Can LDA assign more than one topic for a word? - lda

I have just started reading about Latent Dirichlet Allocation LDA and want to apply it to my project.
May I know if LDA is able to assign a topic to more than one word?
For example, Article A talks about "river banks" while Article B talks about "The role of banks in finance". Hence, will LDA allow the word "banks" to potentially be assigned to two different topics?

LDA topics are probability distributions over terms. Terms may have nonzero weight in any or all of the topics. You can turn that around and find the probability of each topic given a specific term. So yes, a term like "bank" can be assigned to many topics, but in general it will be assigned to some with a stronger weight than others.

Related

How to reveal relations between number of words and target with self-attention based models?

Transformers can handle variable length input, but what if the number of words might correlate with the target? Let's say we want to perform a sentiment analysis for some reviews where the longer reviews are more probable to be bad. How can the model harness this knowledge? Of course a simple solution could be to add this count as a feature after the self-attention layer. However, this hand-crafted-like approach wouldn't reveal more complex relations, for example if there is a high number of word X, it correlates with target 1, except if there is also high number of word Y, in which case the target tends to be 0.
How could this information be included using deep learning? Paper recommendations in the topic are also well appreciated.

How to choose the best LDA model when coherence and perplexy show opposed trends?

I have a corpus with around 1,500,000 documents of titles and abstracts from scientific research projects within STEM. I used Mallet https://mimno.github.io/Mallet/transforms to fit models from 10 to 790 topics in 10 topics increments (I allow for hyperparameter optimization). I ran three replicates with differing seeds using 70% of the corpus for training and 30% for validation. The corpus was pre-processed to remove stop-words, lemmatise, etc.
I want to choose the model with best number of topics. I understand that coherence and perplexity are two quantitative measures of model performance that should allow me to do this. They measure two 'separate' types of performance. Coherence measures the ability of the model finding topics that are the closest to human interpretability. While perplexity measures the ability of the model to classify unseen data with a trained model.
However, when I calculate these two methods they show opposing results with no clear optimum in either.
I was hoping that both would show some aggrement in trend, so i could choose what number of topics to work with (see figure below).

Input documents to LDA

Assume I have N text documents and I run LDA in the following 2 ways,
run LDA over the N documents at once
run on each document separately, so for N documents you run the algorithm N times
I'm aware of what number of topics to choose as well; in the first case i can select N to be the number of topics (assuming each document is about a single topic) but if I run it on each document separately not sure how to select the number of topics as well...?
What's going on in these two cases?
Latent Dirichlet Allocation is intended to model the topic and word distributions for each document in a corpus of documents.
Running LDA over all of the documents in the corpus at once is the normal approach; running it on a per-document basis is not something I've heard of. I wouldn't recommend doing this. It's difficult to say what would happen, but I wouldn't expect the results to be near as useful because you couldn't meaningfully compare one document/topic or topic/word distribution with another.
I'm thinking that your choice of N for the number of topics might be too high (what if you had thousands of documents in your corpus?), but it really depends on the nature of the corpus you are modelling. Remember that LDA assumes a document will be a distribution over topics, so it might be worth rethinking the assumption that each document is about one topic.
LDA is a statistical model that predicts or assigns topics to documents, it works by distributing the words of each document over topics, (randomly the first time) then repeats this step a number of iterations (could be 500 iterations) until the words that are assigned to the topics are almost stable, now it can assign N topics to a document according to the most frequent words in the document that has a high probability in the topic.
so it does not make sense to run it over one document since the words that is assigned to the topic in the first iteration will not change over iterations because you are using only one document, and the topics that is assigned to document will be meaningless

MALLET Ranking of Words in a topic

I am relatively new to mallet and need to know:
- are the words in each topic that mallet produces rank ordered in some way?
- if so, what is the ordering (i.e.) is 1st in a topic list the one with the highest distribution across the corpus?
Thanks!
they are ranked based on probabilities from the training, i.e. the first word is most probable to appear in this topic, the 2nd is less probable, the 3rd less and so on.. These are not directly related to term frequencies although surely the words with highest tfidf weights are more likely to be most probable. Also, Gibbs sampling has a lot to do with how words are ranked in topics - due to randomness in sampling you can get quite different probabilities for words within topics. Try, for example, to save the model and then retrain using --input-model option - the topics will look very much alike but not the same.
That said, if you need to see actual weights of terms in the corpus unrelated to LDA, you can use something like NLTK in Python to check frequency distributions and also something like sklearn for TFIDF to get more meaningful weight distributions.

LDA topic modeling - Training and testing

I have read LDA and I understand the mathematics of how the topics are generated when one inputs a collection of documents.
References say that LDA is an algorithm which, given a collection of documents and nothing more (no supervision needed), can uncover the “topics” expressed by documents in that collection. Thus by using LDA algorithm and the Gibbs Sampler (or Variational Bayes), I can input a set of documents and as output I can get the topics. Each topic is a set of terms with assigned probabilities.
What I don't understand is, if the above is true, then why do many topic modeling tutorials talk about separating the dataset into training and test set?
Can anyone explain me the steps (the basic concept) of how LDA can be used for training a model, which can then be used to analyze another test dataset?
Splitting the data into training and testing sets is a common step in evaluating the performance of a learning algorithm. It's more clear-cut for supervised learning, wherein you train the model on the training set, then see how well its classifications on the test set match the true class labels. For unsupervised learning, such evaluation is a little trickier. In the case of topic modeling, a common measure of performance is perplexity. You train the model (like LDA) on the training set, and then you see how "perplexed" the model is on the testing set. More specifically, you measure how well the word counts of the test documents are represented by the word distributions represented by the topics.
Perplexity is good for relative comparisons between models or parameter settings, but it's numeric value doesn't really mean much. I prefer to evaluate topic models using the following, somewhat manual, evaluation process:
Inspect the topics: Look at the highest-likelihood words in each topic. Do they sound like they form a cohesive "topic" or just some random group of words?
Inspect the topic assignments: Hold out a few random documents from training and see what topics LDA assigns to them. Manually inspect the documents and the top words in the assigned topics. Does it look like the topics really describe what the documents are actually talking about?
I realize that this process isn't as nice and quantitative as one might like, but to be honest, the applications of topic models are rarely quantitative either. I suggest evaluating your topic model according to the problem you're applying it to.
Good luck!
The general rule that using the training data for evaluation might be subject to overfitting also applies to unsupervised learning like LDA -- though it is not as obvious. LDA optimizes some objective, ie. generative probability, on the training data. It might be that in the training data two words are indicative of a topic, say "white house" for US politics. Assume the two words only occur once (in the training data). Then any algorithm fully relying on the assumption that they indicate only politics and nothing else would be doing great if you evaluated on the training data. However, if there are other topics like "architecture" then you might question, whether this is really the right thing to learn. Having a test data set can solve this issue to some extend:
Since the relationship "white house" seems rare in the training data, it likely does not occur at all in the test data. If so, the evaluation shows how much your model relies on spurious relationships that might in fact not be helpful compared to more general ones.
"White house" occurs in the test data, say it occurs once for "US politics" and once in a document on architecture. Then the assumption that it only indicates "US politics" is too strong and performance metrics will be worse, showing that your model is overfitting.