LDA with gensim - strange values for perplexity - lda

We're running LDA using gensim and we're getting some strange results for perplexity. We're finding that perplexity (and topic diff) both increase as the number of topics increases - we were expecting it to decline. We've tried lots of different number of topics 1,2,3,4,5,6,7,8,9,10,20,50,100. We've also played around with alpha (symmetric and auto) and keep getting the same results.
Our documents have 20+ words but most of them are 20-30. Are these documents too small for LDA to work?
Should we try to increase the amount of training data (we are running on 100k)? Or increase number of passes (but it looks like it has converged)?
Thanks

Related

Saving Random Forest Classifiers (sklearn) with picke/joblib creates huge files

I am trying to save a bunch of trained random forest classifiers in order to reuse them later. For this, I am trying to use pickle or joblib. The problem I encounter is, that the saved files get huge. This seems to be correlated to the amount of data that I use for training (which is several 10-millions of samples per forest, leading to dumped files in the order of up to 20GB!).
Is the RF classifier itself saving the training data in its structure? If so, how could I take the structure apart and only save the necessary parameters for later predictions? Sadly, I could not find anything on the subject of size yet.
Thanks for your help!
Baradrist
Here's what I did in a nutshell:
I trained the (fairly standard) RF on a large dataset and saved the trained forest afterwards, trying both pickle and joblib (also with the compress-option set to 3).
X_train, y_train = ... some data
classifier = RandomForestClassifier(n_estimators=24, max_depth=10)
classifier.fit(X_train, y_train)
pickle.dump(classifier, open(path+'classifier.pickle', 'wb'))
or
joblib.dump(classifier, path+'classifier.joblib', compress=True)
Since the saved files got quite big (5GB to nearly 20GB, compressed aprox. 1/3 of this - and I will need >50 such forests!) and the training takes a while, I experimented with different subsets of the training data. Depending on the size of the train set, I found different sizes for the saved classifier, making me believe that information about the training is pickled/joblibed as well. This seems unintuitive to me, as for predictions, I only need the information of all the trained weak predictors (decision trees) which should be steady and since the number of trees and the max depth is not too high, they should also not take up that much space. And certainly not more due to a larger training set.
All in all, I suspect that the structure is containing more than I need. Yet, I couldn't find a good answer on how to exclude these parts from it and save only the necessary information for my future predictions.
I ran into a similar issue and I also thought in the beginning that the model was saving unnecessary information or that the serialization was introducing some redundancy. It turns out in fact that decision trees are indeed memory hungry structures that consists of multiple arrays of length given by the total number of nodes. Nodes in general grow with the size of data (and parameters like max_depth cannot effectively used to limit growth since the reasonable values still have room to generate huge number of nodes). See details in this answer but the gist is:
a single decision tree can easy grow to a few MBs (example above has a 5MB decision tree for 100K data and a 50MB decision tree for 1M data)
a random forest commonly contains at least 100 such decision tree and for the example above you would have models in the range of 0.5/5GB
compression is usually not enough to reduce to reasonable sizes (1/2, 1/3 are usual ranges)
Other notes:
using a different algorithm models might remain of a more manageable size (e.g. with xgboost I saw much smaller serialized models)
it is probably possible to "prune" some of the data used by decision trees if you only plan it to reuse it for prediction. In particular I imagine the array of impurity and possible those on n_samples might not be needed but I have not checked.
with respect to you hypothesis that the random forest is saving the data on which it is trained: not it is not and the data itself would likely be one or more order of magnitude smaller than the final model
so in principle another strategy if you have a reproducible training pipeline could be to save the data instead of the model and retrain on purpose, but this is only possible if you can spare the time to retrain (for example if in a use case where you have a long running service which has the model in memory and you serialize the model in order to have a backup for when the model goes down)
there are probably also other options to limit growth of random forest, the best one I have found until now is in this answer, where the suggestion is to work with min_samples_leaf to set it as a percentage of data

How does the number of Gibbs sampling iterations impacts Latent Dirichlet Allocation?

The documentation of MALLET mentions following:
--num-iterations [NUMBER]
The number of sampling iterations should be a trade off between the time taken to complete sampling and the quality of the topic model.
MALLET provides furthermore an example:
// Run the model for 50 iterations and stop (this is for testing only,
// for real applications, use 1000 to 2000 iterations)
model.setNumIterations(50);
It is obvious that too few iterations lead to bad topic models.
However, does increasing the number of Gibbs sampling iterations necessarily benefit the quality of the topic model (measured by perplexity, topic coherence or on a downstream task)?
Or is it possible that the model quality decreases with the --num-iterations set to a too high value?
On a personal project, averaged over 10-fold cross-validation increasing the number of iterations from 100 to 1000 did not impact the average accuracy (measured as Mean Reciprocal Rank) for a downstream task. However, within the cross-validation splits the performance changed significantly, although the random seed was fixed and all other parameters kept the same. What part of background knowledge about Gibbs sampling am I missing to explain this behavior?
I am using a symmetric prior for alpha and beta without hyperparameter optimization and the parallelized LDA implementation provided by MALLET.
The 1000 iteration setting is designed to be a safe number for most collection sizes, and also to communicate "this is a large, round number, so don't think it's very precise". It's likely that smaller numbers will be fine. I once ran a model for 1000000 iterations, and fully half the token assignments never changed from the 1000 iteration model.
Could you be more specific about the cross validation results? Was it that different folds had different MRRs, which were individually stable over iteration counts? Or that individual fold MRRs varied by iteration count, but they balanced out in the overall mean? It's not unusual for different folds to have different "difficulty". Fixing the random seed also wouldn't make a difference if the data is different.

Regression problem getting much better results when dividing values by 100

I'm working on a regression problem in pytorch. My target values can be either between 0 to 100 or 0 to 1 (they represent % or % divided by 100).
The data is unbalanced, I have much more data with lower targets.
I've noticed that when I run the model with targets in the range 0-100, it doesn't learn - the validation loss doesn't improve, and the loss on the 25% large targets is very big, much bigger than the std in this group.
However, when I run the model with targets in the range 0-1, it does learn and I get good results.
If anyone can explain why this happens, and if using the ranges 0-1 is "cheating", that will be great.
Also - should I scale the targets? (either if I use the larger or the smaller range).
Some additional info - I'm trying to fine tune bert for a specific task. I use MSEloss.
Thanks!
I think your observation relates to batch normalization. There is a paper written on the subject, an numerous medium/towardsdatascience posts, which i will not list here. Idea is that if you have a no non-linearities in your model and loss function, it doesn't matter. But even in MSE you do have non-linearity, which makes it sensitive to scaling of both target and source data. You can experiment with inserting Batch Normalization Layers into your models, after dense or convolutional layers. In my experience it often improves accuracy.

Input documents to LDA

Assume I have N text documents and I run LDA in the following 2 ways,
run LDA over the N documents at once
run on each document separately, so for N documents you run the algorithm N times
I'm aware of what number of topics to choose as well; in the first case i can select N to be the number of topics (assuming each document is about a single topic) but if I run it on each document separately not sure how to select the number of topics as well...?
What's going on in these two cases?
Latent Dirichlet Allocation is intended to model the topic and word distributions for each document in a corpus of documents.
Running LDA over all of the documents in the corpus at once is the normal approach; running it on a per-document basis is not something I've heard of. I wouldn't recommend doing this. It's difficult to say what would happen, but I wouldn't expect the results to be near as useful because you couldn't meaningfully compare one document/topic or topic/word distribution with another.
I'm thinking that your choice of N for the number of topics might be too high (what if you had thousands of documents in your corpus?), but it really depends on the nature of the corpus you are modelling. Remember that LDA assumes a document will be a distribution over topics, so it might be worth rethinking the assumption that each document is about one topic.
LDA is a statistical model that predicts or assigns topics to documents, it works by distributing the words of each document over topics, (randomly the first time) then repeats this step a number of iterations (could be 500 iterations) until the words that are assigned to the topics are almost stable, now it can assign N topics to a document according to the most frequent words in the document that has a high probability in the topic.
so it does not make sense to run it over one document since the words that is assigned to the topic in the first iteration will not change over iterations because you are using only one document, and the topics that is assigned to document will be meaningless

MALLET Ranking of Words in a topic

I am relatively new to mallet and need to know:
- are the words in each topic that mallet produces rank ordered in some way?
- if so, what is the ordering (i.e.) is 1st in a topic list the one with the highest distribution across the corpus?
Thanks!
they are ranked based on probabilities from the training, i.e. the first word is most probable to appear in this topic, the 2nd is less probable, the 3rd less and so on.. These are not directly related to term frequencies although surely the words with highest tfidf weights are more likely to be most probable. Also, Gibbs sampling has a lot to do with how words are ranked in topics - due to randomness in sampling you can get quite different probabilities for words within topics. Try, for example, to save the model and then retrain using --input-model option - the topics will look very much alike but not the same.
That said, if you need to see actual weights of terms in the corpus unrelated to LDA, you can use something like NLTK in Python to check frequency distributions and also something like sklearn for TFIDF to get more meaningful weight distributions.