How are Elastic Net penalties applied to Logistic Regression's Maximum Likelihood cost function? - regression

I understand how Ridge / Lasso / Elastic Net regression penalties are applied to the linear regression's cost function, but I am trying to figure out how they are applied to Logistic Regression's Maximum Likelihood cost function.
I've tried looking into pages through google, and it looks like it can be done (I believe Sci-Kit's logistic regression models accept L1 and L2 parameters, and I've seen some YouTube videos saying that the penalties can be applied in logistic models too) and I've found how they are added to the sum of squared residuals cost function, but I am curious on how the penalties are applied with the Maximum Likelihood cost function. Is it Maximum Likelihood minus the penalties?

I got an answer by posting on stats stack exchange (link). I'll post the answer from ofer-a here to help you all searching on Stack Overflow for a similar answer.
The elastic net terms are added to the maximum likelihood cost function.i.e. the > final cost function is:
The first term is the likelihood, the second term is the l1 norm part of the elastic net, and the third term is the l2 norm part. i.e. the network is trying to minimize the negative log likelihood and also trying to minimize the weights.

Related

How does the number of Gibbs sampling iterations impacts Latent Dirichlet Allocation?

The documentation of MALLET mentions following:
--num-iterations [NUMBER]
The number of sampling iterations should be a trade off between the time taken to complete sampling and the quality of the topic model.
MALLET provides furthermore an example:
// Run the model for 50 iterations and stop (this is for testing only,
// for real applications, use 1000 to 2000 iterations)
model.setNumIterations(50);
It is obvious that too few iterations lead to bad topic models.
However, does increasing the number of Gibbs sampling iterations necessarily benefit the quality of the topic model (measured by perplexity, topic coherence or on a downstream task)?
Or is it possible that the model quality decreases with the --num-iterations set to a too high value?
On a personal project, averaged over 10-fold cross-validation increasing the number of iterations from 100 to 1000 did not impact the average accuracy (measured as Mean Reciprocal Rank) for a downstream task. However, within the cross-validation splits the performance changed significantly, although the random seed was fixed and all other parameters kept the same. What part of background knowledge about Gibbs sampling am I missing to explain this behavior?
I am using a symmetric prior for alpha and beta without hyperparameter optimization and the parallelized LDA implementation provided by MALLET.
The 1000 iteration setting is designed to be a safe number for most collection sizes, and also to communicate "this is a large, round number, so don't think it's very precise". It's likely that smaller numbers will be fine. I once ran a model for 1000000 iterations, and fully half the token assignments never changed from the 1000 iteration model.
Could you be more specific about the cross validation results? Was it that different folds had different MRRs, which were individually stable over iteration counts? Or that individual fold MRRs varied by iteration count, but they balanced out in the overall mean? It's not unusual for different folds to have different "difficulty". Fixing the random seed also wouldn't make a difference if the data is different.

Deep learning model test accuracy unstable

I am trying to train and test a pytorch GCN model that is supposed to identify person. But the test accuracy is quite jumpy like it gives 49% at 23 epoch then goes below near 45% at 41 epoch. So it's not increasing all the time though loss seems to decrease at every epoch.
My question is not about implementation errors rather I want to know why this happens. I don't think there is something wrong in my coding as I saw SOTA architecture has this type of behavior as well. The author just picked the best result and published saying that their models gives that result.
Is it normal for the accuracy to be jumpy (up-down) and am I just to take the best ever weights that produce that?
Accuracy is naturally more "jumpy", as you put it. In terms of accuracy, you have a discrete outcome for each sample - you either get it right or wrong. This makes it so that the result fluctuate, especially if you have a relatively low number of samples (as you have a higher sampling variance).
On the other hand, the loss function should vary more smoothly. It is based on the probabilities for each class calculated at your softmax layer, which means that they vary continuously. With a small enough learning rate, the loss function should vary monotonically. Any bumps you see are due to the optimization algorithm taking discrete steps, with the assumption that the loss function is roughly linear in the vicinity of the current point.

Is BatchNorm turned off when inferencing?

I read from several sources that implicitly suggest batchnorm being turned off for inference but I have no definite answer for this.
Most common is to use a moving average of mean and std for your batch normalization as used by Keras for example (https://github.com/keras-team/keras/blob/master/keras/layers/normalization.py). If you just turn it off the network will perform worse on the same data, due to changes in how the images are processed.
This is done by storing the average mean and the average std of all the batches used during training the network. Then in inference this moving average is used for normalization.

RNN L2 Regularization stops learning

I use Bidirectional RNN to detect an event of unbalanced occurence. The positive class is 100times less often than the negative class.
While no regularization use I can get 100% accuracy on train set and 30% on validation set.
I turn on l2 regularization and the result is only 30% accuracy on train set too instead of longer learning and 100% accuracy on validation set.
I was thinking that maybe my data is too small so just for experiment I merged train set with test set which I did not use before. Situation was the same as I would use l2 regularization, which I did not now. I get 30% accuracy on train+test and validation.
In use 128hidden units and 80 timesteps in the mentioned experiments
When I increased the number of hidden units to 256 I can again overfit on train+test set to get 100% accuracy but still only 30% on validation set.
I did try so many options for hyperparameters and almost no result. Maybe the weighted cross entropy is causing the problem, in given experiments the weight on positive class is 5. While trying larger weights the results are often worse around 20% of accuracy.
I tried LSTM and GRU cells, no difference.
The best results I got. I tried 2 hidden layers with 256 hidden units, it took around 3 days of computation and 8GB of GPU memory. I got around 40-50% accuracy before it starts overfitting again while l2 regularization was on but no so strong.
Is there some general guideline what to do in this situation? I was not able to find anything.
Too much hidden units can overfit your model. You can try with smaller number of hidden units. As you mentioned, training with more data might improve the performance. If you don't have enough data, you can generate some artificial data. Researchers add distortions to their training data to increase their data size but in a controlled way. This type of strategy is pretty good for image data but certainly if you are dealing with text data, probably you can use some knowledge base that can improve the performance.
There are many works going on using Knowledge-bases to solve NLP and deep learning related tasks.

Adjusting Binary Logistic Formula in SPSS

I am running a binary logistic regression in SPSS, to test the effect of e.g. TV advertisements on the probability of a consumer to buy a product. My problem is that with the formula of binary logistic regression:
P=1/(1+e^(-(a+b*Adv)) )
the maximum probability will be equal to 100%. However,even if I increase the number of advertisements by 1000, it is not sensible to assume that the probability to purchase will be 100%. So if I draw the graph of the logistic regression with the coefficients from the Binary Logistic Regression, at some point the probability reaches 100%, which is never the case in a real life setting. How can I control for that?
Is there a way to change the SPSS binary logistic regression to have a maximum probability of e.g. 20%?
Thank you!
The maximum hypothetical probability is 100%, but if you use real-world data, your model will fit the data in such a way that the predicted y-value for any given value of x will be no higher than the real-world y-value (+/- your model's error term). I wouldn't worry too much about the hypothetical maximum probability as long as my model fit the data reasonably well. One of the key reasons for using logistic regressions instead of OLS linear regressions is to avoid impossible predicted values.