logistic regression model testing the goodness of fit - regression

I am trying to assess the goodness of fit of my logistic regression model with the original dataset. I plan to use Chi square goodness of fit and Hosmer-Lemeshow Test.
I would like to understand more the Chisquare test R code used, could anybody give detailed explanation on the code?
Here is the model.
``model01 \<- glm(low \~ lwt+age+race+smoke+ht+ui, family=binomial(link='logit'), data=lb)`
`summary(model01)``
Here is the chi square test code that I need help understanding line by line. If you could briefly mention the concepts related that would be very appreciated.
`sum(residuals(model01, type = "pearson")^2)`
`deviance(model01)`
`df.residual(model01)`
`1 - pchisq(sum(residuals(model01, type = "pearson")^2), df.residual(model01)-length(model01$coefficients)+1)`
Also I have this code for getting the number of covariate patterns of the regression model.
Could you explain this code as well?
Thanks a lot
`model.mf = model.frame(model01)
model.cp = epi.cp(model.mf\[-1\])
model.cp$cov.pattern; model.cp$id
cov_pat \<- nrow(model.cp$cov.pattern)
rm(model.mf)`
I tried to check with ?function () for each line but could not get it.

Related

StableBaselines3 - Why does calling "model.learn(50,000)" twice not give same result as calling "model.learn(100,000)" once?

I am working on a Reinforcement Learning problem in StableBaselines3.
I am trying to understand why this code:
model = MaskablePPO(MaskableActorCriticPolicy, env, verbose=1, learning_rate=0.0003, gamma=0.975, seed=10, batch_size=256, clip_range=0.2)
model.learn(100000)
Does not give the exact same result as this code:
model = MaskablePPO(MaskableActorCriticPolicy, env, verbose=1, learning_rate=0.0003, gamma=0.975, seed=10, batch_size=256, clip_range=0.2)
model.learn(50000)
model.learn(50000)
I say they don't give the same results because in both cases, I tested out the model on a test-set through a for-loop, and the performance was different. Given that I set deterministic=True in the for-loop and I didn't change the seed, the different performance must mean the networks are different, which means the training process was different.
I was under the impression that if I run model.learn() on an existing model, it would just pick up the training where it was previously left off, but I guess that's incorrect.
Can someone help me understand why those two situations deliver different results?

Interpretation, logistic regression

I have a quick question regarding logistic regression output.
My code (in Stata):
logit pass i.experience, or
pass is a binary variabel determining whether the test is passed or not, experience is a categorical variabel consisting of 3 different experience-groups. Reference group = experience=0 (no experience).
If ORs shows <1 for all experience-groups (p<0.01) I conclude that:
having (any) experience = smaller change of passing the test, compared
to having no experience.
My question: can I also turn this interpretation around, and conclude that:
Non-experienced are more likely to pass the test, compared
to student with experience
?
Thanks.

Can variational autoencoders be used on non-image data?

I have a question about variational autoencoders (VAE),
I need to generate new data from my dataset which contains just numerical data, so i want to use VAE for that task, but all the available tutorials and articles use images as input data for the variatioanl autoencoder.
My question is: can i use VAE for generating new data from my datasets eventhough my data is not images ??
Thank you.
Short answer is yes. You should read up a bit on the basics of neural nets if this wasn't obvious already - an image is just a Channel X Height X Width dimensional vector. You might use different kinds of layers in your network to suit the kind of data that you have to give a better inductive bias, but otherwise nothing changes. Follow those tutorials!

Training and test diverge while running catboost

When I run the catboost regressor my training and test plots diverge with weird kinks at ~1000 iterations. The plot is appended below and my regressor setup is as follows:
cat_model=CatBoostRegressor(iterations=2500, depth=4, learning_rate=0.01, loss_function='RMSE', thread_count=-1, use_best_model = True, random_seed=12, random_strength=10, rsm=0.5)
I tried different values of leaf_estimation_iterations & bagging_temperature but did not get any success. Any suggestions on what i should try to get better results.
Model Fit Plot
The diverge is normal. you will always perform better on the train set, as the model overfits the training set, and your objective is to regulate it with the validation set.
First I would recommend to read on bias vs variance tradeoff for a general intuition on how to tackle this issue.
specifically for catboost, you would like to regularize the training procedure so it would generalize better.
you can start with adding more data, and set higher l2_leaf_reg parameter.
The official documentation have much more good suggestions on model tuning:
https://catboost.ai/docs/concepts/parameter-tuning.html

How to get the accuracy of classifier on test data in DeepLearning

I am trying to use DL4J for deep learning and have provided the training data with the labels. I am then trying to send a test data by assigning a dummy label. Without providing a dummy label, it gives runtime error. I dont understand why we need to assign label to test data.
Additionally, I want to know what is the accuracy of the prediction made. From what I saw in the dl4j docs, there is something known as a confusion matrix which is generated. I understand that this just gives us an idea of how well the training data has trained the system. Is there a way to get the accuracy of prediction on test data? Since we are giving a dummy label for the test data, I feel that the confusion matrix is also not generated correctly.
First, how can you test if the network outputs the correct labels if you don't know what the correct labels are? You should always have a labels when training and testing because that way you can assert if the output is correct.
Second question, I've found this on dl4j webpage:
Evaluation eval = new Evaluation(3);
INDArray output = model.output(testData.getFeatures());
eval.eval(testData.getLabels(), output);
log.info(eval.stats());
There is stated that this .stats() method displays the confusion matrix entries (one per line), Accuracy, Precision, Recall and F1 Score. Additionally the Evaluation Class can also calculate and return the following values:
Confusion Matrix
False Positive/Negative Rate
True Positive/Negative
Class Counts
F-beta, G-measure, Matthews Correlation Coefficient and more
I hope this helps you.
You may find people who can respond to your question in the DL4J dev community here: https://gitter.im/deeplearning4j/deeplearning4j/tuninghelp