Created factors with EFA, tried regressing (lm) with control variables - Error message "variable lengths differ" - regression

EFA first-timer here!
I ran an Exploratory Factor Analysis (EFA) on a data set ("df1" = 1320 observations) with 50 variables by creating a subset with relevant variables only that have no missing values ("df2" = 301 observations).
I was able to filter 4 factors (19 variables in total).
Now I would like to take those 4 factors and regress them with control variables.
For instance: Factor 1 (df2$fa1) describes job satisfaction.
I would like to control for age and marital status.
Fa1Regression <- lm(df2$fa1 ~ df1$age + df1$marital)
However I receive the error message:
Error in model.frame.default(formula = df2$fa1 ~ df1$age + :
variable lengths differ (found for 'df1$age')
What can I do to run the regression correctly? Can I delete observations from df1 that are nonexistent in df2 so that the variable lengths are the same?

Its having a problem using lm to regress a latent factor on other coefficients. Instead, use the lavaan package, where your model statement would be myModel<- 'df2$fa1~ x1+x2+x3'

Related

Having issue with max_norm parameter of torch.nn.Embedding

I use torch.nn.Embedding to embed my model’s categorical input features, however, I face problems when I set the max_norm parameter to not None.
There is a note on the pytorch docs page that explains how to use max_norm parameter through the following example:
n, d, m = 3, 5, 7
embedding = nn.Embedding(n, d, max_norm=True)
W = torch.randn((m, d), requires_grad=True)
idx = torch.tensor(\[1, 2\])
a = embedding.weight.clone() # W.t() # weight must be cloned for this to be differentiable
b = embedding(idx) # W.t() # modifies weight in-place
out = (a.unsqueeze(0) + b.unsqueeze(1))
loss = out.sigmoid().prod()
loss.backward()
I can’t easily understand this example from the docs. What is the purpose of having both ‘a’ and ‘b’ and why ‘out’ is defined as, out = (a.unsqueeze(0) + b.unsqueeze(1))?
Do we need to first clone the entire embedding tensor as in ‘a’, and then finding the embeddings for our desired indices as in ‘b’? Then how do ‘a’ and ‘b’ need to be added?
In my code, I don’t have W explicitly, I am assuming that W is representative of the weights applied by the torch.nn.Linear layers. So, I just need to prepare the input (which includes the embeddings for categorical features) that goes into my network.
I greatly appreciate any instructions on this, as understanding this example would help me adapt my code accordingly.
Because W in the line computing a requires gradients, we must save embedding.weight to compute those gradients in the backward pass. However, in the line computing b, executing embedding(idx) will scale embedding.weight by max_norm - in place. So, without cloning it in line a, embedding.weight will be modified when line b is executed - changing what was saved for the backward pass to update W. Hence the requirement to clone embedding.weight - to save it before it gets scaled in line b.
If you don't use embedding.weight outside of the normal forward pass, you don't need to worry about all this.
If you get an error, post it (and your code).

"iteration limit reached" in lme4 GLMM - what does it mean?

I constructed several glmer.nb models with different combinations of random intercepts, and for one of the models (nested random intercepts, with the lowest AICc), I consistently get: "iteration limit reached", without the usual "Warning message:
In theta.ml(Y, mu, weights = object#resp$weights, limit = limit, :..."
Here's what I know:
it is a warning (from the color) but not labeled as such
you can also have that warning with GLMs and LMERs
Here's what I don't know:
does it mean the model is invalid?
what causes that issue?
what could I do to resolve that issue?
Here's what I searched:
https://stats.stackexchange.com/questions/67287/very-large-theta-values-using-glm-nb-in-r-alternative-approaches (no explanation as to the why and how)
GLMM FAQ: no mention
I am not the only regularly running into that or similar problems: Using glmer.nb(), the error message:(maxstephalfit) PIRLS step-halvings failed to reduce deviance in pwrssUpdate is returned
https://stats.stackexchange.com/questions/40647/lme-error-iteration-limit-reached/40664
Here's what would be highly appreciated:
A more informative warning message: did the model converge? what caused this? What can one do to fix it? Can we read more about this (link to GLMM FAQ - brms-style)?
This is a general question. I did not provide reproducible code because an answer that is generalisable would be most useful.
library(lme4)
dd <- data.frame(f = factor(rep(1:20, each = 20)))
dd$y <- simulate(~ 1 + (1|f), family = "poisson",
newdata = dd,
newparam = list(beta = 1, theta = 1),
seed = 101)[[1]]
m1 <- glmer.nb(y ~ 1 + (1|f), data = dd)
Warning message:
In theta.ml(Y, mu, weights = object#resp$weights, limit = limit, :
iteration limit reached
It's a bit hard to tell, but this warning occurs in MASS::theta.ml(), which is called to get an initial estimate of the dispersion parameter. (If you set options(error = recover, warn = 2), warnings will be converted to errors and errors will dump you into a debugger, where you can see the sequence of calls that were active when the warning/error occurred).
This generally occurs when the data (specifically, the conditional distribution of the data) is actually equidispersed (variance == mean) or underdispersed (i.e. variance < mean), which can't be achieved by a negative binomial distribution. If you run getME(m1, "glmer.nb.theta") you'll generally get a very large value (in this case it's 62376), which indicates where the optimizer gave up while it was trying to send the dispersion parameter to infinity.
You can:
ignore the warning (the negative binomial isn't a good choice, but the model is effectively converging to a Poisson solution anyway).
revert to a Poisson model (the CV question you link to does say "a Poisson model might be a better choice")
People often worry less about underdispersion than overdispersion (because underdispersion makes results of a Poisson model conservative), but if you want to take underdispersion into account you can fit your model with a conditional distribution that allows underdispersion as well as overdispersion (not directly possible within lme4, but see here)
PS the "iteration limit reached without convergence" warning in one of your linked answers, from nlminb within lme, is a completely different issue (except that both situations involve some form of iterative solution scheme with a set maximum number of iterations ...)

p values for random effects in lmer

I am working on a mixed model using lmer function. I want to obtain p-values for all the fixed and random effects. I am able to obtain p-values for fixed effects using different methods but I haven't found anything for random effects. Whatever method I have found on the internet is to make a null model for the same and then get the p-values by comparison. Can I have a method through which I don't need to make an another model?
My model looks like:
mod1 = lmer(Out ~ Var1 + (1 + Var2 | Var3), data = dataset)
You must do these through model comparison, as far as I know. The lmerTest package has a function called step, which will reduce your model to just the significant parameters (fixed and random) based on a number of different tests. The documentation isn't entirely clear on how everything is done, so I much prefer to use model comparison to get at specific tests.
For your model, you could test the random slope by specifying:
mod0 <- lmer(Out ~ Var1 + (1 + Var2 | Var3), data = dataset, REML=TRUE)
mod1 <- lmer(Out ~ Var1 + (1 | Var3), data = dataset, REML=TRUE)
anova(mod0, mod1, refit=FALSE)
This will show you the log likelihood test and test statistic (chi-square distributed). But you are testing two parameters here: the random slope of Var2 and the covariance between the random slopes and random intercepts. So you need a p-value adjustment:
1-(.5*pchisq(anova(mod0,mod1, refit=FALSE)$Chisq[[2]],df=2)+
.5*pchisq(anova(mod0,mod1, refit=FALSE)$Chisq[[2]],df=1))
More on those tests here or here.

Dynamic Topic model output - Blei format

I am working with the Dynamic Topic Models package that was developed by Blei. I am new to LDA however I understand it.
I would like to know what does the output by the name of
lda-seq/topic-000-var-obs.dat store?
I know that lda-seq/topic-001-var-e-log-prob.dat stores the log of the variational posterior and by applying the exponential over it, I get the probability of the word within Topic 001.
Thanks
Topic-000-var-e-log-prob.dat store the log of the variational posterior of the topic 1.
Topic-001-var-e-log-prob.dat store the log of the variational posterior of the topic 2.
I have failed to find a concrete answer anywhere. However, since the documentation's sample.sh states
The code creates at least the following files:
- topic-???-var-e-log-prob.dat: the e-betas (word distributions) for topic ??? for all times.
...
- gam.dat
without mentioning the topic-000-var-obs.dat file, suggests that it is not imperative for most analyses.
Speculation
obs suggest observations. After a little dig around in the example/model_run results, I plotted the sum across epochs for each word/token using:
temp = scan("dtm/example/model_run/lda-seq/topic-000-var-obs.dat")
temp.matrix = matrix(temp, ncol = 10, byrow = TRUE)
plot(rowSums(temp.matrix))
and the result is something like:
The general trend of the non-negative values is decreasing and many values are floored (in this case to -11.00972 = log(1.67e-05)) Suggesting that these values are weightings or some other measure of influence on the model. The model removes some tokens and the influence/importance of the others tapers off over the index. The later trend may be caused by preprocessing such as sorting tokens by tf-idf when creating the dictionary.
Interestingly the row sum values varies for both the floored tokens and the set with more positive values:
temp = scan("~/Documents/Python/inference/project/dtm/example/model_run/lda-seq/topic-009-var-obs.dat")
temp.matrix = matrix(temp, ncol = 10, byrow = TRUE)
plot(rowSums(temp.matrix))

Interpreting libsvm epsilon-SVR result

I tried to train & cross validate a set of data with 8616 samples using epsilon SVR.
Among the datasets, I take 4368 for test, 4248 for CV.
Kernel type = RBF kernel. Libsvm provides a result as shown below.
optimization finished, #iter = 502363
nu = 0.689607
obj = -6383530527604706.000000, rho = 2884789.960212
nSV = 3023, nBSV = 3004
This is a result gotten by setting
-s 3 -t 2 -c 2^28 -g 2^-13 -p 2^12
(a) What does "nu" means? Sometimes I got nu = 0.99xx for different parameter.
(b) It seems that "obj" is surprisingly large. Does it sounds correct? Libsvm FAQ said this is "optimal objective value of the dual SVM problem". Does it means that this is the min value of f(alpha)?
(c) "rho" is large too. This is the bias term, b. The dataset labels (y) consist of value between 82672 to 286026. So I guess this is reasonable, am I right?
For training set,
Mean squared error = 1.26991e+008 (regression)
Squared correlation coefficient = 0.881112 (regression)
For cross-validation set,
Mean squared error = 1.38909e+008 (regression)
Squared correlation coefficient = 0.883144 (regression)
Using the selected param, I have produced the below result
kernel_type=2 (best c:2^28=2.68435e+008, g:2^-13=0.00012207, e:2^12=4096)
NRMS: 0.345139, best_gap:0.00199433
Mean Absolute Percent Error (MAPE): 5.39%
Mean Absolute Error (MAE): 8956.12 MWh
Daily Peak MAPE: 5.30%
The CV set MAPE is low (5.39%). Using Bias-Variance test, the difference between train set MAPE and CV set MAPE is only 0.00199433, which mean the param seems to be set correctly. But I wonder if the extremely large "obj", "rho" value is correct....
I am very new to SVR, do correct me if my interpretation or validation method is incorrect/insufficient.
Method to calculate MAPE
train_model = svmtrain(train_label, train_data, cmd);
[result_label, train_accuracy, train_dec_values] = svmpredict(train_label, train_data, train_model);
train_err = train_label-result_label;
train_errpct = abs(train_err)./train_label*100;
train_MAPE = mean(train_errpct(~isinf(train_errpct)));
The objective and rho values are high because (most probably) the data were not scaled. Scaling is highly recommended to avoid overflow; the overflow risk also depends on the type of kernel. Btw, when scaling the training data, do not forget to also scale the test data, which is most easily accomplished by scaling all data first, and then splitting them into a training and test set.