How does SPSS assign factor scores for cases where underlying variables were pairwise deleted? - regression

Here's a simplified example of what I'm trying to figure out from a report. All analyses are being run in SPSS, which I don't have and don't use (my experience is with SAS and R).
They were running a regression to predict overall meal satisfaction from food type ordered, self-reported food flavor, and self-reported food texture.
But food flavor and texture are highly correlated, so they conducted a factor analysis, found food flavor and texture load on one factor, and used the factor scores in the regression.
However, about 40% of respondents don't have responses on self-reported food texture, so they used pairwise deletion while making the factors.
My question is when SPSS calculates the factor scores and outputs them as new variables in the data set, what does it do with people who had an input for a factor that was pairwise deleted?
How does it calculate (if it calculates it at all) factor scores for those people who had a response pairwise deleted during the creation of the factors and who therefore have missing data for one of the variables?

Factor scores are a linear combination of their scaled inputs. That is, given normalized variables X1, ..., Xn, we have the following (where LaTeX formatting isn't supported and the L's indicate the loadings)
f = \sum_{i=1}^n L_i X_i
In your case n = 2. Now suppose one of the X_i is missing. How do you take a sum involving a missing value? Clearly, you cannot... unless you impute it.
I don't know what the analyst who created your report did. I suggest you ask them.

Related

How does the number of Gibbs sampling iterations impacts Latent Dirichlet Allocation?

The documentation of MALLET mentions following:
--num-iterations [NUMBER]
The number of sampling iterations should be a trade off between the time taken to complete sampling and the quality of the topic model.
MALLET provides furthermore an example:
// Run the model for 50 iterations and stop (this is for testing only,
// for real applications, use 1000 to 2000 iterations)
model.setNumIterations(50);
It is obvious that too few iterations lead to bad topic models.
However, does increasing the number of Gibbs sampling iterations necessarily benefit the quality of the topic model (measured by perplexity, topic coherence or on a downstream task)?
Or is it possible that the model quality decreases with the --num-iterations set to a too high value?
On a personal project, averaged over 10-fold cross-validation increasing the number of iterations from 100 to 1000 did not impact the average accuracy (measured as Mean Reciprocal Rank) for a downstream task. However, within the cross-validation splits the performance changed significantly, although the random seed was fixed and all other parameters kept the same. What part of background knowledge about Gibbs sampling am I missing to explain this behavior?
I am using a symmetric prior for alpha and beta without hyperparameter optimization and the parallelized LDA implementation provided by MALLET.
The 1000 iteration setting is designed to be a safe number for most collection sizes, and also to communicate "this is a large, round number, so don't think it's very precise". It's likely that smaller numbers will be fine. I once ran a model for 1000000 iterations, and fully half the token assignments never changed from the 1000 iteration model.
Could you be more specific about the cross validation results? Was it that different folds had different MRRs, which were individually stable over iteration counts? Or that individual fold MRRs varied by iteration count, but they balanced out in the overall mean? It's not unusual for different folds to have different "difficulty". Fixing the random seed also wouldn't make a difference if the data is different.

Making smaller models from pre-existing redundant models

Sorry for the vague title.
I will just start with an example. Say that I have a pre-existing model that can classify dogs, cats and humans. However, all I need is a model that can classify between dogs and cats (Humans are not needed). The pre-existing model is heavy and redundant, so I want to make a smaller, faster model that can just do the job needed.
What approaches exist?
I thought of utilizing knowledge distillation (using the previous model as a teacher and the new model as the student) and training a whole new model.
First, prune the teacher model to have a smaller version to be used as a student in distillation. A simple regime such as magnitude-based pruning will suffice.
For distillation, as your output vectors will not match anymore (the student is 2 dimensional and teacher is 3 dimensional, you will have to take this into account and only calculate this distillation loss based on the overlapping dimensions. An alternative is layer-wise distillation in which the output vectors are irrelevant and the distillation loss is calculated based on the difference between intermediate layers of the teacher and student. In both cases, total loss may include a difference between student output and label, in addition to student output and teacher output.
It is possible for a simple task like this that just basic transfer learning would suffice after pruning - that is just to replace the 3d output vector with a 2d output vector and continue training.

Is there a way to not select a reference category for logistic regression in SPSS?

When doing logistic regression in SPSS, is there a way to remove the reference category in the independent variables so they're all compared against each other equally rather than against the reference category?
When you have a categorical predictor variable, the most fundamental way to encode it for modeling, sometimes referred to as the canonical representation, is to use a 0-1 indicator for each level of the predictor, where each case takes on a value of 1 for the indicator corresponding to its category, and 0 for all the other indicators. The multinomial logistic regression procedure in SPSS (NOMREG) uses this parameterization.
If you run NOMREG with a single categorical predictor with k levels, the design matrix is built with an intercept column and the k indicator variables, unless you suppress the intercept. If the intercept remains in the model, the last indicator will be redundant, linearly dependent on the intercept and the first k-1 indicators. Another way to say this is that the design matrix is of deficient rank, since any of the columns can be predicted given the other k columns.
The same redundancy will be true of any additional categorical predictors entered as main effects (only k-1 of k indicators can be nonredundant). If you add interactions among categorical predictors, an indicator for each combination of levels of the two predictors is generated, but more than one of these will also be redundant given the intercept and main effects preceding the interaction(s).
The fundamental or canonical representation of the model is thus overparameterized, meaning it has more parameters than can be uniquely estimated. There are multiple ways commonly used to deal with this fact. One approach is the one used in NOMREG and most other more recent regression-type modeling procedures in SPSS, which is to use a generalized inverse of the cross-product of the design matrix, which has the effect of aliasing parameters associated with redundant columns to 0. You'll see these parameters represented by 0 values with no standard errors or other statistics in the SPSS output.
The other way used in SPSS to handle the overparameterized nature of the basic model is to reparameterize the design matrix to full rank, which involves creating k-1 coded variables instead of k indicators for each main effect, and creating interaction variables from these. This is the approach taken in LOGISTIC REGRESSION.
Note that the overall model fit and predicted values from a logistic regression (or other form of linear or generalized linear model) will be the same regardless of what choices are made about parameterization, as long as the appropriate total number of unique columns are in the design matrix. Particular parameter estimates are of course highly dependent upon the particular parameterization used, but you can derive the results from any of the valid approaches using the results from any other valid approach.
If there are k levels in a categorical predictor, there are k-1 degrees of freedom for comparing those k groups, meaning that once you'd made k-1 linearly independent or nonredundant comparisons, any others can be derived from those.
So the short answer is no, you can't do what you're talking about, but you don't need to, because the results for any valid parameterization will allow you to derive those for any other one.

Computation consideration with different Caffe's network topology (difference in number of output)

I would like to use one of Caffe's reference model i.e. bvlc_reference_caffenet. I found that my target class i.e. person is one of the classes included in the ILSVRC dataset that has been trained for the model. As my goal is to classify whether a test image contains a person or not, I may achieve this by the following:
Use inference directly with 1000 number of output. This doesn't
require training/learning.
Change the network topology a little bit with the final FC layer's number of output (num_output) is set to 2 (instead of 1000). Retrain it as a binary classification problem.
My concern is about computational effort at deployment/prediction phase (testing). The latter looks more expensive computationally than the former. This is because during prediction phase it needs to compute those 1000 output possibilities to find the one with the highest score. What I'm not sure is that, it could be the case that there's a heuristic (which I'm not aware of) that simplifies the computation.
Can somebody please help cross check my understanding on this.

Extract GLM coefficients for one factor only

I am running a series of GLMs for a number of species of the form:
glm.sp<-glm(number~site+as.factor(year)+offset(log(visits)),family=poisson,data=data.sp)
Note that the year term is deliberately a factor as it's not reasonable to assume a linear relationship. The yearly coefficients produced from this model are a measure of the number of each species per year, taking account of the ammount of effort (visits). I then want to extract, exponentiate and index (relative to the last year) the year coefficients and run a GAM on them.
Currently I do this by eyeballing the coefficients and calling them directly:
data.sp.coef$coef<-exp(glm.sp$coefficients[60:77])
However as the number of sites and the number of years recorded for each species are different this means I need to eyeball each species. For example, a different species might have the year coefficients at 51:64. I'd rather not do that, and feel there must be a better way of calling out the coefficients for the years.
I've tried, the below (which doesn't work!)
> coef(glm.sp)["year"]
<NA>
NA
And I also tried saving all the coefficients as a dataframe and using a fuzzy search to extract all the values that contained "year" (the coefficients are automatically saved in the format yearXXXX-YY).
I'm certain I'm missing something simple, so would very much appreciate being proded in the right direction!
Thanks
Matt