tbl_regression - resolving long tables by indicating categorical variables were included but not breaking down stats for each category - regression

My regression model uses several control variables that contain several categorical values, each. When I use tbl_summary to get a table, it comes out very long. I don't really care about showing the results for these variables. Is there a way to collapse the tbl_regression display of a given variable, to show that it was included in the regression but without listing all its categorical values?

Related

Evaluating the performance of variational autoencoder on unlabeled data

I've designed a variational autoencoder (VAE) that clusters sequential time series data.
To evaluate the performance of VAE on labeled data, First, I run KMeans on the raw data and compare the generated labels with the true labels using Adjusted Mutual Info Score (AMI). Then, after the model is trained, I pass validation data to it, run KMeans on latent vectors, and compare the generated labels with the true labels of validation data using AMI. Finally, I compare the two AMI scores with each other to see if KMeans has better performance on the latent vectors than the raw data.
My question is this: How can we evaluate the performance of VAE when the data is unlabeled?
I know we can run KMeans on the raw data and generate labels for it, but in this case, since we consider the generated labels as true labels, how can we compare the performance of KMeans on the raw data with KMeans on the latent vectors?
Note: The model is totally unsupervised. Labels (if exist) are not used in the training process. They're used only for evaluation.
In unsupervised learning you evaluate the performance of a model by either using labelled data or visual analysis. In your case you do not have labelled data, so you would need to do analysis. One way to do this is by looking at the predictions. If you know how the raw data should be labelled, you can qualitatively evaluate the accuracy. Another method is, since you are using KMeans, is to visualize the clusters. If the clusters are spread apart in groups, that is usually a good sign. However, if they are closer together and overlapping, the labelling of vectors in the respective areas may be less accurate. Alternatively, there may be some sort of a metric that you can use to evaluate the clusters or come up with your own.

Encoding a lot of categorical variables

I have 10 million categorical variables (each variable has 3 categories). What is the best way to encode these 10 million variables to train a deep learning model on them? (If I use one hot encoding, then I will end up having 30 million variables. Also, embedding layer with one output makes no sense (it is similar to integer encoding and there is no order between these categories) and embedding layer with two outputs does not make that much difference. Usually, we use embedding layer when number of categories is a lot). Please give me your opinion.
You should treat this problem like word embeddings, where you also have a lot of entities (usually 30-50 thousand).
Make a random embedding for each category, of dimension 100-300. Use triplet loss or something like it to train the embeddings. Basically, create a valid pair of embeddings, or a pair of embedding and input. For word vector these are words that co-occur in a context window (they are near each other in a sentence). Then pick some other, unrelated words at random. Train the network so that the valid pair are closer (cosine distance) than the random pairs; there are different loss functions you can try, but basically the closer the valid pair and the further the random pair the lower the loss.
However, I would think about how you have formulated your problem. Do you actually have 10 million categories? Why do you have more labels than there are words in any human language? If you can group them into hierarchies so that you have fewer labels at multiple stages your model will be more effective.
Did you already use ordinal encoder ? This would encode the categories but won't increase the number of variables.

Is there a way to not select a reference category for logistic regression in SPSS?

When doing logistic regression in SPSS, is there a way to remove the reference category in the independent variables so they're all compared against each other equally rather than against the reference category?
When you have a categorical predictor variable, the most fundamental way to encode it for modeling, sometimes referred to as the canonical representation, is to use a 0-1 indicator for each level of the predictor, where each case takes on a value of 1 for the indicator corresponding to its category, and 0 for all the other indicators. The multinomial logistic regression procedure in SPSS (NOMREG) uses this parameterization.
If you run NOMREG with a single categorical predictor with k levels, the design matrix is built with an intercept column and the k indicator variables, unless you suppress the intercept. If the intercept remains in the model, the last indicator will be redundant, linearly dependent on the intercept and the first k-1 indicators. Another way to say this is that the design matrix is of deficient rank, since any of the columns can be predicted given the other k columns.
The same redundancy will be true of any additional categorical predictors entered as main effects (only k-1 of k indicators can be nonredundant). If you add interactions among categorical predictors, an indicator for each combination of levels of the two predictors is generated, but more than one of these will also be redundant given the intercept and main effects preceding the interaction(s).
The fundamental or canonical representation of the model is thus overparameterized, meaning it has more parameters than can be uniquely estimated. There are multiple ways commonly used to deal with this fact. One approach is the one used in NOMREG and most other more recent regression-type modeling procedures in SPSS, which is to use a generalized inverse of the cross-product of the design matrix, which has the effect of aliasing parameters associated with redundant columns to 0. You'll see these parameters represented by 0 values with no standard errors or other statistics in the SPSS output.
The other way used in SPSS to handle the overparameterized nature of the basic model is to reparameterize the design matrix to full rank, which involves creating k-1 coded variables instead of k indicators for each main effect, and creating interaction variables from these. This is the approach taken in LOGISTIC REGRESSION.
Note that the overall model fit and predicted values from a logistic regression (or other form of linear or generalized linear model) will be the same regardless of what choices are made about parameterization, as long as the appropriate total number of unique columns are in the design matrix. Particular parameter estimates are of course highly dependent upon the particular parameterization used, but you can derive the results from any of the valid approaches using the results from any other valid approach.
If there are k levels in a categorical predictor, there are k-1 degrees of freedom for comparing those k groups, meaning that once you'd made k-1 linearly independent or nonredundant comparisons, any others can be derived from those.
So the short answer is no, you can't do what you're talking about, but you don't need to, because the results for any valid parameterization will allow you to derive those for any other one.

How can I get the IDs of specific items in a Pytorch dataloader-based dataset with a query?

I have a large dataset (approx. 500GB and 180k data points plus labels) in a Pytorch dataloader. Until now, I used torch.utils.data.random_split to split the dataset randomly into training and validation. However, this lead to serious overfitting. Now, I want to rather use a deterministic split, i.e. based on the paths stored in the dataloader, I could figure out a non-random split. However, I have no idea how to do so... The question is: How can I get the IDs of about 10% of the data points based on some query that has a look at the information about the files stored in the data loader (e.g. the paths)?
Have you used a custom dataset along with the dataloader? If the underlying dataset has some variable that stores the filenames of the individual files, you can access it using .dataloader.dataset.filename_variable.
If thats not available, you can create a custom dataset yourself, where you essentially call the original dataset itself.

MANOVA or Multiple Regression

We have several independent variables (some are continuous with more than 5 levels, some binary and some quasi-interval (5 levels - categorical). We also have 5 dependent variables that share a common construct. Is it useful to conduct MANOVA with all the continues/quasi-interval as covariates, and the binary as factor variables - or preform a 5 separate multiple regression analysis?
Thank you
In general, it is inadvisable to perform multiple (univariate) analyses when you can replace them with one (multivariate). Neglecting this can result in Type I errors when a suitable correction is not applied. It might also lead to overlooking an effect arising from interaction between variables analysed separately.
With this in mind, you would probably be better off performing a single MANOVA. Covariates, however, should be determined by your experimental design rather than by the type of a variable (nominal, categorical, and continuous).