Testing Data Consistency and its effect on Multilevel Modeling Multivariate Inference - lme4

I have a MLM model looking at the effect of demographics of a few cities on a region wide outcome variable as follows:
RegionalProgress = β0j + β1j * Demographics + u0j + e0ij
The data used in this analysis consists of 7 different cities with different data sources. The point I am trying to make is these 7 data sets (that I have combined together) have inconsistant structure and substance, and differences do (or do not) alter or at least complicate multivariate relationships. A tip I got was to use β1j and its variation across cities. I'm having trouble understanding how this would relate to proving inconsistancies in data sets. I'm doing all of this in R and my model looks like this in case that's helpful:
model11 <- lmerTest::lmer(RegionalProgress ~ 1 + (1|CITY) + PopDen + HomeOwn + Income + Black + Asian + Hispanic + Age, data = data, REML = FALSE)
Can anyone help me understand the tip, or give me other tips how to find evidence of:
there are meaningful differences (or not) between the data sets across cities,
how these differences do affect multivariate relationships?

Related

Logit regression including year and industry fixed effects in Python

I am currently working with a dataframe containing financial information post IPO of traditional IPO firms as well as firms that went public through a SPAC merger, both in the years 2020 to 2022. I am trying to model the likelihood of a firm becoming public using the SPAC merger route, from a few key financial post IPO variables (independent variables). I want to employ a logistic regression model with the dependent variable P(SPAC)i, which is binary and equals 1 for SPAC firms and 0 for IPO firms. The main specification is:
P(SPAC)i = 1⁄(1+ e∧(α + β1Xi + β2Xi + β3Xi + ... + ∑βj Year fixed effectsi,j + ∑βl Industry fixed effects i,l + u i))
Where individual firms are indexed by i.
I don't know how to include year and industry fixed effect into my logit regression. Could anybody give me a hand on this?

FDR correction for univariate regression models

I have three different regression models with age and training slope :
X~Y + age + training.slope, X~Z + age + training.slope, X~V + age + training.slope.
I have separated these variables out in different models for good reason (e.g. avoiding regression to the mean) etc. Further, I perform these analysis separately for two groups and then compare their coefficients. Could anyone suggest what would be an appropriate way to FDR correct them? Should I combine the p-values of Y,Z, and V and apply a FDR correction? Further given that this is run for two groups would you combine the p-values of the three variables for both groups and FDR correct them all together?
Cheers!

include random slope in binomial mixed model

I am using a binomial GLMM to examine the relationship between presence of individuals (# hours/day) at a site over time. Since presence is measured daily for several individuals, I've included a random intercept for individual ID.
e.g.,
presence <- cbind(hours, 24-hours)
glmer(presence ~ time + (1 | ID), family = binomial)
I'd like to also look at using ID as a random slope, but I don't know how to add this to my model. I've tried the two different approaches below, but I'm not sure which is correct.
glmer(presence ~ time + (1 + ID), family = binomial)
Error: No random effects terms specified in formula
glmer(presence ~ time + (1 + ID | ID), family = binomial)
Error: number of observations (=1639) < number of random effects (=5476) for term (1 + ID | ID); the random-effects parameters are probably unidentifiable
You cannot have a random slope for ID and have ID as a (level-two) grouping variable (see this documentation for more detail: https://cran.r-project.org/web/packages/lme4/lme4.pdf).
The grouping variable, which is ID in the models below, is used as a variable for which to specify random effects. model_1 gives random intercepts for the ID variable. model_2 gives both random intercepts and random slopes for the time variable. In other words, model_1 allows the intercept of the relationship between presence and time to vary with ID(the slope remains the same), whereas model_2 allows for the both the intercept and slopes to vary with ID, so that the relationship between presence and time (i.e., the slope) can be different for each individual (ID).
model_1 = glmer(presence ~ time + (1 | ID), family = binomial)
model_2 = glmer(presence ~ time + (1 + time | ID), family = binomial)
I would also recommend:
Snijders, T. A. B., & Bosker, R. J. (2012). Multilevel analysis: an introduction to basic and advanced multilevel modeling (2nd ed.): Sage.

Can I use autoencoder for clustering?

In the below code, they use autoencoder as supervised clustering or classification because they have data labels.
http://amunategui.github.io/anomaly-detection-h2o/
But, can I use autoencoder to cluster data if I did not have its labels.?
Regards
The deep-learning autoencoder is always unsupervised learning. The "supervised" part of the article you link to is to evaluate how well it did.
The following example (taken from ch.7 of my book, Practical Machine Learning with H2O, where I try all the H2O unsupervised algorithms on the same data set - please excuse the plug) takes 563 features, and tries to encode them into just two hidden nodes.
m <- h2o.deeplearning(
2:564, training_frame = tfidf,
hidden = c(2), auto-encoder = T, activation = "Tanh"
)
f <- h2o.deepfeatures(m, tfidf, layer = 1)
The second command there extracts the hidden node weights. f is a data frame, with two numeric columns, and one row for every row in the tfidf source data. I chose just two hidden nodes so that I could plot the clusters:
Results will change on each run. You can (maybe) get better results with stacked auto-encoders, or using more hidden nodes (but then you cannot plot them). Here I felt the results were limited by the data.
BTW, I made the above plot with this code:
d <- as.matrix(f[1:30,]) #Just first 30, to avoid over-cluttering
labels <- as.vector(tfidf[1:30, 1])
plot(d, pch = 17) #Triangle
text(d, labels, pos = 3) #pos=3 means above
(P.S. The original data came from Brandon Rose's excellent article on using NLTK. )
In some aspects encoding data and clustering data share some overlapping theory. As a result, you can use Autoencoders to cluster(encode) data.
A simple example to visualize is if you have a set of training data that you suspect has two primary classes. Such as voter history data for republicans and democrats. If you take an Autoencoder and encode it to two dimensions then plot it on a scatter plot, this clustering becomes more clear. Below is a sample result from one of my models. You can see a noticeable split between the two classes as well as a bit of expected overlap.
The code can be found here
This method does not require only two binary classes, you could also train on as many different classes as you wish. Two polarized classes is just easier to visualize.
This method is not limited to two output dimensions, that was just for plotting convenience. In fact, you may find it difficult to meaningfully map certain, large dimension spaces to such a small space.
In cases where the encoded (clustered) layer is larger in dimension it is not as clear to "visualize" feature clusters. This is where it gets a bit more difficult, as you'll have to use some form of supervised learning to map the encoded(clustered) features to your training labels.
A couple ways to determine what class features belong to is to pump the data into knn-clustering algorithm. Or, what I prefer to do is to take the encoded vectors and pass them to a standard back-error propagation neural network. Note that depending on your data you may find that just pumping the data straight into your back-propagation neural network is sufficient.

glmer and odds ratios

Is there a way to calculate odds ratios for a glmer model?
My model is defined as follows:
results_reduced <- glmer(R0A1 ~ MPHW_Perc + AG_Perc + Shrub_Perc + Dist_PrimaryRoads
+ Dist_SecondaryRoads + (1 | ID), data = secondorder_st,
family = binomial)
I would like to calculate odds ratios for this model with associated confidence intervals; however, the following syntax that I would use in GLM continues to run without stopping:
#odds ratios and 95% CI
exp(cbind(OR = coef(results_reduced), confint(results_reduced)))
Therefore, I assume that the random effect is causing some hang-up issue. Is there a way to output the odds ratio for a mixed-effects model?
The random effects probably aren't the problem. 1st, coef(glmm_model) pulls up a list of your fixed and random effects. You want fixef(glmm_model). 2nd.How large is your dataset? Calculating confidence intervals for glmm models takes time (sometimes minutes), so you might just not be waiting long enough. 3rd, For glmer the default way to calculate confidence intervals is "profile." Depending on your data it will often throw warnings, which might not cooperate when nested in your calls to cbind and exp. You could try confint(..., method ="Wald"), or doing things piece by piece.