FDR correction for univariate regression models - regression

I have three different regression models with age and training slope :
X~Y + age + training.slope, X~Z + age + training.slope, X~V + age + training.slope.
I have separated these variables out in different models for good reason (e.g. avoiding regression to the mean) etc. Further, I perform these analysis separately for two groups and then compare their coefficients. Could anyone suggest what would be an appropriate way to FDR correct them? Should I combine the p-values of Y,Z, and V and apply a FDR correction? Further given that this is run for two groups would you combine the p-values of the three variables for both groups and FDR correct them all together?
Cheers!

Related

Lasso Regression

For the lasso (linear regression with L1 regularization) with a fixed value of λ, it is necessary to
use cross–validation to select the best optimization algorithm.
I know for a fact that we can use cross validation to find optimal value of λ, but is it neccesary to use cross validation in case λ is fixed?
Any thoughts please?
Cross Validation isn't about if your Regularization Parameter is Fixed or not. Its more related to the R^2 metric.
Lets say you consider 100 records and divide your data into 5 sub-datasets , means each sub-data contains 20 records.
Now out of 5 sub-datas , there are 5 different ways to assign anyone of the sub-data as Cross-Validation (CV) Data.
For all these 5 scenarios, we can find out the R^2, and then find out the Average R^2.
This way, you can have a comparison of your R-score with the Average R-score.

Testing Data Consistency and its effect on Multilevel Modeling Multivariate Inference

I have a MLM model looking at the effect of demographics of a few cities on a region wide outcome variable as follows:
RegionalProgress = β0j + β1j * Demographics + u0j + e0ij
The data used in this analysis consists of 7 different cities with different data sources. The point I am trying to make is these 7 data sets (that I have combined together) have inconsistant structure and substance, and differences do (or do not) alter or at least complicate multivariate relationships. A tip I got was to use β1j and its variation across cities. I'm having trouble understanding how this would relate to proving inconsistancies in data sets. I'm doing all of this in R and my model looks like this in case that's helpful:
model11 <- lmerTest::lmer(RegionalProgress ~ 1 + (1|CITY) + PopDen + HomeOwn + Income + Black + Asian + Hispanic + Age, data = data, REML = FALSE)
Can anyone help me understand the tip, or give me other tips how to find evidence of:
there are meaningful differences (or not) between the data sets across cities,
how these differences do affect multivariate relationships?

Is it possible that the number of basic functions is more than the number of observations in spline regression?

I want to run regression spline with B-spline basis function. The data is structured in such a way that the number of observations is less than the number of basis functions and I get a good result.
But I`m not sure if this is the correct case.
Do I have to have more rows than columns like linear regression?
Thank you.
When the number of observations, N, is small, it’s easy to fit a model with basis functions with low square error. If you have more basis functions than observations, then you could have 0 residuals (perfect fit to the data). But that is not to be trusted because it may not be representative of more data points. So yes, you want to have more observations than you do columns. Mathematically, you cannot properly estimate more than N columns because of collinearity. For a rule of thumb, 15 - 20 observations are usually needed for each additional variable / spline.
But, this isn't always the case, such as in genetics when we have hundreds of thousands of potential variables and small sample size. In that case, we turn to tools that help with a small sample size, such as cross validation and bootstrap.
Bootstrap (ie resample with replacement) your datapoints and refit splines many times (100 will probably do). Then you average the splines and use these as the final spline functions. Or you could do cross validation, where you train on a smaller dataset (70%) and then test it on the remaining dataset.
In the functional data analysis framework, there are packages in R that construct and fit spline bases (such as cubic, B, etc). These packages include refund, fda, and fda.usc.
For example,
B <- smooth.construct.cc.smooth.spec(object = list(term = "day.t", bs.dim = 12, fixed = FALSE, dim = 1, p.order = NA, by = NA),data = list(day.t = 200:320), knots = list())
constructs a B spline basis of dimension 12 (over time, day.t), but you can also use these packages to help choose a basis dimension.

What is the difference between RSE and MSE?

I am going through Introduction to Statistical Learning in R by Hastie and Tibshirani. I came across two concepts: RSE and MSE. My understanding is like this:
RSE = sqrt(RSS/N-2)
MSE = RSS/N
Now I am building 3 models for a problem and need to compare them. While MSE come intuitively to me, I was also wondering if calculating RSS/N-2 will make any use which is according to above is RSE^2
I think I am not sure which to use where?
RSE is an estimate of the standard deviation of the residuals, and therefore also of the observations. Which is why it's equal to RSS/df. And in your case, as a simple linear model df = 2.
MSE is mean squared error observed in your models, and it's usually calculated using a test set to compare the predictive accuracy of your fitted models. Since we're concerned with the mean error of the model, we divide by n.
I think RSE ⊂ MSE (i.e. RSE is part of MSE).
And MSE = RSS/ degree of freedom
MSE for a single set of data (X1,X2,....Xn) would be RSS over N
or more accurately is RSS/N-1
(since your freedom to vary will be reduced by one when U have used up all the freedom)
But in linear regression concerning X and Y with binomial term, the degree of freedom is affected by both X and Y thus N-2
thus yr MSE = RSS/N-2
and one can also call this RSE
And in over parameterized model, meaning one have a collection of many ßs (more than 2 terms| y~ ß0 + ß1*X + ß2*X..), one can even penalize the model by reducing the denominator by including the number of parameters:
MSE= RSS/N-p (p is the number of fitted parameters)

p values for random effects in lmer

I am working on a mixed model using lmer function. I want to obtain p-values for all the fixed and random effects. I am able to obtain p-values for fixed effects using different methods but I haven't found anything for random effects. Whatever method I have found on the internet is to make a null model for the same and then get the p-values by comparison. Can I have a method through which I don't need to make an another model?
My model looks like:
mod1 = lmer(Out ~ Var1 + (1 + Var2 | Var3), data = dataset)
You must do these through model comparison, as far as I know. The lmerTest package has a function called step, which will reduce your model to just the significant parameters (fixed and random) based on a number of different tests. The documentation isn't entirely clear on how everything is done, so I much prefer to use model comparison to get at specific tests.
For your model, you could test the random slope by specifying:
mod0 <- lmer(Out ~ Var1 + (1 + Var2 | Var3), data = dataset, REML=TRUE)
mod1 <- lmer(Out ~ Var1 + (1 | Var3), data = dataset, REML=TRUE)
anova(mod0, mod1, refit=FALSE)
This will show you the log likelihood test and test statistic (chi-square distributed). But you are testing two parameters here: the random slope of Var2 and the covariance between the random slopes and random intercepts. So you need a p-value adjustment:
1-(.5*pchisq(anova(mod0,mod1, refit=FALSE)$Chisq[[2]],df=2)+
.5*pchisq(anova(mod0,mod1, refit=FALSE)$Chisq[[2]],df=1))
More on those tests here or here.