Multiple regression model - regression

I have multiple regression model which looks like this:
wine.lm <- lm(Alc.vol. ~ pc2+pc1* factor(pc_parametr$color),
data = pc_parametr)
I have to extract the coefficients and translate them into one regression equation for each color (white,red)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 12.6803846 0.2511799 50.4832872 1.104478e-26
pc2 -3.7524814 3.5850681 -1.0466974 3.052541e-01
pc1 -9.3332435 7.5111530 -1.2425847 2.255503e-01
factor(pc_parameter$color)white 0.4778615 0.2926116 1.6330914 1.149828e-01
pc1:factor(pc_parameter$color)white 6.7281697 8.0999853 0.8306397 4.140399e-01
I was trying to do it manually but I am confused.
Y = 12.68 + -3.75 * pc2 + -9.33 * pc1 + 0.48 * factor(pc_parametr$color)white + 6.73 * pc1:factor(pc_parametr$color)white + e"
Is there a code for the calculation for different colors or manually how is the correct way

Here is correct equation notation:
Red color is the reference level and is always included in intercept. In the plot the colors move the regression curve along the y-axis.

Related

Unexpected output of emmeans averaged accross variables

I transformed a variable (e.g. leaf_area) using a simple square transformation and then fitted to the following model containing an interaction:
fit <- lmer(leaf_area^2 ~genotype*soil_type + date_measurement + light + (1|repetition) + (1|y_position) + (1|x_position), data = dataset)
To obtain the emmeans averaged accross genotypes and soil type for each measurement date, I further use the following command:
fit.emm <- emmeans(fit, ~ genotype*soil_type + date_measurement, type = "response")
The emmeans are, nevertheless, averaged for the variable date_measurement.
As represented in the following example, emmeans are averages of genotypes x, y and z in the soil MT and in the date of measurement 27.4, but the measurement dates actually occured on 21, 23, 28, 30 and 35 das.
genotype soil_type date_measurement emmean SE df lower.CL upper.CL
x MT 27.4 0.190 0.0174 126.0 0.155 0.224
y MT 27.4 0.220 0.0147 74.1 0.191 0.250
z MT 27.4 0.210 0.0157 108.6 0.179 0.241
When I fit the model without interaction between genotype and soil type and run the emmeans, the results are still averaged for the measurement dates.
fit <- lmer(leaf_area^2 ~genotype + soil_type + date_measurement + light + (1|repetition) + (1|y_position) + (1|x_position), data = dataset)
fit.emm <- emmeans(fit, ~ genotype + soil_type + date_measurement, type = "response")
My question is: how can I obtain the emmeans averaged accross genotype and soil but separated for each date of measurement?
Class of variables:
date_measurement, light, x_position, y_position: numeric
genotype and soil_type: factor
Thank you in advance.
When you have a numerical predictor in the model, the default is to obtain predictions at the average value of that covariate. If you want the covariates treated like factors, you have to say so:
fit.emm <- emmeans(fit, ~ genotype*soil_type + date_measurement,
cov.reduce = FALSE)
In addition, emmeans cannot auto-detect your square transformation. You can fix it up by doing
fit.emm <- update(fit.emm, tran = make.tran("power", 2),
type = "response")
Then I think you will want to subsequently obtain marginal means by averaging over date_measurement at least -- i.e.,
fit.emm2 <- emmeans(fit.emm, ~ genotype*soil_type)
It will retain the transformation and type = "response" setting.

Mathematical equation of binomial probit gam (mgcv) with tensor product interactions?

I have the following binomial (probit) gam using mgcv, which includes y (0 or 1), two continuous predictors (xa, xb) plus the ‘ti’ interactions of a third covariate (xc) with these two predictors.
mygam <- gamV(y ~ s(xa, k=10, bs="cr") + s(xb, k=10, bs="cr") +
ti(xc, xa, bs = c("cr", "cr"), k = c(5, 5)) +
ti(xc, xb, bs = c("cr", "cr"), k = c(5, 5)),
data = df, method = "ML", family = binomial(link = "probit"))
Using default k=10 for main effects and k=c(5,5) for interactions, the intercept and 50 coefficients are the following:
terms <- c("Intercept", "s(xa).1", "s(xa).2", "s(xa).3", "s(xa).4", "s(xa).5", "s(xa).6", "s(xa).7", "s(xa).8", "s(xa).9", "s(xb).1", "s(xb).2", "s(xb).3", "s(xb).4", "s(xb).5", "s(xb).6", "s(xb).7", "s(xb).8", "s(xb).9", "ti(xc,xa).1", "ti(xc,xa).2", "ti(xc,xa).3", "ti(xc,xa).4", "ti(xc,xa).5", "ti(xc,xa).6", "ti(xc,xa).7", "ti(xc,xa).8", "ti(xc,xa).9", "ti(xc,xa).10", "ti(xc,xa).11", "ti(xc,xa).12", "ti(xc,xa).13", "ti(xc,xa).14", "ti(xc,xa).15", "ti(xc,xa).16", "ti(xc,xb).1", "ti(xc,xb).2", "ti(xc,xb).3", "ti(xc,xb).4", "ti(xc,xb).5", "ti(xc,xb).6", "ti(xc,xb).7", "ti(xc,xb).8", "ti(xc,xb).9", "ti(xc,xb).10", "ti(xc,xb).11", "ti(xc,xb).12", "ti(xc,xb).13", "ti(xc,xb).14", "ti(xc,xb).15", "ti(xc,xb).16")
coefs <- c(-0.0702421404106311, 0.0768316292916553, 0.210036768213672, 0.409025596435604, 0.516554288252813, 0.314600352165584, -0.271938137725695, -1.1169186662112, -1.44829172827383, -2.39608336269616, 0.445091855160863, 0.119747299507175, -0.73508332280573, -1.3851857008194, -1.84125850675114, -1.77797283303084, -1.45118023146655, -1.56696555281429, -2.55103708393941, 0.0505422263407052, -0.110361707609838, -0.168897589312596, -0.0602318423244818, 0.095385784704545, -0.20818521830706, -0.318650042681766, -0.113613570916751, 0.123559386280642, -0.269467853796075, -0.412476320830133, -0.147039497705579, 0.189416535823022, -0.412990646359733, -0.632158143648671, -0.225344249076957, 0.0237165469278517, 0.0434926950921869, 0.080572361088243, 0.397397459143317, 0.0453636001566695, 0.0831126054198634, 0.153350111096294, 0.75009880522662, 0.0583689328419794, 0.107001374561518, 0.197852239031467, 0.970623037721609, 0.0894562434842868, 0.163989821269297, 0.303175057387294, 1.48718228468607)
df_coefs <- data.frame(terms, coefs)
I would like the mathematical equation of this model, which would allow to determine the probability of y given known covariates. Given as example from my dataset (n > 70000), the predicted probability ‘prob’ (type = “response”) obtained with xa = 7.116, xb = 2.6, and xc = 19 was prob = 0.76444141, which is the result to be determined with the expected mathematical equation.
Is this possible?
Thanks for your help and time.
Below, the summary(mygam)
Parametric coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.07024 0.00709 -9.907 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df Chi.sq p-value
s(xa) 8.007 8.548 5602.328 < 2e-16 ***
s(xb) 8.448 8.908 16282.793 < 2e-16 ***
ti(xc,xa) 1.004 1.007 10.278 0.00138 **
ti(xc,xb) 1.021 1.042 7.718 0.00627 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.52 Deviance explained = 45.6%
-ML = 29379 Scale est. = 1 n = 77870
If you set type="terms" in the predict function, you get the contributions of the individual components to the linear predictor. However, these are not on the scale of outcome probability, but on that of the linear predictor.
Because of the non-linear transformation of the linear predictor -- in your case with the probit link -- attributing the predicted probability to the individual components requires attribution methods that come with additional assumptions.
An example of such an attribution method is Shapley values.

Julia and StatsModels: print results without header line (StatsModels.DataFrameRegressionModel{...})

I run a linear regression with Julia, GLM, and StatsModels and print the results, which I include directly in the research report. This printout includes a header line with the object type, which is a distraction in the report. For example, this code:
using GLM, StatsModels, DataFrames
df = DataFrames.DataFrame(a = rand(10), b = rand(10))
f = fit(LinearModel, #formula(a ~ b), df)
println(f)
prints:
StatsModels.DataFrameRegressionModel{GLM.LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,Base.LinAlg.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}
Formula: a ~ 1 + b
Coefficients:
Estimate Std.Error t value Pr(>|t|)
(Intercept) 0.238502 0.224529 1.06223 0.3191
b 0.333053 0.366105 0.909721 0.3896
I can avoid the first line by casting the object into a string and splitting at newlines:
f2 = split(string(f), "\n")
for i in 2:length(f2)
println(f2[i])
end
and then I get:
Formula: a ~ 1 + b
Coefficients:
Estimate Std.Error t value Pr(>|t|)
(Intercept) 0.238502 0.224529 1.06223 0.3191
b 0.333053 0.366105 0.909721 0.3896
But this is ugly and prone to errors. In the GLM documentation of methods applied to a fit object, I found no methods or arguments for this. Does anyone have a cleaner way?
Following discussion in the comments. If you only need to get summary of coefficients of your model write:
julia> coeftable(f)
Estimate Std.Error t value Pr(>|t|)
(Intercept) 0.337666 0.205716 1.64142 0.1393
b -0.0887478 0.378739 -0.234324 0.8206

Calculate cutoff and sensitivity for specific values of specificity?

After calculating several regression models, I want to calculate sensitivity-values and the cut-off for pre-specified values of specificity (i.e, 0.99, 0.90, 0.85 and so on) to find the best model. I have created code to calculate sensitivity and specificity for given values of the cut-off (from 0.1 to 0.9), but now I want to use specific values of specificity (i.e., calculate the corresponding cut-off value and sensitivity-values), and here I'm stuck.
Suppose I have the following regression model (using the example dataset 'mtcars'):
data(mtcars)
model <- glm(formula= vs ~ wt + disp, data=mtcars, family=binomial)
Here is the code I've used for the calculation of sens and spec for given values of the cut-off:
predvalues <- model$fitted.values
getMisclass <- function(cutoff, p, labels){
d <- cut(predvalues, breaks=c(-Inf, cutoff, Inf), labels=c("0", "1"))
print(confusionMatrix(d, mtcars$vs, positive="1"))
cat("cutoff", cutoff, ":\n")
t <- table(d, mtcars$vs)
print(round(sum(t[c(1,4)])/sum(t), 2))
}
cutoffs <- seq(.1,.9,by=.1)
sapply(cutoffs, getMisclass, p=predval, labels=mtcars$vs)
Can someone help me how to rewrite this code for the calculation of sensitivity and cut-off scores given a range of specificity-values? Is it possible?
The values for the cutoff should be
cutoffs <- c(0.99, 0.90, 0.85, 0.80, 0.75)
Thanks a lot!
This is closely related to how ROC curves are calculated: if those are calculated with fine granularity you essentially get a sensitivity and specificity for "every" threshold value. So, what you could do is simply calculate the sensitivities, specificities and corresponding threshold as if you would want to obtain a ROC curve...
library(pROC)
myRoc <- roc(predictor = predvalues, response = mtcars$vs)
plot(myRoc)
myRoc$specificities
print(with(myRoc, data.frame(specificities, sensitivities, thresholds)))
# specificities sensitivities thresholds
# 1 0.00000000 1.00000000 -Inf
# 2 0.05555556 1.00000000 0.002462809
# 3 0.11111111 1.00000000 0.003577104
# 4 0.16666667 1.00000000 0.004656164
# 5 0.22222222 1.00000000 0.005191974
# 6 0.27777778 1.00000000 0.006171197
# [...]
...and then lookup the corresponding sensitivities and thresholds for whichever specificities you are interested in, e.g. as:
cutoffs <- c(0.99, 0.90, 0.85, 0.80, 0.75)
myData <- with(myRoc, data.frame(specificities, sensitivities, thresholds))
library(plyr)
print(laply(cutoffs, function(cutoff) myData$sensitivities[which.min(abs(myData$specificities-cutoff))]))
# [1] 0.7857143 0.8571429 0.8571429 0.9285714 0.9285714

Using different colors in Gnuplot based on a CSV file column value

I have a CSV file with the following structure:
X,Y,Z
where X and Y are coordinates on a square plot and Z can be 0/1. I want to plot points with different color, depending on the value in the Z column.
Is that possible?
So far I have a file which just displays all the data on the square chart and colors them with only 1 color:
filename='test.csv'
set datafile separator ","
set title filename
set size square
plot filename using 0:1 linecolor rgb "yellow"
It's all in the documentation, check help rgbcolor variable :
rgb(r,g,b) = 65536 * int(r) + 256 * int(g) + int(b)
color1=rgb(255,0,0); color2=rgb(0,255,0)
plot fname using 1:2:($3==0?color1:color2) w p lc rgb variable