Julia and StatsModels: print results without header line (StatsModels.DataFrameRegressionModel{...}) - regression

I run a linear regression with Julia, GLM, and StatsModels and print the results, which I include directly in the research report. This printout includes a header line with the object type, which is a distraction in the report. For example, this code:
using GLM, StatsModels, DataFrames
df = DataFrames.DataFrame(a = rand(10), b = rand(10))
f = fit(LinearModel, #formula(a ~ b), df)
println(f)
prints:
StatsModels.DataFrameRegressionModel{GLM.LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,Base.LinAlg.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}
Formula: a ~ 1 + b
Coefficients:
Estimate Std.Error t value Pr(>|t|)
(Intercept) 0.238502 0.224529 1.06223 0.3191
b 0.333053 0.366105 0.909721 0.3896
I can avoid the first line by casting the object into a string and splitting at newlines:
f2 = split(string(f), "\n")
for i in 2:length(f2)
println(f2[i])
end
and then I get:
Formula: a ~ 1 + b
Coefficients:
Estimate Std.Error t value Pr(>|t|)
(Intercept) 0.238502 0.224529 1.06223 0.3191
b 0.333053 0.366105 0.909721 0.3896
But this is ugly and prone to errors. In the GLM documentation of methods applied to a fit object, I found no methods or arguments for this. Does anyone have a cleaner way?

Following discussion in the comments. If you only need to get summary of coefficients of your model write:
julia> coeftable(f)
Estimate Std.Error t value Pr(>|t|)
(Intercept) 0.337666 0.205716 1.64142 0.1393
b -0.0887478 0.378739 -0.234324 0.8206

Related

Unexpected output of emmeans averaged accross variables

I transformed a variable (e.g. leaf_area) using a simple square transformation and then fitted to the following model containing an interaction:
fit <- lmer(leaf_area^2 ~genotype*soil_type + date_measurement + light + (1|repetition) + (1|y_position) + (1|x_position), data = dataset)
To obtain the emmeans averaged accross genotypes and soil type for each measurement date, I further use the following command:
fit.emm <- emmeans(fit, ~ genotype*soil_type + date_measurement, type = "response")
The emmeans are, nevertheless, averaged for the variable date_measurement.
As represented in the following example, emmeans are averages of genotypes x, y and z in the soil MT and in the date of measurement 27.4, but the measurement dates actually occured on 21, 23, 28, 30 and 35 das.
genotype soil_type date_measurement emmean SE df lower.CL upper.CL
x MT 27.4 0.190 0.0174 126.0 0.155 0.224
y MT 27.4 0.220 0.0147 74.1 0.191 0.250
z MT 27.4 0.210 0.0157 108.6 0.179 0.241
When I fit the model without interaction between genotype and soil type and run the emmeans, the results are still averaged for the measurement dates.
fit <- lmer(leaf_area^2 ~genotype + soil_type + date_measurement + light + (1|repetition) + (1|y_position) + (1|x_position), data = dataset)
fit.emm <- emmeans(fit, ~ genotype + soil_type + date_measurement, type = "response")
My question is: how can I obtain the emmeans averaged accross genotype and soil but separated for each date of measurement?
Class of variables:
date_measurement, light, x_position, y_position: numeric
genotype and soil_type: factor
Thank you in advance.
When you have a numerical predictor in the model, the default is to obtain predictions at the average value of that covariate. If you want the covariates treated like factors, you have to say so:
fit.emm <- emmeans(fit, ~ genotype*soil_type + date_measurement,
cov.reduce = FALSE)
In addition, emmeans cannot auto-detect your square transformation. You can fix it up by doing
fit.emm <- update(fit.emm, tran = make.tran("power", 2),
type = "response")
Then I think you will want to subsequently obtain marginal means by averaging over date_measurement at least -- i.e.,
fit.emm2 <- emmeans(fit.emm, ~ genotype*soil_type)
It will retain the transformation and type = "response" setting.

Different Sum Sq and MSS using lme4::lmer and lmerTest::lmer

I get sums of squares and mean sums of squares 10x higher when I use anova on lmerTest:: lmer compared to lme4:: lmer objects. See the R log file below. Note the warning that when I attach the lmerTest package, the stats::sigma function overrides the lme4::sigma function, and I suspect that it is this that leads to the discrepancy. In addition, the anova report now says that it is a Type III anova rather than the expected Type I. Is this a bug in the lmerTest package, or is there something about use of the Kenward-Roger approximation for degrees of freedom that changes the calculation of SumSQ and MSS and specification of the anova Type that I don't understand?
I would append the test file, but it is confidential clinical trial information. If necessary I can see if I can cobble up a test case.
Thanks in advance for any advice you all can provide about this.
> library(lme4)
Loading required package: Matrix
Attaching package: ‘lme4’
The following object is masked from ‘package:stats’:
sigma
> test100 <- lmer(log(value) ~ prepost * lowhi + (1|CID/LotNo/rep),
REML = F, data = GSIRlong, subset = !is.na(value))
> library(lmerTest)
Attaching package: ‘lmerTest’
The following object is masked from ‘package:lme4’:
lmer
The following object is masked from ‘package:stats’:
step
Warning message:
replacing previous import ‘lme4::sigma’ by ‘stats::sigma’ when loading
‘lmerTest’
> test200 <- lmer(log(value) ~ prepost * lowhi + (1|CID/LotNo/rep),
REML = F, data = GSIRlong, subset = !is.na(value))
> anova(test100)
Analysis of Variance Table
Df Sum Sq Mean Sq F value
prepost 1 3.956 3.956 18.4825
lowhi 1 130.647 130.647 610.3836
prepost:lowhi 1 0.038 0.038 0.1758
> anova(test200, ddf = 'Ken')
Analysis of Variance Table of type III with Kenward-Roger
approximation for degrees of freedom
Sum Sq Mean Sq NumDF DenDF F.value Pr(>F)
prepost 37.15 37.15 1 308.04 18.68 2.094e-05 ***
lowhi 1207.97 1207.97 1 376.43 607.33 < 2.2e-16 ***
prepost:lowhi 0.35 0.35 1 376.43 0.17 0.676
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Update: Thanks, Ben. I did a little code archeology on lmerTest to see if I could dope out an explanation for the above anomalies. First, it turns out that lmerTest::lmer just submits the model to lme4::lmer and then relabels the result as an "mermodLmerTest" object. The only effect of this is to invoke versions of summary() and anova() from the lmerTest package rather than the usual defaults from base and stats. (These lmerTest functions are compiled, and I have not yet gone farther to look at the C++ code.) lmerTest::summary just adds three columns to the base::summary result, giving df, t value, and Pr. Note that lmerTest::anova, by default, computes a type III anova rather than a type I as in stats::anova. (Explanation of my second question above.) Not a great choice if one's model includes interactions. One can request a type I, II, or III anova using the type = 1/2/3 option.
However there are other surprises using the nlmeTest versions of summary and anova, as shown in the R console file below. I used lmerTest's included sleepstudy data so that this code should be replicable.
a. Note that "sleepstudy" has 180 records (with 3 variables)
b. The summaries of fm1 and fm1a are identical except for the added Fixed effects columns. But note that in the lmerTest::summary the ddfs for the intercept and Days are 1371 and 1281 respectively; odd given that there are only 180 records in "sleepstudy."
c. Just as in my original model above, the nlm4 anad nlmrTest versions of anova give very different values of Sum Sq and Mean Sq. (30031 and 446.65 respectively).
d: Interestingly, the nlmrTest versions of anova using Satterthwaite and Kenward-Rogers estimates of the DenDF are wildly different (5794080 and 28 respecitvely). The K-R value seems more reasonable.
Given the above issues, I am reluctant at this point to depend on lmerTest to give reliable p-values. Based on your (Doug Bates's) blog entry (https://stat.ethz.ch/pipermail/r-help/2006-May/094765.html), I am using now (and recommending) the method from the posting by Dan Mirman (http://mindingthebrain.blogspot.ch/2014/02/three-ways-to-get-parameter-specific-p.html) in the final bit of code below to estimate a naive t-test p-value (assuming essentially infinite degrees of freedom) and a Kenward-Rogers estimate of df (using the R package 'pbkrtest' -- the same package used by lmerTest). I couldn't find R code to compute the Satterthwaite estimates. The naive t-test p-value is clearly anti-conservative, but the KR estimate is thought to be pretty good. If the two give similar estimates of p, then I think that one can feel comfortable with a p-value in the range of [naive t-test, KR estimate].
> library(lme4); library(lmerTest); library(pbkrtest);
dim(sleepstudy)
[1] 180 3
>
> fm1 <- lme4::lmer(Reaction ~ Days + (Days|Subject), sleepstudy)
> fm1a <- lmerTest::lmer(Reaction ~ Days + (Days|Subject), sleepstudy)
>
> summary(fm1)
Linear mixed model fit by REML ['lmerMod']
Formula: Reaction ~ Days + (Days | Subject)
Data: sleepstudy
REML criterion at convergence: 1743.6
Scaled residuals:
Min 1Q Median 3Q Max
-3.9536 -0.4634 0.0231 0.4634 5.1793
Random effects:
Groups Name Variance Std.Dev. Corr
Subject (Intercept) 612.09 24.740
Days 35.07 5.922 0.07
Residual 654.94 25.592
Number of obs: 180, groups: Subject, 18
Fixed effects:
Estimate Std. Error t value
(Intercept) 251.405 6.825 36.84
Days 10.467 1.546 6.77
Correlation of Fixed Effects:
(Intr)
Days -0.138
> summary(fm1a)
Linear mixed model fit by REML t-tests use Satterthwaite approximations to
degrees of freedom [lmerMod]
Formula: Reaction ~ Days + (Days | Subject)
Data: sleepstudy
REML criterion at convergence: 1743.6
Scaled residuals:
Min 1Q Median 3Q Max
-3.9536 -0.4634 0.0231 0.4634 5.1793
Random effects:
Groups Name Variance Std.Dev. Corr
Subject (Intercept) 612.09 24.740
Days 35.07 5.922 0.07
Residual 654.94 25.592
Number of obs: 180, groups: Subject, 18
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 251.405 6.825 1371.100 302.06 <2e-16 ***
Days 10.467 1.546 1281.700 55.52 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr)
Days -0.138
Warning message:
In deviance.merMod(object, ...) :
deviance() is deprecated for REML fits; use REMLcrit for the REML
criterion or deviance(.,REML=FALSE) for deviance calculated at the REML fit
>
> anova(fm1)
Analysis of Variance Table
Df Sum Sq Mean Sq F value
Days 1 30031 30031 45.853
> anova(fm1a, ddf = 'Sat', type = 1)
Analysis of Variance Table of type I with Satterthwaite
approximation for degrees of freedom
Sum Sq Mean Sq NumDF DenDF F.value Pr(>F)
Days 446.65 446.65 1 5794080 45.853 1.275e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Warning message:
In deviance.merMod(object, ...) :
deviance() is deprecated for REML fits; use REMLcrit for the REML criterion or deviance(.,REML=FALSE) for deviance calculated at the REML fit
> anova(fm1a, ddf = 'Ken', type = 1)
Analysis of Variance Table of type I with Kenward-Roger
approximation for degrees of freedom
Sum Sq Mean Sq NumDF DenDF F.value Pr(>F)
Days 446.65 446.65 1 27.997 45.853 2.359e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Warning message:
In deviance.merMod(object, ...) :
deviance() is deprecated for REML fits; use REMLcrit for the REML criterion or deviance(.,REML=FALSE) for deviance calculated at the REML fit
>
> # t.test
> coefs <- data.frame(coef(summary(fm1)))
> coefs$p.z <- 2 * (1 - pnorm(abs(coefs$t.value)))
> coefs
Estimate Std..Error t.value p.z
(Intercept) 251.40510 6.824556 36.838311 0.000000e+00
Days 10.46729 1.545789 6.771485 1.274669e-11
>
> # Kenward-Rogers
> df.KR <- get_ddf_Lb(fm1, fixef(fm1))
> df.KR
[1] 25.89366
> coefs$p.KR <- 2 * (1 - pt(abs(coefs$t.value), df.KR))
> coefs
Estimate Std..Error t.value p.z p.KR
(Intercept) 251.40510 6.824556 36.838311 0.000000e+00 0.0000e+00
Days 10.46729 1.545789 6.771485 1.274669e-11 3.5447e-07

for every row, reshape and calculate eigenvectors in a vectorized way

I have been wanting to figure this out for a long time, but have had no success yet. I am assuming I will use arrayfun, but I couldn't figure it yet. Appreciate help. Here is the problem:
Given a matrix of many rows and N^2 columns, reshape every row to NxN matrix and calculate eigenvalues, and do this in a vectorized way not using for loop. For example
A=
0.6060168 0.8340029 0.0064574 0.7133187
0.6325375 0.0919912 0.5692567 0.7432627
0.8292699 0.5136958 0.4171895 0.2530783
0.7966113 0.1975865 0.6687064 0.3226548
0.0163615 0.2123476 0.9868179 0.1478827
for every **i**
m=reshape(A(i,:),2,2)
[vc vl]=eig(m)
I am inclined to do something like
f = #(x) eig(reshape(x,2,2))
arrayfun(f,A)
but of course I am getting an error like
octave:5> arrayfun(f,A)
error: reshape: can't reshape 1x1 array to 2x2 array
error: evaluating argument list element number 1
error: evaluating argument list element number 1
error: called from:
error: at line -1, column -1
error: cellfun: too many output arguments
error: /usr/share/octave/3.2.4/m/general/arrayfun.m at line 168, column 21
A = [0.6060168 0.8340029 0.0064574 0.7133187;
0.6325375 0.0919912 0.5692567 0.7432627;
0.8292699 0.5136958 0.4171895 0.2530783;
0.7966113 0.1975865 0.6687064 0.3226548;
0.0163615 0.2123476 0.9868179 0.1478827];
N = 2;
[mc, ml] = arrayfun (#(row) eig (reshape (A (row, :), N, N)), 1:rows(A), "UniformOutput", false)
mc =
{
[1,1] =
-0.170783 -0.044626
0.985309 -0.999004
[1,2] =
-0.95343 -0.89053
0.30161 -0.45492
(cropped)
}
ml =
{
[1,1] =
Diagonal Matrix
0.56876 0
0 0.75057
[1,2] =
Diagonal Matrix
0.45246 0
0 0.92334
(cropped)
With the Ndpar package, calculations can be parallelized over multiple cores.
Borrowing from Andy's answer,
pkg load ndpar
A = [0.6060168 0.8340029 0.0064574 0.7133187;
0.6325375 0.0919912 0.5692567 0.7432627;
0.8292699 0.5136958 0.4171895 0.2530783;
0.7966113 0.1975865 0.6687064 0.3226548;
0.0163615 0.2123476 0.9868179 0.1478827];
N = 2;
[eigenvectors, eigenvalues] = ndpar_arrayfun(nproc,
#(row) eig(reshape(row, N, N)),
A, "IdxDimensions", 1, "Uniformoutput", false)
yields the same output.
EDIT - Or with the original pararrayfun from the octave-forge parallel package:
[eigenvectors, eigenvalues] = pararrayfun(nproc,
#(row_idx) eig(reshape(A(row_idx, :), N, N)),
1:rows(A), "UniformOutput", false)

Problem with spline method = 'monoH.FC''

I am interested in using the monotone spline, but I get an error when R tries to use it. I am using R 2.12.0, and the method 'monoH.FC' says that it has been supported since 2.8.0
Reproducible example (same result for more complicated (x,y) relationships)
x<-1:2
y<-1:2
spline(x,y,method="monoH.FC")
Error in spline(x, y, method = "monoH.FC") : invalid interpolation method
What I have tried
?spline returns:
...
Usage:
...
spline(x, y = NULL, n = 3*length(x), method = "fmm",
xmin = min(x), xmax = max(x), xout, ties = mean)
...
Arguments:
method: specifies the type of spline to be used. Possible values are
‘"fmm"’, ‘"natural"’, ‘"periodic"’ and ‘"monoH.FC"’.
...
But the spline function itself indicates that the 'monoH.FC' method is not supported:
...
method <- pmatch(method, c("periodic", "natural", "fmm"))
if (is.na(method))
stop("invalid interpolation method")
...
Question
How can I use method = 'monoH.FC' with spline?
Use splinefun; it supports method=monoH.FC.
The last example in ?spline shows you how to do it.
## An example of monotone interpolation
n <- 20
set.seed(11)
x. <- sort(runif(n)) ; y. <- cumsum(abs(rnorm(n)))
plot(x.,y.)
curve(splinefun(x.,y.)(x), add=TRUE, col=2, n=1001)
curve(splinefun(x.,y., method="mono")(x), add=TRUE, col=3, n=1001)
legend("topleft", paste("splinefun( \"", c("fmm", "monoH.CS"), "\" )", sep=''),
col=2:3, lty=1)

What's the correct way to expand a [0,1] interval to [a,b]?

Many random-number generators return floating numbers between 0 and 1.
What's the best and correct way to get integers between a and b?
Divide the interval [0,1] in B-A+1 bins
Example A=2, B=5
[----+----+----+----]
0 1/4 1/2 3/4 1
Maps to 2 3 4 5
The problem with the formula
Int (Rnd() * (B-A+1)) + A
is that your Rnd() generation interval is closed on both sides, thus the 0 and the 1 are both possible outputs and the formula gives 6 when the Rnd() is exactly 1.
In a real random distribution (not pseudo), the 1 has probability zero. I think it is safe enough to program something like:
r=Rnd()
if r equal 1
MyInt = B
else
MyInt = Int(r * (B-A+1)) + A
endif
Edit
Just a quick test in Mathematica:
Define our function:
f[a_, b_] := If[(r = RandomReal[]) == 1, b, IntegerPart[r (b - a + 1)] + a]
Build a table with 3 10^5 numbers in [1,100]:
table = SortBy[Tally[Table[f[1, 100], {300000}]], First]
Check minimum and maximum:
In[137]:= {Max[First /# table], Min[First /# table]}
Out[137]= {100, 1}
Lets see the distribution:
BarChart[Last /# SortBy[Tally[Table[f[1, 100], {300000}]], First],
ChartStyle -> "DarkRainbow"]
X = (Rand() * (B - A)) + A
Another way to look at it, where r is your random number in the range 0 to 1:
(1-r)a + rb
As for your additional requirement of the result being an integer, maybe (apart from using built in casting) the modulus operator can help you out. Check out this question and the answer:
Expand a random range from 1–5 to 1–7
Well, why not just look at how Python does it itself? Read random.py in your installation's lib directory.
After gutting it to only support the behavior of random.randint() (which is what you want) and removing all error checks for non-integer or out-of-bounds arguments, you get:
import random
def randint(start, stop):
width = stop+1 - start
return start + int(random.random()*width)
Testing:
>>> l = []
>>> for i in range(2000000):
... l.append(randint(3,6))
...
>>> l.count(3)
499593
>>> l.count(4)
499359
>>> l.count(5)
501432
>>> l.count(6)
499616
>>>
Assuming r_a_b is the desired random number between a and b and r_0_1 is a random number between 0 and 1 the following should work just fine:
r_a_b = (r_0_1 * (b-a)) + a