ANOVA mean test for Vector Valued Response - anova

What codes should I use in R for an ANOVA model when the response is vector valued...i.e.
Suppose I have longitudinal data for 20 individuals each having measurements of 10 time points...now I have a factor X having 3 levels,say 0,1,2...I need to test if the levels differ significantly from each other...I have to test for the mean vector (vector since each individual contains 10 time points)..i.e. if the mean vector for level 0, mean vector for level 1 and mean vector for level 2 are significantly different...
My sample data is:
Y
[1,] 9.759608 15.02230 17.70331
[2,] 9.596711 15.50542 18.49343
[3,] 11.298570 17.44781 19.48276
[4,] 8.519376 13.73086 17.05881
[5,] 10.232851 15.85302 19.87476
[6,] 10.888219 16.05568 20.12624
[7,] 9.688724 15.50494 18.82778
[8,] 10.309219 16.78230 18.80428
[9,] 9.620743 15.84582 19.32465
[10,] 10.418802 16.18098 17.94019
>treatment=c(0,1,1,2,0,2,1,1,0,1)
>treatment=factor(treatment)
> result=aov(Y~treatment)
Error in model.frame.default(formula = Y ~ treatment, drop.unused.levels = TRUE) :
object is not a matrix

Maybe it's just a problem of type of object. Try :
as.matrix(Y)

Related

Unexpected output of emmeans averaged accross variables

I transformed a variable (e.g. leaf_area) using a simple square transformation and then fitted to the following model containing an interaction:
fit <- lmer(leaf_area^2 ~genotype*soil_type + date_measurement + light + (1|repetition) + (1|y_position) + (1|x_position), data = dataset)
To obtain the emmeans averaged accross genotypes and soil type for each measurement date, I further use the following command:
fit.emm <- emmeans(fit, ~ genotype*soil_type + date_measurement, type = "response")
The emmeans are, nevertheless, averaged for the variable date_measurement.
As represented in the following example, emmeans are averages of genotypes x, y and z in the soil MT and in the date of measurement 27.4, but the measurement dates actually occured on 21, 23, 28, 30 and 35 das.
genotype soil_type date_measurement emmean SE df lower.CL upper.CL
x MT 27.4 0.190 0.0174 126.0 0.155 0.224
y MT 27.4 0.220 0.0147 74.1 0.191 0.250
z MT 27.4 0.210 0.0157 108.6 0.179 0.241
When I fit the model without interaction between genotype and soil type and run the emmeans, the results are still averaged for the measurement dates.
fit <- lmer(leaf_area^2 ~genotype + soil_type + date_measurement + light + (1|repetition) + (1|y_position) + (1|x_position), data = dataset)
fit.emm <- emmeans(fit, ~ genotype + soil_type + date_measurement, type = "response")
My question is: how can I obtain the emmeans averaged accross genotype and soil but separated for each date of measurement?
Class of variables:
date_measurement, light, x_position, y_position: numeric
genotype and soil_type: factor
Thank you in advance.
When you have a numerical predictor in the model, the default is to obtain predictions at the average value of that covariate. If you want the covariates treated like factors, you have to say so:
fit.emm <- emmeans(fit, ~ genotype*soil_type + date_measurement,
cov.reduce = FALSE)
In addition, emmeans cannot auto-detect your square transformation. You can fix it up by doing
fit.emm <- update(fit.emm, tran = make.tran("power", 2),
type = "response")
Then I think you will want to subsequently obtain marginal means by averaging over date_measurement at least -- i.e.,
fit.emm2 <- emmeans(fit.emm, ~ genotype*soil_type)
It will retain the transformation and type = "response" setting.

Contrast emmeans: post-hoc t-test as the average differences of the differences between baseline and treatment periods

I am using the lme4 package in R to undertake linear mixed effect models (LMM). Essentially all participants received two interventions (an intervention treatment and a placebo (control)) and were separated by a washout period. However, the order or sequence they received the interventions differed.
An interaction term of intervention and visit was included in the LMM with eight levels including all combinations of intervention (2 levels: control and intervention) and visit (4 levels: visit 1=baseline 1, visit 2, visit 3=post-randomization baseline 2, visit 4).
My question is how do I determine the intervention effect by a post-hoc t-test as the average differences of the differences between interventions, hence between visits 1 and 2 and between visits 3 and 4. I also want to determine the effects of the intervention and control compared to baseline.
Please see code below:
model1<- lmer(X ~ treatment_type:visit_code + (1|SID) + (1|SID:period), na.action= na.omit, data = data.x)
emm <- emmeans(model1 , ~treatment_type:visit_code)
My results of model 1 is:
emm
treatment_type visit_code emmean SE df lower.CL upper.CL
Control T0 -0.2915 0.167 26.0 -0.635 0.0520
Intervention T0 -0.1424 0.167 26.0 -0.486 0.2011
Control T1 -0.2335 0.167 26.0 -0.577 0.1100
Intervention T1 0.0884 0.167 26.0 -0.255 0.4319
Control T2 0.0441 0.167 26.0 -0.299 0.3876
Intervention T2 -0.2708 0.168 26.8 -0.616 0.0748
Control T3 0.1272 0.167 26.0 -0.216 0.4708
Intervention T3 0.0530 0.168 26.8 -0.293 0.3987
Degrees-of-freedom method: kenward-roger
Confidence level used: 0.95
I first created a matrix/ vectors:
#name vectors
Control.B1<- c(1,0,0,0,0,0,0,0) #control baseline 1 (visit 1)
Intervention.B1<- c(0,1,0,0,0,0,0,0) #intervention baseline 1 (visit 1)
Control.A2<- c(0,0,1,0,0,0,0,0) #post control 1 (visit 2)
Intervention.A2<- c(0,0,0,1,0,0,0,0) #post intervention 1 (visit 2)
ControlB3<- c(0,0,0,0,1,0,0,0) #control baseline 2 (visit 3)
Intervention.B3<- c(0,0,0,0,0,1,0,0) #intervention baseline 2 (visit 3)
Control.A4<- c(0,0,0,0,0,0,1,0) #post control 2 (visit 4)
Intervention.A4<- c(0,0,0,0,0,0,0,1) #post intervention 2 (visit 4)
Contbaseline = (Control.B1 + Control.B3)/2 # average of control baseline visits
Intbaseline = (Intervention. B1 + Intervention.B3)/2 # average of intervention baseline visits
ControlAfter= (Control.A2 + Control.A4)/2 # average of after control visits
IntervAfter= (Intervention.A2 + Intervention.A4)/2 # average of after intervention visits
Control.vs.Baseline = (ControlAfter-Contbaseline)
Intervention.vs.Baseline = (IntervAfter-Intbaseline)
Control.vs.Intervention = ((Control.vs.Baseline)-(Intervention.vs.Baseline))
the output of these are as follows:
> Control.vs.Baseline
[1] -0.5 0.0 0.5 0.0 -0.5 0.0 0.5 0.0
> Intervention.vs.Baseline
[1] 0.0 -0.5 0.0 0.5 0.0 -0.5 0.0 0.5
> Control.vs.Intervention
[1] -0.5 0.5 0.5 -0.5 -0.5 0.5 0.5 -0.5
Is this correct to the average differences of the differences between baseline and treatment periods?
Many thanks in advance!
A two-period crossover is the same as a repeated 2x2 Latin square. My suggestion for future such experiments is to structure the data accordingly, using variables for sequence (rows), period (columns), and treatment (assigned in the pattern (A,B) first sequence and (B,A) second sequence. The subjects are randomized to which sequence they are in.
So with your data, you would need to add a variable sequence that has the level AB for those subjects who receive the treatment sequence A, A, B, B, and level BA for those who receive B, B, A, A (though I guess the 1st and 3rd are really baseline for everybody).
Since there are 4 visits, it helps keep things sorted if you recode that as two factors trial and period, as follows:
visit trial period
1 base 1
2 test 1
3 base 2
4 test 2
Then fit the model with formula
model2 <- lmer(X ~ (sequence + period + treatment_type) * trial +
(1|SID:sequence), ...etc...)
The parenthesized part is the standard model for a Latin square. Then the analysis can be done without custom contrasts as follows:
RG <- ref_grid(model2) # same really as emmeans() for all 4 factors
CHG <- contrast(RG, "consec", simple = "trial")
CHG <- update(CHG, by = NULL, infer = c(TRUE, FALSE))
CHG contains the differences from baseline (trial differences for each combination of the other three factors. The update() step removes the by variables saved from contrast(). Now, we can get the marginal means and comparisons for each factor:
emmeans(CHG, consec ~ treatment_type)
emmeans(CHG, consec ~ period)
emmeans(CHG, consec ~ sequence)
These will be the same results you got the other way via custom contrasts. The one that was a difference of differences before is now handled by sequence. This works because in a 2x2 Latin square, the main effect of each factor is confounded with the two-way interaction of the other two factors.

Comparison the words with the original file in the R

I have original dataset in json format. Let's load it in R.
library("rjson")
setwd("mydir")
getwd()
json_data <- fromJSON(paste(readLines("N1.json"), collapse=""))
uu <- unlist(json_data)
uutext <- uu[names(uu) == "text"]
And I have another dataset mydata2
mydata=read.csv(path to data/words)
I need to find the words in mydata2, only which are present in messages in json file. And then write this messages into the new document, "xyz.txt" How to do it?
chalk indirect pick reaction team skip pumpkin surprise bless ignorance
1 time patient road extent decade cemetery staircase monarch bubble abbey
2 service conglomerate banish pan friendly position tight highlight rice disappear
3 write swear break tire jam neutral momentum requirement relationship matrix
4 inspire dose jump promote trace latest absolute adjust joystick habit
5 wrong behave claim dedicate threat sell particle statement teach lamb
6 eye tissue prescription problem secretion revenge barrel beard mechanism platform
7 forest kick face wisecrack uncertainty ratio complain doubt reflection realism
8 total fee debate hall soft smart sip ritual pill category
9 contain headline lump absorption superintendent digital increase key banner second
i mean
chalk -1 number1 indirect -2 number2
template
Word1-1 number1-1; Word1-2 number 1-2; …; Word 1-10 number 1-10
Word2-1 number2-1; Word2-2 number 2-2; …; Word 2-10 number 2-10
Next time pls include real data. Simplified model:
library(data.table)
word = c("test","meh","blah")
jsonF = c("let's do test", "blah is right", "test blah", "test test")
outp <- list()
for (i in 1:length(word)) {
outp[[i]] = as.data.frame(grep(word[i],jsonF,v=T,fixed=T)) # possibly, ignore.case=T
}
qq = rbindlist(outp)
qq = unique(qq)
print(qq)
1: let's do test
2: test blah
3: test test
4: blah is right
Edit: quick and dirty paste/collapse:
library(data.table)
x = LETTERS[1:10]
y = LETTERS[11:20]
df = rbind(x,y)
L = list()
for (i in 1:nrow(df)) {
L[i] = paste0(df[i,],"-",seq(1,10)," ",i,"-",seq(1,10),collapse="; ")
}
Fin = cbind(L)
View(Fin)
Gives:
> Fin
L
[1,] "A-1 1-1; B-2 1-2; C-3 1-3; D-4 1-4; E-5 1-5; F-6 1-6; G-7 1-7; H-8 1-8; I-9 1-9; J-10 1-10"
[2,] "K-1 2-1; L-2 2-2; M-3 2-3; N-4 2-4; O-5 2-5; P-6 2-6; Q-7 2-7; R-8 2-8; S-9 2-9; T-10 2-10"

Different Sum Sq and MSS using lme4::lmer and lmerTest::lmer

I get sums of squares and mean sums of squares 10x higher when I use anova on lmerTest:: lmer compared to lme4:: lmer objects. See the R log file below. Note the warning that when I attach the lmerTest package, the stats::sigma function overrides the lme4::sigma function, and I suspect that it is this that leads to the discrepancy. In addition, the anova report now says that it is a Type III anova rather than the expected Type I. Is this a bug in the lmerTest package, or is there something about use of the Kenward-Roger approximation for degrees of freedom that changes the calculation of SumSQ and MSS and specification of the anova Type that I don't understand?
I would append the test file, but it is confidential clinical trial information. If necessary I can see if I can cobble up a test case.
Thanks in advance for any advice you all can provide about this.
> library(lme4)
Loading required package: Matrix
Attaching package: ‘lme4’
The following object is masked from ‘package:stats’:
sigma
> test100 <- lmer(log(value) ~ prepost * lowhi + (1|CID/LotNo/rep),
REML = F, data = GSIRlong, subset = !is.na(value))
> library(lmerTest)
Attaching package: ‘lmerTest’
The following object is masked from ‘package:lme4’:
lmer
The following object is masked from ‘package:stats’:
step
Warning message:
replacing previous import ‘lme4::sigma’ by ‘stats::sigma’ when loading
‘lmerTest’
> test200 <- lmer(log(value) ~ prepost * lowhi + (1|CID/LotNo/rep),
REML = F, data = GSIRlong, subset = !is.na(value))
> anova(test100)
Analysis of Variance Table
Df Sum Sq Mean Sq F value
prepost 1 3.956 3.956 18.4825
lowhi 1 130.647 130.647 610.3836
prepost:lowhi 1 0.038 0.038 0.1758
> anova(test200, ddf = 'Ken')
Analysis of Variance Table of type III with Kenward-Roger
approximation for degrees of freedom
Sum Sq Mean Sq NumDF DenDF F.value Pr(>F)
prepost 37.15 37.15 1 308.04 18.68 2.094e-05 ***
lowhi 1207.97 1207.97 1 376.43 607.33 < 2.2e-16 ***
prepost:lowhi 0.35 0.35 1 376.43 0.17 0.676
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Update: Thanks, Ben. I did a little code archeology on lmerTest to see if I could dope out an explanation for the above anomalies. First, it turns out that lmerTest::lmer just submits the model to lme4::lmer and then relabels the result as an "mermodLmerTest" object. The only effect of this is to invoke versions of summary() and anova() from the lmerTest package rather than the usual defaults from base and stats. (These lmerTest functions are compiled, and I have not yet gone farther to look at the C++ code.) lmerTest::summary just adds three columns to the base::summary result, giving df, t value, and Pr. Note that lmerTest::anova, by default, computes a type III anova rather than a type I as in stats::anova. (Explanation of my second question above.) Not a great choice if one's model includes interactions. One can request a type I, II, or III anova using the type = 1/2/3 option.
However there are other surprises using the nlmeTest versions of summary and anova, as shown in the R console file below. I used lmerTest's included sleepstudy data so that this code should be replicable.
a. Note that "sleepstudy" has 180 records (with 3 variables)
b. The summaries of fm1 and fm1a are identical except for the added Fixed effects columns. But note that in the lmerTest::summary the ddfs for the intercept and Days are 1371 and 1281 respectively; odd given that there are only 180 records in "sleepstudy."
c. Just as in my original model above, the nlm4 anad nlmrTest versions of anova give very different values of Sum Sq and Mean Sq. (30031 and 446.65 respectively).
d: Interestingly, the nlmrTest versions of anova using Satterthwaite and Kenward-Rogers estimates of the DenDF are wildly different (5794080 and 28 respecitvely). The K-R value seems more reasonable.
Given the above issues, I am reluctant at this point to depend on lmerTest to give reliable p-values. Based on your (Doug Bates's) blog entry (https://stat.ethz.ch/pipermail/r-help/2006-May/094765.html), I am using now (and recommending) the method from the posting by Dan Mirman (http://mindingthebrain.blogspot.ch/2014/02/three-ways-to-get-parameter-specific-p.html) in the final bit of code below to estimate a naive t-test p-value (assuming essentially infinite degrees of freedom) and a Kenward-Rogers estimate of df (using the R package 'pbkrtest' -- the same package used by lmerTest). I couldn't find R code to compute the Satterthwaite estimates. The naive t-test p-value is clearly anti-conservative, but the KR estimate is thought to be pretty good. If the two give similar estimates of p, then I think that one can feel comfortable with a p-value in the range of [naive t-test, KR estimate].
> library(lme4); library(lmerTest); library(pbkrtest);
dim(sleepstudy)
[1] 180 3
>
> fm1 <- lme4::lmer(Reaction ~ Days + (Days|Subject), sleepstudy)
> fm1a <- lmerTest::lmer(Reaction ~ Days + (Days|Subject), sleepstudy)
>
> summary(fm1)
Linear mixed model fit by REML ['lmerMod']
Formula: Reaction ~ Days + (Days | Subject)
Data: sleepstudy
REML criterion at convergence: 1743.6
Scaled residuals:
Min 1Q Median 3Q Max
-3.9536 -0.4634 0.0231 0.4634 5.1793
Random effects:
Groups Name Variance Std.Dev. Corr
Subject (Intercept) 612.09 24.740
Days 35.07 5.922 0.07
Residual 654.94 25.592
Number of obs: 180, groups: Subject, 18
Fixed effects:
Estimate Std. Error t value
(Intercept) 251.405 6.825 36.84
Days 10.467 1.546 6.77
Correlation of Fixed Effects:
(Intr)
Days -0.138
> summary(fm1a)
Linear mixed model fit by REML t-tests use Satterthwaite approximations to
degrees of freedom [lmerMod]
Formula: Reaction ~ Days + (Days | Subject)
Data: sleepstudy
REML criterion at convergence: 1743.6
Scaled residuals:
Min 1Q Median 3Q Max
-3.9536 -0.4634 0.0231 0.4634 5.1793
Random effects:
Groups Name Variance Std.Dev. Corr
Subject (Intercept) 612.09 24.740
Days 35.07 5.922 0.07
Residual 654.94 25.592
Number of obs: 180, groups: Subject, 18
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 251.405 6.825 1371.100 302.06 <2e-16 ***
Days 10.467 1.546 1281.700 55.52 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr)
Days -0.138
Warning message:
In deviance.merMod(object, ...) :
deviance() is deprecated for REML fits; use REMLcrit for the REML
criterion or deviance(.,REML=FALSE) for deviance calculated at the REML fit
>
> anova(fm1)
Analysis of Variance Table
Df Sum Sq Mean Sq F value
Days 1 30031 30031 45.853
> anova(fm1a, ddf = 'Sat', type = 1)
Analysis of Variance Table of type I with Satterthwaite
approximation for degrees of freedom
Sum Sq Mean Sq NumDF DenDF F.value Pr(>F)
Days 446.65 446.65 1 5794080 45.853 1.275e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Warning message:
In deviance.merMod(object, ...) :
deviance() is deprecated for REML fits; use REMLcrit for the REML criterion or deviance(.,REML=FALSE) for deviance calculated at the REML fit
> anova(fm1a, ddf = 'Ken', type = 1)
Analysis of Variance Table of type I with Kenward-Roger
approximation for degrees of freedom
Sum Sq Mean Sq NumDF DenDF F.value Pr(>F)
Days 446.65 446.65 1 27.997 45.853 2.359e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Warning message:
In deviance.merMod(object, ...) :
deviance() is deprecated for REML fits; use REMLcrit for the REML criterion or deviance(.,REML=FALSE) for deviance calculated at the REML fit
>
> # t.test
> coefs <- data.frame(coef(summary(fm1)))
> coefs$p.z <- 2 * (1 - pnorm(abs(coefs$t.value)))
> coefs
Estimate Std..Error t.value p.z
(Intercept) 251.40510 6.824556 36.838311 0.000000e+00
Days 10.46729 1.545789 6.771485 1.274669e-11
>
> # Kenward-Rogers
> df.KR <- get_ddf_Lb(fm1, fixef(fm1))
> df.KR
[1] 25.89366
> coefs$p.KR <- 2 * (1 - pt(abs(coefs$t.value), df.KR))
> coefs
Estimate Std..Error t.value p.z p.KR
(Intercept) 251.40510 6.824556 36.838311 0.000000e+00 0.0000e+00
Days 10.46729 1.545789 6.771485 1.274669e-11 3.5447e-07

Evaluate Multivariate Function on a grid for two of the variables

I should probably delete this question, since I found an answer elsewhere on stackoverflow.
Using 'Vectorize' solved the problem.
outer(x,y,Vectorize(model))
I'll leave this up for a bit in case this helps anyone else. The issue I had is described below.
I am an occasional user of R, mostly for Scientific Programming. I want to evaluate a function on a grid of x,y values. Something similar to this simple example:
> x=seq(1,5,1)
> y=seq(1,5,1)
> model = function(a,b){a+b}
> (outer(x,y,model))
[,1] [,2] [,3] [,4] [,5]
[1,] 2 3 4 5 6
[2,] 3 4 5 6 7
[3,] 4 5 6 7 8
[4,] 5 6 7 8 9
[5,] 6 7 8 9 10
The input and output of my actual function looks like this (the volumetric flow rate of a 'Cross' fluid in a circular pipe):
> SimpleCrossModelCircPipe(eta_0 = 176.45, lambda = 2.4526, m=0.83122, mesh=1000,
pipe_length=1.925, press_drop=20.3609, pipe_diameter=0.0413)
[1] 0.005635064
I want to evaluate the function on a grid of values, for example, press_drop and pipe_diameter. So, define a function of these two variables.
> model = function(a,b){SimpleCrossModelCircPipe(eta_0 = 176.45, lambda = 2.4526,
m=0.83122, mesh=1000,pipe_length=1.925, press_drop=a, pipe_diameter=b)}
For just one pair of values this does work:
> model(1,2)
[1] 107249.8
But, if I use with 'outer', there is an error.
> (outer(x,y,model))
Error in seq.default(tau_wall, 0, length.out = mesh + 1) :
'from' must be of length 1
If I place a print statement for the local variable, tau_wall, using outer generates all 25 values, before moving to the line, which generates the error, since tau_wall should have length 1, not 25.
(outer(x,y,model))
[1] 352.5289 705.0579 1057.5868 1410.1158 1762.6447 705.0579 1410.1158 2115.1736
[9] 2820.2315 3525.2894 1057.5868 2115.1736 3172.7605 4230.3473 5287.9341 1410.1158
[17] 2820.2315 4230.3473 5640.4630 7050.5788 1762.6447 3525.2894 5287.9341 7050.5788
[25] 8813.2235
Is there a simple fix to make this work?
Best,
Steve