why lmerTest gives different p value when data is too small? - lme4

I'm new in statistics and this package. I expected that the p value should be the same if my data multiply or divide by the same number, for example, all *10 or all *100.
But since my data is too small(~10^-9), the p value is almost 1 at the beginning. But when I multiply the data (the 'x' in the test), the p value decreased until data became ~10^-5 and then the p value wouldn't change.
test= lmer(x ~ a + b +c + (1|rep), data=data)
Estimate Std. Error df t value Pr(>|t|)
a2 -5.783e-09 1.232e-09 8.879e-05 -4.693 0.999 (raw data)
a2 -5.783e-08 1.232e-08 6.177e-03 -4.693 0.971 (raw data*10)
a2 -5.783e-07 1.232e-07 3.473e-01 -4.693 0.397 (raw data*100)
a2 -5.783e-06 1.232e-06 7.851e+00 -4.693 0.00164 ** (raw data*1000)
a2 -5.783e-05 1.232e-05 9.596e+01 -4.693 8.95e-06 *** (raw data*10000)
a2 -0.0005783 0.0001232 95.9638425 -4.693 8.95e-06 *** (raw data*100000)
I don't understand why these p values change to be constant. Could someone kindly explain it for me?

Okay, so after a bit of digging, I think that I have found a solution and an explanation. As you can see in your example, the t-value is not changing. The changes in p-value are due to changes in the estimated degrees of freedom. The default method for this is the Satterthwaite method, which according to this post from one of the authors of the package depends on the dependent variable (see the post here: https://stats.stackexchange.com/questions/342848/satterthwaite-degrees-of-freedom-in-a-mixed-model-change-drastically-depending-o)
Now, within a normal range of orders of magnitude, the degrees of freedom do not change and the p-values remain constant. You approached this from the other direction in your example, noting that the numbers stopped changing after a certain point (when the numbers in the DV were sufficiently large). Here I show that they are stable using an example from the iris package included with R:
# Preparing data
d <- iris
d$width <- d$Sepal.Width
d$Species <- as.factor(d$Species)
# Creating slightly smaller versions of the DV
d$length <- d$Sepal.Length
d$length_10 <- d$Sepal.Length/10
d$length_1e2 <- d$Sepal.Length/1e2
d$length_1e3 <- d$Sepal.Length/1e3
# fitting the models
m1 <- lmer(length ~ width + (1|Species),data = d)
m2 <- lmer(length_10 ~ width + (1|Species),data = d)
m3 <- lmer(length_1e2 ~ width + (1|Species),data = d)
m4 <- lmer(length_1e3 ~ width + (1|Species),data = d)
# The coefficients are all the same
> summary(m1)$coefficients
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 3.4061671 0.6683080 3.405002 5.096703 1.065543e-02
width 0.7971543 0.1062064 146.664820 7.505711 5.453404e-12
> summary(m2)$coefficients
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 0.34061671 0.06683080 3.405002 5.096703 1.065543e-02
width 0.07971543 0.01062064 146.664820 7.505711 5.453404e-12
> summary(m3)$coefficients
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 0.034061671 0.006683079 3.405003 5.096703 1.065542e-02
width 0.007971543 0.001062064 146.664820 7.505711 5.453405e-12
> summary(m4)$coefficients
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 0.0034061671 0.0006683079 3.405003 5.096703 1.065542e-02
width 0.0007971543 0.0001062064 146.664820 7.505711 5.453405e-12
However, your numbers are much smaller than this, so I made much smaller versions of the DV to try and re-create your example. As you can see, the degrees of freedom start approaching zero which causes the p-values to move towards one.
# Much smaller numbers
d$length_1e6 <- d$Sepal.Length/1e6
d$length_1e7 <- d$Sepal.Length/1e7
d$length_1e8 <- d$Sepal.Length/1e8
# fitting the models
m5 <- lmer(length_1e6 ~ width + (1|Species),data = d)
m6 <- lmer(length_1e7 ~ width + (1|Species),data = d)
m7 <- lmer(length_1e8 ~ width + (1|Species),data = d)
# Here we recreate the problem
> summary(m5)$coefficients
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 3.406167e-06 6.683079e-07 0.5618686 5.096703 0.2522273
width 7.971543e-07 1.062064e-07 0.6730683 7.505711 0.1599534
> summary(m6)$coefficients
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 3.406167e-07 6.683080e-08 0.01224581 5.096703 0.9461743
width 7.971543e-08 1.062064e-08 0.01229056 7.505711 0.9415154
> summary(m7)$coefficients
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 3.406167e-08 6.683080e-09 0.0001784636 5.096703 0.9988162
width 7.971543e-09 1.062064e-09 0.0001784738 7.505711 0.9987471
A possible solution to this is to use another approximation method, Kenward-Roger. Let's take the model with the smallest transformation of the DV here. we can do that with the following code:
summary(m7, ddf="Kenward-Roger")$coefficients
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 3.406167e-08 6.687077e-09 3.408815 5.093656 1.064475e-02
width 7.971543e-09 1.064752e-09 146.666335 7.486759 6.053660e-12
As you can see, with this method the numbers from the smallest version of our transformation now match the stable numbers from the larger transformations. Understanding exactly why small numbers are a problem for the Satterthwaite method is beyond my understanding of the methods employed by lmerTest method, but I know at least one of them is on here and might be able to provide additional insight. I suspect it might be related to underflow, as your numbers are very small, but I can't be sure.
I hope this helps!

Related

Mixed regression model controlling for Genetic Relatedness

I am trying to perform a regression analysis but controlling for genetic relatedness among individuals (genetic relatedness matrix). See the code for an example including identical twins (“1” and siblings “0.5”). N=10
I know I should use a mixed model. However, I am not able to include the genetic relatedness matrix into the model. I have seen that these two packages are often used for this (“Kindship2” and “pedigreemm”.
Here is the code but I am unable to fit the model.
require("pedigreemm")
require("lme4")
library(kinship2)
library(car)
FAMID <- c(1,1,2,2,3,3,4,4,5,5)
UNIQUEID <- 1:10
Pheno <- c(2,4,5,5,8,10,15,20,0,0)
PRS <- c(0.1,0.5, 1,1, 2,2 ,3,3, -0.5, -0.4)
data <- as.data.frame(cbind(FAMID,UNIQUEID,Pheno,PRS))
RELMAT <- matrix(c(1,1,0.04,0.03,0.03,0.03,0.03,0.03,0.03,0.03,
1,1,0.03,0.03,0.03,0.03,0.03,0.03,0.03,0.03,
0.04,0.03,1,1,0.03,0.05,0.03,0.03,0.03,0.03,
0.03,0.03,1,1,0.03,0.03,0.03,0.03,0.03,0.03,
0.03,0.03,0.03,0.03,1,0.5,0.03,0.03,0.03,0.03,
0.03,0.03,0.05,0.03,0.5,1,0.03,0.03,0.03,0.03,
0.03,0.03,0.03,0.03,0.03,0.03,1,0.5,0.03,0.03,
0.03,0.03,0.03,0.03,0.03,0.03,0.5,1,0.03,0.03,
0.03,0.03,0.03,0.03,0.03,0.03,0.03,0.03,1,0.5,
0.03,0.03,0.03,0.03,0.03,0.03,0.03,0.03,0.5,1), nrow = 10, ncol = 10, byrow = T)
This is the model that I want to fit:
m1 = lmer(Pheno ~ PRS + (1 | RELMAT), data = data)
Thank you so much in advance.

non-conformable arguments error from lmer when trying to extract information from the model matrix

I have some longitudinal data from which I'd like to get the predicted means at specified times. The model includes 2 terms, their interaction and a spline term for the time variable. When I try to obtain the predicted means, I get "Error in mm %*% fixef(m4) : non-conformable arguments"
I've used the sleep data set from lmer to illustrate my problem. First, I import the data and create a variable "age" for my interaction
sleep <- as.data.frame(sleepstudy) #get the sleep data
# create fake variable for age with 3 levels
set.seed(1234567)
sleep$age <- as.factor(sample(1:3,length(sleep),rep=TRUE))
Then I run my lmer model
library(lme4)
library(splines)
m4 <- lmer(Reaction ~ Days + ns(Days, df=4) + age + Days:age + (Days | Subject), sleep)
Finally, I create the data and matrix needed to obtain predicted means
#new data frame for predicted means
d <- c(0:9) # make a vector of days = 0 to 9 to obtain predictions for each day
newdat <- as.data.frame(cbind(Days=d, age=rep(c(1:3),length(d))))
newdat$Days <- as.numeric(as.character(newdat$Days))
newdat$age <- as.factor(newdat$age)
# create a matrix
mm<-model.matrix(~Days + ns(Days, df=4) + age + Days:age, newdat)
newdat$pred<-mm%*%fixef(m4)
It's at this point that I get the error:
Error in mm %*% fixef(m4) : non-conformable arguments
I can use predict to get the means
newdat$pred <- predict(m4, newdata=newdat, re.form=NA)
which works fine, but I want to be able to calculate a confidence interval, so I need a conformable matrix.
I read somewhere that the problem may be that lmer creates aliases (I can't find that post). This comment was made with regards to not being able to use effect() for a similar task. I couldn't quite understand how to overcome this problem. Moreover, I recall that post was a little old and hoped the alias problem may no longer be relevant.
If anyone has a suggestion for what I may be doing wrong, I'd appreciate the feedback. Thanks.
There are a couple of things here.
you need to drop columns to make your model matrix commensurate with the fixed effect vector that was actually fitted (i.e., commensurate with the model matrix that was actually used for fitting, after dropping collinear columns)
for additional confusion, you happened to only sample ages 2 and 3 (out of a possible {1,2,3})
I've cleaned up the code a little bit ...
library("lme4")
library("splines")
sleep <- sleepstudy #get the sleep data
set.seed(1234567)
## next line happens to sample only 2 and 3 ...
sleep$age <- as.factor(sample(1:3,length(sleep),rep=TRUE))
length(levels(sleep$age)) ## 2
Fit model:
m4 <- lmer(Reaction ~ Days + ns(Days, df=4) +
age + Days:age + (Days | Subject), sleep)
## message; fixed-effect model matrix is
## rank deficient so dropping 1 column / coefficient
Check fixed effects:
f1 <- fixef(m4)
length(f1) ## 7
f2 <- fixef(m4,add.dropped=TRUE)
length(f2) ## 8
We could use this extended version of the fixed effects (which has an NA value in it), but this would just mess us up by propagating NA values through the computation ...
Check model matrix:
X <- getME(m4,"X")
ncol(X) ## 7
(which.dropped <- attr(getME(m4,"X"),"col.dropped"))
## ns(Days, df = 4)4
## 6
New data frame for predicted means
d <- 0:9
## best to use data.frame() directly, avoid cbind()
## generate age based on *actual* levels in data
newdat <- data.frame(Days=d,
age=factor(rep(levels(sleep$age),length(d))))
Create a matrix:
mm <- model.matrix(formula(m4,fixed.only=TRUE)[-2], newdat)
mm <- mm[,-which.dropped] ## drop redundant columns
## newdat$pred <- mm%*%fixef(m4) ## works now
Added by sianagh: Code to obtain confidence intervals and plot the data:
predFun <- function(x) predict(x,newdata=newdat,re.form=NA)
newdat$pred <- predFun(m4)
bb <- bootMer(m4,
FUN=predFun,
nsim=200)
## nb. this produces an error message on its first run,
## but not on subsequent runs (using the development version of lme4)
bb_ci <- as.data.frame(t(apply(bb$t,2,quantile,c(0.025,0.975))))
names(bb_ci) <- c("lwr","upr")
newdat <- cbind(newdat,bb_ci)
Plot:
plot(Reaction~Days,sleep)
with(newdat,
matlines(Days,cbind(pred,lwr,upr),
col=c("red","green","green"),
lty=2,
lwd=c(3,2,2)))
The error is caused due to the drift component, if you put
allowdrift=FALSE
into your auto.arima prediction it will be fixed.

ggplot2 is not printing all the information I need in R

I am trying to replicate the following script: San Francisco Crime Classification
here is my code:
library(dplyr)
library(ggmap)
library(ggplot2)
library(readr)
library(rjson)
library(RCurl)
library(RJSONIO)
library(jsonlite)
train=jsonlite::fromJSON("/home/felipe/Templates/Archivo de prueba/databritanica.json")
counts <- summarise(group_by(train, Crime_type), Counts=length(Crime_type))
#counts <- counts[order(-counts$Crime_type),]
# This removes the "Other Offenses" category
top12 <- train[train$Crime_type %in% counts$Crime_type[c(1,3:13)],]
map<-get_map(location=c(lon = -2.747770, lat = 53.389499) ,zoom=12,source="osm")
p <- ggmap(map) +
geom_point(data=top12, aes(x=Longitude, y=Latitude, color=factor(Crime_type)), alpha=0.05) +
guides(colour = guide_legend(override.aes = list(alpha=1.0, size=6.0),
title="Type of Crime")) +
scale_colour_brewer(type="qual",palette="Paired") +
ggtitle("Top Crimes in Britain") +
theme_light(base_size=20) +
theme(axis.line=element_blank(),
axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks=element_blank(),
axis.title.x=element_blank(),
axis.title.y=element_blank())
ggsave("united kingdom_top_crimes_map.png", p, width=14, height=10, units="in")
I am reading the data from a JSON file and try to print points over the map according to the data. Each point is a type of crime that is have been committed, the location of each point depends of two parameters: longitude and latitude.
What is the problem? the points are not being printing. The script generate a new map without the points that is suppose to show.
This is the original map:
And this is the result:
Any ideas??
This a example of the data contain in the JSON file is:
[
{"Month":"2014-05","Longitude":-2.747770,"Latitude":53.389499,"Location":"On or near Cronton Road","LSOA_name":"Halton 001B","Crime_type":"Other theft"},
{"Month":"2014-05","Longitude":-2.799099,"Latitude":53.354676,"Location":"On or near Old Higher Road","LSOA_name":"Halton 008B","Crime_type":"Anti-social behaviour"},
{"Month":"2014-05","Longitude":-2.804451,"Latitude":53.352456,"Location":"On or near Higher Road","LSOA_name":"Halton 008B","Crime_type":"Anti-social behaviour"}
]
Short answer:
Your alpha = 0.05 is making the points practically invisible when plotted on the colorful map background, as mentioned by #aosmith.
Longer answer:
I suggest the following changes to your geom_point:
Increase the alpha to something more reasonable
Increase the size of the points
Optionally, change the shape to one with a background and fill for better visibility
This will require you to change the fill parameter in aes, as well as the scale_color_brewer to scale_fill_brewer
Example:
# Load required packages
library(dplyr)
library(ggplot2)
library(ggmap)
library(jsonlite)
# Example data provided in question, with one manually entered entry with
# Crime_type = "Other Offenses"
'[
{"Month":"2014-05","Longitude":-2.747770,"Latitude":53.389499,"Location":"On or near Cronton Road","LSOA_name":"Halton 001B","Crime_type":"Other theft"},
{"Month":"2014-05","Longitude":-2.799099,"Latitude":53.354676,"Location":"On or near Old Higher Road","LSOA_name":"Halton 008B","Crime_type":"Anti-social behaviour"},
{"Month":"2014-05","Longitude":-2.804451,"Latitude":53.352456,"Location":"On or near Higher Road","LSOA_name":"Halton 008B","Crime_type":"Anti-social behaviour"},
{"Month":"2014-05","Longitude":-2.81,"Latitude":53.36,"Location":"On or near Higher Road","LSOA_name":"Halton 008B","Crime_type":"Other Offenses"}
]' -> example_json
train <- fromJSON(example_json)
# Process the data, the dplyr way
counts <- train %>%
group_by(Crime_type) %>%
summarise(Counts = length(Crime_type))
# This removes the "Other Offenses" category
top12 <- train %>%
filter(Crime_type != "Other Offenses")
# Get the map
map <- get_map(location=c(lon = -2.747770, lat = 53.389499), zoom=12, source="osm")
# Plotting code
p <- ggmap(map) +
# Changes made to geom_point.
# I increased the alpha and size, and I used a shape that has
# a black border and a fill determined by Crime_type.
geom_point(data=top12, aes(x=Longitude, y=Latitude, fill=factor(Crime_type)),
shape = 21, alpha = 0.75, size = 3.5, color = "black") +
guides(fill = guide_legend(override.aes = list(alpha=1.0, size=6.0),
title="Type of Crime")) +
# Changed scale_color_brewer to scale_fill_brewer
scale_fill_brewer(type="qual", palette="Paired") +
ggtitle("Top Crimes in Britain") +
theme_light(base_size=20) +
theme(axis.line=element_blank(),
axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks=element_blank(),
axis.title.x=element_blank(),
axis.title.y=element_blank())

Cannot resolve "Undefined control sequence" in swaeve (using \Sexpr{})

I have troubles finding a solution for a basic error I get.
I was trying to use a .Rnw document as it is by default created in RStudio to report some results. I have to admit that I am new to Sweave. However, I used \Sexpr{}in order to represent some numbers defined in a chunk. It did not work and returned the error "Undefined control sequence". So I tried the example given in the official introduction:
<<echo=FALSE>>=
x <- 2
y <- 3
(z <- x + y)
#
However, I get the same error when using $z=\Sexpr{z}$ ("Undefined control sequence")
I assume something fundamentally is wrong with my file, but I can't figure out what. Here my whole document:
\documentclass{report}
\usepackage[noae]{Sweave}
\usepackage{graphicx, verbatim}
\setlength{\textwidth}{6.5in}
\setlength{\textheight}{9in}
\setlength{\oddsidemargin}{0in}
\setlength{\evensidemargin}{0in}
\setlength{\topmargin}{-1.5cm}
\begin{document}
\SweaveOpts{concordance=TRUE}
\begin{center}
\textbf{Data}
\end {center}
\section*{Questionnaires}
\subsection*{Stress}
<<echo=FALSE>>=
x <- 2
y <- 3
(z <- x + y)
#
Defining $z$ as above we get $z=\Sexpr{z}$.
\end{document}

R: calling rq() within a function and defining the linear predictor

I am trying to call rq() of the package quantreg within a function. Herebelow is a simplified explanation of my problem.
If I follow the recommendations found at
http://developer.r-project.org/model-fitting-functions.txt, I have a design matrix after the line
x <- model.matrix(mt, mf, contrasts)
with the first column full of 1's to create an intercept.
Now, when I call rq(), I am obliged to use something like
fit <- rq (y ~ x [,2], tau = 0.5, ...)
My problem happens if there is more than 1 explanatory variable. I don't know how to find an automatic way to write:
x [,2] + x [,3] + x [,4] + ...
Here is the complete simplified code:
ao_qr <- function (formula, data, method = "br",...) {
cl <- match.call ()
## keep only the arguments which should go into the model
## frame
mf <- match.call (expand.dots = FALSE)
m <- match (c ("formula", "data"), names (mf), 0)
mf <- mf[c (1, m)]
mf$drop.unused.levels <- TRUE
mf[[1]] <- as.name ("model.frame")
mf <- eval.parent (mf)
if (method == "model.frame") return (mf)
## allow model.frame to update the terms object before
## saving it
mt <- attr (mf, "terms")
y <- model.response (mf, "numeric")
x <- model.matrix (mt, mf, contrasts)
## proceed with the quantile regression
fit <- rq (y ~ x[,2], tau = 0.5, ...)
print (summary (fit, se = "boot", R = 100))
}
I call the function with:
ao_qr(pain ~ treatment + extra, data = data.subset)
And here is how to get the data:
require (lqmm)
data(labor)
data <- labor
data.subset <- subset (data, time == 90)
data.subset$extra <- rnorm (65)
In this case, with this code, my linear predictor only includes "treatment". If I want "extra", I have to manually add x[,3] in the linear predictor of rq() in the code. This is not automatic and will not work on other datasets with unknown number of variables.
Does anyone know how to tackle this ?
Any help would be greatly appreciated !!!
I found a simple solution:
x[,2:ncol(x)]