non-conformable arguments error from lmer when trying to extract information from the model matrix - lme4

I have some longitudinal data from which I'd like to get the predicted means at specified times. The model includes 2 terms, their interaction and a spline term for the time variable. When I try to obtain the predicted means, I get "Error in mm %*% fixef(m4) : non-conformable arguments"
I've used the sleep data set from lmer to illustrate my problem. First, I import the data and create a variable "age" for my interaction
sleep <- as.data.frame(sleepstudy) #get the sleep data
# create fake variable for age with 3 levels
set.seed(1234567)
sleep$age <- as.factor(sample(1:3,length(sleep),rep=TRUE))
Then I run my lmer model
library(lme4)
library(splines)
m4 <- lmer(Reaction ~ Days + ns(Days, df=4) + age + Days:age + (Days | Subject), sleep)
Finally, I create the data and matrix needed to obtain predicted means
#new data frame for predicted means
d <- c(0:9) # make a vector of days = 0 to 9 to obtain predictions for each day
newdat <- as.data.frame(cbind(Days=d, age=rep(c(1:3),length(d))))
newdat$Days <- as.numeric(as.character(newdat$Days))
newdat$age <- as.factor(newdat$age)
# create a matrix
mm<-model.matrix(~Days + ns(Days, df=4) + age + Days:age, newdat)
newdat$pred<-mm%*%fixef(m4)
It's at this point that I get the error:
Error in mm %*% fixef(m4) : non-conformable arguments
I can use predict to get the means
newdat$pred <- predict(m4, newdata=newdat, re.form=NA)
which works fine, but I want to be able to calculate a confidence interval, so I need a conformable matrix.
I read somewhere that the problem may be that lmer creates aliases (I can't find that post). This comment was made with regards to not being able to use effect() for a similar task. I couldn't quite understand how to overcome this problem. Moreover, I recall that post was a little old and hoped the alias problem may no longer be relevant.
If anyone has a suggestion for what I may be doing wrong, I'd appreciate the feedback. Thanks.

There are a couple of things here.
you need to drop columns to make your model matrix commensurate with the fixed effect vector that was actually fitted (i.e., commensurate with the model matrix that was actually used for fitting, after dropping collinear columns)
for additional confusion, you happened to only sample ages 2 and 3 (out of a possible {1,2,3})
I've cleaned up the code a little bit ...
library("lme4")
library("splines")
sleep <- sleepstudy #get the sleep data
set.seed(1234567)
## next line happens to sample only 2 and 3 ...
sleep$age <- as.factor(sample(1:3,length(sleep),rep=TRUE))
length(levels(sleep$age)) ## 2
Fit model:
m4 <- lmer(Reaction ~ Days + ns(Days, df=4) +
age + Days:age + (Days | Subject), sleep)
## message; fixed-effect model matrix is
## rank deficient so dropping 1 column / coefficient
Check fixed effects:
f1 <- fixef(m4)
length(f1) ## 7
f2 <- fixef(m4,add.dropped=TRUE)
length(f2) ## 8
We could use this extended version of the fixed effects (which has an NA value in it), but this would just mess us up by propagating NA values through the computation ...
Check model matrix:
X <- getME(m4,"X")
ncol(X) ## 7
(which.dropped <- attr(getME(m4,"X"),"col.dropped"))
## ns(Days, df = 4)4
## 6
New data frame for predicted means
d <- 0:9
## best to use data.frame() directly, avoid cbind()
## generate age based on *actual* levels in data
newdat <- data.frame(Days=d,
age=factor(rep(levels(sleep$age),length(d))))
Create a matrix:
mm <- model.matrix(formula(m4,fixed.only=TRUE)[-2], newdat)
mm <- mm[,-which.dropped] ## drop redundant columns
## newdat$pred <- mm%*%fixef(m4) ## works now
Added by sianagh: Code to obtain confidence intervals and plot the data:
predFun <- function(x) predict(x,newdata=newdat,re.form=NA)
newdat$pred <- predFun(m4)
bb <- bootMer(m4,
FUN=predFun,
nsim=200)
## nb. this produces an error message on its first run,
## but not on subsequent runs (using the development version of lme4)
bb_ci <- as.data.frame(t(apply(bb$t,2,quantile,c(0.025,0.975))))
names(bb_ci) <- c("lwr","upr")
newdat <- cbind(newdat,bb_ci)
Plot:
plot(Reaction~Days,sleep)
with(newdat,
matlines(Days,cbind(pred,lwr,upr),
col=c("red","green","green"),
lty=2,
lwd=c(3,2,2)))

The error is caused due to the drift component, if you put
allowdrift=FALSE
into your auto.arima prediction it will be fixed.

Related

Univariate cox regression analysis with multiple covariates

covariates <- c("age", "sex", "ph.karno", "ph.ecog", "wt.loss")
univ_formulas <- sapply(covariates,
function(x) as.formula(paste('Surv(time, status)~', x)))
univ_models <- lapply( univ_formulas, function(x){coxph(x, data = lung)})
# Extract data
univ_results <- lapply(univ_models,
function(x){
x <- summary(x)
p.value<-signif(x$wald["pvalue"], digits=2)
wald.test<-signif(x$wald["test"], digits=2)
beta<-signif(x$coef[1], digits=2);#coeficient beta
HR <-signif(x$coef[2], digits=2);#exp(beta)
HR.confint.lower <- signif(x$conf.int[,"lower .95"], 2)
HR.confint.upper <- signif(x$conf.int[,"upper .95"],2)
HR <- paste0(HR, " (",
HR.confint.lower, "-", HR.confint.upper, ")")
res<-c(beta, HR, wald.test, p.value)
names(res)<-c("beta", "HR (95% CI for HR)", "wald.test",
"p.value")
return(res)
#return(exp(cbind(coef(x),confint(x))))
})
res <- t(as.data.frame(univ_results, check.names = FALSE))
as.data.frame(res)
Normally I use this code for univariate cox regression analysis but I have multiple genes >20000 that I want to run as independent variables in a univariate cox regression analysis and I am not sure how I can run this code without typing the individual covariates (gene names) out. All my column names for the genes begin with "ENSG..".
Is there a way to do univariate cox regression on so many genes in an efficient way please? Thanks in advance.
There are several ways to make the list of variable names without typing them out. Probably one of the easiest is to use names() to get all of the variable names in the data, then remove time and status from that list (as well as any other variables you don't want to include). For example, for the veteran dataset:
covariates <- names(survival::veteran)
covariates # Look at which names were detected
#> [1] "trt" "celltype" "time" "status" "karno" "diagtime" "age"
#> [8] "prior"
covariates <- covariates[-which(covariates %in% c("time", "status"))]
covariates # Confirm time and status have been removed
#> [1] "trt" "celltype" "karno" "diagtime" "age" "prior"
Created on 2022-08-30 by the reprex package (v2.0.1)
You could also programmatically create a list of names. For example:
covariates <- sapply(1:10, FUN = function(x) paste0("ENSG.", x))
covariates
#> [1] "ENSG.1" "ENSG.2" "ENSG.3" "ENSG.4" "ENSG.5" "ENSG.6" "ENSG.7"
#> [8] "ENSG.8" "ENSG.9" "ENSG.10"
This approach might be better if the naming is easy to program. If the gene names are irregular it might be more difficult.
As far as efficiency, I don't think much can be done to improve the overall runtime. The bulk of the runtime is doing the actually coxph() calculations. There are other questions/answers on the site about optimizing R code. If you want to pursue optimization I'd suggest looking through those, and then making a new question with a reproducible example if you need more help.

For Loop reading API response with existing Data Frame

I have a dataframe:
df
NAME ARTISTNAME COL3
1 Everything_Now (continued) Arcade Fire Everything_Now%20(continued)%20Arcade%20Fire
2 Everything Now Arcade Fire Everything%20Now%20Arcade%20Fire
3 Signs of Life Arcade Fire Signs%20of%20Life%20Arcade%20Fire
4 Creature Comfort Arcade Fire Creature%20Comfort%20Arcade%20Fire
5 Peter Pan Arcade Fire Peter%20Pan%20Arcade%20Fire
6 Chemistry Arcade Fire Chemistry%20Arcade%20Fire
My goal is to loop this with Genius Lyric's API to get the lyric url for each value in COL3.
If I were to not loop this and just do it for each song individually, then my output for one would look like this:
genius_url <- "https://api.genius.com/search?q=Everything_Now%20(continued)%20Arcade%20Fire"
getgeniuslyrics <- GET(genius_url, add_headers(Authorization = HeaderValue))
geniuslyrics <- jsonlite::fromJSON(toJSON(content(getgeniuslyrics)))
answer <- data.frame(geniuslyrics$response$hits$result$url[1])
answer
X.https...genius.com.Arcade.fire.everything.now.continued.lyrics.
1 https://genius.com/Arcade-fire-everything-now-continued-lyrics
str(answer)
'data.frame': 1 obs. of 1 variable:
$ X.https...genius.com.Arcade.fire.everything.now.continued.lyrics.: Factor w/ 1 level "https://genius.com/Arcade-fire-everything-now-continued-lyrics": 1
This was my attempt at the for-loop so far but I am getting an error:
for(i in 1:length(df[,3])) {
genius_url <- paste("https://api.genius.com/search?q=",
df3[i,3],
sep="")
getgeniuslyrics <- GET(genius_url, add_headers(Authorization = HeaderValue))
geniuslyrics <- jsonlite::fromJSON(toJSON(content(getgeniuslyrics)))
answer <- data.frame(geniuslyrics$response$hits$result$url[1])
df[i,4] <- answer[1,]
}
The error message I am getting is:
Error in x[...] <- m : replacement has length zero
In addition: There were 26 warnings (use warnings() to see them)
Hope this makes sense. Any help would be great, thanks.
Does your dataframe already have column three or you are to create it from columns 1 and 2? I assumed you have to create the third column given the first and the second.
Try rewriting the one trial in a function like format:
funfun <- function(...){
x=unlist(list(...))
A=paste(unlist(lapply(x,strsplit," ")),collapse = "%20")
genius_url=paste0("https://api.genius.com/search?q=",A)
getgeniuslyrics <- GET(genius_url, add_headers(Authorization = HeaderValue))
geniuslyrics <- jsonlite::fromJSON(toJSON(content(getgeniuslyrics)))
answer <- data.frame(geniuslyrics$response$hits$result$url[1])
answer
}
nom maybe from here you can loop or use apply functions:
apply(df[,1:2],1,funfun)
in the case you have the third column, then your life is easier:
funfun_1 <- function(x){
genius_url=paste0("https://api.genius.com/search?q=",x)
getgeniuslyrics <- GET(genius_url, add_headers(Authorization = HeaderValue))
geniuslyrics <- jsonlite::fromJSON(toJSON(content(getgeniuslyrics)))
answer <- data.frame(geniuslyrics$response$hits$result$url[1])
answer
}
sapply(df[,3],funfun_1)

Find Jaccard distance of tweets and cluster in Kmeans

This is a follow up question to a problem I've been working on for a while. I have two questions. One regards an algorithm that works on two tweets, that I revised to measure 10 tweets. I'm wondering what my revision is measuring. I get result, but I want it to measure several tweet's jaccard distances, not just return one value. Since it's returning one value, I think it's just adding everything up. The other question is about my attempt to create a For Loop and assign clusters.
I'm trying to find the Jaccard distance between a dataset of tweets, then cluster those tweets with the Kmeans algorithm.
This is where I'm retrieving the data from:
http://www3.nd.edu/~dwang5/courses/spring15/assignments/A2/Tweets.json
What I have so far is this
install.packages("rjson")
library("rjson")
#download JSON File and put into a dataframe
download.file("http://www3.nd.edu/~dwang5/courses/spring15/assignments/A2/Tweets.json", tf<-tempfile());library(jsonlite);json_alldata <- fromJSON(sprintf("[%s]", paste(readLines(file(tf)),collapse=",")))
# get rid of geo column
tweet.features = json_alldata
tweet.features$geo <- NULL
# *Works. Compares two tweets and measures Jaccard Distance
tweetText <- list(tweet1 = tweet.features$text[1]:tweet.features$text[2])
jaccard_i <- function(tw1, tw2){
tw1 <- unlist(strsplit(tw1, " |\\."))
tw2 <- unlist(strsplit(tw2, " |\\."))
i <- length(intersect(tw1, tw2))
u <- length(union(tw1, tw2))
list(i=i, u=u, j=i/u)
}
jaccard_i(tweetText[[1]], tweetText[[2]])
All of that measures the jaccard distance of two specified tweets. Which is great.
But now I'm trying to modify to compare the distances between several tweets. This time 10 random tweets which I retrieved from the Sample command in R.
# Generates two sets of 5 random tweets
tweetText <- list(sample(tweet.features$text, replace = FALSE, size = 5), sample(tweet.features$text, replace = FALSE, size = 5))
jaccard_i <- function(tw1, tw2){
tw1 <- unlist(strsplit(tw1, " |\\."))
tw2 <- unlist(strsplit(tw2, " |\\."))
i <- length(intersect(tw1, tw2))
u <- length(union(tw1, tw2))
list(i=i, u=u, j=i/u)
}
jaccard_i(tweetText[[1]], tweetText[[2]])
This gives me results, but it can't be correct.
I'm trying to build an algorithm that can measure all the tweets, compare their jaccard distance, then cluster based on the Jaccard distance with Kmeans.
So for another attempt, I thought to make a For Loop.
I decided to make 10 cluster centers with the 10 random tweets
c <- sample(tweet.features$text, replace = FALSE, size = 10)
Now I did a For Loop hoping to measure the tweets which I figured I could assign to an array and cluster
#Algorithm attempt
for(i in tweet.features$text){
for (j in c){
i <- length(intersect(i, j))
u <- length(union(i, j))
j = i/u
}
#assign(my.array)
}
I don't believe that's doing anything useful, but it's an attempt to create a loop to measure the Jaccard distance.
I'm sorry that this is a loaded question. Any help would be appreciated as I'm a bit lost.
In your first function, you are doing unlist of your lists of words in your tweets, thus in tw1 and tw2 have the global lists of words and you cannot use them for your tweet-by-tweet Jaccard. You could accomplish it with a removing the unlist, then tw1 and tw2 are lists of lists of terms, and you can compare them using mapply. Something as follows.
jaccard_i <- function(tw1, tw2){
tw1 <- strsplit(tw1, " |\\.")
tw2 <- strsplit(tw2, " |\\.")
i <- mapply(function(tw1, tw2) {
length(intersect(tw1, tw2))
}, tw1=tw1, tw2=tw2)
u <- mapply(function(tw1, tw2) {
length(union(tw1, tw2))
}, tw1=tw1, tw2=tw2)
list(i=i, u=u, j=i/u)
}
Silly example:
> tw1 = c("we yes you no", "we are the people")
> tw2= c("we are the people", "we yes you no")
> tweetText = list(tw1, tw2)
> jaccard_i(tweetText[[1]], tweetText[[2]])
$i
[1] 1 1
$u
[1] 7 7
$j
[1] 0.1428571 0.1428571
As for the second part of your question, the double loop, an easy way to start addressing it would be like this,
tw = c("we yes you no", "we are the people")
lapply(tw, function(tweet1) {
lapply(tw, function(tweet2) {
jaccard_i(tweet1, tweet2)
})
})
With results that look like this,
[[1]]
[[1]][[1]]
[[1]][[1]]$i
[1] 4
[[1]][[1]]$u
[1] 4
[[1]][[1]]$j
[1] 1
[[1]][[2]]
[[1]][[2]]$i
[1] 1
[[1]][[2]]$u
[1] 7
[[1]][[2]]$j
[1] 0.1428571
[[2]]
[[2]][[1]]
[[2]][[1]]$i
[1] 1
[[2]][[1]]$u
[1] 7
[[2]][[1]]$j
[1] 0.1428571
[[2]][[2]]
[[2]][[2]]$i
[1] 4
[[2]][[2]]$u
[1] 4
[[2]][[2]]$j
[1] 1
where you should skip the values in the diagonals of course -just a starting point as I said.
Hope it helps.

delete vertices while preserving nodes IDs

I am using the function "delete vertices", and I found a strange behavior on my networks.
After reading the documentation of igraph, I found that:
"delete.vertices removes the specified vertices from the graph together with their adjacent edges. The ids of the vertices are not preserved."
is there any work-around to preserve the ids of the original network?
Yes, assign a vertex attribute to the graph, probably the name attribute is best. These are kept after deletion.
g <- graph.ring(10)
V(g)$name <- letters[1:10]
g2 <- delete.vertices(g, c("a", "b", "f"))
str(g2)
# IGRAPH UN-- 7 5 -- Ring graph
# + attr: name (g/c), mutual (g/l), circular (g/l), name (v/c)
# + edges (vertex names):
# [1] c--d d--e g--h h--i i--j
If you want to preserve the original numeric vertex ids, then assign them as names:
gg <- graph.ring(10)
V(gg)$name <- V(gg)
gg2 <- delete.vertices(gg, c(1,2,6))
str(gg2)
# IGRAPH UN-- 7 5 -- Ring graph
# + attr: name (g/c), mutual (g/l), circular (g/l), name (v/n)
# + edges (vertex names):
# [1] 3-- 4 4-- 5 7-- 8 8-- 9 9--10

Cannot resolve "Undefined control sequence" in swaeve (using \Sexpr{})

I have troubles finding a solution for a basic error I get.
I was trying to use a .Rnw document as it is by default created in RStudio to report some results. I have to admit that I am new to Sweave. However, I used \Sexpr{}in order to represent some numbers defined in a chunk. It did not work and returned the error "Undefined control sequence". So I tried the example given in the official introduction:
<<echo=FALSE>>=
x <- 2
y <- 3
(z <- x + y)
#
However, I get the same error when using $z=\Sexpr{z}$ ("Undefined control sequence")
I assume something fundamentally is wrong with my file, but I can't figure out what. Here my whole document:
\documentclass{report}
\usepackage[noae]{Sweave}
\usepackage{graphicx, verbatim}
\setlength{\textwidth}{6.5in}
\setlength{\textheight}{9in}
\setlength{\oddsidemargin}{0in}
\setlength{\evensidemargin}{0in}
\setlength{\topmargin}{-1.5cm}
\begin{document}
\SweaveOpts{concordance=TRUE}
\begin{center}
\textbf{Data}
\end {center}
\section*{Questionnaires}
\subsection*{Stress}
<<echo=FALSE>>=
x <- 2
y <- 3
(z <- x + y)
#
Defining $z$ as above we get $z=\Sexpr{z}$.
\end{document}