Related
I'm having the hardest time generating confidence intervals for my glmer poisson model. After following several very helpful tutorials (such as https://drewtyre.rbind.io/classes/nres803/week_12/lab_12/) as well as stackoverflow posts, I keep getting very strange results, i.e. the upper and lower limits of the CI are identical.
Here is a reproducible example containing a response variable called "production," a fixed effect called "Treatment_Num" and a random effect called "Genotype":
df1 <- data.frame(production=c(15,12,10,9,6,8,9,5,3,3,2,1,0,0,0,0), Treatment_Num=c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4), Genotype=c(1,1,2,2,1,1,2,2,1,1,2,2,1,1,2,2))
#run the glmer model
df1_glmer <- glmer(production ~ Treatment_Num +(1|Genotype),
data = df1, family = poisson(link = "log"))
#make an empty data set to predict from, that contains the explanatory variables but no response
require(magrittr)
df_empty <- df1 %>%
tidyr::expand(Treatment_Num, Genotype)
#create new column containing predictions
df_empty$PopPred <- predict(df1_glmer, newdata = df_empty, type="response",re.form = ~0)
#function for bootMer
myFunc_df1_glmer <- function(mm) {
predict(df1_glmer, newdata = df_empty, type="response",re.form=~0)
}
#run bootMer
require(lme4)
merBoot_df1_glmer <- bootMer(df1_glmer, myFunc_df1_glmer, nsim = 10)
#get confidence intervals out of it
predCL <- t(apply(merBoot_df1_glmer$t, MARGIN = 2, FUN = quantile, probs = c(0.025, 0.975)))
#enter lower and upper limits of confidence interval into df_empty
df_empty$lci <- predCL[, 1]
df_empty$uci <- predCL[, 2]
#when viewing df_empty the problem becomes clear: the lci and uci are identical!
df_empty
Any insights you can give me will be much appreciated!
Ignore my comment!
The issue is with the function you created to pass to bootMer(). You wrote:
myFunc_df1_glmer <- function(mm) {
predict(df1_glmer, newdata = df_empty, type="response",re.form=~0)
}
The argument mm should be a fitted model object derived from the bootstrapped data.
However, you don't pass this object to predict(), but rather the original model
object. If you change the function to:
myFunc_df1_glmer <- function(mm) {
predict(mm, newdata = df_empty, type="response",re.form=~0)
#^^ pass in the object created by bootMer
}
then it works:
> df_empty
# A tibble: 8 x 5
Treatment_Num Genotype PopPred lci uci
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 12.9 9.63 15.7
2 1 2 12.9 9.63 15.7
3 2 1 5.09 3.87 5.89
4 2 2 5.09 3.87 5.89
5 3 1 2.01 1.20 2.46
6 3 2 2.01 1.20 2.46
7 4 1 0.796 0.361 1.14
8 4 2 0.796 0.361 1.14
As an aside -- how many genotypes in your actual data? If less than 5-7 you might
do better using a straight up glm() with genotype as a factor using sum-to-zero
contrasts.
If you request web data through R, you often work with json or xml where the fields are not named if there is no value for them. Sometimes, there isn't even any data and it comes out as an empty list for a certain index. So, I see this as two different problems. I'm proposing the solution I use to solve this as well but I know there are some better ones out there. I have for starters, a very messy and fake list that I created that is missing field names (on purpose from the xml, json spec) AND missing whole indexes (also on purpose).
(messy_list <- list(list(x = 2, y = 3),
list(),
list(y = 4),
list(x = 5)))
Now, here is how I break it down to what I would say is "solved".
library(plyr)
messy_list_no_empties <- lapply(messy_list, function(x) if(length(x) == 0) {list(NA, NA)} else x)
ldply(messy_list_no_empties, data.frame)[,1:2]
The end result is what I am looking for but I would like to find a more elegant way to deal with this problem.
With purrr::map_df,
library(purrr)
messy_list <- list(list(x = 2, y = 3),
list(),
list(y = 4),
list(x = 5))
messy_list %>% map_df(~list(x = .x$x %||% NA,
y = .x$y %||% NA))
#> # A tibble: 4 × 2
#> x y
#> <dbl> <dbl>
#> 1 2 3
#> 2 NA NA
#> 3 NA 4
#> 4 5 NA
map_df iterates over the list like lapply and coerces the results to a data.frame. The function (in purrr's formula form) assembles a list with an x and a y element, looking for existing values if they're there. If they're not, the subsetting will return NULL, which %||% will replace with the value after it, NA.
In mostly-equivalent base R,
as.data.frame(do.call(rbind,
lapply(messy_list, function(.x){
list(x = ifelse(is.null(.x$x), NA, .x$x),
y = ifelse(is.null(.x$y), NA, .x$y))
})))
#> x y
#> 1 2 3
#> 2 NA NA
#> 3 NA 4
#> 4 5 NA
Note the base approach won't handle different types well. To do so, coerce everything to character (rbind probably will anyway, so just add stringsAsFactors = FALSE to as.data.frame) and lapply type.convert.
Your method is already pretty compact, but if you're looking for other methods, one way might be to use rbindlist from data.table:
library(data.table)
new_list <- lapply(messy_list, function(x) if(identical(x,list())){list(x = NA)} else {x})
rbindlist(new_list, fill = T, use.names = T)
# x y
#1: 2 3
#2: NA NA
#3: NA 4
#4: 5 NA
Note we need the lapply so it doesn't drop the rows that are empty
This comes up a lot when dealing with API's.
Most of the time, to do real analysis, I'd like to get my dataset tidy, but typically, this requires a solution for each type of tree, rather than something more general.
I figured it would be nice to have one function that generates tidy data (albeit with a ton of NA's in deeply nested trees with many different factor levels.
I have a hackish solution which follows, using unlist(..., recursive = FALSE) + a naming convention,
But I'd like to see if someone here might have a better solution to tidy these kinds of list structures.
#####################
# Some Test Data
aNestedTree =
list(a = 1,
b = 2,
c = list(
a = list(1:5),
b = 2,
c = list(
a = 1,
d = 3,
e = list())),
d = list(
y = 3,
z = 2
))
############################################################
# Run through the list and rename all list elements,
# We unlist once at time, adding "__" at each unlist step
# until the object is no longer a list
renameVars <- function(lst, sep = '__') {
if(is.list(lst)) {
names(lst) <- paste0(names(lst),sep)
renameVars(unlist(lst, recursive = FALSE),sep = sep)
} else {
lst
}
}
res <- renameVars(aNestedTree)
We can check the output and see that we have a strangely named object,
But there's a method to this madness.
> res
a________ b________ c__.a____1__ c__.a____2__ c__.a____3__
1 2 1 2 3
c__.a____4__ c__.a____5__ c__.b______ c__.c__.a____ c__.c__.d____
4 5 2 1 3
d__.y______ d__.z______
3 2
Now I put this in a data.table, so I can shape it.
library(data.table)
dt <- data.table(values = res, name = names(res))
# Use some regex to split that name up, along with data.table's tstrsplit
# function to separate them into as many columns as there are nests
> dt[,paste0('V',seq_along(s <- tstrsplit(dt$name,'[__]+(\\.|)'))) := s]
> dt
values name V1 V2 V3
1: 1 a________ a NA NA
2: 2 b________ b NA NA
3: 1 c__.a____1__ c a 1
4: 2 c__.a____2__ c a 2
5: 3 c__.a____3__ c a 3
6: 4 c__.a____4__ c a 4
7: 5 c__.a____5__ c a 5
8: 2 c__.b______ c b NA
9: 1 c__.c__.a____ c c a
10: 3 c__.c__.d____ c c d
11: 3 d__.y______ d y NA
12: 2 d__.z______ d z NA
I can then filter for the factor combinations that I want (Or dcast/spread). (Though I'm effectively breaking apart tables at the lowest level if they exist)
I thought about going through bind.c and pulling out the do_unlistto make a function with a flexible naming convention via Rcpp, but my C++ is rusty, so I figured I'd post here before I do anything drastic.
I tend to lean towards tidyjson as well. In the tidyverse, the behavior you are looking for seems to be in the gather family.
I think the gather family of functions in tidyjson could do with a bit of improvement that would make these helpers unnecessary. Right now, they are very "type-sensitive" and error or throw out types that do not match. In any case, the workaround is not too challenging, although it definitely lacks elegance. Note that the bind_rows variant is presently from my development version and is not mainstream yet. Hopefully this illustrates the idea, though.
Notes on approach:
That all values would be numeric (I cast them to character afterwards)
Helpers gather elements of the varying types, and bind_rows stacks the datasets together.
level is kept track of by level of recursion
First define the helpers:
recurse_gather <- function(.x,.level) {
.x <- tidyjson::bind_rows(
gobj(.x,.level)
, garr(.x,.level)
, gpersist(.x,.level)
)
if (any(as.character(json_types(.x,'type')$type) %in% c('object','array'))) {
.x <- recurse_gather(.x,.level+1)
}
return(.x)
}
gobj <- function(.x,.level) {
.x %>% json_types('type') %>%
filter(type=='object') %>%
gather_object(paste0('v',.level)) %>%
select(-type)
}
gpersist <- function(.x,.level) {
.x %>% json_types('type') %>%
filter(! type %in% c('object','array')) %>%
mutate_(.dots=setNames(
paste0('as.character(NA)')
,paste0('v',.level)
)) %>%
select(-type)
}
garr <- function(.x,.level) {
.x %>% json_types('type') %>%
filter(type=='array') %>%
gather_array('arridx') %>%
append_values_number(paste0('v',.level)) %>%
mutate_(.dots=setNames(
paste0('as.character(v',.level,')')
,paste0('v',.level)
)) %>%
select(-arridx,-type)
}
Then using the helpers is pretty straight-forward.
library(dplyr)
library(tidyjson)
j <- "{\"a\":[1],\"b\":[2],\"c\":{\"a\":[1,2,3,4,5],\"b\":[2],\"c\":{\"a\":[1],\"d\":[3],\"e\":[]}},\"d\":{\"y\":[3],\"z\":[2]}}"
recurse_gather(j, 1) %>% arrange(v1, v2, v3, v4) %>% tbl_df()
#> # A tibble: 12 x 5
#> document.id v1 v2 v3 v4
#> * <int> <chr> <chr> <chr> <chr>
#> 1 1 a 1 <NA> <NA>
#> 2 1 b 2 <NA> <NA>
#> 3 1 c a 1 <NA>
#> 4 1 c a 2 <NA>
#> 5 1 c a 3 <NA>
#> 6 1 c a 4 <NA>
#> 7 1 c a 5 <NA>
#> 8 1 c b 2 <NA>
#> 9 1 c c a 1
#> 10 1 c c d 3
#> 11 1 d y 3 <NA>
#> 12 1 d z 2 <NA>
Hopeful that future development on the tidyjson package will make this an easier problem to tackle!
I struggled in similar situations, but the tidyjson package has bailed me out time after time when dealing with nested JSON. There's a fair amount of typing required, but the tidyjson functions return a tidy object. Documentation here: https://github.com/sailthru/tidyjson
As dracodoc pointed out, data.tree might help. E.g. like this:
library(data.tree)
aNestedTree =
list(a = 1,
b = 2,
c = list(
a = list(1:5),
b = 2,
c = list(
a = 1,
d = 3,
e = list())),
d = list(
y = 3,
z = 2
))
tree <- FromListSimple(aNestedTree)
print(tree)
This will give:
levelName z
1 Root NA
2 ¦--c NA
3 ¦ ¦--a NA
4 ¦ °--c NA
5 ¦ °--e NA
6 °--d 2
And:
tree$fieldsAll
[1] "a" "b" "1" "d" "y" "z"
Side note: typically, you could do something like this:
do.call("print", c(tree, tree$fieldsAll))
However, here, this doesn't work because some node names are the same as field names. I consider this a bug and will fix it soon.
I am trying to read data from tables similar to the following http://www.fec.gov/pubrec/fe1996/hraz.htm using R but have been unable to make progress. I realize that to do so I need to use XML and RCurl but in spite of the numerous other examples on the web concerning similar problems I have not been able to resolve this one.
The first issue is that the table is only a table when viewing it but is not coded as such. Treating it as an xml document I can access the "data" in the table but because there are several tables I would like to get I don't believe this to be the most elegant solution.
Treating it as an html document might work better but I am relatively unfamiliar with xpathApply and do not know how to get at the actual "data" in the table since it is not bracketed by anything (i.e. a i-/i or b-/b).
I have had some success using xml files in the past but this is my first attempt at doing something similar with html files. These files in particular seem to have less structure then other examples I have seen.
Any help is much appreciated.
Assuming you can read the html output into a text file (the equivalent of copying+pasting form your web browser),
this should get you a good chunk of the way there:
# x is the output from the website
library(stringr)
library(data.table)
# First, remove commas from numbers (easiest to do at beginning)
x <- gsub(",([0-9])", "\\1", x)
# split the data by District
districts <- strsplit(x, "DISTRICT *")[[1]]
# separate out the header info
headerInfo <- districts[[1]]
districts <- tail(districts, -1)
# grab the straggling district number, use it as a name and remove it
# end of first line
eofl <- str_locate(districts, "\n")[,2]
# trim white space and assign as name
names(districts) <- str_trim(substr(districts, 1, eofl))
# remove first line
districts <- substr(districts, eofl+1, nchar(districts))
# replace the ending '-------' and trime white space
districts <- str_trim(str_replace_all(districts, "---*", ""))
# Adjust delimeter (this is the tricky part)
## more than two spaces are a spearator
districts <- str_replace_all(districts, " +", "\t")
## lines that are total tallies are missing two columns.
## thus, need to add two extra delims. After the first and third columns
# this function will
padDelims <- function(section, splton) {
# split into lines
section <- strsplit(section, splton)[[1]]
# identify lines starting with totals
LinesToFix <- str_detect(section, "^Total")
# pad appropriate columns
section[LinesToFix] <- sub("(.+)\t(.+)\t(.*)?", "\\1\t\t\\2\t\t\\3", section[LinesToFix])
# any rows missing delims, pad at end
counts <- str_count(section, "\t")
toadd <- max(counts) - counts
section[ ] <- mapply(function(s, p) if (p==0) return (s) else paste0(s, paste0(rep("\t", p), collapse="")), section, toadd)
# paste it back together and return
paste(section, collapse=splton)
}
districts <- lapply(districts, padDelims, splton="\n")
# reading the table and simultaneously addding the district column
districtTables <-
lapply(names(districts), function(d)
data.table(read.table(text=districts[[d]], sep="\t"), district=d) )
# ... or without adding district number:
## lapply(districts, function(d) data.table(read.table(text=d, sep="\t")))
# flatten it
votes <- do.call(rbind, districtTables)
setnames(votes, c("Candidate", "Party", "PrimVotes.Abs", "PrimVotes.Perc", "GeneralVotes.Abs", "GeneralVotes.Perc", "District") )
Sample table:
votes
Candidate Party PrimVotes.Abs PrimVotes.Perc GeneralVotes.Abs GeneralVotes.Perc District
1: Salmon, Matt R 33672 100.00 135634.00 60.18 1
2: Total Party Votes: 33672 NA NA NA 1
3: NA NA NA NA 1
4: Cox, John W(D)/D 1942 100.00 89738.00 39.82 1
5: Total Party Votes: 1942 NA NA NA 1
6: NA NA NA NA 1
7: Total District Votes: 35614 NA 225372.00 NA 1
8: Pastor, Ed D 29969 100.00 81982.00 65.01 2
9: Total Party Votes: 29969 NA NA NA 2
10: NA NA NA NA 2
...
51: Hayworth, J.D. R 32554 100.00 121431.00 47.57 6
52: Total Party Votes: 32554 NA NA NA 6
53: NA NA NA NA 6
54: Owens, Steve D 35137 100.00 118957.00 46.60 6
55: Total Party Votes: 35137 NA NA NA 6
56: NA NA NA NA 6
57: Anderson, Robert LBT 148 100.00 14899.00 5.84 6
58: NA NA NA NA 6
59: Total District Votes: 67839 NA 255287.00 NA 6
60: NA NA NA NA 6
61: Total State Votes: 368185 NA 1356446.00 NA 6
Candidate Party PrimVotes.Abs PrimVotes.Perc GeneralVotes.Abs GeneralVotes.Perc District
I have a data frame with results for certain instruments, and I want to create a new column which contains the totals of each row. Because I have different numbers of instruments each time I run an analysis on new data, I need a function to dynamically calculate the new column with the Row Total.
To simply my problem, here’s what my data frame looks like:
Type Value
1 A 10
2 A 15
3 A 20
4 A 25
5 B 30
6 B 40
7 B 50
8 B 60
9 B 70
10 B 80
11 B 90
My goal is to achieve the following:
A B Total
1 10 30 40
2 15 40 55
3 20 50 70
4 25 60 85
5 70 70
6 80 80
7 90 90
I’ve tried various method, but this way holds the most promise:
myList <- list(a = c(10, 15, 20, 25), b = c(30, 40, 50, 60, 70, 80, 90))
tmpDF <- data.frame(sapply(myList, '[', 1:max(sapply(myList, length))))
> tmpDF
a b
1 10 30
2 15 40
3 20 50
4 25 60
5 NA 70
6 NA 80
7 NA 90
totalSum <- rowSums(tmpDF)
totalSum <- data.frame(totalSum)
tmpDF <- cbind(tmpDF, totalSum)
> tmpDF
a b totalSum
1 10 30 40
2 15 40 55
3 20 50 70
4 25 60 85
5 NA 70 NA
6 NA 80 NA
7 NA 90 NA
Even though this way did succeeded in combining two data frames of different lengths, the ‘rowSums’ function gives the wrong values in this example. Besides that, my original data isn't in a list format, so I can't apply such a 'solution'.
I think I’m overcomplicating this problem, so I was wondering how can I …
Subset data from a data frame on the basis of ‘Type’,
Insert these individual subsets of different lengths into a new data frame,
Add an ‘Total’ column to this data frame which is the correct sum of the
individual subsets.
An added complication to this problem is that this needs to be done in an function or in an otherwise dynamic way, so that I don’t need to manually subset the dozens of ‘Types’ (A, B, C, and so on) in my data frame.
Here’s what I have so far, which doesn’t work, but illustrates the lines I’m thinking along:
TotalDf <- function(x){
tmpNumberOfTypes <- c(levels(x$Type))
for( i in tmpNumberOfTypes){
subSetofData <- subset(x, Type = i, select = Value)
if( i == 1) {
totalDf <- subSetOfData }
else{
totalDf <- cbind(totalDf, subSetofData)}
}
return(totalDf)
}
Thanks in advance for any thoughts or ideas on this,
Regards,
EDIT:
Thanks to the comment of Joris (see below) I got an end in the right direction, however, when trying to translate his solution to my data frame, I run into additional problems. His proposed answer works, and gives me the following (correct) sum of the values of A and B:
> tmp78 <- tapply(DF$value,DF$id,sum)
> tmp78
1 2 3 4 5 6
6 8 10 12 9 10
> data.frame(tmp78)
tmp78
1 6
2 8
3 10
4 12
5 9
6 10
However, when I try this solution on my data frame, it doesn’t work:
> subSetOfData <- copyOfTradesList[c(1:3,11:13),c(1,10)]
> subSetOfData
Instrument AccountValue
1 JPM 6997
2 JPM 7261
3 JPM 7545
11 KFT 6992
12 KFT 6944
13 KFT 7069
> unlist(sapply(rle(subSetOfData$Instrument)$lengths,function(x) 1:x))
Error in rle(subSetOfData$Instrument) : 'x' must be an atomic vector
> subSetOfData$InstrumentNumeric <- as.numeric(subSetOfData$Instrument)
> unlist(sapply(rle(subSetOfData$InstrumentNumeric)$lengths,function(x) 1:x))
[,1] [,2]
[1,] 1 1
[2,] 2 2
[3,] 3 3
> subSetOfData$id <- unlist(sapply(rle(subSetOfData$InstrumentNumeric)$lengths,function(x) 1:x))
Error in `$<-.data.frame`(`*tmp*`, "id", value = c(1L, 2L, 3L, 1L, 2L, :
replacement has 3 rows, data has 6
I have the disturbing idea that I’m going around in circles…
Two thoughts :
1) you could use na.rm=T in rowSums
2) How do you know which one has to go with which? You might add some indexing.
eg :
DF <- data.frame(
type=c(rep("A",4),rep("B",6)),
value = 1:10,
stringsAsFactors=F
)
DF$id <- unlist(lapply(rle(DF$type)$lengths,function(x) 1:x))
Now this allows you to easily tapply the sum on the original dataframe
tapply(DF$value,DF$id,sum)
And, more importantly, get your dataframe in the correct form :
> DF
type value id
1 A 1 1
2 A 2 2
3 A 3 3
4 A 4 4
5 B 5 1
6 B 6 2
7 B 7 3
8 B 8 4
9 B 9 5
10 B 10 6
> library(reshape)
> cast(DF,id~type)
id A B
1 1 1 5
2 2 2 6
3 3 3 7
4 4 4 8
5 5 NA 9
6 6 NA 10
TV <- data.frame(Type = c("A","A","A","A","B","B","B","B","B","B","B")
, Value = c(10,15,20,25,30,40,50,60,70,80,90)
, stringsAsFactors = FALSE)
# Added Type C for testing
# TV <- data.frame(Type = c("A","A","A","A","B","B","B","B","B","B","B", "C", "C", "C")
# , Value = c(10,15,20,25,30,40,50,60,70,80,90, 100, 150, 130)
# , stringsAsFactors = FALSE)
lnType <- with(TV, tapply(Value, Type, length))
lnType <- as.integer(lnType)
lnType
id <- unlist(mapply(FUN = rep_len, length.out = lnType, x = list(1:max(lnType))))
(TV <- cbind(id, TV))
require(reshape2)
tvWide <- dcast(TV, id ~ Type)
# Alternatively
# tvWide <- reshape(data = TV, direction = "wide", timevar = "Type", ids = c(id, Type))
tvWide <- subset(tvWide, select = -id)
# If you want something neat without the <NA>
# for(i in 1:ncol(tvWide)){
#
# if (is.na(tvWide[j,i])){
# tvWide[j,i] = 0
# }
#
# }
# }
tvWide
transform(tvWide, rowSum=rowSums(tvWide, na.rm = TRUE))