Parsing incomplete lists into data frames with two different problems - json

If you request web data through R, you often work with json or xml where the fields are not named if there is no value for them. Sometimes, there isn't even any data and it comes out as an empty list for a certain index. So, I see this as two different problems. I'm proposing the solution I use to solve this as well but I know there are some better ones out there. I have for starters, a very messy and fake list that I created that is missing field names (on purpose from the xml, json spec) AND missing whole indexes (also on purpose).
(messy_list <- list(list(x = 2, y = 3),
list(),
list(y = 4),
list(x = 5)))
Now, here is how I break it down to what I would say is "solved".
library(plyr)
messy_list_no_empties <- lapply(messy_list, function(x) if(length(x) == 0) {list(NA, NA)} else x)
ldply(messy_list_no_empties, data.frame)[,1:2]
The end result is what I am looking for but I would like to find a more elegant way to deal with this problem.

With purrr::map_df,
library(purrr)
messy_list <- list(list(x = 2, y = 3),
list(),
list(y = 4),
list(x = 5))
messy_list %>% map_df(~list(x = .x$x %||% NA,
y = .x$y %||% NA))
#> # A tibble: 4 × 2
#> x y
#> <dbl> <dbl>
#> 1 2 3
#> 2 NA NA
#> 3 NA 4
#> 4 5 NA
map_df iterates over the list like lapply and coerces the results to a data.frame. The function (in purrr's formula form) assembles a list with an x and a y element, looking for existing values if they're there. If they're not, the subsetting will return NULL, which %||% will replace with the value after it, NA.
In mostly-equivalent base R,
as.data.frame(do.call(rbind,
lapply(messy_list, function(.x){
list(x = ifelse(is.null(.x$x), NA, .x$x),
y = ifelse(is.null(.x$y), NA, .x$y))
})))
#> x y
#> 1 2 3
#> 2 NA NA
#> 3 NA 4
#> 4 5 NA
Note the base approach won't handle different types well. To do so, coerce everything to character (rbind probably will anyway, so just add stringsAsFactors = FALSE to as.data.frame) and lapply type.convert.

Your method is already pretty compact, but if you're looking for other methods, one way might be to use rbindlist from data.table:
library(data.table)
new_list <- lapply(messy_list, function(x) if(identical(x,list())){list(x = NA)} else {x})
rbindlist(new_list, fill = T, use.names = T)
# x y
#1: 2 3
#2: NA NA
#3: NA 4
#4: 5 NA
Note we need the lapply so it doesn't drop the rows that are empty

Related

problem with bootMer CI: upper and lower limits are identical

I'm having the hardest time generating confidence intervals for my glmer poisson model. After following several very helpful tutorials (such as https://drewtyre.rbind.io/classes/nres803/week_12/lab_12/) as well as stackoverflow posts, I keep getting very strange results, i.e. the upper and lower limits of the CI are identical.
Here is a reproducible example containing a response variable called "production," a fixed effect called "Treatment_Num" and a random effect called "Genotype":
df1 <- data.frame(production=c(15,12,10,9,6,8,9,5,3,3,2,1,0,0,0,0), Treatment_Num=c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4), Genotype=c(1,1,2,2,1,1,2,2,1,1,2,2,1,1,2,2))
#run the glmer model
df1_glmer <- glmer(production ~ Treatment_Num +(1|Genotype),
data = df1, family = poisson(link = "log"))
#make an empty data set to predict from, that contains the explanatory variables but no response
require(magrittr)
df_empty <- df1 %>%
tidyr::expand(Treatment_Num, Genotype)
#create new column containing predictions
df_empty$PopPred <- predict(df1_glmer, newdata = df_empty, type="response",re.form = ~0)
#function for bootMer
myFunc_df1_glmer <- function(mm) {
predict(df1_glmer, newdata = df_empty, type="response",re.form=~0)
}
#run bootMer
require(lme4)
merBoot_df1_glmer <- bootMer(df1_glmer, myFunc_df1_glmer, nsim = 10)
#get confidence intervals out of it
predCL <- t(apply(merBoot_df1_glmer$t, MARGIN = 2, FUN = quantile, probs = c(0.025, 0.975)))
#enter lower and upper limits of confidence interval into df_empty
df_empty$lci <- predCL[, 1]
df_empty$uci <- predCL[, 2]
#when viewing df_empty the problem becomes clear: the lci and uci are identical!
df_empty
Any insights you can give me will be much appreciated!
Ignore my comment!
The issue is with the function you created to pass to bootMer(). You wrote:
myFunc_df1_glmer <- function(mm) {
predict(df1_glmer, newdata = df_empty, type="response",re.form=~0)
}
The argument mm should be a fitted model object derived from the bootstrapped data.
However, you don't pass this object to predict(), but rather the original model
object. If you change the function to:
myFunc_df1_glmer <- function(mm) {
predict(mm, newdata = df_empty, type="response",re.form=~0)
#^^ pass in the object created by bootMer
}
then it works:
> df_empty
# A tibble: 8 x 5
Treatment_Num Genotype PopPred lci uci
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 12.9 9.63 15.7
2 1 2 12.9 9.63 15.7
3 2 1 5.09 3.87 5.89
4 2 2 5.09 3.87 5.89
5 3 1 2.01 1.20 2.46
6 3 2 2.01 1.20 2.46
7 4 1 0.796 0.361 1.14
8 4 2 0.796 0.361 1.14
As an aside -- how many genotypes in your actual data? If less than 5-7 you might
do better using a straight up glm() with genotype as a factor using sum-to-zero
contrasts.

How can I get multiple outputs from single operation?

These are the codes that have been written to analyze within and between interactions of different species.
in this code, I tried to get separate outputs from each analyzes unsuccessfully.
lapply(data.list, function(x) {
grp <- factor(x$species)
window <- ripras(x$utmX, x$utmY)
pp.grp <- ppp(x$utmX, x$utmY, window=window, marks=grp)
split.grp <- split(pp.grp)
L <- (alltypes(pp.grp, "L"))
LE <- alltypes(pp.grp, Lcross, nsim = 100, envelope = TRUE)
return("L", "LE")
})
plot(L[1])
So my question is how I can get multiple outputs from a single operation?
Thank you so much in advance!
The most common way to handle multiple outputs from a function in R is to put
the results in a list and return that. Hopefully this can inspire you:
f <- function(x){
L <- x
LE <- matrix(x, 2, 2)
rslt <- list(L = L, LE = LE)
return(rslt)
}
y <- f(7)
Now y is a list with two elements: L and LE
y
#> $L
#> [1] 7
#>
#> $LE
#> [,1] [,2]
#> [1,] 7 7
#> [2,] 7 7
Use $ to get a named element (in this case L – same as y[[1]]):
y$L
#> [1] 7
Created on 2019-03-16 by the reprex package (v0.2.1)

Tidy nested json tree

This comes up a lot when dealing with API's.
Most of the time, to do real analysis, I'd like to get my dataset tidy, but typically, this requires a solution for each type of tree, rather than something more general.
I figured it would be nice to have one function that generates tidy data (albeit with a ton of NA's in deeply nested trees with many different factor levels.
I have a hackish solution which follows, using unlist(..., recursive = FALSE) + a naming convention,
But I'd like to see if someone here might have a better solution to tidy these kinds of list structures.
#####################
# Some Test Data
aNestedTree =
list(a = 1,
b = 2,
c = list(
a = list(1:5),
b = 2,
c = list(
a = 1,
d = 3,
e = list())),
d = list(
y = 3,
z = 2
))
############################################################
# Run through the list and rename all list elements,
# We unlist once at time, adding "__" at each unlist step
# until the object is no longer a list
renameVars <- function(lst, sep = '__') {
if(is.list(lst)) {
names(lst) <- paste0(names(lst),sep)
renameVars(unlist(lst, recursive = FALSE),sep = sep)
} else {
lst
}
}
res <- renameVars(aNestedTree)
We can check the output and see that we have a strangely named object,
But there's a method to this madness.
> res
a________ b________ c__.a____1__ c__.a____2__ c__.a____3__
1 2 1 2 3
c__.a____4__ c__.a____5__ c__.b______ c__.c__.a____ c__.c__.d____
4 5 2 1 3
d__.y______ d__.z______
3 2
Now I put this in a data.table, so I can shape it.
library(data.table)
dt <- data.table(values = res, name = names(res))
# Use some regex to split that name up, along with data.table's tstrsplit
# function to separate them into as many columns as there are nests
> dt[,paste0('V',seq_along(s <- tstrsplit(dt$name,'[__]+(\\.|)'))) := s]
> dt
values name V1 V2 V3
1: 1 a________ a NA NA
2: 2 b________ b NA NA
3: 1 c__.a____1__ c a 1
4: 2 c__.a____2__ c a 2
5: 3 c__.a____3__ c a 3
6: 4 c__.a____4__ c a 4
7: 5 c__.a____5__ c a 5
8: 2 c__.b______ c b NA
9: 1 c__.c__.a____ c c a
10: 3 c__.c__.d____ c c d
11: 3 d__.y______ d y NA
12: 2 d__.z______ d z NA
I can then filter for the factor combinations that I want (Or dcast/spread). (Though I'm effectively breaking apart tables at the lowest level if they exist)
I thought about going through bind.c and pulling out the do_unlistto make a function with a flexible naming convention via Rcpp, but my C++ is rusty, so I figured I'd post here before I do anything drastic.
I tend to lean towards tidyjson as well. In the tidyverse, the behavior you are looking for seems to be in the gather family.
I think the gather family of functions in tidyjson could do with a bit of improvement that would make these helpers unnecessary. Right now, they are very "type-sensitive" and error or throw out types that do not match. In any case, the workaround is not too challenging, although it definitely lacks elegance. Note that the bind_rows variant is presently from my development version and is not mainstream yet. Hopefully this illustrates the idea, though.
Notes on approach:
That all values would be numeric (I cast them to character afterwards)
Helpers gather elements of the varying types, and bind_rows stacks the datasets together.
level is kept track of by level of recursion
First define the helpers:
recurse_gather <- function(.x,.level) {
.x <- tidyjson::bind_rows(
gobj(.x,.level)
, garr(.x,.level)
, gpersist(.x,.level)
)
if (any(as.character(json_types(.x,'type')$type) %in% c('object','array'))) {
.x <- recurse_gather(.x,.level+1)
}
return(.x)
}
gobj <- function(.x,.level) {
.x %>% json_types('type') %>%
filter(type=='object') %>%
gather_object(paste0('v',.level)) %>%
select(-type)
}
gpersist <- function(.x,.level) {
.x %>% json_types('type') %>%
filter(! type %in% c('object','array')) %>%
mutate_(.dots=setNames(
paste0('as.character(NA)')
,paste0('v',.level)
)) %>%
select(-type)
}
garr <- function(.x,.level) {
.x %>% json_types('type') %>%
filter(type=='array') %>%
gather_array('arridx') %>%
append_values_number(paste0('v',.level)) %>%
mutate_(.dots=setNames(
paste0('as.character(v',.level,')')
,paste0('v',.level)
)) %>%
select(-arridx,-type)
}
Then using the helpers is pretty straight-forward.
library(dplyr)
library(tidyjson)
j <- "{\"a\":[1],\"b\":[2],\"c\":{\"a\":[1,2,3,4,5],\"b\":[2],\"c\":{\"a\":[1],\"d\":[3],\"e\":[]}},\"d\":{\"y\":[3],\"z\":[2]}}"
recurse_gather(j, 1) %>% arrange(v1, v2, v3, v4) %>% tbl_df()
#> # A tibble: 12 x 5
#> document.id v1 v2 v3 v4
#> * <int> <chr> <chr> <chr> <chr>
#> 1 1 a 1 <NA> <NA>
#> 2 1 b 2 <NA> <NA>
#> 3 1 c a 1 <NA>
#> 4 1 c a 2 <NA>
#> 5 1 c a 3 <NA>
#> 6 1 c a 4 <NA>
#> 7 1 c a 5 <NA>
#> 8 1 c b 2 <NA>
#> 9 1 c c a 1
#> 10 1 c c d 3
#> 11 1 d y 3 <NA>
#> 12 1 d z 2 <NA>
Hopeful that future development on the tidyjson package will make this an easier problem to tackle!
I struggled in similar situations, but the tidyjson package has bailed me out time after time when dealing with nested JSON. There's a fair amount of typing required, but the tidyjson functions return a tidy object. Documentation here: https://github.com/sailthru/tidyjson
As dracodoc pointed out, data.tree might help. E.g. like this:
library(data.tree)
aNestedTree =
list(a = 1,
b = 2,
c = list(
a = list(1:5),
b = 2,
c = list(
a = 1,
d = 3,
e = list())),
d = list(
y = 3,
z = 2
))
tree <- FromListSimple(aNestedTree)
print(tree)
This will give:
levelName z
1 Root NA
2 ¦--c NA
3 ¦ ¦--a NA
4 ¦ °--c NA
5 ¦ °--e NA
6 °--d 2
And:
tree$fieldsAll
[1] "a" "b" "1" "d" "y" "z"
Side note: typically, you could do something like this:
do.call("print", c(tree, tree$fieldsAll))
However, here, this doesn't work because some node names are the same as field names. I consider this a bug and will fix it soon.

How to write a JSON object from R dataframe with grouping

In general I feel there is a need to make JSON objects by folding multiple columns. There is no direct way to do this afaik. Please point it out if there is ..
I have data of this from
A B C
1 a x
1 a y
1 c z
2 d p
2 f q
2 f r
How do I write a json which looks like
{'query':'1', 'type':[{'name':'a', 'values':[{'value':'x'}, {'value':'y'}]}, {'name':'c', 'values':[{'value':'z'}]}]}
and similarly for 'query':'2'
I am looking to spit them in the mongo import/export individual json lines format.
Any pointers are also appreciated..
You've got a little "non-standard" thing going with two keys of "value" (I don't know if this is legal json), as you can see here:
(js <- jsonlite::fromJSON('{"query":"1", "type":[{"name":"a", "values":[{"value":"x"}, {"value":"y"}]}, {"name":"c", "values":[{"value":"z"}]}]}'))
## $query
## [1] "1"
##
## $type
## name values
## 1 a x, y
## 2 c z
... with a data.frame cell containing a list of data.frames:
js$type$values[[1]]
## value
## 1 x
## 2 y
class(js$type$values[[1]])
## [1] "data.frame"
If you can accept your "type" variable containing a vector instead of a named-list, then perhaps the following code will suffice:
jsonlite::toJSON(lapply(unique(dat[, 'A']), function(a1) {
list(query = a1,
type = lapply(unique(dat[dat$A == a1, 'B']), function(b2) {
list(name = b2,
values = dat[(dat$A == a1) & (dat$B == b2), 'C'])
}))
}))
## [{"query":[1],"type":[{"name":["a"],"values":["x","y"]},{"name":["c"],"values":["z"]}]},{"query":[2],"type":[{"name":["d"],"values":["p"]},{"name":["f"],"values":["q","r"]}]}]

Subsetting a data frame in a function using another data frame as parameter

I would like to submit a data frame to a function and use it to subset another data frame.
This is the basic data frame:
foo <- data.frame(var1= c(1, 1, 1, 2, 2, 3), var2=c('A', 'A', 'B', 'B', 'C', 'C'))
I use the following function to find out the frequencies of var2 for specified values of var1.
foobar <- function(x, y, z){
a <- subset(x, (x$var1 == y))
b <- subset(a, (a$var2 == z))
n=nrow(b)
return(n)
}
Examples:
foobar(foo, 1, "A") # returns 2
foobar(foo, 1, "B") # returns 1
foobar(foo, 3, "C") # returns 1
This works. But now I want to submit a data frame of values to foobar. Instead of the above examples, I would like to submit df to foobar and get the same results as above (2, 1, 1)
df <- data.frame(var1=c(1, 1, 3), var2=c("A", "B", "C"))
When I change foobar to accept two arguments like foobar(foo, df) and use y[, c(var1)] and y[, c(var2)] instead of the two parameters x and y it still doesn't work. Which way is there to do this?
edit1: last paragraph clarified
edit2: var1 type corrected
Try this:
library(plyr)
match_df <- function(x, match) {
vars <- names(match)
# Create unique id for each row
x_id <- id(match[vars])
match_id <- id(x[vars])
# Match identifiers and return subsetted data frame
x[match(x_id, match_id, nomatch = 0), ]
}
match_df(foo, df)
# var1 var2
# 1 1 A
# 3 1 B
# 5 2 C
Your function foobar is expecting three arguments, and you only supplied two arguments to it with foobar(foo, df). You can use apply to get what you want:
apply(df, 1, function(x) foobar(foo, x[1], x[2]))
And in use:
> apply(df, 1, function(x) foobar(foo, x[1], x[2]))
[1] 2 1 1
To respond to your edit:
I'm not entirely sure what y[, c(var1)] means, but here's an attempt at trying to figure out what you are trying to do.
What I think you were trying to do was: foobar(foo, y = df[, "var1"], z = df[, "var2"]).
First, note that the use of c() is not needed here and you can reference the columns you want by placing the name of the column in quotes OR reference the column by number (as I did above). Secondly, df[, "var1"] returns all of the rows for the column names var1 which has a length of three:
> length(df[, "var1"])
[1] 3
The function you defined is not set up to deal with vectors of length greater than 1. That is why we need to iterate through each row of your dataframe to grab a single value, process it, and then go to the next row in the data.frame. That is what the apply function does. It is equivalent to saying something along the lines of for (i in 1: length(nrow(df)) but is a more idiomatic way of handling such issues.
Finally, is there a reason you generated var1 as a factor? It probably makes more sense to treate these as numeric in my opinion. Compare:
> str(df)
'data.frame': 3 obs. of 2 variables:
$ var1: Factor w/ 2 levels "1","3": 1 1 2
$ var2: Factor w/ 3 levels "A","B","C": 1 2 3
Versus
> df2 <- data.frame(var1=c(1,1,3), var2=c("A", "B", "C"))
> str(df2)
'data.frame': 3 obs. of 2 variables:
$ var1: num 1 1 3
$ var2: Factor w/ 3 levels "A","B","C": 1 2 3
In summary - apply is the function you are after here. You may want to spend some time thinking about whether your data should be numeric or a factor, but apply is still what you want.
foobar2 <- function(x, df) {
.dofun <- function(y, z){
a <- subset(x, x$var1==y)
b <- subset(a, a$var2==z)
n <- nrow(b)
return (n)
}
ans <- mapply(.dofun, as.character(df$var1), as.character(df$var2))
names(ans) <- NULL
return(ans)
}