R: Generic flattening of JSON to data.frame - json

This question is about a generic mechanism for converting any collection of non-cyclical homogeneous or heterogeneous data structures into a dataframe. This can be particularly useful when dealing with the ingestion of many JSON documents or with a large JSON document that is an array of dictionaries.
There are several SO questions that deal with manipulating deeply nested JSON structures and turning them into dataframes using functionality such as plyr, lapply, etc. All the questions and answers I have found are about specific cases as opposed to offering a general approach for dealing with collections of complex JSON data structures.
In Python and Ruby I've been well-served by implementing a generic data structure flattening utility that uses the path to a leaf node in a data structure as the name of the value at that node in the flattened data structure. For example, the value my_data[['x']][[2]][['y']] would appear as result[['x.2.y']].
If one has a collection of these data structures that may not be entirely homogeneous the key to doing a successful flattening to a dataframe would be to discover the names of all possible dataframe columns, e.g., by taking the union of all keys/names of the values in the individually flattened data structures.
This seems like a common pattern and so I'm wondering whether someone has already built this for R. If not, I'll build it but, given R's unique promise-based data structures, I'd appreciate advice on an implementation approach that minimizes heap thrashing.

Hi #Sim I had cause to reflect on your problem yesterday define:
flatten<-function(x) {
dumnames<-unlist(getnames(x,T))
dumnames<-gsub("(*.)\\.1","\\1",dumnames)
repeat {
x <- do.call(.Primitive("c"), x)
if(!any(vapply(x, is.list, logical(1)))){
names(x)<-dumnames
return(x)
}
}
}
getnames<-function(x,recursive){
nametree <- function(x, parent_name, depth) {
if (length(x) == 0)
return(character(0))
x_names <- names(x)
if (is.null(x_names)){
x_names <- seq_along(x)
x_names <- paste(parent_name, x_names, sep = "")
}else{
x_names[x_names==""] <- seq_along(x)[x_names==""]
x_names <- paste(parent_name, x_names, sep = "")
}
if (!is.list(x) || (!recursive && depth >= 1L))
return(x_names)
x_names <- paste(x_names, ".", sep = "")
lapply(seq_len(length(x)), function(i) nametree(x[[i]],
x_names[i], depth + 1L))
}
nametree(x, "", 0L)
}
(getnames is adapted from AnnotationDbi:::make.name.tree)
(flatten is adapted from discussion here How to flatten a list to a list without coercion?)
as a simple example
my_data<-list(x=list(1,list(1,2,y='e'),3))
> my_data[['x']][[2]][['y']]
[1] "e"
> out<-flatten(my_data)
> out
$x.1
[1] 1
$x.2.1
[1] 1
$x.2.2
[1] 2
$x.2.y
[1] "e"
$x.3
[1] 3
> out[['x.2.y']]
[1] "e"
so the result is a flattened list with roughly the naming structure you suggest. Coercion is avoided also which is a plus.
A more complicated example
library(RJSONIO)
library(RCurl)
json.data<-getURL("http://www.reddit.com/r/leagueoflegends/.json")
dumdata<-fromJSON(json.data)
out<-flatten(dumdata)
UPDATE
naive way to remove trailing .1
my_data<-list(x=list(1,list(1,2,y='e'),3))
gsub("(*.)\\.1","\\1",unlist(getnames(my_data,T)))
> gsub("(*.)\\.1","\\1",unlist(getnames(my_data,T)))
[1] "x.1" "x.2.1" "x.2.2" "x.2.y" "x.3"

R has two packages for dealing with JSON input: rjson and RJSONIO. If I understand correctly what you mean by "collection of non-cyclical homogeneous or heterogeneous data structures", I think either of these packages will import that sort of structure as a list.
You can then flatten that list (into a vector) using the unlist function.
If the list is suitably structured (a non-nested list where each element is the same length) then as.data.frame prvoides an alternative to convert the list to be a data frame.
An example:
(my_data <- list(x = list('1' = 1, '2' = list(y = 2))))
unlist(my_data)

The jsonlite package is a fork of RJSONIO specifically designed to make conversion between JSON and data frames easier. You don't provide any example json data, but I think this might be what you are looking for. Have a look at this blog post or the vignette.

Great answer with the flatten and getnames functions. Took a few minutes to figure out all the options needed to get from a vector of JSON strings to a data.frame, so I thought I'd record that here. Suppose jsonvec is a vector of JSON strings. The following builds a data.frame (data.table) where there is one row per string, and each column corresponds to a different possible leaf node of the JSON tree. Any string missing a particular leaf node is filled with NA.
library(data.table)
library(jsonlite)
parsed = lapply(jsonvec, fromJSON, simplifyVector=FALSE)
flattened = lapply(parsed, flatten) #using flatten from accepted answer
d = rbindlist(flattened, fill=TRUE)

I'm now a big fan of simply:
library(jsonlite)
library(tidyverse)
fromJSON("file_path.json") %>%
unlist() %>%
enframe()
And then potentially, depending on your data, piping that into
%>%
pivot_wider()
Once it's in a flat table shape, there are a load of tools in tidyverse and other R libraries more generally for wrangling things around and e.g., dealing with columns with similar prefixes (which will result from the above pipeline as the parent name of the children within a nested json chunk will be prefixed to the child's name).

Related

How to use/convert my R vector object in a rest (est.ensembl.org) api query?

Good morning.
I want to use the the following rest: https://rest.ensembl.org/documentation/info/sequence_id_post
I have the vector object (ids) in R:
> ids
[1] "NM_007294.3:c.932_933insT" "NM_007294.3:c.1883C>T" "NM_007294.3:c.2183A>C"
[4] "NM_007294.3:c.2321C>T" "NM_007294.3:c.4585G>A" "NM_007294.3:c.4681C>A"
I have to put this vector(ids) with more than 200 variables in the body= ids variable (bellow), according to the example of code below, for it works:
Code:
library(httr)
library(jsonlite)
library(xml2)
server <- "https://rest.ensembl.org"
ext <- "/vep/human/hgvs"
r <- POST(paste(server, ext, sep = ""), content_type("application/json"), accept("application/json"), body = '{ "hgvs_notations" : ["NM_007294.3:c.932_933insT", "NM_007294.3:c.1883C>T"] }')
stop_for_status(r)
head(fromJSON(toJSON(content(r))))
I know it's a json format, but when I convert my variable ids to json it's not in the correct format.
Do you have any suggestions?
Thanks for any help.
Leandro
I think that NM_007294.3:c.2321C>T is not a valid query to /sequence/id REST endpoint. It contains a sequence id (NM_007294.3) and a variant (c.2321C>T) and if you understood this literally, you are asking the server a letter T, since this call returns sequences.
Valid query would contain only sequence ids and you can use it like that (provided you have your ids in a vector):
r <- POST(paste(server, ext, sep = ""), content_type("application/json"), accept("application/json"), body = paste0('{ "ids" :', jsonlite::toJSON(ids), ' }')
Depending on the downstream scenario, making your ids unique might help/speed things up.

Convert R data.frame to multilevel JSON

I have a periodic process in R that yields me a data.frame.
I want to use this data.frame to create a dropdown selector with AngularJS.
My final data.frame will look more or less as follows (my real example might have a deeper hierarchical structure):
DF<-data.frame(hie1=c(rep("Cl1",2),"Cl2"),hie2=c("Cl1op1","Cl1op2","Clop1"),
hie3=c("/first.html","/second.html","/third.html"))
I need to convert that data.frame into a JSON with the following structure :
{
"Cl1":{"Cl1op1":"/first.html","Cl1op2": "/second.html"},
"Cl2":{"Cl2op1":"/third.html"}
}
So far, I have tried all the toJSON commands of the rjson and RJSONIO packages for the data.frame with and without column names:
library(rjson)
#library(RJSONIO)
DF2<-DF
colnames(DF2)<-NULL
cat(toJSON(DF))
cat(toJSON(DF2))
I thought about using reshape2's dcast function beforeusing toJSON, but I do not know what kind of structure I need to achieve my goal.
I also used the functions toJSON2 an toJSONArray from the rCharts with no success.
Is there an appropriate transformation in R to get the output I am looking for?
P.S. (I do not mind having [] instead of {})
EDIT:
I have created a couple of functions (included below) to fulfil my needs.
However, they are not too clean and I believe that there must be a better way to perform this transformation in R.
I keep this question open expecting a better solution.
linktwo<-function(V){
paste0(sapply(V,function(x) paste0("'",toString(x),"'")),collapse=":")
}
pastehier<-function(DF){
if(ncol(DF)==2){
return(paste0(apply(DF,1,linktwo),collapse=","))
}else{
u<-unique(DF[,1])
output=character()
for(i in u){
output<-append(output,paste0(paste0("'",i,"'"),":{",pastehier(DF[DF[,1]==i,-1]),
"}"))
}
return(paste0(output,collapse=","))
}
}
pastehier(DF)
I do not fully understand your request and maybe my solution is useless, but here is a try:
library(reshape2)
prova <- dcast(DF, hie1 ~ ... )
toJSON(prova, pretty = TRUE)
[
{
"hie1": "Cl1",
"Cl1op1": "/first.html",
"Cl1op2": "/second.html"
},
{
"hie1": "Cl2",
"Clop1": "/third.html"
}
]
where:
> prova
hie1 Cl1op1 Cl1op2 Clop1
1 Cl1 /first.html /second.html <NA>
2 Cl2 <NA> <NA> /third.html

Is it possible, in R, to access the values of a list with a for loop on the names of the fields?

I have a big json file, containing 18 fields, some of which contain some other subfields. I read the file in R in the following way:
json_file <- "daily_profiles_Bnzai_20150914_20150915_20150914.json"
data <- fromJSON(sprintf("[%s]", paste(readLines(json_file), collapse=",")))
This gives me a giant list with all the fields contained in the json file. I want to make it into a data.frame and do some operations in the meantime. For example if I do:
doc_length <- data.frame(t(apply(as.data.frame(data$doc_lenght_map), 1, unlist)))
os <- data.frame(t(apply(as.data.frame(data$operating_system), 1, unlist)))
navigation <- as.data.frame(data$navigation)
monday <- data.frame(t(apply(navigation[,grep("Monday",names(data$navigation))],1,unlist)))
Monday <- data.frame(apply(monday, 1, sum))
works fine, I get what I want, with all the right subfields and then I want to join them in a final data.frame that I will use to do other operations.
Now, I'd like to do something like that on the subset of fields where I don't need to do operations. So, for example, the days of the week contained in navigation are not included. I'd like to have something like (suppose I have a data.frame df):
for(name in names(data))
{
df <- cbind(df, data.frame(t(apply(as.data.frame(data$name), 1, unlist)))
}
The above loop gives me errors. So, what I want to do is finding a way to access all the fields of the list in an automatic way, as in the loop, where the iterator "name" takes on all the fields of the list, without having to call them singularly and then doing some operations with those fields. I tried even with
for(name in names(data))
{
df <- cbind(df, data.frame(t(apply(as.data.frame(data[name]), 1, unlist)))
}
but it doesn't take all of the subfields. I also tried with
data[, name]
but it doesn't work either. So I think I need to use the "$" operator.
Is it possible to do something like that?
Thank you a lot!
Davide
Like the other commenters, I am confused, but I will throw this out to see if it might point you in the right direction.
# make mtcars a list as an example
data <- lapply(mtcars,identity)
do.call(
cbind,
lapply(
names(data),
function(name){
data.frame(data[name])
}
)
)

Parallel programming in R

I have a file that consists of multiple JSON objects. I need to read through these files and extract certain fields from the JSON objects. To complicate things, some of the objects do not contain all the fields. I am dealing with a large file of over 200,000 JSON objects. I would like to split job across multiple cores. I have tried to experiment with doSNOW, foreach, and parallel and really do not understand how to do this. The following is my code that I would like to make more efficient.
foreach (i in 2:length(linn)) %dopar% {
json_data <- fromJSON(linn[i])
if(names(json_data)[1]=="info")
next
mLocation <- ifelse('location' %!in% names(json_data$actor),'NULL',json_data$actor$location$displayName)
mRetweetCount <- ifelse('retweetCount' %!in% names(json_data),0,json_data$retweetCount)
mGeo <- ifelse('geo' %!in% names(json_data),c(-0,-0),json_data$geo$coordinates)
tweet <- rbind(tweet,
data.frame(
record.no = i,
id = json_data$id,
objecttype = json_data$actor$objectType,
postedtime = json_data$actor$postedTime,
location = mLocation,
displayname = json_data$generator$displayName,
link = json_data$generator$link,
body = json_data$body,
retweetcount = mRetweetCount,
geo = mGeo)
)
}
Rather than trying to parallelize an iteration, I think you're better off trying to vectorize (hmm, actually most of the below is still iterating...). For instance here we get all our records (no speed gain yet, though see below...)
json_data <- lapply(linn, fromJSON)
For location we pre-allocate a vector of NAs to represent records for which there is no location, then find records that do have a location (maybe there's a better way of doing this...) and update them
mLocation <- rep(NA, length(json_data))
idx <- sapply(json_data, function(x) "location" %in% names(x$actor))
mLocation[idx] <- sapply(json_data[idx], function(x) x$location$displayName)
Finally, create a 200,000 row data frame in a single call (rather than your 'copy and append' pattern, which makes a copy of the first row, then the first and second row, then the first, second, third row, then ... so N-squared rows, in addition to recreating factors and other data.frame specific expenses; this is likely where you spend most of your time)
data.frame(i=seq_along(json_data), location=mLocation)
The idea would be to accumulate all the columns, and then do just one call to data.frame(). I think you could cheat on parsing line-at-a-time, by pasting everything into a single string repersenting a JSON array, and parsing in one call
json_data <- fromJSON(sprintf("[%s]", paste(linn, collapse=",")))

Convert character to html in R

What's the prefered way in R to convert a character (vector) containing non-ASCII characters to html? I would for example like to convert
"ü"
to
"ü"
I am aware that this is possible by a clever use of gsub (but has anyone doen it once and for all?) and I thought that the package R2HTML would do that, but it doesn't.
EDIT: Here is what I ended up using; it can obviously be extended by modifying the dictionary:
char2html <- function(x){
dictionary <- data.frame(
symbol = c("ä","ö","ü","Ä", "Ö", "Ü", "ß"),
html = c("ä","ö", "ü","Ä",
"Ö", "Ü","ß"))
for(i in 1:dim(dictionary)[1]){
x <- gsub(dictionary$symbol[i],dictionary$html[i],x)
}
x
}
x <- c("Buschwindröschen", "Weißdorn")
char2html(x)
This question is pretty old but I couldn't find any straightforward answer... So I came up with this simple function which uses the numerical html codes and works for LATIN 1 - Supplement (integer values 161 to 255). There's probably (certainly?) a function in some package that does it more thoroughly, but what follows is probably good enough for many applications...
conv_latinsupp <- function(...) {
out <- character()
for (s in list(...)) {
splitted <- unlist(strsplit(s, ""))
intvalues <- utf8ToInt(enc2utf8(s))
pos_to_modify <- which(intvalues >=161 & intvalues <= 255)
splitted[pos_to_modify] <- paste0("&#0", intvalues[pos_to_modify], ";")
out <- c(out, paste0(splitted, collapse = ""))
}
out
}
conv_latinsupp("aeiou", "àéïôù12345")
## [1] "aeiou" "àéïôù12345"
The XML uses a method insertEntities for this, but that method is internal. So you may use it at your own risk, as there are no guarantees that it will remain to operate like this in future versions.
Right now, your code could be accomplished using
char2html <- function(x) XML:::insertEntities(x, c("ä"="auml", "ö"="ouml", …))
The use of a named list instead of a data.frame feels kind of elegant, but doesn't change the core of things. Under the hood, insertEntities calls gsub in much the same way your code does.
If numeric HTML entities are valid in your environment, then you could probably convert all your text into those using utf8ToInt and then turn safely printable ASCII characters back into unescaped form. This would save you the trouble of maintaining a dictionary for your entities.