Flatten deep nested json in R - json

I am trying to use R to convert a nested JSON file into a two dimensional dataframe.
My JSON file has a nested structure. But, the names and properties are the same across levels.
{"name":"A", "value":"1", "c":
[{"name":"a1", "value":"11", "c":
[{"name":"a11", "value":"111"},
{"name":"a12", "value":"112"}]
},
{"name":"a2", "value":"12"}]
}
The desired dataset would look like this. Although the exact column names can be different.
name value c__name c_value c_c_name c_c_value
A 1 a1 11 a11 111
A 1 a1 11 a12 112
A 1 a2 12
The code I have so far flattens the data, but it only seems to work for the first level (see the screenshot of the output).
library(jsonlite)
json_file <- ' {"name":"A", "value":"1", "c":
[{"name":"a1", "value":"11", "c":
[{"name":"a11", "value":"111"},
{"name":"a12", "value":"112"}]
},
{"name":"a2", "value":"12"}]
}'
data <- fromJSON(json_file, flatten = TRUE)
View(data)
I tried multiple packages, including jsonlite and RJSONIO, I spent the last 5 hours 5 hours debugging this and trying various online tutorial, but without success. Thanks for your help!

Firstly, that is some ugly JSON; if you have a way of avoiding it, do so. Consequently, what follows is also pretty ugly—to the degree that I normally wouldn't post it, but I am doing so now in the hope that some of the approaches may be of use. If it offends your eyes, let me know and I'll delete it.
library(jsonlite) # for fromJSON
library(reshape2) # for melt
library(dplyr) # for inner_join, select
jlist <- fromJSON(json_file)
jdf <- as.data.frame(jlist)
jdf$c.value <- as.numeric(jdf$c.value) # fix type
jdf$L1 <- as.integer(factor(jdf$c.name)) # for use as a key with an artifact of melt later *urg, sorry*
ccdf <- melt(jdf$c.c) # get nested list into usable form
names(ccdf)[1:2] <- c('c.c.name', 'c.c.value') # fix names so they won't cause problems with the join
df3 <- inner_join(jdf[, -5], ccdf) # join, take out nested column
df3$c.c.value <- as.numeric(df3$c.c.value) # fix type
df3 <- df3 %>% select(-L1, -c) # get rid of useless columns
which leaves you with
> df3
name value c.name c.value c.c.name c.c.value
1 A 1 a1 11 a11 111
2 A 1 a1 11 a12 112
3 A 1 a2 12 <NA> NA
with reasonably sensible types. The packages used are avoidable, if you like.
Is this scalable? Well, not really, without more of the same mess. If anybody else has a less nasty and more scalable approach for dealing with nasty JSON, please post it; I'd be as grateful as the OP.

I think I figured out a way to do this. It seems to work with larger trees. The idea is to unlist the JSON and use the names attribute of the unlisted elements. In this example, if a node has one parent, the name attribute will start with "c.", if it has a parent and a "grandparent", it will list it as "c.c."...etc. So, the code below uses this structure to find the level of nesting and placing the node in the appropriate columns. The rest of the code adds the attributes of the parent nodes and deletes extra rows generated. I know it is not elegant, but I thought it might be useful for others.
library(stringr)
library(jsonlite)
json_file <- ' {"name":"A", "value":"1", "c":
[{"name":"a1", "value":"11", "c":
[{"name":"a11", "value":"111"},
{"name":"a12", "value":"112"}]
},
{"name":"a2", "value":"12"}]
}'
nestedjson <- fromJSON(json_file, simplifyVector = F) #read the json
nAttrPerNode <- 2 #number of attributes per node
strChild <- "c." #determines level of nesting
unnestedjson <- unlist(nestedjson) #convert JSON to unlist
unnestednames <- attr(unnestedjson, "names") #get the names of the cells
depthTree <- (max(str_count(unnestednames, strChild)) + 1) * nAttrPerNode #maximum tree depth
htTree <- length(unnestednames) / nAttrPerNode #maximum tree height (number of branches)
X <- array("", c(htTree, depthTree))
for (nodeht in 1:htTree){ #iterate through the branches and place the nodes based on the count of strChild in the name attribute
nodeIndex <- nodeht * nAttrPerNode
nodedepth <- str_count(unnestednames[nodeIndex], strChild) + 1
X[nodeht, nodedepth * nAttrPerNode - 1] <- unnestedjson[nodeIndex - 1]
X[nodeht, nodedepth * nAttrPerNode] <- unnestedjson[nodeIndex]
}
for (nodeht in 2:htTree){ #repeat the parent node attributes for the children
nodedepth <- 0
repeat{
nodedepth <- nodedepth + 1
startcol <- nodedepth * nAttrPerNode - 1
endcol <- startcol + nAttrPerNode - 1
if (X[nodeht, startcol] == "" & nodedepth < depthTree/nAttrPerNode){
X[nodeht, startcol:endcol] <- X[nodeht-1, startcol:endcol]
} else {
break()
}
}
}
deleteRows <- NULL #Finally delete the rows that only have the parent attributes for nodes that have children
strBranches <- apply(X, 1, paste, collapse="")
for (nodeht in 1:(htTree-1)){
branch2sub <- substr(strBranches[nodeht+1], 1, nchar(strBranches[nodeht]))
if (strBranches[nodeht]==branch2sub){
deleteRows <- c(deleteRows, nodeht)
}
}
deleteRows
X <- X[-deleteRows,]

Related

rbind fromJSON page: duplicate rowname error

I was trying to rbind some json data scraped from api
library(jsonlite)
pop_dat <- data.frame()
for (i in 1:3) {
# Generate url for each page
url <- paste0('http://api.worldbank.org/v2/countries/all/indicators/SP.POP.TOTL?format=json&page=',i)
# Get json data from each page and transform it into dataframe
dat <- as.data.frame(fromJSON(url)[2],flatten = TRUE, row.names = NULL)
pop_dat <- rbind(pop_dat, dat)
}
However, it returns the following error:
Error in row.names<-.data.frame(*tmp*, value = value) :
duplicate 'row.names' are not allowed
In addition: Warning message:
non-unique values when setting 'row.names': ‘1’, ‘10’, ‘11’, ‘12’, ‘13’, ‘14’, ‘15’, ‘16’, ‘17’, ‘18’, ‘19’, ‘2’, ‘20’, ‘21’, ‘22’, ‘23’, ‘24’, ‘25’, ‘26’, ‘27’, ‘28’, ‘29’, ‘3’, ‘30’, ‘31’, ‘32’, ‘33’, ‘34’, ‘35’, ‘36’, ‘37’, ‘38’, ‘39’, ‘4’, ‘40’, ‘41’, ‘42’, ‘43’, ‘44’, ‘45’, ‘46’, ‘47’, ‘48’, ‘49’, ‘5’, ‘50’, ‘6’, ‘7’, ‘8’, ‘9’
Changing the row.names to null doesn't work. I heard from someone it is due to the fact that some data are stored as lists here, which I don't quite understand.
I understand that there is an alternative package WDI to access this data and it works well, but I want to know how to resolve the duplicates row name problem here in general so that I can deal with similar situation where no alternative package is available.
I heard from someone it is due to the fact that some data are stored as lists...
This is correct. The solution is fairly simple, but I find it really easy to get tripped up by this. Right now you're using:
dat <- as.data.frame(fromJSON(url)[2],flatten = TRUE, row.names = NULL)
The problem comes from fromJSON(url)[2]. This should be fromJSON(url)[[2]] instead. According to the documentation, the key difference between [ and [[ is a single bracket can select multiple elements whereas [[ selects only one.
You can see how this works with some fake data.
foo <- list(
a = rnorm(100),
b = rnorm(100),
c = rnorm(100)
)
With [, you can select multiple values inside this list.
foo[c("a", "b")]
length(foo["a"]) # Result is 1 not 100 like you might expect.
With [[ the results are different.
foo[[c("a", "b")]] # Raises a subscript error.
foo[["a"]] #This works.
length(foo[["a"]]) # Result is 100.
So, your answer will depend on which subset operator you're using. For your problem, you'll want to use [[ to select a single data.frame inside of the list. Then, you should be able to use rbind correctly.
final <- data.frame()
for (i in 1:10) {
url <- paste0(
'http://api.worldbank.org/v2/countries/all/indicators/SP.POP.TOTL?format=json&page=',
i
)
res <- jsonlite::fromJSON(url, flatten = TRUE)[[2]]
final <- rbind(final, res)
}
Alternative solution with lapply:
urls <- sprintf(
'http://api.worldbank.org/v2/countries/all/indicators/SP.POP.TOTL?format=json&page=%s',
1:10
)
resl <- lapply(urls, jsonlite::fromJSON, flatten = TRUE)
resl <- lapply(resl, "[[", 2) # Use lapply to select the 2 element from each list element.
resl <- do.call(rbind, resl) # This takes all the elements of the list and uses those elements as the arguments for rbind.

R: jsonlite - export key:value pairs from a list of lists

I have a list of lists which are of variable length. The first value of each nested list is the key, and the rest of the values in the list will be the array entry. It looks something like this:
[[1]]
[1] "Bob" "Apple"
[[2]]
[1] "Cindy" "Apple" "Banana" "Orange" "Pear" "Raspberry"
[[3]]
[1] "Mary" "Orange" "Strawberry"
[[4]]
[1] "George" "Banana"
I've extracted the keys and entries as follows:
keys <- lapply(x, '[', 1)
entries <- lapply(x, '[', -1)
but now that I have these, I don't know how I can associate a key:value pair in R without creating a matrix first, but this is silly since my data don't fit in a rectangle anyway (every example I've seen uses the column names from a matrix as the key values).
This is my crappy method using a matrix, assigning rownames, and then using jsonLite to export to JSON.
#Create a matrix from entries, without recycling
#I found this function on StackOverflow which seems to work...
cbind.fill <- function(...){
nm <- list(...)
nm <- lapply(nm, as.matrix)
n <- max(sapply(nm, nrow))
do.call(cbind, lapply(nm, function (x)
rbind(x, matrix(, n-nrow(x), ncol(x)))))
}
#Call said function
matrix <- cbind.fill(entries)
#Transpose the thing
matrix <- t(matrix)
#Set column names
colnames(matrix) <- keys
#Export to json
json<-toJSON(matrix)
The result is good, but the implementation sucks. Result:
[{"Bob":["Apple"],"Cindy":["Apple","Banana","Orange","Pear","Raspberry"],"Mary":["Orange","Strawberry"],"George":["Banana"]}]
Please let me know of better ways that might exist to accomplish this.
How about:
names(entries) <- unlist(keys)
toJSON(entries)
Consider the following lapply() approach:
library(jsonlite)
entries <- list(c('Bob', 'Apple'),
c('Cindy', 'Apple', 'Banana', 'Orange','Pear','Raspberry'),
c('Mary', 'Orange', 'Strawberry'),
c('George', 'Banana'))
# ITERATE ALL CONTENTS EXCEPT FIRST
inner <- list()
nestlist <- lapply(entries,
function(i) {
inner <- i[2:length(i)]
return(inner)
})
# NAME EACH ELEMENT WITH FIRST ELEMENT
names(nestlist) <- lapply(entries, function(i) i[1])
#$Bob
#[1] "Apple"
#$Cindy
#[1] "Apple" "Banana" "Orange" "Pear" "Raspberry"
#$Mary
#[1] "Orange" "Strawberry"
#$George
#[1] "Banana"
x <- toJSON(list(nestlist), pretty=TRUE)
x
#[
# {
# "Bob": ["Apple"],
# "Cindy": ["Apple", "Banana", "Orange", "Pear", "Raspberry"],
# "Mary": ["Orange", "Strawberry"],
# "George": ["Banana"]
# }
#]
I think this has already been sufficiently answered but here is a method using purrr and jsonlite.
library(purrr)
library(jsonlite)
sample_data <- list(
list("Bob","Apple"),
list("Cindy","Apple","Banana","Orange","Pear","Raspberry"),
list("Mary","Orange","Strawberry"),
list("George","Banana")
)
sample_data %>%
map(~set_names(list(.x[-1]),.x[1])) %>%
toJSON(auto_unbox=TRUE, pretty=TRUE)

Trouble spreading values using tidyjson

I am trying to convert the following multi-document JSON file into a data.frame.
x = '[
{"name": "Bob","groupIds": ["kwt6x61", "yiahf43"]},
{"name": "Sally","groupIds": "yiahf43"}
]'
I'm almost there by using
y = x %>% gather_array() %>%
spread_values(
name = jstring("name"),
groupIds = jstring("groupIds")
)
print(y)
Which returns:
document.id array.index name groupIds
1 1 1 Bob list("kwt6x61", "yiahf43")
2 1 2 Sally yiahf43
Can someone help spread the groupsIds into addtional rows?
This is an interesting problem. The issue stems from the fact that an array of 1 is stored as a string. Otherwise, enter_object('groupIds') %>% gather_array %>% append_values_string would work nicely. tidyjson does not seem to handle this situation nicely. I wonder whether this would even be considered valid JSON, since in one case groupIds is a string, and in another it is an array.
In any case, although this is not an ideal solution, you can use json_types() to illustrate the difference and then conditionally treat each. I converted to a tbl_df (i.e. dropped JSON component) for future processing when done parsing.
library(tidyjson)
library(dplyr)
library(tidyr)
x = '[
{"name": "Bob","groupIds": ["kwt6x61", "yiahf43"]},
{"name": "Sally","groupIds": "yiahf43"}
]'
## Show the different types
z <- x %>% gather_array() %>% spread_values(
name=jstring('name')
) %>% enter_object('groupIds') %>% json_types()
## Conditionally treat each
final <- bind_rows(
z[z$type=='array',] %>% gather_array('id') %>% append_values_string('groupId')
, z[z$type=='string',] %>% append_values_string('groupId') %>% mutate(id=1)
) %>% tbl_df
## Spread them out, maybe? Depends on what you're looking for
final %>% spread('id','groupId')

Ragged dataframe in R, jsonlite::fromJSON

I am new to importing .json files for use in R. I'm trying to create a 'long' format dataframe - each row is one participant, each column is one variable. Most of my dataset is compatible after calling fromJSON, but one nested json structure results in a ragged list, with Null, 1, 2, or 3 entries for each participant (in theory there could be more).
Sample:
testdf <- fromJSON("[[\"MMM\",\"AAA\"],null,[\"GGG\",\"CCC\",\"NNN \"],null,null,[\"AAA\",\"NNN \"],null,[\"MMM\",\"AAA\"],null,null,null,null,[\"MMM\",\"AAA\"],[\"CCC\",\"AAA\"],\"NNN \",[\"MMM\",\"NNN \",\"EEE\"],null,null,[\"CCC\",\"MMM\",\"AAA\"],[\"HHH\",\"AAA\"],\"AAA\",[\"MMM\",\"AAA\",\"NNN \"],[\"CCC\",\"AAA\"],[\"MMM\",\"AAA\",\"NNN \"],[\"AAA\",\"NNN \"],[\"MMM\",\"AAA\"],null,null,null,null,null,null]", flatten=TRUE)
How can I transform this list into a 32 x n dataframe which preserves the null values?
Variations on unlist remove the null values; rbind.fill moves entries to the next row, of course - could something like cbind.fill work? (cbind a df with an empty df (cbind.fill?))
Something hidden in plyr?
Thanks for any suggestions.
Fairly straightforward:
t(sapply(testdf, function(x) {
if (is.null(x)) x <- NA_character_
length(x) <- 3
x })
)
If you want to choose the number of columns automatically, then you need to calculate that first:
nc <- max(sapply(testdf, length))
t(sapply(testdf, function(x) {
if (is.null(x)) x <- NA_character_
length(x) <- nc
x })
)

Substring in Data Frame R

I have data from GPS log like this : (this data in rows of data frame columns)
{"mAccuracy":20.0,"mAltitude":0.0,"mBearing":0.0,"mElapsedRealtimeNanos":21677339000000,"mExtras":{"networkLocationSource":"cached","networkLocationType":"wifi","noGPSLocation":{"mAccuracy":20.0,"mAltitude":0.0,"mBearing":0.0,"mElapsedRealtimeNanos":21677339000000,"mHasAccuracy":true,"mHasAltitude":false,"mHasBearing":false,"mHasSpeed":false,"mIsFromMockProvider":false,"mLatitude":35.1811956,"mLongitude":126.9104909,"mProvider":"network","mSpeed":0.0,"mTime":1402801381486},"travelState":"stationary"},"mHasAccuracy":true,"mHasAltitude":false,"mHasBearing":false,"mHasSpeed":false,"mIsFromMockProvider":false,"mLatitude":35.1811956,"mLongitude":126.9104909,"mProvider":"network","mSpeed":0.0,"mTime":1402801381486,"timestamp":1402801665.512}
The problem is I only need Latitude and longitude value, so I think i can use substring and sappy for applying to all data in dataframe.
But I am not sure this way is handsome because when i use substring ex: substr("abcdef", 2, 4) so I need to count who many chars from beginning until "mLatitude" , so anybody can give suggestion the fast way to processing it?
Thank you to #mnel for answering question, it's work , but i still have problem
From mnel answer I've created function like this :
fgps <- function(x) {
out <- fromJSON(x)
c(out$mExtras$noGPSLocation$mLatitude,
out$mExtras$noGPSLocation$mLongitude)
}
and then this is my data :
gpsdata <- head(dfallgps[,4],2)
[1] "{\"mAccuracy\":23.128,\"mAltitude\":0.0,\"mBearing\":0.0,\"mElapsedRealtimeNanos\":76437488000000,\"mExtras\":{\"networkLocationSource\":\"cached\",\"networkLocationType\":\"wifi\",\"noGPSLocation\":{\"mAccuracy\":23.128,\"mAltitude\":0.0,\"mBearing\":0.0,\"mElapsedRealtimeNanos\":76437488000000,\"mHasAccuracy\":true,\"mHasAltitude\":false,\"mHasBearing\":false,\"mHasSpeed\":false,\"mIsFromMockProvider\":false,\"mLatitude\":35.1779956,\"mLongitude\":126.9089661,\"mProvider\":\"network\",\"mSpeed\":0.0,\"mTime\":1402894224187},\"travelState\":\"stationary\"},\"mHasAccuracy\":true,\"mHasAltitude\":false,\"mHasBearing\":false,\"mHasSpeed\":false,\"mIsFromMockProvider\":false,\"mLatitude\":35.1779956,\"mLongitude\":126.9089661,\"mProvider\":\"network\",\"mSpeed\":0.0,\"mTime\":1402894224187,\"timestamp\":1402894517.425}"
[2] "{\"mAccuracy\":1625.0,\"mAltitude\":0.0,\"mBearing\":0.0,\"mElapsedRealtimeNanos\":77069916000000,\"mExtras\":{\"networkLocationSource\":\"cached\",\"networkLocationType\":\"cell\",\"noGPSLocation\":{\"mAccuracy\":1625.0,\"mAltitude\":0.0,\"mBearing\":0.0,\"mElapsedRealtimeNanos\":77069916000000,\"mHasAccuracy\":true,\"mHasAltitude\":false,\"mHasBearing\":false,\"mHasSpeed\":false,\"mIsFromMockProvider\":false,\"mLatitude\":35.1811881,\"mLongitude\":126.9084072,\"mProvider\":\"network\",\"mSpeed\":0.0,\"mTime\":1402894857416},\"travelState\":\"stationary\"},\"mHasAccuracy\":true,\"mHasAltitude\":false,\"mHasBearing\":false,\"mHasSpeed\":false,\"mIsFromMockProvider\":false,\"mLatitude\":35.1811881,\"mLongitude\":126.9084072,\"mProvider\":\"network\",\"mSpeed\":0.0,\"mTime\":1402894857416,\"timestamp\":1402894857.519}"
When run sapply why the data still shows in the result not just the results values.
sapply(gpsdata, function(gpsdata) fgps(gpsdata))
{"mAccuracy":23.128,"mAltitude":0.0,"mBearing":0.0,"mElapsedRealtimeNanos":76437488000000,"mExtras":{"networkLocationSource":"cached","networkLocationType":"wifi","noGPSLocation":{"mAccuracy":23.128,"mAltitude":0.0,"mBearing":0.0,"mElapsedRealtimeNanos":76437488000000,"mHasAccuracy":true,"mHasAltitude":false,"mHasBearing":false,"mHasSpeed":false,"mIsFromMockProvider":false,"mLatitude":35.1779956,"mLongitude":126.9089661,"mProvider":"network","mSpeed":0.0,"mTime":1402894224187},"travelState":"stationary"},"mHasAccuracy":true,"mHasAltitude":false,"mHasBearing":false,"mHasSpeed":false,"mIsFromMockProvider":false,"mLatitude":35.1779956,"mLongitude":126.9089661,"mProvider":"network","mSpeed":0.0,"mTime":1402894224187,"timestamp":1402894517.425}
[1,] 35.178
[2,] 126.909
{"mAccuracy":1625.0,"mAltitude":0.0,"mBearing":0.0,"mElapsedRealtimeNanos":77069916000000,"mExtras":{"networkLocationSource":"cached","networkLocationType":"cell","noGPSLocation":{"mAccuracy":1625.0,"mAltitude":0.0,"mBearing":0.0,"mElapsedRealtimeNanos":77069916000000,"mHasAccuracy":true,"mHasAltitude":false,"mHasBearing":false,"mHasSpeed":false,"mIsFromMockProvider":false,"mLatitude":35.1811881,"mLongitude":126.9084072,"mProvider":"network","mSpeed":0.0,"mTime":1402894857416},"travelState":"stationary"},"mHasAccuracy":true,"mHasAltitude":false,"mHasBearing":false,"mHasSpeed":false,"mIsFromMockProvider":false,"mLatitude":35.1811881,"mLongitude":126.9084072,"mProvider":"network","mSpeed":0.0,"mTime":1402894857416,"timestamp":1402894857.519}
[1,] 35.18119
[2,] 126.90841
I want the result looks like :
[1] 35.178 126.909
[2] 35.18119 126.90841
Thank you
It would appear that your data is in JSON format. Therefore, use a RJSONIO::fromJSON to read the file.
E.g.:
txt <- "{\"mAccuracy\":20.0,\"mAltitude\":0.0,\"mBearing\":0.0,\"mElapsedRealtimeNanos\":21677339000000,\"mExtras\":{\"networkLocationSource\":\"cached\",\"networkLocationType\":\"wifi\",\"noGPSLocation\":{\"mAccuracy\":20.0,\"mAltitude\":0.0,\"mBearing\":0.0,\"mElapsedRealtimeNanos\":21677339000000,\"mHasAccuracy\":true,\"mHasAltitude\":false,\"mHasBearing\":false,\"mHasSpeed\":false,\"mIsFromMockProvider\":false,\"mLatitude\":35.1811956,\"mLongitude\":126.9104909,\"mProvider\":\"network\",\"mSpeed\":0.0,\"mTime\":1402801381486},\"travelState\":\"stationary\"},\"mHasAccuracy\":true,\"mHasAltitude\":false,\"mHasBearing\":false,\"mHasSpeed\":false,\"mIsFromMockProvider\":false,\"mLatitude\":35.1811956,\"mLongitude\":126.9104909,\"mProvider\":\"network\",\"mSpeed\":0.0,\"mTime\":1402801381486,\"timestamp\":1402801665.512}"
Then process:
library(RJSONIO)
out <- fromJSON(txt)
out$$mLongitude
#[1] 126.9105
out$mLatitude
#[1] 35.1812
# to process multiple values
tt <- rep(txt,2)
myData <- lapply(tt, fromJSON)
latlong <- do.call(rbind,lapply(myData, `[` ,c('mLatitude','mLongitude')))
# or using rbind list
library(data.table)
latlong <- rbindlist(lapply(myData, `[` ,c('mLatitude','mLongitude')))