Extract JSON data from the rows of an R data frame - json

I have a data frame where the values of column Parameters are Json data:
# Parameters
#1 {"a":0,"b":[10.2,11.5,22.1]}
#2 {"a":3,"b":[4.0,6.2,-3.3]}
...
I want to extract the parameters of each row and append them to the data frame as columns A, B1, B2 and B3.
How can I do it?
I would rather use dplyr if it is possible and efficient.

In your example data, each row contains a json object. This format is called jsonlines aka ndjson, and the jsonlite package has a special function stream_in to parse such data into a data frame:
# Example data
mydata <- data.frame(parameters = c(
'{"a":0,"b":[10.2,11.5,22.1]}',
'{"a":3,"b":[4.0,6.2,-3.3]}'
), stringsAsFactors = FALSE)
# Parse json lines
res <- jsonlite::stream_in(textConnection(mydata$parameters))
# Extract columns
a <- res$a
b1 <- sapply(res$b, "[", 1)
b2 <- sapply(res$b, "[", 2)
b3 <- sapply(res$b, "[", 3)
In your example, the json structure is fairly simple so the other suggestions work as well, but this solution will generalize to more complex json structures.

I actually had a similar problem where I had multiple variables in a data frame which were JSON objects and a lot of them were NA's, but I did not want to remove the rows where NA's existed. I wrote a function which is passed a data frame, id within the data frame(usually a record ID), and the variable name in quotes to parse. The function will create two subsets, one for records which contain JSON objects and another to keep track of NA value records for the same variable then it joins those data frames and joins their combination to the original data frame thereby replacing the former variable. Perhaps it will help you or someone else as it has worked for me in a few cases now. I also haven't really cleaned it up too much so I apologize if my variable names are a bit confusing as well as this was a very ad-hoc function I wrote for work. I also should state that I did use another poster's idea for replacing the former variable with the new variables created from the JSON object. You can find that here : Add (insert) a column between two columns in a data.frame
One last note: there is a package called tidyjson which would've had a simpler solution but apparently cannot work with list type JSON objects. At least that's my interpretation.
library(jsonlite)
library(stringr)
library(dplyr)
parse_var <- function(df,id, var) {
m <- df[,var]
p <- m[-which(is.na(m))]
n <- df[,id]
key <- n[-which(is.na(df[,var]))]
#create df for rows which are NA
key_na <- n[which(is.na(df[,var]))]
q <- m[which(is.na(m))]
parse_df_na <- data.frame(key_na,q,stringsAsFactors = FALSE)
#Parse JSON values and bind them together into a dataframe.
p <- lapply(p,function(x){
fromJSON(x) %>% data.frame(stringsAsFactors = FALSE)}) %>% bind_rows()
#bind the record id's of the JSON values to the above JSON parsed dataframe and name the columns appropriately.
parse_df <- data.frame(key,p,stringsAsFactors = FALSE)
## The new variables begin with a capital 'x' so I replace those with my former variables name
n <- names(parse_df) %>% str_replace('X',paste(var,".",sep = ""))
n <- n[2:length(n)]
colnames(parse_df) <- c(id,n)
#join the dataframe for NA JSON values and the dataframe containing parsed JSON values, then remove the NA column,q.
parse_df <- merge(parse_df,parse_df_na,by.x = id,by.y = 'key_na',all = TRUE)
#Remove the new column formed by the NA values#
parse_df <- parse_df[,-which(names(parse_df) =='q')]
####Replace variable that is being parsed in dataframe with the new parsed and names values.######
new_df <- data.frame(append(df,parse_df[,-which(names(parse_df) == id)],after = which(names(df) == var)),stringsAsFactors = FALSE)
new_df <- new_df[,-which(names(new_df) == var)]
return(new_df)
}

Related

rbind fromJSON page: duplicate rowname error

I was trying to rbind some json data scraped from api
library(jsonlite)
pop_dat <- data.frame()
for (i in 1:3) {
# Generate url for each page
url <- paste0('http://api.worldbank.org/v2/countries/all/indicators/SP.POP.TOTL?format=json&page=',i)
# Get json data from each page and transform it into dataframe
dat <- as.data.frame(fromJSON(url)[2],flatten = TRUE, row.names = NULL)
pop_dat <- rbind(pop_dat, dat)
}
However, it returns the following error:
Error in row.names<-.data.frame(*tmp*, value = value) :
duplicate 'row.names' are not allowed
In addition: Warning message:
non-unique values when setting 'row.names': ‘1’, ‘10’, ‘11’, ‘12’, ‘13’, ‘14’, ‘15’, ‘16’, ‘17’, ‘18’, ‘19’, ‘2’, ‘20’, ‘21’, ‘22’, ‘23’, ‘24’, ‘25’, ‘26’, ‘27’, ‘28’, ‘29’, ‘3’, ‘30’, ‘31’, ‘32’, ‘33’, ‘34’, ‘35’, ‘36’, ‘37’, ‘38’, ‘39’, ‘4’, ‘40’, ‘41’, ‘42’, ‘43’, ‘44’, ‘45’, ‘46’, ‘47’, ‘48’, ‘49’, ‘5’, ‘50’, ‘6’, ‘7’, ‘8’, ‘9’
Changing the row.names to null doesn't work. I heard from someone it is due to the fact that some data are stored as lists here, which I don't quite understand.
I understand that there is an alternative package WDI to access this data and it works well, but I want to know how to resolve the duplicates row name problem here in general so that I can deal with similar situation where no alternative package is available.
I heard from someone it is due to the fact that some data are stored as lists...
This is correct. The solution is fairly simple, but I find it really easy to get tripped up by this. Right now you're using:
dat <- as.data.frame(fromJSON(url)[2],flatten = TRUE, row.names = NULL)
The problem comes from fromJSON(url)[2]. This should be fromJSON(url)[[2]] instead. According to the documentation, the key difference between [ and [[ is a single bracket can select multiple elements whereas [[ selects only one.
You can see how this works with some fake data.
foo <- list(
a = rnorm(100),
b = rnorm(100),
c = rnorm(100)
)
With [, you can select multiple values inside this list.
foo[c("a", "b")]
length(foo["a"]) # Result is 1 not 100 like you might expect.
With [[ the results are different.
foo[[c("a", "b")]] # Raises a subscript error.
foo[["a"]] #This works.
length(foo[["a"]]) # Result is 100.
So, your answer will depend on which subset operator you're using. For your problem, you'll want to use [[ to select a single data.frame inside of the list. Then, you should be able to use rbind correctly.
final <- data.frame()
for (i in 1:10) {
url <- paste0(
'http://api.worldbank.org/v2/countries/all/indicators/SP.POP.TOTL?format=json&page=',
i
)
res <- jsonlite::fromJSON(url, flatten = TRUE)[[2]]
final <- rbind(final, res)
}
Alternative solution with lapply:
urls <- sprintf(
'http://api.worldbank.org/v2/countries/all/indicators/SP.POP.TOTL?format=json&page=%s',
1:10
)
resl <- lapply(urls, jsonlite::fromJSON, flatten = TRUE)
resl <- lapply(resl, "[[", 2) # Use lapply to select the 2 element from each list element.
resl <- do.call(rbind, resl) # This takes all the elements of the list and uses those elements as the arguments for rbind.

jsonlite's fromJSON is returning a list of 2 lists instead of a df

I'm following the FCC's documentation to download some metadata about proceedings.
I don't believe I can post the data but you can get a free API key.
My code results in a listed list of 2 lists instead of a structured df from the JSON format.
My goal is to have a dataframe where each json element is it's own column.. like a normal df.
library(httr)
library(jsonlite)
datahere = "C:/fcc/"
setwd(datahere)
URL <- "https://publicapi.fcc.gov/ecfs/filings?api_key=<KEY HERE>&proceedings.name=14-28&sort=date_disseminated,DESC"
dataDF <- GET(URL)
dataJSON <- content(dataDF, as="text")
dataJSON <- fromJSON(dataJSON)
# NAs
dataJSON2 <- lapply(dataJSON, function(x) {
x[sapply(x, is.null)] <- NA
unlist(x)
})
x <- do.call("rbind", dataJSON2)
x <- as.data.frame(x)
The JSON is really deeply nested, so you need to put a little more thought into converting between list and data.frame. The logic below pulls out a data.frame of 25 filings (102 variables) and 10 aggregations (25 variables).
# tackle the filings object
filings_df <- ldply(dataJSON$filings, function(x) {
# removes null list elements
x[sapply(x, is.null)] <- NA
# converts to a named character vector
unlisted_x <- unlist(x)
# converts the named character vector to data.frame
# with 1 column and rows for each element
d <- as.data.frame(unlisted_x)
# we need to transpose this data.frame because
# the rows should be columns, and don't check names when converting
d <- as.data.frame(t(d), check.names=F)
# now assign the actual names based on that original
# unlisted character vector
colnames(d) <- names(unlisted_x)
# now return to ldply function, which will automatically stack them together
return(d)
})
# tackle the aggregations object
# same exact logic to create the data.frame
aggregations_df <- ldply(dataJSON$aggregations, function(x) {
# removes null list elements
x[sapply(x, is.null)] <- NA
# converts to a named character vector
unlisted_x <- unlist(x)
# converts the named character vector to data.frame
# with 1 column and rows for each element
d <- as.data.frame(unlisted_x)
# we need to transpose this data.frame because
# the rows should be columns, and don't check names when converting
d <- as.data.frame(t(d), check.names=F)
# now assign the actual names based on that original
# unlisted character vector
colnames(d) <- names(unlisted_x)
# now return to ldply function, which will automatically stack them together
return(d)
})

converting a column in json format into a new data frame

I have a csv file and one of the column is in json format.
that particular column in json format looks like this:
{"title":" ","body":" ","url":"thedailygreen print this healthy eating eat safe Dirty Dozen Foods page all"}
I have read this file using read.csv in R. Now, how to I create a new data frame from this column which should have field names as title, body and url.
You can use package RJSONIO to parse the column values, e.g. :
library(RJSONIO)
# create an example data.frame with a json column
cell1 <- '{"title":"A","body":"X","url":"http://url1.x"}'
cell2 <- '{"title":"B","body":"Y","url":"http://url2.y"}'
cell3 <- '{"title":"C","body":"Z","url":"http://url3.z"}'
df <- data.frame(jsoncol = c(cell1,cell2,cell3),stringsAsFactors=F)
# parse json and create a data.frame
res <- do.call(rbind.data.frame,
lapply(df$jsoncol, FUN=function(x){ as.list(fromJSON(x))}))
> res
title body url
A X http://url1.x
B Y http://url2.y
C Z http://url3.z
N.B. :
the code above assumes all the cells contains title, body and url only. If there can be other properties in the json cells, use this code instead :
vals <- lapply(df$jsoncol,fromJSON)
res <- do.call(rbind, lapply(vals,FUN=function(v){ data.frame(title=v['title'],
body =v['body'],
url =v['url']) }))
EDIT (as per comment):
I've read the file using the following code :
df <- read.table(file="c:\\sample.tsv",
header=T, sep="\t", colClasses="character")
then parsed using this code :
# define a simple function to turn NULL to NA
naIfnull <- function(x){if(!is.null(x)) x else NA}
vals <- lapply(df$boilerplate,fromJSON)
res <- do.call(rbind,
lapply(vals,FUN=function(v){ v <- as.list(v)
data.frame(title=naIfnull(v$title),
body =naIfnull(v$body),
url =naIfnull(v$url)) }))

Ragged dataframe in R, jsonlite::fromJSON

I am new to importing .json files for use in R. I'm trying to create a 'long' format dataframe - each row is one participant, each column is one variable. Most of my dataset is compatible after calling fromJSON, but one nested json structure results in a ragged list, with Null, 1, 2, or 3 entries for each participant (in theory there could be more).
Sample:
testdf <- fromJSON("[[\"MMM\",\"AAA\"],null,[\"GGG\",\"CCC\",\"NNN \"],null,null,[\"AAA\",\"NNN \"],null,[\"MMM\",\"AAA\"],null,null,null,null,[\"MMM\",\"AAA\"],[\"CCC\",\"AAA\"],\"NNN \",[\"MMM\",\"NNN \",\"EEE\"],null,null,[\"CCC\",\"MMM\",\"AAA\"],[\"HHH\",\"AAA\"],\"AAA\",[\"MMM\",\"AAA\",\"NNN \"],[\"CCC\",\"AAA\"],[\"MMM\",\"AAA\",\"NNN \"],[\"AAA\",\"NNN \"],[\"MMM\",\"AAA\"],null,null,null,null,null,null]", flatten=TRUE)
How can I transform this list into a 32 x n dataframe which preserves the null values?
Variations on unlist remove the null values; rbind.fill moves entries to the next row, of course - could something like cbind.fill work? (cbind a df with an empty df (cbind.fill?))
Something hidden in plyr?
Thanks for any suggestions.
Fairly straightforward:
t(sapply(testdf, function(x) {
if (is.null(x)) x <- NA_character_
length(x) <- 3
x })
)
If you want to choose the number of columns automatically, then you need to calculate that first:
nc <- max(sapply(testdf, length))
t(sapply(testdf, function(x) {
if (is.null(x)) x <- NA_character_
length(x) <- nc
x })
)

convert json into data frame in R?

I need to convert json file into a data frame. Each line in the json file, may have different number of entries. For example
{"timestamp":"2016-12-13T04:04:06.394-0500",
"test101":"2016-12-13T04:04:06.382-0500",
"error":"false","from":"xon","event":"DAT","BT":"work","cd":"E","id":"IBM",
"key":"20161213040330617511","begin_work":"2016-12-13T04:04:06.383-0500"","#version:"1","#timestamp":"2016-12-14T20:04:29.502Z"}
{"timestamp":"2016-12-13T04:04:05.318-0500","test101":"2016-12-13T04:03:46.074-0500","error":"false","from":"de","event":"cp","BT":"work","cd":"dsh","id":"appl",
"key":"142314089",
"begin_work":"2016-12-13T04:03:46.074-0500",
"refresh":"2016-12-13T04:03:45.920-0500",
"co_refresh":"2016-12-13T04:03:45.769-0500",
"test104":"2016-12-13T04:03:45.832-0500",
"test104":"2016-12-13T04:03:45.832-0500",
"test105":"2016-12-13T04:03:46.031-0500",
"test7":"2016-12-13T04:03:46.032-0500",
"t-test9":"2016-12-13T04:03:45.704-0500",
"test10_StartDateTimeStamp":"2016-12-13T04:03:45.704-0500",
"stop":"2016-12-13T04:03:50.772-0500",
"stop_again":"2016-12-13T04:03:46.091-0500",
"#version":"1","#timestamp":"2016-12-14T20:04:29.503Z"}
{"timestamp":"2016-12-13T04:04:07.113-0500","test101":"2016-12-13T04:04:07.068-0500","error":"false","from":"xon","event":"DAT","BT":"work","cd":"E","id":"3YPS","key":"20161213040318326935","begin_work":"2016-12-13T04:04:07.069-0500","#version":"1","#timestamp":"2016-12-14T20:04:29.505Z"}
I need to start parsing the file form a keyword called "key" until a keyword called #version.
Data frame need to look something like this:
key group time
20161213040330617511 begin_work 2016-12-13T04:04:06.383-0500
142314089 begin_work 2016-12-13T04:03:46.074-0500
142314089 refresh 2016-12-13T04:03:45.920-0500
142314089 co_refresh 2016-12-13T04:03:45.769-0500
142314089 test104 2016-12-13T04:03:45.832-0500
etc
I have tried something like this:
library(jsonlite)
library(data.table)
setwd("C:/file/")
filenames <- list.files("system", pattern="*json*", full.names=TRUE)
dflist <- lapply(filenames, function(i) {
jsonlite::fromJSON(
paste0("[",
paste0(readLines(i),collapse=","),
"]"),flatten=TRUE
)
})
d<-rbindlist(dflist, use.names=TRUE, fill=TRUE)
I need to put key value pairs into a 3 column data frame
I am getting field names after key as columns and NA as the values. Any ideas how could I convert json to df frame in R?
This is something you can try, a combination of dplyr and tidyr :
library(dplyr)
library(tidyr)
library(jsonlite)
data <- jsonlite::fromJSON("data.json")
lapply(data, function(d) as_data_frame(d)) %>%
bind_rows() %>%
gather(groups, val, -timestamp, -key) %>%
select(key, group, timestamp)
BTW I had to change your json example a little bit.
Here's the json file I use:
{"x":{"timestamp":"2016-12-13T04:04:06.394-0500",
"test101":"2016-12-13T04:04:06.382-0500",
"error":"false","from":"xon","event":"DAT","BT":"work","cd":"E","id":"IBM",
"key":"20161213040330617511","begin_work":"2016-12-13T04:04:06.383-0500","#version":"1","#timestamp":"2016-12-14T20:04:29.502Z"},
"y":{"timestamp":"2016-12-13T04:04:05.318-0500","test101":"2016-12-13T04:03:46.074-0500","error":"false","from":"de","event":"cp","BT":"work","cd":"dsh","id":"appl",
"key":"142314089",
"begin_work":"2016-12-13T04:03:46.074-0500",
"refresh":"2016-12-13T04:03:45.920-0500",
"co_refresh":"2016-12-13T04:03:45.769-0500",
"test104":"2016-12-13T04:03:45.832-0500",
"test105":"2016-12-13T04:03:46.031-0500",
"test7":"2016-12-13T04:03:46.032-0500",
"t-test9":"2016-12-13T04:03:45.704-0500",
"test10_StartDateTimeStamp":"2016-12-13T04:03:45.704-0500",
"stop":"2016-12-13T04:03:50.772-0500",
"stop_again":"2016-12-13T04:03:46.091-0500",
"#version":"1","#timestamp":"2016-12-14T20:04:29.503Z"},
"z":{"timestamp":"2016-12-13T04:04:07.113-0500","test101":"2016-12-13T04:04:07.068-0500","error":"false","from":"xon","event":"DAT","BT":"work","cd":"E","id":"3YPS","key":"20161213040318326935","begin_work":"2016-12-13T04:04:07.069-0500","#version":"1","#timestamp":"2016-12-14T20:04:29.505Z"}}