From list of json files to data.table: partial variable list - json

I have a list of more than 100,000 json files from which I want to get a data.table with only a few variables. Unfortunately the files are complex. The content of each json file looks like:
Sample 1
$id
[1] "10.1"
$title
$title$value
[1] "Why this item"
$itemsource
$itemsource$id
[1] "AA"
$date
[1] "1992-01-01"
$itemType
[1] "art"
$creators
list()
Sample 2
$id
[1] "10.2"
$title
$title$value
[1] "We need this item"
$itemsource
$itemsource$id
[1] "AY"
$date
[1] "1999-01-01"
$itemType
[1] "art"
$creators
type name firstname surname affiliationIds
1 Person Frank W. Cornell. Frank W. Cornell. a1
2 Person David A. Chen. David A. Chen. a1
$affiliations
id name
1 a1 Foreign Affairs Desk, New York Times
What I need from this set of files is a table with creator names, item ids and dates. For the two sample files above:
id date name firstname lastname creatortype
"10.1" "1992-01-01" NA NA NA NA
"10.2" "1999-01-01" Frank W. Cornell. Frank W. Cornell. Person
"10.2" "1999-01-01" David A. Chen. David A. Chen. Person
What I have done so far:
library(parallel)
library(data.table)
library(jsonlite)
library(dplyr)
filelist = list.files(pattern="*.json",recursive=TRUE,include.dirs =TRUE)
parsed = mclapply(filelist, function(x) fromJSON(x),mc.cores=24)
data = rbindlist(mclapply(1:length(parsed), function(x) {
a = data.table(item = parsed[[x]]$id, date = list(list(parsed[[x]]$date)), name = list(list(parsed[[x]]$name)), creatortype = list(list(parsed[[x]]$creatortype))) #ignoring the firstname/lastname fields here for convenience
b = data.table(id = a$item, date = unlist(a$date), name=unlist(a$name), creatortype=unlist(a$creatortype))
return(b)
},mc.cores=24))
However, on the last step, I get this error:
"Error in rbindlist(mclapply(1:length(parsed), function(x){:
Item 1 of list is not a data.frame, data.table or list"
Thanks in advance for your suggestions.
Related questions include:
Extract data from list of lists [R]
R convert json to list to data.table
I want to convert JSON file into data.table in r
How can read files from directory using R?
Convert R data table column from JSON to data table

from the error message, i suppose this basically means that one of the results from mclapply() is empty, by empty I mean either NULL or data.table with 0 row, or simply encounters an error within the parallel processing.
what you could do is:
add more checks inside the mclapply() like try-error or check the class of b and nrow of b, whether b is empty or not
when you use rbindlist, add argument fill = T
hope this solves ur problem.

Related

How to parse a weirdly formatted data file?

How to read a weirdly formatted data file?
For example if there are different types of seperators (, : |) all used together?
Looking at a dataframe example, something along these lines:
A monstrous response to the monstrous data. First, split each column containing k:v pairs and convert them to pandas Series. Combine the results for all three "Other" columns into one dataframe:
others = pd.concat(data[x].str.split(':').apply(pd.Series)
for x in ('Other1', 'Other2', 'Other3')).dropna(how='all')
# 0 1
#0 Hospital Awesome Hospital
#1 Hobbies Cooking
#2 Hospital Awesome Hospital
#0 Maiden Name Rubin
#1 Hobby Experience 10 years
#2 Maiden Name Simpson
#0 DOB 2015/04/09
#2 DOB 2015/04/16
Do some index manipulations (we want the keys to become column names):
others = others.reset_index().set_index(['index',0]).unstack()
# 1
#0 DOB Hobbies Hobby Experience Hospital Maiden Name
#index
#0 2015/04/09 None None Awesome Hospital Rubin
#1 None Cooking 10 years None None
#2 2015/04/16 None None Awesome Hospital Simpson
Remove the hierarchical column index produced by unstack():
others.columns = others.columns.get_level_values(0)
Put the pieces together again:
pd.concat([data[["Full Name","Town"]], others], axis=1)
parse has a nice interface and might be good option for pulling data such as this out:
>>> import parse
>>> format_spec='{}: {}'
>>> string='Hobbies: Cooking'
>>> parse.parse(format_spec, string).fixed
('Hobbies', 'Cooking')
Use compile if you will parse the same spec over and over:
>>> other_parser = parse.compile(format_spec)
>>> other_parser.parse(string).fixed
('Hobbies', 'Cooking')
>>> other_parser.parse('Maiden Name: Rubin').fixed
('Maiden Name', 'Rubin')
The fixed property returns the parsed arguments as a tuple. Using these tuples we can just create a bunch of dictionaries, feed them into pd.DataFrame, and merge with the first df:
import parse
import pandas as pd
# slice first two columns from original dataframe
first_df = pd.read_csv(filepath, sep='t').ix[:,0:2]
# make the parser
other_parser = parse.compile('{}: {}')
# parse remaining columns to a new dataframe
with open(filepath) as f:
# a generator of dict objects is fed into DataFrame
# the dict keys are column names
others_df = pd.DataFrame(dict(other_parser.parse(substr).fixed for substr in line.split('\t')[2:]) for line in f)
# merge on the indexes
df = pd.merge(first_df, others_df, left_index=True, right_index=True)

JSON data to dataframe in R

I have a json file from which i am importing the data
myList = rjson::fromJSON(file = "JsData.json")
myList
[[1]]
[[1]]$key
[1] "type1|new york, ny|NYC|hit"
[[1]]$doc_count
[1] 12
[[2]]
[[2]]$key
[2] "type1|omaha, ne|Omaha|hit"
[[2]]$doc_count
[2] 8
But when I am trying to convert to a data frame by function below ,
do.call(rbind, lapply(myList, data.frame))
I am getting an error.-
Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, : arguments imply differing number of rows: 1, 0
I need to parse this data so that it can be used in excel csv. I looked at the solution for Getting imported json data into a data frame in R but the output is not coming in the proper usable format in excel.
And the JsData.json sample data looks like this:
[{"key":"type1|new york, ny|NYC|hit","doc_count":12},
{"key":"type1|omaha, ne|Omaha|hit","doc_count":8},
{"key":"type2|yuba city, ca|Yuba|hit","doc_count":9}]
You can try :
require(jsonlite)
s ='[{"key":"type1|new york, ny|NYC|hit","doc_count":12},
.......
"key":"type2|yuba city, ca|Yuba|hit","doc_count":9}]'
df <- fromJSON(s)
df
key doc_count
1 type1|new york, ny|NYC|hit 12
2 type1|omaha, ne|Omaha|hit 8
3 type2|yuba city, ca|Yuba|hit 9
I don't know how you want to deal wiyh you key .....

R convert json to list to data.table

I have a data.table where one of the columns contains JSON. I am trying to extract the content so that each variable is a column.
library(jsonlite)
library(data.table)
df<-data.table(a=c('{"tag_id":"34","response_id":2}',
'{"tag_id":"4","response_id":1,"other":4}',
'{"tag_id":"34"}'),stringsAsFactors=F)
The desired result, that does not refer to the "other" variable:
tag_id response_id
1 "34" 2
2 "4" 1
3 "34" NA
I have tried several versions of:
parseLog <- function(x){
if (is.na(x))
e=c(tag_id=NA,response_id=NA)
else{
j=fromJSON(x)
e=c(tag_id=as.integer(j$tag_id),response_id=j$response_id)
}
e
}
that seems to work well to retrieve a list of vectors (or lists if c is replaced by list) but when I try to convert the list to data.table something doesn´t work as expected.
parsed<-lapply(df$a,parseLog)
rparsed<-do.call(rbind.data.frame,parsed)
colnames(rparsed)<-c("tag_id","response_id")
Because of the missing value in the third row. How can I solve it in a R-ish clean way? How can I make that my parse method returns an NA for the missing value. Alternative, Is there a parameter "fill" like there is for rbind that can be used in rbind.data.frame or analogous method?
The dataset I am using has 11M rows so performance is important.
Additionally, there is an equivalent method to rbind.data.frame to obtain a data.table. How would that be used? When I check the documentation it refers me to rbindlist but it complains the parameter is not used and if call directly(without do.call it complains about the type of parsed):
rparsed<-do.call(rbindlist,fill=T,parsed)
EDIT: The case I need to cover is more general, in a set of 11M records all the possible circumstances happen:
df<-data.table(a=c('{"tag_id":"34","response_id":2}',
'{"trash":"34","useless":2}',
'{"tag_id":"4","response_id":1,"other":4}',
NA,
'{"response_id":"34"}',
'{"tag_id":"34"}'),stringsAsFactors=F)
and the output should only contain tag_id and response_id columns.
There might be a simpler way but this seems to be working:
library(data.table)
library(jsonlite)
df[, json := sapply(a, fromJSON)][, rbindlist(lapply(json, data.frame), fill=TRUE)]
#or if you need all the columns :
#df[, json := sapply(a, fromJSON)][,
# c('tag_id', 'response_id') := rbindlist(lapply(json, data.frame), fill=TRUE)]
Output:
> df[, json := sapply(a, fromJSON)][, rbindlist(lapply(json, data.frame), fill=TRUE)]
tag_id response_id
1: 34 2
2: 4 1
3: 34 NA
EDIT:
This solution comes after the edit of the question with additional requests.
There are lots of ways to do this but I find the simplest one is at the creation of the data.frame like this:
df[, json := sapply(a, fromJSON)][,
rbindlist(lapply(json, function(x) data.frame(x)[-3]), fill=TRUE)]
# tag_id response_id
#1: 34 2
#2: 4 1
#3: 34 NA

How to parse JSON with fromJSON on a dataframe column?

I have the following data.frame with one column called "json" and two rows of JSON data:
df <- data.frame(json = c('{"client":"ABC Company","totalUSD":7110.0000,"durationDays":731,"familySize":4,"assignmentType":"Long Term","homeLocation":"Australia","hostLocation":"United States","serviceName":"Service ABC","homeLocationGeoLat":-25.274398,"homeLocationGeoLng":133.775136,"hostLocationGeoLat":37.09024,"hostLocationGeoLng":-95.712891}', '{"client":"ABC Company","totalUSD":7110.0000,"durationDays":731,"familySize":4,"assignmentType":"Long Term","homeLocation":"Australia","hostLocation":"United States","serviceName":"Service XYZ","homeLocationGeoLat":-25.274398,"homeLocationGeoLng":133.775136,"hostLocationGeoLat":37.09024,"hostLocationGeoLng":-95.712891}'))
I am trying to parse the JSON into a data.frame using fromJSON from the rjson package.
I cast the column as character type and then attempt to parse:
> df$json <- as.character(df$json)
> final <- fromJSON(json_str = df$json)
However, it only seems to give me the first row of JSON, whereas I expect 2 rows.
How can I parse the JSON into a data.frame from df$json?
You probably want a resultant data frame from this exercise, so:
do.call(rbind.data.frame, lapply(df$json, rjson::fromJSON))
## client totalUSD durationDays familySize assignmentType homeLocation hostLocation serviceName homeLocationGeoLat
## 2 ABC Company 7110 731 4 Long Term Australia United States Service ABC -25.2744
## 21 ABC Company 7110 731 4 Long Term Australia United States Service XYZ -25.2744
## homeLocationGeoLng hostLocationGeoLat hostLocationGeoLng
## 2 133.7751 37.09024 -95.71289
## 21 133.7751 37.09024 -95.71289
The exact same results will come from:
do.call(rbind.data.frame, lapply(df$json, jsonlite::fromJSON))
do.call(rbind.data.frame, lapply(df$json, RJSONIO::fromJSON))

Is it possible to write a table to a file in JSON format in R?

I'm making word frequency tables with R and the preferred output format would be a JSON file. sth like
{
"word" : "dog",
"frequency" : 12
}
Is there any way to save the table directly into this format? I've been using the write.csv() function and convert the output into JSON but this is very complicated and time consuming.
set.seed(1)
( tbl <- table(round(runif(100, 1, 5))) )
## 1 2 3 4 5
## 9 24 30 23 14
library(rjson)
sink("json.txt")
cat(toJSON(tbl))
sink()
file.show("json.txt")
## {"1":9,"2":24,"3":30,"4":23,"5":14}
or even better:
set.seed(1)
( tab <- table(letters[round(runif(100, 1, 26))]) )
a b c d e f g h i j k l m n o p q r s t u v w x y z
1 2 4 3 2 5 4 3 5 3 9 4 7 2 2 2 5 5 5 6 5 3 7 3 2 1
sink("lets.txt")
cat(toJSON(tab))
sink()
file.show("lets.txt")
## {"a":1,"b":2,"c":4,"d":3,"e":2,"f":5,"g":4,"h":3,"i":5,"j":3,"k":9,"l":4,"m":7,"n":2,"o":2,"p":2,"q":5,"r":5,"s":5,"t":6,"u":5,"v":3,"w":7,"x":3,"y":2,"z":1}
Then validate it with http://www.jsonlint.com/ to get pretty formatting. If you have multidimensional table, you'll have to work it out a bit...
EDIT:
Oh, now I see, you want the dataset characteristics sink-ed to a JSON file. No problem, just give us a sample data, and I'll work on a code a bit. Practically, you need to carry out the data into desirable format, hence convert it to JSON. list should suffice. Give me a sec, I'll update my answer.
EDIT #2:
Well, time is relative... it's a common knowledge... Here you go:
( dtf <- structure(list(word = structure(1:3, .Label = c("cat", "dog",
"mouse"), class = "factor"), frequency = c(12, 32, 18)), .Names = c("word",
"frequency"), row.names = c(NA, -3L), class = "data.frame") )
## word frequency
## 1 cat 12
## 2 dog 32
## 3 mouse 18
If dtf is a simple data frame, yes, data.frame, if it's not, coerce it! Long story short, you can do:
toJSON(as.data.frame(t(dtf)))
## [1] "{\"V1\":{\"word\":\"cat\",\"frequency\":\"12\"},\"V2\":{\"word\":\"dog\",\"frequency\":\"32\"},\"V3\":{\"word\":\"mouse\",\"frequency\":\"18\"}}"
I though I'll need some melt with this one, but simple t did the trick. Now, you only need to deal with column names after transposing the data.frame. t coerces data.frames to matrix, so you need to convert it back to data.frame. I used as.data.frame, but you can also use toJSON(data.frame(t(dtf))) - you'll get X instead of V as a variable name. Alternatively, you can use regexp to clean the JSON file (if needed), but it's a lousy practice, try to work it out by preparing the data.frame.
I hope this helped a bit...
These days I would typically use the jsonlite package.
library("jsonlite")
toJSON(mydatatable, pretty = TRUE)
This turns the data table into a JSON array of key/value pair objects directly.
RJSONIO is a package "that allows conversion to and from data in Javascript object notation (JSON) format". You can use it to export your object as a JSON file.
library(RJSONIO)
writeLines(toJSON(anobject), "afile.JSON")