Parse nested JSON to Data Frame in R - json

i'm having trouble with a very nasty nested JSON.
The format is like this
{
"matches": [
{
"matchId": 1,
"region": "BR",
"participants": [
{
"participantId": 0,
"teamId": 200,
"stats": {
"winner": true,
"champLevel": 16,
"item0": 3128,
}
{
"matchId": 2,
"region": "BR",
"participants": [
{
"participantId": 0,
"teamId": 201,
"stats": {
"winner": false,
"champLevel": 18,
"item0": 3128,
"item1": 3157,
"item1": 3158,
}
As you can see in the second match the number of items increased, but in the data frame the first row will have the same collumns:
MatchId region ... stats.winner stats.champLevel stats.item0 stats.item1 stats.item2
1 BR TRUE 16 3128 1 BR
1 BR TRUE 16 3128 3157 3158
See the first row is smaller than the second, so R recycle the values ....
If you want the full data you can grab it at:
http://pastebin.com/HQDf2ase
How I parsed the json to data.frame:
json.matchData <- fromJSON(file="file.json"))
Unlist the elements of the Json and convert it to a data frame
matchData.i <- lapply(json.matchData$matches, function(x){ unlist(x)})
Transform into Data Frame
matchData <- do.call("rbind", matchData.i)
matchData <- as.data.frame(matchData)
But the dataframe is messed up, because some fields should be NA but they are filled with wrong values.

I think using the plyr rbind.fill() function would be helpful here. How about this
library(plyr)
matchData <- rbind.fill(lapply(matchData.i,
function(x) do.call("data.frame", as.list(x))
))
the lapply() bit is to turn the intermediate lists into data.frames which rbind.fill requires.

Related

Exclude column header when writing DataFrame to json

I have the following dataframe df1
SomeJson
=================
[{
"Number": "1234",
"Color": "blue",
"size": "Medium"
}, {
"Number": "2222",
"Color": "red",
"size": "Small"
}
]
and I am trying to write just the contents of this column to blob storage as a json.
df1.select("SomeJson")
.write
.option("header", false)
.mode("append")
.json(blobStorageOutput)
This Code works but it creates the following json in blob storage.
{
"SomeJson": [{
"Number": "1234",
"Color": "blue",
"size": "Medium"
}, {
"Number": "2222",
"Color": "red",
"size": "Small"
}
]
}
But I just want the contents of the column not the column Header as well, I dont want the "SomeJson" in my final Json. Any Suggestions?
If you don't want dataframe column to get appended, write your dataframe as text and not as json. It will only write the content of your column.
df1.select("SomeJson")
.write
.option("header", false)
.mode("append")
.text(blobStorageOutput)
Just an additional assumption to this question,
We derive JSON structure itself from the dataset and then we encounter this header scenario like here. We can follow the below approach.
spark.sql("SELECT COLLECT_SET(STRUCT(<field_name>)) AS `` FROM <table_name> LIMIT 1").coalesce(1).write.format("org.apache.spark.sql.json").mode("overwrite").save(<Blob Path1/ ADLS Path1>)
Output will be like,
{"":[{<field_name>:<field_value>}]}
Here the header can be avoided by following 3 lines (Assumption No Tilda in data),
jsonToCsvDF=spark.read.format("com.databricks.spark.csv").option("delimiter", "~").load(<Blob Path1/ ADLS Path1>)
jsonToCsvDF.createOrReplaceTempView("json_to_csv")
spark.sql("SELECT SUBSTR(`_c0`,5,length(`_c0`)-5) FROM json_to_csv").coalesce(1).write.option("header",false).mode("overwrite").text(<Blob Path2/ ADLS Path2>)

Combine json files and preserve format in R

I have two json files in the exact format as below. My goal is to combine them and keep the exact same format - just basically stack one on top of the other.
I have tried the following but this does not correctly combine both files and preserve the format as both files are bracketed with [ ] separately. How does one combine and keep only one pair of brackets around the entire file?
files <- c("test.json","test2.json")
jsonl <- lapply(files, function(f) fromJSON(file = f))
jsonc <- toJSON(jsonl)
write(jsonc, file = "two.json")
Are there any better solutions in R?
test.json:
[
{
"vendor": 0,
"startTime": 4380,
"endTime": 4445
},
{
"vendor": 0,
"startTime": 4448,
"endTime": 4453
},
{
"vendor": 0,
"startTime": 4696,
"endTime": 4880
}
]
undesired output:
[
[
{"vendor":0,"startTime":4380,"endTime":4445},
{"vendor":0,"startTime":4448,"endTime":4453},
{"vendor":0,"startTime":4696,"endTime":4880}],
[{"vendor":0,"startTime":4380,"endTime":4445},
{"vendor":0,"startTime":4448,"endTime":4453},
{"vendor":0,"startTime":4696,"endTime":4880}
]
]
desired output:
[
{"vendor":0,"startTime":4380,"endTime":4445},
{"vendor":0,"startTime":4448,"endTime":4453},
{"vendor":0,"startTime":4696,"endTime":4880},
{"vendor":0,"startTime":4380,"endTime":4445},
{"vendor":0,"startTime":4448,"endTime":4453},
{"vendor":0,"startTime":4696,"endTime":4880}
]
You can join them before writing using rbind
files <- c("test.json","test2.json")
jsonl <- do.call("rbind", lapply(files, function(f) fromJSON(f)))
write(toJSON(jsonl), file = "two.json")

Python pandas convert CSV upto level "n" nested JSON?

I want to convert the csv file into nested json upto n level. I am using below code to get the desired output from this link. But I am getting an error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-109-89d84e9a61bf> in <module>()
40 # make a list of keys
41 keys_list = []
---> 42 for item in d['children']:
43 keys_list.append(item['name'])
44
TypeError: 'datetime.date' object has no attribute '__getitem__'
Below is the code:
# CSV 2 flare.json
# convert a csv file to flare.json for use with many D3.js viz's
# This script creates outputs a flare.json file with 2 levels of nesting.
# For additional nested layers, add them in lines 32 - 47
# sample: http://bl.ocks.org/mbostock/1283663
# author: Andrew Heekin
# MIT License
import pandas as pd
import json
df = pd.read_csv('file.csv')
# choose columns to keep, in the desired nested json hierarchical order
df = df[['Group','year','quarter']]
# order in the groupby here matters, it determines the json nesting
# the groupby call makes a pandas series by grouping 'the_parent' and 'the_child', while summing the numerical column 'child_size'
df1 = df.groupby(['Group','year','quarter'])['quarter'].count()
df1 = df1.reset_index(name = "count")
#print df1.head()
# start a new flare.json document
flare = dict()
flare = {"name":"flare", "children": []}
#df1['year'] = [str(yr) for yr in df1['year']]
for line in df1.values:
the_parent = line[0]
the_child = line[1]
child_size = line[2]
# make a list of keys
keys_list = []
for item in d['children']:
keys_list.append(item['name'])
# if 'the_parent' is NOT a key in the flare.json yet, append it
if not the_parent in keys_list:
d['children'].append({"name":the_parent, "children":[{"name":the_child, "size":child_size}]})
# if 'the_parent' IS a key in the flare.json, add a new child to it
else:
d['children'][keys_list.index(the_parent)]['children'].append({"name":the_child, "size":child_size})
flare = d
# export the final result to a json file
with open('flare.json', 'w') as outfile:
json.dump(flare, outfile)
Expected Output in below format:
{
"name": "stock",
"children": [
{"name": "fruits",
"children": [
{"name": "berries",
"children": [
{"count": 20, "name": "blueberry"},
{"count": 70, "name": "cranberry"},
{"count": 96, "name": "raspberry"},
{"count": 140, "name": "strawberry"}]
},
{"name": "citrus",
"children": [
{"count": 20, "name": "grapefruit"},
{"count": 120, "name": "lemon"},
{"count": 50, "name": "orange"}]
},
{"name": "dried fruit",
"children": [
{"count": 25, "name": "dates"},
{"count": 10, "name": "raisins"}]
}]
},
{"name": "vegtables",
"children": [
{"name": "green leaf",
"children": [
{"count": 19, "name": "cress"},
{"count": 18, "name": "spinach"}]
},
{
"name": "legumes",
"children": [
{"count": 27, "name": "beans"},
{"count": 12, "name": "chickpea"}]
}]
}]
}
Could any one please help how to resolve this error.
Thanks

Transform sequence of data into JSON for D3.js visualization

I have a data that shows a series of actions (column Actions ) performed by several users (column Id). The order of the data frame is important - it is the order the actions were performed in. For each id, the first action performed is start. Consecutive identical actions are possible (for example, the sequence start -> D -> D -> D is valid ). This is some code to generate data:
set.seed(10)
i <- 0
all_id <- NULL
all_vals <- NULL
while (i < 5) {
i <- i + 1
print(i)
size <- sample(3:5, size = 1)
tmp_id <- rep(i, times = size + 1)
tmp_vals <- c("start",sample(LETTERS, size = size) )
all_id <- c(all_id, tmp_id)
all_vals <- c(all_vals, tmp_vals)
}
df <- data.frame(Id = all_id,
Action = all_vals)
Goal - transform this data in a JSON nested on multiple levels that will be used in a D3.js visualization (like this). I would like to see a counter for how many times each child appears for their respective parent (an maybe even a percentage out of the total appearances of the parent) - but I hope I can do that myself.
Expected output below - this is generic, not from the data I generated above, and real data will have quite a lot of nested values ( count and percentage are optional at this point in time):
{
"action": "start",
"parent": "null",
"count": "10",
"percentage": "100",
"children": [
{
"action": "H",
"parent": "start",
"count": "6",
"percentage": "60",
"children": [
{
"action": "D",
"parent": "H",
"count": "5",
"percentage": "83.3"
},
{
"action": "B",
"parent": "H",
"count": "3",
"percentage": "50"
}
]
},
{
"action": "R",
"parent": "start",
"count": "4",
"percentage": "40"
}
]
}
I know I am supposed to post something I've tried, but I really don't have anything remotely worth of being shown.
I have just started writing some R -> d3.js converters in https://github.com/timelyportfolio/d3r that should work well in these type situations. I will work up an example later today with your data.
The internal hierarchy builder in https://github.com/timelyportfolio/sunburstR also might work well here.
I'll add to the answer as I explore both of these paths.
example 1
set.seed(10)
i <- 0
all_id <- NULL
all_vals <- NULL
while (i < 5) {
i <- i + 1
print(i)
size <- sample(3:5, size = 1)
tmp_id <- rep(i, times = size + 1)
tmp_vals <- c("start",sample(LETTERS, size = size) )
all_id <- c(all_id, tmp_id)
all_vals <- c(all_vals, tmp_vals)
}
df <- data.frame(Id = all_id,
Action = all_vals)
# not sure I completely understand what this is
# supposed to become but here is a first try
# find position of start
start_pos <- which(df$Action=="start")
# get the sequences
# surely there is a better way but do this for now
sequences <- paste(
start_pos+1,
c(start_pos[-1],nrow(df))-1,
sep=":"
)
paths <- lapply(
sequences,
function(x){
data.frame(
t(as.character(df[eval(parse(text=x)),]$Action)),
stringsAsFactors=FALSE
)
}
)
paths_df <- dplyr::bind_rows(paths)
# use d3r
# devtools::install_github("timelyportfolio/d3r")
library(d3r)
d3_nest(paths_df) # if want list, then json=FALSE
# visualize with listviewer
# devtools::install_github("timelyportfolio/listviewer")
listviewer::jsonedit(d3_nest(paths_df))

Make deeply nested JSON from data frame in R

I'm looking to take a nice tidy data frame and turn it into a deeply nested JSON using R. So far I haven't been able to find any other resources that directly address this task - most seem to be trying to take it in the other direction (un-nesting a JSON).
Here's a small dummy version of the data frame I'm starting with. Imagine a survey was given to two audiences within a company, one for managers and a separate one for employees. The surveys have different sets of questions with different IDs but many questions overlap and I want to compare the responses between the two groups. The end goal is to make a JSON that matches up section IDs, question IDs, and option IDs/text from two surveys in the correct hierarchy. Some questions have subquestions that require a further level of nesting, which is what I’m having difficulty doing.
library(dplyr)
library(tidyr)
library(jsonlite)
dummyDF <- data_frame(sectionId = c(rep(1,9),rep(2,3)),
questionId = c(rep(1,3),rep(2,6),rep(3,3)),
subquestionId = c(rep(NA,3),rep("2a",3),rep("2b",3),rep(NA,3)),
deptManagerQId = c(rep("m1",3),rep("m2",3),rep("m3",3),rep("m4",3)),
deptEmployeeQId = c(rep("e1",3),rep("e3",3),rep("e4",3),rep("e7",3)),
optionId = rep(c(1,2,3),4),
text = rep(c("yes","neutral","no"),4))
And here’s the end result I’m trying to achieve:
theGoal <- fromJSON('{
"sections": [
{
"sectionId": "1",
"questions": [
{
"questionId": "1",
"deptManagerQId": "m1",
"deptEmployeeQId": "e1",
"options": [
{
"optionId": 1,
"text": "yes"
},
{
"optionId": 2,
"text": "neutral"
},
{
"optionId": 3,
"text": "no"
}
]
},
{
"questionId": "2",
"options": [
{
"optionId": 1,
"text": "yes"
},
{
"optionId": 2,
"text": "neutral"
},
{
"optionId": 3,
"text": "no"
}
],
"subquestions": [
{
"subquestionId": "2a",
"deptManagerQId": "m2",
"deptEmployeeQId": "e3"
},
{
"subquestionId": "2b",
"deptManagerQId": "m3",
"deptEmployeeQId": "e4"
}
]
},
{
"questionId": "3",
"deptManagerQId": "m4",
"deptEmployeeQId": "e7",
"options": [
{
"optionId": 1,
"text": "yes"
},
{
"optionId": 2,
"text": "neutral"
},
{
"optionId": 3,
"text": "no"
}
]
}
]
}
]
}')
Here are a few approaches I’ve tried using nest from tidyr that end up either only getting me part of the way there or throwing an error message.
1
list1 <- dummyDF %>% nest(-sectionId, .key=questions) %>%
mutate(questions = lapply(seq_along(.$questions), function(x) nest(.$questions[[x]], optionId, text, .key = options))) %>%
list(sections = .)
2
nested1 <- dummyDF %>% nest(-sectionId, .key=questions) %>%
mutate(questions = lapply(seq_along(.$questions), function(x) nest(.$questions[[x]], optionId, text, .key = options)))
nested2 <- nested1 %>% mutate(questions = lapply(seq_along(.$questions), function(x) nest(.$questions[[x]], subquestionId, .key = subquestions)))
#Gives this error: cannot group column options, of class 'list'
3
list2 <- dummyDF %>% nest(-sectionId, .key=questions) %>%
mutate(questions = lapply(seq_along(.$questions),
function(x) {ifelse(is.na(.$questions[[x]]$subquestionId),
function(x) {.$questions[[x]] %>% select(-subquestionId) %>% nest(optionId, text, .key = options)},
function(x) {.$questions[[x]] %>% nest(subquestion_id, .key = subquestions)})})) %>%
list(sections = .)
#Gives this error: attempt to replicate an object of type 'closure'
Any ideas would be greatly appreciated. I’m open to any approaches. I took the issue to a local R user group meet-up but wasn’t able to come up with any solutions so I’ve got my fingers crossed here. I realize R might not be the best tool to accomplish this but it’s the one I know so I’m giving it a shot. Thanks.
jsonlite::toJSON looks like a nice solution to your problem.
Works seamlessly up to column types and column order (I corrected to illustrate that the objects were identical). If you need any other type of JSON structure, I would recommend restructuring the data_frame on the front end first using something like dplyr or tidyr.
library(jsonlite)
library(dplyr)
dummyDF <- data_frame(sectionId = c(rep(1,9),rep(2,3)),
questionId = c(rep(1,3),rep(2,6),rep(3,3)),
subquestionId = c(rep(NA,3),rep("2a",3),rep("2b",3),rep(NA,3)),
deptManagerQId = c(rep("m1",3),rep("m2",3),rep("m3",3),rep("m4",3)),
deptEmployeeQId = c(rep("e1",3),rep("e3",3),rep("e4",3),rep("e7",3)),
optionId = rep(c(1,2,3),4),
text = rep(c("yes","neutral","no"),4))
## Convert to a JSON object
json <- jsonlite::toJSON(dummyDF)
theGoal <- fromJSON(json) %>% tbl_df() %>% select_(.dots=names(dummyDF)) %>%
## Convert integer columns to numeric
mutate_if(function(x) {if (typeof(x)=='integer') {TRUE} else {FALSE}},as.numeric)
## Compare the objects
all.equal(theGoal,dummyDF)
# TRUE
identical(theGoal,dummyDF)
# TRUE