I have a data that shows a series of actions (column Actions ) performed by several users (column Id). The order of the data frame is important - it is the order the actions were performed in. For each id, the first action performed is start. Consecutive identical actions are possible (for example, the sequence start -> D -> D -> D is valid ). This is some code to generate data:
set.seed(10)
i <- 0
all_id <- NULL
all_vals <- NULL
while (i < 5) {
i <- i + 1
print(i)
size <- sample(3:5, size = 1)
tmp_id <- rep(i, times = size + 1)
tmp_vals <- c("start",sample(LETTERS, size = size) )
all_id <- c(all_id, tmp_id)
all_vals <- c(all_vals, tmp_vals)
}
df <- data.frame(Id = all_id,
Action = all_vals)
Goal - transform this data in a JSON nested on multiple levels that will be used in a D3.js visualization (like this). I would like to see a counter for how many times each child appears for their respective parent (an maybe even a percentage out of the total appearances of the parent) - but I hope I can do that myself.
Expected output below - this is generic, not from the data I generated above, and real data will have quite a lot of nested values ( count and percentage are optional at this point in time):
{
"action": "start",
"parent": "null",
"count": "10",
"percentage": "100",
"children": [
{
"action": "H",
"parent": "start",
"count": "6",
"percentage": "60",
"children": [
{
"action": "D",
"parent": "H",
"count": "5",
"percentage": "83.3"
},
{
"action": "B",
"parent": "H",
"count": "3",
"percentage": "50"
}
]
},
{
"action": "R",
"parent": "start",
"count": "4",
"percentage": "40"
}
]
}
I know I am supposed to post something I've tried, but I really don't have anything remotely worth of being shown.
I have just started writing some R -> d3.js converters in https://github.com/timelyportfolio/d3r that should work well in these type situations. I will work up an example later today with your data.
The internal hierarchy builder in https://github.com/timelyportfolio/sunburstR also might work well here.
I'll add to the answer as I explore both of these paths.
example 1
set.seed(10)
i <- 0
all_id <- NULL
all_vals <- NULL
while (i < 5) {
i <- i + 1
print(i)
size <- sample(3:5, size = 1)
tmp_id <- rep(i, times = size + 1)
tmp_vals <- c("start",sample(LETTERS, size = size) )
all_id <- c(all_id, tmp_id)
all_vals <- c(all_vals, tmp_vals)
}
df <- data.frame(Id = all_id,
Action = all_vals)
# not sure I completely understand what this is
# supposed to become but here is a first try
# find position of start
start_pos <- which(df$Action=="start")
# get the sequences
# surely there is a better way but do this for now
sequences <- paste(
start_pos+1,
c(start_pos[-1],nrow(df))-1,
sep=":"
)
paths <- lapply(
sequences,
function(x){
data.frame(
t(as.character(df[eval(parse(text=x)),]$Action)),
stringsAsFactors=FALSE
)
}
)
paths_df <- dplyr::bind_rows(paths)
# use d3r
# devtools::install_github("timelyportfolio/d3r")
library(d3r)
d3_nest(paths_df) # if want list, then json=FALSE
# visualize with listviewer
# devtools::install_github("timelyportfolio/listviewer")
listviewer::jsonedit(d3_nest(paths_df))
Related
I have two json files in the exact format as below. My goal is to combine them and keep the exact same format - just basically stack one on top of the other.
I have tried the following but this does not correctly combine both files and preserve the format as both files are bracketed with [ ] separately. How does one combine and keep only one pair of brackets around the entire file?
files <- c("test.json","test2.json")
jsonl <- lapply(files, function(f) fromJSON(file = f))
jsonc <- toJSON(jsonl)
write(jsonc, file = "two.json")
Are there any better solutions in R?
test.json:
[
{
"vendor": 0,
"startTime": 4380,
"endTime": 4445
},
{
"vendor": 0,
"startTime": 4448,
"endTime": 4453
},
{
"vendor": 0,
"startTime": 4696,
"endTime": 4880
}
]
undesired output:
[
[
{"vendor":0,"startTime":4380,"endTime":4445},
{"vendor":0,"startTime":4448,"endTime":4453},
{"vendor":0,"startTime":4696,"endTime":4880}],
[{"vendor":0,"startTime":4380,"endTime":4445},
{"vendor":0,"startTime":4448,"endTime":4453},
{"vendor":0,"startTime":4696,"endTime":4880}
]
]
desired output:
[
{"vendor":0,"startTime":4380,"endTime":4445},
{"vendor":0,"startTime":4448,"endTime":4453},
{"vendor":0,"startTime":4696,"endTime":4880},
{"vendor":0,"startTime":4380,"endTime":4445},
{"vendor":0,"startTime":4448,"endTime":4453},
{"vendor":0,"startTime":4696,"endTime":4880}
]
You can join them before writing using rbind
files <- c("test.json","test2.json")
jsonl <- do.call("rbind", lapply(files, function(f) fromJSON(f)))
write(toJSON(jsonl), file = "two.json")
I have a .txt file with this structure
section1#[{"p": "0.999834", "tag": "MA"},{"p": "1", "tag": "MO"},...etc...}]
section1#[{"p": "0.9995", "tag": "NC"},{"p": "1", "tag": "FL"},...etc...}]
...
section2#[{"p": "0.9995", "tag": "NC"},{"p": "1", "tag": "FL"},...etc...}]
I am trying to read it by using R with the commands
library(jsonlite)
data <- fromJSON("myfile.txt")
But I get this
Error in feed_push_parser(readBin(con, raw(), n), reset = TRUE) :
lexical error: invalid char in json text.
section2#[{"p": "0.99
(right here) ------^
How can I read it even by splitting by sections?
Remove the prefix and bind the flattened JSON arrays together into a data frame:
raw_dat <- readLines(textConnection('section1#[{"p": "0.999834", "tag": "MA"},{"p": "1", "tag": "MO"}]
section1#[{"p": "0.9995", "tag": "NC"},{"p": "1", "tag": "FL"}]
section2#[{"p": "0.9995", "tag": "NC"},{"p": "1", "tag": "FL"}]'))
library(stringi)
library(purrr)
library(jsonlite)
stri_replace_first_regex(raw_dat, "^section[[:digit:]]+#", "") %>%
map_df(fromJSON)
## p tag
## 1 0.999834 MA
## 2 1 MO
## 3 0.9995 NC
## 4 1 FL
## 5 0.9995 NC
## 6 1 FL
Remove section# from each line. Then your .txt will have a 2D array with JSON objects at each index.
You can access elements by accessing it as foo[0][0] being the first object of first line and foo[m][n] where m is the number of sections -1 and n is number of objects in each section -1
I'm looking to take a nice tidy data frame and turn it into a deeply nested JSON using R. So far I haven't been able to find any other resources that directly address this task - most seem to be trying to take it in the other direction (un-nesting a JSON).
Here's a small dummy version of the data frame I'm starting with. Imagine a survey was given to two audiences within a company, one for managers and a separate one for employees. The surveys have different sets of questions with different IDs but many questions overlap and I want to compare the responses between the two groups. The end goal is to make a JSON that matches up section IDs, question IDs, and option IDs/text from two surveys in the correct hierarchy. Some questions have subquestions that require a further level of nesting, which is what I’m having difficulty doing.
library(dplyr)
library(tidyr)
library(jsonlite)
dummyDF <- data_frame(sectionId = c(rep(1,9),rep(2,3)),
questionId = c(rep(1,3),rep(2,6),rep(3,3)),
subquestionId = c(rep(NA,3),rep("2a",3),rep("2b",3),rep(NA,3)),
deptManagerQId = c(rep("m1",3),rep("m2",3),rep("m3",3),rep("m4",3)),
deptEmployeeQId = c(rep("e1",3),rep("e3",3),rep("e4",3),rep("e7",3)),
optionId = rep(c(1,2,3),4),
text = rep(c("yes","neutral","no"),4))
And here’s the end result I’m trying to achieve:
theGoal <- fromJSON('{
"sections": [
{
"sectionId": "1",
"questions": [
{
"questionId": "1",
"deptManagerQId": "m1",
"deptEmployeeQId": "e1",
"options": [
{
"optionId": 1,
"text": "yes"
},
{
"optionId": 2,
"text": "neutral"
},
{
"optionId": 3,
"text": "no"
}
]
},
{
"questionId": "2",
"options": [
{
"optionId": 1,
"text": "yes"
},
{
"optionId": 2,
"text": "neutral"
},
{
"optionId": 3,
"text": "no"
}
],
"subquestions": [
{
"subquestionId": "2a",
"deptManagerQId": "m2",
"deptEmployeeQId": "e3"
},
{
"subquestionId": "2b",
"deptManagerQId": "m3",
"deptEmployeeQId": "e4"
}
]
},
{
"questionId": "3",
"deptManagerQId": "m4",
"deptEmployeeQId": "e7",
"options": [
{
"optionId": 1,
"text": "yes"
},
{
"optionId": 2,
"text": "neutral"
},
{
"optionId": 3,
"text": "no"
}
]
}
]
}
]
}')
Here are a few approaches I’ve tried using nest from tidyr that end up either only getting me part of the way there or throwing an error message.
1
list1 <- dummyDF %>% nest(-sectionId, .key=questions) %>%
mutate(questions = lapply(seq_along(.$questions), function(x) nest(.$questions[[x]], optionId, text, .key = options))) %>%
list(sections = .)
2
nested1 <- dummyDF %>% nest(-sectionId, .key=questions) %>%
mutate(questions = lapply(seq_along(.$questions), function(x) nest(.$questions[[x]], optionId, text, .key = options)))
nested2 <- nested1 %>% mutate(questions = lapply(seq_along(.$questions), function(x) nest(.$questions[[x]], subquestionId, .key = subquestions)))
#Gives this error: cannot group column options, of class 'list'
3
list2 <- dummyDF %>% nest(-sectionId, .key=questions) %>%
mutate(questions = lapply(seq_along(.$questions),
function(x) {ifelse(is.na(.$questions[[x]]$subquestionId),
function(x) {.$questions[[x]] %>% select(-subquestionId) %>% nest(optionId, text, .key = options)},
function(x) {.$questions[[x]] %>% nest(subquestion_id, .key = subquestions)})})) %>%
list(sections = .)
#Gives this error: attempt to replicate an object of type 'closure'
Any ideas would be greatly appreciated. I’m open to any approaches. I took the issue to a local R user group meet-up but wasn’t able to come up with any solutions so I’ve got my fingers crossed here. I realize R might not be the best tool to accomplish this but it’s the one I know so I’m giving it a shot. Thanks.
jsonlite::toJSON looks like a nice solution to your problem.
Works seamlessly up to column types and column order (I corrected to illustrate that the objects were identical). If you need any other type of JSON structure, I would recommend restructuring the data_frame on the front end first using something like dplyr or tidyr.
library(jsonlite)
library(dplyr)
dummyDF <- data_frame(sectionId = c(rep(1,9),rep(2,3)),
questionId = c(rep(1,3),rep(2,6),rep(3,3)),
subquestionId = c(rep(NA,3),rep("2a",3),rep("2b",3),rep(NA,3)),
deptManagerQId = c(rep("m1",3),rep("m2",3),rep("m3",3),rep("m4",3)),
deptEmployeeQId = c(rep("e1",3),rep("e3",3),rep("e4",3),rep("e7",3)),
optionId = rep(c(1,2,3),4),
text = rep(c("yes","neutral","no"),4))
## Convert to a JSON object
json <- jsonlite::toJSON(dummyDF)
theGoal <- fromJSON(json) %>% tbl_df() %>% select_(.dots=names(dummyDF)) %>%
## Convert integer columns to numeric
mutate_if(function(x) {if (typeof(x)=='integer') {TRUE} else {FALSE}},as.numeric)
## Compare the objects
all.equal(theGoal,dummyDF)
# TRUE
identical(theGoal,dummyDF)
# TRUE
i'm having trouble with a very nasty nested JSON.
The format is like this
{
"matches": [
{
"matchId": 1,
"region": "BR",
"participants": [
{
"participantId": 0,
"teamId": 200,
"stats": {
"winner": true,
"champLevel": 16,
"item0": 3128,
}
{
"matchId": 2,
"region": "BR",
"participants": [
{
"participantId": 0,
"teamId": 201,
"stats": {
"winner": false,
"champLevel": 18,
"item0": 3128,
"item1": 3157,
"item1": 3158,
}
As you can see in the second match the number of items increased, but in the data frame the first row will have the same collumns:
MatchId region ... stats.winner stats.champLevel stats.item0 stats.item1 stats.item2
1 BR TRUE 16 3128 1 BR
1 BR TRUE 16 3128 3157 3158
See the first row is smaller than the second, so R recycle the values ....
If you want the full data you can grab it at:
http://pastebin.com/HQDf2ase
How I parsed the json to data.frame:
json.matchData <- fromJSON(file="file.json"))
Unlist the elements of the Json and convert it to a data frame
matchData.i <- lapply(json.matchData$matches, function(x){ unlist(x)})
Transform into Data Frame
matchData <- do.call("rbind", matchData.i)
matchData <- as.data.frame(matchData)
But the dataframe is messed up, because some fields should be NA but they are filled with wrong values.
I think using the plyr rbind.fill() function would be helpful here. How about this
library(plyr)
matchData <- rbind.fill(lapply(matchData.i,
function(x) do.call("data.frame", as.list(x))
))
the lapply() bit is to turn the intermediate lists into data.frames which rbind.fill requires.
I have this materialized path tree structure built using PostgreSQL's ltree module.
id1
id1.id2
id1.id2.id3
id1.id2.id5
id1.id2.id3.id4 ... etc
I can of course easily use ltree to get all nodes from the entire tree or from a specific path/subpath, but when I do that, naturally what I get is a lot of rows (which equals to an array/slice of nodes in the end.. Golang/whatever programming language you use)
What I'm after is to fetch the tree - ideally from a certain start and ending path/point - as a hieracical JSON tree object like etc
{
"id": 1,
"path": "1",
"name": "root",
"children": [
{
"id": 2,
"path": "1.2",
"name": "Node 2",
"children": [
{
"id": 3,
"path": "1.2.3",
"name": "Node 3",
"children": [
{
"id": 4,
"path": "1.2.3.4",
"name": "Node 4",
"children": [
]
}
]
},
{
"id": 5,
"path": "1.2.5",
"name": "Node 5",
"children": [
]
}
]
}
]
}
I know from a linear (non-hiearchical) row/array/slice resultset I can of course in Golang explode the path and make the necessary business logic there to create this json, but it'll certainly be MUCH much better if there's a handy way of achieving this with PostgreSQL directly.
So how would you in PostgreSQL output an ltree tree structure to json - potentionally from a starting to ending path?
If you don't know ltree, I guess the question could be generalized more to "Materalized path tree to hierachical json"
Also I'm playing with the thought of adding a parent_id on all nodes in addition to the ltree path, since at least then I would be able to use recursive calls using that id to fetch the json I guess... also I've thought about putting a trigger on that parent_id to manage the path (keep it updated) based on when a change in parent id happens - I know it's another question, but perhaps you could tell me your opinion as well, about this?
I hope some genius can help me with this. :)
For your convenience here's a sample create script you can use to save time:
CREATE TABLE node
(
id bigserial NOT NULL,
path ltree NOT NULL,
name character varying(255),
CONSTRAINT node_pkey PRIMARY KEY (id)
);
INSERT INTO node (path,name)
VALUES ('1','root');
INSERT INTO node (path,name)
VALUES ('1.2','Node 1');
INSERT INTO node (path,name)
VALUES ('1.2.3','Node 3');
INSERT INTO node (path,name)
VALUES ('1.2.3.4','Node 4');
INSERT INTO node (path,name)
VALUES ('1.2.5','Node 5');
I was able to find and slightly change it to work with ltree's materialized paths instead of parent ids like often used on adjacency tree structures.
While I still hope for a better solution, this I guess will get the job done.
I kinda feel I have to add the parent_id in addition to the ltree path, since this is of course not any way near as fast as referencing parent id's.
Well credits goes to this guy's solution, and here's my slightly modified code using ltree's subpath, ltree2text and nlevel to achieve the exact same:
WITH RECURSIVE c AS (
SELECT *, 1 as lvl
FROM node
WHERE id=1
UNION ALL
SELECT node.*, c.lvl + 1 as lvl
FROM node
JOIN c ON ltree2text(subpath(node.path,nlevel(node.path)-2 ,nlevel(node.path))) = CONCAT(subpath(c.path,nlevel(c.path)-1,nlevel(c.path)),'.',node.id)
),
maxlvl AS (
SELECT max(lvl) maxlvl FROM c
),
j AS (
SELECT c.*, json '[]' children
FROM c, maxlvl
WHERE lvl = maxlvl
UNION ALL
SELECT (c).*, json_agg(j) children FROM (
SELECT c, j
FROM j
JOIN c ON ltree2text(subpath(j.path,nlevel(j.path)-2,nlevel(j.path))) = CONCAT(subpath(c.path,nlevel(c.path)-1,nlevel(c.path)),'.',j.id)
) v
GROUP BY v.c
)
SELECT row_to_json(j)::text json_tree
FROM j
WHERE lvl = 1;
There is a big problem with this solution though, so far.. see the image below for the error (Node 5 is missing):
The reason node 5 does not show up is because it is a leaf node that is not at the max level and the subsequent join on condition excluded it.
The true base case for recursing through a tree is a node that is a leaf. By starting at the max level, that implicitly selects all leaf nodes but misses leaf nodes that occur at a lower level. Here is what we want to do in pseudo code:
for each node:
if node is leaf, then return empty array
else return the aggregated children
I found this hard to express in SQL though. Instead, I used the same strategy of starting from the max level and then moving up one level at a time. However, I added some code to handle the leaf node base case when I was above the max level.
Here is what I came up with:
WITH RECURSIVE c AS (
SELECT
name,
path,
nlevel(path) AS lvl
FROM node
),
maxlvl AS (
SELECT max(lvl) maxlvl FROM c
),
j AS (
SELECT
c.*,
json '[]' AS children
FROM c, maxlvl
WHERE lvl = maxlvl
UNION ALL
SELECT
(c).*,
CASE
WHEN COUNT(j) > 0 -- check if returned record is null
THEN json_agg(j) -- if not null, aggregate
ELSE json '[]' -- if null, then we are a leaf, so return empty array
END AS children
FROM (
SELECT
c,
CASE
WHEN c.path = subpath(j.path, 0, nlevel(j.path) - 1) -- c is a parent of the child
THEN j
ELSE NULL -- if c is not a parent, return NULL to trigger base case
END AS j
FROM j
JOIN c ON c.lvl = j.lvl - 1
) AS v
GROUP BY v.c
)
SELECT row_to_json(j)::text AS json_tree
FROM j
WHERE lvl = 1;
My solution only uses the path (and the derived level from the path). It does not need name or id to properly recurse.
Here is the result I get (I included a node 6 to make sure I handled multiple leaf nodes at the same level):
{
"name": "root",
"path": "1",
"lvl": 1,
"children": [
{
"name": "Node 1",
"path": "1.2",
"lvl": 2,
"children": [
{
"name": "Node 5",
"path": "1.2.5",
"lvl": 3,
"children": []
},
{
"name": "Node 3",
"path": "1.2.3",
"lvl": 3,
"children": [
{
"name": "Node 6",
"path": "1.2.3.4",
"lvl": 4,
"children": []
},
{
"name": "Node 4",
"path": "1.2.3.4",
"lvl": 4,
"children": []
}
]
}
]
}
]
}