How to parse JSON in a DataFrame column using R - json

How do I come from here ...
| ID | JSON Request |
==============================================================================
| 1 | {"user":"xyz1","weightmap": {"P1":0,"P2":100}, "domains":["a1","b1"]} |
------------------------------------------------------------------------------
| 2 | {"user":"xyz2","weightmap": {"P1":100,"P2":0}, "domains":["a2","b2"]} |
------------------------------------------------------------------------------
to here (The requirement is to make a table of JSON in column 2):
| User | P1 | P2 | domains |
============================
| xyz1 | 0 |100 | a1, b1 |
----------------------------
| xyz2 |100 | 0 | a2, b2 |
----------------------------
Here is the code to generate the data.frame:
raw_df <-
data.frame(
id = 1:2,
json =
c(
'{"user": "xyz2", "weightmap": {"P1":100,"P2":0}, "domains": ["a2","b2"]}',
'{"user": "xyz1", "weightmap": {"P1":0,"P2":100}, "domains": ["a1","b1"]}'
),
stringsAsFactors = FALSE
)

Here's a tidyverse solution (also using jsonlite) if you're happy to work in a long format (for domains in this case):
library(jsonlite)
library(dplyr)
library(purrr)
library(tidyr)
d <- data.frame(
id = c(1, 2),
json = c(
'{"user":"xyz1","weightmap": {"P1":0,"P2":100}, "domains":["a1","b1"]}',
'{"user":"xyz2","weightmap": {"P1":100,"P2":0}, "domains":["a2","b2"]}'
),
stringsAsFactors = FALSE
)
d %>%
mutate(json = map(json, ~ fromJSON(.) %>% as.data.frame())) %>%
unnest(json)
#> id user weightmap.P1 weightmap.P2 domains
#> 1 1 xyz1 0 100 a1
#> 2 1 xyz1 0 100 b1
#> 3 2 xyz2 100 0 a2
#> 4 2 xyz2 100 0 b2
mutate... is converting from a string to column of nested data frames.
unnest... is unnesting these data frames into multiple columns

I would go for the jsonlite package in combination with the usage of mapply, a transformation function and data.table's rbindlist.
# data
raw_df <- data.frame(id = 1:2, json = c('{"user": "xyz2", "weightmap": {"P1":100,"P2":0}, "domains": ["a2","b2"]}', '{"user": "xyz1", "weightmap": {"P1":0,"P2":100}, "domains": ["a1","b1"]}'), stringsAsFactors = FALSE)
# libraries
library(jsonlite)
library(data.table)
# 1) First, make a transformation function that works for a single entry
f <- function(json, id){
# transform json to list
tmp <- jsonlite::fromJSON(json)
# transform list to data.frame
tmp <- as.data.frame(tmp)
# add id
tmp$id <- id
# return
return(tmp)
}
# 2) apply it via mapply
json_dfs <-
mapply(f, raw_df$json, raw_df$id, SIMPLIFY = FALSE)
# 3) combine the fragments via rbindlist
clean_df <-
data.table::rbindlist(json_dfs)
# 4) et-voila
clean_df
## user weightmap.P1 weightmap.P2 domains id
## 1: xyz2 100 0 a2 1
## 2: xyz2 100 0 b2 1
## 3: xyz1 0 100 a1 2
## 4: xyz1 0 100 b1 2

Could not get the flatten parameter to work as I expected so needed to unlist and then "re-list" before rbinding with do.call:
library(jsonlite)
do.call( rbind,
lapply(raw_df$json,
function(j) as.list(unlist(fromJSON(j, flatten=TRUE)))
) )
user weightmap.P1 weightmap.P2 domains1 domains2
[1,] "xyz2" "100" "0" "a2" "b2"
[2,] "xyz1" "0" "100" "a1" "b1"
Admittedly, this will require further processing since it coerces all the lines to character.

library(jsonlite)
json = c(
'{"user":"xyz1","weightmap": {"P1":0,"P2":100}, "domains":["a1","b1"]}',
'{"user":"xyz2","weightmap": {"P1":100,"P2":0}, "domains":["a2","b2"]}'
)
json <- lapply( paste0("[", json ,"]"),
function(x) jsonlite::fromJSON(x))
df <- data.frame(matrix(unlist(json), nrow=2, ncol=5, byrow=T))
df <- df %>% unite(Domains, X4, X5, sep = ", ")
colnames(df) <- c("user", "P1", "P2", "domains")
head(df)
The output is:
user P1 P2 domains
1 xyz1 0 100 a1, b1
2 xyz2 100 0 a2, b2

Using tidyjson
https://cran.r-project.org/web/packages/tidyjson/vignettes/introduction-to-tidyjson.html
install.packages("tidyjson")
library(tidyjson)
json_as_df <- raw_df$json %>% spread_all
# retain columns
json_as_df <- raw_df %>% as.tbl_json(json.column = "json") %>% spread_all

Related

Extract nested JSON from R dataframe without knowing keys

I am trying to extract JSON from a TSV column. The difficulty is the JSON is shallowly nested, and the key values may not be present in every row.
I have a minimal example to illustrate my point.
df <- tibble(index = c(1, 2),
data = c('{"json_char":"alpha", "json_list1":["x","y"]}',
'{"json_char":"beta", "json_list1":["x","y","z"], "json_list2":["a","b","c"]}'))
The desired result:
df <- tibble::tibble(index = list(1, 2),
json_char = list("alpha", "beta"),
json_list1 = list(list("x","y"), list("x","y","z")),
json_list2 = list(NA, list("a","b","c")))
After a fair amount of experimentation, I have this function:
extract_json_column <- function(df) {
df %>%
magrittr::use_series(data) %>%
purrr::map(jsonlite::fromJSON) %>%
purrr::map(purrr::simplify) %>%
tibble::enframe() %>%
tidyr::spread("name", "value") %>%
purrr::flatten_dfr()
}
Which gives me the following error: Error in bind_rows_(x, .id) : Argument 2 must be length 3, not 7.
The first row sets the number of parameters for the rest of dataframe. Is there anyway to avoid that behavior?
I modified your function to the following. I hope this helps.
library(tidyverse)
library(rjson)
extract_json_column <- function(df){
df %>%
rowwise() %>%
mutate(data = map(data, fromJSON)) %>%
split(.$index) %>%
map(~.$data[[1]]) %>%
map(~map_if(., function(x) length(x) != 1, list)) %>%
map(as_data_frame) %>%
bind_rows(.id = "index")
}
extract_json_column(df)
# A tibble: 2 x 4
index json_char json_list1 json_list2
<chr> <chr> <list> <list>
1 1 alpha <chr [2]> <NULL>
2 2 beta <chr [3]> <chr [3]>

How to clean and split HTML tags in R?

My parser create a data frame, which looks like:
name html
1 John <span class="incident-icon" data-minute="68" data-second="37" data-id="8028"></span><span class="name-meta-data">68</span>
2 Steve <span class="incident-icon" data-minute="69" data-second="4" data-id="132205"></span><span class="name-meta-data">69</span>
So how I can extract usefull information from HTML? For example, I want to use some HTML attributes as features:
name minute second id
1 John 68 37 8028
2 Steve 69 4 132205
If you already have the data frame in your question, you can try the following. Your data frame is called mydf here. You can extract all numbers with stri_extract_all_regex(). Then, you follow the classic method converting a list to a data frame. Then, you assign new column names and bind the result with the column, name in the original data frame.
library(stringi)
library(dplyr)
stri_extract_all_regex(str = mydf$url, pattern = "[0-9]+") %>%
unlist %>%
matrix(ncol = 4, byrow = T) %>%
data.frame %>%
setNames(c("minute", "second", "ID", "data")) %>%
bind_cols(mydf["name"], .)
# name minute second ID data
#1 John 68 37 8028 68
#2 Steve 69 4 132205 69
DATA
mydf <- structure(list(name = c("John", "Steve"), url = c("<span class=\"incident-icon\" data-minute=\"68\" data-second=\"37\" data-id=\"8028\"></span><span class=\"name-meta-data\">68</span>",
"<span class=\"incident-icon\" data-minute=\"69\" data-second=\"4\" data-id=\"132205\"></span><span class=\"name-meta-data\">69</span>"
)), .Names = c("name", "url"), row.names = c(NA, -2L), class = "data.frame")
An alternate rvest approach using purrr and dplyr:
library(rvest)
library(purrr)
library(dplyr)
df <- read.table(stringsAsFactors=FALSE, header=TRUE, sep=",", text='name,html
John,<span class="incident-icon" data-minute="68" data-second="37" data-id="8028"></span><span class="name-meta-data">68</span>
Steve,<span class="incident-icon" data-minute="69" data-second="4" data-id="132205"></span><span class="name-meta-data">69</span>')
by_row(df, .collate="cols",
~read_html(.$html) %>%
html_nodes("span:first-of-type") %>%
html_attrs() %>%
flatten_chr() %>%
as.list() %>%
flatten_df()) %>%
select(-html, -class1) %>%
setNames(gsub("^data-|1$", "", colnames(.)))
## # A tibble: 2 × 4
## name minute second id
## <chr> <chr> <chr> <chr>
## 1 John 68 37 8028
## 2 Steve 69 4 132205
regex is possible, but I prefer the rvest package for this,
this is easier with data.table or dplyr, but lets do it base R, (on the off-chance that those are new concepts)
# Example data
df <- structure(list(name = c("John", "Steve"), html = c("<span class=\"incident-icon\" data-minute=\"68\" data-second=\"37\" data-id=\"8028\"></span><span class=\"name-meta-data\">68</span>",
"<span class=\"incident-icon\" data-minute=\"69\" data-second=\"4\" data-id=\"132205\"></span><span class=\"name-meta-data\">69</span>"
)), .Names = c("name", "html"), row.names = c(NA, -2L), class = "data.frame")
rvest lets us split this up using the DOM, which can be a lot nicer than working with regex for the same thing.
library(rvest)
# Get span attributes from each row:
spanattrs <-
lapply(df$html,
function(y) read_html(y) %>% html_node('span') %>% html_attrs)
# rbind to get a data.frame with all attributes
final <- data.frame(df, do.call(rbind,spanattrs))
> final
name html class
1 John <span class="incident-icon" data-minute="68" data-second="37" data-id="8028"></span><span class="name-meta-data">68</span> incident-icon
2 Steve <span class="incident-icon" data-minute="69" data-second="4" data-id="132205"></span><span class="name-meta-data">69</span> incident-icon
data.minute data.second data.id
1 68 37 8028
2 69 4 132205
Lets remove the html so it's a little nicer in the viewer here:
> final$html <- NULL
> final
name class data.minute data.second data.id
1 John incident-icon 68 37 8028
2 Steve incident-icon 69 4 132205

Tidy nested json tree

This comes up a lot when dealing with API's.
Most of the time, to do real analysis, I'd like to get my dataset tidy, but typically, this requires a solution for each type of tree, rather than something more general.
I figured it would be nice to have one function that generates tidy data (albeit with a ton of NA's in deeply nested trees with many different factor levels.
I have a hackish solution which follows, using unlist(..., recursive = FALSE) + a naming convention,
But I'd like to see if someone here might have a better solution to tidy these kinds of list structures.
#####################
# Some Test Data
aNestedTree =
list(a = 1,
b = 2,
c = list(
a = list(1:5),
b = 2,
c = list(
a = 1,
d = 3,
e = list())),
d = list(
y = 3,
z = 2
))
############################################################
# Run through the list and rename all list elements,
# We unlist once at time, adding "__" at each unlist step
# until the object is no longer a list
renameVars <- function(lst, sep = '__') {
if(is.list(lst)) {
names(lst) <- paste0(names(lst),sep)
renameVars(unlist(lst, recursive = FALSE),sep = sep)
} else {
lst
}
}
res <- renameVars(aNestedTree)
We can check the output and see that we have a strangely named object,
But there's a method to this madness.
> res
a________ b________ c__.a____1__ c__.a____2__ c__.a____3__
1 2 1 2 3
c__.a____4__ c__.a____5__ c__.b______ c__.c__.a____ c__.c__.d____
4 5 2 1 3
d__.y______ d__.z______
3 2
Now I put this in a data.table, so I can shape it.
library(data.table)
dt <- data.table(values = res, name = names(res))
# Use some regex to split that name up, along with data.table's tstrsplit
# function to separate them into as many columns as there are nests
> dt[,paste0('V',seq_along(s <- tstrsplit(dt$name,'[__]+(\\.|)'))) := s]
> dt
values name V1 V2 V3
1: 1 a________ a NA NA
2: 2 b________ b NA NA
3: 1 c__.a____1__ c a 1
4: 2 c__.a____2__ c a 2
5: 3 c__.a____3__ c a 3
6: 4 c__.a____4__ c a 4
7: 5 c__.a____5__ c a 5
8: 2 c__.b______ c b NA
9: 1 c__.c__.a____ c c a
10: 3 c__.c__.d____ c c d
11: 3 d__.y______ d y NA
12: 2 d__.z______ d z NA
I can then filter for the factor combinations that I want (Or dcast/spread). (Though I'm effectively breaking apart tables at the lowest level if they exist)
I thought about going through bind.c and pulling out the do_unlistto make a function with a flexible naming convention via Rcpp, but my C++ is rusty, so I figured I'd post here before I do anything drastic.
I tend to lean towards tidyjson as well. In the tidyverse, the behavior you are looking for seems to be in the gather family.
I think the gather family of functions in tidyjson could do with a bit of improvement that would make these helpers unnecessary. Right now, they are very "type-sensitive" and error or throw out types that do not match. In any case, the workaround is not too challenging, although it definitely lacks elegance. Note that the bind_rows variant is presently from my development version and is not mainstream yet. Hopefully this illustrates the idea, though.
Notes on approach:
That all values would be numeric (I cast them to character afterwards)
Helpers gather elements of the varying types, and bind_rows stacks the datasets together.
level is kept track of by level of recursion
First define the helpers:
recurse_gather <- function(.x,.level) {
.x <- tidyjson::bind_rows(
gobj(.x,.level)
, garr(.x,.level)
, gpersist(.x,.level)
)
if (any(as.character(json_types(.x,'type')$type) %in% c('object','array'))) {
.x <- recurse_gather(.x,.level+1)
}
return(.x)
}
gobj <- function(.x,.level) {
.x %>% json_types('type') %>%
filter(type=='object') %>%
gather_object(paste0('v',.level)) %>%
select(-type)
}
gpersist <- function(.x,.level) {
.x %>% json_types('type') %>%
filter(! type %in% c('object','array')) %>%
mutate_(.dots=setNames(
paste0('as.character(NA)')
,paste0('v',.level)
)) %>%
select(-type)
}
garr <- function(.x,.level) {
.x %>% json_types('type') %>%
filter(type=='array') %>%
gather_array('arridx') %>%
append_values_number(paste0('v',.level)) %>%
mutate_(.dots=setNames(
paste0('as.character(v',.level,')')
,paste0('v',.level)
)) %>%
select(-arridx,-type)
}
Then using the helpers is pretty straight-forward.
library(dplyr)
library(tidyjson)
j <- "{\"a\":[1],\"b\":[2],\"c\":{\"a\":[1,2,3,4,5],\"b\":[2],\"c\":{\"a\":[1],\"d\":[3],\"e\":[]}},\"d\":{\"y\":[3],\"z\":[2]}}"
recurse_gather(j, 1) %>% arrange(v1, v2, v3, v4) %>% tbl_df()
#> # A tibble: 12 x 5
#> document.id v1 v2 v3 v4
#> * <int> <chr> <chr> <chr> <chr>
#> 1 1 a 1 <NA> <NA>
#> 2 1 b 2 <NA> <NA>
#> 3 1 c a 1 <NA>
#> 4 1 c a 2 <NA>
#> 5 1 c a 3 <NA>
#> 6 1 c a 4 <NA>
#> 7 1 c a 5 <NA>
#> 8 1 c b 2 <NA>
#> 9 1 c c a 1
#> 10 1 c c d 3
#> 11 1 d y 3 <NA>
#> 12 1 d z 2 <NA>
Hopeful that future development on the tidyjson package will make this an easier problem to tackle!
I struggled in similar situations, but the tidyjson package has bailed me out time after time when dealing with nested JSON. There's a fair amount of typing required, but the tidyjson functions return a tidy object. Documentation here: https://github.com/sailthru/tidyjson
As dracodoc pointed out, data.tree might help. E.g. like this:
library(data.tree)
aNestedTree =
list(a = 1,
b = 2,
c = list(
a = list(1:5),
b = 2,
c = list(
a = 1,
d = 3,
e = list())),
d = list(
y = 3,
z = 2
))
tree <- FromListSimple(aNestedTree)
print(tree)
This will give:
levelName z
1 Root NA
2 ¦--c NA
3 ¦ ¦--a NA
4 ¦ °--c NA
5 ¦ °--e NA
6 °--d 2
And:
tree$fieldsAll
[1] "a" "b" "1" "d" "y" "z"
Side note: typically, you could do something like this:
do.call("print", c(tree, tree$fieldsAll))
However, here, this doesn't work because some node names are the same as field names. I consider this a bug and will fix it soon.

Simple formatting with tidyjson with R

I have a simple JSON file which I'm attempting to coerce into an R data.frame.
json = "
{ \"objects\":
{
\"object_one\": {
\"key1\" : \"value1\",
\"key2\" : \"value2\",
\"key3\" : \"0\",
\"key4\" : \"value3\",
\"key5\" : \"False\",
\"key6\" : \"False\"
},
\"object_two\": {
\"key1\" : \"0.5\",
\"key2\" : \"0\",
\"key3\" : \"343\",
\"key4\" : \"value4\",
\"key5\" : \"True\",
\"key6\" : \"True\"
}
}
}
"
and I simply want to extract the name of each object as a index key (or rowname), create column names from the keys and spread the values.
Unfortunately I've had no luck unpicking the syntax. Can anyone help?
Thanks
Stuart
There are two ways to do this with tidyjson, the first is to use tidyjson::append_values_string and then tidyr::spread:
library(tidyjson)
library(dplyr)
library(tidyr)
json %>%
enter_object("objects") %>%
gather_keys("object") %>%
gather_keys("key") %>%
append_values_string("value") %>%
tbl_df %>% spread(key, value)
#> # A tibble: 2 x 8
#> document.id object key1 key2 key3 key4 key5 key6
#> * <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 1 object_one value1 value2 0 value3 False False
#> 2 1 object_two 0.5 0 343 value4 True True
The other way is to use tidyjson::spread_values to specific each key separately:
json %>%
enter_object("objects") %>%
gather_keys("object") %>%
spread_values(
key1 = jstring("key1"),
key2 = jstring("key2"),
key3 = jnumber("key3"),
key4 = jstring("key4"),
key5 = jstring("key5"),
key6 = jstring("key6")
)
#> document.id object key1 key2 key3 key4 key5 key6
#> 1 1 object_one value1 value2 0 value3 False False
#> 2 1 object_two 0.5 0 343 value4 True True
The advantage of the second approach is that you can (a) specify the types of each column and (b) will be guaranteed to get the same data.frame structure even if the keys change (or are missing) in some documents or objects.
Not entirely sure on your desired output, but you can use jsonlite::fromJSON to extract the data, and data.table::rbindlist to put it into a data.table
library(jsonlite)
library(data.table)
rbindlist(fromJSON(json))
# object_one object_two
# 1: value1 0.5
# 2: value2 0
# 3: 0 343
# 4: value3 value4
# 5: False True
# 6: False True
Based on your comment, another approach that involves some reshaping
library(jsonlite)
library(reshape2)
lst <- fromJSON(json)
lst <- lapply(lst[[1]], unlist)
df <- as.data.frame(lst)
df$key <- rownames(df)
df <- melt(df, id = "key")
df <- dcast(df, formula = variable ~ key)
df
# variable key1 key2 key3 key4 key5 key6
# 1 object_one value1 value2 0 value3 False False
# 2 object_two 0.5 0 343 value4 True True

convert JSON to data.frame in R

I have problem with converting lists to data.frame
First I have downloaded dataset in JSON format from Data API:
request1 <- POST(url = "https://api.data-api.io/v1/subjekti", add_headers('x-dataapi-key' = "xxxxxxx", 'content-type'= "application/json"), body = list(oib = oibreq), encode = "json")
json1 <- content(request1, type = "application/json")
json2 <- fromJSON(toJSON(json1, null = "null"), flatten = TRUE)
The problem is that data are elements of lists. For example
> json2[['oib']]
[[1]]
[1] "00045103869"
[[2]]
[1] "18527887472"
[[3]]
[1] "92680516748"
all colnames:
> colnames(json2)
[1] "oib" "mb" "mbs" "mbo" "rno" "naziv"
[7] "adresa" "grad" "posta" "zupanija" "nkd2007" "puo"
[13] "godinaOsnivanja" "status" "temeljniKapital" "isActive" "datumBrisanja" "predmetPoslovanja"
How can I convert this lists to data.frame?
Sorry, that was my first question on stockoverflow. There is my dataset:
> data <- dput(json3)
structure(list(oib = list("00045103869", "18527887472", "92680516748"),
mb = list("01699032", "03858731", "02591596"), mbs = list(
"080451345", "060060881", "040260786"), mbo = c(NA, NA,
NA), rno = c(NA, NA, NA), naziv = list("INTERIJER DIZAJN d.o.o.",
"M - Đ COMMERCE d.o.o.", "HIP REKLAME d.o.o. u stečaju"),
adresa = list("Savska cesta 179", "Put Piketa 0", "Sadska 2"),
grad = list("Zagreb", "Sinj", "Rijeka"), posta = list("10000",
"21230", "51000"), zupanija = list("Grad Zagreb", "Splitsko-dalmatinska",
"Primorsko-goranska"), nkd2007 = list("1623", "4719",
"4711"), puo = list(92L, 92L, 92L), godinaOsnivanja = list(
"2003", "1995", "2009"), status = list("bez postupka",
"bez postupka", "stečaj"), temeljniKapital = list("20.000,00 kn",
"509.100,00 kn", "20.000,00 kn"), isActive = list(TRUE,
TRUE, FALSE), datumBrisanja = list(NULL, NULL, "2015-12-24T00:00:00+01:00")), .Names = c("oib",
"mb", "mbs", "mbo", "rno", "naziv", "adresa", "grad", "posta",
"zupanija", "nkd2007", "puo", "godinaOsnivanja", "status", "temeljniKapital",
"isActive", "datumBrisanja"), class = "data.frame", row.names = c(NA,
3L))
A quick & dirty way would be to substitute the NULL values by e.g. NAs like this
f <- function(lst) lapply(lst, function(x) if (is.list(x)) f(x) else if (is.null(x)) NA_character_ else x)
df <- as.data.frame(lapply(f(json2), unlist))
str(df)
# 'data.frame': 3 obs. of 17 variables:
# $ oib : Factor w/ 3 levels "00045103869",..: 1 2 3
# $ mb : Factor w/ 3 levels "01699032","02591596",..: 1 3 2
# $ mbs : Factor w/ 3 levels "040260786","060060881",..: 3 2 1
# $ mbo : logi NA NA NA
# $ rno : logi NA NA NA
# $ naziv : Factor w/ 3 levels "HIP REKLAME d.o.o. u stecaju",..: 2 3 1
# $ adresa : Factor w/ 3 levels "Put Piketa 0",..: 3 1 2
# $ grad : Factor w/ 3 levels "Rijeka","Sinj",..: 3 2 1
# $ posta : Factor w/ 3 levels "10000","21230",..: 1 2 3
# $ zupanija : Factor w/ 3 levels "Grad Zagreb",..: 1 3 2
# $ nkd2007 : Factor w/ 3 levels "1623","4711",..: 1 3 2
# $ puo : int 92 92 92
# $ godinaOsnivanja: Factor w/ 3 levels "1995","2003",..: 2 1 3
# $ status : Factor w/ 2 levels "bez postupka",..: 1 1 2
# $ temeljniKapital: Factor w/ 2 levels "20.000,00 kn",..: 1 2 1
# $ isActive : logi TRUE TRUE FALSE
# $ datumBrisanja : Factor w/ 1 level "2015-12-24T00:00:00+01:00": NA NA 1
But there may be better options.