I have a simple JSON file which I'm attempting to coerce into an R data.frame.
json = "
{ \"objects\":
{
\"object_one\": {
\"key1\" : \"value1\",
\"key2\" : \"value2\",
\"key3\" : \"0\",
\"key4\" : \"value3\",
\"key5\" : \"False\",
\"key6\" : \"False\"
},
\"object_two\": {
\"key1\" : \"0.5\",
\"key2\" : \"0\",
\"key3\" : \"343\",
\"key4\" : \"value4\",
\"key5\" : \"True\",
\"key6\" : \"True\"
}
}
}
"
and I simply want to extract the name of each object as a index key (or rowname), create column names from the keys and spread the values.
Unfortunately I've had no luck unpicking the syntax. Can anyone help?
Thanks
Stuart
There are two ways to do this with tidyjson, the first is to use tidyjson::append_values_string and then tidyr::spread:
library(tidyjson)
library(dplyr)
library(tidyr)
json %>%
enter_object("objects") %>%
gather_keys("object") %>%
gather_keys("key") %>%
append_values_string("value") %>%
tbl_df %>% spread(key, value)
#> # A tibble: 2 x 8
#> document.id object key1 key2 key3 key4 key5 key6
#> * <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 1 object_one value1 value2 0 value3 False False
#> 2 1 object_two 0.5 0 343 value4 True True
The other way is to use tidyjson::spread_values to specific each key separately:
json %>%
enter_object("objects") %>%
gather_keys("object") %>%
spread_values(
key1 = jstring("key1"),
key2 = jstring("key2"),
key3 = jnumber("key3"),
key4 = jstring("key4"),
key5 = jstring("key5"),
key6 = jstring("key6")
)
#> document.id object key1 key2 key3 key4 key5 key6
#> 1 1 object_one value1 value2 0 value3 False False
#> 2 1 object_two 0.5 0 343 value4 True True
The advantage of the second approach is that you can (a) specify the types of each column and (b) will be guaranteed to get the same data.frame structure even if the keys change (or are missing) in some documents or objects.
Not entirely sure on your desired output, but you can use jsonlite::fromJSON to extract the data, and data.table::rbindlist to put it into a data.table
library(jsonlite)
library(data.table)
rbindlist(fromJSON(json))
# object_one object_two
# 1: value1 0.5
# 2: value2 0
# 3: 0 343
# 4: value3 value4
# 5: False True
# 6: False True
Based on your comment, another approach that involves some reshaping
library(jsonlite)
library(reshape2)
lst <- fromJSON(json)
lst <- lapply(lst[[1]], unlist)
df <- as.data.frame(lst)
df$key <- rownames(df)
df <- melt(df, id = "key")
df <- dcast(df, formula = variable ~ key)
df
# variable key1 key2 key3 key4 key5 key6
# 1 object_one value1 value2 0 value3 False False
# 2 object_two 0.5 0 343 value4 True True
Related
I am trying to extract JSON from a TSV column. The difficulty is the JSON is shallowly nested, and the key values may not be present in every row.
I have a minimal example to illustrate my point.
df <- tibble(index = c(1, 2),
data = c('{"json_char":"alpha", "json_list1":["x","y"]}',
'{"json_char":"beta", "json_list1":["x","y","z"], "json_list2":["a","b","c"]}'))
The desired result:
df <- tibble::tibble(index = list(1, 2),
json_char = list("alpha", "beta"),
json_list1 = list(list("x","y"), list("x","y","z")),
json_list2 = list(NA, list("a","b","c")))
After a fair amount of experimentation, I have this function:
extract_json_column <- function(df) {
df %>%
magrittr::use_series(data) %>%
purrr::map(jsonlite::fromJSON) %>%
purrr::map(purrr::simplify) %>%
tibble::enframe() %>%
tidyr::spread("name", "value") %>%
purrr::flatten_dfr()
}
Which gives me the following error: Error in bind_rows_(x, .id) : Argument 2 must be length 3, not 7.
The first row sets the number of parameters for the rest of dataframe. Is there anyway to avoid that behavior?
I modified your function to the following. I hope this helps.
library(tidyverse)
library(rjson)
extract_json_column <- function(df){
df %>%
rowwise() %>%
mutate(data = map(data, fromJSON)) %>%
split(.$index) %>%
map(~.$data[[1]]) %>%
map(~map_if(., function(x) length(x) != 1, list)) %>%
map(as_data_frame) %>%
bind_rows(.id = "index")
}
extract_json_column(df)
# A tibble: 2 x 4
index json_char json_list1 json_list2
<chr> <chr> <list> <list>
1 1 alpha <chr [2]> <NULL>
2 2 beta <chr [3]> <chr [3]>
I have this R code to convert JSON data to a data.frame. It works fine but it is rather slow for huge JSON files. What's the more efficient way to do this (won't mind having a data.table output)?
json_data <- fromJSON(json_dt_url)
json_data <- json_data[['data']]
my_df <- data.frame()
for (i in 1:length(json_data))
{
my_df <- rbind(my_df, as.data.frame(json_data[[i]]))
}
If you are looking for fast JSON parsing, take a look at RcppSimdJson.
library(RcppSimdJson)
jsonfile <- system.file("jsonexamples", "small", "demo.json", package="RcppSimdJson")
res <- fload(jsonfile)
str(res)
#> List of 1
#> $ Image:List of 6
#> ..$ Width : int 800
#> ..$ Height : int 600
#> ..$ Title : chr "View from 15th Floor"
#> ..$ Thumbnail:List of 3
#> .. ..$ Url : chr "http://www.example.com/image/481989943"
#> .. ..$ Height: int 125
#> .. ..$ Width : int 100
#> ..$ Animated : logi FALSE
#> ..$ IDs : int [1:4] 116 943 234 38793
Created on 2020-08-05 by the reprex package (v0.3.0)
Using the benchmarking code from the package, we can compare different parsing approaches:
file <- system.file("jsonexamples", "mesh.json", package = "RcppSimdJson")
res <- bench::mark(
RcppSimdJson = RcppSimdJson::fload(file),
jsonlite = jsonlite::fromJSON(file),
jsonify = jsonify::from_json(file),
RJSONIO = RJSONIO::fromJSON(file),
ndjson = ndjson::stream_in(file),
check = FALSE
)
res
#> # A tibble: 5 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 RcppSimdJson 1.51ms 1.67ms 582. 5.82MB 5.98
#> 2 jsonlite 44.68ms 48.95ms 18.8 2.74MB 22.6
#> 3 jsonify 9.76ms 11.34ms 87.5 1.12MB 43.7
#> 4 RJSONIO 33.11ms 35.17ms 28.6 2.93MB 3.82
#> 5 ndjson 136.35ms 138.67ms 7.21 9.41MB 30.6
Created on 2020-08-05 by the reprex package (v0.3.0)
We see that RcppSimdJson is by far the fastest.
data2 <- fromJSON("data.json", flatten = TRUE)
reference https://rdrr.io/cran/jsonlite/f/vignettes/json-apis.Rmd
Try this way:
library(jsonlite)
json_data <- read_json("data.json", simplifyVector = TRUE)
Include the sample input so that I can test the solution myself!
I have the following code, which extracts data from a JSON file.
library(jsonlite)
file_path <- 'C:/some/file/path.json'
df <- jsonlite::fromJSON(txt = file_path ,
simplifyVector = FALSE,
simplifyDataFrame = TRUE,
simplifyMatrix = FALSE,
flatten = FALSE)
The data structure is highly nested. My approach extracts 99% of it just fine, but in one particular part of the data I came across a phenomenon that I would describe as an "embedded" data frame:
df <- structure(
list(
ID = c(1L, 2L, 3L, 4L, 5L),
var1 = c('a', 'b', 'c', 'd', 'e'),
var2 = structure(
list(
var2a = c('v', 'w', 'x', 'y', 'z'),
var2b = c('vv', 'ww', 'xx', 'yy', 'zz')),
.Names = c('var2a', 'var2b'),
row.names = c(NA, 5L),
class = 'data.frame'),
var3 = c('aa', 'bb', 'cc', 'dd', 'ee')),
.Names = c('ID', 'var1', 'var2', 'var3'),
row.names = c(NA, 5L),
class = 'data.frame')
# Looks like this:
# ID var1 var2.var2a var2.var2b var3
# 1 1 a v vv aa
# 2 2 b w ww bb
# 3 3 c x xx cc
# 4 4 d y yy dd
# 5 5 e z zz ee
This looks like a normal data frame, and it behaves like that for the most part.
class(df)
# [1] "data.frame"
df[1,]
# ID var1 var2.var2a var2.var2b var3
# 1 a v vv aa
dim(df)
# [1] 5 4
# One less than expected due to embedded data frame
lapply(df, class)
# $ID
# [1] "integer"
#
# $var1
# [1] "character"
#
# $var2
# [1] "data.frame"
#
# $var3
# [1] "character"
str(df)
# 'data.frame': 5 obs. of 4 variables:
# $ ID : int 1 2 3 4 5
# $ var1: chr "a" "b" "c" "d" ...
# $ var2:'data.frame': 5 obs. of 2 variables:
# ..$ var2a: chr "v" "w" "x" "y" ...
# ..$ var2b: chr "vv" "ww" "xx" "yy" ...
# $ var3: chr "aa" "bb" "cc" "dd" ...
What is going on here, why is jsonlite creating this odd structure instead of just a simple data.frame? Can I avoid this behaviour, and if not how can I most elegantly rectify this? I've used the approach below, but it feels very hacky, at best.
# Any columns with embedded data frame?
newX <- X[,-which(lapply(X, class) == 'data.frame')] %>%
# Append them to the end
cbind(X[,which(lapply(X, class) == 'data.frame')])
Update
The suggested workaround solves my issue, but I still feel like I don't understand the strange embedded data.frame structure. I would have thought that such a structure would be illegal by R data format conventions, or at least behave differently in terms of subsetting using [. I have opened a separate question on that.
I think you want to flatten your df object:
json <- toJSON(df)
flat_df <- fromJSON(json, flatten = T)
str(flat_df)
'data.frame': 5 obs. of 5 variables:
$ ID : int 1 2 3 4 5
$ var1 : chr "a" "b" "c" "d" ...
$ var3 : chr "aa" "bb" "cc" "dd" ...
$ var2.var2a: chr "v" "w" "x" "y" ...
$ var2.var2b: chr "vv" "ww" "xx" "yy" ...
Is that closer to what you're looking for?
How do I come from here ...
| ID | JSON Request |
==============================================================================
| 1 | {"user":"xyz1","weightmap": {"P1":0,"P2":100}, "domains":["a1","b1"]} |
------------------------------------------------------------------------------
| 2 | {"user":"xyz2","weightmap": {"P1":100,"P2":0}, "domains":["a2","b2"]} |
------------------------------------------------------------------------------
to here (The requirement is to make a table of JSON in column 2):
| User | P1 | P2 | domains |
============================
| xyz1 | 0 |100 | a1, b1 |
----------------------------
| xyz2 |100 | 0 | a2, b2 |
----------------------------
Here is the code to generate the data.frame:
raw_df <-
data.frame(
id = 1:2,
json =
c(
'{"user": "xyz2", "weightmap": {"P1":100,"P2":0}, "domains": ["a2","b2"]}',
'{"user": "xyz1", "weightmap": {"P1":0,"P2":100}, "domains": ["a1","b1"]}'
),
stringsAsFactors = FALSE
)
Here's a tidyverse solution (also using jsonlite) if you're happy to work in a long format (for domains in this case):
library(jsonlite)
library(dplyr)
library(purrr)
library(tidyr)
d <- data.frame(
id = c(1, 2),
json = c(
'{"user":"xyz1","weightmap": {"P1":0,"P2":100}, "domains":["a1","b1"]}',
'{"user":"xyz2","weightmap": {"P1":100,"P2":0}, "domains":["a2","b2"]}'
),
stringsAsFactors = FALSE
)
d %>%
mutate(json = map(json, ~ fromJSON(.) %>% as.data.frame())) %>%
unnest(json)
#> id user weightmap.P1 weightmap.P2 domains
#> 1 1 xyz1 0 100 a1
#> 2 1 xyz1 0 100 b1
#> 3 2 xyz2 100 0 a2
#> 4 2 xyz2 100 0 b2
mutate... is converting from a string to column of nested data frames.
unnest... is unnesting these data frames into multiple columns
I would go for the jsonlite package in combination with the usage of mapply, a transformation function and data.table's rbindlist.
# data
raw_df <- data.frame(id = 1:2, json = c('{"user": "xyz2", "weightmap": {"P1":100,"P2":0}, "domains": ["a2","b2"]}', '{"user": "xyz1", "weightmap": {"P1":0,"P2":100}, "domains": ["a1","b1"]}'), stringsAsFactors = FALSE)
# libraries
library(jsonlite)
library(data.table)
# 1) First, make a transformation function that works for a single entry
f <- function(json, id){
# transform json to list
tmp <- jsonlite::fromJSON(json)
# transform list to data.frame
tmp <- as.data.frame(tmp)
# add id
tmp$id <- id
# return
return(tmp)
}
# 2) apply it via mapply
json_dfs <-
mapply(f, raw_df$json, raw_df$id, SIMPLIFY = FALSE)
# 3) combine the fragments via rbindlist
clean_df <-
data.table::rbindlist(json_dfs)
# 4) et-voila
clean_df
## user weightmap.P1 weightmap.P2 domains id
## 1: xyz2 100 0 a2 1
## 2: xyz2 100 0 b2 1
## 3: xyz1 0 100 a1 2
## 4: xyz1 0 100 b1 2
Could not get the flatten parameter to work as I expected so needed to unlist and then "re-list" before rbinding with do.call:
library(jsonlite)
do.call( rbind,
lapply(raw_df$json,
function(j) as.list(unlist(fromJSON(j, flatten=TRUE)))
) )
user weightmap.P1 weightmap.P2 domains1 domains2
[1,] "xyz2" "100" "0" "a2" "b2"
[2,] "xyz1" "0" "100" "a1" "b1"
Admittedly, this will require further processing since it coerces all the lines to character.
library(jsonlite)
json = c(
'{"user":"xyz1","weightmap": {"P1":0,"P2":100}, "domains":["a1","b1"]}',
'{"user":"xyz2","weightmap": {"P1":100,"P2":0}, "domains":["a2","b2"]}'
)
json <- lapply( paste0("[", json ,"]"),
function(x) jsonlite::fromJSON(x))
df <- data.frame(matrix(unlist(json), nrow=2, ncol=5, byrow=T))
df <- df %>% unite(Domains, X4, X5, sep = ", ")
colnames(df) <- c("user", "P1", "P2", "domains")
head(df)
The output is:
user P1 P2 domains
1 xyz1 0 100 a1, b1
2 xyz2 100 0 a2, b2
Using tidyjson
https://cran.r-project.org/web/packages/tidyjson/vignettes/introduction-to-tidyjson.html
install.packages("tidyjson")
library(tidyjson)
json_as_df <- raw_df$json %>% spread_all
# retain columns
json_as_df <- raw_df %>% as.tbl_json(json.column = "json") %>% spread_all
I have problem with converting lists to data.frame
First I have downloaded dataset in JSON format from Data API:
request1 <- POST(url = "https://api.data-api.io/v1/subjekti", add_headers('x-dataapi-key' = "xxxxxxx", 'content-type'= "application/json"), body = list(oib = oibreq), encode = "json")
json1 <- content(request1, type = "application/json")
json2 <- fromJSON(toJSON(json1, null = "null"), flatten = TRUE)
The problem is that data are elements of lists. For example
> json2[['oib']]
[[1]]
[1] "00045103869"
[[2]]
[1] "18527887472"
[[3]]
[1] "92680516748"
all colnames:
> colnames(json2)
[1] "oib" "mb" "mbs" "mbo" "rno" "naziv"
[7] "adresa" "grad" "posta" "zupanija" "nkd2007" "puo"
[13] "godinaOsnivanja" "status" "temeljniKapital" "isActive" "datumBrisanja" "predmetPoslovanja"
How can I convert this lists to data.frame?
Sorry, that was my first question on stockoverflow. There is my dataset:
> data <- dput(json3)
structure(list(oib = list("00045103869", "18527887472", "92680516748"),
mb = list("01699032", "03858731", "02591596"), mbs = list(
"080451345", "060060881", "040260786"), mbo = c(NA, NA,
NA), rno = c(NA, NA, NA), naziv = list("INTERIJER DIZAJN d.o.o.",
"M - Đ COMMERCE d.o.o.", "HIP REKLAME d.o.o. u stečaju"),
adresa = list("Savska cesta 179", "Put Piketa 0", "Sadska 2"),
grad = list("Zagreb", "Sinj", "Rijeka"), posta = list("10000",
"21230", "51000"), zupanija = list("Grad Zagreb", "Splitsko-dalmatinska",
"Primorsko-goranska"), nkd2007 = list("1623", "4719",
"4711"), puo = list(92L, 92L, 92L), godinaOsnivanja = list(
"2003", "1995", "2009"), status = list("bez postupka",
"bez postupka", "stečaj"), temeljniKapital = list("20.000,00 kn",
"509.100,00 kn", "20.000,00 kn"), isActive = list(TRUE,
TRUE, FALSE), datumBrisanja = list(NULL, NULL, "2015-12-24T00:00:00+01:00")), .Names = c("oib",
"mb", "mbs", "mbo", "rno", "naziv", "adresa", "grad", "posta",
"zupanija", "nkd2007", "puo", "godinaOsnivanja", "status", "temeljniKapital",
"isActive", "datumBrisanja"), class = "data.frame", row.names = c(NA,
3L))
A quick & dirty way would be to substitute the NULL values by e.g. NAs like this
f <- function(lst) lapply(lst, function(x) if (is.list(x)) f(x) else if (is.null(x)) NA_character_ else x)
df <- as.data.frame(lapply(f(json2), unlist))
str(df)
# 'data.frame': 3 obs. of 17 variables:
# $ oib : Factor w/ 3 levels "00045103869",..: 1 2 3
# $ mb : Factor w/ 3 levels "01699032","02591596",..: 1 3 2
# $ mbs : Factor w/ 3 levels "040260786","060060881",..: 3 2 1
# $ mbo : logi NA NA NA
# $ rno : logi NA NA NA
# $ naziv : Factor w/ 3 levels "HIP REKLAME d.o.o. u stecaju",..: 2 3 1
# $ adresa : Factor w/ 3 levels "Put Piketa 0",..: 3 1 2
# $ grad : Factor w/ 3 levels "Rijeka","Sinj",..: 3 2 1
# $ posta : Factor w/ 3 levels "10000","21230",..: 1 2 3
# $ zupanija : Factor w/ 3 levels "Grad Zagreb",..: 1 3 2
# $ nkd2007 : Factor w/ 3 levels "1623","4711",..: 1 3 2
# $ puo : int 92 92 92
# $ godinaOsnivanja: Factor w/ 3 levels "1995","2003",..: 2 1 3
# $ status : Factor w/ 2 levels "bez postupka",..: 1 1 2
# $ temeljniKapital: Factor w/ 2 levels "20.000,00 kn",..: 1 2 1
# $ isActive : logi TRUE TRUE FALSE
# $ datumBrisanja : Factor w/ 1 level "2015-12-24T00:00:00+01:00": NA NA 1
But there may be better options.