Tidyjson: is there an 'exit_object()' equivalent? - json

I'm using package tidyjson to parse a json string and extract the key values into columns. The json in nested, and while I can drill down at a node, I can't figure out a way to go up to the previous level. The code is below:
library(tidyjson)
library(data.table)
library(dplyr)
input <- '{
"name": "Bob",
"age": 30,
"social": {
"married": "yes",
"kids": "no"
},
"work": {
"title": "engineer",
"salary": 5000
}
}'
output <- input %>% as.tbl_json() %>%
spread_values(name = jstring("name"),
age = jnumber("age")) %>%
enter_object("social") %>%
spread_values(married = jstring("married"),
kids = jstring("kids")) %>%
#### I would need an exit_obeject() here
enter_object("work") %>%
spread_values(title = jstring("title"),
salary = jnumber("salary"))

There's a note in the documentation:
"Note that there are often situations where there are multiple arrays
or objects of differing types that exist at the same level of the JSON
hierarchy. In this case, you need to use enter_object() to enter each
of them in separate pipelines to create separate data.frames that can
then be joined relationally."
As such I've been staging my tidyjson commands and putting the outputs together with merge, e.g.:
# first the high-level values
output_table <- input_tbl_json %>%
spread_values(val1 = jstring('val1'),
val2 = jnumber('val2'))
# then enter an object and get something from inside, merging it as a new column
output_table <- merge(output_table,
input_tbl_json %>%
enter_object('thing') %>%
spread_values(val3 = jstring('thing1')),
by = c('document.id'))
output table columns should look like | document.id | val1 | val2 | val3 |
That workflow may fall over with operations like gather_keys() that add rows, but I haven't had call to test it.

I think an overlooked piece of functionality within tidyjson is the ability to use more complex paths in the jnumber, jstring, etc. functions.
You can do something like the following without "entering an object." I find this to be a very satisfying solution, for the most part. Perhaps more satisfying than multiple enter/exits.
input <- '{
"name": "Bob",
"age": 30,
"social": {
"married": "yes",
"kids": "no"
},
"work": {
"title": "engineer",
"salary": 5000
}
}'
output <- input %>% as.tbl_json() %>%
spread_values(
name = jstring('name')
, age=jnumber('age')
, married=jstring('social','married')
, kids = jstring('social','kids')
, title= jstring('work','title')
, salary = jnumber('work','salary')
)

Related

Web scrape request does not work for long date range input in R

The code here works for web scraping by sending repeated request based on the input date range (startDate and endDate). Then data will be saved in csv file. I have used this code before for different xPath # html_node() argument, and it works fine. Now with different xPath, seems like it cannot works for longer date range. With this one, I also can't detect which data went missing because the code fails to work when the heading element applied as attr(). I've tried increase the range slowly and it only works until [1:29]. After that, no matter what the range used, it keep showing the old return. In some cases, I can see str(new_df) complete the request, but it keep saving the old return, as if the bind_rows fail. Sometimes (by using date of different year), I was able to extract the range desired by slowly increase the range (desired range is [1:92]). It makes me excited, but when I change the date input to get other different year it return the last record. sometimes with error, and sometimes the error not appear. I include the lengthy code here so anyone can reproduce it. I wonder if the website burdened by repeated request or my pc getting muzzy. Kindly help.
get_sounding_data <- function(region = c("naconf", "samer", "pac", "nz", "ant",
"np", "europe", "africa", "seasia", "mideast"),
date,
from_hr = c("00", "12", "all"),
to_hr = c("00", "12", "all"),
station_number = 48615) {
# we use these pkgs (I removed the readr and dplyr dependencies)
suppressPackageStartupMessages({
require("xml2", quietly = TRUE)
require("httr", quietly = TRUE)
require("rvest", quietly = TRUE)
})
# validate region
region <- match.arg(
arg = region,
choices = c(
"naconf", "samer", "pac", "nz", "ant",
"np", "europe", "africa", "seasia", "mideast"
)
)
# this actually validates the date for us if it's a character string
date <- as.Date(date)
# get year and month
year <- as.integer(format(date, "%Y"))
stopifnot(year %in% 1973:as.integer(format(Sys.Date(), "%Y")))
year <- as.character(year)
month <- format(date, "%m")
# we need these to translate day & *_hr to the param the app needs
c(
"0100", "0112", "0200", "0212", "0300", "0312", "0400", "0412",
"0500", "0512", "0600", "0612", "0700", "0712", "0800", "0812",
"0900", "0912", "1000", "1012", "1100", "1112", "1200", "1212",
"1300", "1312", "1400", "1412", "1500", "1512", "1600", "1612",
"1700", "1712", "1800", "1812", "1900", "1912", "2000", "2012",
"2100", "2112", "2200", "2212", "2300", "2312", "2400", "2412",
"2500", "2512", "2600", "2612", "2700", "2712", "2800", "2812",
"2900", "2912", "3000", "3012", "3100", "3112"
) -> hr_vals
c(
"01/00Z", "01/12Z", "02/00Z", "02/12Z", "03/00Z", "03/12Z", "04/00Z",
"04/12Z", "05/00Z", "05/12Z", "06/00Z", "06/12Z", "07/00Z", "07/12Z",
"08/00Z", "08/12Z", "09/00Z", "09/12Z", "10/00Z", "10/12Z", "11/00Z",
"11/12Z", "12/00Z", "12/12Z", "13/00Z", "13/12Z", "14/00Z", "14/12Z",
"15/00Z", "15/12Z", "16/00Z", "16/12Z", "17/00Z", "17/12Z", "18/00Z",
"18/12Z", "19/00Z", "19/12Z", "20/00Z", "20/12Z", "21/00Z", "21/12Z",
"22/00Z", "22/12Z", "23/00Z", "23/12Z", "24/00Z", "24/12Z", "25/00Z",
"25/12Z", "26/00Z", "26/12Z", "27/00Z", "27/12Z", "28/00Z", "28/12Z",
"29/00Z", "29/12Z", "30/00Z", "30/12Z", "31/00Z", "31/12Z"
) -> hr_inputs
hr_trans <- stats::setNames(hr_vals, hr_inputs)
o_from_hr <- from_hr <- as.character(tolower(from_hr))
o_to_hr <- to_hr <- as.character(tolower(to_hr))
if ((from_hr == "all") || (to_hr == "all")) {
from_hr <- to_hr <- "all"
} else {
from_hr <- hr_trans[sprintf("%s/%02dZ", format(date, "%d"), as.integer(from_hr))]
match.arg(from_hr, hr_vals)
to_hr <- hr_trans[sprintf("%s/%02dZ", format(date, "%d"), as.integer(to_hr))]
match.arg(to_hr, hr_vals)
}
# clean up the station number if it was entered as a double
station_number <- as.character(as.integer(station_number))
# execute the API call
httr::GET(
url = "http://weather.uwyo.edu/cgi-bin/sounding",
query = list(
region = region,
TYPE = "TEXT:LIST",
YEAR = year,
MONTH = sprintf("%02d", as.integer(month)),
FROM = from_hr,
TO = to_hr,
STNM = station_number
)
) -> res
# check for super bad errors (that we can't handle nicely)
httr::stop_for_status(res)
# get the page content
doc <- httr::content(res, as="text")
# if the site reports no data, issue a warning and return an empty data frame
if (grepl("Can't get", doc)) {
doc <- xml2::read_html(doc)
msg <- rvest::html_nodes(doc, "body")
msg <- rvest::html_text(msg, trim=TRUE)
msg <- gsub("\n\n+.*$", "", msg)
warning(msg)
return(data.frame(stringsAsFactors=FALSE))
}
# turn it into something we can parse
doc <- xml2::read_html(doc)
# get the metadata
#meta <- rvest::html_node(doc, "h2")
#meta <- rvest::html_text(meta, trim=TRUE)
#attr(doc, "meta") <- meta
raw_dat <- doc %>%
html_nodes("pre + h3") %>%
html_text()
indices <- doc %>%
str_split(pattern = "\n", simplify = T) %>%
map_chr(str_squish) %>%
tibble(x = .) %>%
separate(x, into = c("Station", "Value"), sep = ": ") %>%
filter(!is.na(Value))
data <- tidyr::spread(indices, Station, Value)
data
}
startDate <- as.Date("01-11-1984", format="%d-%m-%Y")
endDate <- as.Date("04-11-1984",format="%d-%m-%Y")
#startDate <- as.Date("01-11-1984", format="%d-%m-%Y")
#endDate <- as.Date("31-01-1985",format="%d-%m-%Y")
days <- seq(startDate, endDate, "day")
#wanted to have [1:92], but its not working
lapply(days[1:4], function(day) {
get_sounding_data(
region = "seasia",
date = day,
from_hr = "00",
to_hr = "00",
station_number = "48615"
)
}) -> soundings_48615
warnings()
new_df <- map(soundings_48615, . %>% mutate_all(parse_guess))
#str(new_df)
library(tidyr)
library(tidyverse)
library(dplyr)
dat <- bind_rows(new_df)
dat <- dat %>% separate(col =`Observation time`, into = c('Date', 'time'), sep = '/')
dat$Date <- as.Date(dat$Date, format = "%y%m%d")
#save in text file
library(xlsx)
write.csv(dat, 'c:/Users/Hp/Documents/1984.csv')
get_sounding_data <- NULL
error
Error in bind_rows_(x, .id) :
Column `1000 hPa to 500 hPa thickness` can't be converted from numeric to character
dat <- dat %>% separate(col =`Observation time`, into = c('Date', 'time'), sep = '/')
Error in eval_tidy(enquo(var), var_env) :
object 'Observation time' not found
I've install different R version, but this error keep come out. So I ignore it.
Error: package or namespace load failed for ‘xlsx’:
.onLoad failed in loadNamespace() for 'rJava', details:
call: fun(libname, pkgname)
error: No CurrentVersion entry in Software/JavaSoft registry! Try re-
installing Java and make sure R and Java have matching architectures.

Passing a character variable into a function in R (Tidyjson)

I'm messing around with tidyjson (latest from github, published by Jeremy Stanley). I wanted to sort of automate searching and extract the nested arrays. The following examples below provide the output I want.
'{"name": {"first": "bob", "last": "jones"}, "age": 32}' %>%
enter_object("name") %>%
gather_keys %>%
append_values_string
'{"name": {"first": "bob", "last": "jones"}, "age": 32}' %>%
enter_object(name) %>%
gather_keys %>%
append_values_string
These both give the same output:
# A tbl_json: 2 x 3 tibble with a "JSON" attribute
`attr(., "JSON")` document.id key string
<chr> <int> <chr> <chr>
1 "bob" 1 first bob
2 "jones" 1 last jones
However, if I declare a character variable before and pass it along it fails.
object_name <- "name"
'{"name": {"first": "bob", "last": "jones"}, "age": 32}' %>%
enter_object(list(name="name")) %>%
gather_keys %>%
append_values_string
Error: Path components must be single names or character strings
Any ideas why this would happen?
If you are familiar with Hadley's book Advanced R, this is a piece of non-standard evaluation that unfortunately does not presently have a workaround in pure tidyjson (I would prefer a enter_object_ that uses standard evaluation, more like dplyr). I am hopeful of that functionality at some point being available, because as you suggest, it would be nice to vectorize and automate these sorts of programs.
The Non-Standard Evaluation is the "magic" that allows you to pass in the un-quoted name and still get good results in your second example (instead of the program looking for an object called name). The hazard is it does not resolve objects like object_name in your case.
That said, it seems you can work-around with do.call and a list of parameters (I fixed your example, as I think it went a bit awry)
library(tidyjson)
json <- "{\"name\": {\"first\": \"bob\", \"last\": \"jones\"}, \"age\": 32}"
object_name <- "name"
do.call(enter_object, args = list(json, object_name)) %>% gather_object %>%
append_values_string
#> # A tbl_json: 2 x 3 tibble with a "JSON" attribute
#> `attr(., "JSON")` document.id name string
#> <chr> <int> <chr> <chr>
#> 1 "\"bob\"" 1 first bob
#> 2 "\"jones\"" 1 last jones
I definitely recommend checking out some of the new features / functionality in the development version of tidyjson with devtools::install_github('jeremystan/tidyjson'), but unfortunately no support for standard evaluation in "path"s yet.

Extract data in columnar format from JSON in R

I want to extract data from the json object in R
R Package used tidyjson, magrittr, jsonlite
trial <- '[{ "KEYS": {"USER_ID": "1266", "MOBILE_NO": "9000000000"}}]'
trial %>%
gather_array %>% # stack as an array
spread_values(USER_ID = jstring("KEYS.USER_ID"),
MOBILE_NO = jstring("KEYs.MOBILE_NO") )
Output of this code is not as required. Anyone with suggestions.
document.id array.index USER_ID MOBILE_NO
1 1 1 <NA> <NA>
Expected output:
document.id array.index USER_ID MOBILE_NO
1 1 1266 9000000000
tidyjson uses multi-parameter paths, rather than "dot-separated" paths, as you attempted. You can really tackle this two ways:
Recommended, as it does not throw away the rest of the object:
trial <- '[{ "KEYS": {"USER_ID": "1266", "MOBILE_NO": "9000000000"}}]'
trial %>%
gather_array %>% # stack as an array
spread_values(USER_ID = jstring('KEYS','USER_ID'),
MOBILE_NO = jstring('KEYS','MOBILE_NO'))
Can also use enter_object if preferred or necessary:
trial <- '[{ "KEYS": {"USER_ID": "1266", "MOBILE_NO": "9000000000"}}]'
trial %>%
gather_array %>% # stack as an array
enter_object('KEYS') %>%
spread_values(USER_ID = jstring('USER_ID'),
MOBILE_NO = jstring('MOBILE_NO'))

R: jsonlite - export key:value pairs from a list of lists

I have a list of lists which are of variable length. The first value of each nested list is the key, and the rest of the values in the list will be the array entry. It looks something like this:
[[1]]
[1] "Bob" "Apple"
[[2]]
[1] "Cindy" "Apple" "Banana" "Orange" "Pear" "Raspberry"
[[3]]
[1] "Mary" "Orange" "Strawberry"
[[4]]
[1] "George" "Banana"
I've extracted the keys and entries as follows:
keys <- lapply(x, '[', 1)
entries <- lapply(x, '[', -1)
but now that I have these, I don't know how I can associate a key:value pair in R without creating a matrix first, but this is silly since my data don't fit in a rectangle anyway (every example I've seen uses the column names from a matrix as the key values).
This is my crappy method using a matrix, assigning rownames, and then using jsonLite to export to JSON.
#Create a matrix from entries, without recycling
#I found this function on StackOverflow which seems to work...
cbind.fill <- function(...){
nm <- list(...)
nm <- lapply(nm, as.matrix)
n <- max(sapply(nm, nrow))
do.call(cbind, lapply(nm, function (x)
rbind(x, matrix(, n-nrow(x), ncol(x)))))
}
#Call said function
matrix <- cbind.fill(entries)
#Transpose the thing
matrix <- t(matrix)
#Set column names
colnames(matrix) <- keys
#Export to json
json<-toJSON(matrix)
The result is good, but the implementation sucks. Result:
[{"Bob":["Apple"],"Cindy":["Apple","Banana","Orange","Pear","Raspberry"],"Mary":["Orange","Strawberry"],"George":["Banana"]}]
Please let me know of better ways that might exist to accomplish this.
How about:
names(entries) <- unlist(keys)
toJSON(entries)
Consider the following lapply() approach:
library(jsonlite)
entries <- list(c('Bob', 'Apple'),
c('Cindy', 'Apple', 'Banana', 'Orange','Pear','Raspberry'),
c('Mary', 'Orange', 'Strawberry'),
c('George', 'Banana'))
# ITERATE ALL CONTENTS EXCEPT FIRST
inner <- list()
nestlist <- lapply(entries,
function(i) {
inner <- i[2:length(i)]
return(inner)
})
# NAME EACH ELEMENT WITH FIRST ELEMENT
names(nestlist) <- lapply(entries, function(i) i[1])
#$Bob
#[1] "Apple"
#$Cindy
#[1] "Apple" "Banana" "Orange" "Pear" "Raspberry"
#$Mary
#[1] "Orange" "Strawberry"
#$George
#[1] "Banana"
x <- toJSON(list(nestlist), pretty=TRUE)
x
#[
# {
# "Bob": ["Apple"],
# "Cindy": ["Apple", "Banana", "Orange", "Pear", "Raspberry"],
# "Mary": ["Orange", "Strawberry"],
# "George": ["Banana"]
# }
#]
I think this has already been sufficiently answered but here is a method using purrr and jsonlite.
library(purrr)
library(jsonlite)
sample_data <- list(
list("Bob","Apple"),
list("Cindy","Apple","Banana","Orange","Pear","Raspberry"),
list("Mary","Orange","Strawberry"),
list("George","Banana")
)
sample_data %>%
map(~set_names(list(.x[-1]),.x[1])) %>%
toJSON(auto_unbox=TRUE, pretty=TRUE)

Trouble spreading values using tidyjson

I am trying to convert the following multi-document JSON file into a data.frame.
x = '[
{"name": "Bob","groupIds": ["kwt6x61", "yiahf43"]},
{"name": "Sally","groupIds": "yiahf43"}
]'
I'm almost there by using
y = x %>% gather_array() %>%
spread_values(
name = jstring("name"),
groupIds = jstring("groupIds")
)
print(y)
Which returns:
document.id array.index name groupIds
1 1 1 Bob list("kwt6x61", "yiahf43")
2 1 2 Sally yiahf43
Can someone help spread the groupsIds into addtional rows?
This is an interesting problem. The issue stems from the fact that an array of 1 is stored as a string. Otherwise, enter_object('groupIds') %>% gather_array %>% append_values_string would work nicely. tidyjson does not seem to handle this situation nicely. I wonder whether this would even be considered valid JSON, since in one case groupIds is a string, and in another it is an array.
In any case, although this is not an ideal solution, you can use json_types() to illustrate the difference and then conditionally treat each. I converted to a tbl_df (i.e. dropped JSON component) for future processing when done parsing.
library(tidyjson)
library(dplyr)
library(tidyr)
x = '[
{"name": "Bob","groupIds": ["kwt6x61", "yiahf43"]},
{"name": "Sally","groupIds": "yiahf43"}
]'
## Show the different types
z <- x %>% gather_array() %>% spread_values(
name=jstring('name')
) %>% enter_object('groupIds') %>% json_types()
## Conditionally treat each
final <- bind_rows(
z[z$type=='array',] %>% gather_array('id') %>% append_values_string('groupId')
, z[z$type=='string',] %>% append_values_string('groupId') %>% mutate(id=1)
) %>% tbl_df
## Spread them out, maybe? Depends on what you're looking for
final %>% spread('id','groupId')