How can I get multiple outputs from single operation? - function

These are the codes that have been written to analyze within and between interactions of different species.
in this code, I tried to get separate outputs from each analyzes unsuccessfully.
lapply(data.list, function(x) {
grp <- factor(x$species)
window <- ripras(x$utmX, x$utmY)
pp.grp <- ppp(x$utmX, x$utmY, window=window, marks=grp)
split.grp <- split(pp.grp)
L <- (alltypes(pp.grp, "L"))
LE <- alltypes(pp.grp, Lcross, nsim = 100, envelope = TRUE)
return("L", "LE")
})
plot(L[1])
So my question is how I can get multiple outputs from a single operation?
Thank you so much in advance!

The most common way to handle multiple outputs from a function in R is to put
the results in a list and return that. Hopefully this can inspire you:
f <- function(x){
L <- x
LE <- matrix(x, 2, 2)
rslt <- list(L = L, LE = LE)
return(rslt)
}
y <- f(7)
Now y is a list with two elements: L and LE
y
#> $L
#> [1] 7
#>
#> $LE
#> [,1] [,2]
#> [1,] 7 7
#> [2,] 7 7
Use $ to get a named element (in this case L – same as y[[1]]):
y$L
#> [1] 7
Created on 2019-03-16 by the reprex package (v0.2.1)

Related

Webscraping does not identify the full HTML link

My problem is very similar to this one. I want to identify all the HTML links in this website so I can then open the link and download the tables.
The problem is that when I create the extract_links functions as pointed out in that answer, I get a list of all the HTMLs, but this are not complete.
To make it more clear:
If you press "Junio" in year "2022" the real HTML is the following:
http://transparencia.uantof.cl/index.php?action=plantillas_generar_plantilla&ig=21&m=6&a=2022&ia=7658
but the HTML that I am recovering from the source of the website lacks the last bit (&ia=7658):
http://transparencia.uantof.cl/index.php?action=plantillas_generar_plantilla&ig=21&m=6&a=2022
Which does not direct me to the table I want.
The problem is that these numbers do not seem to follow any logic and change between year/month links. Any help on how to retrieve the full HTML links will be greatly appreciated. If you also happen to know how can I retrieve the year/month of the file to add as an extra column that would also be great.
Thanks to the help of #margusl I was able to realize that rvest redirects automatically and that solves my problem.
I am trying to use the following code to loop over different links to obtain the tables, store them in a data frame and then download them:
yr.list <- seq(2019,2020)
mes.list <- seq(1,12)
combined_df <- data.frame()
for (yr in yr.list){
for (mes in mes.list) {
root <- "http://transparencia.uantof.cl/index.php?action=plantillas_selec_archivo&ig=21"
# Full link
url <- paste(root,"&m=",mes,"&a=",yr,sep="")
# Parse HTML File
file<-read_html(url, encoding = "latin1")
file<- rvest::html_table(file)
str(file)
# This is the relevant table
table <- as.data.frame(file[[1]])
# in your loop, add the files that you read to the combined_df
combined_df <- rbind(combined_df, table)
}
}
It does not work because the read_html code with the encoding works only for some years, but not for all. for example, when running:
url <- "http://transparencia.uantof.cl/index.php?action=plantillas_selec_archivo&ig=21&m=3&a=2015"
file<-read_html(url, encoding = "latin1")
It does not recover the tables with names/surnames that recovers in the previous months but something else. Why can't this work on all the sub-pages? Is this a encoding problem again?
If you open that last page you had issues with, you'll see that it serves a sort of a submenu with 2 more links - http://transparencia.uantof.cl/index.php?action=plantillas_selec_archivo&ig=21&m=3&a=2015 . Meaning that it's not enough to just generate links for each month & year and extract first table of each page, all those pages should be checked for content and exceptions should be handled.
Though I took somewhat opportunistic approach and it happened to work with URL range defined in question + those few odd samples, but there could be other surprises down the road. Switched to httr for making requests as it allows to collect and monitor response headers, separating content retrieval and parsing also seems to work around encoding issues, at least in this case. First collecting and then parsing also simplifies debugging, you can check if certain responses / headers were different from the rest (i.e. response length being 10x smaller than average or final, redirected, url differs from the rest). And it's easy to change content handling
/ parsing for a small subset of responses, if needed. If you are not sure what rvest has retrieved, you can always save the response to a html file and check it with browser or editor, something like
html <- read_html(url_or_text_content); write(as.character(html), "dump.html")
library(rvest)
library(httr)
library(purrr)
library(dplyr)
library(tidyr)
library(stringr)
yr.list <- seq(2019,2020)
mes.list <- seq(1,12)
# combine mes.list & yr.list
url.params <- expand.grid(mes = mes.list, yr = yr.list)
# few extra samples:
url.params <- rbind(url.params,
list(mes = 6, yr = 2022), # here rvest strugglest with correct encoding
list(mes = 3, yr = 2015) # returns page with sub-categories
)
url.list <- str_glue("http://transparencia.uantof.cl/index.php?action=plantillas_selec_archivo&ig=21&m={url.params$mes}&a={url.params$yr}")
url.list
#> http://transparencia.uantof.cl/index.php?action=plantillas_selec_archivo&ig=21&m=1&a=2019
#> http://transparencia.uantof.cl/index.php?action=plantillas_selec_archivo&ig=21&m=2&a=2019
#> http://transparencia.uantof.cl/index.php?action=plantillas_selec_archivo&ig=21&m=3&a=2019
#> ...
#> http://transparencia.uantof.cl/index.php?action=plantillas_selec_archivo&ig=21&m=11&a=2020
#> http://transparencia.uantof.cl/index.php?action=plantillas_selec_archivo&ig=21&m=12&a=2020
#> http://transparencia.uantof.cl/index.php?action=plantillas_selec_archivo&ig=21&m=6&a=2022
#> http://transparencia.uantof.cl/index.php?action=plantillas_selec_archivo&ig=21&m=3&a=2015
# url list for input, output is a tibble with all responses (incl. "url", "date",
# "status_code", header details and response body)
fetch_urls <- function(url.list){
# collect all responses to a list with httr, enable verbose, parse responses later
# add progress bar - requests take a while
resp.list = vector(mode = "list", length = length(url.list))
pb <- txtProgressBar(max = length(url.list), style = 3)
for (i in seq_along(url.list)){
resp.list[[i]] <- GET(url.list[i])
setTxtProgressBar(pb,i)
}
close(pb)
# turn responses into tibble to check urls, response sizes and status codes
resp.tibble <- bind_cols(
map_df(resp.list, ~ .[c("url", "date", "status_code")], .id = "req_id"),
map_df(resp.list, headers) %>% rename_with(~ paste0("header_",.x)),
# map_df(resp_follow.list, "times"),
map_chr(resp.list, content, as = "text") %>% tibble(html_doc = .)
)
return(resp.tibble)
}
resp.tibble <- fetch_urls(url.list)
# check resulting table without html_doc column
# View(resp.tibble[-ncol(resp.tibble)])
resp.tibble %>%
select(req_id:status_code,`header_content-length`) %>%
arrange(`header_content-length`)
#> # A tibble: 26 × 5
#> req_id url date statu…¹ heade…²
#> <chr> <chr> <dttm> <int> <chr>
#> 1 14 http://transparencia.uantof.cl/in… 2022-10-31 17:29:12 200 21371
#> 2 26 http://transparencia.uantof.cl/in… 2022-10-31 17:31:45 200 2230
#> 3 24 http://transparencia.uantof.cl/in… 2022-10-31 17:31:21 200 24035
#> 4 21 http://transparencia.uantof.cl/in… 2022-10-31 17:30:42 200 24173
#> 5 20 http://transparencia.uantof.cl/in… 2022-10-31 17:30:29 200 24183
#> 6 23 http://transparencia.uantof.cl/in… 2022-10-31 17:31:08 200 24184
#> 7 22 http://transparencia.uantof.cl/in… 2022-10-31 17:30:55 200 24207
#> 8 18 http://transparencia.uantof.cl/in… 2022-10-31 17:30:04 200 24405
#> 9 16 http://transparencia.uantof.cl/in… 2022-10-31 17:29:38 200 24715
#> 10 7 http://transparencia.uantof.cl/in… 2022-10-31 17:27:32 200 24716
#> # … with 16 more rows, and abbreviated variable names ¹​status_code,
#> # ²​`header_content-length`
# 26. is kind of suspicious:
# 25 http://transparencia.uantof.cl/index.php?action=plantillas_generar_plantilla&ig=21&m=6&a=2022&ia=76…
# 26 http://transparencia.uantof.cl/index.php?action=plantillas_selec_archivo&ig=21&m=3&a=2015
# looks like there has been no redirection and its header_content-length is about 10x smaller than for other responses
# checking it more closely reveals that the page includes a "submenu" instead of table(s):
# <p class="subMenu_interiores">
# <b>2015 - Marzo</b>
# ABRIL 2015
# Marzo 2015
# </p>
# lets' collect urls that were not redirected from our tibble and harvest links from stored html:
suburl.list <- resp.tibble %>%
# urls that do NOT include "plantillas_generar_plantilla"
filter(!str_detect(url, "plantillas_generar_plantilla")) %>%
pull(html_doc) %>%
# rvest does not like lists, thus let's map()
map( ~ read_html(.x) %>% html_elements("#columna1_interiores a") %>% html_attr("href")) %>%
unlist() %>%
paste0("http://transparencia.uantof.cl/",.)
suburl.list
#> [1] "http://transparencia.uantof.cl/index.php?action=plantillas_generar_plantilla&ig=21&m=3&a=2015&ia=772"
#> [2] "http://transparencia.uantof.cl/index.php?action=plantillas_generar_plantilla&ig=21&m=3&a=2015&ia=648"
# fetch content from those submenu urls
subresp.tibble <- fetch_urls(suburl.list)
# sanity check:
subresp.tibble %>%
select(req_id:status_code,`header_content-length`)
#> # A tibble: 2 × 5
#> req_id url date statu…¹ heade…²
#> <chr> <chr> <dttm> <int> <chr>
#> 1 1 http://transparencia.uantof.cl/ind… 2022-10-31 17:31:52 200 25385
#> 2 2 http://transparencia.uantof.cl/ind… 2022-10-31 17:31:59 200 25332
#> # … with abbreviated variable names ¹​status_code, ²​`header_content-length`
# better, sizes align with previous results.
# collect all relevant responses
table_1 <- resp.tibble %>%
filter(str_detect(url, "plantillas_generar_plantilla")) %>%
bind_rows(subresp.tibble) %>%
# extract html (as strings)
pull(html_doc) %>%
# rvest does not like lists, thus let's map(), pluck(1) extracts first table (from each page)
map(~ read_html(.x) %>% html_table() %>% pluck(1)) %>%
# first attempt to bind rows fails, aparently column types differ
# change all non-character columns to character
map (~ mutate(.x, across(!where(is.character),as.character))) %>%
# bind all tables by rows
bind_rows()
# columns vary across tables so number of NA fields in final result is rather high
Final result for 26 pages, a 10,987 × 30 tibble:
table_1
#> # A tibble: 10,987 × 30
#> Nº PLANTA PATERNO MATERNO NOMBRES G TITULO CARGO REGION ASIGN…¹
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 1 DIRECTIVO ABARZA CASTRO JESSIC… 6 ADMIN… JEFE… SEGUN… (1)(8)…
#> 2 2 PROFESIONAL ABASOLO QUINTE… NURY D… 12 EDUCA… PROF… SEGUN… (4)(8)…
#> 3 3 ACADEMICO ACOSTA PENA ROXANA… 11 EDUCA… PROF… SEGUN… (2)(8)…
#> 4 4 AUXILIARES ACOSTA PIZARRO ROBERT… 23 LICEN… AUXI… SEGUN… (7)(8)…
#> 5 5 DIRECTIVO AGENO SIERRA ROSELL… 4 MATRO… DIRE… SEGUN… (1)(8)…
#> 6 6 AUXILIARES AGUIRRE LAZO RENE G… 16 LICEN… AUXI… SEGUN… (7)(8)…
#> 7 7 TECNICOS ALAMOS MARIN SERGIO… 13 TECNI… TECN… SEGUN… (5)(8)…
#> 8 8 AUXILIARES ALAYANA CORTES CHRIST… 23 LICEN… AUXI… SEGUN… (7)(8)…
#> 9 9 ACADEMICO ALCOTA AGUIRRE PATRIC… 9 ING. … PROF… SEGUN… (2)(8)…
#> 10 10 ADMINISTRATI… ALFARO BARRAZA MARIA … 23 LICEN… ADMI… SEGUN… (6)(8)…
#> # … with 10,977 more rows, 20 more variables: `UNID MONETARIA` <chr>,
#> # `REMUNERACION MENSUAL BRUTA` <chr>, HORAS <chr>, `CANT. HORAS` <chr>,
#> # `MONTO HORAS EXTRAS` <chr>, `FECHA DE INGRESO` <chr>, `F. HASTA` <chr>,
#> # OBSERVACIONES <chr>, GRADO <chr>, ESTAMENTO <chr>,
#> # `Apellido Paterno` <chr>, `Apellido Materno` <chr>, Nombres <chr>,
#> # `Grado ERUA` <chr>, `CALIFICACION PROFESIONAL O FORMACION` <chr>,
#> # `CARGO O FUNCION` <chr>, `R BRUTA` <chr>, `Horas Extras` <chr>, …
Created on 2022-10-31 with reprex v2.0.2

Is there a reason the xgboost code snippet from the usemodels package has one_hot set to TRUE?

Is there a reason the recipe code snippet for xgboost classifier has one_hot = TRUE? This creates "n" dummy variables instead of "n-1". I usually set it to FALSE but just want to make sure I'm not missing something.
Code -
data <- mtcars %>%
as_tibble() %>%
mutate(cyl = cyl %>% as.factor)
usemodels::use_xgboost(mpg ~ cyl, data = data)
Output -
xgboost_recipe <-
recipe(formula = mpg ~ cyl, data = data) %>%
step_novel(all_nominal(), -all_outcomes()) %>%
step_dummy(all_nominal(), -all_outcomes(), one_hot = TRUE) %>%
step_zv(all_predictors())
xgboost_spec <-
boost_tree(trees = tune(), min_n = tune(), tree_depth = tune(), learn_rate = tune(),
loss_reduction = tune(), sample_size = tune()) %>%
set_mode("regression") %>%
set_engine("xgboost")
xgboost_workflow <-
workflow() %>%
add_recipe(xgboost_recipe) %>%
add_model(xgboost_spec)
set.seed(28278)
xgboost_tune <-
tune_grid(xgboost_workflow, resamples = stop("add your rsample object"), grid = stop("add number of candidate points"))
The idea there is that, as a tree-based model, xgboost can handle all the levels (unlike a linear model) and can actually require more splits to fit well if you don't include all the categories. Read more about this here.
You don't see the same for the ranger random forest because it can handle factors natively.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
cars <- as_tibble(mtcars) %>%
mutate(cyl = cyl %>% as.factor)
usemodels::use_ranger(mpg ~ cyl, data = cars)
#> Registered S3 method overwritten by 'tune':
#> method from
#> required_pkgs.model_spec parsnip
#> ranger_recipe <-
#> recipe(formula = mpg ~ cyl, data = cars)
#>
#> ranger_spec <-
#> rand_forest(mtry = tune(), min_n = tune(), trees = 1000) %>%
#> set_mode("regression") %>%
#> set_engine("ranger")
#>
#> ranger_workflow <-
#> workflow() %>%
#> add_recipe(ranger_recipe) %>%
#> add_model(ranger_spec)
#>
#> set.seed(54153)
#> ranger_tune <-
#> tune_grid(ranger_workflow, resamples = stop("add your rsample object"), grid = stop("add number of candidate points"))
Created on 2021-04-07 by the reprex package (v2.0.0)

Parsing incomplete lists into data frames with two different problems

If you request web data through R, you often work with json or xml where the fields are not named if there is no value for them. Sometimes, there isn't even any data and it comes out as an empty list for a certain index. So, I see this as two different problems. I'm proposing the solution I use to solve this as well but I know there are some better ones out there. I have for starters, a very messy and fake list that I created that is missing field names (on purpose from the xml, json spec) AND missing whole indexes (also on purpose).
(messy_list <- list(list(x = 2, y = 3),
list(),
list(y = 4),
list(x = 5)))
Now, here is how I break it down to what I would say is "solved".
library(plyr)
messy_list_no_empties <- lapply(messy_list, function(x) if(length(x) == 0) {list(NA, NA)} else x)
ldply(messy_list_no_empties, data.frame)[,1:2]
The end result is what I am looking for but I would like to find a more elegant way to deal with this problem.
With purrr::map_df,
library(purrr)
messy_list <- list(list(x = 2, y = 3),
list(),
list(y = 4),
list(x = 5))
messy_list %>% map_df(~list(x = .x$x %||% NA,
y = .x$y %||% NA))
#> # A tibble: 4 × 2
#> x y
#> <dbl> <dbl>
#> 1 2 3
#> 2 NA NA
#> 3 NA 4
#> 4 5 NA
map_df iterates over the list like lapply and coerces the results to a data.frame. The function (in purrr's formula form) assembles a list with an x and a y element, looking for existing values if they're there. If they're not, the subsetting will return NULL, which %||% will replace with the value after it, NA.
In mostly-equivalent base R,
as.data.frame(do.call(rbind,
lapply(messy_list, function(.x){
list(x = ifelse(is.null(.x$x), NA, .x$x),
y = ifelse(is.null(.x$y), NA, .x$y))
})))
#> x y
#> 1 2 3
#> 2 NA NA
#> 3 NA 4
#> 4 5 NA
Note the base approach won't handle different types well. To do so, coerce everything to character (rbind probably will anyway, so just add stringsAsFactors = FALSE to as.data.frame) and lapply type.convert.
Your method is already pretty compact, but if you're looking for other methods, one way might be to use rbindlist from data.table:
library(data.table)
new_list <- lapply(messy_list, function(x) if(identical(x,list())){list(x = NA)} else {x})
rbindlist(new_list, fill = T, use.names = T)
# x y
#1: 2 3
#2: NA NA
#3: NA 4
#4: 5 NA
Note we need the lapply so it doesn't drop the rows that are empty

How to unpack nested JSON type column in a dataframe with R(plus RegEx issue)

I'm very new to R,and I'm currently stuck on this problem:
so I imported a JSON file and already***convert it to a dataframe***, now I need to return rows under condition:
As you can see in the picture, I have a column recording hours(payload.hours)
My GOAL is to find out the hours that meet: 1. sunday 2. time ealier than 10AM.
I tried several ways but somehow it even doesn't come close at all... I havent dealt with such nested form before... so I have to seek your idea&help...
e.g. one element in payload.hours column
payload.hours
...
[530] "{\"monday\":[[\"10:30\",\"16:00\"]],\"tuesday\":[[\"10:30\",\"16:00\"]],\"wednesday\":[[\"10:30\",\"16:00\"]],\"thursday\":[[\"10:30\",\"16:00\"]],\"friday\":[[\"10:30\",\"16:00\"]],\"saturday\":[[\"10:30\",\"16:00\"]],\"sunday\":[[\"10:30\",\"16:00\"]]}"
this is what I used for unpacking the nested lists in "hours" column...but it doesn't work...
library(ndjson)
json<- ndjson::stream_in("#localpath")
#successfully converted json to a dataframe...but each element in payload.hours column remains nested like above.
lapply(json$payload.hours, jsonlite::fromJSON)
#continue unwarp nested jason column BUT RESULT Error in if (is.character(txt) && length(txt) == 1 && nchar(txt, type = "bytes") < :missing value where TRUE/FALSE needed
Another approach I tried(FOR A LONG TIME) is RegEx
hrs<-json1$payload.hours #select column hours into hrs
tme<-"sunday{1}.{8}[0-9]{1}\"" # ???(not sure about this...seruously)...? match string with sunday and after 8characters..aka find preceding digit{1} when meet ":"
iftme<-grepl(tme,hrs) #set logical factor T/F if matches
checkhrs<-hrs[iftme] #check if open hours are correct
checkhrs
And this seems to work...but I am not sure why...(YES.IDK WHY)...so if anyone could explain to me that would be great!
This is the original json file:
https://drive.google.com/open?id=0B-jU6pp4pjS4Smg2RGpHSTlvN2c
This is RegEx output...seems right...but I am not sure about my expression..LOL
Unpacking JSON can be a lot of work, particularly if it is deeply nested. Most JSON reading packages (jsonlite, RJSONIO, etc.) can turn data into something close to a data.frame, but fixing the structure requires an understanding that the reader functions don't have. Since JSON most nearly corresponds to R's lists, cleaning up data coming from JSON typically involves a lot of lapply and its variants. Here I'll use purrr, which has many useful variants and helper functions and works neatly with dplyr.
library(tidyverse)
# Read data
json <- jsonlite::stream_in(file('~/Downloads/jsondata.json'))
# Initial cleanup to proper data.frame
json <- json$payload %>% map_df(simplify_all) %>% dmap(simplify) %>%
mutate(uuid = json$uuid, # re-add uuid subset out at beginning
# Convert hours to a list column of data.frames
hours = hours %>% map_if(negate(is.na), jsonlite::fromJSON) %>%
map(~map_df(.x, as_data_frame, .id = 'day')),
# Add Boolean variable for whether Sunday opening hours are before 10a. Subset,
open_sun_before_10 = hours %>% map(~.x %>% filter(day == 'sunday') %>% .[[2]]) %>%
map(as.POSIXct, format = '%H:%M') %>% # convert to datetime,
map(~.x < as.POSIXct('10:00', format = '%H:%M')) %>% # compare to 10a
map_lgl(~ifelse(length(.x) == 0, NA, .x))) # and cleanup.
Whereas stream_in returned a data.frame with two columns (one very deeply nested), the columns are now less nested. There are still JSON structures in some of the untouched columns, though, which will have to be addressed if you want to use the data.
json
#> # A tibble: 538 × 42
#> existence_full geo_virtual latitude
#> <dbl> <chr> <chr>
#> 1 1.000000 ["56.9459720|-2.1971226|20|within_50m|4"] 56.945972
#> 2 1.000000 ["56.237480|-5.073578|20|within_50m|4"] 56.237480
#> 3 1.000000 ["51.483872|-0.606820|100|rooftop|2"] 51.483872
#> 4 1.000000 ["57.343233|-2.191955|100|rooftop|4"] 57.343233
#> 5 1.000000 ["53.225815|-4.094775|20|within_50m|4"] 53.225815
#> 6 1.000000 ["58.9965740|-3.1882195|20|within_50m|4"] 58.996574
#> 7 1.000000 ["57.661419|-2.520144|100|rooftop|4"] 57.661419
#> 8 1.000000 ["51.642727|-3.934845|20|within_50m|4"] 51.642727
#> 9 0.908251 <NA> <NA>
#> 10 1.000000 ["56.510558|-5.401638|100|rooftop|2"] 56.510558
#> # ... with 528 more rows, and 39 more variables: locality <chr>,
#> # `_records_touched` <chr>, address <chr>, email <chr>,
#> # existence_ml <dbl>, domain_aggregate <chr>, name <chr>,
#> # search_tags <list>, admin_region <chr>, existence <dbl>,
#> # category_labels <list>, post_town <chr>, region <chr>,
#> # review_count <chr>, geocode_level <chr>, tel <chr>, placerank <int>,
#> # longitude <chr>, placerank_ml <dbl>, fax <chr>,
#> # category_ids_text_search <chr>, website <chr>, status <chr>,
#> # geocode_confidence <chr>, postcode <chr>, category_ids <list>,
#> # country <chr>, `_geocode_quality` <chr>, hours_display <chr>,
#> # hours <list>, neighborhood <list>, languages <chr>,
#> # address_extended <chr>, status_closed <chr>, po_box <chr>,
#> # name_variants <list>, yext_id <chr>, uuid <chr>,
#> # open_sun_before_10 <lgl>
And the columns created:
json %>% select(hours, open_sun_before_10)
#> # A tibble: 538 × 2
#> hours open_sun_before_10
#> <list> <lgl>
#> 1 <tibble [1 × 2]> NA
#> 2 <tibble [1 × 2]> NA
#> 3 <tibble [7 × 3]> FALSE
#> 4 <tibble [1 × 2]> NA
#> 5 <tibble [7 × 3]> FALSE
#> 6 <tibble [1 × 2]> NA
#> 7 <tibble [1 × 2]> NA
#> 8 <tibble [6 × 3]> NA
#> 9 <tibble [1 × 2]> NA
#> 10 <tibble [7 × 3]> TRUE
#> # ... with 528 more rows

Tidy nested json tree

This comes up a lot when dealing with API's.
Most of the time, to do real analysis, I'd like to get my dataset tidy, but typically, this requires a solution for each type of tree, rather than something more general.
I figured it would be nice to have one function that generates tidy data (albeit with a ton of NA's in deeply nested trees with many different factor levels.
I have a hackish solution which follows, using unlist(..., recursive = FALSE) + a naming convention,
But I'd like to see if someone here might have a better solution to tidy these kinds of list structures.
#####################
# Some Test Data
aNestedTree =
list(a = 1,
b = 2,
c = list(
a = list(1:5),
b = 2,
c = list(
a = 1,
d = 3,
e = list())),
d = list(
y = 3,
z = 2
))
############################################################
# Run through the list and rename all list elements,
# We unlist once at time, adding "__" at each unlist step
# until the object is no longer a list
renameVars <- function(lst, sep = '__') {
if(is.list(lst)) {
names(lst) <- paste0(names(lst),sep)
renameVars(unlist(lst, recursive = FALSE),sep = sep)
} else {
lst
}
}
res <- renameVars(aNestedTree)
We can check the output and see that we have a strangely named object,
But there's a method to this madness.
> res
a________ b________ c__.a____1__ c__.a____2__ c__.a____3__
1 2 1 2 3
c__.a____4__ c__.a____5__ c__.b______ c__.c__.a____ c__.c__.d____
4 5 2 1 3
d__.y______ d__.z______
3 2
Now I put this in a data.table, so I can shape it.
library(data.table)
dt <- data.table(values = res, name = names(res))
# Use some regex to split that name up, along with data.table's tstrsplit
# function to separate them into as many columns as there are nests
> dt[,paste0('V',seq_along(s <- tstrsplit(dt$name,'[__]+(\\.|)'))) := s]
> dt
values name V1 V2 V3
1: 1 a________ a NA NA
2: 2 b________ b NA NA
3: 1 c__.a____1__ c a 1
4: 2 c__.a____2__ c a 2
5: 3 c__.a____3__ c a 3
6: 4 c__.a____4__ c a 4
7: 5 c__.a____5__ c a 5
8: 2 c__.b______ c b NA
9: 1 c__.c__.a____ c c a
10: 3 c__.c__.d____ c c d
11: 3 d__.y______ d y NA
12: 2 d__.z______ d z NA
I can then filter for the factor combinations that I want (Or dcast/spread). (Though I'm effectively breaking apart tables at the lowest level if they exist)
I thought about going through bind.c and pulling out the do_unlistto make a function with a flexible naming convention via Rcpp, but my C++ is rusty, so I figured I'd post here before I do anything drastic.
I tend to lean towards tidyjson as well. In the tidyverse, the behavior you are looking for seems to be in the gather family.
I think the gather family of functions in tidyjson could do with a bit of improvement that would make these helpers unnecessary. Right now, they are very "type-sensitive" and error or throw out types that do not match. In any case, the workaround is not too challenging, although it definitely lacks elegance. Note that the bind_rows variant is presently from my development version and is not mainstream yet. Hopefully this illustrates the idea, though.
Notes on approach:
That all values would be numeric (I cast them to character afterwards)
Helpers gather elements of the varying types, and bind_rows stacks the datasets together.
level is kept track of by level of recursion
First define the helpers:
recurse_gather <- function(.x,.level) {
.x <- tidyjson::bind_rows(
gobj(.x,.level)
, garr(.x,.level)
, gpersist(.x,.level)
)
if (any(as.character(json_types(.x,'type')$type) %in% c('object','array'))) {
.x <- recurse_gather(.x,.level+1)
}
return(.x)
}
gobj <- function(.x,.level) {
.x %>% json_types('type') %>%
filter(type=='object') %>%
gather_object(paste0('v',.level)) %>%
select(-type)
}
gpersist <- function(.x,.level) {
.x %>% json_types('type') %>%
filter(! type %in% c('object','array')) %>%
mutate_(.dots=setNames(
paste0('as.character(NA)')
,paste0('v',.level)
)) %>%
select(-type)
}
garr <- function(.x,.level) {
.x %>% json_types('type') %>%
filter(type=='array') %>%
gather_array('arridx') %>%
append_values_number(paste0('v',.level)) %>%
mutate_(.dots=setNames(
paste0('as.character(v',.level,')')
,paste0('v',.level)
)) %>%
select(-arridx,-type)
}
Then using the helpers is pretty straight-forward.
library(dplyr)
library(tidyjson)
j <- "{\"a\":[1],\"b\":[2],\"c\":{\"a\":[1,2,3,4,5],\"b\":[2],\"c\":{\"a\":[1],\"d\":[3],\"e\":[]}},\"d\":{\"y\":[3],\"z\":[2]}}"
recurse_gather(j, 1) %>% arrange(v1, v2, v3, v4) %>% tbl_df()
#> # A tibble: 12 x 5
#> document.id v1 v2 v3 v4
#> * <int> <chr> <chr> <chr> <chr>
#> 1 1 a 1 <NA> <NA>
#> 2 1 b 2 <NA> <NA>
#> 3 1 c a 1 <NA>
#> 4 1 c a 2 <NA>
#> 5 1 c a 3 <NA>
#> 6 1 c a 4 <NA>
#> 7 1 c a 5 <NA>
#> 8 1 c b 2 <NA>
#> 9 1 c c a 1
#> 10 1 c c d 3
#> 11 1 d y 3 <NA>
#> 12 1 d z 2 <NA>
Hopeful that future development on the tidyjson package will make this an easier problem to tackle!
I struggled in similar situations, but the tidyjson package has bailed me out time after time when dealing with nested JSON. There's a fair amount of typing required, but the tidyjson functions return a tidy object. Documentation here: https://github.com/sailthru/tidyjson
As dracodoc pointed out, data.tree might help. E.g. like this:
library(data.tree)
aNestedTree =
list(a = 1,
b = 2,
c = list(
a = list(1:5),
b = 2,
c = list(
a = 1,
d = 3,
e = list())),
d = list(
y = 3,
z = 2
))
tree <- FromListSimple(aNestedTree)
print(tree)
This will give:
levelName z
1 Root NA
2 ¦--c NA
3 ¦ ¦--a NA
4 ¦ °--c NA
5 ¦ °--e NA
6 °--d 2
And:
tree$fieldsAll
[1] "a" "b" "1" "d" "y" "z"
Side note: typically, you could do something like this:
do.call("print", c(tree, tree$fieldsAll))
However, here, this doesn't work because some node names are the same as field names. I consider this a bug and will fix it soon.