My problem is very similar to this one. I want to identify all the HTML links in this website so I can then open the link and download the tables.
The problem is that when I create the extract_links functions as pointed out in that answer, I get a list of all the HTMLs, but this are not complete.
To make it more clear:
If you press "Junio" in year "2022" the real HTML is the following:
http://transparencia.uantof.cl/index.php?action=plantillas_generar_plantilla&ig=21&m=6&a=2022&ia=7658
but the HTML that I am recovering from the source of the website lacks the last bit (&ia=7658):
http://transparencia.uantof.cl/index.php?action=plantillas_generar_plantilla&ig=21&m=6&a=2022
Which does not direct me to the table I want.
The problem is that these numbers do not seem to follow any logic and change between year/month links. Any help on how to retrieve the full HTML links will be greatly appreciated. If you also happen to know how can I retrieve the year/month of the file to add as an extra column that would also be great.
Thanks to the help of #margusl I was able to realize that rvest redirects automatically and that solves my problem.
I am trying to use the following code to loop over different links to obtain the tables, store them in a data frame and then download them:
yr.list <- seq(2019,2020)
mes.list <- seq(1,12)
combined_df <- data.frame()
for (yr in yr.list){
for (mes in mes.list) {
root <- "http://transparencia.uantof.cl/index.php?action=plantillas_selec_archivo&ig=21"
# Full link
url <- paste(root,"&m=",mes,"&a=",yr,sep="")
# Parse HTML File
file<-read_html(url, encoding = "latin1")
file<- rvest::html_table(file)
str(file)
# This is the relevant table
table <- as.data.frame(file[[1]])
# in your loop, add the files that you read to the combined_df
combined_df <- rbind(combined_df, table)
}
}
It does not work because the read_html code with the encoding works only for some years, but not for all. for example, when running:
url <- "http://transparencia.uantof.cl/index.php?action=plantillas_selec_archivo&ig=21&m=3&a=2015"
file<-read_html(url, encoding = "latin1")
It does not recover the tables with names/surnames that recovers in the previous months but something else. Why can't this work on all the sub-pages? Is this a encoding problem again?
If you open that last page you had issues with, you'll see that it serves a sort of a submenu with 2 more links - http://transparencia.uantof.cl/index.php?action=plantillas_selec_archivo&ig=21&m=3&a=2015 . Meaning that it's not enough to just generate links for each month & year and extract first table of each page, all those pages should be checked for content and exceptions should be handled.
Though I took somewhat opportunistic approach and it happened to work with URL range defined in question + those few odd samples, but there could be other surprises down the road. Switched to httr for making requests as it allows to collect and monitor response headers, separating content retrieval and parsing also seems to work around encoding issues, at least in this case. First collecting and then parsing also simplifies debugging, you can check if certain responses / headers were different from the rest (i.e. response length being 10x smaller than average or final, redirected, url differs from the rest). And it's easy to change content handling
/ parsing for a small subset of responses, if needed. If you are not sure what rvest has retrieved, you can always save the response to a html file and check it with browser or editor, something like
html <- read_html(url_or_text_content); write(as.character(html), "dump.html")
library(rvest)
library(httr)
library(purrr)
library(dplyr)
library(tidyr)
library(stringr)
yr.list <- seq(2019,2020)
mes.list <- seq(1,12)
# combine mes.list & yr.list
url.params <- expand.grid(mes = mes.list, yr = yr.list)
# few extra samples:
url.params <- rbind(url.params,
list(mes = 6, yr = 2022), # here rvest strugglest with correct encoding
list(mes = 3, yr = 2015) # returns page with sub-categories
)
url.list <- str_glue("http://transparencia.uantof.cl/index.php?action=plantillas_selec_archivo&ig=21&m={url.params$mes}&a={url.params$yr}")
url.list
#> http://transparencia.uantof.cl/index.php?action=plantillas_selec_archivo&ig=21&m=1&a=2019
#> http://transparencia.uantof.cl/index.php?action=plantillas_selec_archivo&ig=21&m=2&a=2019
#> http://transparencia.uantof.cl/index.php?action=plantillas_selec_archivo&ig=21&m=3&a=2019
#> ...
#> http://transparencia.uantof.cl/index.php?action=plantillas_selec_archivo&ig=21&m=11&a=2020
#> http://transparencia.uantof.cl/index.php?action=plantillas_selec_archivo&ig=21&m=12&a=2020
#> http://transparencia.uantof.cl/index.php?action=plantillas_selec_archivo&ig=21&m=6&a=2022
#> http://transparencia.uantof.cl/index.php?action=plantillas_selec_archivo&ig=21&m=3&a=2015
# url list for input, output is a tibble with all responses (incl. "url", "date",
# "status_code", header details and response body)
fetch_urls <- function(url.list){
# collect all responses to a list with httr, enable verbose, parse responses later
# add progress bar - requests take a while
resp.list = vector(mode = "list", length = length(url.list))
pb <- txtProgressBar(max = length(url.list), style = 3)
for (i in seq_along(url.list)){
resp.list[[i]] <- GET(url.list[i])
setTxtProgressBar(pb,i)
}
close(pb)
# turn responses into tibble to check urls, response sizes and status codes
resp.tibble <- bind_cols(
map_df(resp.list, ~ .[c("url", "date", "status_code")], .id = "req_id"),
map_df(resp.list, headers) %>% rename_with(~ paste0("header_",.x)),
# map_df(resp_follow.list, "times"),
map_chr(resp.list, content, as = "text") %>% tibble(html_doc = .)
)
return(resp.tibble)
}
resp.tibble <- fetch_urls(url.list)
# check resulting table without html_doc column
# View(resp.tibble[-ncol(resp.tibble)])
resp.tibble %>%
select(req_id:status_code,`header_content-length`) %>%
arrange(`header_content-length`)
#> # A tibble: 26 × 5
#> req_id url date statu…¹ heade…²
#> <chr> <chr> <dttm> <int> <chr>
#> 1 14 http://transparencia.uantof.cl/in… 2022-10-31 17:29:12 200 21371
#> 2 26 http://transparencia.uantof.cl/in… 2022-10-31 17:31:45 200 2230
#> 3 24 http://transparencia.uantof.cl/in… 2022-10-31 17:31:21 200 24035
#> 4 21 http://transparencia.uantof.cl/in… 2022-10-31 17:30:42 200 24173
#> 5 20 http://transparencia.uantof.cl/in… 2022-10-31 17:30:29 200 24183
#> 6 23 http://transparencia.uantof.cl/in… 2022-10-31 17:31:08 200 24184
#> 7 22 http://transparencia.uantof.cl/in… 2022-10-31 17:30:55 200 24207
#> 8 18 http://transparencia.uantof.cl/in… 2022-10-31 17:30:04 200 24405
#> 9 16 http://transparencia.uantof.cl/in… 2022-10-31 17:29:38 200 24715
#> 10 7 http://transparencia.uantof.cl/in… 2022-10-31 17:27:32 200 24716
#> # … with 16 more rows, and abbreviated variable names ¹status_code,
#> # ²`header_content-length`
# 26. is kind of suspicious:
# 25 http://transparencia.uantof.cl/index.php?action=plantillas_generar_plantilla&ig=21&m=6&a=2022&ia=76…
# 26 http://transparencia.uantof.cl/index.php?action=plantillas_selec_archivo&ig=21&m=3&a=2015
# looks like there has been no redirection and its header_content-length is about 10x smaller than for other responses
# checking it more closely reveals that the page includes a "submenu" instead of table(s):
# <p class="subMenu_interiores">
# <b>2015 - Marzo</b>
# ABRIL 2015
# Marzo 2015
# </p>
# lets' collect urls that were not redirected from our tibble and harvest links from stored html:
suburl.list <- resp.tibble %>%
# urls that do NOT include "plantillas_generar_plantilla"
filter(!str_detect(url, "plantillas_generar_plantilla")) %>%
pull(html_doc) %>%
# rvest does not like lists, thus let's map()
map( ~ read_html(.x) %>% html_elements("#columna1_interiores a") %>% html_attr("href")) %>%
unlist() %>%
paste0("http://transparencia.uantof.cl/",.)
suburl.list
#> [1] "http://transparencia.uantof.cl/index.php?action=plantillas_generar_plantilla&ig=21&m=3&a=2015&ia=772"
#> [2] "http://transparencia.uantof.cl/index.php?action=plantillas_generar_plantilla&ig=21&m=3&a=2015&ia=648"
# fetch content from those submenu urls
subresp.tibble <- fetch_urls(suburl.list)
# sanity check:
subresp.tibble %>%
select(req_id:status_code,`header_content-length`)
#> # A tibble: 2 × 5
#> req_id url date statu…¹ heade…²
#> <chr> <chr> <dttm> <int> <chr>
#> 1 1 http://transparencia.uantof.cl/ind… 2022-10-31 17:31:52 200 25385
#> 2 2 http://transparencia.uantof.cl/ind… 2022-10-31 17:31:59 200 25332
#> # … with abbreviated variable names ¹status_code, ²`header_content-length`
# better, sizes align with previous results.
# collect all relevant responses
table_1 <- resp.tibble %>%
filter(str_detect(url, "plantillas_generar_plantilla")) %>%
bind_rows(subresp.tibble) %>%
# extract html (as strings)
pull(html_doc) %>%
# rvest does not like lists, thus let's map(), pluck(1) extracts first table (from each page)
map(~ read_html(.x) %>% html_table() %>% pluck(1)) %>%
# first attempt to bind rows fails, aparently column types differ
# change all non-character columns to character
map (~ mutate(.x, across(!where(is.character),as.character))) %>%
# bind all tables by rows
bind_rows()
# columns vary across tables so number of NA fields in final result is rather high
Final result for 26 pages, a 10,987 × 30 tibble:
table_1
#> # A tibble: 10,987 × 30
#> Nº PLANTA PATERNO MATERNO NOMBRES G TITULO CARGO REGION ASIGN…¹
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 1 DIRECTIVO ABARZA CASTRO JESSIC… 6 ADMIN… JEFE… SEGUN… (1)(8)…
#> 2 2 PROFESIONAL ABASOLO QUINTE… NURY D… 12 EDUCA… PROF… SEGUN… (4)(8)…
#> 3 3 ACADEMICO ACOSTA PENA ROXANA… 11 EDUCA… PROF… SEGUN… (2)(8)…
#> 4 4 AUXILIARES ACOSTA PIZARRO ROBERT… 23 LICEN… AUXI… SEGUN… (7)(8)…
#> 5 5 DIRECTIVO AGENO SIERRA ROSELL… 4 MATRO… DIRE… SEGUN… (1)(8)…
#> 6 6 AUXILIARES AGUIRRE LAZO RENE G… 16 LICEN… AUXI… SEGUN… (7)(8)…
#> 7 7 TECNICOS ALAMOS MARIN SERGIO… 13 TECNI… TECN… SEGUN… (5)(8)…
#> 8 8 AUXILIARES ALAYANA CORTES CHRIST… 23 LICEN… AUXI… SEGUN… (7)(8)…
#> 9 9 ACADEMICO ALCOTA AGUIRRE PATRIC… 9 ING. … PROF… SEGUN… (2)(8)…
#> 10 10 ADMINISTRATI… ALFARO BARRAZA MARIA … 23 LICEN… ADMI… SEGUN… (6)(8)…
#> # … with 10,977 more rows, 20 more variables: `UNID MONETARIA` <chr>,
#> # `REMUNERACION MENSUAL BRUTA` <chr>, HORAS <chr>, `CANT. HORAS` <chr>,
#> # `MONTO HORAS EXTRAS` <chr>, `FECHA DE INGRESO` <chr>, `F. HASTA` <chr>,
#> # OBSERVACIONES <chr>, GRADO <chr>, ESTAMENTO <chr>,
#> # `Apellido Paterno` <chr>, `Apellido Materno` <chr>, Nombres <chr>,
#> # `Grado ERUA` <chr>, `CALIFICACION PROFESIONAL O FORMACION` <chr>,
#> # `CARGO O FUNCION` <chr>, `R BRUTA` <chr>, `Horas Extras` <chr>, …
Created on 2022-10-31 with reprex v2.0.2
I'm having the hardest time generating confidence intervals for my glmer poisson model. After following several very helpful tutorials (such as https://drewtyre.rbind.io/classes/nres803/week_12/lab_12/) as well as stackoverflow posts, I keep getting very strange results, i.e. the upper and lower limits of the CI are identical.
Here is a reproducible example containing a response variable called "production," a fixed effect called "Treatment_Num" and a random effect called "Genotype":
df1 <- data.frame(production=c(15,12,10,9,6,8,9,5,3,3,2,1,0,0,0,0), Treatment_Num=c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4), Genotype=c(1,1,2,2,1,1,2,2,1,1,2,2,1,1,2,2))
#run the glmer model
df1_glmer <- glmer(production ~ Treatment_Num +(1|Genotype),
data = df1, family = poisson(link = "log"))
#make an empty data set to predict from, that contains the explanatory variables but no response
require(magrittr)
df_empty <- df1 %>%
tidyr::expand(Treatment_Num, Genotype)
#create new column containing predictions
df_empty$PopPred <- predict(df1_glmer, newdata = df_empty, type="response",re.form = ~0)
#function for bootMer
myFunc_df1_glmer <- function(mm) {
predict(df1_glmer, newdata = df_empty, type="response",re.form=~0)
}
#run bootMer
require(lme4)
merBoot_df1_glmer <- bootMer(df1_glmer, myFunc_df1_glmer, nsim = 10)
#get confidence intervals out of it
predCL <- t(apply(merBoot_df1_glmer$t, MARGIN = 2, FUN = quantile, probs = c(0.025, 0.975)))
#enter lower and upper limits of confidence interval into df_empty
df_empty$lci <- predCL[, 1]
df_empty$uci <- predCL[, 2]
#when viewing df_empty the problem becomes clear: the lci and uci are identical!
df_empty
Any insights you can give me will be much appreciated!
Ignore my comment!
The issue is with the function you created to pass to bootMer(). You wrote:
myFunc_df1_glmer <- function(mm) {
predict(df1_glmer, newdata = df_empty, type="response",re.form=~0)
}
The argument mm should be a fitted model object derived from the bootstrapped data.
However, you don't pass this object to predict(), but rather the original model
object. If you change the function to:
myFunc_df1_glmer <- function(mm) {
predict(mm, newdata = df_empty, type="response",re.form=~0)
#^^ pass in the object created by bootMer
}
then it works:
> df_empty
# A tibble: 8 x 5
Treatment_Num Genotype PopPred lci uci
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 12.9 9.63 15.7
2 1 2 12.9 9.63 15.7
3 2 1 5.09 3.87 5.89
4 2 2 5.09 3.87 5.89
5 3 1 2.01 1.20 2.46
6 3 2 2.01 1.20 2.46
7 4 1 0.796 0.361 1.14
8 4 2 0.796 0.361 1.14
As an aside -- how many genotypes in your actual data? If less than 5-7 you might
do better using a straight up glm() with genotype as a factor using sum-to-zero
contrasts.
I am trying to load a json file into a data.frame in r. I have had some luck with the fromJSON function in the jsonlite package - But am getting nested lists and am not sure how to flatten the input into a two dimensional data.frame. Jsonlite reads the file in as a data.frame, but leaves nested lists in some of the variables.
Does Anyone have any tips in loading a JSON file to a data.frame when it reads in with nested lists.
#*#*#*#*#*#*#*#*#*##*#*#*#*#*#*#*#*#*# HERE IS MY EXAMPLE #*#*#*#*#*#*#*#*#*##*#*#*#*#*#*#*#*#*#
# loads the packages
library("httr")
library( "jsonlite")
# downloads an example file
providers <- fromJSON( "http://fm.formularynavigator.com/jsonFiles/publish/11/47/providers.json" , simplifyDataFrame=TRUE )
# the flatten function breaks the name variable into three vars ( first name, middle name, last name)
providers <- flatten( providers )
# but many of the columns are still lists:
sapply( providers , class)
# Some of these lists have a single level
head( providers$facility_type )
# Some have lot more than two - for example nine
providers[ , 6][[1]]
I want one row per npi, and than seperate columns for each of the slices of the individual lists - so that the data frame has cols for "plan_id_type","plan_id","network_tier" nine times, maybe colnames, from 0 to 8.
I have been able to use this site: http://www.convertcsv.com/json-to-csv.htm to get this file in two dimensions, but since I am doing hundreds of these I would love to be able to do it dynamically. This is the file: http://s000.tinyupload.com/download.php?file_id=10808537503095762868&t=1080853750309576286812811 - I would like to get a file with this structure load as a data.frame using the the fromJson function
HERE are a few of the things I have tried;
So I have thought of two approaches;
First: use a different function to read in the Json file, I have looked at
rjson but that reads in a list
library( rjson )
providers <- fromJSON( getURL( "https://fm.formularynavigator.com/jsonFiles/publish/11/47/providers.json") )
class( providers )
and I have tried RJSONIO - I tried this Getting imported json data into a data frame in R
json-data-into-a-data-frame-in-r
library( RJSONIO )
providers <- fromJSON( getURL( "https://fm.formularynavigator.com/jsonFiles/publish/11/47/providers.json") )
json_file <- lapply(providers, function(x) {
x[sapply(x, is.null)] <- NA
unlist(x)
})
# but When converting the lists to a data.frame I get an error
a <- do.call("rbind", json_file)
So, the second approach I have tried is to convert all the lists into variables in my data.frame
detach("package:RJSONIO", unload = TRUE )
detach("package:rjson", unload = TRUE )
library( "jsonlite")
providers <- fromJSON( "http://fm.formularynavigator.com/jsonFiles/publish/11/47/providers.json" , simplifyDataFrame=TRUE )
providers <- flatten( providers )
I am able to pull one of the lists - but because of missings I can't merge back on to my dataframe
a <- data.frame(Reduce(rbind, providers$facility_type))
length( a ) == nrow( providers )
I also tried these suggestions: Converting nested list to dataframe. A well as some other stuff but haven't had any luck
a <- sapply( providers$facility_type, unlist )
as.data.frame(t(sapply( providers$providers, unlist )) )
Any help much appreciated
Update: 21 February 2016
col_fixer updated to include a vec2col argument that lets you flatten a list column into either a single string or a set of columns.
In the data.frame you've downloaded, I see several different column types. There are normal columns comprising vectors of the same type. There are list columns where the items may be NULL or may themselves be a flat vector. There are list columns where there are data.frames as the list elements. There are list columns that contain a data.frame of the same number of rows as the main data.frame.
Here's a sample dataset that recreates those conditions:
mydf <- data.frame(id = 1:3, type = c("A", "A", "B"),
facility = I(list(c("x", "y"), NULL, "x")),
address = I(list(data.frame(v1 = 1, v2 = 2, v4 = 3),
data.frame(v1 = 1:2, v2 = 3:4, v3 = 5),
data.frame(v1 = 1, v2 = NA, v3 = 3))))
mydf$person <- data.frame(name = c("AA", "BB", "CC"), age = c(20, 32, 23),
preference = c(TRUE, FALSE, TRUE))
The str of this sample data.frame looks like:
str(mydf)
## 'data.frame': 3 obs. of 5 variables:
## $ id : int 1 2 3
## $ type : Factor w/ 2 levels "A","B": 1 1 2
## $ facility:List of 3
## ..$ : chr "x" "y"
## ..$ : NULL
## ..$ : chr "x"
## ..- attr(*, "class")= chr "AsIs"
## $ address :List of 3
## ..$ :'data.frame': 1 obs. of 3 variables:
## .. ..$ v1: num 1
## .. ..$ v2: num 2
## .. ..$ v4: num 3
## ..$ :'data.frame': 2 obs. of 3 variables:
## .. ..$ v1: int 1 2
## .. ..$ v2: int 3 4
## .. ..$ v3: num 5 5
## ..$ :'data.frame': 1 obs. of 3 variables:
## .. ..$ v1: num 1
## .. ..$ v2: logi NA
## .. ..$ v3: num 3
## ..- attr(*, "class")= chr "AsIs"
## $ person :'data.frame': 3 obs. of 3 variables:
## ..$ name : Factor w/ 3 levels "AA","BB","CC": 1 2 3
## ..$ age : num 20 32 23
## ..$ preference: logi TRUE FALSE TRUE
## NULL
One way you can "flatten" this is to "fix" the list columns. There are three fixes.
flatten (from "jsonlite") will take care of columns like the "person" column.
Columns like the "facility" column can be fixed using toString, which would convert each element to a comma separated item or which can be converted into multiple columns.
Columns where there are data.frames, some with multiple rows, first need to be flattened into a single row (by transforming to a "wide" format) and then need to be bound together as a single data.table. (I'm using "data.table" for reshaping and for binding the rows together).
We can take care of the second and third points with a function like the following:
col_fixer <- function(x, vec2col = FALSE) {
if (!is.list(x[[1]])) {
if (isTRUE(vec2col)) {
as.data.table(data.table::transpose(x))
} else {
vapply(x, toString, character(1L))
}
} else {
temp <- rbindlist(x, use.names = TRUE, fill = TRUE, idcol = TRUE)
temp[, .time := sequence(.N), by = .id]
value_vars <- setdiff(names(temp), c(".id", ".time"))
dcast(temp, .id ~ .time, value.var = value_vars)[, .id := NULL]
}
}
We'll integrate that and the flatten function in another function that would do most of the processing.
Flattener <- function(indf, vec2col = FALSE) {
require(data.table)
require(jsonlite)
indf <- flatten(indf)
listcolumns <- sapply(indf, is.list)
newcols <- do.call(cbind, lapply(indf[listcolumns], col_fixer, vec2col))
indf[listcolumns] <- list(NULL)
cbind(indf, newcols)
}
Running the function gives us:
Flattener(mydf)
## id type person.name person.age person.preference facility address.v1_1
## 1 1 A AA 20 TRUE x, y 1
## 2 2 A BB 32 FALSE 1
## 3 3 B CC 23 TRUE x 1
## address.v1_2 address.v2_1 address.v2_2 address.v4_1 address.v4_2 address.v3_1
## 1 NA 2 NA 3 NA NA
## 2 2 3 4 NA NA 5
## 3 NA NA NA NA NA 3
## address.v3_2
## 1 NA
## 2 5
## 3 NA
Or, with the vectors going into separate columns:
Flattener(mydf, TRUE)
## id type person.name person.age person.preference facility.V1 facility.V2
## 1 1 A AA 20 TRUE x y
## 2 2 A BB 32 FALSE <NA> <NA>
## 3 3 B CC 23 TRUE x <NA>
## address.v1_1 address.v1_2 address.v2_1 address.v2_2 address.v4_1 address.v4_2
## 1 1 NA 2 NA 3 NA
## 2 1 2 3 4 NA NA
## 3 1 NA NA NA NA NA
## address.v3_1 address.v3_2
## 1 NA NA
## 2 5 5
## 3 3 NA
Here's the str:
str(Flattener(mydf))
## 'data.frame': 3 obs. of 14 variables:
## $ id : int 1 2 3
## $ type : Factor w/ 2 levels "A","B": 1 1 2
## $ person.name : Factor w/ 3 levels "AA","BB","CC": 1 2 3
## $ person.age : num 20 32 23
## $ person.preference: logi TRUE FALSE TRUE
## $ facility : chr "x, y" "" "x"
## $ address.v1_1 : num 1 1 1
## $ address.v1_2 : num NA 2 NA
## $ address.v2_1 : num 2 3 NA
## $ address.v2_2 : num NA 4 NA
## $ address.v4_1 : num 3 NA NA
## $ address.v4_2 : num NA NA NA
## $ address.v3_1 : num NA 5 3
## $ address.v3_2 : num NA 5 NA
## NULL
On your "providers" object, this runs very quickly and consistently:
library(microbenchmark)
out <- microbenchmark(Flattener(providers), Flattener(providers, TRUE), flattenList(jsonRList))
out
# Unit: milliseconds
# expr min lq mean median uq max neval
# Flattener(providers) 104.18939 126.59295 157.3744 138.4185 174.5222 308.5218 100
# Flattener(providers, TRUE) 67.56471 86.37789 109.8921 96.3534 121.4443 301.4856 100
# flattenList(jsonRList) 1780.44981 2065.50533 2485.1924 2269.4496 2694.1487 4397.4793 100
library(ggplot2)
qplot(y = time, data = out, colour = expr) ## Via #TylerRinker
My first step was to load the data via RCurl::getURL() and rjson::fromJSON(), as per your second code sample:
##--------------------------------------
## libraries
##--------------------------------------
library(rjson);
library(RCurl);
##--------------------------------------
## get data
##--------------------------------------
URL <- 'https://fm.formularynavigator.com/jsonFiles/publish/11/47/providers.json';
jsonRList <- fromJSON(getURL(URL)); ## recursive list representing the original JSON data
Next, to get a deep understanding of the structure and cleanness of the data, I wrote a set of helper functions:
##--------------------------------------
## helper functions
##--------------------------------------
## apply a function to a set of nodes at the same depth level in a recursive list structure
levelApply <- function(
nodes, ## the root node of the list (recursive calls pass deeper nodes as they drill down into the list)
keyList, ## another list, expected to hold a sequence of keys (component names, integer indexes, or NULL for all) specifying which nodes to select at each depth level
func=identity, ## a function to run separately on each node once keyList has been exhausted
..., ## further arguments passed to func()
joinFunc=NULL ## optional function for joining the return values of func() at each successive depth, as the stack is unwound. An alternative is calling unlist() on the result, but careful not to lose the top-level index association
) {
if (length(keyList) == 0L) {
ret <- if (is.null(nodes)) NULL else func(nodes,...)
} else if (is.null(keyList[[1L]]) || length(keyList[[1L]]) != 1L) {
ret <- lapply(if (is.null(keyList[[1L]])) nodes else nodes[keyList[[1L]]],levelApply,keyList[-1L],func,...,joinFunc=joinFunc);
if (!is.null(joinFunc))
ret <- do.call(joinFunc,ret);
} else {
ret <- levelApply(nodes[[keyList[[1L]]]],keyList[-1L],func,...,joinFunc=joinFunc);
}; ## end if
ret;
}; ## end if
## these two wrappers automatically attempt to simplify the results of func() to a vector or matrix/data.frame, respectively
levelApplyToVec <- function(...) levelApply(...,joinFunc=c);
levelApplyToFrame <- function(...) levelApply(...,joinFunc=rbind); ## can return matrix or data.frame, depending on ret
The key to understanding the above is the keyList parameter. Let's say you have a list like this:
list(NULL,'addresses',2:3,'city')
That would select all city strings underneath the second and third address elements underneath the addresses list underneath all elements of the main list.
There are no built-in apply functions in R that can operate on such "parallel" node selections (rapply() is close, but no cigar), which is why I wrote my own. levelApply() finds each of the matching nodes and runs the given func() on it (default identity(), thus returning the node itself), returning the results to the caller, either joined as per joinFunc(), or in the same recursive list structure in which those nodes existed in the input list. Quick demo:
unname(levelApplyToVec(jsonRList,list(4L,'addresses',1:2,c('address','city'))));
## [1] "1001 Noble St" "Fairbanks" "1650 Cowles St" "Fairbanks"
Here are the remaining helper functions I wrote in the process of working on this problem:
## for the given node selection key union, retrieve a data.frame of logicals representing the unique combinations of keys possessed by the selected nodes, possibly with a count
keyCombos <- function(node,keyList,allKeys) `rownames<-`(setNames(unique(as.data.frame(levelApplyToFrame(node,keyList,function(h) allKeys%in%names(h)))),allKeys),NULL);
keyCombosWithCount <- function(node,keyList,allKeys) { ks <- keyCombos(node,keyList,allKeys); ks$.count <- unname(apply(ks,1,function(combo) sum(levelApplyToVec(node,keyList,function(h) identical(sort(names(ks)[combo]),sort(names(h))))))); ks; };
## return a simple two-component list with type (list, namedlist, or atomic vector type) and len for non-namedlist types; tlStr() returns a nice stringified form of said list
tl <- function(e) { if (is.null(e)) return(NULL); ret <- typeof(e); if (ret == 'list' && !is.null(names(e))) ret <- list(type='namedlist') else ret <- list(type=ret,len=length(e)); ret; };
tlStr <- function(e) { if (is.null(e)) return(NA); ret <- tl(e); if (is.null(ret$len)) ret <- ret$type else ret <- paste0(ret$type,'[',ret$len,']'); ret; };
## stringification functions for display
mkcsv <- function(v) paste0(collapse=',',v);
keyListToStr <- function(keyList) paste0(collapse='','/',sapply(keyList,function(key) if (is.null(key)) '*' else paste0(collapse=',',key)));
## return a data.frame giving a comma-separated list of the unique types possessed by the selected nodes; useful for learning about the structure of the data
keyTypes <- function(node,keyList,allKeys) data.frame(key=allKeys,tl=sapply(allKeys,function(key) mkcsv(unique(na.omit(levelApplyToVec(node,c(keyList,key),tlStr))))),row.names=NULL);
## useful for testing; can call npiToFrame() to show the row with a specified npi value, in a nice vertical form
rowToFrame <- function(dfrow) data.frame(column=names(dfrow),value=c(as.matrix(dfrow)));
getNPIRow <- function(df,npi) which(df$npi == npi);
npiToFrame <- function(df,npi) rowToFrame(df[getNPIRow(df,npi),]);
I've tried to capture the sequence of commands I ran against the data as I first examined it. Below are the results, showing the commands I ran, the command output, and leading comments describing what my intention was, and my conclusion from the output:
##--------------------------------------
## data examination
##--------------------------------------
## type of object -- plain unnamed list => array, length 3256
levelApplyToVec(jsonRList,list(),tlStr);
## [1] "list[3256]"
## unique types of main array elements => all named lists => hashes
unique(levelApplyToVec(jsonRList,list(NULL),tlStr));
## [1] "namedlist"
## get the union of keys among all hashes
allKeys <- unique(levelApplyToVec(jsonRList,list(NULL),names)); allKeys;
## [1] "npi" "type" "facility_name" "facility_type" "addresses" "plans" "last_updated_on" "name" "speciality" "accepting" "languages" "gender"
## get the unique pattern of keys among all hashes, and how often each occurs => shows there are inconsistent key sets among the top-level hashes
keyCombosWithCount(jsonRList,list(NULL),allKeys);
## npi type facility_name facility_type addresses plans last_updated_on name speciality accepting languages gender .count
## 1 TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE 279
## 2 TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE 2973
## 3 TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE 4
## for each key, get the unique set of types it takes on among all hashes, ignoring hashes where the key is omitted => some scalar strings, some multi-string, addresses is a variable-length list, plans is length-9 list, and name is a hash
keyTypes(jsonRList,list(NULL),allKeys);
## key tl
## 1 npi character[1]
## 2 type character[1]
## 3 facility_name character[1]
## 4 facility_type character[1],character[2],character[3]
## 5 addresses list[1],list[2],list[3],list[6],list[5],list[7],list[4],list[8],list[9],list[13],list[12]
## 6 plans list[9]
## 7 last_updated_on character[1]
## 8 name namedlist
## 9 speciality character[1],character[2],character[3],character[4]
## 10 accepting character[1]
## 11 languages character[2],character[3],character[4],character[6],character[5]
## 12 gender character[1]
## must look deeper into addresses array, plans array, and name hash; we'll have to flatten them
## ==== addresses =====
## note: the addresses key is always present under main array elements
## unique types of address elements across all hashes => all named lists, thus nested hashes
unique(levelApplyToVec(jsonRList,list(NULL,'addresses',NULL),tlStr));
## [1] "namedlist"
## union of keys among all address element hashes
allAddressKeys <- unique(levelApplyToVec(jsonRList,list(NULL,'addresses',NULL),names)); allAddressKeys;
## [1] "address" "city" "state" "zip" "phone" "address_2"
## pattern of keys among address elements => only address_2 varies, similar frequency with it as without it
keyCombosWithCount(jsonRList,list(NULL,'addresses',NULL),allAddressKeys);
## address city state zip phone address_2 .count
## 1 TRUE TRUE TRUE TRUE TRUE FALSE 1898
## 2 TRUE TRUE TRUE TRUE TRUE TRUE 2575
## for each address element key, get the unique set of types it takes on among all hashes, ignoring hashes where the key (only address_2 in this case) is omitted => all scalar strings
keyTypes(jsonRList,list(NULL,'addresses',NULL),allAddressKeys);
## key tl
## 1 address character[1]
## 2 city character[1]
## 3 state character[1]
## 4 zip character[1]
## 5 phone character[1]
## 6 address_2 character[1]
## ==== plans =====
## note: the plans key is always present under main array elements
## unique types of plan elements across all hashes => all named lists, thus nested hashes
unique(levelApplyToVec(jsonRList,list(NULL,'plans',NULL),tlStr));
## [1] "namedlist"
## union of keys among all plan element hashes
allPlanKeys <- unique(levelApplyToVec(jsonRList,list(NULL,'plans',NULL),names)); allPlanKeys;
## [1] "plan_id_type" "plan_id" "network_tier"
## pattern of keys among plan elements => good, all plan elements have all 3 keys, perfectly consistent
keyCombosWithCount(jsonRList,list(NULL,'plans',NULL),allPlanKeys);
## plan_id_type plan_id network_tier .count
## 1 TRUE TRUE TRUE 29304
## for each plan element key, get the unique set of types it takes on among all hashes (note: no plan keys are ever omitted, so don't have to worry about that) => all scalar strings
keyTypes(jsonRList,list(NULL,'plans',NULL),allPlanKeys);
## key tl
## 1 plan_id_type character[1]
## 2 plan_id character[1]
## 3 network_tier character[1]
## ==== name =====
## note: the name key is *not* always present under main array elements
## union of keys among all name hashes
allNameKeys <- unique(levelApplyToVec(jsonRList,list(NULL,'name'),names)); allNameKeys;
## [1] "first" "middle" "last"
## pattern of keys among name elements => sometimes middle is missing, relatively infrequently
keyCombosWithCount(jsonRList,list(NULL,'name'),allNameKeys);
## first middle last .count
## 1 TRUE TRUE TRUE 2679
## 2 TRUE FALSE TRUE 298
## for each name element key, get the unique set of types it takes on among all hashes, ignoring hashes where the key (only middle in this case) is omitted => all scalar strings
keyTypes(jsonRList,list(NULL,'name'),allNameKeys);
## key tl
## 1 first character[1]
## 2 middle character[1]
## 3 last character[1]
Here's my summary of the data:
one top-level main list, length 3256.
each element is a hash with inconsistent key sets. There are 12 keys in total across all main hashes, with 3 patterns of key sets present.
6 of the hash values are scalar strings, 3 are variable-length string vectors, addresses is a list of variable length, plans is a list always of length 9, and name is a hash.
each addresses list element is a hash with 5 or 6 keys to scalar strings, address_2 being the inconsistent one.
each plans list element is a hash with 3 keys to scalar strings, no inconsistencies.
each name hash has first and last but not always middle scalar strings.
The most important observation here is that there are no type-inconsistencies between parallel nodes (aside from omissions and length differences). That means we can combine all parallel nodes into vectors with no considerations of type coercion. We can flatten all the data into a two-dimensional structure provided we associate columns with deep-enough nodes, such that all columns correspond to a single scalar string node in the input list.
Below is my solution. Note that it depends on the helper functions tl(), keyListToStr(), and mkcsv() I defined earlier.
##--------------------------------------
## solution
##--------------------------------------
## recursively traverse the list structure, building up a column at each leaf node
extractLevelColumns <- function(
nodes, ## current level node selection
..., ## additional arguments to data.frame()
keyList=list(), ## current key path under main list
sep=NULL, ## optional string separator on which to join multi-element vectors; if NULL, will leave as separate columns
mkname=function(keyList,maxLen) paste0(collapse='.',if (is.null(sep) && maxLen == 1L) keyList[-length(keyList)] else keyList) ## name builder from current keyList and character vector max length across node level; default to dot-separated keys, and remove last index component for scalars
) {
cat(sprintf('extractLevelColumns(): %s\n',keyListToStr(keyList)));
if (length(nodes) == 0L) return(list()); ## handle corner case of empty main list
tlList <- lapply(nodes,tl);
typeList <- do.call(c,lapply(tlList,`[[`,'type'));
if (length(unique(typeList)) != 1L) stop(sprintf('error: inconsistent types (%s) at %s.',mkcsv(typeList),keyListToStr(keyList)));
type <- typeList[1L];
if (type == 'namedlist') { ## hash; recurse
allKeys <- unique(do.call(c,lapply(nodes,names)));
ret <- do.call(c,lapply(allKeys,function(key) extractLevelColumns(lapply(nodes,`[[`,key),...,keyList=c(keyList,key),sep=sep,mkname=mkname)));
} else if (type == 'list') { ## array; recurse
lenList <- do.call(c,lapply(tlList,`[[`,'len'));
maxLen <- max(lenList,na.rm=T);
allIndexes <- seq_len(maxLen);
ret <- do.call(c,lapply(allIndexes,function(index) extractLevelColumns(lapply(nodes,function(node) if (length(node) < index) NULL else node[[index]]),...,keyList=c(keyList,index),sep=sep,mkname=mkname))); ## must be careful to guard out-of-bounds to NULL; happens automatically with string keys, but not with integer indexes
} else if (type%in%c('raw','logical','integer','double','complex','character')) { ## atomic leaf node; build column
lenList <- do.call(c,lapply(tlList,`[[`,'len'));
maxLen <- max(lenList,na.rm=T);
if (is.null(sep)) {
ret <- lapply(seq_len(maxLen),function(i) setNames(data.frame(sapply(nodes,function(node) if (length(node) < i) NA else node[[i]]),...),mkname(c(keyList,i),maxLen)));
} else {
## keep original type if maxLen is 1, IOW don't stringify
ret <- list(setNames(data.frame(sapply(nodes,function(node) if (length(node) == 0L) NA else if (maxLen == 1L) node else paste(collapse=sep,node)),...),mkname(keyList,maxLen)));
}; ## end if
} else stop(sprintf('error: unsupported type %s at %s.',type,keyListToStr(keyList)));
if (is.null(ret)) ret <- list(); ## handle corner case of exclusively empty sublists
ret;
}; ## end extractLevelColumns()
## simple interface function
flattenList <- function(mainList,...) do.call(cbind,extractLevelColumns(mainList,...));
The extractLevelColumns() function traverses the input list and extracts all node values at each leaf node position, combining them into a vector with NA where the value was missing, and then transforming to a one-column data.frame. The column name is set immediately, leveraging a parameterized mkname() function to define the stringification of the keyList to the string column name. Multiple columns are returned as a list of data.frames from each recursive call and likewise from the top-level call.
It also validates that there are no type-inconsistencies between parallel nodes. Although I manually verified the consistency of the data earlier, I tried to write as generic and reusable a solution as possible, because it's always a good idea to do so, so this validation step is appropriate.
flattenList() is the primary interface function; it simply calls extractLevelColumns() and then do.call(cbind,...) to combine the columns into a single data.frame.
An advantage of this solution is that it's entirely generic; it can handle an unlimited number of depth levels, by virtue of being fully recursive. Additionally, it has no package dependencies, parameterizes the column name building logic, and forwards variadic arguments to data.frame(), so for example you can pass stringsAsFactors=F to inhibit the automatic factorization of character columns normally done by data.frame(), and/or row.names={namevector} to set the row names of the resulting data.frame, or row.names=NULL to prevent the use of the top-level list component names as row names, if such existed in the input list.
I've also added a sep parameter which defaults to NULL. If NULL, multi-element leaf nodes will be separated into multiple columns, one per element, with an index suffix on the column name for differentiation. Otherwise, it's taken as a string separator on which to join all elements to a single string, and only one column is generated for the node.
In terms of performance, it's very fast. Here's a demo:
## actually run it
system.time({ df <- flattenList(jsonRList); });
## extractLevelColumns(): /
## extractLevelColumns(): /npi
## extractLevelColumns(): /type
## extractLevelColumns(): /facility_name
## extractLevelColumns(): /facility_type
## extractLevelColumns(): /addresses
## extractLevelColumns(): /addresses/1
## extractLevelColumns(): /addresses/1/address
## extractLevelColumns(): /addresses/1/city
##
## ... snip ...
##
## extractLevelColumns(): /plans/9/network_tier
## extractLevelColumns(): /last_updated_on
## extractLevelColumns(): /name
## extractLevelColumns(): /name/first
## extractLevelColumns(): /name/middle
## extractLevelColumns(): /name/last
## extractLevelColumns(): /speciality
## extractLevelColumns(): /accepting
## extractLevelColumns(): /languages
## extractLevelColumns(): /gender
## user system elapsed
## 2.265 0.000 2.268
Result:
class(df); dim(df); names(df);
## [1] "data.frame"
## [1] 3256 126
## [1] "npi" "type" "facility_name" "facility_type.1" "facility_type.2" "facility_type.3" "addresses.1.address" "addresses.1.city" "addresses.1.state"
## [10] "addresses.1.zip" "addresses.1.phone" "addresses.1.address_2" "addresses.2.address" "addresses.2.city" "addresses.2.state" "addresses.2.zip" "addresses.2.phone" "addresses.2.address_2"
## [19] "addresses.3.address" "addresses.3.city" "addresses.3.state" "addresses.3.zip" "addresses.3.phone" "addresses.3.address_2" "addresses.4.address" "addresses.4.city" "addresses.4.state"
## [28] "addresses.4.zip" "addresses.4.phone" "addresses.4.address_2" "addresses.5.address" "addresses.5.address_2" "addresses.5.city" "addresses.5.state" "addresses.5.zip" "addresses.5.phone"
## [37] "addresses.6.address" "addresses.6.address_2" "addresses.6.city" "addresses.6.state" "addresses.6.zip" "addresses.6.phone" "addresses.7.address" "addresses.7.address_2" "addresses.7.city"
## [46] "addresses.7.state" "addresses.7.zip" "addresses.7.phone" "addresses.8.address" "addresses.8.address_2" "addresses.8.city" "addresses.8.state" "addresses.8.zip" "addresses.8.phone"
## [55] "addresses.9.address" "addresses.9.address_2" "addresses.9.city" "addresses.9.state" "addresses.9.zip" "addresses.9.phone" "addresses.10.address" "addresses.10.address_2" "addresses.10.city"
## [64] "addresses.10.state" "addresses.10.zip" "addresses.10.phone" "addresses.11.address" "addresses.11.address_2" "addresses.11.city" "addresses.11.state" "addresses.11.zip" "addresses.11.phone"
## [73] "addresses.12.address" "addresses.12.address_2" "addresses.12.city" "addresses.12.state" "addresses.12.zip" "addresses.12.phone" "addresses.13.address" "addresses.13.city" "addresses.13.state"
## [82] "addresses.13.zip" "addresses.13.phone" "plans.1.plan_id_type" "plans.1.plan_id" "plans.1.network_tier" "plans.2.plan_id_type" "plans.2.plan_id" "plans.2.network_tier" "plans.3.plan_id_type"
## [91] "plans.3.plan_id" "plans.3.network_tier" "plans.4.plan_id_type" "plans.4.plan_id" "plans.4.network_tier" "plans.5.plan_id_type" "plans.5.plan_id" "plans.5.network_tier" "plans.6.plan_id_type"
## [100] "plans.6.plan_id" "plans.6.network_tier" "plans.7.plan_id_type" "plans.7.plan_id" "plans.7.network_tier" "plans.8.plan_id_type" "plans.8.plan_id" "plans.8.network_tier" "plans.9.plan_id_type"
## [109] "plans.9.plan_id" "plans.9.network_tier" "last_updated_on" "name.first" "name.middle" "name.last" "speciality.1" "speciality.2" "speciality.3"
## [118] "speciality.4" "accepting" "languages.1" "languages.2" "languages.3" "languages.4" "languages.5" "languages.6" "gender"
The resulting data.frame is quite wide, but we can use rowToFrame() and npiToFrame() to get a good vertical layout of one row at a time. For example, here's the first row:
rowToFrame(df[1L,]);
## column value
## 1 npi 1063645026
## 2 type FACILITY
## 3 facility_name EXPRESS SCRIPTS
## 4 facility_type.1 Pharmacies
## 5 facility_type.2 <NA>
## 6 facility_type.3 <NA>
## 7 addresses.1.address 4750 E 450 S
## 8 addresses.1.city WHITESTOWN
## 9 addresses.1.state IN
## 10 addresses.1.zip 46075
## 11 addresses.1.phone 2012695236
## 12 addresses.1.address_2 <NA>
## 13 addresses.2.address <NA>
## 14 addresses.2.city <NA>
## 15 addresses.2.state <NA>
## 16 addresses.2.zip <NA>
## 17 addresses.2.phone <NA>
## 18 addresses.2.address_2 <NA>
## 19 addresses.3.address <NA>
## 20 addresses.3.city <NA>
## 21 addresses.3.state <NA>
##
## ... snip ...
##
## 77 addresses.12.zip <NA>
## 78 addresses.12.phone <NA>
## 79 addresses.13.address <NA>
## 80 addresses.13.city <NA>
## 81 addresses.13.state <NA>
## 82 addresses.13.zip <NA>
## 83 addresses.13.phone <NA>
## 84 plans.1.plan_id_type HIOS-PLAN-ID
## 85 plans.1.plan_id 38344AK0620003
## 86 plans.1.network_tier HERITAGE-PLUS
## 87 plans.2.plan_id_type HIOS-PLAN-ID
## 88 plans.2.plan_id 38344AK0620004
## 89 plans.2.network_tier HERITAGE-PLUS
## 90 plans.3.plan_id_type HIOS-PLAN-ID
## 91 plans.3.plan_id 38344AK0620006
## 92 plans.3.network_tier HERITAGE-PLUS
## 93 plans.4.plan_id_type HIOS-PLAN-ID
## 94 plans.4.plan_id 38344AK0620008
## 95 plans.4.network_tier HERITAGE-PLUS
## 96 plans.5.plan_id_type HIOS-PLAN-ID
## 97 plans.5.plan_id 38344AK0570001
## 98 plans.5.network_tier HERITAGE-PLUS
## 99 plans.6.plan_id_type HIOS-PLAN-ID
## 100 plans.6.plan_id 38344AK0570002
## 101 plans.6.network_tier HERITAGE-PLUS
## 102 plans.7.plan_id_type HIOS-PLAN-ID
## 103 plans.7.plan_id 38344AK0980003
## 104 plans.7.network_tier HERITAGE-PLUS
## 105 plans.8.plan_id_type HIOS-PLAN-ID
## 106 plans.8.plan_id 38344AK0980006
## 107 plans.8.network_tier HERITAGE-PLUS
## 108 plans.9.plan_id_type HIOS-PLAN-ID
## 109 plans.9.plan_id 38344AK0980012
## 110 plans.9.network_tier HERITAGE-PLUS
## 111 last_updated_on 2015-10-14
## 112 name.first <NA>
## 113 name.middle <NA>
## 114 name.last <NA>
## 115 speciality.1 <NA>
## 116 speciality.2 <NA>
## 117 speciality.3 <NA>
## 118 speciality.4 <NA>
## 119 accepting <NA>
## 120 languages.1 <NA>
## 121 languages.2 <NA>
## 122 languages.3 <NA>
## 123 languages.4 <NA>
## 124 languages.5 <NA>
## 125 languages.6 <NA>
## 126 gender <NA>
I've tested the result pretty thoroughly by doing many spot-checks on individual records, and it all looks correct. Let me know if you have any questions.
This answer is rather a data organization suggestion (and is much shorter than the bounty-attracting answers around;)
If you want to keep the semantics of the fields, like keep all plan_ids in a single column, you can normalize your data design a bit, and do joins afterwards, if you need the information together:
library(dplyr)
# notice the simplifyVector=F
providers <- fromJSON( "http://fm.formularynavigator.com/jsonFiles/publish/11/47/providers.json", simplifyVector=F)
# pick and repeat fields for each element of array
# {field1:val, field2:val2, array:[{af1:av1, af2:av2}, {af1:av3, af2:av4}]}
# gives data.frame
# field1, field2 array.af1 array.af2
# val val2 av1 av2
# val val2 av3 av4
denormalize <- function(data, fields, array) {
data.frame(
c(
data[fields],
as.list(
bind_rows(
lapply(data[[array]], data.frame)))))
}
plans_df <- bind_rows(lapply(providers, denormalize, c('npi'), 'plans'))
addresses_df <- bind_rows(lapply(providers, denormalize, c('npi'), 'addresses'))
npis <- bind_rows(lapply(providers, function(d, fields) data.frame(d[fields]),
c('npi', 'type', 'last_updated_on')))
Then you can first filter on the data and join in other information afterwards:
addresses_df %>%
filter(city == "Healy") %>%
left_join(plans_df, by="npi") ->
plans_in_healy
So this isn't really eligible as a solution since it doesn't directly answer the question, but here is how I would analyze this data.
First, I had to understand your data set. It appears to be information about health providers.
providers <- fromJSON( "http://fm.formularynavigator.com/jsonFiles/publish/11/47/providers.json" , simplifyDataFrame=FALSE )
types = sapply(providers,"[[","type")
table(types)
# FACILITY INDIVIDUAL
# 279 2977
FACILITY entries have the "ID" fields facility_name and facility_type.
INDIVIDUAL entries have the "ID" fields name, speciality, accepting, languages, and gender.
All entries have "ID" fields npi and last_updated_on.
All entries have two nested fields: addresses and plans. For example addresses is a list that contains city, state, etc.
Since there are multiple addresses for each npi, I'd prefer to convert them to a data frame with columns for the city, state, etc. I'll also make a similar data frame for the plans. Then I'll join the addresses and plans into a single data frame. Hence, if there are 4 addresses and 8 plans, there will be 4*8=32 rows in the joined data frame. Finally, I'll tac on a similarly denormalized data frame with "ID" information using another merge.
library(dplyr)
unfurl_npi_data = function (x) {
repeat_cols = c("plans","addresses")
id_cols = setdiff(names(x),repeat_cols)
repeat_data = x[repeat_cols]
id_data = x[id_cols]
# Denormalized ID data
id_data_df = Reduce(function(x,y) merge(x,y,by=NULL), id_data, "")[,-1]
atomic_colnames = names(which(!sapply(id_data, is.list)))
df_atomic_cols = unlist(sapply(id_data,function(x) if(is.list(x)) rep(FALSE, length(x)) else TRUE))
colnames(id_data_df)[df_atomic_cols] = atomic_colnames
# Join the plans and addresses (denormalized)
repeated_data = lapply(repeat_data, rbind_all)
repeated_data_crossed = Reduce(merge, repeated_data, repeated_data[[1]])
merge(id_data_df, repeated_data_crossed)
}
providers2 = split(providers, types)
providers3 = lapply(providers2, function(x) rbind_all(lapply(x, unfurl_npi_data)))
Then do some cleanup.
unique_df = function(x) {
chr_col_names = names(which(sapply(x, class) == "character"))
for( col in chr_col_names )
x[[col]] = toupper(x[[col]])
unique(x)
}
providers3 = lapply(providers3, unique_df)
facilities = providers3[["FACILITY"]]
individuals = providers3[["INDIVIDUAL"]]
rm(providers, providers2, providers3)
And now you can ask some interesting questions. For example, how many addresses does each health care provider have?
unique_providers = individuals %>% select(first, middle, last, gender, state, city, address) %>% unique()
num_addresses = unique_providers %>% count(first, middle, last, gender)
table(num_addresses$n)
# 1 2 3 4 5 6 7 8 9 12 13
# 2258 492 119 33 43 21 6 1 2 1 1
At addresses with more than five people, what is the percent of male healthcare providers?
address_pcts = unique_providers %>%
group_by(address, city, state) %>%
filter(n()>5) %>%
arrange(address) %>%
summarise(pct_male = sum(gender=="MALE")/n())
library(ggplot2)
qplot(address_pcts$pct_male, binwidth=1/7) + xlim(0,1)
And on and on...
I’m using the R programming language (and R Studio) having trouble organizing some data that I’m pulling via API so that it’s writeable to a table. I’m using the StubHub API to get a JSON response that contains all ticket listings for a particular event. I can successfully make the call to StubHub, I get the successful response. Here’s the code I am using to grab the response:
# get the content part of the response
msgContent = content(response)
# format to JSON object
jsonContent = jsonlite::fromJSON(toJSON(msgContent),flatten=TRUE,simplifyVector=TRUE)
This JSON object has a node called “listing” and that’s what I’m most interested in, so I set a variable to that part of the object:
friListings = jsonContent $listing
Checking the class of “friListings” I see I have a data.frame:
> class(friListings)
[1] "data.frame"
When I click on this variable in R Studio — View(friListings) — it opens in a new tab and looks pretty and nicely formatted. There are 21 variables (columns) and 609 observations (row). I see null values for certain cells, which is expected.
I would like to write this data.frame out as a table in a file on my computer. When I try to do that, I get this error.
> write.table(friListings,file="data",row.names=FALSE)
Error in if (inherits(X[[j]], "data.frame") && ncol(xj) > 1L) X[[j]] <- as.matrix(X[[j]]) :
missing value where TRUE/FALSE needed
Looking at other postings, it appears this is happening because my data.frame is actually not “flat” and is a list of lists with different classes and nesting. I validate this by str() on each of the columns in friListings….
> str(friListings[1])
'data.frame': 609 obs. of 1 variable:
$ listingId:List of 609
..$ : int 1138579989
..$ : int 1138969061
..$ : int 1138958138
(this is just the first couple of lines, there are hundreds)
Another example:
> str(friListings[6])
'data.frame': 609 obs. of 1 variable:
$ sellerSectionName:List of 609
..$ : chr "Upper 354 - No View"
..$ : chr "Club 303 - Obstructed/No View"
..$ : chr "Middle 254 - Obstructed/No View"
(this is just the first couple of lines, there are hundreds)
Here is the head of friListings that I am attempting to share using dput from the reproducible example post:
> dput(head(friListings,4))
structure(list(listingId = list(1138579989L, 1138969061L, 1138958138L,
1139003985L), sectionId = list(1552295L, 1552172L, 1552220L,
1552289L), row = list("16", "6", "22", "26"), quantity = list(
1L, 2L, 4L, 1L), sellerSectionName = list("Upper 354 - No View",
"Club 303 - Obstructed/No View", "Middle 254 - Obstructed/No View",
"353"), sectionName = list("Upper 354 - Obstructed/No View",
"Club 303 - Obstructed/No View", "Middle 254 - Obstructed/No View",
"Upper 353 - Obstructed/No View"), seatNumbers = list("21",
"7,8", "13,14,15,16", "General Admission"), zoneId = list(
232917L, 232909L, 232914L, 232917L), zoneName = list("Upper",
"Club", "Middle", "Upper"), listingAttributeList = list(structure(c(204L,
201L), .Dim = c(2L, 1L)), structure(c(4369L, 5370L), .Dim = c(2L,
1L)), structure(c(4369L, 5989L), .Dim = c(2L, 1L)), structure(c(204L,
4369L), .Dim = c(2L, 1L))), listingAttributeCategoryList = list(
structure(1L, .Dim = c(1L, 1L)), structure(1L, .Dim = c(1L,
1L)), structure(1L, .Dim = c(1L, 1L)), structure(1L, .Dim = c(1L,
1L))), deliveryTypeList = list(structure(5L, .Dim = c(1L,
1L)), structure(5L, .Dim = c(1L, 1L)), structure(5L, .Dim = c(1L,
1L)), structure(5L, .Dim = c(1L, 1L))), dirtyTicketInd = list(
FALSE, FALSE, FALSE, FALSE), splitOption = list("0", "0",
"1", "1"), ticketSplit = list("1", "2", "2", "1"), splitVector = list(
structure(1L, .Dim = c(1L, 1L)), structure(2L, .Dim = c(1L,
1L)), structure(c(2L, 4L), .Dim = c(2L, 1L)), structure(1L, .Dim = c(1L,
1L))), sellerOwnInd = list(0L, 0L, 0L, 0L), currentPrice.amount = list(
468.99, 475L, 475L, 550.45), currentPrice.currency = list(
"USD", "USD", "USD", "USD"), faceValue.amount = list(NULL,
NULL, NULL, NULL), faceValue.currency = list(NULL, NULL,
NULL, NULL)), .Names = c("listingId", "sectionId", "row",
"quantity", "sellerSectionName", "sectionName", "seatNumbers",
"zoneId", "zoneName", "listingAttributeList", "listingAttributeCategoryList",
"deliveryTypeList", "dirtyTicketInd", "splitOption", "ticketSplit",
"splitVector", "sellerOwnInd", "currentPrice.amount", "currentPrice.currency",
"faceValue.amount", "faceValue.currency"), row.names = c(NA,
4L), class = "data.frame")
I tried to get around this by going through each column in friListings, unlisting that node, saving to a vector and then doing a cbind to stitch them all together. But, when I do that, I get vectors of different lengths because of the nulls. I took this approach one step further and tried to class each column to force NAs to preserve the nulls, but that’s not working. And, regardless, there’s gotta be a better approach than this. Here's some output to illustrate what happens when I attempt this approach.
# Take the column zoneId and casting it as numeric to force NA
friListings$zoneId<-lapply(friListings$zoneId, as.numeric)
# check the length
> length(friListings$zoneId)
[1] 609
# unlist and check the length... and I lost 11 items
> zoneid <- unlist(friListings$zoneId, use.names=FALSE)
> length(zoneid)
[1] 598
# here's the tail of the column... (because I happen to know that's where the empty values that are being dropped are)
> tail(friListings$zoneId)
[[1]]
numeric(0)
[[2]]
numeric(0)
[[3]]
numeric(0)
[[4]]
numeric(0)
[[5]]
numeric(0)
[[6]]
numeric(0)
I know people work with JSON and R all the time (I'm obviously not one of those people!), so maybe I’m missing something obvious. But I’ve spent 5 hours trying different ways to clean this data and searching the internet for answers. I read the JSON package documentation, too.
I really just want to "flatten" this object so that it’s pretty and structured in the same way the R Studio renders it when I do View(friListings). I'm already passing "flatten=TRUE" in my "fromJSON" call above and it doesn't seem to be doing what I expect. Same with the "simplifyVector=TRUE" (which is TRUE by default according to the docs, but added it for clarity).
Thanks for any insight or guidance you may be able to offer!!!
You might want to try and adapt this approach:
f <- function(x)
if(is.list(x)) {
unlist(lapply(x, f))
} else {
x[which(is.null(x))] <- NA
paste(x, collapse = ",")
}
df <- as.data.frame(do.call(cbind, lapply(friListings, f)))
write.table(df, tf <- tempfile(fileext = "csv"))
df <- read.table(tf)
str(df)
# 'data.frame': 4 obs. of 21 variables:
# $ listingId : int 1138579989 1138969061 1138958138 1139003985
# $ sectionId : int 1552295 1552172 1552220 1552289
# $ row : int 16 6 22 26
# $ quantity : int 1 2 4 1
# $ sellerSectionName : Factor w/ 4 levels "353","Club 303 - Obstructed/No View",..: 4 2 3 1
# $ sectionName : Factor w/ 4 levels "Club 303 - Obstructed/No View",..: 4 1 2 3
# $ seatNumbers : Factor w/ 4 levels "13,14,15,16",..: 2 3 1 4
# $ zoneId : int 232917 232909 232914 232917
# $ zoneName : Factor w/ 3 levels "Club","Middle",..: 3 1 2 3
# $ listingAttributeList : Factor w/ 4 levels "204,201","204,4369",..: 1 3 4 2
# $ listingAttributeCategoryList: int 1 1 1 1
# $ deliveryTypeList : int 5 5 5 5
# $ dirtyTicketInd : logi FALSE FALSE FALSE FALSE
# $ splitOption : int 0 0 1 1
# $ ticketSplit : int 1 2 2 1
# $ splitVector : Factor w/ 3 levels "1","2","2,4": 1 2 3 1
# $ sellerOwnInd : int 0 0 0 0
# $ currentPrice.amount : num 469 475 475 550
# $ currentPrice.currency : Factor w/ 1 level "USD": 1 1 1 1
# $ faceValue.amount : logi NA NA NA NA
# $ faceValue.currency : logi NA NA NA NA