Read Json file into a data.frame without nested lists - json

I am trying to load a json file into a data.frame in r. I have had some luck with the fromJSON function in the jsonlite package - But am getting nested lists and am not sure how to flatten the input into a two dimensional data.frame. Jsonlite reads the file in as a data.frame, but leaves nested lists in some of the variables.
Does Anyone have any tips in loading a JSON file to a data.frame when it reads in with nested lists.
#*#*#*#*#*#*#*#*#*##*#*#*#*#*#*#*#*#*# HERE IS MY EXAMPLE #*#*#*#*#*#*#*#*#*##*#*#*#*#*#*#*#*#*#
# loads the packages
library("httr")
library( "jsonlite")
# downloads an example file
providers <- fromJSON( "http://fm.formularynavigator.com/jsonFiles/publish/11/47/providers.json" , simplifyDataFrame=TRUE )
# the flatten function breaks the name variable into three vars ( first name, middle name, last name)
providers <- flatten( providers )
# but many of the columns are still lists:
sapply( providers , class)
# Some of these lists have a single level
head( providers$facility_type )
# Some have lot more than two - for example nine
providers[ , 6][[1]]
I want one row per npi, and than seperate columns for each of the slices of the individual lists - so that the data frame has cols for "plan_id_type","plan_id","network_tier" nine times, maybe colnames, from 0 to 8.
I have been able to use this site: http://www.convertcsv.com/json-to-csv.htm to get this file in two dimensions, but since I am doing hundreds of these I would love to be able to do it dynamically. This is the file: http://s000.tinyupload.com/download.php?file_id=10808537503095762868&t=1080853750309576286812811 - I would like to get a file with this structure load as a data.frame using the the fromJson function
HERE are a few of the things I have tried;
So I have thought of two approaches;
First: use a different function to read in the Json file, I have looked at
rjson but that reads in a list
library( rjson )
providers <- fromJSON( getURL( "https://fm.formularynavigator.com/jsonFiles/publish/11/47/providers.json") )
class( providers )
and I have tried RJSONIO - I tried this Getting imported json data into a data frame in R
json-data-into-a-data-frame-in-r
library( RJSONIO )
providers <- fromJSON( getURL( "https://fm.formularynavigator.com/jsonFiles/publish/11/47/providers.json") )
json_file <- lapply(providers, function(x) {
x[sapply(x, is.null)] <- NA
unlist(x)
})
# but When converting the lists to a data.frame I get an error
a <- do.call("rbind", json_file)
So, the second approach I have tried is to convert all the lists into variables in my data.frame
detach("package:RJSONIO", unload = TRUE )
detach("package:rjson", unload = TRUE )
library( "jsonlite")
providers <- fromJSON( "http://fm.formularynavigator.com/jsonFiles/publish/11/47/providers.json" , simplifyDataFrame=TRUE )
providers <- flatten( providers )
I am able to pull one of the lists - but because of missings I can't merge back on to my dataframe
a <- data.frame(Reduce(rbind, providers$facility_type))
length( a ) == nrow( providers )
I also tried these suggestions: Converting nested list to dataframe. A well as some other stuff but haven't had any luck
a <- sapply( providers$facility_type, unlist )
as.data.frame(t(sapply( providers$providers, unlist )) )
Any help much appreciated

Update: 21 February 2016
col_fixer updated to include a vec2col argument that lets you flatten a list column into either a single string or a set of columns.
In the data.frame you've downloaded, I see several different column types. There are normal columns comprising vectors of the same type. There are list columns where the items may be NULL or may themselves be a flat vector. There are list columns where there are data.frames as the list elements. There are list columns that contain a data.frame of the same number of rows as the main data.frame.
Here's a sample dataset that recreates those conditions:
mydf <- data.frame(id = 1:3, type = c("A", "A", "B"),
facility = I(list(c("x", "y"), NULL, "x")),
address = I(list(data.frame(v1 = 1, v2 = 2, v4 = 3),
data.frame(v1 = 1:2, v2 = 3:4, v3 = 5),
data.frame(v1 = 1, v2 = NA, v3 = 3))))
mydf$person <- data.frame(name = c("AA", "BB", "CC"), age = c(20, 32, 23),
preference = c(TRUE, FALSE, TRUE))
The str of this sample data.frame looks like:
str(mydf)
## 'data.frame': 3 obs. of 5 variables:
## $ id : int 1 2 3
## $ type : Factor w/ 2 levels "A","B": 1 1 2
## $ facility:List of 3
## ..$ : chr "x" "y"
## ..$ : NULL
## ..$ : chr "x"
## ..- attr(*, "class")= chr "AsIs"
## $ address :List of 3
## ..$ :'data.frame': 1 obs. of 3 variables:
## .. ..$ v1: num 1
## .. ..$ v2: num 2
## .. ..$ v4: num 3
## ..$ :'data.frame': 2 obs. of 3 variables:
## .. ..$ v1: int 1 2
## .. ..$ v2: int 3 4
## .. ..$ v3: num 5 5
## ..$ :'data.frame': 1 obs. of 3 variables:
## .. ..$ v1: num 1
## .. ..$ v2: logi NA
## .. ..$ v3: num 3
## ..- attr(*, "class")= chr "AsIs"
## $ person :'data.frame': 3 obs. of 3 variables:
## ..$ name : Factor w/ 3 levels "AA","BB","CC": 1 2 3
## ..$ age : num 20 32 23
## ..$ preference: logi TRUE FALSE TRUE
## NULL
One way you can "flatten" this is to "fix" the list columns. There are three fixes.
flatten (from "jsonlite") will take care of columns like the "person" column.
Columns like the "facility" column can be fixed using toString, which would convert each element to a comma separated item or which can be converted into multiple columns.
Columns where there are data.frames, some with multiple rows, first need to be flattened into a single row (by transforming to a "wide" format) and then need to be bound together as a single data.table. (I'm using "data.table" for reshaping and for binding the rows together).
We can take care of the second and third points with a function like the following:
col_fixer <- function(x, vec2col = FALSE) {
if (!is.list(x[[1]])) {
if (isTRUE(vec2col)) {
as.data.table(data.table::transpose(x))
} else {
vapply(x, toString, character(1L))
}
} else {
temp <- rbindlist(x, use.names = TRUE, fill = TRUE, idcol = TRUE)
temp[, .time := sequence(.N), by = .id]
value_vars <- setdiff(names(temp), c(".id", ".time"))
dcast(temp, .id ~ .time, value.var = value_vars)[, .id := NULL]
}
}
We'll integrate that and the flatten function in another function that would do most of the processing.
Flattener <- function(indf, vec2col = FALSE) {
require(data.table)
require(jsonlite)
indf <- flatten(indf)
listcolumns <- sapply(indf, is.list)
newcols <- do.call(cbind, lapply(indf[listcolumns], col_fixer, vec2col))
indf[listcolumns] <- list(NULL)
cbind(indf, newcols)
}
Running the function gives us:
Flattener(mydf)
## id type person.name person.age person.preference facility address.v1_1
## 1 1 A AA 20 TRUE x, y 1
## 2 2 A BB 32 FALSE 1
## 3 3 B CC 23 TRUE x 1
## address.v1_2 address.v2_1 address.v2_2 address.v4_1 address.v4_2 address.v3_1
## 1 NA 2 NA 3 NA NA
## 2 2 3 4 NA NA 5
## 3 NA NA NA NA NA 3
## address.v3_2
## 1 NA
## 2 5
## 3 NA
Or, with the vectors going into separate columns:
Flattener(mydf, TRUE)
## id type person.name person.age person.preference facility.V1 facility.V2
## 1 1 A AA 20 TRUE x y
## 2 2 A BB 32 FALSE <NA> <NA>
## 3 3 B CC 23 TRUE x <NA>
## address.v1_1 address.v1_2 address.v2_1 address.v2_2 address.v4_1 address.v4_2
## 1 1 NA 2 NA 3 NA
## 2 1 2 3 4 NA NA
## 3 1 NA NA NA NA NA
## address.v3_1 address.v3_2
## 1 NA NA
## 2 5 5
## 3 3 NA
Here's the str:
str(Flattener(mydf))
## 'data.frame': 3 obs. of 14 variables:
## $ id : int 1 2 3
## $ type : Factor w/ 2 levels "A","B": 1 1 2
## $ person.name : Factor w/ 3 levels "AA","BB","CC": 1 2 3
## $ person.age : num 20 32 23
## $ person.preference: logi TRUE FALSE TRUE
## $ facility : chr "x, y" "" "x"
## $ address.v1_1 : num 1 1 1
## $ address.v1_2 : num NA 2 NA
## $ address.v2_1 : num 2 3 NA
## $ address.v2_2 : num NA 4 NA
## $ address.v4_1 : num 3 NA NA
## $ address.v4_2 : num NA NA NA
## $ address.v3_1 : num NA 5 3
## $ address.v3_2 : num NA 5 NA
## NULL
On your "providers" object, this runs very quickly and consistently:
library(microbenchmark)
out <- microbenchmark(Flattener(providers), Flattener(providers, TRUE), flattenList(jsonRList))
out
# Unit: milliseconds
# expr min lq mean median uq max neval
# Flattener(providers) 104.18939 126.59295 157.3744 138.4185 174.5222 308.5218 100
# Flattener(providers, TRUE) 67.56471 86.37789 109.8921 96.3534 121.4443 301.4856 100
# flattenList(jsonRList) 1780.44981 2065.50533 2485.1924 2269.4496 2694.1487 4397.4793 100
library(ggplot2)
qplot(y = time, data = out, colour = expr) ## Via #TylerRinker

My first step was to load the data via RCurl::getURL() and rjson::fromJSON(), as per your second code sample:
##--------------------------------------
## libraries
##--------------------------------------
library(rjson);
library(RCurl);
##--------------------------------------
## get data
##--------------------------------------
URL <- 'https://fm.formularynavigator.com/jsonFiles/publish/11/47/providers.json';
jsonRList <- fromJSON(getURL(URL)); ## recursive list representing the original JSON data
Next, to get a deep understanding of the structure and cleanness of the data, I wrote a set of helper functions:
##--------------------------------------
## helper functions
##--------------------------------------
## apply a function to a set of nodes at the same depth level in a recursive list structure
levelApply <- function(
nodes, ## the root node of the list (recursive calls pass deeper nodes as they drill down into the list)
keyList, ## another list, expected to hold a sequence of keys (component names, integer indexes, or NULL for all) specifying which nodes to select at each depth level
func=identity, ## a function to run separately on each node once keyList has been exhausted
..., ## further arguments passed to func()
joinFunc=NULL ## optional function for joining the return values of func() at each successive depth, as the stack is unwound. An alternative is calling unlist() on the result, but careful not to lose the top-level index association
) {
if (length(keyList) == 0L) {
ret <- if (is.null(nodes)) NULL else func(nodes,...)
} else if (is.null(keyList[[1L]]) || length(keyList[[1L]]) != 1L) {
ret <- lapply(if (is.null(keyList[[1L]])) nodes else nodes[keyList[[1L]]],levelApply,keyList[-1L],func,...,joinFunc=joinFunc);
if (!is.null(joinFunc))
ret <- do.call(joinFunc,ret);
} else {
ret <- levelApply(nodes[[keyList[[1L]]]],keyList[-1L],func,...,joinFunc=joinFunc);
}; ## end if
ret;
}; ## end if
## these two wrappers automatically attempt to simplify the results of func() to a vector or matrix/data.frame, respectively
levelApplyToVec <- function(...) levelApply(...,joinFunc=c);
levelApplyToFrame <- function(...) levelApply(...,joinFunc=rbind); ## can return matrix or data.frame, depending on ret
The key to understanding the above is the keyList parameter. Let's say you have a list like this:
list(NULL,'addresses',2:3,'city')
That would select all city strings underneath the second and third address elements underneath the addresses list underneath all elements of the main list.
There are no built-in apply functions in R that can operate on such "parallel" node selections (rapply() is close, but no cigar), which is why I wrote my own. levelApply() finds each of the matching nodes and runs the given func() on it (default identity(), thus returning the node itself), returning the results to the caller, either joined as per joinFunc(), or in the same recursive list structure in which those nodes existed in the input list. Quick demo:
unname(levelApplyToVec(jsonRList,list(4L,'addresses',1:2,c('address','city'))));
## [1] "1001 Noble St" "Fairbanks" "1650 Cowles St" "Fairbanks"
Here are the remaining helper functions I wrote in the process of working on this problem:
## for the given node selection key union, retrieve a data.frame of logicals representing the unique combinations of keys possessed by the selected nodes, possibly with a count
keyCombos <- function(node,keyList,allKeys) `rownames<-`(setNames(unique(as.data.frame(levelApplyToFrame(node,keyList,function(h) allKeys%in%names(h)))),allKeys),NULL);
keyCombosWithCount <- function(node,keyList,allKeys) { ks <- keyCombos(node,keyList,allKeys); ks$.count <- unname(apply(ks,1,function(combo) sum(levelApplyToVec(node,keyList,function(h) identical(sort(names(ks)[combo]),sort(names(h))))))); ks; };
## return a simple two-component list with type (list, namedlist, or atomic vector type) and len for non-namedlist types; tlStr() returns a nice stringified form of said list
tl <- function(e) { if (is.null(e)) return(NULL); ret <- typeof(e); if (ret == 'list' && !is.null(names(e))) ret <- list(type='namedlist') else ret <- list(type=ret,len=length(e)); ret; };
tlStr <- function(e) { if (is.null(e)) return(NA); ret <- tl(e); if (is.null(ret$len)) ret <- ret$type else ret <- paste0(ret$type,'[',ret$len,']'); ret; };
## stringification functions for display
mkcsv <- function(v) paste0(collapse=',',v);
keyListToStr <- function(keyList) paste0(collapse='','/',sapply(keyList,function(key) if (is.null(key)) '*' else paste0(collapse=',',key)));
## return a data.frame giving a comma-separated list of the unique types possessed by the selected nodes; useful for learning about the structure of the data
keyTypes <- function(node,keyList,allKeys) data.frame(key=allKeys,tl=sapply(allKeys,function(key) mkcsv(unique(na.omit(levelApplyToVec(node,c(keyList,key),tlStr))))),row.names=NULL);
## useful for testing; can call npiToFrame() to show the row with a specified npi value, in a nice vertical form
rowToFrame <- function(dfrow) data.frame(column=names(dfrow),value=c(as.matrix(dfrow)));
getNPIRow <- function(df,npi) which(df$npi == npi);
npiToFrame <- function(df,npi) rowToFrame(df[getNPIRow(df,npi),]);
I've tried to capture the sequence of commands I ran against the data as I first examined it. Below are the results, showing the commands I ran, the command output, and leading comments describing what my intention was, and my conclusion from the output:
##--------------------------------------
## data examination
##--------------------------------------
## type of object -- plain unnamed list => array, length 3256
levelApplyToVec(jsonRList,list(),tlStr);
## [1] "list[3256]"
## unique types of main array elements => all named lists => hashes
unique(levelApplyToVec(jsonRList,list(NULL),tlStr));
## [1] "namedlist"
## get the union of keys among all hashes
allKeys <- unique(levelApplyToVec(jsonRList,list(NULL),names)); allKeys;
## [1] "npi" "type" "facility_name" "facility_type" "addresses" "plans" "last_updated_on" "name" "speciality" "accepting" "languages" "gender"
## get the unique pattern of keys among all hashes, and how often each occurs => shows there are inconsistent key sets among the top-level hashes
keyCombosWithCount(jsonRList,list(NULL),allKeys);
## npi type facility_name facility_type addresses plans last_updated_on name speciality accepting languages gender .count
## 1 TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE 279
## 2 TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE 2973
## 3 TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE 4
## for each key, get the unique set of types it takes on among all hashes, ignoring hashes where the key is omitted => some scalar strings, some multi-string, addresses is a variable-length list, plans is length-9 list, and name is a hash
keyTypes(jsonRList,list(NULL),allKeys);
## key tl
## 1 npi character[1]
## 2 type character[1]
## 3 facility_name character[1]
## 4 facility_type character[1],character[2],character[3]
## 5 addresses list[1],list[2],list[3],list[6],list[5],list[7],list[4],list[8],list[9],list[13],list[12]
## 6 plans list[9]
## 7 last_updated_on character[1]
## 8 name namedlist
## 9 speciality character[1],character[2],character[3],character[4]
## 10 accepting character[1]
## 11 languages character[2],character[3],character[4],character[6],character[5]
## 12 gender character[1]
## must look deeper into addresses array, plans array, and name hash; we'll have to flatten them
## ==== addresses =====
## note: the addresses key is always present under main array elements
## unique types of address elements across all hashes => all named lists, thus nested hashes
unique(levelApplyToVec(jsonRList,list(NULL,'addresses',NULL),tlStr));
## [1] "namedlist"
## union of keys among all address element hashes
allAddressKeys <- unique(levelApplyToVec(jsonRList,list(NULL,'addresses',NULL),names)); allAddressKeys;
## [1] "address" "city" "state" "zip" "phone" "address_2"
## pattern of keys among address elements => only address_2 varies, similar frequency with it as without it
keyCombosWithCount(jsonRList,list(NULL,'addresses',NULL),allAddressKeys);
## address city state zip phone address_2 .count
## 1 TRUE TRUE TRUE TRUE TRUE FALSE 1898
## 2 TRUE TRUE TRUE TRUE TRUE TRUE 2575
## for each address element key, get the unique set of types it takes on among all hashes, ignoring hashes where the key (only address_2 in this case) is omitted => all scalar strings
keyTypes(jsonRList,list(NULL,'addresses',NULL),allAddressKeys);
## key tl
## 1 address character[1]
## 2 city character[1]
## 3 state character[1]
## 4 zip character[1]
## 5 phone character[1]
## 6 address_2 character[1]
## ==== plans =====
## note: the plans key is always present under main array elements
## unique types of plan elements across all hashes => all named lists, thus nested hashes
unique(levelApplyToVec(jsonRList,list(NULL,'plans',NULL),tlStr));
## [1] "namedlist"
## union of keys among all plan element hashes
allPlanKeys <- unique(levelApplyToVec(jsonRList,list(NULL,'plans',NULL),names)); allPlanKeys;
## [1] "plan_id_type" "plan_id" "network_tier"
## pattern of keys among plan elements => good, all plan elements have all 3 keys, perfectly consistent
keyCombosWithCount(jsonRList,list(NULL,'plans',NULL),allPlanKeys);
## plan_id_type plan_id network_tier .count
## 1 TRUE TRUE TRUE 29304
## for each plan element key, get the unique set of types it takes on among all hashes (note: no plan keys are ever omitted, so don't have to worry about that) => all scalar strings
keyTypes(jsonRList,list(NULL,'plans',NULL),allPlanKeys);
## key tl
## 1 plan_id_type character[1]
## 2 plan_id character[1]
## 3 network_tier character[1]
## ==== name =====
## note: the name key is *not* always present under main array elements
## union of keys among all name hashes
allNameKeys <- unique(levelApplyToVec(jsonRList,list(NULL,'name'),names)); allNameKeys;
## [1] "first" "middle" "last"
## pattern of keys among name elements => sometimes middle is missing, relatively infrequently
keyCombosWithCount(jsonRList,list(NULL,'name'),allNameKeys);
## first middle last .count
## 1 TRUE TRUE TRUE 2679
## 2 TRUE FALSE TRUE 298
## for each name element key, get the unique set of types it takes on among all hashes, ignoring hashes where the key (only middle in this case) is omitted => all scalar strings
keyTypes(jsonRList,list(NULL,'name'),allNameKeys);
## key tl
## 1 first character[1]
## 2 middle character[1]
## 3 last character[1]
Here's my summary of the data:
one top-level main list, length 3256.
each element is a hash with inconsistent key sets. There are 12 keys in total across all main hashes, with 3 patterns of key sets present.
6 of the hash values are scalar strings, 3 are variable-length string vectors, addresses is a list of variable length, plans is a list always of length 9, and name is a hash.
each addresses list element is a hash with 5 or 6 keys to scalar strings, address_2 being the inconsistent one.
each plans list element is a hash with 3 keys to scalar strings, no inconsistencies.
each name hash has first and last but not always middle scalar strings.
The most important observation here is that there are no type-inconsistencies between parallel nodes (aside from omissions and length differences). That means we can combine all parallel nodes into vectors with no considerations of type coercion. We can flatten all the data into a two-dimensional structure provided we associate columns with deep-enough nodes, such that all columns correspond to a single scalar string node in the input list.
Below is my solution. Note that it depends on the helper functions tl(), keyListToStr(), and mkcsv() I defined earlier.
##--------------------------------------
## solution
##--------------------------------------
## recursively traverse the list structure, building up a column at each leaf node
extractLevelColumns <- function(
nodes, ## current level node selection
..., ## additional arguments to data.frame()
keyList=list(), ## current key path under main list
sep=NULL, ## optional string separator on which to join multi-element vectors; if NULL, will leave as separate columns
mkname=function(keyList,maxLen) paste0(collapse='.',if (is.null(sep) && maxLen == 1L) keyList[-length(keyList)] else keyList) ## name builder from current keyList and character vector max length across node level; default to dot-separated keys, and remove last index component for scalars
) {
cat(sprintf('extractLevelColumns(): %s\n',keyListToStr(keyList)));
if (length(nodes) == 0L) return(list()); ## handle corner case of empty main list
tlList <- lapply(nodes,tl);
typeList <- do.call(c,lapply(tlList,`[[`,'type'));
if (length(unique(typeList)) != 1L) stop(sprintf('error: inconsistent types (%s) at %s.',mkcsv(typeList),keyListToStr(keyList)));
type <- typeList[1L];
if (type == 'namedlist') { ## hash; recurse
allKeys <- unique(do.call(c,lapply(nodes,names)));
ret <- do.call(c,lapply(allKeys,function(key) extractLevelColumns(lapply(nodes,`[[`,key),...,keyList=c(keyList,key),sep=sep,mkname=mkname)));
} else if (type == 'list') { ## array; recurse
lenList <- do.call(c,lapply(tlList,`[[`,'len'));
maxLen <- max(lenList,na.rm=T);
allIndexes <- seq_len(maxLen);
ret <- do.call(c,lapply(allIndexes,function(index) extractLevelColumns(lapply(nodes,function(node) if (length(node) < index) NULL else node[[index]]),...,keyList=c(keyList,index),sep=sep,mkname=mkname))); ## must be careful to guard out-of-bounds to NULL; happens automatically with string keys, but not with integer indexes
} else if (type%in%c('raw','logical','integer','double','complex','character')) { ## atomic leaf node; build column
lenList <- do.call(c,lapply(tlList,`[[`,'len'));
maxLen <- max(lenList,na.rm=T);
if (is.null(sep)) {
ret <- lapply(seq_len(maxLen),function(i) setNames(data.frame(sapply(nodes,function(node) if (length(node) < i) NA else node[[i]]),...),mkname(c(keyList,i),maxLen)));
} else {
## keep original type if maxLen is 1, IOW don't stringify
ret <- list(setNames(data.frame(sapply(nodes,function(node) if (length(node) == 0L) NA else if (maxLen == 1L) node else paste(collapse=sep,node)),...),mkname(keyList,maxLen)));
}; ## end if
} else stop(sprintf('error: unsupported type %s at %s.',type,keyListToStr(keyList)));
if (is.null(ret)) ret <- list(); ## handle corner case of exclusively empty sublists
ret;
}; ## end extractLevelColumns()
## simple interface function
flattenList <- function(mainList,...) do.call(cbind,extractLevelColumns(mainList,...));
The extractLevelColumns() function traverses the input list and extracts all node values at each leaf node position, combining them into a vector with NA where the value was missing, and then transforming to a one-column data.frame. The column name is set immediately, leveraging a parameterized mkname() function to define the stringification of the keyList to the string column name. Multiple columns are returned as a list of data.frames from each recursive call and likewise from the top-level call.
It also validates that there are no type-inconsistencies between parallel nodes. Although I manually verified the consistency of the data earlier, I tried to write as generic and reusable a solution as possible, because it's always a good idea to do so, so this validation step is appropriate.
flattenList() is the primary interface function; it simply calls extractLevelColumns() and then do.call(cbind,...) to combine the columns into a single data.frame.
An advantage of this solution is that it's entirely generic; it can handle an unlimited number of depth levels, by virtue of being fully recursive. Additionally, it has no package dependencies, parameterizes the column name building logic, and forwards variadic arguments to data.frame(), so for example you can pass stringsAsFactors=F to inhibit the automatic factorization of character columns normally done by data.frame(), and/or row.names={namevector} to set the row names of the resulting data.frame, or row.names=NULL to prevent the use of the top-level list component names as row names, if such existed in the input list.
I've also added a sep parameter which defaults to NULL. If NULL, multi-element leaf nodes will be separated into multiple columns, one per element, with an index suffix on the column name for differentiation. Otherwise, it's taken as a string separator on which to join all elements to a single string, and only one column is generated for the node.
In terms of performance, it's very fast. Here's a demo:
## actually run it
system.time({ df <- flattenList(jsonRList); });
## extractLevelColumns(): /
## extractLevelColumns(): /npi
## extractLevelColumns(): /type
## extractLevelColumns(): /facility_name
## extractLevelColumns(): /facility_type
## extractLevelColumns(): /addresses
## extractLevelColumns(): /addresses/1
## extractLevelColumns(): /addresses/1/address
## extractLevelColumns(): /addresses/1/city
##
## ... snip ...
##
## extractLevelColumns(): /plans/9/network_tier
## extractLevelColumns(): /last_updated_on
## extractLevelColumns(): /name
## extractLevelColumns(): /name/first
## extractLevelColumns(): /name/middle
## extractLevelColumns(): /name/last
## extractLevelColumns(): /speciality
## extractLevelColumns(): /accepting
## extractLevelColumns(): /languages
## extractLevelColumns(): /gender
## user system elapsed
## 2.265 0.000 2.268
Result:
class(df); dim(df); names(df);
## [1] "data.frame"
## [1] 3256 126
## [1] "npi" "type" "facility_name" "facility_type.1" "facility_type.2" "facility_type.3" "addresses.1.address" "addresses.1.city" "addresses.1.state"
## [10] "addresses.1.zip" "addresses.1.phone" "addresses.1.address_2" "addresses.2.address" "addresses.2.city" "addresses.2.state" "addresses.2.zip" "addresses.2.phone" "addresses.2.address_2"
## [19] "addresses.3.address" "addresses.3.city" "addresses.3.state" "addresses.3.zip" "addresses.3.phone" "addresses.3.address_2" "addresses.4.address" "addresses.4.city" "addresses.4.state"
## [28] "addresses.4.zip" "addresses.4.phone" "addresses.4.address_2" "addresses.5.address" "addresses.5.address_2" "addresses.5.city" "addresses.5.state" "addresses.5.zip" "addresses.5.phone"
## [37] "addresses.6.address" "addresses.6.address_2" "addresses.6.city" "addresses.6.state" "addresses.6.zip" "addresses.6.phone" "addresses.7.address" "addresses.7.address_2" "addresses.7.city"
## [46] "addresses.7.state" "addresses.7.zip" "addresses.7.phone" "addresses.8.address" "addresses.8.address_2" "addresses.8.city" "addresses.8.state" "addresses.8.zip" "addresses.8.phone"
## [55] "addresses.9.address" "addresses.9.address_2" "addresses.9.city" "addresses.9.state" "addresses.9.zip" "addresses.9.phone" "addresses.10.address" "addresses.10.address_2" "addresses.10.city"
## [64] "addresses.10.state" "addresses.10.zip" "addresses.10.phone" "addresses.11.address" "addresses.11.address_2" "addresses.11.city" "addresses.11.state" "addresses.11.zip" "addresses.11.phone"
## [73] "addresses.12.address" "addresses.12.address_2" "addresses.12.city" "addresses.12.state" "addresses.12.zip" "addresses.12.phone" "addresses.13.address" "addresses.13.city" "addresses.13.state"
## [82] "addresses.13.zip" "addresses.13.phone" "plans.1.plan_id_type" "plans.1.plan_id" "plans.1.network_tier" "plans.2.plan_id_type" "plans.2.plan_id" "plans.2.network_tier" "plans.3.plan_id_type"
## [91] "plans.3.plan_id" "plans.3.network_tier" "plans.4.plan_id_type" "plans.4.plan_id" "plans.4.network_tier" "plans.5.plan_id_type" "plans.5.plan_id" "plans.5.network_tier" "plans.6.plan_id_type"
## [100] "plans.6.plan_id" "plans.6.network_tier" "plans.7.plan_id_type" "plans.7.plan_id" "plans.7.network_tier" "plans.8.plan_id_type" "plans.8.plan_id" "plans.8.network_tier" "plans.9.plan_id_type"
## [109] "plans.9.plan_id" "plans.9.network_tier" "last_updated_on" "name.first" "name.middle" "name.last" "speciality.1" "speciality.2" "speciality.3"
## [118] "speciality.4" "accepting" "languages.1" "languages.2" "languages.3" "languages.4" "languages.5" "languages.6" "gender"
The resulting data.frame is quite wide, but we can use rowToFrame() and npiToFrame() to get a good vertical layout of one row at a time. For example, here's the first row:
rowToFrame(df[1L,]);
## column value
## 1 npi 1063645026
## 2 type FACILITY
## 3 facility_name EXPRESS SCRIPTS
## 4 facility_type.1 Pharmacies
## 5 facility_type.2 <NA>
## 6 facility_type.3 <NA>
## 7 addresses.1.address 4750 E 450 S
## 8 addresses.1.city WHITESTOWN
## 9 addresses.1.state IN
## 10 addresses.1.zip 46075
## 11 addresses.1.phone 2012695236
## 12 addresses.1.address_2 <NA>
## 13 addresses.2.address <NA>
## 14 addresses.2.city <NA>
## 15 addresses.2.state <NA>
## 16 addresses.2.zip <NA>
## 17 addresses.2.phone <NA>
## 18 addresses.2.address_2 <NA>
## 19 addresses.3.address <NA>
## 20 addresses.3.city <NA>
## 21 addresses.3.state <NA>
##
## ... snip ...
##
## 77 addresses.12.zip <NA>
## 78 addresses.12.phone <NA>
## 79 addresses.13.address <NA>
## 80 addresses.13.city <NA>
## 81 addresses.13.state <NA>
## 82 addresses.13.zip <NA>
## 83 addresses.13.phone <NA>
## 84 plans.1.plan_id_type HIOS-PLAN-ID
## 85 plans.1.plan_id 38344AK0620003
## 86 plans.1.network_tier HERITAGE-PLUS
## 87 plans.2.plan_id_type HIOS-PLAN-ID
## 88 plans.2.plan_id 38344AK0620004
## 89 plans.2.network_tier HERITAGE-PLUS
## 90 plans.3.plan_id_type HIOS-PLAN-ID
## 91 plans.3.plan_id 38344AK0620006
## 92 plans.3.network_tier HERITAGE-PLUS
## 93 plans.4.plan_id_type HIOS-PLAN-ID
## 94 plans.4.plan_id 38344AK0620008
## 95 plans.4.network_tier HERITAGE-PLUS
## 96 plans.5.plan_id_type HIOS-PLAN-ID
## 97 plans.5.plan_id 38344AK0570001
## 98 plans.5.network_tier HERITAGE-PLUS
## 99 plans.6.plan_id_type HIOS-PLAN-ID
## 100 plans.6.plan_id 38344AK0570002
## 101 plans.6.network_tier HERITAGE-PLUS
## 102 plans.7.plan_id_type HIOS-PLAN-ID
## 103 plans.7.plan_id 38344AK0980003
## 104 plans.7.network_tier HERITAGE-PLUS
## 105 plans.8.plan_id_type HIOS-PLAN-ID
## 106 plans.8.plan_id 38344AK0980006
## 107 plans.8.network_tier HERITAGE-PLUS
## 108 plans.9.plan_id_type HIOS-PLAN-ID
## 109 plans.9.plan_id 38344AK0980012
## 110 plans.9.network_tier HERITAGE-PLUS
## 111 last_updated_on 2015-10-14
## 112 name.first <NA>
## 113 name.middle <NA>
## 114 name.last <NA>
## 115 speciality.1 <NA>
## 116 speciality.2 <NA>
## 117 speciality.3 <NA>
## 118 speciality.4 <NA>
## 119 accepting <NA>
## 120 languages.1 <NA>
## 121 languages.2 <NA>
## 122 languages.3 <NA>
## 123 languages.4 <NA>
## 124 languages.5 <NA>
## 125 languages.6 <NA>
## 126 gender <NA>
I've tested the result pretty thoroughly by doing many spot-checks on individual records, and it all looks correct. Let me know if you have any questions.

This answer is rather a data organization suggestion (and is much shorter than the bounty-attracting answers around;)
If you want to keep the semantics of the fields, like keep all plan_ids in a single column, you can normalize your data design a bit, and do joins afterwards, if you need the information together:
library(dplyr)
# notice the simplifyVector=F
providers <- fromJSON( "http://fm.formularynavigator.com/jsonFiles/publish/11/47/providers.json", simplifyVector=F)
# pick and repeat fields for each element of array
# {field1:val, field2:val2, array:[{af1:av1, af2:av2}, {af1:av3, af2:av4}]}
# gives data.frame
# field1, field2 array.af1 array.af2
# val val2 av1 av2
# val val2 av3 av4
denormalize <- function(data, fields, array) {
data.frame(
c(
data[fields],
as.list(
bind_rows(
lapply(data[[array]], data.frame)))))
}
plans_df <- bind_rows(lapply(providers, denormalize, c('npi'), 'plans'))
addresses_df <- bind_rows(lapply(providers, denormalize, c('npi'), 'addresses'))
npis <- bind_rows(lapply(providers, function(d, fields) data.frame(d[fields]),
c('npi', 'type', 'last_updated_on')))
Then you can first filter on the data and join in other information afterwards:
addresses_df %>%
filter(city == "Healy") %>%
left_join(plans_df, by="npi") ->
plans_in_healy

So this isn't really eligible as a solution since it doesn't directly answer the question, but here is how I would analyze this data.
First, I had to understand your data set. It appears to be information about health providers.
providers <- fromJSON( "http://fm.formularynavigator.com/jsonFiles/publish/11/47/providers.json" , simplifyDataFrame=FALSE )
types = sapply(providers,"[[","type")
table(types)
# FACILITY INDIVIDUAL
# 279 2977
FACILITY entries have the "ID" fields facility_name and facility_type.
INDIVIDUAL entries have the "ID" fields name, speciality, accepting, languages, and gender.
All entries have "ID" fields npi and last_updated_on.
All entries have two nested fields: addresses and plans. For example addresses is a list that contains city, state, etc.
Since there are multiple addresses for each npi, I'd prefer to convert them to a data frame with columns for the city, state, etc. I'll also make a similar data frame for the plans. Then I'll join the addresses and plans into a single data frame. Hence, if there are 4 addresses and 8 plans, there will be 4*8=32 rows in the joined data frame. Finally, I'll tac on a similarly denormalized data frame with "ID" information using another merge.
library(dplyr)
unfurl_npi_data = function (x) {
repeat_cols = c("plans","addresses")
id_cols = setdiff(names(x),repeat_cols)
repeat_data = x[repeat_cols]
id_data = x[id_cols]
# Denormalized ID data
id_data_df = Reduce(function(x,y) merge(x,y,by=NULL), id_data, "")[,-1]
atomic_colnames = names(which(!sapply(id_data, is.list)))
df_atomic_cols = unlist(sapply(id_data,function(x) if(is.list(x)) rep(FALSE, length(x)) else TRUE))
colnames(id_data_df)[df_atomic_cols] = atomic_colnames
# Join the plans and addresses (denormalized)
repeated_data = lapply(repeat_data, rbind_all)
repeated_data_crossed = Reduce(merge, repeated_data, repeated_data[[1]])
merge(id_data_df, repeated_data_crossed)
}
providers2 = split(providers, types)
providers3 = lapply(providers2, function(x) rbind_all(lapply(x, unfurl_npi_data)))
Then do some cleanup.
unique_df = function(x) {
chr_col_names = names(which(sapply(x, class) == "character"))
for( col in chr_col_names )
x[[col]] = toupper(x[[col]])
unique(x)
}
providers3 = lapply(providers3, unique_df)
facilities = providers3[["FACILITY"]]
individuals = providers3[["INDIVIDUAL"]]
rm(providers, providers2, providers3)
And now you can ask some interesting questions. For example, how many addresses does each health care provider have?
unique_providers = individuals %>% select(first, middle, last, gender, state, city, address) %>% unique()
num_addresses = unique_providers %>% count(first, middle, last, gender)
table(num_addresses$n)
# 1 2 3 4 5 6 7 8 9 12 13
# 2258 492 119 33 43 21 6 1 2 1 1
At addresses with more than five people, what is the percent of male healthcare providers?
address_pcts = unique_providers %>%
group_by(address, city, state) %>%
filter(n()>5) %>%
arrange(address) %>%
summarise(pct_male = sum(gender=="MALE")/n())
library(ggplot2)
qplot(address_pcts$pct_male, binwidth=1/7) + xlim(0,1)
And on and on...

Related

Webscraping does not identify the full HTML link

My problem is very similar to this one. I want to identify all the HTML links in this website so I can then open the link and download the tables.
The problem is that when I create the extract_links functions as pointed out in that answer, I get a list of all the HTMLs, but this are not complete.
To make it more clear:
If you press "Junio" in year "2022" the real HTML is the following:
http://transparencia.uantof.cl/index.php?action=plantillas_generar_plantilla&ig=21&m=6&a=2022&ia=7658
but the HTML that I am recovering from the source of the website lacks the last bit (&ia=7658):
http://transparencia.uantof.cl/index.php?action=plantillas_generar_plantilla&ig=21&m=6&a=2022
Which does not direct me to the table I want.
The problem is that these numbers do not seem to follow any logic and change between year/month links. Any help on how to retrieve the full HTML links will be greatly appreciated. If you also happen to know how can I retrieve the year/month of the file to add as an extra column that would also be great.
Thanks to the help of #margusl I was able to realize that rvest redirects automatically and that solves my problem.
I am trying to use the following code to loop over different links to obtain the tables, store them in a data frame and then download them:
yr.list <- seq(2019,2020)
mes.list <- seq(1,12)
combined_df <- data.frame()
for (yr in yr.list){
for (mes in mes.list) {
root <- "http://transparencia.uantof.cl/index.php?action=plantillas_selec_archivo&ig=21"
# Full link
url <- paste(root,"&m=",mes,"&a=",yr,sep="")
# Parse HTML File
file<-read_html(url, encoding = "latin1")
file<- rvest::html_table(file)
str(file)
# This is the relevant table
table <- as.data.frame(file[[1]])
# in your loop, add the files that you read to the combined_df
combined_df <- rbind(combined_df, table)
}
}
It does not work because the read_html code with the encoding works only for some years, but not for all. for example, when running:
url <- "http://transparencia.uantof.cl/index.php?action=plantillas_selec_archivo&ig=21&m=3&a=2015"
file<-read_html(url, encoding = "latin1")
It does not recover the tables with names/surnames that recovers in the previous months but something else. Why can't this work on all the sub-pages? Is this a encoding problem again?
If you open that last page you had issues with, you'll see that it serves a sort of a submenu with 2 more links - http://transparencia.uantof.cl/index.php?action=plantillas_selec_archivo&ig=21&m=3&a=2015 . Meaning that it's not enough to just generate links for each month & year and extract first table of each page, all those pages should be checked for content and exceptions should be handled.
Though I took somewhat opportunistic approach and it happened to work with URL range defined in question + those few odd samples, but there could be other surprises down the road. Switched to httr for making requests as it allows to collect and monitor response headers, separating content retrieval and parsing also seems to work around encoding issues, at least in this case. First collecting and then parsing also simplifies debugging, you can check if certain responses / headers were different from the rest (i.e. response length being 10x smaller than average or final, redirected, url differs from the rest). And it's easy to change content handling
/ parsing for a small subset of responses, if needed. If you are not sure what rvest has retrieved, you can always save the response to a html file and check it with browser or editor, something like
html <- read_html(url_or_text_content); write(as.character(html), "dump.html")
library(rvest)
library(httr)
library(purrr)
library(dplyr)
library(tidyr)
library(stringr)
yr.list <- seq(2019,2020)
mes.list <- seq(1,12)
# combine mes.list & yr.list
url.params <- expand.grid(mes = mes.list, yr = yr.list)
# few extra samples:
url.params <- rbind(url.params,
list(mes = 6, yr = 2022), # here rvest strugglest with correct encoding
list(mes = 3, yr = 2015) # returns page with sub-categories
)
url.list <- str_glue("http://transparencia.uantof.cl/index.php?action=plantillas_selec_archivo&ig=21&m={url.params$mes}&a={url.params$yr}")
url.list
#> http://transparencia.uantof.cl/index.php?action=plantillas_selec_archivo&ig=21&m=1&a=2019
#> http://transparencia.uantof.cl/index.php?action=plantillas_selec_archivo&ig=21&m=2&a=2019
#> http://transparencia.uantof.cl/index.php?action=plantillas_selec_archivo&ig=21&m=3&a=2019
#> ...
#> http://transparencia.uantof.cl/index.php?action=plantillas_selec_archivo&ig=21&m=11&a=2020
#> http://transparencia.uantof.cl/index.php?action=plantillas_selec_archivo&ig=21&m=12&a=2020
#> http://transparencia.uantof.cl/index.php?action=plantillas_selec_archivo&ig=21&m=6&a=2022
#> http://transparencia.uantof.cl/index.php?action=plantillas_selec_archivo&ig=21&m=3&a=2015
# url list for input, output is a tibble with all responses (incl. "url", "date",
# "status_code", header details and response body)
fetch_urls <- function(url.list){
# collect all responses to a list with httr, enable verbose, parse responses later
# add progress bar - requests take a while
resp.list = vector(mode = "list", length = length(url.list))
pb <- txtProgressBar(max = length(url.list), style = 3)
for (i in seq_along(url.list)){
resp.list[[i]] <- GET(url.list[i])
setTxtProgressBar(pb,i)
}
close(pb)
# turn responses into tibble to check urls, response sizes and status codes
resp.tibble <- bind_cols(
map_df(resp.list, ~ .[c("url", "date", "status_code")], .id = "req_id"),
map_df(resp.list, headers) %>% rename_with(~ paste0("header_",.x)),
# map_df(resp_follow.list, "times"),
map_chr(resp.list, content, as = "text") %>% tibble(html_doc = .)
)
return(resp.tibble)
}
resp.tibble <- fetch_urls(url.list)
# check resulting table without html_doc column
# View(resp.tibble[-ncol(resp.tibble)])
resp.tibble %>%
select(req_id:status_code,`header_content-length`) %>%
arrange(`header_content-length`)
#> # A tibble: 26 × 5
#> req_id url date statu…¹ heade…²
#> <chr> <chr> <dttm> <int> <chr>
#> 1 14 http://transparencia.uantof.cl/in… 2022-10-31 17:29:12 200 21371
#> 2 26 http://transparencia.uantof.cl/in… 2022-10-31 17:31:45 200 2230
#> 3 24 http://transparencia.uantof.cl/in… 2022-10-31 17:31:21 200 24035
#> 4 21 http://transparencia.uantof.cl/in… 2022-10-31 17:30:42 200 24173
#> 5 20 http://transparencia.uantof.cl/in… 2022-10-31 17:30:29 200 24183
#> 6 23 http://transparencia.uantof.cl/in… 2022-10-31 17:31:08 200 24184
#> 7 22 http://transparencia.uantof.cl/in… 2022-10-31 17:30:55 200 24207
#> 8 18 http://transparencia.uantof.cl/in… 2022-10-31 17:30:04 200 24405
#> 9 16 http://transparencia.uantof.cl/in… 2022-10-31 17:29:38 200 24715
#> 10 7 http://transparencia.uantof.cl/in… 2022-10-31 17:27:32 200 24716
#> # … with 16 more rows, and abbreviated variable names ¹​status_code,
#> # ²​`header_content-length`
# 26. is kind of suspicious:
# 25 http://transparencia.uantof.cl/index.php?action=plantillas_generar_plantilla&ig=21&m=6&a=2022&ia=76…
# 26 http://transparencia.uantof.cl/index.php?action=plantillas_selec_archivo&ig=21&m=3&a=2015
# looks like there has been no redirection and its header_content-length is about 10x smaller than for other responses
# checking it more closely reveals that the page includes a "submenu" instead of table(s):
# <p class="subMenu_interiores">
# <b>2015 - Marzo</b>
# ABRIL 2015
# Marzo 2015
# </p>
# lets' collect urls that were not redirected from our tibble and harvest links from stored html:
suburl.list <- resp.tibble %>%
# urls that do NOT include "plantillas_generar_plantilla"
filter(!str_detect(url, "plantillas_generar_plantilla")) %>%
pull(html_doc) %>%
# rvest does not like lists, thus let's map()
map( ~ read_html(.x) %>% html_elements("#columna1_interiores a") %>% html_attr("href")) %>%
unlist() %>%
paste0("http://transparencia.uantof.cl/",.)
suburl.list
#> [1] "http://transparencia.uantof.cl/index.php?action=plantillas_generar_plantilla&ig=21&m=3&a=2015&ia=772"
#> [2] "http://transparencia.uantof.cl/index.php?action=plantillas_generar_plantilla&ig=21&m=3&a=2015&ia=648"
# fetch content from those submenu urls
subresp.tibble <- fetch_urls(suburl.list)
# sanity check:
subresp.tibble %>%
select(req_id:status_code,`header_content-length`)
#> # A tibble: 2 × 5
#> req_id url date statu…¹ heade…²
#> <chr> <chr> <dttm> <int> <chr>
#> 1 1 http://transparencia.uantof.cl/ind… 2022-10-31 17:31:52 200 25385
#> 2 2 http://transparencia.uantof.cl/ind… 2022-10-31 17:31:59 200 25332
#> # … with abbreviated variable names ¹​status_code, ²​`header_content-length`
# better, sizes align with previous results.
# collect all relevant responses
table_1 <- resp.tibble %>%
filter(str_detect(url, "plantillas_generar_plantilla")) %>%
bind_rows(subresp.tibble) %>%
# extract html (as strings)
pull(html_doc) %>%
# rvest does not like lists, thus let's map(), pluck(1) extracts first table (from each page)
map(~ read_html(.x) %>% html_table() %>% pluck(1)) %>%
# first attempt to bind rows fails, aparently column types differ
# change all non-character columns to character
map (~ mutate(.x, across(!where(is.character),as.character))) %>%
# bind all tables by rows
bind_rows()
# columns vary across tables so number of NA fields in final result is rather high
Final result for 26 pages, a 10,987 × 30 tibble:
table_1
#> # A tibble: 10,987 × 30
#> Nº PLANTA PATERNO MATERNO NOMBRES G TITULO CARGO REGION ASIGN…¹
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 1 DIRECTIVO ABARZA CASTRO JESSIC… 6 ADMIN… JEFE… SEGUN… (1)(8)…
#> 2 2 PROFESIONAL ABASOLO QUINTE… NURY D… 12 EDUCA… PROF… SEGUN… (4)(8)…
#> 3 3 ACADEMICO ACOSTA PENA ROXANA… 11 EDUCA… PROF… SEGUN… (2)(8)…
#> 4 4 AUXILIARES ACOSTA PIZARRO ROBERT… 23 LICEN… AUXI… SEGUN… (7)(8)…
#> 5 5 DIRECTIVO AGENO SIERRA ROSELL… 4 MATRO… DIRE… SEGUN… (1)(8)…
#> 6 6 AUXILIARES AGUIRRE LAZO RENE G… 16 LICEN… AUXI… SEGUN… (7)(8)…
#> 7 7 TECNICOS ALAMOS MARIN SERGIO… 13 TECNI… TECN… SEGUN… (5)(8)…
#> 8 8 AUXILIARES ALAYANA CORTES CHRIST… 23 LICEN… AUXI… SEGUN… (7)(8)…
#> 9 9 ACADEMICO ALCOTA AGUIRRE PATRIC… 9 ING. … PROF… SEGUN… (2)(8)…
#> 10 10 ADMINISTRATI… ALFARO BARRAZA MARIA … 23 LICEN… ADMI… SEGUN… (6)(8)…
#> # … with 10,977 more rows, 20 more variables: `UNID MONETARIA` <chr>,
#> # `REMUNERACION MENSUAL BRUTA` <chr>, HORAS <chr>, `CANT. HORAS` <chr>,
#> # `MONTO HORAS EXTRAS` <chr>, `FECHA DE INGRESO` <chr>, `F. HASTA` <chr>,
#> # OBSERVACIONES <chr>, GRADO <chr>, ESTAMENTO <chr>,
#> # `Apellido Paterno` <chr>, `Apellido Materno` <chr>, Nombres <chr>,
#> # `Grado ERUA` <chr>, `CALIFICACION PROFESIONAL O FORMACION` <chr>,
#> # `CARGO O FUNCION` <chr>, `R BRUTA` <chr>, `Horas Extras` <chr>, …
Created on 2022-10-31 with reprex v2.0.2

R Read Json into a data.frame

I am trying to load a json file into a data.frame in R. But there is some list() which is null in my data.
Here is my json data:
json_file1 <- jsonlite::fromJSON('{"txtId":"20180101","data":{"user":[{"id":"123","phone":"00001","realName":"Eric","addr":{},"source":{},"registerDate":{},"type":0,"remain":{}}],"score":[]}}')
json_file2 <- jsonlite::fromJSON('{"txtId":"20180102","data":{"user":[{"id":"456","phone":"00002","realName":"Amy","addr":{},"source":{},"registerDate":{},"type":0,"remain":100}],"score":[]}}')
json_file = list(json_file1, json_file2)
zt.detail = lapply(json_file, function(y){
if(!is.null(y$data$user)) data.frame(y$data$user, stringsAsFactors = F)
})
when I rbind zt.detail, I get the error:
# > dat_callrecord = data.table::rbindlist(zt.detail, fill = T)
# Error in data.table::rbindlist(zt.detail, fill = T) :
# Column 4 of item 1 is length 0, inconsistent with first column of that item which is length 1. rbind/rbindlist doesn't recycle as it already expects each item to be a uniform list, data.frame or data.table
# > str(zt.detail[[1]])
# 'data.frame': 1 obs. of 9 variables:
# $ id : chr "123"
# $ phone : chr "00001"
# $ realName : chr "Eric"
# $ addr :'data.frame': 1 obs. of 0 variables
# $ source :'data.frame': 1 obs. of 0 variables
# $ registerDate:'data.frame': 1 obs. of 0 variables
# $ type : int 0
# $ remain :'data.frame': 1 obs. of 0 variables
The error was caused because the structure of my data contains data.frame of 1 observation but 0 variables. So I want to transfer those list() into NA before and get the following result:
> dat_callrecord
id phone realName type remain addr source registerDate
123 00001 Eric 0 NA NA NA NA
456 00002 Amy 0 100 NA NA NA
We can loop through the list and if there is a data.frame, replace it with NA and then do the rbindlist
data.table::rbindlist(lapply(zt.detail, function(x) {
x[] <- lapply(x, function(y) if(is.data.frame(y)) NA else y)
x}))
# id phone realName addr source registerDate type remain
#1: 123 00001 Eric NA NA NA 0 NA
#2: 456 00002 Amy NA NA NA 0 100

Parsing incomplete lists into data frames with two different problems

If you request web data through R, you often work with json or xml where the fields are not named if there is no value for them. Sometimes, there isn't even any data and it comes out as an empty list for a certain index. So, I see this as two different problems. I'm proposing the solution I use to solve this as well but I know there are some better ones out there. I have for starters, a very messy and fake list that I created that is missing field names (on purpose from the xml, json spec) AND missing whole indexes (also on purpose).
(messy_list <- list(list(x = 2, y = 3),
list(),
list(y = 4),
list(x = 5)))
Now, here is how I break it down to what I would say is "solved".
library(plyr)
messy_list_no_empties <- lapply(messy_list, function(x) if(length(x) == 0) {list(NA, NA)} else x)
ldply(messy_list_no_empties, data.frame)[,1:2]
The end result is what I am looking for but I would like to find a more elegant way to deal with this problem.
With purrr::map_df,
library(purrr)
messy_list <- list(list(x = 2, y = 3),
list(),
list(y = 4),
list(x = 5))
messy_list %>% map_df(~list(x = .x$x %||% NA,
y = .x$y %||% NA))
#> # A tibble: 4 × 2
#> x y
#> <dbl> <dbl>
#> 1 2 3
#> 2 NA NA
#> 3 NA 4
#> 4 5 NA
map_df iterates over the list like lapply and coerces the results to a data.frame. The function (in purrr's formula form) assembles a list with an x and a y element, looking for existing values if they're there. If they're not, the subsetting will return NULL, which %||% will replace with the value after it, NA.
In mostly-equivalent base R,
as.data.frame(do.call(rbind,
lapply(messy_list, function(.x){
list(x = ifelse(is.null(.x$x), NA, .x$x),
y = ifelse(is.null(.x$y), NA, .x$y))
})))
#> x y
#> 1 2 3
#> 2 NA NA
#> 3 NA 4
#> 4 5 NA
Note the base approach won't handle different types well. To do so, coerce everything to character (rbind probably will anyway, so just add stringsAsFactors = FALSE to as.data.frame) and lapply type.convert.
Your method is already pretty compact, but if you're looking for other methods, one way might be to use rbindlist from data.table:
library(data.table)
new_list <- lapply(messy_list, function(x) if(identical(x,list())){list(x = NA)} else {x})
rbindlist(new_list, fill = T, use.names = T)
# x y
#1: 2 3
#2: NA NA
#3: NA 4
#4: 5 NA
Note we need the lapply so it doesn't drop the rows that are empty

Transform list cell in data frame into rows

I'm sorry for no code to replicate, I can provide a picture only. See it below please.
A data frame with Facebook insights data prepared from JSON consists a column "values" with list values. For the next manipulation I need to have only one value in the column. So the row 3 on picture should be transformed into two (with list content or value directly):
post_story_adds_by_action_type_unique lifetime list(like = 38)
post_story_adds_by_action_type_unique lifetime list(share = 11)
If there are 3 or more values in data frame list cell, it should make 3 or more single value rows.
Do you know how to do it?
I use this code to get the json and data frame:
i <- fromJSON(post.request.url)
i <- as.data.frame(i$insights$data)
Edit:
There will be no deeper nesting, just this one level.
The list is not needed in the result, I need just the values and their names.
Let's assume you're starting with something that looks like this:
mydf <- data.frame(a = c("A", "B", "C", "D"), period = "lifetime")
mydf$values <- list(list(value = 42), list(value = 5),
list(value = list(like = 38, share = 11)),
list(value = list(like = 38, share = 13)))
str(mydf)
## 'data.frame': 4 obs. of 3 variables:
## $ a : Factor w/ 4 levels "A","B","C","D": 1 2 3 4
## $ period: Factor w/ 1 level "lifetime": 1 1 1 1
## $ values:List of 4
## ..$ :List of 1
## .. ..$ value: num 42
## ..$ :List of 1
## .. ..$ value: num 5
## ..$ :List of 1
## .. ..$ value:List of 2
## .. .. ..$ like : num 38
## .. .. ..$ share: num 11
## ..$ :List of 1
## .. ..$ value:List of 2
## .. .. ..$ like : num 38
## .. .. ..$ share: num 13
## NULL
Instead of retaining lists in your output, I would suggest flattening out the data, perhaps using a function like this:
myFun <- function(indt, col) {
if (!is.data.table(indt)) indt <- as.data.table(indt)
other_names <- setdiff(names(indt), col)
list_col <- indt[[col]]
rep_out <- sapply(list_col, function(x) length(unlist(x, use.names = FALSE)))
flat <- {
if (is.null(names(list_col))) names(list_col) <- seq_along(list_col)
setDT(tstrsplit(names(unlist(list_col)), ".", fixed = TRUE))[
, val := unlist(list_col, use.names = FALSE)][]
}
cbind(indt[rep(1:nrow(indt), rep_out)][, (col) := NULL], flat)
}
Here's what it does with the "mydf" I shared:
myFun(mydf, "values")
## a period V1 V2 V3 val
## 1: A lifetime 1 value NA 42
## 2: B lifetime 2 value NA 5
## 3: C lifetime 3 value like 38
## 4: C lifetime 3 value share 11
## 5: D lifetime 4 value like 38
## 6: D lifetime 4 value share 13

How to write a JSON object from R dataframe with grouping

In general I feel there is a need to make JSON objects by folding multiple columns. There is no direct way to do this afaik. Please point it out if there is ..
I have data of this from
A B C
1 a x
1 a y
1 c z
2 d p
2 f q
2 f r
How do I write a json which looks like
{'query':'1', 'type':[{'name':'a', 'values':[{'value':'x'}, {'value':'y'}]}, {'name':'c', 'values':[{'value':'z'}]}]}
and similarly for 'query':'2'
I am looking to spit them in the mongo import/export individual json lines format.
Any pointers are also appreciated..
You've got a little "non-standard" thing going with two keys of "value" (I don't know if this is legal json), as you can see here:
(js <- jsonlite::fromJSON('{"query":"1", "type":[{"name":"a", "values":[{"value":"x"}, {"value":"y"}]}, {"name":"c", "values":[{"value":"z"}]}]}'))
## $query
## [1] "1"
##
## $type
## name values
## 1 a x, y
## 2 c z
... with a data.frame cell containing a list of data.frames:
js$type$values[[1]]
## value
## 1 x
## 2 y
class(js$type$values[[1]])
## [1] "data.frame"
If you can accept your "type" variable containing a vector instead of a named-list, then perhaps the following code will suffice:
jsonlite::toJSON(lapply(unique(dat[, 'A']), function(a1) {
list(query = a1,
type = lapply(unique(dat[dat$A == a1, 'B']), function(b2) {
list(name = b2,
values = dat[(dat$A == a1) & (dat$B == b2), 'C'])
}))
}))
## [{"query":[1],"type":[{"name":["a"],"values":["x","y"]},{"name":["c"],"values":["z"]}]},{"query":[2],"type":[{"name":["d"],"values":["p"]},{"name":["f"],"values":["q","r"]}]}]