Transform list cell in data frame into rows - json

I'm sorry for no code to replicate, I can provide a picture only. See it below please.
A data frame with Facebook insights data prepared from JSON consists a column "values" with list values. For the next manipulation I need to have only one value in the column. So the row 3 on picture should be transformed into two (with list content or value directly):
post_story_adds_by_action_type_unique lifetime list(like = 38)
post_story_adds_by_action_type_unique lifetime list(share = 11)
If there are 3 or more values in data frame list cell, it should make 3 or more single value rows.
Do you know how to do it?
I use this code to get the json and data frame:
i <- fromJSON(post.request.url)
i <- as.data.frame(i$insights$data)
Edit:
There will be no deeper nesting, just this one level.
The list is not needed in the result, I need just the values and their names.

Let's assume you're starting with something that looks like this:
mydf <- data.frame(a = c("A", "B", "C", "D"), period = "lifetime")
mydf$values <- list(list(value = 42), list(value = 5),
list(value = list(like = 38, share = 11)),
list(value = list(like = 38, share = 13)))
str(mydf)
## 'data.frame': 4 obs. of 3 variables:
## $ a : Factor w/ 4 levels "A","B","C","D": 1 2 3 4
## $ period: Factor w/ 1 level "lifetime": 1 1 1 1
## $ values:List of 4
## ..$ :List of 1
## .. ..$ value: num 42
## ..$ :List of 1
## .. ..$ value: num 5
## ..$ :List of 1
## .. ..$ value:List of 2
## .. .. ..$ like : num 38
## .. .. ..$ share: num 11
## ..$ :List of 1
## .. ..$ value:List of 2
## .. .. ..$ like : num 38
## .. .. ..$ share: num 13
## NULL
Instead of retaining lists in your output, I would suggest flattening out the data, perhaps using a function like this:
myFun <- function(indt, col) {
if (!is.data.table(indt)) indt <- as.data.table(indt)
other_names <- setdiff(names(indt), col)
list_col <- indt[[col]]
rep_out <- sapply(list_col, function(x) length(unlist(x, use.names = FALSE)))
flat <- {
if (is.null(names(list_col))) names(list_col) <- seq_along(list_col)
setDT(tstrsplit(names(unlist(list_col)), ".", fixed = TRUE))[
, val := unlist(list_col, use.names = FALSE)][]
}
cbind(indt[rep(1:nrow(indt), rep_out)][, (col) := NULL], flat)
}
Here's what it does with the "mydf" I shared:
myFun(mydf, "values")
## a period V1 V2 V3 val
## 1: A lifetime 1 value NA 42
## 2: B lifetime 2 value NA 5
## 3: C lifetime 3 value like 38
## 4: C lifetime 3 value share 11
## 5: D lifetime 4 value like 38
## 6: D lifetime 4 value share 13

Related

R: Missed values when scraping webpage

When scraping data from a webpage, some elements/values are not returned.
Specifically, I use the rvest package to scrap.
The webpage that contains the information I want is https://azure.microsoft.com/en-us/pricing/details/virtual-machines/windows/ - however, when I scrap the data, the columns with prices only return "$-".
Sample code:
library(rvest)
webpage <- read_html("https://azure.microsoft.com/en-us/pricing/details/virtual-machines/windows/")
tbls <- html_nodes(webpage, "table")
tbls_ls <- webpage %>%
html_nodes("table") %>%
.[1:(length(tbls)-2)] %>%
html_table()
Output of first df:
> List of 22 $ :'data.frame': 7 obs. of 6 variables: ..$ Instance
> : chr [1:7] "B1L" "B1S" "B2S" "B1MS" ... ..$ Cores
> : int [1:7] 1 1 2 1 2 4 8 ..$ RAM
> : chr [1:7] "0.50 GiB" "1.00 GiB" "4.00 GiB" "2.00 GiB" ... ..$
> Temporary Storage : chr [1:7] "1 GiB" "2
> GiB" "8 GiB" "4 GiB" ... ..$ Price
> : chr [1:7] "$-" "$-" "$-" "$-" ... ..$ Prices with Azure Hybrid
> Benefit1 (% savings): chr [1:7] "$-" "$-" "$-" "$-" ...
What can I do to get the whole value of these specific elements?
They have a single set of price data irrespective of the filter. So you need to take that attribute's value and parse the json.
library(rvest)
webpage <- read_html("https://azure.microsoft.com/en-us/pricing/details/virtual-machines/windows/")
tbls <- html_nodes(webpage, "table")
webpage %>%
html_nodes("table") %>%
.[1:(length(tbls)-2)] %>%
html_table()
ss <- webpage %>% html_nodes("table span.price-data ") %>% xml_attr('data-amount')
lapply(ss,function(x){data.frame(jsonlite::fromJSON(x))})
Sample output:
[[176]]
regional.asia.pacific.southeast regional.australia.east regional.canada.central regional.canada.east
1 1.496 1.496 1.376 1.376
regional.europe.west regional.japan.east regional.united.kingdom.south regional.us.east.2 regional.usgov.virginia
1 1.488 1.464 1.448 1.373 1.504
regional.us.west regional.us.west.2
1 1.376 1.248
[[177]]
regional.asia.pacific.southeast regional.australia.east regional.canada.central regional.canada.east
1 4.464 4.464 4.224 4.224
regional.europe.west regional.japan.east regional.united.kingdom.south regional.us.east.2 regional.usgov.virginia
1 4.448 4.4 4.368 4.365 4.48
regional.us.west regional.us.west.2
1 4.224 3.968
You need to match that particular value and take the price from this.

Read Json file into a data.frame without nested lists

I am trying to load a json file into a data.frame in r. I have had some luck with the fromJSON function in the jsonlite package - But am getting nested lists and am not sure how to flatten the input into a two dimensional data.frame. Jsonlite reads the file in as a data.frame, but leaves nested lists in some of the variables.
Does Anyone have any tips in loading a JSON file to a data.frame when it reads in with nested lists.
#*#*#*#*#*#*#*#*#*##*#*#*#*#*#*#*#*#*# HERE IS MY EXAMPLE #*#*#*#*#*#*#*#*#*##*#*#*#*#*#*#*#*#*#
# loads the packages
library("httr")
library( "jsonlite")
# downloads an example file
providers <- fromJSON( "http://fm.formularynavigator.com/jsonFiles/publish/11/47/providers.json" , simplifyDataFrame=TRUE )
# the flatten function breaks the name variable into three vars ( first name, middle name, last name)
providers <- flatten( providers )
# but many of the columns are still lists:
sapply( providers , class)
# Some of these lists have a single level
head( providers$facility_type )
# Some have lot more than two - for example nine
providers[ , 6][[1]]
I want one row per npi, and than seperate columns for each of the slices of the individual lists - so that the data frame has cols for "plan_id_type","plan_id","network_tier" nine times, maybe colnames, from 0 to 8.
I have been able to use this site: http://www.convertcsv.com/json-to-csv.htm to get this file in two dimensions, but since I am doing hundreds of these I would love to be able to do it dynamically. This is the file: http://s000.tinyupload.com/download.php?file_id=10808537503095762868&t=1080853750309576286812811 - I would like to get a file with this structure load as a data.frame using the the fromJson function
HERE are a few of the things I have tried;
So I have thought of two approaches;
First: use a different function to read in the Json file, I have looked at
rjson but that reads in a list
library( rjson )
providers <- fromJSON( getURL( "https://fm.formularynavigator.com/jsonFiles/publish/11/47/providers.json") )
class( providers )
and I have tried RJSONIO - I tried this Getting imported json data into a data frame in R
json-data-into-a-data-frame-in-r
library( RJSONIO )
providers <- fromJSON( getURL( "https://fm.formularynavigator.com/jsonFiles/publish/11/47/providers.json") )
json_file <- lapply(providers, function(x) {
x[sapply(x, is.null)] <- NA
unlist(x)
})
# but When converting the lists to a data.frame I get an error
a <- do.call("rbind", json_file)
So, the second approach I have tried is to convert all the lists into variables in my data.frame
detach("package:RJSONIO", unload = TRUE )
detach("package:rjson", unload = TRUE )
library( "jsonlite")
providers <- fromJSON( "http://fm.formularynavigator.com/jsonFiles/publish/11/47/providers.json" , simplifyDataFrame=TRUE )
providers <- flatten( providers )
I am able to pull one of the lists - but because of missings I can't merge back on to my dataframe
a <- data.frame(Reduce(rbind, providers$facility_type))
length( a ) == nrow( providers )
I also tried these suggestions: Converting nested list to dataframe. A well as some other stuff but haven't had any luck
a <- sapply( providers$facility_type, unlist )
as.data.frame(t(sapply( providers$providers, unlist )) )
Any help much appreciated
Update: 21 February 2016
col_fixer updated to include a vec2col argument that lets you flatten a list column into either a single string or a set of columns.
In the data.frame you've downloaded, I see several different column types. There are normal columns comprising vectors of the same type. There are list columns where the items may be NULL or may themselves be a flat vector. There are list columns where there are data.frames as the list elements. There are list columns that contain a data.frame of the same number of rows as the main data.frame.
Here's a sample dataset that recreates those conditions:
mydf <- data.frame(id = 1:3, type = c("A", "A", "B"),
facility = I(list(c("x", "y"), NULL, "x")),
address = I(list(data.frame(v1 = 1, v2 = 2, v4 = 3),
data.frame(v1 = 1:2, v2 = 3:4, v3 = 5),
data.frame(v1 = 1, v2 = NA, v3 = 3))))
mydf$person <- data.frame(name = c("AA", "BB", "CC"), age = c(20, 32, 23),
preference = c(TRUE, FALSE, TRUE))
The str of this sample data.frame looks like:
str(mydf)
## 'data.frame': 3 obs. of 5 variables:
## $ id : int 1 2 3
## $ type : Factor w/ 2 levels "A","B": 1 1 2
## $ facility:List of 3
## ..$ : chr "x" "y"
## ..$ : NULL
## ..$ : chr "x"
## ..- attr(*, "class")= chr "AsIs"
## $ address :List of 3
## ..$ :'data.frame': 1 obs. of 3 variables:
## .. ..$ v1: num 1
## .. ..$ v2: num 2
## .. ..$ v4: num 3
## ..$ :'data.frame': 2 obs. of 3 variables:
## .. ..$ v1: int 1 2
## .. ..$ v2: int 3 4
## .. ..$ v3: num 5 5
## ..$ :'data.frame': 1 obs. of 3 variables:
## .. ..$ v1: num 1
## .. ..$ v2: logi NA
## .. ..$ v3: num 3
## ..- attr(*, "class")= chr "AsIs"
## $ person :'data.frame': 3 obs. of 3 variables:
## ..$ name : Factor w/ 3 levels "AA","BB","CC": 1 2 3
## ..$ age : num 20 32 23
## ..$ preference: logi TRUE FALSE TRUE
## NULL
One way you can "flatten" this is to "fix" the list columns. There are three fixes.
flatten (from "jsonlite") will take care of columns like the "person" column.
Columns like the "facility" column can be fixed using toString, which would convert each element to a comma separated item or which can be converted into multiple columns.
Columns where there are data.frames, some with multiple rows, first need to be flattened into a single row (by transforming to a "wide" format) and then need to be bound together as a single data.table. (I'm using "data.table" for reshaping and for binding the rows together).
We can take care of the second and third points with a function like the following:
col_fixer <- function(x, vec2col = FALSE) {
if (!is.list(x[[1]])) {
if (isTRUE(vec2col)) {
as.data.table(data.table::transpose(x))
} else {
vapply(x, toString, character(1L))
}
} else {
temp <- rbindlist(x, use.names = TRUE, fill = TRUE, idcol = TRUE)
temp[, .time := sequence(.N), by = .id]
value_vars <- setdiff(names(temp), c(".id", ".time"))
dcast(temp, .id ~ .time, value.var = value_vars)[, .id := NULL]
}
}
We'll integrate that and the flatten function in another function that would do most of the processing.
Flattener <- function(indf, vec2col = FALSE) {
require(data.table)
require(jsonlite)
indf <- flatten(indf)
listcolumns <- sapply(indf, is.list)
newcols <- do.call(cbind, lapply(indf[listcolumns], col_fixer, vec2col))
indf[listcolumns] <- list(NULL)
cbind(indf, newcols)
}
Running the function gives us:
Flattener(mydf)
## id type person.name person.age person.preference facility address.v1_1
## 1 1 A AA 20 TRUE x, y 1
## 2 2 A BB 32 FALSE 1
## 3 3 B CC 23 TRUE x 1
## address.v1_2 address.v2_1 address.v2_2 address.v4_1 address.v4_2 address.v3_1
## 1 NA 2 NA 3 NA NA
## 2 2 3 4 NA NA 5
## 3 NA NA NA NA NA 3
## address.v3_2
## 1 NA
## 2 5
## 3 NA
Or, with the vectors going into separate columns:
Flattener(mydf, TRUE)
## id type person.name person.age person.preference facility.V1 facility.V2
## 1 1 A AA 20 TRUE x y
## 2 2 A BB 32 FALSE <NA> <NA>
## 3 3 B CC 23 TRUE x <NA>
## address.v1_1 address.v1_2 address.v2_1 address.v2_2 address.v4_1 address.v4_2
## 1 1 NA 2 NA 3 NA
## 2 1 2 3 4 NA NA
## 3 1 NA NA NA NA NA
## address.v3_1 address.v3_2
## 1 NA NA
## 2 5 5
## 3 3 NA
Here's the str:
str(Flattener(mydf))
## 'data.frame': 3 obs. of 14 variables:
## $ id : int 1 2 3
## $ type : Factor w/ 2 levels "A","B": 1 1 2
## $ person.name : Factor w/ 3 levels "AA","BB","CC": 1 2 3
## $ person.age : num 20 32 23
## $ person.preference: logi TRUE FALSE TRUE
## $ facility : chr "x, y" "" "x"
## $ address.v1_1 : num 1 1 1
## $ address.v1_2 : num NA 2 NA
## $ address.v2_1 : num 2 3 NA
## $ address.v2_2 : num NA 4 NA
## $ address.v4_1 : num 3 NA NA
## $ address.v4_2 : num NA NA NA
## $ address.v3_1 : num NA 5 3
## $ address.v3_2 : num NA 5 NA
## NULL
On your "providers" object, this runs very quickly and consistently:
library(microbenchmark)
out <- microbenchmark(Flattener(providers), Flattener(providers, TRUE), flattenList(jsonRList))
out
# Unit: milliseconds
# expr min lq mean median uq max neval
# Flattener(providers) 104.18939 126.59295 157.3744 138.4185 174.5222 308.5218 100
# Flattener(providers, TRUE) 67.56471 86.37789 109.8921 96.3534 121.4443 301.4856 100
# flattenList(jsonRList) 1780.44981 2065.50533 2485.1924 2269.4496 2694.1487 4397.4793 100
library(ggplot2)
qplot(y = time, data = out, colour = expr) ## Via #TylerRinker
My first step was to load the data via RCurl::getURL() and rjson::fromJSON(), as per your second code sample:
##--------------------------------------
## libraries
##--------------------------------------
library(rjson);
library(RCurl);
##--------------------------------------
## get data
##--------------------------------------
URL <- 'https://fm.formularynavigator.com/jsonFiles/publish/11/47/providers.json';
jsonRList <- fromJSON(getURL(URL)); ## recursive list representing the original JSON data
Next, to get a deep understanding of the structure and cleanness of the data, I wrote a set of helper functions:
##--------------------------------------
## helper functions
##--------------------------------------
## apply a function to a set of nodes at the same depth level in a recursive list structure
levelApply <- function(
nodes, ## the root node of the list (recursive calls pass deeper nodes as they drill down into the list)
keyList, ## another list, expected to hold a sequence of keys (component names, integer indexes, or NULL for all) specifying which nodes to select at each depth level
func=identity, ## a function to run separately on each node once keyList has been exhausted
..., ## further arguments passed to func()
joinFunc=NULL ## optional function for joining the return values of func() at each successive depth, as the stack is unwound. An alternative is calling unlist() on the result, but careful not to lose the top-level index association
) {
if (length(keyList) == 0L) {
ret <- if (is.null(nodes)) NULL else func(nodes,...)
} else if (is.null(keyList[[1L]]) || length(keyList[[1L]]) != 1L) {
ret <- lapply(if (is.null(keyList[[1L]])) nodes else nodes[keyList[[1L]]],levelApply,keyList[-1L],func,...,joinFunc=joinFunc);
if (!is.null(joinFunc))
ret <- do.call(joinFunc,ret);
} else {
ret <- levelApply(nodes[[keyList[[1L]]]],keyList[-1L],func,...,joinFunc=joinFunc);
}; ## end if
ret;
}; ## end if
## these two wrappers automatically attempt to simplify the results of func() to a vector or matrix/data.frame, respectively
levelApplyToVec <- function(...) levelApply(...,joinFunc=c);
levelApplyToFrame <- function(...) levelApply(...,joinFunc=rbind); ## can return matrix or data.frame, depending on ret
The key to understanding the above is the keyList parameter. Let's say you have a list like this:
list(NULL,'addresses',2:3,'city')
That would select all city strings underneath the second and third address elements underneath the addresses list underneath all elements of the main list.
There are no built-in apply functions in R that can operate on such "parallel" node selections (rapply() is close, but no cigar), which is why I wrote my own. levelApply() finds each of the matching nodes and runs the given func() on it (default identity(), thus returning the node itself), returning the results to the caller, either joined as per joinFunc(), or in the same recursive list structure in which those nodes existed in the input list. Quick demo:
unname(levelApplyToVec(jsonRList,list(4L,'addresses',1:2,c('address','city'))));
## [1] "1001 Noble St" "Fairbanks" "1650 Cowles St" "Fairbanks"
Here are the remaining helper functions I wrote in the process of working on this problem:
## for the given node selection key union, retrieve a data.frame of logicals representing the unique combinations of keys possessed by the selected nodes, possibly with a count
keyCombos <- function(node,keyList,allKeys) `rownames<-`(setNames(unique(as.data.frame(levelApplyToFrame(node,keyList,function(h) allKeys%in%names(h)))),allKeys),NULL);
keyCombosWithCount <- function(node,keyList,allKeys) { ks <- keyCombos(node,keyList,allKeys); ks$.count <- unname(apply(ks,1,function(combo) sum(levelApplyToVec(node,keyList,function(h) identical(sort(names(ks)[combo]),sort(names(h))))))); ks; };
## return a simple two-component list with type (list, namedlist, or atomic vector type) and len for non-namedlist types; tlStr() returns a nice stringified form of said list
tl <- function(e) { if (is.null(e)) return(NULL); ret <- typeof(e); if (ret == 'list' && !is.null(names(e))) ret <- list(type='namedlist') else ret <- list(type=ret,len=length(e)); ret; };
tlStr <- function(e) { if (is.null(e)) return(NA); ret <- tl(e); if (is.null(ret$len)) ret <- ret$type else ret <- paste0(ret$type,'[',ret$len,']'); ret; };
## stringification functions for display
mkcsv <- function(v) paste0(collapse=',',v);
keyListToStr <- function(keyList) paste0(collapse='','/',sapply(keyList,function(key) if (is.null(key)) '*' else paste0(collapse=',',key)));
## return a data.frame giving a comma-separated list of the unique types possessed by the selected nodes; useful for learning about the structure of the data
keyTypes <- function(node,keyList,allKeys) data.frame(key=allKeys,tl=sapply(allKeys,function(key) mkcsv(unique(na.omit(levelApplyToVec(node,c(keyList,key),tlStr))))),row.names=NULL);
## useful for testing; can call npiToFrame() to show the row with a specified npi value, in a nice vertical form
rowToFrame <- function(dfrow) data.frame(column=names(dfrow),value=c(as.matrix(dfrow)));
getNPIRow <- function(df,npi) which(df$npi == npi);
npiToFrame <- function(df,npi) rowToFrame(df[getNPIRow(df,npi),]);
I've tried to capture the sequence of commands I ran against the data as I first examined it. Below are the results, showing the commands I ran, the command output, and leading comments describing what my intention was, and my conclusion from the output:
##--------------------------------------
## data examination
##--------------------------------------
## type of object -- plain unnamed list => array, length 3256
levelApplyToVec(jsonRList,list(),tlStr);
## [1] "list[3256]"
## unique types of main array elements => all named lists => hashes
unique(levelApplyToVec(jsonRList,list(NULL),tlStr));
## [1] "namedlist"
## get the union of keys among all hashes
allKeys <- unique(levelApplyToVec(jsonRList,list(NULL),names)); allKeys;
## [1] "npi" "type" "facility_name" "facility_type" "addresses" "plans" "last_updated_on" "name" "speciality" "accepting" "languages" "gender"
## get the unique pattern of keys among all hashes, and how often each occurs => shows there are inconsistent key sets among the top-level hashes
keyCombosWithCount(jsonRList,list(NULL),allKeys);
## npi type facility_name facility_type addresses plans last_updated_on name speciality accepting languages gender .count
## 1 TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE 279
## 2 TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE 2973
## 3 TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE 4
## for each key, get the unique set of types it takes on among all hashes, ignoring hashes where the key is omitted => some scalar strings, some multi-string, addresses is a variable-length list, plans is length-9 list, and name is a hash
keyTypes(jsonRList,list(NULL),allKeys);
## key tl
## 1 npi character[1]
## 2 type character[1]
## 3 facility_name character[1]
## 4 facility_type character[1],character[2],character[3]
## 5 addresses list[1],list[2],list[3],list[6],list[5],list[7],list[4],list[8],list[9],list[13],list[12]
## 6 plans list[9]
## 7 last_updated_on character[1]
## 8 name namedlist
## 9 speciality character[1],character[2],character[3],character[4]
## 10 accepting character[1]
## 11 languages character[2],character[3],character[4],character[6],character[5]
## 12 gender character[1]
## must look deeper into addresses array, plans array, and name hash; we'll have to flatten them
## ==== addresses =====
## note: the addresses key is always present under main array elements
## unique types of address elements across all hashes => all named lists, thus nested hashes
unique(levelApplyToVec(jsonRList,list(NULL,'addresses',NULL),tlStr));
## [1] "namedlist"
## union of keys among all address element hashes
allAddressKeys <- unique(levelApplyToVec(jsonRList,list(NULL,'addresses',NULL),names)); allAddressKeys;
## [1] "address" "city" "state" "zip" "phone" "address_2"
## pattern of keys among address elements => only address_2 varies, similar frequency with it as without it
keyCombosWithCount(jsonRList,list(NULL,'addresses',NULL),allAddressKeys);
## address city state zip phone address_2 .count
## 1 TRUE TRUE TRUE TRUE TRUE FALSE 1898
## 2 TRUE TRUE TRUE TRUE TRUE TRUE 2575
## for each address element key, get the unique set of types it takes on among all hashes, ignoring hashes where the key (only address_2 in this case) is omitted => all scalar strings
keyTypes(jsonRList,list(NULL,'addresses',NULL),allAddressKeys);
## key tl
## 1 address character[1]
## 2 city character[1]
## 3 state character[1]
## 4 zip character[1]
## 5 phone character[1]
## 6 address_2 character[1]
## ==== plans =====
## note: the plans key is always present under main array elements
## unique types of plan elements across all hashes => all named lists, thus nested hashes
unique(levelApplyToVec(jsonRList,list(NULL,'plans',NULL),tlStr));
## [1] "namedlist"
## union of keys among all plan element hashes
allPlanKeys <- unique(levelApplyToVec(jsonRList,list(NULL,'plans',NULL),names)); allPlanKeys;
## [1] "plan_id_type" "plan_id" "network_tier"
## pattern of keys among plan elements => good, all plan elements have all 3 keys, perfectly consistent
keyCombosWithCount(jsonRList,list(NULL,'plans',NULL),allPlanKeys);
## plan_id_type plan_id network_tier .count
## 1 TRUE TRUE TRUE 29304
## for each plan element key, get the unique set of types it takes on among all hashes (note: no plan keys are ever omitted, so don't have to worry about that) => all scalar strings
keyTypes(jsonRList,list(NULL,'plans',NULL),allPlanKeys);
## key tl
## 1 plan_id_type character[1]
## 2 plan_id character[1]
## 3 network_tier character[1]
## ==== name =====
## note: the name key is *not* always present under main array elements
## union of keys among all name hashes
allNameKeys <- unique(levelApplyToVec(jsonRList,list(NULL,'name'),names)); allNameKeys;
## [1] "first" "middle" "last"
## pattern of keys among name elements => sometimes middle is missing, relatively infrequently
keyCombosWithCount(jsonRList,list(NULL,'name'),allNameKeys);
## first middle last .count
## 1 TRUE TRUE TRUE 2679
## 2 TRUE FALSE TRUE 298
## for each name element key, get the unique set of types it takes on among all hashes, ignoring hashes where the key (only middle in this case) is omitted => all scalar strings
keyTypes(jsonRList,list(NULL,'name'),allNameKeys);
## key tl
## 1 first character[1]
## 2 middle character[1]
## 3 last character[1]
Here's my summary of the data:
one top-level main list, length 3256.
each element is a hash with inconsistent key sets. There are 12 keys in total across all main hashes, with 3 patterns of key sets present.
6 of the hash values are scalar strings, 3 are variable-length string vectors, addresses is a list of variable length, plans is a list always of length 9, and name is a hash.
each addresses list element is a hash with 5 or 6 keys to scalar strings, address_2 being the inconsistent one.
each plans list element is a hash with 3 keys to scalar strings, no inconsistencies.
each name hash has first and last but not always middle scalar strings.
The most important observation here is that there are no type-inconsistencies between parallel nodes (aside from omissions and length differences). That means we can combine all parallel nodes into vectors with no considerations of type coercion. We can flatten all the data into a two-dimensional structure provided we associate columns with deep-enough nodes, such that all columns correspond to a single scalar string node in the input list.
Below is my solution. Note that it depends on the helper functions tl(), keyListToStr(), and mkcsv() I defined earlier.
##--------------------------------------
## solution
##--------------------------------------
## recursively traverse the list structure, building up a column at each leaf node
extractLevelColumns <- function(
nodes, ## current level node selection
..., ## additional arguments to data.frame()
keyList=list(), ## current key path under main list
sep=NULL, ## optional string separator on which to join multi-element vectors; if NULL, will leave as separate columns
mkname=function(keyList,maxLen) paste0(collapse='.',if (is.null(sep) && maxLen == 1L) keyList[-length(keyList)] else keyList) ## name builder from current keyList and character vector max length across node level; default to dot-separated keys, and remove last index component for scalars
) {
cat(sprintf('extractLevelColumns(): %s\n',keyListToStr(keyList)));
if (length(nodes) == 0L) return(list()); ## handle corner case of empty main list
tlList <- lapply(nodes,tl);
typeList <- do.call(c,lapply(tlList,`[[`,'type'));
if (length(unique(typeList)) != 1L) stop(sprintf('error: inconsistent types (%s) at %s.',mkcsv(typeList),keyListToStr(keyList)));
type <- typeList[1L];
if (type == 'namedlist') { ## hash; recurse
allKeys <- unique(do.call(c,lapply(nodes,names)));
ret <- do.call(c,lapply(allKeys,function(key) extractLevelColumns(lapply(nodes,`[[`,key),...,keyList=c(keyList,key),sep=sep,mkname=mkname)));
} else if (type == 'list') { ## array; recurse
lenList <- do.call(c,lapply(tlList,`[[`,'len'));
maxLen <- max(lenList,na.rm=T);
allIndexes <- seq_len(maxLen);
ret <- do.call(c,lapply(allIndexes,function(index) extractLevelColumns(lapply(nodes,function(node) if (length(node) < index) NULL else node[[index]]),...,keyList=c(keyList,index),sep=sep,mkname=mkname))); ## must be careful to guard out-of-bounds to NULL; happens automatically with string keys, but not with integer indexes
} else if (type%in%c('raw','logical','integer','double','complex','character')) { ## atomic leaf node; build column
lenList <- do.call(c,lapply(tlList,`[[`,'len'));
maxLen <- max(lenList,na.rm=T);
if (is.null(sep)) {
ret <- lapply(seq_len(maxLen),function(i) setNames(data.frame(sapply(nodes,function(node) if (length(node) < i) NA else node[[i]]),...),mkname(c(keyList,i),maxLen)));
} else {
## keep original type if maxLen is 1, IOW don't stringify
ret <- list(setNames(data.frame(sapply(nodes,function(node) if (length(node) == 0L) NA else if (maxLen == 1L) node else paste(collapse=sep,node)),...),mkname(keyList,maxLen)));
}; ## end if
} else stop(sprintf('error: unsupported type %s at %s.',type,keyListToStr(keyList)));
if (is.null(ret)) ret <- list(); ## handle corner case of exclusively empty sublists
ret;
}; ## end extractLevelColumns()
## simple interface function
flattenList <- function(mainList,...) do.call(cbind,extractLevelColumns(mainList,...));
The extractLevelColumns() function traverses the input list and extracts all node values at each leaf node position, combining them into a vector with NA where the value was missing, and then transforming to a one-column data.frame. The column name is set immediately, leveraging a parameterized mkname() function to define the stringification of the keyList to the string column name. Multiple columns are returned as a list of data.frames from each recursive call and likewise from the top-level call.
It also validates that there are no type-inconsistencies between parallel nodes. Although I manually verified the consistency of the data earlier, I tried to write as generic and reusable a solution as possible, because it's always a good idea to do so, so this validation step is appropriate.
flattenList() is the primary interface function; it simply calls extractLevelColumns() and then do.call(cbind,...) to combine the columns into a single data.frame.
An advantage of this solution is that it's entirely generic; it can handle an unlimited number of depth levels, by virtue of being fully recursive. Additionally, it has no package dependencies, parameterizes the column name building logic, and forwards variadic arguments to data.frame(), so for example you can pass stringsAsFactors=F to inhibit the automatic factorization of character columns normally done by data.frame(), and/or row.names={namevector} to set the row names of the resulting data.frame, or row.names=NULL to prevent the use of the top-level list component names as row names, if such existed in the input list.
I've also added a sep parameter which defaults to NULL. If NULL, multi-element leaf nodes will be separated into multiple columns, one per element, with an index suffix on the column name for differentiation. Otherwise, it's taken as a string separator on which to join all elements to a single string, and only one column is generated for the node.
In terms of performance, it's very fast. Here's a demo:
## actually run it
system.time({ df <- flattenList(jsonRList); });
## extractLevelColumns(): /
## extractLevelColumns(): /npi
## extractLevelColumns(): /type
## extractLevelColumns(): /facility_name
## extractLevelColumns(): /facility_type
## extractLevelColumns(): /addresses
## extractLevelColumns(): /addresses/1
## extractLevelColumns(): /addresses/1/address
## extractLevelColumns(): /addresses/1/city
##
## ... snip ...
##
## extractLevelColumns(): /plans/9/network_tier
## extractLevelColumns(): /last_updated_on
## extractLevelColumns(): /name
## extractLevelColumns(): /name/first
## extractLevelColumns(): /name/middle
## extractLevelColumns(): /name/last
## extractLevelColumns(): /speciality
## extractLevelColumns(): /accepting
## extractLevelColumns(): /languages
## extractLevelColumns(): /gender
## user system elapsed
## 2.265 0.000 2.268
Result:
class(df); dim(df); names(df);
## [1] "data.frame"
## [1] 3256 126
## [1] "npi" "type" "facility_name" "facility_type.1" "facility_type.2" "facility_type.3" "addresses.1.address" "addresses.1.city" "addresses.1.state"
## [10] "addresses.1.zip" "addresses.1.phone" "addresses.1.address_2" "addresses.2.address" "addresses.2.city" "addresses.2.state" "addresses.2.zip" "addresses.2.phone" "addresses.2.address_2"
## [19] "addresses.3.address" "addresses.3.city" "addresses.3.state" "addresses.3.zip" "addresses.3.phone" "addresses.3.address_2" "addresses.4.address" "addresses.4.city" "addresses.4.state"
## [28] "addresses.4.zip" "addresses.4.phone" "addresses.4.address_2" "addresses.5.address" "addresses.5.address_2" "addresses.5.city" "addresses.5.state" "addresses.5.zip" "addresses.5.phone"
## [37] "addresses.6.address" "addresses.6.address_2" "addresses.6.city" "addresses.6.state" "addresses.6.zip" "addresses.6.phone" "addresses.7.address" "addresses.7.address_2" "addresses.7.city"
## [46] "addresses.7.state" "addresses.7.zip" "addresses.7.phone" "addresses.8.address" "addresses.8.address_2" "addresses.8.city" "addresses.8.state" "addresses.8.zip" "addresses.8.phone"
## [55] "addresses.9.address" "addresses.9.address_2" "addresses.9.city" "addresses.9.state" "addresses.9.zip" "addresses.9.phone" "addresses.10.address" "addresses.10.address_2" "addresses.10.city"
## [64] "addresses.10.state" "addresses.10.zip" "addresses.10.phone" "addresses.11.address" "addresses.11.address_2" "addresses.11.city" "addresses.11.state" "addresses.11.zip" "addresses.11.phone"
## [73] "addresses.12.address" "addresses.12.address_2" "addresses.12.city" "addresses.12.state" "addresses.12.zip" "addresses.12.phone" "addresses.13.address" "addresses.13.city" "addresses.13.state"
## [82] "addresses.13.zip" "addresses.13.phone" "plans.1.plan_id_type" "plans.1.plan_id" "plans.1.network_tier" "plans.2.plan_id_type" "plans.2.plan_id" "plans.2.network_tier" "plans.3.plan_id_type"
## [91] "plans.3.plan_id" "plans.3.network_tier" "plans.4.plan_id_type" "plans.4.plan_id" "plans.4.network_tier" "plans.5.plan_id_type" "plans.5.plan_id" "plans.5.network_tier" "plans.6.plan_id_type"
## [100] "plans.6.plan_id" "plans.6.network_tier" "plans.7.plan_id_type" "plans.7.plan_id" "plans.7.network_tier" "plans.8.plan_id_type" "plans.8.plan_id" "plans.8.network_tier" "plans.9.plan_id_type"
## [109] "plans.9.plan_id" "plans.9.network_tier" "last_updated_on" "name.first" "name.middle" "name.last" "speciality.1" "speciality.2" "speciality.3"
## [118] "speciality.4" "accepting" "languages.1" "languages.2" "languages.3" "languages.4" "languages.5" "languages.6" "gender"
The resulting data.frame is quite wide, but we can use rowToFrame() and npiToFrame() to get a good vertical layout of one row at a time. For example, here's the first row:
rowToFrame(df[1L,]);
## column value
## 1 npi 1063645026
## 2 type FACILITY
## 3 facility_name EXPRESS SCRIPTS
## 4 facility_type.1 Pharmacies
## 5 facility_type.2 <NA>
## 6 facility_type.3 <NA>
## 7 addresses.1.address 4750 E 450 S
## 8 addresses.1.city WHITESTOWN
## 9 addresses.1.state IN
## 10 addresses.1.zip 46075
## 11 addresses.1.phone 2012695236
## 12 addresses.1.address_2 <NA>
## 13 addresses.2.address <NA>
## 14 addresses.2.city <NA>
## 15 addresses.2.state <NA>
## 16 addresses.2.zip <NA>
## 17 addresses.2.phone <NA>
## 18 addresses.2.address_2 <NA>
## 19 addresses.3.address <NA>
## 20 addresses.3.city <NA>
## 21 addresses.3.state <NA>
##
## ... snip ...
##
## 77 addresses.12.zip <NA>
## 78 addresses.12.phone <NA>
## 79 addresses.13.address <NA>
## 80 addresses.13.city <NA>
## 81 addresses.13.state <NA>
## 82 addresses.13.zip <NA>
## 83 addresses.13.phone <NA>
## 84 plans.1.plan_id_type HIOS-PLAN-ID
## 85 plans.1.plan_id 38344AK0620003
## 86 plans.1.network_tier HERITAGE-PLUS
## 87 plans.2.plan_id_type HIOS-PLAN-ID
## 88 plans.2.plan_id 38344AK0620004
## 89 plans.2.network_tier HERITAGE-PLUS
## 90 plans.3.plan_id_type HIOS-PLAN-ID
## 91 plans.3.plan_id 38344AK0620006
## 92 plans.3.network_tier HERITAGE-PLUS
## 93 plans.4.plan_id_type HIOS-PLAN-ID
## 94 plans.4.plan_id 38344AK0620008
## 95 plans.4.network_tier HERITAGE-PLUS
## 96 plans.5.plan_id_type HIOS-PLAN-ID
## 97 plans.5.plan_id 38344AK0570001
## 98 plans.5.network_tier HERITAGE-PLUS
## 99 plans.6.plan_id_type HIOS-PLAN-ID
## 100 plans.6.plan_id 38344AK0570002
## 101 plans.6.network_tier HERITAGE-PLUS
## 102 plans.7.plan_id_type HIOS-PLAN-ID
## 103 plans.7.plan_id 38344AK0980003
## 104 plans.7.network_tier HERITAGE-PLUS
## 105 plans.8.plan_id_type HIOS-PLAN-ID
## 106 plans.8.plan_id 38344AK0980006
## 107 plans.8.network_tier HERITAGE-PLUS
## 108 plans.9.plan_id_type HIOS-PLAN-ID
## 109 plans.9.plan_id 38344AK0980012
## 110 plans.9.network_tier HERITAGE-PLUS
## 111 last_updated_on 2015-10-14
## 112 name.first <NA>
## 113 name.middle <NA>
## 114 name.last <NA>
## 115 speciality.1 <NA>
## 116 speciality.2 <NA>
## 117 speciality.3 <NA>
## 118 speciality.4 <NA>
## 119 accepting <NA>
## 120 languages.1 <NA>
## 121 languages.2 <NA>
## 122 languages.3 <NA>
## 123 languages.4 <NA>
## 124 languages.5 <NA>
## 125 languages.6 <NA>
## 126 gender <NA>
I've tested the result pretty thoroughly by doing many spot-checks on individual records, and it all looks correct. Let me know if you have any questions.
This answer is rather a data organization suggestion (and is much shorter than the bounty-attracting answers around;)
If you want to keep the semantics of the fields, like keep all plan_ids in a single column, you can normalize your data design a bit, and do joins afterwards, if you need the information together:
library(dplyr)
# notice the simplifyVector=F
providers <- fromJSON( "http://fm.formularynavigator.com/jsonFiles/publish/11/47/providers.json", simplifyVector=F)
# pick and repeat fields for each element of array
# {field1:val, field2:val2, array:[{af1:av1, af2:av2}, {af1:av3, af2:av4}]}
# gives data.frame
# field1, field2 array.af1 array.af2
# val val2 av1 av2
# val val2 av3 av4
denormalize <- function(data, fields, array) {
data.frame(
c(
data[fields],
as.list(
bind_rows(
lapply(data[[array]], data.frame)))))
}
plans_df <- bind_rows(lapply(providers, denormalize, c('npi'), 'plans'))
addresses_df <- bind_rows(lapply(providers, denormalize, c('npi'), 'addresses'))
npis <- bind_rows(lapply(providers, function(d, fields) data.frame(d[fields]),
c('npi', 'type', 'last_updated_on')))
Then you can first filter on the data and join in other information afterwards:
addresses_df %>%
filter(city == "Healy") %>%
left_join(plans_df, by="npi") ->
plans_in_healy
So this isn't really eligible as a solution since it doesn't directly answer the question, but here is how I would analyze this data.
First, I had to understand your data set. It appears to be information about health providers.
providers <- fromJSON( "http://fm.formularynavigator.com/jsonFiles/publish/11/47/providers.json" , simplifyDataFrame=FALSE )
types = sapply(providers,"[[","type")
table(types)
# FACILITY INDIVIDUAL
# 279 2977
FACILITY entries have the "ID" fields facility_name and facility_type.
INDIVIDUAL entries have the "ID" fields name, speciality, accepting, languages, and gender.
All entries have "ID" fields npi and last_updated_on.
All entries have two nested fields: addresses and plans. For example addresses is a list that contains city, state, etc.
Since there are multiple addresses for each npi, I'd prefer to convert them to a data frame with columns for the city, state, etc. I'll also make a similar data frame for the plans. Then I'll join the addresses and plans into a single data frame. Hence, if there are 4 addresses and 8 plans, there will be 4*8=32 rows in the joined data frame. Finally, I'll tac on a similarly denormalized data frame with "ID" information using another merge.
library(dplyr)
unfurl_npi_data = function (x) {
repeat_cols = c("plans","addresses")
id_cols = setdiff(names(x),repeat_cols)
repeat_data = x[repeat_cols]
id_data = x[id_cols]
# Denormalized ID data
id_data_df = Reduce(function(x,y) merge(x,y,by=NULL), id_data, "")[,-1]
atomic_colnames = names(which(!sapply(id_data, is.list)))
df_atomic_cols = unlist(sapply(id_data,function(x) if(is.list(x)) rep(FALSE, length(x)) else TRUE))
colnames(id_data_df)[df_atomic_cols] = atomic_colnames
# Join the plans and addresses (denormalized)
repeated_data = lapply(repeat_data, rbind_all)
repeated_data_crossed = Reduce(merge, repeated_data, repeated_data[[1]])
merge(id_data_df, repeated_data_crossed)
}
providers2 = split(providers, types)
providers3 = lapply(providers2, function(x) rbind_all(lapply(x, unfurl_npi_data)))
Then do some cleanup.
unique_df = function(x) {
chr_col_names = names(which(sapply(x, class) == "character"))
for( col in chr_col_names )
x[[col]] = toupper(x[[col]])
unique(x)
}
providers3 = lapply(providers3, unique_df)
facilities = providers3[["FACILITY"]]
individuals = providers3[["INDIVIDUAL"]]
rm(providers, providers2, providers3)
And now you can ask some interesting questions. For example, how many addresses does each health care provider have?
unique_providers = individuals %>% select(first, middle, last, gender, state, city, address) %>% unique()
num_addresses = unique_providers %>% count(first, middle, last, gender)
table(num_addresses$n)
# 1 2 3 4 5 6 7 8 9 12 13
# 2258 492 119 33 43 21 6 1 2 1 1
At addresses with more than five people, what is the percent of male healthcare providers?
address_pcts = unique_providers %>%
group_by(address, city, state) %>%
filter(n()>5) %>%
arrange(address) %>%
summarise(pct_male = sum(gender=="MALE")/n())
library(ggplot2)
qplot(address_pcts$pct_male, binwidth=1/7) + xlim(0,1)
And on and on...

"NA" in JSON file translates to NA logical

I have json files with data for countries. One of the files has the following data:
"[{\"count\":1,\"subject\":{\"name\":\"Namibia\",\"alpha2\":\"NA\"}}]"
I have the following code convert the json into a data.frame using the jsonlite package:
df = as.data.frame(fromJSON(jsonfile), flatten=TRUE))
I was expecting a data.frame with numbers and strings:
count subject.name subject.alpha2
1 Namibia "NA"
Instead, the NA alpha2 code is being automatically converted into NA logical, and this is what I get:
str(df)
$ count : int 1
$ subject.name : chr "Namibia"
$ subject.alpha2: logi NA
I want alpha2 to be a string, not logical. How do I fix this?
That particular implementation of fromJSON (and there are three different packages with that name for a function) has a simplifyVector argument which appears to prevent the corecion:
require(jsonlite)
> as.data.frame( fromJSON(test, simplifyVector=FALSE ) )
count subject.name subject.alpha2
1 1 Namibia NA
> str( as.data.frame( fromJSON(test, simplifyVector=FALSE ) ) )
'data.frame': 1 obs. of 3 variables:
$ count : int 1
$ subject.name : Factor w/ 1 level "Namibia": 1
$ subject.alpha2: Factor w/ 1 level "NA": 1
> str( as.data.frame( fromJSON(test, simplifyVector=FALSE ) ,stringsAsFactors=FALSE) )
'data.frame': 1 obs. of 3 variables:
$ count : int 1
$ subject.name : chr "Namibia"
$ subject.alpha2: chr "NA"
I tried seeing if that option worked well with the flatten argument, but was disappointed:
> str( fromJSON(test, simplifyVector=FALSE, flatten=TRUE) )
List of 1
$ :List of 2
..$ count : int 1
..$ subject:List of 2
.. ..$ name : chr "Namibia"
.. ..$ alpha2: chr "NA"
The accepted answer did not solve my use case.
However, rjson::fromJSON does this naturally, and to my surprise, 10 times faster on my data.

How do I write a json array from R that has a sequence of lat and long?

How do I write a json array from R that has a sequence of lat and long?
I would like to write:
[[[1,2],[3,4],[5,6]]]
the best I can do is:
toJSON(matrix(1:6, ncol = 2, byrow = T))
#"[ [ 1, 2 ],\n[ 3, 4 ],\n[ 5, 6 ] ]"
How can I wrap the thing in another array (the json kind)?
This is important to me so I can write files into a geojson format as a LineString.
I usually use fromJSON to get the target object :
ll <- fromJSON('[[[1,2],[3,4],[5,6]]]')
str(ll)
List of 1
$ :List of 3
..$ : num [1:2] 1 2
..$ : num [1:2] 3 4
..$ : num [1:2] 5 6
So we should create , a list of unnamed list, each containing 2 elements:
xx <- list(setNames(split(1:6,rep(1:3,each=2)),NULL))
identical(toJSON(xx),'[[[1,2],[3,4],[5,6]]]')
[1] TRUE
If you have a matrix
m1 <- matrix(1:6, ncol=2, byrow=T)
may be this helps:
library(rjson)
paste0("[",toJSON(setNames(split(m1, row(m1)),NULL)),"]")
#[1] "[[[1,2],[3,4],[5,6]]]"

Split and transform a R character string into numerical vector

I want to convert the following json and put the values into a data frame. It almost works but as.data.frame() puts everything into one row.
require(rjson)
require(RCurl)
y = getURI(url1)
y
[1] "[{\"close\":5.45836392962902,\"highest\":5.45837200714172,\"lowest\":5.45836392962902,\"open\":5.45837200714172,\"start_time\":\"2012-01-29T18:29:24-08:00\"},{\"close\":5.45837200714172,\"highest\":5.45837200714172,\"lowest\":5.45834791002201,\"open\":5.45835598753471,\"start_time\":\"2012-01-29T18:28:24-08:00\"}]"
x = fromJSON(y)
> str(x)
List of 2
$ :List of 5
..$ close : num 5.46
..$ highest : num 5.46
..$ lowest : num 5.46
..$ open : num 5.46
..$ start_time: chr "2012-01-29T18:29:24-08:00"
$ :List of 5
..$ close : num 5.46
..$ highest : num 5.46
..$ lowest : num 5.46
..$ open : num 5.46
..$ start_time: chr "2012-01-29T18:28:24-08:00"
as.data.frame(x)
close highest lowest open start_time close.1 highest.1 lowest.1 open.1 start_time.1
1 5.458364 5.458372 5.458364 5.458372 2012-01-29T18:29:24-08:00 5.458372 5.458372 5.458348 5.458356 2012-01-29T18:28:24-08:00
Instead of it being on one row. I want them in two rows.
close highest lowest open start_time
1 5.458364 5.458372 5.458364 5.458372 2012-01-29T18:29:24-08:00
2 5.458372 5.458372 5.458348 5.458356 2012-01-29T18:28:24-08:00
Is there something I can specify in as.data.table for this to work?
EDIT:
do.call(rbind,lapply(x,as.data.frame))
The above was able to coerce it into a data frame, but the time stamp column has two factors. This next part has its own question here
y = do.call(rbind,lapply(x,as.data.frame))
str(x)
'data.frame': 2 obs. of 5 variables:
$ close : num 5.46 5.46
$ highest : num 5.47 5.46
$ lowest : num 5.46 5.46
$ open : num 5.46 5.46
$ start_time: Factor w/ 2 levels "2012-01-29T21:48:24-05:00",..: 1 2
If I try to convert the POSIX format I get
x$start_time = as.POSIXct(x$start_time)
x$start_time
[1] "2012-01-29 CST" "2012-01-29 CST"
But it loses the time data.
You might try:
do.call(rbind,lapply(x,as.data.frame))