How to write a JSON object from R dataframe with grouping - json

In general I feel there is a need to make JSON objects by folding multiple columns. There is no direct way to do this afaik. Please point it out if there is ..
I have data of this from
A B C
1 a x
1 a y
1 c z
2 d p
2 f q
2 f r
How do I write a json which looks like
{'query':'1', 'type':[{'name':'a', 'values':[{'value':'x'}, {'value':'y'}]}, {'name':'c', 'values':[{'value':'z'}]}]}
and similarly for 'query':'2'
I am looking to spit them in the mongo import/export individual json lines format.
Any pointers are also appreciated..

You've got a little "non-standard" thing going with two keys of "value" (I don't know if this is legal json), as you can see here:
(js <- jsonlite::fromJSON('{"query":"1", "type":[{"name":"a", "values":[{"value":"x"}, {"value":"y"}]}, {"name":"c", "values":[{"value":"z"}]}]}'))
## $query
## [1] "1"
##
## $type
## name values
## 1 a x, y
## 2 c z
... with a data.frame cell containing a list of data.frames:
js$type$values[[1]]
## value
## 1 x
## 2 y
class(js$type$values[[1]])
## [1] "data.frame"
If you can accept your "type" variable containing a vector instead of a named-list, then perhaps the following code will suffice:
jsonlite::toJSON(lapply(unique(dat[, 'A']), function(a1) {
list(query = a1,
type = lapply(unique(dat[dat$A == a1, 'B']), function(b2) {
list(name = b2,
values = dat[(dat$A == a1) & (dat$B == b2), 'C'])
}))
}))
## [{"query":[1],"type":[{"name":["a"],"values":["x","y"]},{"name":["c"],"values":["z"]}]},{"query":[2],"type":[{"name":["d"],"values":["p"]},{"name":["f"],"values":["q","r"]}]}]

Related

Parsing incomplete lists into data frames with two different problems

If you request web data through R, you often work with json or xml where the fields are not named if there is no value for them. Sometimes, there isn't even any data and it comes out as an empty list for a certain index. So, I see this as two different problems. I'm proposing the solution I use to solve this as well but I know there are some better ones out there. I have for starters, a very messy and fake list that I created that is missing field names (on purpose from the xml, json spec) AND missing whole indexes (also on purpose).
(messy_list <- list(list(x = 2, y = 3),
list(),
list(y = 4),
list(x = 5)))
Now, here is how I break it down to what I would say is "solved".
library(plyr)
messy_list_no_empties <- lapply(messy_list, function(x) if(length(x) == 0) {list(NA, NA)} else x)
ldply(messy_list_no_empties, data.frame)[,1:2]
The end result is what I am looking for but I would like to find a more elegant way to deal with this problem.
With purrr::map_df,
library(purrr)
messy_list <- list(list(x = 2, y = 3),
list(),
list(y = 4),
list(x = 5))
messy_list %>% map_df(~list(x = .x$x %||% NA,
y = .x$y %||% NA))
#> # A tibble: 4 × 2
#> x y
#> <dbl> <dbl>
#> 1 2 3
#> 2 NA NA
#> 3 NA 4
#> 4 5 NA
map_df iterates over the list like lapply and coerces the results to a data.frame. The function (in purrr's formula form) assembles a list with an x and a y element, looking for existing values if they're there. If they're not, the subsetting will return NULL, which %||% will replace with the value after it, NA.
In mostly-equivalent base R,
as.data.frame(do.call(rbind,
lapply(messy_list, function(.x){
list(x = ifelse(is.null(.x$x), NA, .x$x),
y = ifelse(is.null(.x$y), NA, .x$y))
})))
#> x y
#> 1 2 3
#> 2 NA NA
#> 3 NA 4
#> 4 5 NA
Note the base approach won't handle different types well. To do so, coerce everything to character (rbind probably will anyway, so just add stringsAsFactors = FALSE to as.data.frame) and lapply type.convert.
Your method is already pretty compact, but if you're looking for other methods, one way might be to use rbindlist from data.table:
library(data.table)
new_list <- lapply(messy_list, function(x) if(identical(x,list())){list(x = NA)} else {x})
rbindlist(new_list, fill = T, use.names = T)
# x y
#1: 2 3
#2: NA NA
#3: NA 4
#4: 5 NA
Note we need the lapply so it doesn't drop the rows that are empty

Tidy nested json tree

This comes up a lot when dealing with API's.
Most of the time, to do real analysis, I'd like to get my dataset tidy, but typically, this requires a solution for each type of tree, rather than something more general.
I figured it would be nice to have one function that generates tidy data (albeit with a ton of NA's in deeply nested trees with many different factor levels.
I have a hackish solution which follows, using unlist(..., recursive = FALSE) + a naming convention,
But I'd like to see if someone here might have a better solution to tidy these kinds of list structures.
#####################
# Some Test Data
aNestedTree =
list(a = 1,
b = 2,
c = list(
a = list(1:5),
b = 2,
c = list(
a = 1,
d = 3,
e = list())),
d = list(
y = 3,
z = 2
))
############################################################
# Run through the list and rename all list elements,
# We unlist once at time, adding "__" at each unlist step
# until the object is no longer a list
renameVars <- function(lst, sep = '__') {
if(is.list(lst)) {
names(lst) <- paste0(names(lst),sep)
renameVars(unlist(lst, recursive = FALSE),sep = sep)
} else {
lst
}
}
res <- renameVars(aNestedTree)
We can check the output and see that we have a strangely named object,
But there's a method to this madness.
> res
a________ b________ c__.a____1__ c__.a____2__ c__.a____3__
1 2 1 2 3
c__.a____4__ c__.a____5__ c__.b______ c__.c__.a____ c__.c__.d____
4 5 2 1 3
d__.y______ d__.z______
3 2
Now I put this in a data.table, so I can shape it.
library(data.table)
dt <- data.table(values = res, name = names(res))
# Use some regex to split that name up, along with data.table's tstrsplit
# function to separate them into as many columns as there are nests
> dt[,paste0('V',seq_along(s <- tstrsplit(dt$name,'[__]+(\\.|)'))) := s]
> dt
values name V1 V2 V3
1: 1 a________ a NA NA
2: 2 b________ b NA NA
3: 1 c__.a____1__ c a 1
4: 2 c__.a____2__ c a 2
5: 3 c__.a____3__ c a 3
6: 4 c__.a____4__ c a 4
7: 5 c__.a____5__ c a 5
8: 2 c__.b______ c b NA
9: 1 c__.c__.a____ c c a
10: 3 c__.c__.d____ c c d
11: 3 d__.y______ d y NA
12: 2 d__.z______ d z NA
I can then filter for the factor combinations that I want (Or dcast/spread). (Though I'm effectively breaking apart tables at the lowest level if they exist)
I thought about going through bind.c and pulling out the do_unlistto make a function with a flexible naming convention via Rcpp, but my C++ is rusty, so I figured I'd post here before I do anything drastic.
I tend to lean towards tidyjson as well. In the tidyverse, the behavior you are looking for seems to be in the gather family.
I think the gather family of functions in tidyjson could do with a bit of improvement that would make these helpers unnecessary. Right now, they are very "type-sensitive" and error or throw out types that do not match. In any case, the workaround is not too challenging, although it definitely lacks elegance. Note that the bind_rows variant is presently from my development version and is not mainstream yet. Hopefully this illustrates the idea, though.
Notes on approach:
That all values would be numeric (I cast them to character afterwards)
Helpers gather elements of the varying types, and bind_rows stacks the datasets together.
level is kept track of by level of recursion
First define the helpers:
recurse_gather <- function(.x,.level) {
.x <- tidyjson::bind_rows(
gobj(.x,.level)
, garr(.x,.level)
, gpersist(.x,.level)
)
if (any(as.character(json_types(.x,'type')$type) %in% c('object','array'))) {
.x <- recurse_gather(.x,.level+1)
}
return(.x)
}
gobj <- function(.x,.level) {
.x %>% json_types('type') %>%
filter(type=='object') %>%
gather_object(paste0('v',.level)) %>%
select(-type)
}
gpersist <- function(.x,.level) {
.x %>% json_types('type') %>%
filter(! type %in% c('object','array')) %>%
mutate_(.dots=setNames(
paste0('as.character(NA)')
,paste0('v',.level)
)) %>%
select(-type)
}
garr <- function(.x,.level) {
.x %>% json_types('type') %>%
filter(type=='array') %>%
gather_array('arridx') %>%
append_values_number(paste0('v',.level)) %>%
mutate_(.dots=setNames(
paste0('as.character(v',.level,')')
,paste0('v',.level)
)) %>%
select(-arridx,-type)
}
Then using the helpers is pretty straight-forward.
library(dplyr)
library(tidyjson)
j <- "{\"a\":[1],\"b\":[2],\"c\":{\"a\":[1,2,3,4,5],\"b\":[2],\"c\":{\"a\":[1],\"d\":[3],\"e\":[]}},\"d\":{\"y\":[3],\"z\":[2]}}"
recurse_gather(j, 1) %>% arrange(v1, v2, v3, v4) %>% tbl_df()
#> # A tibble: 12 x 5
#> document.id v1 v2 v3 v4
#> * <int> <chr> <chr> <chr> <chr>
#> 1 1 a 1 <NA> <NA>
#> 2 1 b 2 <NA> <NA>
#> 3 1 c a 1 <NA>
#> 4 1 c a 2 <NA>
#> 5 1 c a 3 <NA>
#> 6 1 c a 4 <NA>
#> 7 1 c a 5 <NA>
#> 8 1 c b 2 <NA>
#> 9 1 c c a 1
#> 10 1 c c d 3
#> 11 1 d y 3 <NA>
#> 12 1 d z 2 <NA>
Hopeful that future development on the tidyjson package will make this an easier problem to tackle!
I struggled in similar situations, but the tidyjson package has bailed me out time after time when dealing with nested JSON. There's a fair amount of typing required, but the tidyjson functions return a tidy object. Documentation here: https://github.com/sailthru/tidyjson
As dracodoc pointed out, data.tree might help. E.g. like this:
library(data.tree)
aNestedTree =
list(a = 1,
b = 2,
c = list(
a = list(1:5),
b = 2,
c = list(
a = 1,
d = 3,
e = list())),
d = list(
y = 3,
z = 2
))
tree <- FromListSimple(aNestedTree)
print(tree)
This will give:
levelName z
1 Root NA
2 ¦--c NA
3 ¦ ¦--a NA
4 ¦ °--c NA
5 ¦ °--e NA
6 °--d 2
And:
tree$fieldsAll
[1] "a" "b" "1" "d" "y" "z"
Side note: typically, you could do something like this:
do.call("print", c(tree, tree$fieldsAll))
However, here, this doesn't work because some node names are the same as field names. I consider this a bug and will fix it soon.

Read Json file into a data.frame without nested lists

I am trying to load a json file into a data.frame in r. I have had some luck with the fromJSON function in the jsonlite package - But am getting nested lists and am not sure how to flatten the input into a two dimensional data.frame. Jsonlite reads the file in as a data.frame, but leaves nested lists in some of the variables.
Does Anyone have any tips in loading a JSON file to a data.frame when it reads in with nested lists.
#*#*#*#*#*#*#*#*#*##*#*#*#*#*#*#*#*#*# HERE IS MY EXAMPLE #*#*#*#*#*#*#*#*#*##*#*#*#*#*#*#*#*#*#
# loads the packages
library("httr")
library( "jsonlite")
# downloads an example file
providers <- fromJSON( "http://fm.formularynavigator.com/jsonFiles/publish/11/47/providers.json" , simplifyDataFrame=TRUE )
# the flatten function breaks the name variable into three vars ( first name, middle name, last name)
providers <- flatten( providers )
# but many of the columns are still lists:
sapply( providers , class)
# Some of these lists have a single level
head( providers$facility_type )
# Some have lot more than two - for example nine
providers[ , 6][[1]]
I want one row per npi, and than seperate columns for each of the slices of the individual lists - so that the data frame has cols for "plan_id_type","plan_id","network_tier" nine times, maybe colnames, from 0 to 8.
I have been able to use this site: http://www.convertcsv.com/json-to-csv.htm to get this file in two dimensions, but since I am doing hundreds of these I would love to be able to do it dynamically. This is the file: http://s000.tinyupload.com/download.php?file_id=10808537503095762868&t=1080853750309576286812811 - I would like to get a file with this structure load as a data.frame using the the fromJson function
HERE are a few of the things I have tried;
So I have thought of two approaches;
First: use a different function to read in the Json file, I have looked at
rjson but that reads in a list
library( rjson )
providers <- fromJSON( getURL( "https://fm.formularynavigator.com/jsonFiles/publish/11/47/providers.json") )
class( providers )
and I have tried RJSONIO - I tried this Getting imported json data into a data frame in R
json-data-into-a-data-frame-in-r
library( RJSONIO )
providers <- fromJSON( getURL( "https://fm.formularynavigator.com/jsonFiles/publish/11/47/providers.json") )
json_file <- lapply(providers, function(x) {
x[sapply(x, is.null)] <- NA
unlist(x)
})
# but When converting the lists to a data.frame I get an error
a <- do.call("rbind", json_file)
So, the second approach I have tried is to convert all the lists into variables in my data.frame
detach("package:RJSONIO", unload = TRUE )
detach("package:rjson", unload = TRUE )
library( "jsonlite")
providers <- fromJSON( "http://fm.formularynavigator.com/jsonFiles/publish/11/47/providers.json" , simplifyDataFrame=TRUE )
providers <- flatten( providers )
I am able to pull one of the lists - but because of missings I can't merge back on to my dataframe
a <- data.frame(Reduce(rbind, providers$facility_type))
length( a ) == nrow( providers )
I also tried these suggestions: Converting nested list to dataframe. A well as some other stuff but haven't had any luck
a <- sapply( providers$facility_type, unlist )
as.data.frame(t(sapply( providers$providers, unlist )) )
Any help much appreciated
Update: 21 February 2016
col_fixer updated to include a vec2col argument that lets you flatten a list column into either a single string or a set of columns.
In the data.frame you've downloaded, I see several different column types. There are normal columns comprising vectors of the same type. There are list columns where the items may be NULL or may themselves be a flat vector. There are list columns where there are data.frames as the list elements. There are list columns that contain a data.frame of the same number of rows as the main data.frame.
Here's a sample dataset that recreates those conditions:
mydf <- data.frame(id = 1:3, type = c("A", "A", "B"),
facility = I(list(c("x", "y"), NULL, "x")),
address = I(list(data.frame(v1 = 1, v2 = 2, v4 = 3),
data.frame(v1 = 1:2, v2 = 3:4, v3 = 5),
data.frame(v1 = 1, v2 = NA, v3 = 3))))
mydf$person <- data.frame(name = c("AA", "BB", "CC"), age = c(20, 32, 23),
preference = c(TRUE, FALSE, TRUE))
The str of this sample data.frame looks like:
str(mydf)
## 'data.frame': 3 obs. of 5 variables:
## $ id : int 1 2 3
## $ type : Factor w/ 2 levels "A","B": 1 1 2
## $ facility:List of 3
## ..$ : chr "x" "y"
## ..$ : NULL
## ..$ : chr "x"
## ..- attr(*, "class")= chr "AsIs"
## $ address :List of 3
## ..$ :'data.frame': 1 obs. of 3 variables:
## .. ..$ v1: num 1
## .. ..$ v2: num 2
## .. ..$ v4: num 3
## ..$ :'data.frame': 2 obs. of 3 variables:
## .. ..$ v1: int 1 2
## .. ..$ v2: int 3 4
## .. ..$ v3: num 5 5
## ..$ :'data.frame': 1 obs. of 3 variables:
## .. ..$ v1: num 1
## .. ..$ v2: logi NA
## .. ..$ v3: num 3
## ..- attr(*, "class")= chr "AsIs"
## $ person :'data.frame': 3 obs. of 3 variables:
## ..$ name : Factor w/ 3 levels "AA","BB","CC": 1 2 3
## ..$ age : num 20 32 23
## ..$ preference: logi TRUE FALSE TRUE
## NULL
One way you can "flatten" this is to "fix" the list columns. There are three fixes.
flatten (from "jsonlite") will take care of columns like the "person" column.
Columns like the "facility" column can be fixed using toString, which would convert each element to a comma separated item or which can be converted into multiple columns.
Columns where there are data.frames, some with multiple rows, first need to be flattened into a single row (by transforming to a "wide" format) and then need to be bound together as a single data.table. (I'm using "data.table" for reshaping and for binding the rows together).
We can take care of the second and third points with a function like the following:
col_fixer <- function(x, vec2col = FALSE) {
if (!is.list(x[[1]])) {
if (isTRUE(vec2col)) {
as.data.table(data.table::transpose(x))
} else {
vapply(x, toString, character(1L))
}
} else {
temp <- rbindlist(x, use.names = TRUE, fill = TRUE, idcol = TRUE)
temp[, .time := sequence(.N), by = .id]
value_vars <- setdiff(names(temp), c(".id", ".time"))
dcast(temp, .id ~ .time, value.var = value_vars)[, .id := NULL]
}
}
We'll integrate that and the flatten function in another function that would do most of the processing.
Flattener <- function(indf, vec2col = FALSE) {
require(data.table)
require(jsonlite)
indf <- flatten(indf)
listcolumns <- sapply(indf, is.list)
newcols <- do.call(cbind, lapply(indf[listcolumns], col_fixer, vec2col))
indf[listcolumns] <- list(NULL)
cbind(indf, newcols)
}
Running the function gives us:
Flattener(mydf)
## id type person.name person.age person.preference facility address.v1_1
## 1 1 A AA 20 TRUE x, y 1
## 2 2 A BB 32 FALSE 1
## 3 3 B CC 23 TRUE x 1
## address.v1_2 address.v2_1 address.v2_2 address.v4_1 address.v4_2 address.v3_1
## 1 NA 2 NA 3 NA NA
## 2 2 3 4 NA NA 5
## 3 NA NA NA NA NA 3
## address.v3_2
## 1 NA
## 2 5
## 3 NA
Or, with the vectors going into separate columns:
Flattener(mydf, TRUE)
## id type person.name person.age person.preference facility.V1 facility.V2
## 1 1 A AA 20 TRUE x y
## 2 2 A BB 32 FALSE <NA> <NA>
## 3 3 B CC 23 TRUE x <NA>
## address.v1_1 address.v1_2 address.v2_1 address.v2_2 address.v4_1 address.v4_2
## 1 1 NA 2 NA 3 NA
## 2 1 2 3 4 NA NA
## 3 1 NA NA NA NA NA
## address.v3_1 address.v3_2
## 1 NA NA
## 2 5 5
## 3 3 NA
Here's the str:
str(Flattener(mydf))
## 'data.frame': 3 obs. of 14 variables:
## $ id : int 1 2 3
## $ type : Factor w/ 2 levels "A","B": 1 1 2
## $ person.name : Factor w/ 3 levels "AA","BB","CC": 1 2 3
## $ person.age : num 20 32 23
## $ person.preference: logi TRUE FALSE TRUE
## $ facility : chr "x, y" "" "x"
## $ address.v1_1 : num 1 1 1
## $ address.v1_2 : num NA 2 NA
## $ address.v2_1 : num 2 3 NA
## $ address.v2_2 : num NA 4 NA
## $ address.v4_1 : num 3 NA NA
## $ address.v4_2 : num NA NA NA
## $ address.v3_1 : num NA 5 3
## $ address.v3_2 : num NA 5 NA
## NULL
On your "providers" object, this runs very quickly and consistently:
library(microbenchmark)
out <- microbenchmark(Flattener(providers), Flattener(providers, TRUE), flattenList(jsonRList))
out
# Unit: milliseconds
# expr min lq mean median uq max neval
# Flattener(providers) 104.18939 126.59295 157.3744 138.4185 174.5222 308.5218 100
# Flattener(providers, TRUE) 67.56471 86.37789 109.8921 96.3534 121.4443 301.4856 100
# flattenList(jsonRList) 1780.44981 2065.50533 2485.1924 2269.4496 2694.1487 4397.4793 100
library(ggplot2)
qplot(y = time, data = out, colour = expr) ## Via #TylerRinker
My first step was to load the data via RCurl::getURL() and rjson::fromJSON(), as per your second code sample:
##--------------------------------------
## libraries
##--------------------------------------
library(rjson);
library(RCurl);
##--------------------------------------
## get data
##--------------------------------------
URL <- 'https://fm.formularynavigator.com/jsonFiles/publish/11/47/providers.json';
jsonRList <- fromJSON(getURL(URL)); ## recursive list representing the original JSON data
Next, to get a deep understanding of the structure and cleanness of the data, I wrote a set of helper functions:
##--------------------------------------
## helper functions
##--------------------------------------
## apply a function to a set of nodes at the same depth level in a recursive list structure
levelApply <- function(
nodes, ## the root node of the list (recursive calls pass deeper nodes as they drill down into the list)
keyList, ## another list, expected to hold a sequence of keys (component names, integer indexes, or NULL for all) specifying which nodes to select at each depth level
func=identity, ## a function to run separately on each node once keyList has been exhausted
..., ## further arguments passed to func()
joinFunc=NULL ## optional function for joining the return values of func() at each successive depth, as the stack is unwound. An alternative is calling unlist() on the result, but careful not to lose the top-level index association
) {
if (length(keyList) == 0L) {
ret <- if (is.null(nodes)) NULL else func(nodes,...)
} else if (is.null(keyList[[1L]]) || length(keyList[[1L]]) != 1L) {
ret <- lapply(if (is.null(keyList[[1L]])) nodes else nodes[keyList[[1L]]],levelApply,keyList[-1L],func,...,joinFunc=joinFunc);
if (!is.null(joinFunc))
ret <- do.call(joinFunc,ret);
} else {
ret <- levelApply(nodes[[keyList[[1L]]]],keyList[-1L],func,...,joinFunc=joinFunc);
}; ## end if
ret;
}; ## end if
## these two wrappers automatically attempt to simplify the results of func() to a vector or matrix/data.frame, respectively
levelApplyToVec <- function(...) levelApply(...,joinFunc=c);
levelApplyToFrame <- function(...) levelApply(...,joinFunc=rbind); ## can return matrix or data.frame, depending on ret
The key to understanding the above is the keyList parameter. Let's say you have a list like this:
list(NULL,'addresses',2:3,'city')
That would select all city strings underneath the second and third address elements underneath the addresses list underneath all elements of the main list.
There are no built-in apply functions in R that can operate on such "parallel" node selections (rapply() is close, but no cigar), which is why I wrote my own. levelApply() finds each of the matching nodes and runs the given func() on it (default identity(), thus returning the node itself), returning the results to the caller, either joined as per joinFunc(), or in the same recursive list structure in which those nodes existed in the input list. Quick demo:
unname(levelApplyToVec(jsonRList,list(4L,'addresses',1:2,c('address','city'))));
## [1] "1001 Noble St" "Fairbanks" "1650 Cowles St" "Fairbanks"
Here are the remaining helper functions I wrote in the process of working on this problem:
## for the given node selection key union, retrieve a data.frame of logicals representing the unique combinations of keys possessed by the selected nodes, possibly with a count
keyCombos <- function(node,keyList,allKeys) `rownames<-`(setNames(unique(as.data.frame(levelApplyToFrame(node,keyList,function(h) allKeys%in%names(h)))),allKeys),NULL);
keyCombosWithCount <- function(node,keyList,allKeys) { ks <- keyCombos(node,keyList,allKeys); ks$.count <- unname(apply(ks,1,function(combo) sum(levelApplyToVec(node,keyList,function(h) identical(sort(names(ks)[combo]),sort(names(h))))))); ks; };
## return a simple two-component list with type (list, namedlist, or atomic vector type) and len for non-namedlist types; tlStr() returns a nice stringified form of said list
tl <- function(e) { if (is.null(e)) return(NULL); ret <- typeof(e); if (ret == 'list' && !is.null(names(e))) ret <- list(type='namedlist') else ret <- list(type=ret,len=length(e)); ret; };
tlStr <- function(e) { if (is.null(e)) return(NA); ret <- tl(e); if (is.null(ret$len)) ret <- ret$type else ret <- paste0(ret$type,'[',ret$len,']'); ret; };
## stringification functions for display
mkcsv <- function(v) paste0(collapse=',',v);
keyListToStr <- function(keyList) paste0(collapse='','/',sapply(keyList,function(key) if (is.null(key)) '*' else paste0(collapse=',',key)));
## return a data.frame giving a comma-separated list of the unique types possessed by the selected nodes; useful for learning about the structure of the data
keyTypes <- function(node,keyList,allKeys) data.frame(key=allKeys,tl=sapply(allKeys,function(key) mkcsv(unique(na.omit(levelApplyToVec(node,c(keyList,key),tlStr))))),row.names=NULL);
## useful for testing; can call npiToFrame() to show the row with a specified npi value, in a nice vertical form
rowToFrame <- function(dfrow) data.frame(column=names(dfrow),value=c(as.matrix(dfrow)));
getNPIRow <- function(df,npi) which(df$npi == npi);
npiToFrame <- function(df,npi) rowToFrame(df[getNPIRow(df,npi),]);
I've tried to capture the sequence of commands I ran against the data as I first examined it. Below are the results, showing the commands I ran, the command output, and leading comments describing what my intention was, and my conclusion from the output:
##--------------------------------------
## data examination
##--------------------------------------
## type of object -- plain unnamed list => array, length 3256
levelApplyToVec(jsonRList,list(),tlStr);
## [1] "list[3256]"
## unique types of main array elements => all named lists => hashes
unique(levelApplyToVec(jsonRList,list(NULL),tlStr));
## [1] "namedlist"
## get the union of keys among all hashes
allKeys <- unique(levelApplyToVec(jsonRList,list(NULL),names)); allKeys;
## [1] "npi" "type" "facility_name" "facility_type" "addresses" "plans" "last_updated_on" "name" "speciality" "accepting" "languages" "gender"
## get the unique pattern of keys among all hashes, and how often each occurs => shows there are inconsistent key sets among the top-level hashes
keyCombosWithCount(jsonRList,list(NULL),allKeys);
## npi type facility_name facility_type addresses plans last_updated_on name speciality accepting languages gender .count
## 1 TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE 279
## 2 TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE 2973
## 3 TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE 4
## for each key, get the unique set of types it takes on among all hashes, ignoring hashes where the key is omitted => some scalar strings, some multi-string, addresses is a variable-length list, plans is length-9 list, and name is a hash
keyTypes(jsonRList,list(NULL),allKeys);
## key tl
## 1 npi character[1]
## 2 type character[1]
## 3 facility_name character[1]
## 4 facility_type character[1],character[2],character[3]
## 5 addresses list[1],list[2],list[3],list[6],list[5],list[7],list[4],list[8],list[9],list[13],list[12]
## 6 plans list[9]
## 7 last_updated_on character[1]
## 8 name namedlist
## 9 speciality character[1],character[2],character[3],character[4]
## 10 accepting character[1]
## 11 languages character[2],character[3],character[4],character[6],character[5]
## 12 gender character[1]
## must look deeper into addresses array, plans array, and name hash; we'll have to flatten them
## ==== addresses =====
## note: the addresses key is always present under main array elements
## unique types of address elements across all hashes => all named lists, thus nested hashes
unique(levelApplyToVec(jsonRList,list(NULL,'addresses',NULL),tlStr));
## [1] "namedlist"
## union of keys among all address element hashes
allAddressKeys <- unique(levelApplyToVec(jsonRList,list(NULL,'addresses',NULL),names)); allAddressKeys;
## [1] "address" "city" "state" "zip" "phone" "address_2"
## pattern of keys among address elements => only address_2 varies, similar frequency with it as without it
keyCombosWithCount(jsonRList,list(NULL,'addresses',NULL),allAddressKeys);
## address city state zip phone address_2 .count
## 1 TRUE TRUE TRUE TRUE TRUE FALSE 1898
## 2 TRUE TRUE TRUE TRUE TRUE TRUE 2575
## for each address element key, get the unique set of types it takes on among all hashes, ignoring hashes where the key (only address_2 in this case) is omitted => all scalar strings
keyTypes(jsonRList,list(NULL,'addresses',NULL),allAddressKeys);
## key tl
## 1 address character[1]
## 2 city character[1]
## 3 state character[1]
## 4 zip character[1]
## 5 phone character[1]
## 6 address_2 character[1]
## ==== plans =====
## note: the plans key is always present under main array elements
## unique types of plan elements across all hashes => all named lists, thus nested hashes
unique(levelApplyToVec(jsonRList,list(NULL,'plans',NULL),tlStr));
## [1] "namedlist"
## union of keys among all plan element hashes
allPlanKeys <- unique(levelApplyToVec(jsonRList,list(NULL,'plans',NULL),names)); allPlanKeys;
## [1] "plan_id_type" "plan_id" "network_tier"
## pattern of keys among plan elements => good, all plan elements have all 3 keys, perfectly consistent
keyCombosWithCount(jsonRList,list(NULL,'plans',NULL),allPlanKeys);
## plan_id_type plan_id network_tier .count
## 1 TRUE TRUE TRUE 29304
## for each plan element key, get the unique set of types it takes on among all hashes (note: no plan keys are ever omitted, so don't have to worry about that) => all scalar strings
keyTypes(jsonRList,list(NULL,'plans',NULL),allPlanKeys);
## key tl
## 1 plan_id_type character[1]
## 2 plan_id character[1]
## 3 network_tier character[1]
## ==== name =====
## note: the name key is *not* always present under main array elements
## union of keys among all name hashes
allNameKeys <- unique(levelApplyToVec(jsonRList,list(NULL,'name'),names)); allNameKeys;
## [1] "first" "middle" "last"
## pattern of keys among name elements => sometimes middle is missing, relatively infrequently
keyCombosWithCount(jsonRList,list(NULL,'name'),allNameKeys);
## first middle last .count
## 1 TRUE TRUE TRUE 2679
## 2 TRUE FALSE TRUE 298
## for each name element key, get the unique set of types it takes on among all hashes, ignoring hashes where the key (only middle in this case) is omitted => all scalar strings
keyTypes(jsonRList,list(NULL,'name'),allNameKeys);
## key tl
## 1 first character[1]
## 2 middle character[1]
## 3 last character[1]
Here's my summary of the data:
one top-level main list, length 3256.
each element is a hash with inconsistent key sets. There are 12 keys in total across all main hashes, with 3 patterns of key sets present.
6 of the hash values are scalar strings, 3 are variable-length string vectors, addresses is a list of variable length, plans is a list always of length 9, and name is a hash.
each addresses list element is a hash with 5 or 6 keys to scalar strings, address_2 being the inconsistent one.
each plans list element is a hash with 3 keys to scalar strings, no inconsistencies.
each name hash has first and last but not always middle scalar strings.
The most important observation here is that there are no type-inconsistencies between parallel nodes (aside from omissions and length differences). That means we can combine all parallel nodes into vectors with no considerations of type coercion. We can flatten all the data into a two-dimensional structure provided we associate columns with deep-enough nodes, such that all columns correspond to a single scalar string node in the input list.
Below is my solution. Note that it depends on the helper functions tl(), keyListToStr(), and mkcsv() I defined earlier.
##--------------------------------------
## solution
##--------------------------------------
## recursively traverse the list structure, building up a column at each leaf node
extractLevelColumns <- function(
nodes, ## current level node selection
..., ## additional arguments to data.frame()
keyList=list(), ## current key path under main list
sep=NULL, ## optional string separator on which to join multi-element vectors; if NULL, will leave as separate columns
mkname=function(keyList,maxLen) paste0(collapse='.',if (is.null(sep) && maxLen == 1L) keyList[-length(keyList)] else keyList) ## name builder from current keyList and character vector max length across node level; default to dot-separated keys, and remove last index component for scalars
) {
cat(sprintf('extractLevelColumns(): %s\n',keyListToStr(keyList)));
if (length(nodes) == 0L) return(list()); ## handle corner case of empty main list
tlList <- lapply(nodes,tl);
typeList <- do.call(c,lapply(tlList,`[[`,'type'));
if (length(unique(typeList)) != 1L) stop(sprintf('error: inconsistent types (%s) at %s.',mkcsv(typeList),keyListToStr(keyList)));
type <- typeList[1L];
if (type == 'namedlist') { ## hash; recurse
allKeys <- unique(do.call(c,lapply(nodes,names)));
ret <- do.call(c,lapply(allKeys,function(key) extractLevelColumns(lapply(nodes,`[[`,key),...,keyList=c(keyList,key),sep=sep,mkname=mkname)));
} else if (type == 'list') { ## array; recurse
lenList <- do.call(c,lapply(tlList,`[[`,'len'));
maxLen <- max(lenList,na.rm=T);
allIndexes <- seq_len(maxLen);
ret <- do.call(c,lapply(allIndexes,function(index) extractLevelColumns(lapply(nodes,function(node) if (length(node) < index) NULL else node[[index]]),...,keyList=c(keyList,index),sep=sep,mkname=mkname))); ## must be careful to guard out-of-bounds to NULL; happens automatically with string keys, but not with integer indexes
} else if (type%in%c('raw','logical','integer','double','complex','character')) { ## atomic leaf node; build column
lenList <- do.call(c,lapply(tlList,`[[`,'len'));
maxLen <- max(lenList,na.rm=T);
if (is.null(sep)) {
ret <- lapply(seq_len(maxLen),function(i) setNames(data.frame(sapply(nodes,function(node) if (length(node) < i) NA else node[[i]]),...),mkname(c(keyList,i),maxLen)));
} else {
## keep original type if maxLen is 1, IOW don't stringify
ret <- list(setNames(data.frame(sapply(nodes,function(node) if (length(node) == 0L) NA else if (maxLen == 1L) node else paste(collapse=sep,node)),...),mkname(keyList,maxLen)));
}; ## end if
} else stop(sprintf('error: unsupported type %s at %s.',type,keyListToStr(keyList)));
if (is.null(ret)) ret <- list(); ## handle corner case of exclusively empty sublists
ret;
}; ## end extractLevelColumns()
## simple interface function
flattenList <- function(mainList,...) do.call(cbind,extractLevelColumns(mainList,...));
The extractLevelColumns() function traverses the input list and extracts all node values at each leaf node position, combining them into a vector with NA where the value was missing, and then transforming to a one-column data.frame. The column name is set immediately, leveraging a parameterized mkname() function to define the stringification of the keyList to the string column name. Multiple columns are returned as a list of data.frames from each recursive call and likewise from the top-level call.
It also validates that there are no type-inconsistencies between parallel nodes. Although I manually verified the consistency of the data earlier, I tried to write as generic and reusable a solution as possible, because it's always a good idea to do so, so this validation step is appropriate.
flattenList() is the primary interface function; it simply calls extractLevelColumns() and then do.call(cbind,...) to combine the columns into a single data.frame.
An advantage of this solution is that it's entirely generic; it can handle an unlimited number of depth levels, by virtue of being fully recursive. Additionally, it has no package dependencies, parameterizes the column name building logic, and forwards variadic arguments to data.frame(), so for example you can pass stringsAsFactors=F to inhibit the automatic factorization of character columns normally done by data.frame(), and/or row.names={namevector} to set the row names of the resulting data.frame, or row.names=NULL to prevent the use of the top-level list component names as row names, if such existed in the input list.
I've also added a sep parameter which defaults to NULL. If NULL, multi-element leaf nodes will be separated into multiple columns, one per element, with an index suffix on the column name for differentiation. Otherwise, it's taken as a string separator on which to join all elements to a single string, and only one column is generated for the node.
In terms of performance, it's very fast. Here's a demo:
## actually run it
system.time({ df <- flattenList(jsonRList); });
## extractLevelColumns(): /
## extractLevelColumns(): /npi
## extractLevelColumns(): /type
## extractLevelColumns(): /facility_name
## extractLevelColumns(): /facility_type
## extractLevelColumns(): /addresses
## extractLevelColumns(): /addresses/1
## extractLevelColumns(): /addresses/1/address
## extractLevelColumns(): /addresses/1/city
##
## ... snip ...
##
## extractLevelColumns(): /plans/9/network_tier
## extractLevelColumns(): /last_updated_on
## extractLevelColumns(): /name
## extractLevelColumns(): /name/first
## extractLevelColumns(): /name/middle
## extractLevelColumns(): /name/last
## extractLevelColumns(): /speciality
## extractLevelColumns(): /accepting
## extractLevelColumns(): /languages
## extractLevelColumns(): /gender
## user system elapsed
## 2.265 0.000 2.268
Result:
class(df); dim(df); names(df);
## [1] "data.frame"
## [1] 3256 126
## [1] "npi" "type" "facility_name" "facility_type.1" "facility_type.2" "facility_type.3" "addresses.1.address" "addresses.1.city" "addresses.1.state"
## [10] "addresses.1.zip" "addresses.1.phone" "addresses.1.address_2" "addresses.2.address" "addresses.2.city" "addresses.2.state" "addresses.2.zip" "addresses.2.phone" "addresses.2.address_2"
## [19] "addresses.3.address" "addresses.3.city" "addresses.3.state" "addresses.3.zip" "addresses.3.phone" "addresses.3.address_2" "addresses.4.address" "addresses.4.city" "addresses.4.state"
## [28] "addresses.4.zip" "addresses.4.phone" "addresses.4.address_2" "addresses.5.address" "addresses.5.address_2" "addresses.5.city" "addresses.5.state" "addresses.5.zip" "addresses.5.phone"
## [37] "addresses.6.address" "addresses.6.address_2" "addresses.6.city" "addresses.6.state" "addresses.6.zip" "addresses.6.phone" "addresses.7.address" "addresses.7.address_2" "addresses.7.city"
## [46] "addresses.7.state" "addresses.7.zip" "addresses.7.phone" "addresses.8.address" "addresses.8.address_2" "addresses.8.city" "addresses.8.state" "addresses.8.zip" "addresses.8.phone"
## [55] "addresses.9.address" "addresses.9.address_2" "addresses.9.city" "addresses.9.state" "addresses.9.zip" "addresses.9.phone" "addresses.10.address" "addresses.10.address_2" "addresses.10.city"
## [64] "addresses.10.state" "addresses.10.zip" "addresses.10.phone" "addresses.11.address" "addresses.11.address_2" "addresses.11.city" "addresses.11.state" "addresses.11.zip" "addresses.11.phone"
## [73] "addresses.12.address" "addresses.12.address_2" "addresses.12.city" "addresses.12.state" "addresses.12.zip" "addresses.12.phone" "addresses.13.address" "addresses.13.city" "addresses.13.state"
## [82] "addresses.13.zip" "addresses.13.phone" "plans.1.plan_id_type" "plans.1.plan_id" "plans.1.network_tier" "plans.2.plan_id_type" "plans.2.plan_id" "plans.2.network_tier" "plans.3.plan_id_type"
## [91] "plans.3.plan_id" "plans.3.network_tier" "plans.4.plan_id_type" "plans.4.plan_id" "plans.4.network_tier" "plans.5.plan_id_type" "plans.5.plan_id" "plans.5.network_tier" "plans.6.plan_id_type"
## [100] "plans.6.plan_id" "plans.6.network_tier" "plans.7.plan_id_type" "plans.7.plan_id" "plans.7.network_tier" "plans.8.plan_id_type" "plans.8.plan_id" "plans.8.network_tier" "plans.9.plan_id_type"
## [109] "plans.9.plan_id" "plans.9.network_tier" "last_updated_on" "name.first" "name.middle" "name.last" "speciality.1" "speciality.2" "speciality.3"
## [118] "speciality.4" "accepting" "languages.1" "languages.2" "languages.3" "languages.4" "languages.5" "languages.6" "gender"
The resulting data.frame is quite wide, but we can use rowToFrame() and npiToFrame() to get a good vertical layout of one row at a time. For example, here's the first row:
rowToFrame(df[1L,]);
## column value
## 1 npi 1063645026
## 2 type FACILITY
## 3 facility_name EXPRESS SCRIPTS
## 4 facility_type.1 Pharmacies
## 5 facility_type.2 <NA>
## 6 facility_type.3 <NA>
## 7 addresses.1.address 4750 E 450 S
## 8 addresses.1.city WHITESTOWN
## 9 addresses.1.state IN
## 10 addresses.1.zip 46075
## 11 addresses.1.phone 2012695236
## 12 addresses.1.address_2 <NA>
## 13 addresses.2.address <NA>
## 14 addresses.2.city <NA>
## 15 addresses.2.state <NA>
## 16 addresses.2.zip <NA>
## 17 addresses.2.phone <NA>
## 18 addresses.2.address_2 <NA>
## 19 addresses.3.address <NA>
## 20 addresses.3.city <NA>
## 21 addresses.3.state <NA>
##
## ... snip ...
##
## 77 addresses.12.zip <NA>
## 78 addresses.12.phone <NA>
## 79 addresses.13.address <NA>
## 80 addresses.13.city <NA>
## 81 addresses.13.state <NA>
## 82 addresses.13.zip <NA>
## 83 addresses.13.phone <NA>
## 84 plans.1.plan_id_type HIOS-PLAN-ID
## 85 plans.1.plan_id 38344AK0620003
## 86 plans.1.network_tier HERITAGE-PLUS
## 87 plans.2.plan_id_type HIOS-PLAN-ID
## 88 plans.2.plan_id 38344AK0620004
## 89 plans.2.network_tier HERITAGE-PLUS
## 90 plans.3.plan_id_type HIOS-PLAN-ID
## 91 plans.3.plan_id 38344AK0620006
## 92 plans.3.network_tier HERITAGE-PLUS
## 93 plans.4.plan_id_type HIOS-PLAN-ID
## 94 plans.4.plan_id 38344AK0620008
## 95 plans.4.network_tier HERITAGE-PLUS
## 96 plans.5.plan_id_type HIOS-PLAN-ID
## 97 plans.5.plan_id 38344AK0570001
## 98 plans.5.network_tier HERITAGE-PLUS
## 99 plans.6.plan_id_type HIOS-PLAN-ID
## 100 plans.6.plan_id 38344AK0570002
## 101 plans.6.network_tier HERITAGE-PLUS
## 102 plans.7.plan_id_type HIOS-PLAN-ID
## 103 plans.7.plan_id 38344AK0980003
## 104 plans.7.network_tier HERITAGE-PLUS
## 105 plans.8.plan_id_type HIOS-PLAN-ID
## 106 plans.8.plan_id 38344AK0980006
## 107 plans.8.network_tier HERITAGE-PLUS
## 108 plans.9.plan_id_type HIOS-PLAN-ID
## 109 plans.9.plan_id 38344AK0980012
## 110 plans.9.network_tier HERITAGE-PLUS
## 111 last_updated_on 2015-10-14
## 112 name.first <NA>
## 113 name.middle <NA>
## 114 name.last <NA>
## 115 speciality.1 <NA>
## 116 speciality.2 <NA>
## 117 speciality.3 <NA>
## 118 speciality.4 <NA>
## 119 accepting <NA>
## 120 languages.1 <NA>
## 121 languages.2 <NA>
## 122 languages.3 <NA>
## 123 languages.4 <NA>
## 124 languages.5 <NA>
## 125 languages.6 <NA>
## 126 gender <NA>
I've tested the result pretty thoroughly by doing many spot-checks on individual records, and it all looks correct. Let me know if you have any questions.
This answer is rather a data organization suggestion (and is much shorter than the bounty-attracting answers around;)
If you want to keep the semantics of the fields, like keep all plan_ids in a single column, you can normalize your data design a bit, and do joins afterwards, if you need the information together:
library(dplyr)
# notice the simplifyVector=F
providers <- fromJSON( "http://fm.formularynavigator.com/jsonFiles/publish/11/47/providers.json", simplifyVector=F)
# pick and repeat fields for each element of array
# {field1:val, field2:val2, array:[{af1:av1, af2:av2}, {af1:av3, af2:av4}]}
# gives data.frame
# field1, field2 array.af1 array.af2
# val val2 av1 av2
# val val2 av3 av4
denormalize <- function(data, fields, array) {
data.frame(
c(
data[fields],
as.list(
bind_rows(
lapply(data[[array]], data.frame)))))
}
plans_df <- bind_rows(lapply(providers, denormalize, c('npi'), 'plans'))
addresses_df <- bind_rows(lapply(providers, denormalize, c('npi'), 'addresses'))
npis <- bind_rows(lapply(providers, function(d, fields) data.frame(d[fields]),
c('npi', 'type', 'last_updated_on')))
Then you can first filter on the data and join in other information afterwards:
addresses_df %>%
filter(city == "Healy") %>%
left_join(plans_df, by="npi") ->
plans_in_healy
So this isn't really eligible as a solution since it doesn't directly answer the question, but here is how I would analyze this data.
First, I had to understand your data set. It appears to be information about health providers.
providers <- fromJSON( "http://fm.formularynavigator.com/jsonFiles/publish/11/47/providers.json" , simplifyDataFrame=FALSE )
types = sapply(providers,"[[","type")
table(types)
# FACILITY INDIVIDUAL
# 279 2977
FACILITY entries have the "ID" fields facility_name and facility_type.
INDIVIDUAL entries have the "ID" fields name, speciality, accepting, languages, and gender.
All entries have "ID" fields npi and last_updated_on.
All entries have two nested fields: addresses and plans. For example addresses is a list that contains city, state, etc.
Since there are multiple addresses for each npi, I'd prefer to convert them to a data frame with columns for the city, state, etc. I'll also make a similar data frame for the plans. Then I'll join the addresses and plans into a single data frame. Hence, if there are 4 addresses and 8 plans, there will be 4*8=32 rows in the joined data frame. Finally, I'll tac on a similarly denormalized data frame with "ID" information using another merge.
library(dplyr)
unfurl_npi_data = function (x) {
repeat_cols = c("plans","addresses")
id_cols = setdiff(names(x),repeat_cols)
repeat_data = x[repeat_cols]
id_data = x[id_cols]
# Denormalized ID data
id_data_df = Reduce(function(x,y) merge(x,y,by=NULL), id_data, "")[,-1]
atomic_colnames = names(which(!sapply(id_data, is.list)))
df_atomic_cols = unlist(sapply(id_data,function(x) if(is.list(x)) rep(FALSE, length(x)) else TRUE))
colnames(id_data_df)[df_atomic_cols] = atomic_colnames
# Join the plans and addresses (denormalized)
repeated_data = lapply(repeat_data, rbind_all)
repeated_data_crossed = Reduce(merge, repeated_data, repeated_data[[1]])
merge(id_data_df, repeated_data_crossed)
}
providers2 = split(providers, types)
providers3 = lapply(providers2, function(x) rbind_all(lapply(x, unfurl_npi_data)))
Then do some cleanup.
unique_df = function(x) {
chr_col_names = names(which(sapply(x, class) == "character"))
for( col in chr_col_names )
x[[col]] = toupper(x[[col]])
unique(x)
}
providers3 = lapply(providers3, unique_df)
facilities = providers3[["FACILITY"]]
individuals = providers3[["INDIVIDUAL"]]
rm(providers, providers2, providers3)
And now you can ask some interesting questions. For example, how many addresses does each health care provider have?
unique_providers = individuals %>% select(first, middle, last, gender, state, city, address) %>% unique()
num_addresses = unique_providers %>% count(first, middle, last, gender)
table(num_addresses$n)
# 1 2 3 4 5 6 7 8 9 12 13
# 2258 492 119 33 43 21 6 1 2 1 1
At addresses with more than five people, what is the percent of male healthcare providers?
address_pcts = unique_providers %>%
group_by(address, city, state) %>%
filter(n()>5) %>%
arrange(address) %>%
summarise(pct_male = sum(gender=="MALE")/n())
library(ggplot2)
qplot(address_pcts$pct_male, binwidth=1/7) + xlim(0,1)
And on and on...

Add column to data.frame in R with look-up table [duplicate]

Given two data frames:
df1 = data.frame(CustomerId = c(1:6), Product = c(rep("Toaster", 3), rep("Radio", 3)))
df2 = data.frame(CustomerId = c(2, 4, 6), State = c(rep("Alabama", 2), rep("Ohio", 1)))
df1
# CustomerId Product
# 1 Toaster
# 2 Toaster
# 3 Toaster
# 4 Radio
# 5 Radio
# 6 Radio
df2
# CustomerId State
# 2 Alabama
# 4 Alabama
# 6 Ohio
How can I do database style, i.e., sql style, joins? That is, how do I get:
An inner join of df1 and df2:
Return only the rows in which the left table have matching keys in the right table.
An outer join of df1 and df2:
Returns all rows from both tables, join records from the left which have matching keys in the right table.
A left outer join (or simply left join) of df1 and df2
Return all rows from the left table, and any rows with matching keys from the right table.
A right outer join of df1 and df2
Return all rows from the right table, and any rows with matching keys from the left table.
Extra credit:
How can I do a SQL style select statement?
By using the merge function and its optional parameters:
Inner join: merge(df1, df2) will work for these examples because R automatically joins the frames by common variable names, but you would most likely want to specify merge(df1, df2, by = "CustomerId") to make sure that you were matching on only the fields you desired. You can also use the by.x and by.y parameters if the matching variables have different names in the different data frames.
Outer join: merge(x = df1, y = df2, by = "CustomerId", all = TRUE)
Left outer: merge(x = df1, y = df2, by = "CustomerId", all.x = TRUE)
Right outer: merge(x = df1, y = df2, by = "CustomerId", all.y = TRUE)
Cross join: merge(x = df1, y = df2, by = NULL)
Just as with the inner join, you would probably want to explicitly pass "CustomerId" to R as the matching variable. I think it's almost always best to explicitly state the identifiers on which you want to merge; it's safer if the input data.frames change unexpectedly and easier to read later on.
You can merge on multiple columns by giving by a vector, e.g., by = c("CustomerId", "OrderId").
If the column names to merge on are not the same, you can specify, e.g., by.x = "CustomerId_in_df1", by.y = "CustomerId_in_df2" where CustomerId_in_df1 is the name of the column in the first data frame and CustomerId_in_df2 is the name of the column in the second data frame. (These can also be vectors if you need to merge on multiple columns.)
I would recommend checking out Gabor Grothendieck's sqldf package, which allows you to express these operations in SQL.
library(sqldf)
## inner join
df3 <- sqldf("SELECT CustomerId, Product, State
FROM df1
JOIN df2 USING(CustomerID)")
## left join (substitute 'right' for right join)
df4 <- sqldf("SELECT CustomerId, Product, State
FROM df1
LEFT JOIN df2 USING(CustomerID)")
I find the SQL syntax to be simpler and more natural than its R equivalent (but this may just reflect my RDBMS bias).
See Gabor's sqldf GitHub for more information on joins.
You can do joins as well using Hadley Wickham's awesome dplyr package.
library(dplyr)
#make sure that CustomerId cols are both the same type
#they aren’t in the provided data (one is integer and one is double)
df1$CustomerId <- as.double(df1$CustomerId)
Mutating joins: add columns to df1 using matches in df2
#inner
inner_join(df1, df2)
#left outer
left_join(df1, df2)
#right outer
right_join(df1, df2)
#alternate right outer
left_join(df2, df1)
#full join
full_join(df1, df2)
Filtering joins: filter out rows in df1, don't modify columns
#keep only observations in df1 that match in df2.
semi_join(df1, df2)
#drop all observations in df1 that match in df2.
anti_join(df1, df2)
There is the data.table approach for an inner join, which is very time and memory efficient (and necessary for some larger data.frames):
library(data.table)
dt1 <- data.table(df1, key = "CustomerId")
dt2 <- data.table(df2, key = "CustomerId")
joined.dt1.dt.2 <- dt1[dt2]
merge also works on data.tables (as it is generic and calls merge.data.table)
merge(dt1, dt2)
data.table documented on stackoverflow:
How to do a data.table merge operation
Translating SQL joins on foreign keys to R data.table syntax
Efficient alternatives to merge for larger data.frames R
How to do a basic left outer join with data.table in R?
Yet another option is the join function found in the plyr package. [Note from 2022: plyr is now retired and has been superseded by dplyr. Join operations in dplyr are described in this answer.]
library(plyr)
join(df1, df2,
type = "inner")
# CustomerId Product State
# 1 2 Toaster Alabama
# 2 4 Radio Alabama
# 3 6 Radio Ohio
Options for type: inner, left, right, full.
From ?join: Unlike merge, [join] preserves the order of x no matter what join type is used.
There are some good examples of doing this over at the R Wiki. I'll steal a couple here:
Merge Method
Since your keys are named the same the short way to do an inner join is merge():
merge(df1, df2)
a full inner join (all records from both tables) can be created with the "all" keyword:
merge(df1, df2, all=TRUE)
a left outer join of df1 and df2:
merge(df1, df2, all.x=TRUE)
a right outer join of df1 and df2:
merge(df1, df2, all.y=TRUE)
you can flip 'em, slap 'em and rub 'em down to get the other two outer joins you asked about :)
Subscript Method
A left outer join with df1 on the left using a subscript method would be:
df1[,"State"]<-df2[df1[ ,"Product"], "State"]
The other combination of outer joins can be created by mungling the left outer join subscript example. (yeah, I know that's the equivalent of saying "I'll leave it as an exercise for the reader...")
Update on data.table methods for joining datasets. See below examples for each type of join. There are two methods, one from [.data.table when passing second data.table as the first argument to subset, another way is to use merge function which dispatches to fast data.table method.
df1 = data.frame(CustomerId = c(1:6), Product = c(rep("Toaster", 3), rep("Radio", 3)))
df2 = data.frame(CustomerId = c(2L, 4L, 7L), State = c(rep("Alabama", 2), rep("Ohio", 1))) # one value changed to show full outer join
library(data.table)
dt1 = as.data.table(df1)
dt2 = as.data.table(df2)
setkey(dt1, CustomerId)
setkey(dt2, CustomerId)
# right outer join keyed data.tables
dt1[dt2]
setkey(dt1, NULL)
setkey(dt2, NULL)
# right outer join unkeyed data.tables - use `on` argument
dt1[dt2, on = "CustomerId"]
# left outer join - swap dt1 with dt2
dt2[dt1, on = "CustomerId"]
# inner join - use `nomatch` argument
dt1[dt2, nomatch=NULL, on = "CustomerId"]
# anti join - use `!` operator
dt1[!dt2, on = "CustomerId"]
# inner join - using merge method
merge(dt1, dt2, by = "CustomerId")
# full outer join
merge(dt1, dt2, by = "CustomerId", all = TRUE)
# see ?merge.data.table arguments for other cases
Below benchmark tests base R, sqldf, dplyr and data.table.
Benchmark tests unkeyed/unindexed datasets.
Benchmark is performed on 50M-1 rows datasets, there are 50M-2 common values on join column so each scenario (inner, left, right, full) can be tested and join is still not trivial to perform. It is type of join which well stress join algorithms. Timings are as of sqldf:0.4.11, dplyr:0.7.8, data.table:1.12.0.
# inner
Unit: seconds
expr min lq mean median uq max neval
base 111.66266 111.66266 111.66266 111.66266 111.66266 111.66266 1
sqldf 624.88388 624.88388 624.88388 624.88388 624.88388 624.88388 1
dplyr 51.91233 51.91233 51.91233 51.91233 51.91233 51.91233 1
DT 10.40552 10.40552 10.40552 10.40552 10.40552 10.40552 1
# left
Unit: seconds
expr min lq mean median uq max
base 142.782030 142.782030 142.782030 142.782030 142.782030 142.782030
sqldf 613.917109 613.917109 613.917109 613.917109 613.917109 613.917109
dplyr 49.711912 49.711912 49.711912 49.711912 49.711912 49.711912
DT 9.674348 9.674348 9.674348 9.674348 9.674348 9.674348
# right
Unit: seconds
expr min lq mean median uq max
base 122.366301 122.366301 122.366301 122.366301 122.366301 122.366301
sqldf 611.119157 611.119157 611.119157 611.119157 611.119157 611.119157
dplyr 50.384841 50.384841 50.384841 50.384841 50.384841 50.384841
DT 9.899145 9.899145 9.899145 9.899145 9.899145 9.899145
# full
Unit: seconds
expr min lq mean median uq max neval
base 141.79464 141.79464 141.79464 141.79464 141.79464 141.79464 1
dplyr 94.66436 94.66436 94.66436 94.66436 94.66436 94.66436 1
DT 21.62573 21.62573 21.62573 21.62573 21.62573 21.62573 1
Be aware there are other types of joins you can perform using data.table:
- update on join - if you want to lookup values from another table to your main table
- aggregate on join - if you want to aggregate on key you are joining you do not have to materialize all join results
- overlapping join - if you want to merge by ranges
- rolling join - if you want merge to be able to match to values from preceeding/following rows by rolling them forward or backward
- non-equi join - if your join condition is non-equal
Code to reproduce:
library(microbenchmark)
library(sqldf)
library(dplyr)
library(data.table)
sapply(c("sqldf","dplyr","data.table"), packageVersion, simplify=FALSE)
n = 5e7
set.seed(108)
df1 = data.frame(x=sample(n,n-1L), y1=rnorm(n-1L))
df2 = data.frame(x=sample(n,n-1L), y2=rnorm(n-1L))
dt1 = as.data.table(df1)
dt2 = as.data.table(df2)
mb = list()
# inner join
microbenchmark(times = 1L,
base = merge(df1, df2, by = "x"),
sqldf = sqldf("SELECT * FROM df1 INNER JOIN df2 ON df1.x = df2.x"),
dplyr = inner_join(df1, df2, by = "x"),
DT = dt1[dt2, nomatch=NULL, on = "x"]) -> mb$inner
# left outer join
microbenchmark(times = 1L,
base = merge(df1, df2, by = "x", all.x = TRUE),
sqldf = sqldf("SELECT * FROM df1 LEFT OUTER JOIN df2 ON df1.x = df2.x"),
dplyr = left_join(df1, df2, by = c("x"="x")),
DT = dt2[dt1, on = "x"]) -> mb$left
# right outer join
microbenchmark(times = 1L,
base = merge(df1, df2, by = "x", all.y = TRUE),
sqldf = sqldf("SELECT * FROM df2 LEFT OUTER JOIN df1 ON df2.x = df1.x"),
dplyr = right_join(df1, df2, by = "x"),
DT = dt1[dt2, on = "x"]) -> mb$right
# full outer join
microbenchmark(times = 1L,
base = merge(df1, df2, by = "x", all = TRUE),
dplyr = full_join(df1, df2, by = "x"),
DT = merge(dt1, dt2, by = "x", all = TRUE)) -> mb$full
lapply(mb, print) -> nul
New in 2014:
Especially if you're also interested in data manipulation in general (including sorting, filtering, subsetting, summarizing etc.), you should definitely take a look at dplyr, which comes with a variety of functions all designed to facilitate your work specifically with data frames and certain other database types. It even offers quite an elaborate SQL interface, and even a function to convert (most) SQL code directly into R.
The four joining-related functions in the dplyr package are (to quote):
inner_join(x, y, by = NULL, copy = FALSE, ...): return all rows from
x where there are matching values in y, and all columns from x and y
left_join(x, y, by = NULL, copy = FALSE, ...): return all rows from x, and all columns from x and y
semi_join(x, y, by = NULL, copy = FALSE, ...): return all rows from x where there are matching values in
y, keeping just columns from x.
anti_join(x, y, by = NULL, copy = FALSE, ...): return all rows from x
where there are not matching values in y, keeping just columns from x
It's all here in great detail.
Selecting columns can be done by select(df,"column"). If that's not SQL-ish enough for you, then there's the sql() function, into which you can enter SQL code as-is, and it will do the operation you specified just like you were writing in R all along (for more information, please refer to the dplyr/databases vignette). For example, if applied correctly, sql("SELECT * FROM hflights") will select all the columns from the "hflights" dplyr table (a "tbl").
dplyr since 0.4 implemented all those joins including outer_join, but it was worth noting that for the first few releases prior to 0.4 it used not to offer outer_join, and as a result there was a lot of really bad hacky workaround user code floating around for quite a while afterwards (you can still find such code in SO, Kaggle answers, github from that period. Hence this answer still serves a useful purpose.)
Join-related release highlights:
v0.5 (6/2016)
Handling for POSIXct type, timezones, duplicates, different factor levels. Better errors and warnings.
New suffix argument to control what suffix duplicated variable names receive (#1296)
v0.4.0 (1/2015)
Implement right join and outer join (#96)
Mutating joins, which add new variables to one table from matching rows in another. Filtering joins, which filter observations from one table based on whether or not they match an observation in the other table.
v0.3 (10/2014)
Can now left_join by different variables in each table: df1 %>% left_join(df2, c("var1" = "var2"))
v0.2 (5/2014)
*_join() no longer reorders column names (#324)
v0.1.3 (4/2014)
has inner_join, left_join, semi_join, anti_join
outer_join not implemented yet, fallback is use base::merge() (or plyr::join())
didn't yet implement right_join and outer_join
Hadley mentioning other advantages here
one minor feature merge currently has that dplyr doesn't is the ability to have separate by.x,by.y columns as e.g. Python pandas does.
Workarounds per hadley's comments in that issue:
right_join(x,y) is the same as left_join(y,x) in terms of the rows, just the columns will be different orders. Easily worked around with select(new_column_order)
outer_join is basically union(left_join(x, y), right_join(x, y)) - i.e. preserve all rows in both data frames.
For the case of a left join with a 0..*:0..1 cardinality or a right join with a 0..1:0..* cardinality it is possible to assign in-place the unilateral columns from the joiner (the 0..1 table) directly onto the joinee (the 0..* table), and thereby avoid the creation of an entirely new table of data. This requires matching the key columns from the joinee into the joiner and indexing+ordering the joiner's rows accordingly for the assignment.
If the key is a single column, then we can use a single call to match() to do the matching. This is the case I'll cover in this answer.
Here's an example based on the OP, except I've added an extra row to df2 with an id of 7 to test the case of a non-matching key in the joiner. This is effectively df1 left join df2:
df1 <- data.frame(CustomerId=1:6,Product=c(rep('Toaster',3L),rep('Radio',3L)));
df2 <- data.frame(CustomerId=c(2L,4L,6L,7L),State=c(rep('Alabama',2L),'Ohio','Texas'));
df1[names(df2)[-1L]] <- df2[match(df1[,1L],df2[,1L]),-1L];
df1;
## CustomerId Product State
## 1 1 Toaster <NA>
## 2 2 Toaster Alabama
## 3 3 Toaster <NA>
## 4 4 Radio Alabama
## 5 5 Radio <NA>
## 6 6 Radio Ohio
In the above I hard-coded an assumption that the key column is the first column of both input tables. I would argue that, in general, this is not an unreasonable assumption, since, if you have a data.frame with a key column, it would be strange if it had not been set up as the first column of the data.frame from the outset. And you can always reorder the columns to make it so. An advantageous consequence of this assumption is that the name of the key column does not have to be hard-coded, although I suppose it's just replacing one assumption with another. Concision is another advantage of integer indexing, as well as speed. In the benchmarks below I'll change the implementation to use string name indexing to match the competing implementations.
I think this is a particularly appropriate solution if you have several tables that you want to left join against a single large table. Repeatedly rebuilding the entire table for each merge would be unnecessary and inefficient.
On the other hand, if you need the joinee to remain unaltered through this operation for whatever reason, then this solution cannot be used, since it modifies the joinee directly. Although in that case you could simply make a copy and perform the in-place assignment(s) on the copy.
As a side note, I briefly looked into possible matching solutions for multicolumn keys. Unfortunately, the only matching solutions I found were:
inefficient concatenations. e.g. match(interaction(df1$a,df1$b),interaction(df2$a,df2$b)), or the same idea with paste().
inefficient cartesian conjunctions, e.g. outer(df1$a,df2$a,`==`) & outer(df1$b,df2$b,`==`).
base R merge() and equivalent package-based merge functions, which always allocate a new table to return the merged result, and thus are not suitable for an in-place assignment-based solution.
For example, see Matching multiple columns on different data frames and getting other column as result, match two columns with two other columns, Matching on multiple columns, and the dupe of this question where I originally came up with the in-place solution, Combine two data frames with different number of rows in R.
Benchmarking
I decided to do my own benchmarking to see how the in-place assignment approach compares to the other solutions that have been offered in this question.
Testing code:
library(microbenchmark);
library(data.table);
library(sqldf);
library(plyr);
library(dplyr);
solSpecs <- list(
merge=list(testFuncs=list(
inner=function(df1,df2,key) merge(df1,df2,key),
left =function(df1,df2,key) merge(df1,df2,key,all.x=T),
right=function(df1,df2,key) merge(df1,df2,key,all.y=T),
full =function(df1,df2,key) merge(df1,df2,key,all=T)
)),
data.table.unkeyed=list(argSpec='data.table.unkeyed',testFuncs=list(
inner=function(dt1,dt2,key) dt1[dt2,on=key,nomatch=0L,allow.cartesian=T],
left =function(dt1,dt2,key) dt2[dt1,on=key,allow.cartesian=T],
right=function(dt1,dt2,key) dt1[dt2,on=key,allow.cartesian=T],
full =function(dt1,dt2,key) merge(dt1,dt2,key,all=T,allow.cartesian=T) ## calls merge.data.table()
)),
data.table.keyed=list(argSpec='data.table.keyed',testFuncs=list(
inner=function(dt1,dt2) dt1[dt2,nomatch=0L,allow.cartesian=T],
left =function(dt1,dt2) dt2[dt1,allow.cartesian=T],
right=function(dt1,dt2) dt1[dt2,allow.cartesian=T],
full =function(dt1,dt2) merge(dt1,dt2,all=T,allow.cartesian=T) ## calls merge.data.table()
)),
sqldf.unindexed=list(testFuncs=list( ## note: must pass connection=NULL to avoid running against the live DB connection, which would result in collisions with the residual tables from the last query upload
inner=function(df1,df2,key) sqldf(paste0('select * from df1 inner join df2 using(',paste(collapse=',',key),')'),connection=NULL),
left =function(df1,df2,key) sqldf(paste0('select * from df1 left join df2 using(',paste(collapse=',',key),')'),connection=NULL),
right=function(df1,df2,key) sqldf(paste0('select * from df2 left join df1 using(',paste(collapse=',',key),')'),connection=NULL) ## can't do right join proper, not yet supported; inverted left join is equivalent
##full =function(df1,df2,key) sqldf(paste0('select * from df1 full join df2 using(',paste(collapse=',',key),')'),connection=NULL) ## can't do full join proper, not yet supported; possible to hack it with a union of left joins, but too unreasonable to include in testing
)),
sqldf.indexed=list(testFuncs=list( ## important: requires an active DB connection with preindexed main.df1 and main.df2 ready to go; arguments are actually ignored
inner=function(df1,df2,key) sqldf(paste0('select * from main.df1 inner join main.df2 using(',paste(collapse=',',key),')')),
left =function(df1,df2,key) sqldf(paste0('select * from main.df1 left join main.df2 using(',paste(collapse=',',key),')')),
right=function(df1,df2,key) sqldf(paste0('select * from main.df2 left join main.df1 using(',paste(collapse=',',key),')')) ## can't do right join proper, not yet supported; inverted left join is equivalent
##full =function(df1,df2,key) sqldf(paste0('select * from main.df1 full join main.df2 using(',paste(collapse=',',key),')')) ## can't do full join proper, not yet supported; possible to hack it with a union of left joins, but too unreasonable to include in testing
)),
plyr=list(testFuncs=list(
inner=function(df1,df2,key) join(df1,df2,key,'inner'),
left =function(df1,df2,key) join(df1,df2,key,'left'),
right=function(df1,df2,key) join(df1,df2,key,'right'),
full =function(df1,df2,key) join(df1,df2,key,'full')
)),
dplyr=list(testFuncs=list(
inner=function(df1,df2,key) inner_join(df1,df2,key),
left =function(df1,df2,key) left_join(df1,df2,key),
right=function(df1,df2,key) right_join(df1,df2,key),
full =function(df1,df2,key) full_join(df1,df2,key)
)),
in.place=list(testFuncs=list(
left =function(df1,df2,key) { cns <- setdiff(names(df2),key); df1[cns] <- df2[match(df1[,key],df2[,key]),cns]; df1; },
right=function(df1,df2,key) { cns <- setdiff(names(df1),key); df2[cns] <- df1[match(df2[,key],df1[,key]),cns]; df2; }
))
);
getSolTypes <- function() names(solSpecs);
getJoinTypes <- function() unique(unlist(lapply(solSpecs,function(x) names(x$testFuncs))));
getArgSpec <- function(argSpecs,key=NULL) if (is.null(key)) argSpecs$default else argSpecs[[key]];
initSqldf <- function() {
sqldf(); ## creates sqlite connection on first run, cleans up and closes existing connection otherwise
if (exists('sqldfInitFlag',envir=globalenv(),inherits=F) && sqldfInitFlag) { ## false only on first run
sqldf(); ## creates a new connection
} else {
assign('sqldfInitFlag',T,envir=globalenv()); ## set to true for the one and only time
}; ## end if
invisible();
}; ## end initSqldf()
setUpBenchmarkCall <- function(argSpecs,joinType,solTypes=getSolTypes(),env=parent.frame()) {
## builds and returns a list of expressions suitable for passing to the list argument of microbenchmark(), and assigns variables to resolve symbol references in those expressions
callExpressions <- list();
nms <- character();
for (solType in solTypes) {
testFunc <- solSpecs[[solType]]$testFuncs[[joinType]];
if (is.null(testFunc)) next; ## this join type is not defined for this solution type
testFuncName <- paste0('tf.',solType);
assign(testFuncName,testFunc,envir=env);
argSpecKey <- solSpecs[[solType]]$argSpec;
argSpec <- getArgSpec(argSpecs,argSpecKey);
argList <- setNames(nm=names(argSpec$args),vector('list',length(argSpec$args)));
for (i in seq_along(argSpec$args)) {
argName <- paste0('tfa.',argSpecKey,i);
assign(argName,argSpec$args[[i]],envir=env);
argList[[i]] <- if (i%in%argSpec$copySpec) call('copy',as.symbol(argName)) else as.symbol(argName);
}; ## end for
callExpressions[[length(callExpressions)+1L]] <- do.call(call,c(list(testFuncName),argList),quote=T);
nms[length(nms)+1L] <- solType;
}; ## end for
names(callExpressions) <- nms;
callExpressions;
}; ## end setUpBenchmarkCall()
harmonize <- function(res) {
res <- as.data.frame(res); ## coerce to data.frame
for (ci in which(sapply(res,is.factor))) res[[ci]] <- as.character(res[[ci]]); ## coerce factor columns to character
for (ci in which(sapply(res,is.logical))) res[[ci]] <- as.integer(res[[ci]]); ## coerce logical columns to integer (works around sqldf quirk of munging logicals to integers)
##for (ci in which(sapply(res,inherits,'POSIXct'))) res[[ci]] <- as.double(res[[ci]]); ## coerce POSIXct columns to double (works around sqldf quirk of losing POSIXct class) ----- POSIXct doesn't work at all in sqldf.indexed
res <- res[order(names(res))]; ## order columns
res <- res[do.call(order,res),]; ## order rows
res;
}; ## end harmonize()
checkIdentical <- function(argSpecs,solTypes=getSolTypes()) {
for (joinType in getJoinTypes()) {
callExpressions <- setUpBenchmarkCall(argSpecs,joinType,solTypes);
if (length(callExpressions)<2L) next;
ex <- harmonize(eval(callExpressions[[1L]]));
for (i in seq(2L,len=length(callExpressions)-1L)) {
y <- harmonize(eval(callExpressions[[i]]));
if (!isTRUE(all.equal(ex,y,check.attributes=F))) {
ex <<- ex;
y <<- y;
solType <- names(callExpressions)[i];
stop(paste0('non-identical: ',solType,' ',joinType,'.'));
}; ## end if
}; ## end for
}; ## end for
invisible();
}; ## end checkIdentical()
testJoinType <- function(argSpecs,joinType,solTypes=getSolTypes(),metric=NULL,times=100L) {
callExpressions <- setUpBenchmarkCall(argSpecs,joinType,solTypes);
bm <- microbenchmark(list=callExpressions,times=times);
if (is.null(metric)) return(bm);
bm <- summary(bm);
res <- setNames(nm=names(callExpressions),bm[[metric]]);
attr(res,'unit') <- attr(bm,'unit');
res;
}; ## end testJoinType()
testAllJoinTypes <- function(argSpecs,solTypes=getSolTypes(),metric=NULL,times=100L) {
joinTypes <- getJoinTypes();
resList <- setNames(nm=joinTypes,lapply(joinTypes,function(joinType) testJoinType(argSpecs,joinType,solTypes,metric,times)));
if (is.null(metric)) return(resList);
units <- unname(unlist(lapply(resList,attr,'unit')));
res <- do.call(data.frame,c(list(join=joinTypes),setNames(nm=solTypes,rep(list(rep(NA_real_,length(joinTypes))),length(solTypes))),list(unit=units,stringsAsFactors=F)));
for (i in seq_along(resList)) res[i,match(names(resList[[i]]),names(res))] <- resList[[i]];
res;
}; ## end testAllJoinTypes()
testGrid <- function(makeArgSpecsFunc,sizes,overlaps,solTypes=getSolTypes(),joinTypes=getJoinTypes(),metric='median',times=100L) {
res <- expand.grid(size=sizes,overlap=overlaps,joinType=joinTypes,stringsAsFactors=F);
res[solTypes] <- NA_real_;
res$unit <- NA_character_;
for (ri in seq_len(nrow(res))) {
size <- res$size[ri];
overlap <- res$overlap[ri];
joinType <- res$joinType[ri];
argSpecs <- makeArgSpecsFunc(size,overlap);
checkIdentical(argSpecs,solTypes);
cur <- testJoinType(argSpecs,joinType,solTypes,metric,times);
res[ri,match(names(cur),names(res))] <- cur;
res$unit[ri] <- attr(cur,'unit');
}; ## end for
res;
}; ## end testGrid()
Here's a benchmark of the example based on the OP that I demonstrated earlier:
## OP's example, supplemented with a non-matching row in df2
argSpecs <- list(
default=list(copySpec=1:2,args=list(
df1 <- data.frame(CustomerId=1:6,Product=c(rep('Toaster',3L),rep('Radio',3L))),
df2 <- data.frame(CustomerId=c(2L,4L,6L,7L),State=c(rep('Alabama',2L),'Ohio','Texas')),
'CustomerId'
)),
data.table.unkeyed=list(copySpec=1:2,args=list(
as.data.table(df1),
as.data.table(df2),
'CustomerId'
)),
data.table.keyed=list(copySpec=1:2,args=list(
setkey(as.data.table(df1),CustomerId),
setkey(as.data.table(df2),CustomerId)
))
);
## prepare sqldf
initSqldf();
sqldf('create index df1_key on df1(CustomerId);'); ## upload and create an sqlite index on df1
sqldf('create index df2_key on df2(CustomerId);'); ## upload and create an sqlite index on df2
checkIdentical(argSpecs);
testAllJoinTypes(argSpecs,metric='median');
## join merge data.table.unkeyed data.table.keyed sqldf.unindexed sqldf.indexed plyr dplyr in.place unit
## 1 inner 644.259 861.9345 923.516 9157.752 1580.390 959.2250 270.9190 NA microseconds
## 2 left 713.539 888.0205 910.045 8820.334 1529.714 968.4195 270.9185 224.3045 microseconds
## 3 right 1221.804 909.1900 923.944 8930.668 1533.135 1063.7860 269.8495 218.1035 microseconds
## 4 full 1302.203 3107.5380 3184.729 NA NA 1593.6475 270.7055 NA microseconds
Here I benchmark on random input data, trying different scales and different patterns of key overlap between the two input tables. This benchmark is still restricted to the case of a single-column integer key. As well, to ensure that the in-place solution would work for both left and right joins of the same tables, all random test data uses 0..1:0..1 cardinality. This is implemented by sampling without replacement the key column of the first data.frame when generating the key column of the second data.frame.
makeArgSpecs.singleIntegerKey.optionalOneToOne <- function(size,overlap) {
com <- as.integer(size*overlap);
argSpecs <- list(
default=list(copySpec=1:2,args=list(
df1 <- data.frame(id=sample(size),y1=rnorm(size),y2=rnorm(size)),
df2 <- data.frame(id=sample(c(if (com>0L) sample(df1$id,com) else integer(),seq(size+1L,len=size-com))),y3=rnorm(size),y4=rnorm(size)),
'id'
)),
data.table.unkeyed=list(copySpec=1:2,args=list(
as.data.table(df1),
as.data.table(df2),
'id'
)),
data.table.keyed=list(copySpec=1:2,args=list(
setkey(as.data.table(df1),id),
setkey(as.data.table(df2),id)
))
);
## prepare sqldf
initSqldf();
sqldf('create index df1_key on df1(id);'); ## upload and create an sqlite index on df1
sqldf('create index df2_key on df2(id);'); ## upload and create an sqlite index on df2
argSpecs;
}; ## end makeArgSpecs.singleIntegerKey.optionalOneToOne()
## cross of various input sizes and key overlaps
sizes <- c(1e1L,1e3L,1e6L);
overlaps <- c(0.99,0.5,0.01);
system.time({ res <- testGrid(makeArgSpecs.singleIntegerKey.optionalOneToOne,sizes,overlaps); });
## user system elapsed
## 22024.65 12308.63 34493.19
I wrote some code to create log-log plots of the above results. I generated a separate plot for each overlap percentage. It's a little bit cluttered, but I like having all the solution types and join types represented in the same plot.
I used spline interpolation to show a smooth curve for each solution/join type combination, drawn with individual pch symbols. The join type is captured by the pch symbol, using a dot for inner, left and right angle brackets for left and right, and a diamond for full. The solution type is captured by the color as shown in the legend.
plotRes <- function(res,titleFunc,useFloor=F) {
solTypes <- setdiff(names(res),c('size','overlap','joinType','unit')); ## derive from res
normMult <- c(microseconds=1e-3,milliseconds=1); ## normalize to milliseconds
joinTypes <- getJoinTypes();
cols <- c(merge='purple',data.table.unkeyed='blue',data.table.keyed='#00DDDD',sqldf.unindexed='brown',sqldf.indexed='orange',plyr='red',dplyr='#00BB00',in.place='magenta');
pchs <- list(inner=20L,left='<',right='>',full=23L);
cexs <- c(inner=0.7,left=1,right=1,full=0.7);
NP <- 60L;
ord <- order(decreasing=T,colMeans(res[res$size==max(res$size),solTypes],na.rm=T));
ymajors <- data.frame(y=c(1,1e3),label=c('1ms','1s'),stringsAsFactors=F);
for (overlap in unique(res$overlap)) {
x1 <- res[res$overlap==overlap,];
x1[solTypes] <- x1[solTypes]*normMult[x1$unit]; x1$unit <- NULL;
xlim <- c(1e1,max(x1$size));
xticks <- 10^seq(log10(xlim[1L]),log10(xlim[2L]));
ylim <- c(1e-1,10^((if (useFloor) floor else ceiling)(log10(max(x1[solTypes],na.rm=T))))); ## use floor() to zoom in a little more, only sqldf.unindexed will break above, but xpd=NA will keep it visible
yticks <- 10^seq(log10(ylim[1L]),log10(ylim[2L]));
yticks.minor <- rep(yticks[-length(yticks)],each=9L)*1:9;
plot(NA,xlim=xlim,ylim=ylim,xaxs='i',yaxs='i',axes=F,xlab='size (rows)',ylab='time (ms)',log='xy');
abline(v=xticks,col='lightgrey');
abline(h=yticks.minor,col='lightgrey',lty=3L);
abline(h=yticks,col='lightgrey');
axis(1L,xticks,parse(text=sprintf('10^%d',as.integer(log10(xticks)))));
axis(2L,yticks,parse(text=sprintf('10^%d',as.integer(log10(yticks)))),las=1L);
axis(4L,ymajors$y,ymajors$label,las=1L,tick=F,cex.axis=0.7,hadj=0.5);
for (joinType in rev(joinTypes)) { ## reverse to draw full first, since it's larger and would be more obtrusive if drawn last
x2 <- x1[x1$joinType==joinType,];
for (solType in solTypes) {
if (any(!is.na(x2[[solType]]))) {
xy <- spline(x2$size,x2[[solType]],xout=10^(seq(log10(x2$size[1L]),log10(x2$size[nrow(x2)]),len=NP)));
points(xy$x,xy$y,pch=pchs[[joinType]],col=cols[solType],cex=cexs[joinType],xpd=NA);
}; ## end if
}; ## end for
}; ## end for
## custom legend
## due to logarithmic skew, must do all distance calcs in inches, and convert to user coords afterward
## the bottom-left corner of the legend will be defined in normalized figure coords, although we can convert to inches immediately
leg.cex <- 0.7;
leg.x.in <- grconvertX(0.275,'nfc','in');
leg.y.in <- grconvertY(0.6,'nfc','in');
leg.x.user <- grconvertX(leg.x.in,'in');
leg.y.user <- grconvertY(leg.y.in,'in');
leg.outpad.w.in <- 0.1;
leg.outpad.h.in <- 0.1;
leg.midpad.w.in <- 0.1;
leg.midpad.h.in <- 0.1;
leg.sol.w.in <- max(strwidth(solTypes,'in',leg.cex));
leg.sol.h.in <- max(strheight(solTypes,'in',leg.cex))*1.5; ## multiplication factor for greater line height
leg.join.w.in <- max(strheight(joinTypes,'in',leg.cex))*1.5; ## ditto
leg.join.h.in <- max(strwidth(joinTypes,'in',leg.cex));
leg.main.w.in <- leg.join.w.in*length(joinTypes);
leg.main.h.in <- leg.sol.h.in*length(solTypes);
leg.x2.user <- grconvertX(leg.x.in+leg.outpad.w.in*2+leg.main.w.in+leg.midpad.w.in+leg.sol.w.in,'in');
leg.y2.user <- grconvertY(leg.y.in+leg.outpad.h.in*2+leg.main.h.in+leg.midpad.h.in+leg.join.h.in,'in');
leg.cols.x.user <- grconvertX(leg.x.in+leg.outpad.w.in+leg.join.w.in*(0.5+seq(0L,length(joinTypes)-1L)),'in');
leg.lines.y.user <- grconvertY(leg.y.in+leg.outpad.h.in+leg.main.h.in-leg.sol.h.in*(0.5+seq(0L,length(solTypes)-1L)),'in');
leg.sol.x.user <- grconvertX(leg.x.in+leg.outpad.w.in+leg.main.w.in+leg.midpad.w.in,'in');
leg.join.y.user <- grconvertY(leg.y.in+leg.outpad.h.in+leg.main.h.in+leg.midpad.h.in,'in');
rect(leg.x.user,leg.y.user,leg.x2.user,leg.y2.user,col='white');
text(leg.sol.x.user,leg.lines.y.user,solTypes[ord],cex=leg.cex,pos=4L,offset=0);
text(leg.cols.x.user,leg.join.y.user,joinTypes,cex=leg.cex,pos=4L,offset=0,srt=90); ## srt rotation applies *after* pos/offset positioning
for (i in seq_along(joinTypes)) {
joinType <- joinTypes[i];
points(rep(leg.cols.x.user[i],length(solTypes)),ifelse(colSums(!is.na(x1[x1$joinType==joinType,solTypes[ord]]))==0L,NA,leg.lines.y.user),pch=pchs[[joinType]],col=cols[solTypes[ord]]);
}; ## end for
title(titleFunc(overlap));
readline(sprintf('overlap %.02f',overlap));
}; ## end for
}; ## end plotRes()
titleFunc <- function(overlap) sprintf('R merge solutions: single-column integer key, 0..1:0..1 cardinality, %d%% overlap',as.integer(overlap*100));
plotRes(res,titleFunc,T);
Here's a second large-scale benchmark that's more heavy-duty, with respect to the number and types of key columns, as well as cardinality. For this benchmark I use three key columns: one character, one integer, and one logical, with no restrictions on cardinality (that is, 0..*:0..*). (In general it's not advisable to define key columns with double or complex values due to floating-point comparison complications, and basically no one ever uses the raw type, much less for key columns, so I haven't included those types in the key columns. Also, for information's sake, I initially tried to use four key columns by including a POSIXct key column, but the POSIXct type didn't play well with the sqldf.indexed solution for some reason, possibly due to floating-point comparison anomalies, so I removed it.)
makeArgSpecs.assortedKey.optionalManyToMany <- function(size,overlap,uniquePct=75) {
## number of unique keys in df1
u1Size <- as.integer(size*uniquePct/100);
## (roughly) divide u1Size into bases, so we can use expand.grid() to produce the required number of unique key values with repetitions within individual key columns
## use ceiling() to ensure we cover u1Size; will truncate afterward
u1SizePerKeyColumn <- as.integer(ceiling(u1Size^(1/3)));
## generate the unique key values for df1
keys1 <- expand.grid(stringsAsFactors=F,
idCharacter=replicate(u1SizePerKeyColumn,paste(collapse='',sample(letters,sample(4:12,1L),T))),
idInteger=sample(u1SizePerKeyColumn),
idLogical=sample(c(F,T),u1SizePerKeyColumn,T)
##idPOSIXct=as.POSIXct('2016-01-01 00:00:00','UTC')+sample(u1SizePerKeyColumn)
)[seq_len(u1Size),];
## rbind some repetitions of the unique keys; this will prepare one side of the many-to-many relationship
## also scramble the order afterward
keys1 <- rbind(keys1,keys1[sample(nrow(keys1),size-u1Size,T),])[sample(size),];
## common and unilateral key counts
com <- as.integer(size*overlap);
uni <- size-com;
## generate some unilateral keys for df2 by synthesizing outside of the idInteger range of df1
keys2 <- data.frame(stringsAsFactors=F,
idCharacter=replicate(uni,paste(collapse='',sample(letters,sample(4:12,1L),T))),
idInteger=u1SizePerKeyColumn+sample(uni),
idLogical=sample(c(F,T),uni,T)
##idPOSIXct=as.POSIXct('2016-01-01 00:00:00','UTC')+u1SizePerKeyColumn+sample(uni)
);
## rbind random keys from df1; this will complete the many-to-many relationship
## also scramble the order afterward
keys2 <- rbind(keys2,keys1[sample(nrow(keys1),com,T),])[sample(size),];
##keyNames <- c('idCharacter','idInteger','idLogical','idPOSIXct');
keyNames <- c('idCharacter','idInteger','idLogical');
## note: was going to use raw and complex type for two of the non-key columns, but data.table doesn't seem to fully support them
argSpecs <- list(
default=list(copySpec=1:2,args=list(
df1 <- cbind(stringsAsFactors=F,keys1,y1=sample(c(F,T),size,T),y2=sample(size),y3=rnorm(size),y4=replicate(size,paste(collapse='',sample(letters,sample(4:12,1L),T)))),
df2 <- cbind(stringsAsFactors=F,keys2,y5=sample(c(F,T),size,T),y6=sample(size),y7=rnorm(size),y8=replicate(size,paste(collapse='',sample(letters,sample(4:12,1L),T)))),
keyNames
)),
data.table.unkeyed=list(copySpec=1:2,args=list(
as.data.table(df1),
as.data.table(df2),
keyNames
)),
data.table.keyed=list(copySpec=1:2,args=list(
setkeyv(as.data.table(df1),keyNames),
setkeyv(as.data.table(df2),keyNames)
))
);
## prepare sqldf
initSqldf();
sqldf(paste0('create index df1_key on df1(',paste(collapse=',',keyNames),');')); ## upload and create an sqlite index on df1
sqldf(paste0('create index df2_key on df2(',paste(collapse=',',keyNames),');')); ## upload and create an sqlite index on df2
argSpecs;
}; ## end makeArgSpecs.assortedKey.optionalManyToMany()
sizes <- c(1e1L,1e3L,1e5L); ## 1e5L instead of 1e6L to respect more heavy-duty inputs
overlaps <- c(0.99,0.5,0.01);
solTypes <- setdiff(getSolTypes(),'in.place');
system.time({ res <- testGrid(makeArgSpecs.assortedKey.optionalManyToMany,sizes,overlaps,solTypes); });
## user system elapsed
## 38895.50 784.19 39745.53
The resulting plots, using the same plotting code given above:
titleFunc <- function(overlap) sprintf('R merge solutions: character/integer/logical key, 0..*:0..* cardinality, %d%% overlap',as.integer(overlap*100));
plotRes(res,titleFunc,F);
In joining two data frames with ~1 million rows each, one with 2 columns and the other with ~20, I've surprisingly found merge(..., all.x = TRUE, all.y = TRUE) to be faster then dplyr::full_join(). This is with dplyr v0.4
Merge takes ~17 seconds, full_join takes ~65 seconds.
Some food for though, since I generally default to dplyr for manipulation tasks.
Using merge function we can select the variable of left table or right table, same way like we all familiar with select statement in SQL (EX : Select a.* ...or Select b.* from .....)
We have to add extra code which will subset from the newly joined table .
SQL :- select a.* from df1 a inner join df2 b on a.CustomerId=b.CustomerId
R :- merge(df1, df2, by.x = "CustomerId", by.y = "CustomerId")[,names(df1)]
Same way
SQL :- select b.* from df1 a inner join df2 b on a.CustomerId=b.CustomerId
R :- merge(df1, df2, by.x = "CustomerId", by.y =
"CustomerId")[,names(df2)]
For an inner join on all columns, you could also use fintersect from the data.table-package or intersect from the dplyr-package as an alternative to merge without specifying the by-columns. This will give the rows that are equal between two dataframes:
merge(df1, df2)
# V1 V2
# 1 B 2
# 2 C 3
dplyr::intersect(df1, df2)
# V1 V2
# 1 B 2
# 2 C 3
data.table::fintersect(setDT(df1), setDT(df2))
# V1 V2
# 1: B 2
# 2: C 3
Example data:
df1 <- data.frame(V1 = LETTERS[1:4], V2 = 1:4)
df2 <- data.frame(V1 = LETTERS[2:3], V2 = 2:3)
Update join. One other important SQL-style join is an "update join" where columns in one table are updated (or created) using another table.
Modifying the OP's example tables...
sales = data.frame(
CustomerId = c(1, 1, 1, 3, 4, 6),
Year = 2000:2005,
Product = c(rep("Toaster", 3), rep("Radio", 3))
)
cust = data.frame(
CustomerId = c(1, 1, 4, 6),
Year = c(2001L, 2002L, 2002L, 2002L),
State = state.name[1:4]
)
sales
# CustomerId Year Product
# 1 2000 Toaster
# 1 2001 Toaster
# 1 2002 Toaster
# 3 2003 Radio
# 4 2004 Radio
# 6 2005 Radio
cust
# CustomerId Year State
# 1 2001 Alabama
# 1 2002 Alaska
# 4 2002 Arizona
# 6 2002 Arkansas
Suppose we want to add the customer's state from cust to the purchases table, sales, ignoring the year column. With base R, we can identify matching rows and then copy values over:
sales$State <- cust$State[ match(sales$CustomerId, cust$CustomerId) ]
# CustomerId Year Product State
# 1 2000 Toaster Alabama
# 1 2001 Toaster Alabama
# 1 2002 Toaster Alabama
# 3 2003 Radio <NA>
# 4 2004 Radio Arizona
# 6 2005 Radio Arkansas
# cleanup for the next example
sales$State <- NULL
As can be seen here, match selects the first matching row from the customer table.
Update join with multiple columns. The approach above works well when we are joining on only a single column and are satisfied with the first match. Suppose we want the year of measurement in the customer table to match the year of sale.
As #bgoldst's answer mentions, match with interaction might be an option for this case. More straightforwardly, one could use data.table:
library(data.table)
setDT(sales); setDT(cust)
sales[, State := cust[sales, on=.(CustomerId, Year), x.State]]
# CustomerId Year Product State
# 1: 1 2000 Toaster <NA>
# 2: 1 2001 Toaster Alabama
# 3: 1 2002 Toaster Alaska
# 4: 3 2003 Radio <NA>
# 5: 4 2004 Radio <NA>
# 6: 6 2005 Radio <NA>
# cleanup for next example
sales[, State := NULL]
Rolling update join. Alternately, we may want to take the last state the customer was found in:
sales[, State := cust[sales, on=.(CustomerId, Year), roll=TRUE, x.State]]
# CustomerId Year Product State
# 1: 1 2000 Toaster <NA>
# 2: 1 2001 Toaster Alabama
# 3: 1 2002 Toaster Alaska
# 4: 3 2003 Radio <NA>
# 5: 4 2004 Radio Arizona
# 6: 6 2005 Radio Arkansas
The three examples above all focus on creating/adding a new column. See the related R FAQ for an example of updating/modifying an existing column.

Subsetting a data frame in a function using another data frame as parameter

I would like to submit a data frame to a function and use it to subset another data frame.
This is the basic data frame:
foo <- data.frame(var1= c(1, 1, 1, 2, 2, 3), var2=c('A', 'A', 'B', 'B', 'C', 'C'))
I use the following function to find out the frequencies of var2 for specified values of var1.
foobar <- function(x, y, z){
a <- subset(x, (x$var1 == y))
b <- subset(a, (a$var2 == z))
n=nrow(b)
return(n)
}
Examples:
foobar(foo, 1, "A") # returns 2
foobar(foo, 1, "B") # returns 1
foobar(foo, 3, "C") # returns 1
This works. But now I want to submit a data frame of values to foobar. Instead of the above examples, I would like to submit df to foobar and get the same results as above (2, 1, 1)
df <- data.frame(var1=c(1, 1, 3), var2=c("A", "B", "C"))
When I change foobar to accept two arguments like foobar(foo, df) and use y[, c(var1)] and y[, c(var2)] instead of the two parameters x and y it still doesn't work. Which way is there to do this?
edit1: last paragraph clarified
edit2: var1 type corrected
Try this:
library(plyr)
match_df <- function(x, match) {
vars <- names(match)
# Create unique id for each row
x_id <- id(match[vars])
match_id <- id(x[vars])
# Match identifiers and return subsetted data frame
x[match(x_id, match_id, nomatch = 0), ]
}
match_df(foo, df)
# var1 var2
# 1 1 A
# 3 1 B
# 5 2 C
Your function foobar is expecting three arguments, and you only supplied two arguments to it with foobar(foo, df). You can use apply to get what you want:
apply(df, 1, function(x) foobar(foo, x[1], x[2]))
And in use:
> apply(df, 1, function(x) foobar(foo, x[1], x[2]))
[1] 2 1 1
To respond to your edit:
I'm not entirely sure what y[, c(var1)] means, but here's an attempt at trying to figure out what you are trying to do.
What I think you were trying to do was: foobar(foo, y = df[, "var1"], z = df[, "var2"]).
First, note that the use of c() is not needed here and you can reference the columns you want by placing the name of the column in quotes OR reference the column by number (as I did above). Secondly, df[, "var1"] returns all of the rows for the column names var1 which has a length of three:
> length(df[, "var1"])
[1] 3
The function you defined is not set up to deal with vectors of length greater than 1. That is why we need to iterate through each row of your dataframe to grab a single value, process it, and then go to the next row in the data.frame. That is what the apply function does. It is equivalent to saying something along the lines of for (i in 1: length(nrow(df)) but is a more idiomatic way of handling such issues.
Finally, is there a reason you generated var1 as a factor? It probably makes more sense to treate these as numeric in my opinion. Compare:
> str(df)
'data.frame': 3 obs. of 2 variables:
$ var1: Factor w/ 2 levels "1","3": 1 1 2
$ var2: Factor w/ 3 levels "A","B","C": 1 2 3
Versus
> df2 <- data.frame(var1=c(1,1,3), var2=c("A", "B", "C"))
> str(df2)
'data.frame': 3 obs. of 2 variables:
$ var1: num 1 1 3
$ var2: Factor w/ 3 levels "A","B","C": 1 2 3
In summary - apply is the function you are after here. You may want to spend some time thinking about whether your data should be numeric or a factor, but apply is still what you want.
foobar2 <- function(x, df) {
.dofun <- function(y, z){
a <- subset(x, x$var1==y)
b <- subset(a, a$var2==z)
n <- nrow(b)
return (n)
}
ans <- mapply(.dofun, as.character(df$var1), as.character(df$var2))
names(ans) <- NULL
return(ans)
}