Related
I have a large data file filled up of key value pairs. The key is an ID and the value is a huge json object. I have been trying to convert this data file to a df in R, by importing the data as a 2 column table and then converting the value to a data frame.
I keep getting this error, even after I validated my json.
Error: lexical error: invalid string in json text.
[{ f: { SEQNUM: [ 455043, 455044,
(right here) ------^
below is my code
part00013 <- read.table(PatientData, sep = '\t', header = F, as.is = T)
colnames(part00013) <- c('k','v')
make_indexDateLists <- function(x) {
# x['v'] <- lapply(x['v'], function(y) as.character(y))
# x['v'] <- lapply(x['v'], function(y) substr(y,1, nchar(y)-1 ))
# x['v'] <- lapply(x['v'], function(y) substr(y,2,nchar(y)))
x["v"] <- lapply(as.character(x["v"]), function(y) jsonlite::fromJSON(y,simplifyVector = T))
#do assignpatienttocohorts
x["v"] <- lapply(x["v"], function(y) RJSONIO::toJSON(y))
cbind(x$k, x$v)
}
make_indexDateLists(part00013)
and here is a sample file https://drive.google.com/open?id=0B6hKduYaYwdJQ3BwbUpNSW9EZk0
It's invalid JSON, but you can turn it into valid JSON:
library(stringi)
library(jsonlite)
library(tidyverse)
tmp <- readLines("oneline_part00013")
parts <- stri_split_fixed(tmp, "\t", 2)[[1]]
fromJSON(parts[2], flatten = FALSE) %>%
glimpse()
## Observations: 1
## Variables: 7
## $ f <data.frame> 455043, 455044, 455045, 455046, 455047, 455048, 45504...
## $ s <data.frame> 246549, 246550, 246551, 246552, 246553, 246554, 24655...
## $ i <data.frame> 8224, 8788, 770102, 30, 10, 30, 3301, 3301, 3301, 192...
## $ d <data.frame> 1114386, 1114387, 1114388, 1114389, 1114390, 1114391,...
## $ o <data.frame> 162072527, 162072528, 162072529, 162072530, 162072531...
## $ t <data.frame> 408352, 408353, 408354, 408355, 408356, 408357, 40835...
## $ a <data.frame> 36527, 42259, 35562, 42458, 39119, 30, 10, 30, 20, 30...
flatten = TRUE will un-nest all the data.frame columns (you'll end up with over 450 columns that way)
I'm following the FCC's documentation to download some metadata about proceedings.
I don't believe I can post the data but you can get a free API key.
My code results in a listed list of 2 lists instead of a structured df from the JSON format.
My goal is to have a dataframe where each json element is it's own column.. like a normal df.
library(httr)
library(jsonlite)
datahere = "C:/fcc/"
setwd(datahere)
URL <- "https://publicapi.fcc.gov/ecfs/filings?api_key=<KEY HERE>&proceedings.name=14-28&sort=date_disseminated,DESC"
dataDF <- GET(URL)
dataJSON <- content(dataDF, as="text")
dataJSON <- fromJSON(dataJSON)
# NAs
dataJSON2 <- lapply(dataJSON, function(x) {
x[sapply(x, is.null)] <- NA
unlist(x)
})
x <- do.call("rbind", dataJSON2)
x <- as.data.frame(x)
The JSON is really deeply nested, so you need to put a little more thought into converting between list and data.frame. The logic below pulls out a data.frame of 25 filings (102 variables) and 10 aggregations (25 variables).
# tackle the filings object
filings_df <- ldply(dataJSON$filings, function(x) {
# removes null list elements
x[sapply(x, is.null)] <- NA
# converts to a named character vector
unlisted_x <- unlist(x)
# converts the named character vector to data.frame
# with 1 column and rows for each element
d <- as.data.frame(unlisted_x)
# we need to transpose this data.frame because
# the rows should be columns, and don't check names when converting
d <- as.data.frame(t(d), check.names=F)
# now assign the actual names based on that original
# unlisted character vector
colnames(d) <- names(unlisted_x)
# now return to ldply function, which will automatically stack them together
return(d)
})
# tackle the aggregations object
# same exact logic to create the data.frame
aggregations_df <- ldply(dataJSON$aggregations, function(x) {
# removes null list elements
x[sapply(x, is.null)] <- NA
# converts to a named character vector
unlisted_x <- unlist(x)
# converts the named character vector to data.frame
# with 1 column and rows for each element
d <- as.data.frame(unlisted_x)
# we need to transpose this data.frame because
# the rows should be columns, and don't check names when converting
d <- as.data.frame(t(d), check.names=F)
# now assign the actual names based on that original
# unlisted character vector
colnames(d) <- names(unlisted_x)
# now return to ldply function, which will automatically stack them together
return(d)
})
I have a list of lists which are of variable length. The first value of each nested list is the key, and the rest of the values in the list will be the array entry. It looks something like this:
[[1]]
[1] "Bob" "Apple"
[[2]]
[1] "Cindy" "Apple" "Banana" "Orange" "Pear" "Raspberry"
[[3]]
[1] "Mary" "Orange" "Strawberry"
[[4]]
[1] "George" "Banana"
I've extracted the keys and entries as follows:
keys <- lapply(x, '[', 1)
entries <- lapply(x, '[', -1)
but now that I have these, I don't know how I can associate a key:value pair in R without creating a matrix first, but this is silly since my data don't fit in a rectangle anyway (every example I've seen uses the column names from a matrix as the key values).
This is my crappy method using a matrix, assigning rownames, and then using jsonLite to export to JSON.
#Create a matrix from entries, without recycling
#I found this function on StackOverflow which seems to work...
cbind.fill <- function(...){
nm <- list(...)
nm <- lapply(nm, as.matrix)
n <- max(sapply(nm, nrow))
do.call(cbind, lapply(nm, function (x)
rbind(x, matrix(, n-nrow(x), ncol(x)))))
}
#Call said function
matrix <- cbind.fill(entries)
#Transpose the thing
matrix <- t(matrix)
#Set column names
colnames(matrix) <- keys
#Export to json
json<-toJSON(matrix)
The result is good, but the implementation sucks. Result:
[{"Bob":["Apple"],"Cindy":["Apple","Banana","Orange","Pear","Raspberry"],"Mary":["Orange","Strawberry"],"George":["Banana"]}]
Please let me know of better ways that might exist to accomplish this.
How about:
names(entries) <- unlist(keys)
toJSON(entries)
Consider the following lapply() approach:
library(jsonlite)
entries <- list(c('Bob', 'Apple'),
c('Cindy', 'Apple', 'Banana', 'Orange','Pear','Raspberry'),
c('Mary', 'Orange', 'Strawberry'),
c('George', 'Banana'))
# ITERATE ALL CONTENTS EXCEPT FIRST
inner <- list()
nestlist <- lapply(entries,
function(i) {
inner <- i[2:length(i)]
return(inner)
})
# NAME EACH ELEMENT WITH FIRST ELEMENT
names(nestlist) <- lapply(entries, function(i) i[1])
#$Bob
#[1] "Apple"
#$Cindy
#[1] "Apple" "Banana" "Orange" "Pear" "Raspberry"
#$Mary
#[1] "Orange" "Strawberry"
#$George
#[1] "Banana"
x <- toJSON(list(nestlist), pretty=TRUE)
x
#[
# {
# "Bob": ["Apple"],
# "Cindy": ["Apple", "Banana", "Orange", "Pear", "Raspberry"],
# "Mary": ["Orange", "Strawberry"],
# "George": ["Banana"]
# }
#]
I think this has already been sufficiently answered but here is a method using purrr and jsonlite.
library(purrr)
library(jsonlite)
sample_data <- list(
list("Bob","Apple"),
list("Cindy","Apple","Banana","Orange","Pear","Raspberry"),
list("Mary","Orange","Strawberry"),
list("George","Banana")
)
sample_data %>%
map(~set_names(list(.x[-1]),.x[1])) %>%
toJSON(auto_unbox=TRUE, pretty=TRUE)
I have some JSON that looks like this:
"total_rows":141,"offset":0,"rows":[
{"id":"1","key":"a","value":{"SP$Sale_Price":"240000","CONTRACTDATE$Contract_Date":"2006-10-26T05:00:00"}},
{"id":"2","key":"b","value":{"SP$Sale_Price":"2000000","CONTRACTDATE$Contract_Date":"2006-08-22T05:00:00"}},
{"id":"3","key":"c","value":{"SP$Sale_Price":"780000","CONTRACTDATE$Contract_Date":"2007-01-18T06:00:00"}},
...
In R, what would be the easiest way to produce a scatter-plot of SP$Sale_Price versus CONTRACTDATE$Contract_Date?
I got this far:
install.packages("rjson")
library("rjson")
json_file <- "http://localhost:5984/testdb/_design/sold/_view/sold?limit=100"
json_data <- fromJSON(file=json_file)
install.packages("plyr")
library(plyr)
asFrame <- do.call("rbind.fill", lapply(json_data, as.data.frame))
but now I'm stuck...
> plot(CONTRACTDATE$Contract_Date, SP$Sale_Price)
Error in plot(CONTRACTDATE$Contract_Date, SP$Sale_Price) :
object 'CONTRACTDATE' not found
How to make this work?
Suppose you have the following JSON-file:
txt <- '{"total_rows":141,"offset":0,"rows":[
{"id":"1","key":"a","value":{"SP$Sale_Price":"240000","CONTRACTDATE$Contract_Date":"2006-10-26T05:00:00"}},
{"id":"2","key":"b","value":{"SP$Sale_Price":"2000000","CONTRACTDATE$Contract_Date":"2006-08-22T05:00:00"}},
{"id":"3","key":"c","value":{"SP$Sale_Price":"780000","CONTRACTDATE$Contract_Date":"2007-01-18T06:00:00"}}]}'
Then you can read it as follows with the jsonlite package:
library(jsonlite)
json_data <- fromJSON(txt, flatten = TRUE)
# get the needed dataframe
dat <- json_data$rows
# set convenient names for the columns
# this step is optional, it just gives you nicer columnnames
names(dat) <- c("id","key","sale_price","contract_date")
# convert the 'contract_date' column to a datetime format
dat$contract_date <- strptime(dat$contract_date, format="%Y-%m-%dT%H:%M:%S", tz="GMT")
Now you can plot:
plot(dat$contract_date, dat$sale_price)
Which gives:
If you choose not to flatten the JSON, you can do:
json_data <- fromJSON(txt)
dat <- json_data$rows$value
sp <- strtoi(dat$`SP$Sale_Price`)
cd <- strptime(dat$`CONTRACTDATE$Contract_Date`, format="%Y-%m-%dT%H:%M:%S", tz="GMT")
plot(cd,sp)
Which gives the same plot:
I found a way that doesn't discard the field names:
install.packages("jsonlite")
install.packages("curl")
json <- fromJSON(json_file)
r <- json$rows
At this point r looks like this:
> class(r)
[1] "data.frame"
> colnames(r)
[1] "id" "key" "value"
After some more Googling and trial-and-error I landed on this:
f <- r$value
sp <- strtoi(f[["SP$Sale_Price"]])
cd <- strptime(f[["CONTRACTDATE$Contract_Date"]], format="%Y-%m-%dT%H:%M:%S", tz="GMT")
plot(cd,sp)
And the result on my full data-set...
Actual question
How can I serialize objects to ASCII and unserialize them again from ASCII without having to write to and read from a file connection (i.e. from ASCII that is in-memory)?
Background
In a state-less client-server framework, I would like to make certain information persistent accross calls (serialize >> send to client >> get serialized info back from client >> unserialize) without caching it on the server side.
Note that my JSON object/strong also contains other unserialized information and is thus mixed with the serialized information which is why the approach explained in this post doesn't completely do the trick.
Now, the thing is that I would like to unserialize the object solely based on the already-read JSON string. So to speak: from "in-memory ASCII" instead of from a file connection. How would I do that?
Here's what I tried:
require(forecast)
Approach 1
## SERVER: estimates initial model and writes JSON to socket
model <- auto.arima(AirPassengers, trace = TRUE)
## Model trace:
# ARIMA(2,1,2)(1,1,1)[12] : Inf
# ARIMA(0,1,0)(0,1,0)[12] : 967.6773
# ARIMA(1,1,0)(1,1,0)[12] : 965.4487
# ARIMA(0,1,1)(0,1,1)[12] : 957.1797
# ARIMA(0,1,1)(1,1,1)[12] : 963.5291
# ARIMA(0,1,1)(0,1,0)[12] : 956.7848
# ARIMA(1,1,1)(0,1,0)[12] : 959.4575
# ARIMA(0,1,2)(0,1,0)[12] : 958.8701
# ARIMA(1,1,2)(0,1,0)[12] : 961.3943
# ARIMA(0,1,1)(0,1,0)[12] : 956.7848
# ARIMA(0,1,1)(1,1,0)[12] : 964.7139
#
# Best model: ARIMA(0,1,1)(0,1,0)[12]
fc <- as.data.frame(forecast(model))
deparsed <- deparse(model)
json_out <- list(data = AirPassengers, model = deparsed, fc = fc)
json_out <- jsonlite::toJSON(json_out)
## CLIENT: keeps estimated model, updates data, writes to socket
json_in <- jsonlite::fromJSON(json_out)
json_in$data <- window(AirPassengers, end = 1949 + (1/12 * 14))
## SERVER: reads new JSON and applies model to new data
data <- json_in$data
model_0 <- json_in$model
model_1 <- eval(parse(text = model_0))
## Model trace:
# ARIMA(2,1,2)(1,1,1)[12] : Inf
# ARIMA(0,1,0)(0,1,0)[12] : 967.6773
# ARIMA(1,1,0)(1,1,0)[12] : 965.4487
# ARIMA(0,1,1)(0,1,1)[12] : 957.1797
# ARIMA(0,1,1)(1,1,1)[12] : 963.5291
# ARIMA(0,1,1)(0,1,0)[12] : 956.7848
# ARIMA(1,1,1)(0,1,0)[12] : 959.4575
# ARIMA(0,1,2)(0,1,0)[12] : 958.8701
# ARIMA(1,1,2)(0,1,0)[12] : 961.3943
# ARIMA(0,1,1)(0,1,0)[12] : 956.7848
# ARIMA(0,1,1)(1,1,0)[12] : 964.7139
#
# Best model: ARIMA(0,1,1)(0,1,0)[12]
# Warning message:
# In auto.arima(x = structure(list(x = structure(c(112, 118, 132, :
# Unable to fit final model using maximum likelihood. AIC value approximated
fc <- as.data.frame(forecast(Arima(data, model = model_1)))
## And so on ...
That works, but note that eval(parse(text = json_in$model)) actually re-runs the call to auto.arima() instead of just re-establishing/unserializing the object (note the trace information printed to the console that I included as comments).
That's not completely what I want as simply want to re-establish the final model object in the fastest possible way.
That's why I turned toserialize() next.
Approach 2
## SERVER: estimates initial model and writes JSON to socket
model <- auto.arima(AirPassengers, trace = TRUE)
fc <- as.data.frame(forecast(model))
serialized <- serialize(model, NULL)
class(serialized)
json_out <- list(data = AirPassengers, model = serialized, fc = fc)
json_out <- jsonlite::toJSON(json_out)
## CLIENT: keeps estimated model, updates data, writes to socket
json_in <- jsonlite::fromJSON(json_out)
json_in$data <- window(AirPassengers, end = 1949 + (1/12 * 14))
## SERVER: reads new JSON and applies model to new data
data <- json_in$data
model_0 <- json_in$model
try(model_1 <- unserialize(model_0))
## --> error:
# Error in unserialize(model_0) :
# character vectors are no longer accepted by unserialize()
Unfortunately, function unserialize() expects a file connection instead of "plain ASCII".
So that's why I need to do the following workaround.
Approach 3
## SERVER: estimates initial model and writes JSON to socket
model <- auto.arima(AirPassengers, trace = TRUE)
fc <- as.data.frame(forecast(model))
con <- file("serialized", "w+")
serialize(model, con)
close(con)
json_out <- list(data = AirPassengers, model = "serialized", fc = fc)
json_out <- jsonlite::toJSON(json_out)
## CLIENT: keeps estimated model, updates data, writes to socket
json_in <- jsonlite::fromJSON(json_out)
json_in$data <- window(AirPassengers, end = 1949 + (1/12 * 14))
## SERVER: reads new JSON and applies model to new data
data <- json_in$data
model_0 <- json_in$model
con <- file(model_0, "r+")
model_1 <- unserialize(con)
close(con)
fc <- as.data.frame(forecast(Arima(data, model = model_1)))
## And so on ...
Unserializing works now without the actual auto.arima() call being re-evaluated. But it's against my state-less paradigm as now the actual information is cached on the server side instead of actually being sent via the JSON object/string.
Does this fit your needs?
It follows the general strategy in your Approach 2. The only difference is that it uses as.character() to convert the serialized object to a character vector before passing it to toJSON(), and then uses as.raw(as.hexmode()) to convert it back to a raw vector "on the other side". (I've marked the two edited lines with comments reading ## <<- Edited.)
library(forecast)
## SERVER: estimates initial model and writes JSON to socket
model <- auto.arima(AirPassengers, trace = TRUE)
fc <- as.data.frame(forecast(model))
serialized <- as.character(serialize(model, NULL)) ## <<- Edited
class(serialized)
json_out <- list(data = AirPassengers, model = serialized, fc = fc)
json_out <- jsonlite::toJSON(json_out)
## CLIENT: keeps estimated model, updates data, writes to socket
json_in <- jsonlite::fromJSON(json_out)
json_in$data <- window(AirPassengers, end = 1949 + (1/12 * 14))
## SERVER: reads new JSON and applies model to new data
data <- json_in$data
model_0 <- as.raw(as.hexmode(json_in$model)) ## <<- Edited
unserialize(model_0)
## Series: AirPassengers
## ARIMA(0,1,1)(0,1,0)[12]
##
## Coefficients:
## ma1
## -0.3184
## s.e. 0.0877
##
## sigma^2 estimated as 137.3: log likelihood=-508.32
## AIC=1020.64 AICc=1020.73 BIC=1026.39