Convert JSON URL to R Data Frame - json

I'm having trouble converting a JSON file (from an API) to a data frame in R. An example is the URL http://api.fantasy.nfl.com/v1/players/stats?statType=seasonStats&season=2010&week=1&format=json
I've tried a few different suggestions from S/O, including
convert json data to data frame in R and various blog posts such as http://zevross.com/blog/2015/02/12/using-r-to-download-and-parse-json-an-example-using-data-from-an-open-data-portal/
The closest I've been is using the code below which gives me a large matrix with 4 "rows" and a bunch of "varables" (V1, V2, etc.). I'm assuming that this JSON file is in a different format than "normal" ones.
library(RJSONIO)
raw_data <- getURL("http://api.fantasy.nfl.com/v1/players/stats?statType=seasonStats&season=2010&week=1&format=json")
data <- fromJSON(raw_data)
final_data <- do.call(rbind, data)
I'm pretty agnostic as to how to get it to work so any R packages/process are welcome. Thanks in advance.

The jsonlite package automatically picks up the dataframe:
library(jsonlite)
mydata <- fromJSON("http://api.fantasy.nfl.com/v1/players/stats?statType=seasonStats&season=2010&week=1&format=json")
names(mydata$players)
# [1] "id" "esbid" "gsisPlayerId" "name"
# [5] "position" "teamAbbr" "stats" "seasonPts"
# [9] "seasonProjectedPts" "weekPts" "weekProjectedPts"
head(mydata$players)
# id esbid gsisPlayerId name position teamAbbr stats.1
# 1 100029 FALSE FALSE San Francisco 49ers DEF SF 16
# 2 729 ABD660476 00-0025940 Husain Abdullah DB KC 15
# 3 2504171 ABR073003 00-0019546 John Abraham LB 15
# 4 2507266 ADA509576 00-0025668 Michael Adams DB 13
# 5 2505708 ADA515576 00-0022247 Mike Adams DB IND 15
# 6 1037889 ADA534252 00-0027610 Phillip Adams DB ATL 11
You can control this using the simplify arguments in jsonlite::fromJSON().

There's nothing "abnormal" about this JSON, its just not a rectangular structure that fits trivially into a data frame. JSON can represent much richer data structures.
For example (using the rjson package, you've not said what you've used):
> data = rjson::fromJSON(file="http://api.fantasy.nfl.com/v1/players/stats?statType=seasonStats&season=2010&week=1&format=json")
> length(data[[4]][[10]]$stats)
[1] 14
> length(data[[4]][[1]]$stats)
[1] 21
(data[[1 to 3]] look like headers)
the "stats" of the 10th element of data[[4]] has 14 elements, the "stats" of the first has 21. How is that going to fit into a rectangular data frame? R has stored it in a list because that's R's best way of storing irregular data structures.
Unless you can define a way of mapping the irregular data into a rectangular data frame, you can't store it in a data frame. Do you understand the structure of the data? That's essential.

RJson and Jsonlite have similar commands, like fromJSON but depending on the order you load them, they will override each other. For my purposes, rJson structures data much better than JsonLite, so I make sure to load in the correct order/only load Rjson

jsonlite is load
library(jsonlite)
Definition of quandl_url
quandl_url <- "https://www.quandl.com/api/v3/datasets/WIKI/FB/data.json?auth_token=i83asDsiWUUyfoypkgMz"
Import Quandl data:
quandl_data <- fromJSON(quandl_url)
quandl_data in list type
quandl_data

Related

Parsing unstructured data - Append rows with non-existant values rows to a dataset

I am attempting to build a dataset from unstructured data. The raw data is a series of json files each of which contains information about a single element of the data (eg. each file becomes a row in the final data). I am looping through the jsons using jsonlite to turn each into a huge nested list. This whole operation is falling apart on the basis of a seemingly simple problem:
I need to append rows to my data where some of the elements do not exist.
The raw data is like so:
jsn1 <- list(id='id1', col1=1, col2='A', col3=11)
jsn2 <- list(id='id2', col1=2, col2='B', col3=12)
jsn3 <- list(id='id3', col2='C', col3=13)
jsn4 <- list(id='id4', col1=3, col3=14)
The structure I am trying to get to is this:
df <- data.frame(id=c('id1','id2','id3','id4'),
col1=c(1,2,NA,4),
col2=c('A','B','C',NA),
col3=c(11,12,13,14))
> df
id col1 col2 col3
1 id1 1 A 11
2 id2 2 B 12
3 id3 NA C 13
4 id4 4 <NA> 14
My approach is along the lines of:
#Collect the json names in a vector
files=c('jsn1','jsn2','jsn3','jsn4')
#Initialize the dataframe with the first row filling in any missing values.
#I didn't do this at first, but it seems helpful.
df1=data.frame(id=jsn1$id,
col1=jsn1$col1,
col2=jsn1$col2,
col3=jsn1$col3,
stringsAsFactors=F)
#Create a loop to loop through the files extracting the values then add them to a dataframe.
for (i in 2:length(files)) {
a <- get(files[i])
new.row <- list(id=a$id,
col1=a$col1,
col2=a$col2,
col3=a$col3)
df1 <- rbind(df1,b)
}
However, this doesn't work because df1 <- rbind(df1,new.row) requires the columns to be the same length. I have tried df1 <- rbind.fill(df1,new.row), rbindlist(list(df1,new.row),use.names=T,fill=T), and df[nrow(df1) +1,names(new.row)] <- new.row. And read this and this among others.
Most answers can add to the data frame by "knowing" a priori what columns will be null /not null. Then constructing a df without those columns and adding it with fill. This won't work as I have no idea what columns will be present ahead of time. The missing ones currently end up with 0 elements which is the root of the problem, but I need to check if they are present. It seems like there should be an easy way to handle this either "on read" or on the rbind, but I can't seem to figure it out.
There are potentially hundreds of columns and millions of rows (though 10s and 100s right now). The jsons are large so reading them all into memory / contacting the lists somehow is probably not possible with the real data. A solution using data.table would probably be ideal. But any help is appreciated. Thanks.
You could do
data.table::rbindlist(mget(ls(pattern = "jsn[1-4]")), fill = TRUE)
# id col1 col2 col3
# 1: id1 1 A 11
# 2: id2 2 B 12
# 3: id3 NA C 13
# 4: id4 3 NA 14
Here mget(ls(pattern = "jsn[1-4]")) is a more programmatic way to gather the lists from the global environment that match the pattern jsn followed by the numbers 1-4. It's just the same as list(jsn1, jsn2, jsn3, jsn4) except it comes with names. You could just as easily do
rbindlist(list(jsn1, jsn2, jsn3, jsn4), fill = TRUE)
The ls() method will be better if you have many more jsn* lists.
You want rbind.fill from plyr. First you have to convert all your lists to dataframes (here using lapply(mylists, as.data.frame)), then you can use rbind.fill to bind them and fill missing rows with NA:
library(plyr)
rbind.fill(lapply(list(jsn1, jsn2, jsn3, jsn4), as.data.frame))
id col1 col2 col3
1 id1 1 A 11
2 id2 2 B 12
3 id3 NA C 13
4 id4 3 <NA> 14
I am going to jump back in because I have figured this out and want to share in case anyone else comes across this problem. The problem was reading values that are not present in the loop was creating nulls in my list that prevent rbindlist and rbind.fill from working.
To be more specific, I was doing this:
new.row <- list(id=a$id,
col1=a$col1,
col2=a$col2,
col3=a$col3)
when a$col2 was not in the list read from the json. This causes new.row$col2 to be NULL and then you cannot use new.row in rbindlist or rbind.fill. However, all that I needed to do was remove these nulls from the list like so
plyr::compact(new.row)
Source
before then using rbindlist. Both answers were helpful by showing me that rbindlist or rbind.fill would work without the null values.

Selectively Import only Json data in txt file into R.

I have 3 questions I would like to ask as I am relatively new to both R and Json format. I read quite a bit of things but I don't quite understand still.
1:) Can R parse Json data when the txt file contains other irrelevant information as well?
Assuming I can't, I uploaded the text file into R and did some cleaning up. So that it will be easier to read the file.
require(plyr)
require(rjson)
small.f.2 <- subset(small.f.1, ! V1 %in% c("Level_Index:", "Feature_Type:", "Goals:", "Move_Count:"))
small.f.3 <- small.f.2[,-1]
This would give me a single column with all the json data in each line.
I tried to write new .txt file .
write.table(small.f.3, file="small clean.txt", row.names = FALSE)
json_data <- fromJSON(file="small.clean")
The problem was it only converted 'x' (first row) into a character and ignored everything else. I imagined it was the problem with "x" so I took that out from the .txt file and ran it again.
json_data <- fromJSON(file="small clean copy.txt")
small <- fromJSON(paste(readLines("small clean copy.txt"), collapse=""))
Both time worked and I manage to create a list. But it only takes the data from the first row and ignore the rest. This leads to my second question.
I tried this..
small <- fromJSON(paste(readLines("small clean copy.txt"), collapse=","))
Error in fromJSON(paste(readLines("small clean copy.txt"), collapse = ",")) :
unexpected character ','
2.) How can I extract the rest of the rows in the .txt file?
3.) Is it possible for R to read the Json data from one row, and extract only the nested data that I need, and subsequently go on to the next row, like a loop? For example, in each array, I am only interested in the Action vectors and the State Feature vectors, but I am not interested in the rest of the data. If I can somehow extract only the information I need before moving on to the next array, than I can save a lot of memory space.
I validated the array online. But the .txt file is not json formatted. Only within each array. I hope this make sense. Each row is a nested array.
The data looks something like this. I have about 65 rows (nested arrays) in total.
{"NonlightningIndices":[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],"LightningIndices":[],"SelectedAction":12,"State":{"Features":{"Data":[21.0,58.0,0.599999964237213,12.0,9.0,3.0,1.0,0.0,11.0,2.0,1.0,0.0,0.0,0.0,0.0]}},"Actions":[{"Features":{"Data":[4.0,4.0,1.0,1.0,0.0,3.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.12213890532609,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.13055793241076,0.0,0.0,0.0,0.0,0.0,0.231325346416068,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.949158357257511,0.0,0.0,0.0,0.0,0.0,0.369666537828737,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0851765937900996,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.223409208023677,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.698640447815897,1.69496718435102,0.0,0.0,0.0,0.0,1.42312654023416,0.0,0.38394999584831,0.0,0.0,0.0,0.0,1.0,1.22164326251584,1.30980246401454,1.00411570750454,0.0,0.0,0.0,1.44306759429513,0.0,0.00568191150434618,0.0,0.0,0.0,0.0,0.0,0.0,0.157705869690127,0.0,0.0,0.0,0.0,0.102089274086033,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.37039305683305,2.64354332879095,0.0,0.456876463171171,0.0,0.0,0.208651305680117,0.0,0.0,0.0,0.0,0.0,2.0,0.0,3.46713142511126,2.26785558685153,0.284845692694476,0.29200364444299,0.0,0.562185300773834,1.79134869431988,0.423426746571872,0.0,0.0,0.0,0.0,5.06772310533214,0.0,1.95593334724537,2.08448537685298,1.22045520912269,0.251119892385839,0.0,4.86192274732091,0.0,0.186941346075472,0.0,0.0,0.0,0.0,4.37998688020614,0.0,3.04406665275463,1.0,0.49469909818283,0.0,0.0,1.57589195190525,0.0,0.0,0.0,0.0,0.0,0.0,3.55229001446173]}},......
{"NonlightningIndices":[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,24],"LightningIndices":[[15,16,17,18,19,20,21,22,23]],"SelectedAction":15,"State":{"Features":{"Data":[20.0,53.0,0.0,11.0,10.0,2.0,1.0,0.0,12.0,2.0,1.0,0.0,0.0,1.0,0.0]}},"Actions":[{"Features":{"Data":[4.0,4.0,1.0,1.0,0.0,3.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.110686363475575,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.13427913742728,0.0,0.0,0.0,0.0,0.0,0.218834141070836,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.939443046803111,0.0,0.0,0.0,0.0,0.0,0.357568892126985,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0889329732996782,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.22521492930721,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.700441220022084,1.6762090551226,0.0,0.0,0.0,0.0,1.44526456614638,0.0,0.383689214317325,0.0,0.0,0.0,0.0,1.0,1.22583659574753,1.31795156033445,0.99710368703165,0.0,0.0,0.0,1.44325394830013,0.0,0.00418600599483917,0.0,0.0,0.0,0.0,0.0,0.0,0.157518319482216,0.0,0.0,0.0,0.0,0.110244186273209,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.369899973785845,2.55505143302811,0.0,0.463342609296841,0.0,0.0,0.226088384842823,0.0,0.0,0.0,0.0,0.0,2.0,0.0,3.47842109127488,2.38476342332125,0.0698115810371108,0.276804206873942,0.0,1.53514282355593,1.77391161515718,0.421465101754304,0.0,0.0,0.0,0.0,4.45530484778828,0.0,1.43798302409155,3.46965807176681,0.468528940277049,0.259853183829217,0.0,4.86988325473155,0.0,0.190659677933533,0.0,0.0,0.963116148760181,0.0,4.29930830894124,0.0,2.56201697590845,0.593423384852181,0.46165947868584,0.0,0.0,1.59497392171253,0.0,0.0,0.0,0.0,0.0368838512398189,0.0,4.24538684327048]}},......
I would really appreciate any advice here.

Reading text/number mixed CSV files as tables in Octave

is there an easy way in octave to load data from a csv in a data structure similar to dataframes in R? I tries csvread dlmread but octave keeps reading test a imaginary numbers, plus I'd like to have column's headers as references. I saw that there are some examples online which see way too twisted, how is it possible that there is not a function or something similar to dataframe of R? I say a package called dataframe but I can't seem to figure out how it works. Any tip or suggestion?
csvread('x') %returns 1 column imaginary numbers
dlmread('x') %returns N columns imaginary numbers
Any working alternative?
Why are you unable to make the dataframe package work? You need to be more specific. Here's a simple example:
$ cat cars.csv
Year,Make,Model
1997,Ford,E350
2000,Mercury,Cougar
$ octave
octave-cli-3.8.2:1> pkg load dataframe
octave-cli-3.8.2:2> cars = dataframe ("cars.csv")
cars = dataframe with 2 rows and 3 columns
Src: cars.csv
_1 Year Make Model
Nr double char char
1 1997 Ford E350
2 2000 Mercury Cougar

Use MySQL database query results to plot R graph using RApache and Brew

I am trying to plot a graph using R which is populated by MySQL query results. I have the following code:
rs = dbSendQuery(con, "SELECT BuildingCode, AccessTime from access")
data = fetch(rs, n=-1)
x = data[,1]
y = data[,2]
cat(colnames(data),x,y)
This gives me an output of:
BuildingCode AccessTime TEST-0 TEST-1 TEST-2 TEST-3 TEST-4 14:40:59 07:05:00 20:10:59 08:40:00 07:30:59
But this is where I get stuck. I have idea how to pass the "cat" data into an R plot. I have spend hours searching online and most of the examples of R plots I have come across use read.tables(text=""). This is not feasible for me as the data has to come from a database and not be hard coded in. I also found something about saving the output as a CSV but MySQL can not overwrite existing files so after the code was executed once I was unable to do it again as a file already existed.
My question is, how can I use the "cat" data (or another way of doing it if there is a better way) to plot a graph using data that isn't hard coded?
Note: I am using RApache as my web server and I have installed the Brew package.
Make the plot using R and just pass the path to the file back in cat
<%
## Your other code to get the data, assuming it gets a data.frame called data
## Plot code
library(Cairo)
myplotfilename <- "/path/to/dir/myplot.png"
CairoPNG(filename = myplotfilename, width = 480, height = 480)
plot(x=data[,1],y=data[,2])
tmp <- dev.off()
cat(myplotfilename)
%>

Extract text from HTML node tree with R

I'm currently trying to scrape text from an HTML tree that I've parsed as follows:-
require(RCurl)
require(XML)
query.IMDB <- getURL('http://www.imdb.com/title/tt0096697/epdate') #Simpsons episodes, rated and ordered by broadcast date
names(query.IMDB)
query.IMDB
query.IMDB <- htmlParse(query.IMDB)
df.IMDB <- getNodeSet(query.IMDB, "//*/div[#class='rating rating-list']")
My first attempt was just to use grep on the resulting vector, but this fails.
data[grep("Users rated this", "", df.IMDB)]
#Error in data... object of type closure is not subsettable
My next attempt was to use grep on the individual points in the query.IMDB vector:-
vect <- numeric(length(df.IMDB))
for (i in 1:length(df.IMDB)){
vect[i] <- data[grep("Users rated this", "", df.IMDB)]
}
but this also throws the closure not subsettable error.
Finally trying the above function without data[] around the grep throws
Error in df.IMDB[i] <- grep("Users rated this", "", df.IMDB[i]) : replacement has length zero
I'm actually hoping to eventually replace everything except a number of the form [0-9].[0-9] following the given text string with blank space, but I'm doing a simpler version first to get the thing working.
Can anyone advise what function I should be using to edit the text in each point on my query.IMDB vector
No need to use grep here (AVoid regular expression with HTML files). Use the handy function readHTMLTable from XML package:
library(XML)
head(readHTMLTable('http://www.imdb.com/title/tt0096697/epdate')[[1]][,c(2:4)])
Episode UserRating UserVotes
1 Simpsons Roasting on an Open Fire 8.2 2,694
2 Bart the Genius 7.8 1,167
3 Homer's Odyssey 7.5 1,005
4 There's No Disgrace Like Home 7.9 1,017
5 Bart the General 8.0 992
6 Moaning Lisa 7.4 988
This give you the table of ratings,... Maybe you should convert UserVotes to a numeric.