rjson::fromJSON returns only the first item - json

I have a sqlite database file with several columns. One of the columns has a JSON dictionary (with two keys) embedded in it. I want to extract the JSON column to a data frame in R that shows each key in a separate column.
I tried rjson::fromJSON, but it reads only the first item. Is there a trick that I'm missing?
Here's an example that mimics my problem:
> eg <- as.vector(c("{\"3x\": 20, \"6y\": 23}", "{\"3x\": 60, \"6y\": 50}"))
> fromJSON(eg)
$3x
[1] 20
$6y
[1] 23
The desired output is something like:
# a data frame for both variables
3x 6y
1 20 23
2 60 50
or,
# a data frame for each variable
3x
1 20
2 60
6y
1 23
2 50

What you are looking for is actually a combination of lapply and some application of rbind or related.
I'll extend your data a little, just to have more than 2 elements.
eg <- c("{\"3x\": 20, \"6y\": 23}",
"{\"3x\": 60, \"6y\": 50}",
"{\"3x\": 99, \"6y\": 72}")
library(jsonlite)
Using base R, we can do
do.call(rbind.data.frame, lapply(eg, fromJSON))
# X3x X6y
# 1 20 23
# 2 60 50
# 3 99 72
You might be tempted to do something like Reduce(rbind, lapply(eg, fromJSON)), but the notable difference is that in the Reduce model, rbind is called "N-1" times, where "N" is the number of elements in eg; this results in a LOT of copying of data, and though it might work alright with small "N", it scales horribly. With the do.call option, rbind is called exactly once.
Notice that the column labels have been R-ized, since data.frame column names should not start with numbers. (It is possible, but generally discouraged.)
If you're confident that all substrings will have exactly the same elements, then you may be good here. If there's a chance that there will be a difference at some point, perhaps
eg <- c(eg, "{\"3x\": 99}")
then you'll notice that the base R solution no longer works by default.
do.call(rbind.data.frame, lapply(eg, fromJSON))
# Error in (function (..., deparse.level = 1, make.row.names = TRUE, stringsAsFactors = default.stringsAsFactors()) :
# numbers of columns of arguments do not match
There may be techniques to try to normalize the elements such that you can be assured of matches. However, if you're not averse to a tidyverse package:
library(dplyr)
eg2 <- bind_rows(lapply(eg, fromJSON))
eg2
# # A tibble: 4 × 2
# `3x` `6y`
# <int> <int>
# 1 20 23
# 2 60 50
# 3 99 72
# 4 99 NA
though you cannot call it as directly with the dollar-method, you can still use [[ or backticks.
eg2$3x
# Error: unexpected numeric constant in "eg2$3"
eg2[["3x"]]
# [1] 20 60 99 99
eg2$`3x`
# [1] 20 60 99 99

Related

NetLogo - using BehaviorSpace get all turtles locations as the result of each repetition

I am using BehaviorSpace to run the model hundreds of times with different parameters. But I need to know the locations of all turtles as a result instead of only the number of turtles. How can I achieve it with BehaviorSpace?
Currently, I output the results in a csv file by this code:
to-report get-locations
report (list xcor ycor)
end
to generate-output
file-open "model_r_1.0_locations.csv"
file-print csv:to-row get-locations
file-close
end
but all results are popped into same csv file, so I can't tell the condition of each running.
Seth's suggestion of incorporating behaviorspace-run-number in the filename of your csv output is one alternative. It would allow you to associate that file with the summary data in your main BehaviorSpace output file.
Another option is to include list reporters as "measures" in your behavior space experiment definition. For example, in your case:
map [ t -> [ xcor ] of t ] sort turtles
map [ t -> [ ycor ] of t ] sort turtles
You can then parse the resulting list "manually" in your favourite data analysis language. I've used the following function for this before, in Julia:
parselist(strlist, T = Float64) = parse.(T, split(strlist[2:end-1]))
I'm sure you can easily write some equivalent code in Python or R or whatever language you're using.
In the example above, I've outputted separate lists for the xcor and the ycor of turtles. You could also output a single "list of lists", but the parsing would be trickier.
Edit: How to do this using the csv extension and R
Coincidentally, I had to do something similar today for a different project, and I realized that a combination of the csv extension and R can make this very easy.
The general idea is the following:
In NetLogo, use csv:to-string to encode list data into a string and then write that string directly in the BehaviorSpace output.
In R, use purrr::map and readr::read_csv, followed by tidyr::unnest, to unpack everything in a neat "one observation per row" dataframe.
In other words: we like CSV, so we put CSV in our CSV so we can parse while we parse.
Here is a full-fledged example. Let's say we have the following NetLogo model:
extensions [ csv ]
to setup
clear-all
create-turtles 2 [ move-to one-of patches ]
reset-ticks
end
to go
ask turtles [ forward 1 ]
tick
end
to-report positions
let coords [ (list who xcor ycor) ] of turtles
report csv:to-string fput ["who" "x" "y"] coords
end
We then define the following tiny BehaviorSpace experiment, with only two repetitions and a time limit of two, using our positions reporter as an output:
The R code to process this is pleasantly straightforward:
library(tidyverse)
df <- read_csv("experiment-table.csv", skip = 6) %>%
mutate(positions = map(positions, read_csv)) %>%
unnest()
Which results in the following dataframe, all neat and tidy:
> df
# A tibble: 12 x 5
`[run number]` `[step]` who x y
<int> <int> <int> <dbl> <dbl>
1 1 0 0 16 10
2 1 0 1 10 -2
3 1 1 1 9.03 -2.24
4 1 1 0 -16.0 10.1
5 1 2 1 8.06 -2.48
6 1 2 0 -15.0 10.3
7 2 0 1 -14 1
8 2 0 0 13 15
9 2 1 0 14.0 15.1
10 2 1 1 -13.7 0.0489
11 2 2 0 15.0 15.1
12 2 2 1 -13.4 -0.902
The same thing in Julia:
using CSV, DataFrames
df = CSV.read("experiment-table.csv", header = 7)
cols = filter(col -> col != :positions, names(df))
df = by(df -> CSV.read(IOBuffer(df[:positions][1])), df, cols)

Evaluate Multivariate Function on a grid for two of the variables

I should probably delete this question, since I found an answer elsewhere on stackoverflow.
Using 'Vectorize' solved the problem.
outer(x,y,Vectorize(model))
I'll leave this up for a bit in case this helps anyone else. The issue I had is described below.
I am an occasional user of R, mostly for Scientific Programming. I want to evaluate a function on a grid of x,y values. Something similar to this simple example:
> x=seq(1,5,1)
> y=seq(1,5,1)
> model = function(a,b){a+b}
> (outer(x,y,model))
[,1] [,2] [,3] [,4] [,5]
[1,] 2 3 4 5 6
[2,] 3 4 5 6 7
[3,] 4 5 6 7 8
[4,] 5 6 7 8 9
[5,] 6 7 8 9 10
The input and output of my actual function looks like this (the volumetric flow rate of a 'Cross' fluid in a circular pipe):
> SimpleCrossModelCircPipe(eta_0 = 176.45, lambda = 2.4526, m=0.83122, mesh=1000,
pipe_length=1.925, press_drop=20.3609, pipe_diameter=0.0413)
[1] 0.005635064
I want to evaluate the function on a grid of values, for example, press_drop and pipe_diameter. So, define a function of these two variables.
> model = function(a,b){SimpleCrossModelCircPipe(eta_0 = 176.45, lambda = 2.4526,
m=0.83122, mesh=1000,pipe_length=1.925, press_drop=a, pipe_diameter=b)}
For just one pair of values this does work:
> model(1,2)
[1] 107249.8
But, if I use with 'outer', there is an error.
> (outer(x,y,model))
Error in seq.default(tau_wall, 0, length.out = mesh + 1) :
'from' must be of length 1
If I place a print statement for the local variable, tau_wall, using outer generates all 25 values, before moving to the line, which generates the error, since tau_wall should have length 1, not 25.
(outer(x,y,model))
[1] 352.5289 705.0579 1057.5868 1410.1158 1762.6447 705.0579 1410.1158 2115.1736
[9] 2820.2315 3525.2894 1057.5868 2115.1736 3172.7605 4230.3473 5287.9341 1410.1158
[17] 2820.2315 4230.3473 5640.4630 7050.5788 1762.6447 3525.2894 5287.9341 7050.5788
[25] 8813.2235
Is there a simple fix to make this work?
Best,
Steve

Importing/Conditioning a file.txt with a "kind" of json structure in R

I wanted to import a .txt file in R but the format is really special and it's looks like a json format but I don't know how to import it. There is an example of my data:
{"datetime":"2015-07-08 09:10:00","subject":"MMM","sscore":"-0.2280","smean":"0.2593","svscore":"-0.2795","sdispersion":"0.375","svolume":"8","sbuzz":"0.6026","lastclose":"155.430000000","companyname":"3M Company"},{"datetime":"2015-07-07 09:10:00","subject":"MMM","sscore":"0.2977","smean":"0.2713","svscore":"-0.7436","sdispersion":"0.400","svolume":"5","sbuzz":"0.4895","lastclose":"155.080000000","companyname":"3M Company"},{"datetime":"2015-07-06 09:10:00","subject":"MMM","sscore":"-1.0057","smean":"0.2579","svscore":"-1.3796","sdispersion":"1.000","svolume":"1","sbuzz":"0.4531","lastclose":"155.380000000","companyname":"3M Company"}
To deal with this is used this code:
test1 <- read.csv("C:/Users/test1.txt", header=FALSE)
## Import as 5 observations (5th is all empty) of 1700 variables
#(in fact 40 observations of 11 variables). In fact when I imported the
#.txt file, it's having one line (5th obs) empty, and 4 lines of data and
#placed next to each other 4 lines of data of 11 variables.
# Get the different lines
part1=test1[1:10]
part2=test1[11:20]
part3=test1[21:30]
part4=test1[31:40]
...
## Remove the empty line (there were an empty line after each)
part1=part1[-5,]
part2=part2[-5,]
part3=part3[-5,]
...
## Rename the columns
names(part1)=c("Date Time","Subject","Sscore","Smean","Svscore","Sdispersion","Svolume","Sbuzz","Last close","Company name")
names(part2)=c("Date Time","Subject","Sscore","Smean","Svscore","Sdispersion","Svolume","Sbuzz","Last close","Company name")
names(part3)=c("Date Time","Subject","Sscore","Smean","Svscore","Sdispersion","Svolume","Sbuzz","Last close","Company name")
...
## Assemble data to have one dataset
data=rbind(part1,part2,part3,part4,part5,part6,part7,part8,part9,part10)
## Formate Date Time
times <- as.POSIXct(data$`Date Time`, format='{datetime:%Y-%m-%d %H:%M:%S')
data$`Date Time` <- times
## Keep only the Date
data$Date <- as.Date(times)
## Formate data - Remove text
data$Subject <- gsub("subject:", "", data$Subject)
data$Sscore <- gsub("sscore:", "", data$Sscore)
...
So My code is working to reinstate the data but it's maybe very difficult and more long I know there is better ways to do it, so if you could help me with that I would be very grateful.
There are many packages that read JSON, e.g. rjson, jsonlite, RJSONIO (they will turn in up a google search) - just pick one and give it a go.
e.g.
library(jsonlite)
json.text <- '{"datetime":"2015-07-08 09:10:00","subject":"MMM","sscore":"-0.2280","smean":"0.2593","svscore":"-0.2795","sdispersion":"0.375","svolume":"8","sbuzz":"0.6026","lastclose":"155.430000000","companyname":"3M Company"},{"datetime":"2015-07-07 09:10:00","subject":"MMM","sscore":"0.2977","smean":"0.2713","svscore":"-0.7436","sdispersion":"0.400","svolume":"5","sbuzz":"0.4895","lastclose":"155.080000000","companyname":"3M Company"},{"datetime":"2015-07-06 09:10:00","subject":"MMM","sscore":"-1.0057","smean":"0.2579","svscore":"-1.3796","sdispersion":"1.000","svolume":"1","sbuzz":"0.4531","lastclose":"155.380000000","companyname":"3M Company"}'
x <- fromJSON(paste0('[', json.text, ']'))
datetime subject sscore smean svscore sdispersion svolume sbuzz lastclose companyname
1 2015-07-08 09:10:00 MMM -0.2280 0.2593 -0.2795 0.375 8 0.6026 155.430000000 3M Company
2 2015-07-07 09:10:00 MMM 0.2977 0.2713 -0.7436 0.400 5 0.4895 155.080000000 3M Company
3 2015-07-06 09:10:00 MMM -1.0057 0.2579 -1.3796 1.000 1 0.4531 155.380000000 3M Company
I paste the '[' and ']' around your JSON because you have multiple JSON elements (the rows in the dataframe above) and for this to be well-formed JSON it needs to be an array, i.e. [ {...}, {...}, {...} ] rather than {...}, {...}, {...}.

Scraping data from tables on multiple web pages in R (football players)

I'm working on a project for school where I need to collect the career statistics for individual NCAA football players. The data for each player is in this format.
http://www.sports-reference.com/cfb/players/ryan-aplin-1.html
I cannot find an aggregate of all players so I need to go page by page and pull out the bottom row of each passing scoring Rushing & receiving etc. html table
Each player is catagorized by their last name with links to each alphabet going here.
http://www.sports-reference.com/cfb/players/
For instance, each player with the last name A is found here.
http://www.sports-reference.com/cfb/players/a-index.html
This is my first time really getting into data scraping so I tried to find similar questions with answers. The closest answer I found was this question
I believe I could use something very similar where I switch page number with the collected player's name. However, I'm not sure how to change it to look for player name instead of page number.
Samuel L. Ventura also gave a talk about data scraping for NFL data recently that can be found here.
EDIT:
Ben was really helpful and provided some great code. The first part works really well, however when I attempt to run the second part I run into this.
> # unlist into a single character vector
> links <- unlist(links)
> # Go to each URL in the list and scrape all the data from the tables
> # this will take some time... don't interrupt it!
> all_tables <- lapply(links, readHTMLTable, stringsAsFactors = FALSE)
Error in UseMethod("xmlNamespaceDefinitions") :
no applicable method for 'xmlNamespaceDefinitions' applied to an object of class "NULL"
> # Put player names in the list so we know who the data belong to
> # extract names from the URLs to their stats page...
> toMatch <- c("http://www.sports-reference.com/cfb/players/", "-1.html")
> player_names <- unique (gsub(paste(toMatch,collapse="|"), "", links))
Error: cannot allocate vector of size 512 Kb
> # assign player names to list of tables
> names(all_tables) <- player_names
Error: object 'player_names' not found
> fix(inx_page)
Error in edit(name, file, title, editor) :
unexpected '<' occurred on line 1
use a command like
x <- edit()
to recover
In addition: Warning message:
In edit.default(name, file, title, editor = defaultEditor) :
deparse may be incomplete
This could be an error due to not having sufficient memory (only 4gb on computer I am currently using). Although I do not understand the error
> all_tables <- lapply(links, readHTMLTable, stringsAsFactors = FALSE)
Error in UseMethod("xmlNamespaceDefinitions") :
no applicable method for 'xmlNamespaceDefinitions' applied to an object of class "NULL"
Looking through my other datasets my players really only go back to 2007. If there would be some way to pull just people from 2007 onwards that may help shrink the data. If I had a list of people whose names I wanted to pull could I just replace the lnk in
links[[i]] <- paste0("http://www.sports-reference.com", lnk)
with only the players that I need?
Here's how you can easily get all the data in all the tables on all the player pages...
First make a list of the URLs for all the players' pages...
require(RCurl); require(XML)
n <- length(letters)
# pre-allocate list to fill
links <- vector("list", length = n)
for(i in 1:n){
print(i) # keep track of what the function is up to
# get all html on each page of the a-z index pages
inx_page <- htmlParse(getURI(paste0("http://www.sports-reference.com/cfb/players/", letters[i], "-index.html")))
# scrape URLs for each player from each index page
lnk <- unname(xpathSApply(inx_page, "//a/#href"))
# skip first 63 and last 10 links as they are constant on each page
lnk <- lnk[-c(1:63, (length(lnk)-10):length(lnk))]
# only keep links that go to players (exclude schools)
lnk <- lnk[grep("players", lnk)]
# now we have a list of all the URLs to all the players on that index page
# but the URLs are incomplete, so let's complete them so we can use them from
# anywhere
links[[i]] <- paste0("http://www.sports-reference.com", lnk)
}
# unlist into a single character vector
links <- unlist(links)
Now we have a vector of some 67,000 URLs (seems like a lot of players, can that be right?), so:
Second, scrape all the tables at each URL to get their data, like so:
# Go to each URL in the list and scrape all the data from the tables
# this will take some time... don't interrupt it!
# start edit1 here - just so you can see what's changed
# pre-allocate list
all_tables <- vector("list", length = (length(links)))
for(i in 1:length(links)){
print(i)
# error handling - skips to next URL if it gets an error
result <- try(
all_tables[[i]] <- readHTMLTable(links[i], stringsAsFactors = FALSE)
); if(class(result) == "try-error") next;
}
# end edit1 here
# Put player names in the list so we know who the data belong to
# extract names from the URLs to their stats page...
toMatch <- c("http://www.sports-reference.com/cfb/players/", "-1.html")
player_names <- unique (gsub(paste(toMatch,collapse="|"), "", links))
# assign player names to list of tables
names(all_tables) <- player_names
The result looks like this (this is just a snippet of the output):
all_tables
$`neli-aasa`
$`neli-aasa`$defense
Year School Conf Class Pos Solo Ast Tot Loss Sk Int Yds Avg TD PD FR Yds TD FF
1 *2007 Utah MWC FR DL 2 1 3 0.0 0.0 0 0 0 0 0 0 0 0
2 *2010 Utah MWC SR DL 4 4 8 2.5 1.5 0 0 0 1 0 0 0 0
$`neli-aasa`$kick_ret
Year School Conf Class Pos Ret Yds Avg TD Ret Yds Avg TD
1 *2007 Utah MWC FR DL 0 0 0 0 0 0
2 *2010 Utah MWC SR DL 2 24 12.0 0 0 0 0
$`neli-aasa`$receiving
Year School Conf Class Pos Rec Yds Avg TD Att Yds Avg TD Plays Yds Avg TD
1 *2007 Utah MWC FR DL 1 41 41.0 0 0 0 0 1 41 41.0 0
2 *2010 Utah MWC SR DL 0 0 0 0 0 0 0 0 0
Finally, let's say we just want to look at the passing tables...
# just show passing tables
passing <- lapply(all_tables, function(i) i$passing)
# but lots of NULL in here, and not a convenient format, so...
passing <- do.call(rbind, passing)
And we end up with a data frame that is ready for further analyses (also just a snippet)...
Year School Conf Class Pos Cmp Att Pct Yds Y/A AY/A TD Int Rate
james-aaron 1978 Air Force Ind QB 28 56 50.0 316 5.6 3.6 1 3 92.6
jeff-aaron.1 2000 Alabama-Birmingham CUSA JR QB 100 182 54.9 1135 6.2 6.0 5 3 113.1
jeff-aaron.2 2001 Alabama-Birmingham CUSA SR QB 77 148 52.0 828 5.6 4.3 4 6 99.8

Is it possible to write a table to a file in JSON format in R?

I'm making word frequency tables with R and the preferred output format would be a JSON file. sth like
{
"word" : "dog",
"frequency" : 12
}
Is there any way to save the table directly into this format? I've been using the write.csv() function and convert the output into JSON but this is very complicated and time consuming.
set.seed(1)
( tbl <- table(round(runif(100, 1, 5))) )
## 1 2 3 4 5
## 9 24 30 23 14
library(rjson)
sink("json.txt")
cat(toJSON(tbl))
sink()
file.show("json.txt")
## {"1":9,"2":24,"3":30,"4":23,"5":14}
or even better:
set.seed(1)
( tab <- table(letters[round(runif(100, 1, 26))]) )
a b c d e f g h i j k l m n o p q r s t u v w x y z
1 2 4 3 2 5 4 3 5 3 9 4 7 2 2 2 5 5 5 6 5 3 7 3 2 1
sink("lets.txt")
cat(toJSON(tab))
sink()
file.show("lets.txt")
## {"a":1,"b":2,"c":4,"d":3,"e":2,"f":5,"g":4,"h":3,"i":5,"j":3,"k":9,"l":4,"m":7,"n":2,"o":2,"p":2,"q":5,"r":5,"s":5,"t":6,"u":5,"v":3,"w":7,"x":3,"y":2,"z":1}
Then validate it with http://www.jsonlint.com/ to get pretty formatting. If you have multidimensional table, you'll have to work it out a bit...
EDIT:
Oh, now I see, you want the dataset characteristics sink-ed to a JSON file. No problem, just give us a sample data, and I'll work on a code a bit. Practically, you need to carry out the data into desirable format, hence convert it to JSON. list should suffice. Give me a sec, I'll update my answer.
EDIT #2:
Well, time is relative... it's a common knowledge... Here you go:
( dtf <- structure(list(word = structure(1:3, .Label = c("cat", "dog",
"mouse"), class = "factor"), frequency = c(12, 32, 18)), .Names = c("word",
"frequency"), row.names = c(NA, -3L), class = "data.frame") )
## word frequency
## 1 cat 12
## 2 dog 32
## 3 mouse 18
If dtf is a simple data frame, yes, data.frame, if it's not, coerce it! Long story short, you can do:
toJSON(as.data.frame(t(dtf)))
## [1] "{\"V1\":{\"word\":\"cat\",\"frequency\":\"12\"},\"V2\":{\"word\":\"dog\",\"frequency\":\"32\"},\"V3\":{\"word\":\"mouse\",\"frequency\":\"18\"}}"
I though I'll need some melt with this one, but simple t did the trick. Now, you only need to deal with column names after transposing the data.frame. t coerces data.frames to matrix, so you need to convert it back to data.frame. I used as.data.frame, but you can also use toJSON(data.frame(t(dtf))) - you'll get X instead of V as a variable name. Alternatively, you can use regexp to clean the JSON file (if needed), but it's a lousy practice, try to work it out by preparing the data.frame.
I hope this helped a bit...
These days I would typically use the jsonlite package.
library("jsonlite")
toJSON(mydatatable, pretty = TRUE)
This turns the data table into a JSON array of key/value pair objects directly.
RJSONIO is a package "that allows conversion to and from data in Javascript object notation (JSON) format". You can use it to export your object as a JSON file.
library(RJSONIO)
writeLines(toJSON(anobject), "afile.JSON")