Scraping data from tables on multiple web pages in R (football players) - html

I'm working on a project for school where I need to collect the career statistics for individual NCAA football players. The data for each player is in this format.
http://www.sports-reference.com/cfb/players/ryan-aplin-1.html
I cannot find an aggregate of all players so I need to go page by page and pull out the bottom row of each passing scoring Rushing & receiving etc. html table
Each player is catagorized by their last name with links to each alphabet going here.
http://www.sports-reference.com/cfb/players/
For instance, each player with the last name A is found here.
http://www.sports-reference.com/cfb/players/a-index.html
This is my first time really getting into data scraping so I tried to find similar questions with answers. The closest answer I found was this question
I believe I could use something very similar where I switch page number with the collected player's name. However, I'm not sure how to change it to look for player name instead of page number.
Samuel L. Ventura also gave a talk about data scraping for NFL data recently that can be found here.
EDIT:
Ben was really helpful and provided some great code. The first part works really well, however when I attempt to run the second part I run into this.
> # unlist into a single character vector
> links <- unlist(links)
> # Go to each URL in the list and scrape all the data from the tables
> # this will take some time... don't interrupt it!
> all_tables <- lapply(links, readHTMLTable, stringsAsFactors = FALSE)
Error in UseMethod("xmlNamespaceDefinitions") :
no applicable method for 'xmlNamespaceDefinitions' applied to an object of class "NULL"
> # Put player names in the list so we know who the data belong to
> # extract names from the URLs to their stats page...
> toMatch <- c("http://www.sports-reference.com/cfb/players/", "-1.html")
> player_names <- unique (gsub(paste(toMatch,collapse="|"), "", links))
Error: cannot allocate vector of size 512 Kb
> # assign player names to list of tables
> names(all_tables) <- player_names
Error: object 'player_names' not found
> fix(inx_page)
Error in edit(name, file, title, editor) :
unexpected '<' occurred on line 1
use a command like
x <- edit()
to recover
In addition: Warning message:
In edit.default(name, file, title, editor = defaultEditor) :
deparse may be incomplete
This could be an error due to not having sufficient memory (only 4gb on computer I am currently using). Although I do not understand the error
> all_tables <- lapply(links, readHTMLTable, stringsAsFactors = FALSE)
Error in UseMethod("xmlNamespaceDefinitions") :
no applicable method for 'xmlNamespaceDefinitions' applied to an object of class "NULL"
Looking through my other datasets my players really only go back to 2007. If there would be some way to pull just people from 2007 onwards that may help shrink the data. If I had a list of people whose names I wanted to pull could I just replace the lnk in
links[[i]] <- paste0("http://www.sports-reference.com", lnk)
with only the players that I need?

Here's how you can easily get all the data in all the tables on all the player pages...
First make a list of the URLs for all the players' pages...
require(RCurl); require(XML)
n <- length(letters)
# pre-allocate list to fill
links <- vector("list", length = n)
for(i in 1:n){
print(i) # keep track of what the function is up to
# get all html on each page of the a-z index pages
inx_page <- htmlParse(getURI(paste0("http://www.sports-reference.com/cfb/players/", letters[i], "-index.html")))
# scrape URLs for each player from each index page
lnk <- unname(xpathSApply(inx_page, "//a/#href"))
# skip first 63 and last 10 links as they are constant on each page
lnk <- lnk[-c(1:63, (length(lnk)-10):length(lnk))]
# only keep links that go to players (exclude schools)
lnk <- lnk[grep("players", lnk)]
# now we have a list of all the URLs to all the players on that index page
# but the URLs are incomplete, so let's complete them so we can use them from
# anywhere
links[[i]] <- paste0("http://www.sports-reference.com", lnk)
}
# unlist into a single character vector
links <- unlist(links)
Now we have a vector of some 67,000 URLs (seems like a lot of players, can that be right?), so:
Second, scrape all the tables at each URL to get their data, like so:
# Go to each URL in the list and scrape all the data from the tables
# this will take some time... don't interrupt it!
# start edit1 here - just so you can see what's changed
# pre-allocate list
all_tables <- vector("list", length = (length(links)))
for(i in 1:length(links)){
print(i)
# error handling - skips to next URL if it gets an error
result <- try(
all_tables[[i]] <- readHTMLTable(links[i], stringsAsFactors = FALSE)
); if(class(result) == "try-error") next;
}
# end edit1 here
# Put player names in the list so we know who the data belong to
# extract names from the URLs to their stats page...
toMatch <- c("http://www.sports-reference.com/cfb/players/", "-1.html")
player_names <- unique (gsub(paste(toMatch,collapse="|"), "", links))
# assign player names to list of tables
names(all_tables) <- player_names
The result looks like this (this is just a snippet of the output):
all_tables
$`neli-aasa`
$`neli-aasa`$defense
Year School Conf Class Pos Solo Ast Tot Loss Sk Int Yds Avg TD PD FR Yds TD FF
1 *2007 Utah MWC FR DL 2 1 3 0.0 0.0 0 0 0 0 0 0 0 0
2 *2010 Utah MWC SR DL 4 4 8 2.5 1.5 0 0 0 1 0 0 0 0
$`neli-aasa`$kick_ret
Year School Conf Class Pos Ret Yds Avg TD Ret Yds Avg TD
1 *2007 Utah MWC FR DL 0 0 0 0 0 0
2 *2010 Utah MWC SR DL 2 24 12.0 0 0 0 0
$`neli-aasa`$receiving
Year School Conf Class Pos Rec Yds Avg TD Att Yds Avg TD Plays Yds Avg TD
1 *2007 Utah MWC FR DL 1 41 41.0 0 0 0 0 1 41 41.0 0
2 *2010 Utah MWC SR DL 0 0 0 0 0 0 0 0 0
Finally, let's say we just want to look at the passing tables...
# just show passing tables
passing <- lapply(all_tables, function(i) i$passing)
# but lots of NULL in here, and not a convenient format, so...
passing <- do.call(rbind, passing)
And we end up with a data frame that is ready for further analyses (also just a snippet)...
Year School Conf Class Pos Cmp Att Pct Yds Y/A AY/A TD Int Rate
james-aaron 1978 Air Force Ind QB 28 56 50.0 316 5.6 3.6 1 3 92.6
jeff-aaron.1 2000 Alabama-Birmingham CUSA JR QB 100 182 54.9 1135 6.2 6.0 5 3 113.1
jeff-aaron.2 2001 Alabama-Birmingham CUSA SR QB 77 148 52.0 828 5.6 4.3 4 6 99.8

Related

Troubleshooting a Loop in R

I have this loop in R that is scraping Reddit comments from an API incrementally on an hourly basis (e.g. all comments containing a certain keyword between now and 1 hour ago, 1 hour ago and 2 hours ago, 2 hours ago and 3 hours ago, etc.):
library(jsonlite)
part1 = "https://api.pushshift.io/reddit/search/comment/?q=trump&after="
part2 = "h&before="
part3 = "h&size=500"
results = list()
for (i in 1:50000)
{tryCatch({
{
url_i<- paste0(part1, i+1, part2, i, part3)
r_i <- data.frame(fromJSON(url_i))
results[[i]] <- r_i
myvec_i <- sapply(results, NROW)
print(c(i, sum(myvec_i)))
}
}, error = function(e){})
}
final = do.call(rbind.data.frame, results)
saveRDS(final, "final.RDS")
In this loop, I added a line of code that prints which iteration the loop is currently on, and the cumulative number of results that the loop has scraped as of the current iteration. I also added a line of code ("tryCatch") that in the worst case scenario, forces this loop to skip an iteration which produces an error - however, I was not anticipating that to happen very often.
However, I am noticing that this loop is producing errors and often skipping iterations, far more than I had expected. For example (left column is the iteration number, right column is the cumulative results).
My guess is that between certain times, the API might not have recorded any comments that were left between those times thus not adding any new results to the list. E.g.
iteration_number cumulative_results
1 13432 5673
2 13433 5673
3 13434 5673
But in my case - can someone please help me understand why this loop is producing so many errors that is resulting in so many skipped iterations?
Thank you!
Your issue is almost certainly caused by rate limits imposed on your requests by the Pushshift API. When doing scraping tasks like you are here, the server may track how many requests a client has made within a certain time interval (1) and choose to return an error code (HTTP 429) instead of the requested data. This is called rate limiting and is a way for web sites to limit abuse, charge customers for usage, or both.
Per this discussion about Pushshift on Reddit, it does look like Pushshift imposes a rate limit of 120 requests per minute (Also: See their /meta endpoint).
I was able to confirm that a script like yours will run into rate limiting by changing this line and re-running your script:
}, error = function(e){})
to:
}, error = function(e){ message(e) })
After a while, I got output like:
HTTP error 429
The solution is to slow yourself down in order to stay within this limit. A straightforward way to do this is add a call to Sys.sleep(1) into your for loop, where 1 is the number of seconds to pause execution.
I modified your script as follows:
library(jsonlite)
part1 = "https://api.pushshift.io/reddit/search/comment/?q=trump&after="
part2 = "h&before="
part3 = "h&size=500"
results = list()
for (i in 1:50000)
{tryCatch({
{
Sys.sleep(1) # Changed. Change the value as needed.
url_i<- paste0(part1, i+1, part2, i, part3)
r_i <- data.frame(fromJSON(url_i))
results[[i]] <- r_i
myvec_i <- sapply(results, NROW)
print(c(i, sum(myvec_i)))
}
}, error = function(e){
message(e) # Changed. Prints to the console on error.
})
}
final = do.call(rbind.data.frame, results)
saveRDS(final, "final.RDS")
Note that you may have to try a number larger than 1 and you'll notice that your script takes longer to run.
As the error seems to come randomly, depending on API availability, you could retry, and set a maxattempt number:
library(jsonlite)
part1 = "https://api.pushshift.io/reddit/search/comment/?q=trump&after="
part2 = "h&before="
part3 = "h&size=500"
results = list()
maxattempt <- 3
for (i in 1:1000)
{
attempt <- 1
r_i <- NULL
while( is.null(r_i) && attempt <= maxattempt ) {
if (attempt>1) {print(paste("retry: i =",i))}
attempt <- attempt + 1
url_i<- paste0(part1, i+1, part2, i, part3)
try({
r_i <- data.frame(fromJSON(url_i))
results[[i]] <- r_i
myvec_i <- sapply(results, NROW)
print(c(i, sum(myvec_i))) })
}
}
final = do.call(rbind.data.frame, results)
saveRDS(final, "final.RDS")
Result:
[1] 1 250
[1] 2 498
[1] 3 748
[1] 4 997
[1] 5 1247
[1] 6 1497
[1] 7 1747
[1] 8 1997
[1] 9 2247
[1] 10 2497
...
[1] 416 101527
[1] 417 101776
Error in open.connection(con, "rb") :
Timeout was reached: [api.pushshift.io] Operation timed out after 21678399 milliseconds with 0 out of 0 bytes received
[1] "retry: i = 418"
[1] 418 102026
[1] 419 102276
...
On above example, a retry occured for i = 418.
Note that for i in 1:500, results size is already 250Mb, meaning that you could expect 25Gb for i in 1:50000 : do you have enough RAM?

NetLogo - using BehaviorSpace get all turtles locations as the result of each repetition

I am using BehaviorSpace to run the model hundreds of times with different parameters. But I need to know the locations of all turtles as a result instead of only the number of turtles. How can I achieve it with BehaviorSpace?
Currently, I output the results in a csv file by this code:
to-report get-locations
report (list xcor ycor)
end
to generate-output
file-open "model_r_1.0_locations.csv"
file-print csv:to-row get-locations
file-close
end
but all results are popped into same csv file, so I can't tell the condition of each running.
Seth's suggestion of incorporating behaviorspace-run-number in the filename of your csv output is one alternative. It would allow you to associate that file with the summary data in your main BehaviorSpace output file.
Another option is to include list reporters as "measures" in your behavior space experiment definition. For example, in your case:
map [ t -> [ xcor ] of t ] sort turtles
map [ t -> [ ycor ] of t ] sort turtles
You can then parse the resulting list "manually" in your favourite data analysis language. I've used the following function for this before, in Julia:
parselist(strlist, T = Float64) = parse.(T, split(strlist[2:end-1]))
I'm sure you can easily write some equivalent code in Python or R or whatever language you're using.
In the example above, I've outputted separate lists for the xcor and the ycor of turtles. You could also output a single "list of lists", but the parsing would be trickier.
Edit: How to do this using the csv extension and R
Coincidentally, I had to do something similar today for a different project, and I realized that a combination of the csv extension and R can make this very easy.
The general idea is the following:
In NetLogo, use csv:to-string to encode list data into a string and then write that string directly in the BehaviorSpace output.
In R, use purrr::map and readr::read_csv, followed by tidyr::unnest, to unpack everything in a neat "one observation per row" dataframe.
In other words: we like CSV, so we put CSV in our CSV so we can parse while we parse.
Here is a full-fledged example. Let's say we have the following NetLogo model:
extensions [ csv ]
to setup
clear-all
create-turtles 2 [ move-to one-of patches ]
reset-ticks
end
to go
ask turtles [ forward 1 ]
tick
end
to-report positions
let coords [ (list who xcor ycor) ] of turtles
report csv:to-string fput ["who" "x" "y"] coords
end
We then define the following tiny BehaviorSpace experiment, with only two repetitions and a time limit of two, using our positions reporter as an output:
The R code to process this is pleasantly straightforward:
library(tidyverse)
df <- read_csv("experiment-table.csv", skip = 6) %>%
mutate(positions = map(positions, read_csv)) %>%
unnest()
Which results in the following dataframe, all neat and tidy:
> df
# A tibble: 12 x 5
`[run number]` `[step]` who x y
<int> <int> <int> <dbl> <dbl>
1 1 0 0 16 10
2 1 0 1 10 -2
3 1 1 1 9.03 -2.24
4 1 1 0 -16.0 10.1
5 1 2 1 8.06 -2.48
6 1 2 0 -15.0 10.3
7 2 0 1 -14 1
8 2 0 0 13 15
9 2 1 0 14.0 15.1
10 2 1 1 -13.7 0.0489
11 2 2 0 15.0 15.1
12 2 2 1 -13.4 -0.902
The same thing in Julia:
using CSV, DataFrames
df = CSV.read("experiment-table.csv", header = 7)
cols = filter(col -> col != :positions, names(df))
df = by(df -> CSV.read(IOBuffer(df[:positions][1])), df, cols)

rjson::fromJSON returns only the first item

I have a sqlite database file with several columns. One of the columns has a JSON dictionary (with two keys) embedded in it. I want to extract the JSON column to a data frame in R that shows each key in a separate column.
I tried rjson::fromJSON, but it reads only the first item. Is there a trick that I'm missing?
Here's an example that mimics my problem:
> eg <- as.vector(c("{\"3x\": 20, \"6y\": 23}", "{\"3x\": 60, \"6y\": 50}"))
> fromJSON(eg)
$3x
[1] 20
$6y
[1] 23
The desired output is something like:
# a data frame for both variables
3x 6y
1 20 23
2 60 50
or,
# a data frame for each variable
3x
1 20
2 60
6y
1 23
2 50
What you are looking for is actually a combination of lapply and some application of rbind or related.
I'll extend your data a little, just to have more than 2 elements.
eg <- c("{\"3x\": 20, \"6y\": 23}",
"{\"3x\": 60, \"6y\": 50}",
"{\"3x\": 99, \"6y\": 72}")
library(jsonlite)
Using base R, we can do
do.call(rbind.data.frame, lapply(eg, fromJSON))
# X3x X6y
# 1 20 23
# 2 60 50
# 3 99 72
You might be tempted to do something like Reduce(rbind, lapply(eg, fromJSON)), but the notable difference is that in the Reduce model, rbind is called "N-1" times, where "N" is the number of elements in eg; this results in a LOT of copying of data, and though it might work alright with small "N", it scales horribly. With the do.call option, rbind is called exactly once.
Notice that the column labels have been R-ized, since data.frame column names should not start with numbers. (It is possible, but generally discouraged.)
If you're confident that all substrings will have exactly the same elements, then you may be good here. If there's a chance that there will be a difference at some point, perhaps
eg <- c(eg, "{\"3x\": 99}")
then you'll notice that the base R solution no longer works by default.
do.call(rbind.data.frame, lapply(eg, fromJSON))
# Error in (function (..., deparse.level = 1, make.row.names = TRUE, stringsAsFactors = default.stringsAsFactors()) :
# numbers of columns of arguments do not match
There may be techniques to try to normalize the elements such that you can be assured of matches. However, if you're not averse to a tidyverse package:
library(dplyr)
eg2 <- bind_rows(lapply(eg, fromJSON))
eg2
# # A tibble: 4 × 2
# `3x` `6y`
# <int> <int>
# 1 20 23
# 2 60 50
# 3 99 72
# 4 99 NA
though you cannot call it as directly with the dollar-method, you can still use [[ or backticks.
eg2$3x
# Error: unexpected numeric constant in "eg2$3"
eg2[["3x"]]
# [1] 20 60 99 99
eg2$`3x`
# [1] 20 60 99 99

Data left out when reading json online to R

I try to read online json data to R through the below codes in R:
library('jsonlite')
address<-'https://data.cityofchicago.org/resource/qnmj-8ku6.json'
sample<-fromJSON(address)
The codes did run and have results in right format of a table. But only produced 1000 observations while the original city portal database has more than 200,000 observations. I am not sure what to be fixed to download the whole dataset. Please help.
You're using the wrong link to get the data. You can see the correct link by going to 'Export'
library(jsonlite)
address <- "https://data.cityofchicago.org/api/views/qnmj-8ku6/rows.json?accessType=DOWNLOAD"
sample <- fromJSON(address)
length(sample)
# [1]
length(sample[[2]])
# [1] 274228
Although, you may want to get it as a .csv to make it easier to work with straight away?
address <- "https://data.cityofchicago.org/api/views/qnmj-8ku6/rows.csv?accessType=DOWNLOAD"
sample_csv <- read.csv(address)
nrow(sample_csv)
# [1] 274228
str(sample_csv)
# 'data.frame': 274228 obs. of 22 variables:
# $ ID : int 10512552 10517063 10517120 10518590 10518648
# $ Case.Number : Factor w/ 274219 levels "HA107183","HA156050",..
# $ Date : Factor w/ 112977 levels "01/01/2014 01:00:00 AM",..
# $ Block : Factor w/ 27499 levels "0000X E 100TH PL",..
# $ IUCR : Factor w/ 331 levels "0110","0141",..
# $ Primary.Type : Factor w/ 33 levels "ARSON","ASSAULT",..
# $ Description : Factor w/ 310 levels "$500 AND UNDER",..
# ... etc

Is it possible to write a table to a file in JSON format in R?

I'm making word frequency tables with R and the preferred output format would be a JSON file. sth like
{
"word" : "dog",
"frequency" : 12
}
Is there any way to save the table directly into this format? I've been using the write.csv() function and convert the output into JSON but this is very complicated and time consuming.
set.seed(1)
( tbl <- table(round(runif(100, 1, 5))) )
## 1 2 3 4 5
## 9 24 30 23 14
library(rjson)
sink("json.txt")
cat(toJSON(tbl))
sink()
file.show("json.txt")
## {"1":9,"2":24,"3":30,"4":23,"5":14}
or even better:
set.seed(1)
( tab <- table(letters[round(runif(100, 1, 26))]) )
a b c d e f g h i j k l m n o p q r s t u v w x y z
1 2 4 3 2 5 4 3 5 3 9 4 7 2 2 2 5 5 5 6 5 3 7 3 2 1
sink("lets.txt")
cat(toJSON(tab))
sink()
file.show("lets.txt")
## {"a":1,"b":2,"c":4,"d":3,"e":2,"f":5,"g":4,"h":3,"i":5,"j":3,"k":9,"l":4,"m":7,"n":2,"o":2,"p":2,"q":5,"r":5,"s":5,"t":6,"u":5,"v":3,"w":7,"x":3,"y":2,"z":1}
Then validate it with http://www.jsonlint.com/ to get pretty formatting. If you have multidimensional table, you'll have to work it out a bit...
EDIT:
Oh, now I see, you want the dataset characteristics sink-ed to a JSON file. No problem, just give us a sample data, and I'll work on a code a bit. Practically, you need to carry out the data into desirable format, hence convert it to JSON. list should suffice. Give me a sec, I'll update my answer.
EDIT #2:
Well, time is relative... it's a common knowledge... Here you go:
( dtf <- structure(list(word = structure(1:3, .Label = c("cat", "dog",
"mouse"), class = "factor"), frequency = c(12, 32, 18)), .Names = c("word",
"frequency"), row.names = c(NA, -3L), class = "data.frame") )
## word frequency
## 1 cat 12
## 2 dog 32
## 3 mouse 18
If dtf is a simple data frame, yes, data.frame, if it's not, coerce it! Long story short, you can do:
toJSON(as.data.frame(t(dtf)))
## [1] "{\"V1\":{\"word\":\"cat\",\"frequency\":\"12\"},\"V2\":{\"word\":\"dog\",\"frequency\":\"32\"},\"V3\":{\"word\":\"mouse\",\"frequency\":\"18\"}}"
I though I'll need some melt with this one, but simple t did the trick. Now, you only need to deal with column names after transposing the data.frame. t coerces data.frames to matrix, so you need to convert it back to data.frame. I used as.data.frame, but you can also use toJSON(data.frame(t(dtf))) - you'll get X instead of V as a variable name. Alternatively, you can use regexp to clean the JSON file (if needed), but it's a lousy practice, try to work it out by preparing the data.frame.
I hope this helped a bit...
These days I would typically use the jsonlite package.
library("jsonlite")
toJSON(mydatatable, pretty = TRUE)
This turns the data table into a JSON array of key/value pair objects directly.
RJSONIO is a package "that allows conversion to and from data in Javascript object notation (JSON) format". You can use it to export your object as a JSON file.
library(RJSONIO)
writeLines(toJSON(anobject), "afile.JSON")