Web Scraping with rvest and xml2 - html

I'm trying to scrape the date and policy type for COVID related announcements from this url: https://covid19.healthdata.org/united-states-of-america/alabama
The first date I'm trying to pull is the "April 4th, 2020" date for Alabama's Stay at Home Order.
As far as I can tell (as I am new to this), it has the xpath:
"//[#id="root"]/div/main/div[3]/div[1]/div[2]/div[1]/div[1]/div/div/span"
I've been using the following lines to try to retrieve it -
data <- read_html(url) %>%
html_nodes("span.ant-statistic-content-value")
data <- read_html(url) %>%
html_nodes(xpath = "//*[#id='root']/div/main/div[3]/div[1]/div[2]/div[1]/div[1]/div/div/span")
Neither are able to pull the information I'm looking for. Any help would be appreciated!

The data for this page is stored in a series of JSON files. If you use the developer tools from your browser and look for the Networks files of type XHR; you should obtain a list similar to this (Safari browser below):
Right click the names to copy URL link.
This script should get you started:
library(jsonlite)
#obtain the list of locations
locations<-fromJSON("https://covid19.healthdata.org/api/metadata/location?v=7", flatten = TRUE)
head(locations[, 1:9])
#get list if US locations
US <- locations$children[locations$location_name =="United States of America"]
head(US[[1]])
#Get data frame from interventions
#Create link with desired location_id (569 is Virginia)
#paste0("https://covid19.healthdata.org/api/data/intervention?location=", "569")
Interventions <- fromJSON("https://covid19.healthdata.org/api/data/intervention?location=569", flatten = TRUE)
Interventions
# date_reported covid_intervention_id location_id covid_intervention_measure_id covid_intervention_measure_name
# 1 2020-03-30 00:00:00 110 569 1 People instructed to stay at home
# 2 2020-03-16 00:00:00 258 569 2 Educational facilities closed
# 3 2020-04-19 00:00:00 437 569 7 Assumed_implemented_date
#Repeat for other links of interest

Related

Scraping html header with R

My objective
I'm attempting to use R to scrape text from a web page: https://tidesandcurrents.noaa.gov/stationhome.html?id=8467150. For the purposes of this question, my goal is to access the header text that contains the station number ("Bridgeport, CT - Station ID: 8467150"). Below is a screenshot of the page. I've highlighted the text that I'm trying to verify is present, and the text is also highlighted in the inspect element pane.
My old approach was to access the full text of the site with readLines(). A recent update to the website has made the text more difficult to access, and the station name/number is no longer visible to readLines():
url <- "https://tidesandcurrents.noaa.gov/stationhome.html?id=8467150"
stn <- "8467150"
webpage <- readLines(url, warn = FALSE)
### grep indicates that the station number is not present in the scraped text
grep(x = webpage, pattern = stn, value = TRUE)
Potential solutions
I am therefore looking for a new way to access my target text. I have tried using httr, but still cannot get all the html text to be included in what I scrape from the web page. The XML and rvest packages also seem promising, but I am not sure how to identify the relevant CSS selector or XPath expression.
### an attempt using httr
hDat <- httr::RETRY("GET", url, times = 10)
txt <- httr::content(hDat, "text")
### grep indicates that the station number is still not present
grep(x = txt, pattern = stn, value = TRUE)
### a partial attempt using XML
h <- xml2::read_html(url)
h2 <- XML::htmlTreeParse(h, useInternalNodes=TRUE, asText = TRUE)
### this may end up working, but I'm not sure how to identify the correct path
html.parse <- XML::xpathApply(h2, path = "div.span8", XML::xmlValue)
Regardless of the approach, I would welcome any suggestions that can help me access the header text containing the station name/number.
Unless you use Selenium, it will be very hard.
NOAA encourages you to access their free Restful json APIs. It also goes to great lengths to discourage html scraping.
That said, the following code will get what you want from a NOAA json in a data frame.
library(tidyverse)
library(jsonlite)
j1 <- fromJSON(txt = 'https://api.tidesandcurrents.noaa.gov/mdapi/prod/webapi/stations/8467150.json', simplifyDataFrame = TRUE, flatten = TRUE)
j1$stations %>% as_tibble() %>% select(name, state, id)
Results
# A tibble: 1 x 3
name state id
<chr> <chr> <chr>
1 Bridgeport CT 8467150

Webscrape with R for counting images

How to webscrape in R to get counts on images and videos for this page? Sorry I'm new to webscrape and would like some help.
Here is the link:
https://www.kickstarter.com/projects/urban-farm-florida/hammock-greens-vertical-hydroponic-urban-farm
Should yield video count= 1 and image count=9. But I'm only able to get this far.
library('dplyr')
library('rvest')
library('xml2')
library('selectr')
library("httr")
website<-read_html("https://www.kickstarter.com/projects/urban-farm-florida/hammock-greens-vertical-hydroponic-urban-farm")
website%>%html_nodes("div.template.asset")
I thint that the main issue is that the main body of the website, containing all the images, loads only when you open the website. As far as I know, rvest cannot handle this website architecture. Using RSelenium is a bit more complicated but works fine for your kind of problem.
library(RSelenium)
# open server session and open browser client ----
rD <- rsDriver(browser = c("firefox"), verbose = FALSE)
# adapt this to your browser version
remDr <- rD$client
# navitage to url in browser client ----
url <- "https://www.kickstarter.com/projects/urban-farm-florida/hammock-greens-vertical-hydroponic-urban-farm"
remDr$navigate(url)
# access image elements using xpath ----
elements_image <- remDr$findElements(using = "xpath", "//*[#class='template asset']")
# using only "//img" would show all images on the site = 12
# for some time the xpath "//img[#class='fit lazyloaded']" worked as well, somehow it stopped doing so
length(elements_image)
# show number of images on site
# [1] 9
unlist(lapply(elements_image, function(x) {x$getElementAttribute("alt")}))
# print the attribute "alt" to check output
# [1] "Vertical Columns hold 18 heads of greens, fed by a closed circuit of nutrient-rich water. "
# [2] ""
# [3] ""
# [4] "our farms can be found in unused, unloved, and unlikely spaces"
# [5] "Overtown 1 location at Lotus House homeless shelter"
# [6] "led spectrum manipulated to encourage leafy-green growth"
# [7] "vertical columns are harvested and planted in the same day to maximize efficiency "
# [8] "Chef Aaron shows off some young Romaine"
# [9] "Farmer Thomas keeps the whole team moving"
# access video elements using xpath ----
elements_video <- remDr$findElements(using = "xpath", "//video")
# since there is only one video and since its class seems very specific for the video, "//video" seems the better solution here
length(elements_video)
# show number of videos on site
# [1] 1
unlist(lapply(elements_video, function(x) {x$getElementAttribute("class")}))
# print the attribute "class" to check output
# [1] "aspect-ratio--object z1 has_hls hide"
# close browser client and close server session ----
remDr$close()
rD$server$stop()

How to read a commented out HTML table using readHTMLTable in R

In the past, I have been able to use readHTMLTable in R to pull some football stats. When trying to do so again this year, the tables aren't showing up, even though they are visible on the webpage. Here is an example: http://www.pro-football-reference.com/boxscores/201609080den.htm
When I view the source for the page, the tables are all commented out (which I suspect is why readHTMLTable didn't find them).
Example: search for "team_stats" in source code...
<!--
<div class="table_outer_container">
<div class="overthrow table_container" id="div_team_stats">
<table class="stats_table" id="team_stats" data-cols-to- freeze=1><caption>Team Stats Table</caption>
Questions:
How can the table be commented out in the source yet display in the browser?
Is there a way to read the commented out tables using readHTMLTable (or some other method)?
You can, in fact, grab it if you use the XPath comment() selector:
library(rvest)
url <- 'http://www.pro-football-reference.com/boxscores/201609080den.htm'
url %>% read_html() %>% # parse html
html_nodes('#all_team_stats') %>% # select node with comment
html_nodes(xpath = 'comment()') %>% # select comments within node
html_text() %>% # return contents as text
read_html() %>% # parse text as html
html_node('table') %>% # select table node
html_table() # parse table and return data.frame
## CAR DEN
## 1 First Downs 21 21
## 2 Rush-Yds-TDs 32-157-1 29-148-2
## 3 Cmp-Att-Yd-TD-INT 18-33-194-1-1 18-26-178-1-2
## 4 Sacked-Yards 3-18 2-19
## 5 Net Pass Yards 176 159
## 6 Total Yards 333 307
## 7 Fumbles-Lost 0-0 1-1
## 8 Turnovers 1 3
## 9 Penalties-Yards 8-85 4-22
## 10 Third Down Conv. 9-15 5-10
## 11 Fourth Down Conv. 0-0 1-1
## 12 Time of Possession 32:19 27:41

Why does readHTMLTable cannot successfully read premier league tables for May month?

The official Premier league website provides data with various statistics for league's teams over seasons (e.g. this one). I used the function readHTMLTable from XML R package to retrieve those tables. However, I noticed that the function can not read tables for May months while for others it works well. Here is an example:
april2007.url <- "http://www.premierleague.com/en-gb/matchday/league-table.html?season=2006-2007&month=APRIL&timelineView=date&toDate=1177887600000&tableView=CURRENT_STANDINGS"
april.df <- readHTMLTable(april2007.url, which = 1)
april.df[complete.cases(april.df),] ## correct table
march2014.url <- "http://www.premierleague.com/en-gb/matchday/league-table.html?season=2013-2014&month=APRIL&timelineView=date&toDate=1398639600000&tableView=CURRENT_STANDINGS"
march.df <- readHTMLTable(march2014.url, which = 1)
march.df[complete.cases(march.df), ] ## correct table
may2007.url <- "http://www.premierleague.com/en-gb/matchday/league-table.html?season=2006-2007&month=MAY&timelineView=date&toDate=1179010800000&tableView=CURRENT_STANDINGS"
may.df1 <- readHTMLTable(may2007.url, which = 1)
may.df1 ## Just data for the first team
may2014.url <- "http://www.premierleague.com/en-gb/matchday/league-table.html?season=2013-2014&month=MAY&timelineView=date&toDate=1399762800000&tableView=CURRENT_STANDINGS"
may.df2 <- readHTMLTable(may2014.url, which =1)
may.df2 ## Just data for the first team
As you can see, the function can not retrieve data for May month.
Please, can someone explain why this happens and how it can be fixed?
EDIT AFTER #zyurnaidi answer:
Below is the code that can do the job without manual editing.
url <- "http://www.premierleague.com/en-gb/matchday/league-table.html?season=2009-2010&month=MAY&timelineView=date&toDate=1273359600000&tableView=CURRENT_STANDINGS" ## data for the 09-05-2010.
con <- file (url)
raw <- readLines (con)
close (con)
pattern <- '<span class=" cupchampions-league= competitiontooltip= qualifiedforuefachampionsleague=' ## it seems that this part of the webpage source code mess the things up
raw <- gsub (pattern = pattern, replacement = '""', x = raw)
df <- readHTMLTable (doc = raw, which = 1)
df[complete.cases(df), ] ## correct table
OK. There are few hints for me to find the problem here:
1. The issues happen consistently on May. This is the last month of each season. It means that there should be something unique in this particular case.
2. Direct parsing (htmlParse, from both link and downloaded file) produces a truncated file. The table and html file are just suddenly closed after the first team in the table is reported.
The parsed data always differs from the original right after this point:
<span class=" cupchampions-league=
After downloading and carefully checking the html file itself, I found that there are (uncoded?) character issues there. My guess, this is caused by the cute little trophy icons seen after the team names.
Anyway, to solve this issue, you need to take out these error characters. Instead of editing the downloaded html files, my suggestion is:
1. View page source the EPL url for May's league table
2. Copy all and paste to the text editor, save as an html file
3. You can now use either htmlParse or readHTMLTable
There might be better way to automate this, but hope it can help.

Convert JSON URL to R Data Frame

I'm having trouble converting a JSON file (from an API) to a data frame in R. An example is the URL http://api.fantasy.nfl.com/v1/players/stats?statType=seasonStats&season=2010&week=1&format=json
I've tried a few different suggestions from S/O, including
convert json data to data frame in R and various blog posts such as http://zevross.com/blog/2015/02/12/using-r-to-download-and-parse-json-an-example-using-data-from-an-open-data-portal/
The closest I've been is using the code below which gives me a large matrix with 4 "rows" and a bunch of "varables" (V1, V2, etc.). I'm assuming that this JSON file is in a different format than "normal" ones.
library(RJSONIO)
raw_data <- getURL("http://api.fantasy.nfl.com/v1/players/stats?statType=seasonStats&season=2010&week=1&format=json")
data <- fromJSON(raw_data)
final_data <- do.call(rbind, data)
I'm pretty agnostic as to how to get it to work so any R packages/process are welcome. Thanks in advance.
The jsonlite package automatically picks up the dataframe:
library(jsonlite)
mydata <- fromJSON("http://api.fantasy.nfl.com/v1/players/stats?statType=seasonStats&season=2010&week=1&format=json")
names(mydata$players)
# [1] "id" "esbid" "gsisPlayerId" "name"
# [5] "position" "teamAbbr" "stats" "seasonPts"
# [9] "seasonProjectedPts" "weekPts" "weekProjectedPts"
head(mydata$players)
# id esbid gsisPlayerId name position teamAbbr stats.1
# 1 100029 FALSE FALSE San Francisco 49ers DEF SF 16
# 2 729 ABD660476 00-0025940 Husain Abdullah DB KC 15
# 3 2504171 ABR073003 00-0019546 John Abraham LB 15
# 4 2507266 ADA509576 00-0025668 Michael Adams DB 13
# 5 2505708 ADA515576 00-0022247 Mike Adams DB IND 15
# 6 1037889 ADA534252 00-0027610 Phillip Adams DB ATL 11
You can control this using the simplify arguments in jsonlite::fromJSON().
There's nothing "abnormal" about this JSON, its just not a rectangular structure that fits trivially into a data frame. JSON can represent much richer data structures.
For example (using the rjson package, you've not said what you've used):
> data = rjson::fromJSON(file="http://api.fantasy.nfl.com/v1/players/stats?statType=seasonStats&season=2010&week=1&format=json")
> length(data[[4]][[10]]$stats)
[1] 14
> length(data[[4]][[1]]$stats)
[1] 21
(data[[1 to 3]] look like headers)
the "stats" of the 10th element of data[[4]] has 14 elements, the "stats" of the first has 21. How is that going to fit into a rectangular data frame? R has stored it in a list because that's R's best way of storing irregular data structures.
Unless you can define a way of mapping the irregular data into a rectangular data frame, you can't store it in a data frame. Do you understand the structure of the data? That's essential.
RJson and Jsonlite have similar commands, like fromJSON but depending on the order you load them, they will override each other. For my purposes, rJson structures data much better than JsonLite, so I make sure to load in the correct order/only load Rjson
jsonlite is load
library(jsonlite)
Definition of quandl_url
quandl_url <- "https://www.quandl.com/api/v3/datasets/WIKI/FB/data.json?auth_token=i83asDsiWUUyfoypkgMz"
Import Quandl data:
quandl_data <- fromJSON(quandl_url)
quandl_data in list type
quandl_data