In the past, I have been able to use readHTMLTable in R to pull some football stats. When trying to do so again this year, the tables aren't showing up, even though they are visible on the webpage. Here is an example: http://www.pro-football-reference.com/boxscores/201609080den.htm
When I view the source for the page, the tables are all commented out (which I suspect is why readHTMLTable didn't find them).
Example: search for "team_stats" in source code...
<!--
<div class="table_outer_container">
<div class="overthrow table_container" id="div_team_stats">
<table class="stats_table" id="team_stats" data-cols-to- freeze=1><caption>Team Stats Table</caption>
Questions:
How can the table be commented out in the source yet display in the browser?
Is there a way to read the commented out tables using readHTMLTable (or some other method)?
You can, in fact, grab it if you use the XPath comment() selector:
library(rvest)
url <- 'http://www.pro-football-reference.com/boxscores/201609080den.htm'
url %>% read_html() %>% # parse html
html_nodes('#all_team_stats') %>% # select node with comment
html_nodes(xpath = 'comment()') %>% # select comments within node
html_text() %>% # return contents as text
read_html() %>% # parse text as html
html_node('table') %>% # select table node
html_table() # parse table and return data.frame
## CAR DEN
## 1 First Downs 21 21
## 2 Rush-Yds-TDs 32-157-1 29-148-2
## 3 Cmp-Att-Yd-TD-INT 18-33-194-1-1 18-26-178-1-2
## 4 Sacked-Yards 3-18 2-19
## 5 Net Pass Yards 176 159
## 6 Total Yards 333 307
## 7 Fumbles-Lost 0-0 1-1
## 8 Turnovers 1 3
## 9 Penalties-Yards 8-85 4-22
## 10 Third Down Conv. 9-15 5-10
## 11 Fourth Down Conv. 0-0 1-1
## 12 Time of Possession 32:19 27:41
Related
I am currently working on reading an HTML file of this web page into R and processing it to extract useful data that creates a new dataframe.
Visual inspection of the web page text shows that the lines that contain data values all starts with '< td >'. So here is my code so far:
thepage<-readLines('https://www.worldometers.info/world-population/population-by-country/')
dataline <- grep('<td>', thepage)
dataline
This returns:
11
Which tells me all the data is in line 11. So I did this:
data <- thepage[11]
datalines <- grep('<td>', data)
datalines
This returns:
1
Which isn't helpful at all as "data" is still one massive line. How do I split this massive lines into multiple lines? My preferred dataframe would look something like this:
TIA.
How about the following?
library(tidyverse)
library(rvest)
url <- 'https://www.worldometers.info/world-population/population-by-country/'
pg <- xml2::read_html(url) %>%
rvest::html_table() %>%
.[[1]]
How to webscrape in R to get counts on images and videos for this page? Sorry I'm new to webscrape and would like some help.
Here is the link:
https://www.kickstarter.com/projects/urban-farm-florida/hammock-greens-vertical-hydroponic-urban-farm
Should yield video count= 1 and image count=9. But I'm only able to get this far.
library('dplyr')
library('rvest')
library('xml2')
library('selectr')
library("httr")
website<-read_html("https://www.kickstarter.com/projects/urban-farm-florida/hammock-greens-vertical-hydroponic-urban-farm")
website%>%html_nodes("div.template.asset")
I thint that the main issue is that the main body of the website, containing all the images, loads only when you open the website. As far as I know, rvest cannot handle this website architecture. Using RSelenium is a bit more complicated but works fine for your kind of problem.
library(RSelenium)
# open server session and open browser client ----
rD <- rsDriver(browser = c("firefox"), verbose = FALSE)
# adapt this to your browser version
remDr <- rD$client
# navitage to url in browser client ----
url <- "https://www.kickstarter.com/projects/urban-farm-florida/hammock-greens-vertical-hydroponic-urban-farm"
remDr$navigate(url)
# access image elements using xpath ----
elements_image <- remDr$findElements(using = "xpath", "//*[#class='template asset']")
# using only "//img" would show all images on the site = 12
# for some time the xpath "//img[#class='fit lazyloaded']" worked as well, somehow it stopped doing so
length(elements_image)
# show number of images on site
# [1] 9
unlist(lapply(elements_image, function(x) {x$getElementAttribute("alt")}))
# print the attribute "alt" to check output
# [1] "Vertical Columns hold 18 heads of greens, fed by a closed circuit of nutrient-rich water. "
# [2] ""
# [3] ""
# [4] "our farms can be found in unused, unloved, and unlikely spaces"
# [5] "Overtown 1 location at Lotus House homeless shelter"
# [6] "led spectrum manipulated to encourage leafy-green growth"
# [7] "vertical columns are harvested and planted in the same day to maximize efficiency "
# [8] "Chef Aaron shows off some young Romaine"
# [9] "Farmer Thomas keeps the whole team moving"
# access video elements using xpath ----
elements_video <- remDr$findElements(using = "xpath", "//video")
# since there is only one video and since its class seems very specific for the video, "//video" seems the better solution here
length(elements_video)
# show number of videos on site
# [1] 1
unlist(lapply(elements_video, function(x) {x$getElementAttribute("class")}))
# print the attribute "class" to check output
# [1] "aspect-ratio--object z1 has_hls hide"
# close browser client and close server session ----
remDr$close()
rD$server$stop()
I'm trying to scrape the date and policy type for COVID related announcements from this url: https://covid19.healthdata.org/united-states-of-america/alabama
The first date I'm trying to pull is the "April 4th, 2020" date for Alabama's Stay at Home Order.
As far as I can tell (as I am new to this), it has the xpath:
"//[#id="root"]/div/main/div[3]/div[1]/div[2]/div[1]/div[1]/div/div/span"
I've been using the following lines to try to retrieve it -
data <- read_html(url) %>%
html_nodes("span.ant-statistic-content-value")
data <- read_html(url) %>%
html_nodes(xpath = "//*[#id='root']/div/main/div[3]/div[1]/div[2]/div[1]/div[1]/div/div/span")
Neither are able to pull the information I'm looking for. Any help would be appreciated!
The data for this page is stored in a series of JSON files. If you use the developer tools from your browser and look for the Networks files of type XHR; you should obtain a list similar to this (Safari browser below):
Right click the names to copy URL link.
This script should get you started:
library(jsonlite)
#obtain the list of locations
locations<-fromJSON("https://covid19.healthdata.org/api/metadata/location?v=7", flatten = TRUE)
head(locations[, 1:9])
#get list if US locations
US <- locations$children[locations$location_name =="United States of America"]
head(US[[1]])
#Get data frame from interventions
#Create link with desired location_id (569 is Virginia)
#paste0("https://covid19.healthdata.org/api/data/intervention?location=", "569")
Interventions <- fromJSON("https://covid19.healthdata.org/api/data/intervention?location=569", flatten = TRUE)
Interventions
# date_reported covid_intervention_id location_id covid_intervention_measure_id covid_intervention_measure_name
# 1 2020-03-30 00:00:00 110 569 1 People instructed to stay at home
# 2 2020-03-16 00:00:00 258 569 2 Educational facilities closed
# 3 2020-04-19 00:00:00 437 569 7 Assumed_implemented_date
#Repeat for other links of interest
I am trying to extract the digits of pi from a website using the rvest package in R, but it keeps giving me an xml error.
library(rvest)
pisite <- read_html("http://www.eveandersson.com/pi/digits/1000000")
pitable <- pisite %>%
html_node(xpath = "/html/body/table[2]/tbody/tr/td[1]/pre/text()[1]")
I keep getting the result:
{xml_missing}
NA
Note that I copied the value used for the xpath from the chrome website inspection tool. Although it does look a bit different to the xpaths I have gotten before.
Not sure what to change. Guessing it is something really simple. Any ideas?
Maybe this could help:
library(rvest)
library(dplyr)
# here the site
pisite <- read_html("http://www.eveandersson.com/pi/digits/1000000")
# here you catch what you need
pi <- pisite %>% html_nodes("pre") %>% html_text()
# here you replace de \n with nothing, to have the numbers only
pi <-gsub("\n", "", pi)
pi
[1] "3.1415926535897932384626433832795028841971 ...and so on..."
I am using the rvest package to scrape information from the page http://www.radiolab.org/series/podcasts. After scraping the first page, I want to follow the "Next" link at the bottom, scrape that second page, move onto the third page, etc.
The following line gives an error:
html_session("http://www.radiolab.org/series/podcasts") %>% follow_link("Next")
## Navigating to
##
## ./2/
## Error in parseURI(u) : cannot parse URI
##
## ./2/
Inspecting the HTML shows there is some extra cruft around the "./2/" that rvest apparently doesn't like:
html("http://www.radiolab.org/series/podcasts") %>% html_node(".pagefooter-next a")
## Next
.Last.value %>% html_attrs()
## href
## "\n \n ./2/ "
Question 1:
How can I get rvest::follow_link to treat this link correctly like my browser does? (I could manually grab the "Next" link and clean it up with regex, but prefer to take advantage of the automation provided with rvest.)
At the end of the follow_link code, it calls jump_to. So I tried the following:
html_session("http://www.radiolab.org/series/podcasts") %>% jump_to("./2/")
## <session> http://www.radiolab.org/series/2/
## Status: 404
## Type: text/html; charset=utf-8
## Size: 10744
## Warning message:
## In request_GET(x, url, ...) : client error: (404) Not Found
Digging into the code, it looks like the issue is with XML::getRelativeURL, which uses dirname to strip off the last part of the original path ("/podcasts"):
XML::getRelativeURL("./2/", "http://www.radiolab.org/series/podcasts/")
## [1] "http://www.radiolab.org/series/./2"
XML::getRelativeURL("../3/", "http://www.radiolab.org/series/podcasts/2/")
## [1] "http://www.radiolab.org/series/3"
Question 2:
How can I get rvest::jump_to and XML::getRelativeURL to correctly handle relative paths?
Since this problem still seems to occur with RadioLab.com, your best solution is to create a custom function to handle this edge case. If you're only worried about this site - and this particular error - then you can write something like this:
library(rvest)
follow_next <- function(session, text ="Next", ...) {
link <- html_node(session, xpath = sprintf("//*[text()[contains(.,'%s')]]", text))
url <- html_attr(link, "href")
url = trimws(url)
url = gsub("^\\.{1}/", "", url)
message("Navigating to ", url)
jump_to(session, url, ...)
}
That would allow you to write code like this:
html_session("http://www.radiolab.org/series/podcasts") %>%
follow_next()
#> Navigating to 2/
#> <session> http://www.radiolab.org/series/podcasts/2/
#> Status: 200
#> Type: text/html; charset=utf-8
#> Size: 61261
This is not per se an error - the URL on RadioLab is malformed, and failing to parse a malformed URL is not a bug. If you want to be liberal in how you handle the issue you need to manually work around it.
Note that you could also use RSelenium to launch an actual browser (e.g. Chrome) and have that perform the URL parsing for you.