Following "next" link with relative paths using rvest - html

I am using the rvest package to scrape information from the page http://www.radiolab.org/series/podcasts. After scraping the first page, I want to follow the "Next" link at the bottom, scrape that second page, move onto the third page, etc.
The following line gives an error:
html_session("http://www.radiolab.org/series/podcasts") %>% follow_link("Next")
## Navigating to
##
## ./2/
## Error in parseURI(u) : cannot parse URI
##
## ./2/
Inspecting the HTML shows there is some extra cruft around the "./2/" that rvest apparently doesn't like:
html("http://www.radiolab.org/series/podcasts") %>% html_node(".pagefooter-next a")
## Next
.Last.value %>% html_attrs()
## href
## "\n \n ./2/ "
Question 1:
How can I get rvest::follow_link to treat this link correctly like my browser does? (I could manually grab the "Next" link and clean it up with regex, but prefer to take advantage of the automation provided with rvest.)
At the end of the follow_link code, it calls jump_to. So I tried the following:
html_session("http://www.radiolab.org/series/podcasts") %>% jump_to("./2/")
## <session> http://www.radiolab.org/series/2/
## Status: 404
## Type: text/html; charset=utf-8
## Size: 10744
## Warning message:
## In request_GET(x, url, ...) : client error: (404) Not Found
Digging into the code, it looks like the issue is with XML::getRelativeURL, which uses dirname to strip off the last part of the original path ("/podcasts"):
XML::getRelativeURL("./2/", "http://www.radiolab.org/series/podcasts/")
## [1] "http://www.radiolab.org/series/./2"
XML::getRelativeURL("../3/", "http://www.radiolab.org/series/podcasts/2/")
## [1] "http://www.radiolab.org/series/3"
Question 2:
How can I get rvest::jump_to and XML::getRelativeURL to correctly handle relative paths?

Since this problem still seems to occur with RadioLab.com, your best solution is to create a custom function to handle this edge case. If you're only worried about this site - and this particular error - then you can write something like this:
library(rvest)
follow_next <- function(session, text ="Next", ...) {
link <- html_node(session, xpath = sprintf("//*[text()[contains(.,'%s')]]", text))
url <- html_attr(link, "href")
url = trimws(url)
url = gsub("^\\.{1}/", "", url)
message("Navigating to ", url)
jump_to(session, url, ...)
}
That would allow you to write code like this:
html_session("http://www.radiolab.org/series/podcasts") %>%
follow_next()
#> Navigating to 2/
#> <session> http://www.radiolab.org/series/podcasts/2/
#> Status: 200
#> Type: text/html; charset=utf-8
#> Size: 61261
This is not per se an error - the URL on RadioLab is malformed, and failing to parse a malformed URL is not a bug. If you want to be liberal in how you handle the issue you need to manually work around it.
Note that you could also use RSelenium to launch an actual browser (e.g. Chrome) and have that perform the URL parsing for you.

Related

Scraping html header with R

My objective
I'm attempting to use R to scrape text from a web page: https://tidesandcurrents.noaa.gov/stationhome.html?id=8467150. For the purposes of this question, my goal is to access the header text that contains the station number ("Bridgeport, CT - Station ID: 8467150"). Below is a screenshot of the page. I've highlighted the text that I'm trying to verify is present, and the text is also highlighted in the inspect element pane.
My old approach was to access the full text of the site with readLines(). A recent update to the website has made the text more difficult to access, and the station name/number is no longer visible to readLines():
url <- "https://tidesandcurrents.noaa.gov/stationhome.html?id=8467150"
stn <- "8467150"
webpage <- readLines(url, warn = FALSE)
### grep indicates that the station number is not present in the scraped text
grep(x = webpage, pattern = stn, value = TRUE)
Potential solutions
I am therefore looking for a new way to access my target text. I have tried using httr, but still cannot get all the html text to be included in what I scrape from the web page. The XML and rvest packages also seem promising, but I am not sure how to identify the relevant CSS selector or XPath expression.
### an attempt using httr
hDat <- httr::RETRY("GET", url, times = 10)
txt <- httr::content(hDat, "text")
### grep indicates that the station number is still not present
grep(x = txt, pattern = stn, value = TRUE)
### a partial attempt using XML
h <- xml2::read_html(url)
h2 <- XML::htmlTreeParse(h, useInternalNodes=TRUE, asText = TRUE)
### this may end up working, but I'm not sure how to identify the correct path
html.parse <- XML::xpathApply(h2, path = "div.span8", XML::xmlValue)
Regardless of the approach, I would welcome any suggestions that can help me access the header text containing the station name/number.
Unless you use Selenium, it will be very hard.
NOAA encourages you to access their free Restful json APIs. It also goes to great lengths to discourage html scraping.
That said, the following code will get what you want from a NOAA json in a data frame.
library(tidyverse)
library(jsonlite)
j1 <- fromJSON(txt = 'https://api.tidesandcurrents.noaa.gov/mdapi/prod/webapi/stations/8467150.json', simplifyDataFrame = TRUE, flatten = TRUE)
j1$stations %>% as_tibble() %>% select(name, state, id)
Results
# A tibble: 1 x 3
name state id
<chr> <chr> <chr>
1 Bridgeport CT 8467150

Webscrape with R for counting images

How to webscrape in R to get counts on images and videos for this page? Sorry I'm new to webscrape and would like some help.
Here is the link:
https://www.kickstarter.com/projects/urban-farm-florida/hammock-greens-vertical-hydroponic-urban-farm
Should yield video count= 1 and image count=9. But I'm only able to get this far.
library('dplyr')
library('rvest')
library('xml2')
library('selectr')
library("httr")
website<-read_html("https://www.kickstarter.com/projects/urban-farm-florida/hammock-greens-vertical-hydroponic-urban-farm")
website%>%html_nodes("div.template.asset")
I thint that the main issue is that the main body of the website, containing all the images, loads only when you open the website. As far as I know, rvest cannot handle this website architecture. Using RSelenium is a bit more complicated but works fine for your kind of problem.
library(RSelenium)
# open server session and open browser client ----
rD <- rsDriver(browser = c("firefox"), verbose = FALSE)
# adapt this to your browser version
remDr <- rD$client
# navitage to url in browser client ----
url <- "https://www.kickstarter.com/projects/urban-farm-florida/hammock-greens-vertical-hydroponic-urban-farm"
remDr$navigate(url)
# access image elements using xpath ----
elements_image <- remDr$findElements(using = "xpath", "//*[#class='template asset']")
# using only "//img" would show all images on the site = 12
# for some time the xpath "//img[#class='fit lazyloaded']" worked as well, somehow it stopped doing so
length(elements_image)
# show number of images on site
# [1] 9
unlist(lapply(elements_image, function(x) {x$getElementAttribute("alt")}))
# print the attribute "alt" to check output
# [1] "Vertical Columns hold 18 heads of greens, fed by a closed circuit of nutrient-rich water. "
# [2] ""
# [3] ""
# [4] "our farms can be found in unused, unloved, and unlikely spaces"
# [5] "Overtown 1 location at Lotus House homeless shelter"
# [6] "led spectrum manipulated to encourage leafy-green growth"
# [7] "vertical columns are harvested and planted in the same day to maximize efficiency "
# [8] "Chef Aaron shows off some young Romaine"
# [9] "Farmer Thomas keeps the whole team moving"
# access video elements using xpath ----
elements_video <- remDr$findElements(using = "xpath", "//video")
# since there is only one video and since its class seems very specific for the video, "//video" seems the better solution here
length(elements_video)
# show number of videos on site
# [1] 1
unlist(lapply(elements_video, function(x) {x$getElementAttribute("class")}))
# print the attribute "class" to check output
# [1] "aspect-ratio--object z1 has_hls hide"
# close browser client and close server session ----
remDr$close()
rD$server$stop()

How do I use rvest to extract the digits of pi from the following website?

I am trying to extract the digits of pi from a website using the rvest package in R, but it keeps giving me an xml error.
library(rvest)
pisite <- read_html("http://www.eveandersson.com/pi/digits/1000000")
pitable <- pisite %>%
html_node(xpath = "/html/body/table[2]/tbody/tr/td[1]/pre/text()[1]")
I keep getting the result:
{xml_missing}
NA
Note that I copied the value used for the xpath from the chrome website inspection tool. Although it does look a bit different to the xpaths I have gotten before.
Not sure what to change. Guessing it is something really simple. Any ideas?
Maybe this could help:
library(rvest)
library(dplyr)
# here the site
pisite <- read_html("http://www.eveandersson.com/pi/digits/1000000")
# here you catch what you need
pi <- pisite %>% html_nodes("pre") %>% html_text()
# here you replace de \n with nothing, to have the numbers only
pi <-gsub("\n", "", pi)
pi
[1] "3.1415926535897932384626433832795028841971 ...and so on..."

Why does readHTMLTable cannot successfully read premier league tables for May month?

The official Premier league website provides data with various statistics for league's teams over seasons (e.g. this one). I used the function readHTMLTable from XML R package to retrieve those tables. However, I noticed that the function can not read tables for May months while for others it works well. Here is an example:
april2007.url <- "http://www.premierleague.com/en-gb/matchday/league-table.html?season=2006-2007&month=APRIL&timelineView=date&toDate=1177887600000&tableView=CURRENT_STANDINGS"
april.df <- readHTMLTable(april2007.url, which = 1)
april.df[complete.cases(april.df),] ## correct table
march2014.url <- "http://www.premierleague.com/en-gb/matchday/league-table.html?season=2013-2014&month=APRIL&timelineView=date&toDate=1398639600000&tableView=CURRENT_STANDINGS"
march.df <- readHTMLTable(march2014.url, which = 1)
march.df[complete.cases(march.df), ] ## correct table
may2007.url <- "http://www.premierleague.com/en-gb/matchday/league-table.html?season=2006-2007&month=MAY&timelineView=date&toDate=1179010800000&tableView=CURRENT_STANDINGS"
may.df1 <- readHTMLTable(may2007.url, which = 1)
may.df1 ## Just data for the first team
may2014.url <- "http://www.premierleague.com/en-gb/matchday/league-table.html?season=2013-2014&month=MAY&timelineView=date&toDate=1399762800000&tableView=CURRENT_STANDINGS"
may.df2 <- readHTMLTable(may2014.url, which =1)
may.df2 ## Just data for the first team
As you can see, the function can not retrieve data for May month.
Please, can someone explain why this happens and how it can be fixed?
EDIT AFTER #zyurnaidi answer:
Below is the code that can do the job without manual editing.
url <- "http://www.premierleague.com/en-gb/matchday/league-table.html?season=2009-2010&month=MAY&timelineView=date&toDate=1273359600000&tableView=CURRENT_STANDINGS" ## data for the 09-05-2010.
con <- file (url)
raw <- readLines (con)
close (con)
pattern <- '<span class=" cupchampions-league= competitiontooltip= qualifiedforuefachampionsleague=' ## it seems that this part of the webpage source code mess the things up
raw <- gsub (pattern = pattern, replacement = '""', x = raw)
df <- readHTMLTable (doc = raw, which = 1)
df[complete.cases(df), ] ## correct table
OK. There are few hints for me to find the problem here:
1. The issues happen consistently on May. This is the last month of each season. It means that there should be something unique in this particular case.
2. Direct parsing (htmlParse, from both link and downloaded file) produces a truncated file. The table and html file are just suddenly closed after the first team in the table is reported.
The parsed data always differs from the original right after this point:
<span class=" cupchampions-league=
After downloading and carefully checking the html file itself, I found that there are (uncoded?) character issues there. My guess, this is caused by the cute little trophy icons seen after the team names.
Anyway, to solve this issue, you need to take out these error characters. Instead of editing the downloaded html files, my suggestion is:
1. View page source the EPL url for May's league table
2. Copy all and paste to the text editor, save as an html file
3. You can now use either htmlParse or readHTMLTable
There might be better way to automate this, but hope it can help.

Parsing HTML in R using XML and RCurl

I am trying to parse the content of a website but I receive an error message. I don't know how to deal with the error:
require(RCurl)
require(XML)
html <- getURL("http://www.sec.gov/Archives/edgar/data/8947/000119312506125763/0001193125-06-125763.txt")
doc <- htmlParse(html, asText=TRUE)
This is the error message I get:
Error: XML content does not seem to be XML, nor to identify a file name
I am working on a Mac:
> sessionInfo()
R version 3.0.1 (2013-05-16)
Platform: x86_64-apple-darwin10.8.0 (64-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] plyr_1.8 rJava_0.9-4 R.utils_1.26.2 R.oo_1.13.9 R.methodsS3_1.4.4 gsubfn_0.6-5 proto_0.3-10 RCurl_1.95-4.1
[9] bitops_1.0-6 splus2R_1.2-0 stringr_0.6.2 foreign_0.8-54 XML_3.95-0.2
loaded via a namespace (and not attached):
[1] tcltk_3.0.1 tools_3.0.1
Any ideas on how to solve this issue?
You don't need curl to get the file, the built in tools can read test from urls (eg. scan or read.table).
The reason you're getting this error is the file isn't valid XML or HTML. Strip out all the lines before the <HTML> tag and you should be good to go.
sec <- scan(file = "http://www.sec.gov/Archives/edgar/data/8947/000119312506125763/0001193125-06-125763.txt", what = "character", sep ="\n", allowEscapes = TRUE)
sec <- sec[56:length(sec)]
secHTML <- htmlParse(sec)
There are other, less ugly ways to get the file, but once you strip the 'text' preamble XML should be able to parse it.
Alternately I think there's a parameter to htmlParse which allows you to specify a number of lines to skip.
> txt <- readLines(url("http://www.sec.gov/Archives/edgar/data/8947/000119312506125763/0001193125-06-125763.txt"))
> head(txt)
[1] "-----BEGIN PRIVACY-ENHANCED MESSAGE-----"
[2] "Proc-Type: 2001,MIC-CLEAR"
[3] "Originator-Name: webmaster#www.sec.gov"
[4] "Originator-Key-Asymmetric:"
[5] " MFgwCgYEVQgBAQICAf8DSgAwRwJAW2sNKK9AVtBzYZmr6aGjlWyK3XmZv3dTINen"
[6] " TWSM7vrzLADbmYQaionwg5sDW3P6oaM5D3tdezXMm7z1T+B+twIDAQAB"
> length(txt)
[1] 5517
The filings stored on the www.sec.gov website are a mixture of different types of files. Some are plain text, some are jpg, some are gif, some are pdf, some are XML, some are XBRL, and some are html, and others. The example file you are using is the "RAW Dissemination" file type, that is actually a combination of any, or all of the other types.
The file name "0001193125-06-125763.txt" is a concatenation of the "Accession Number" and the txt extension. This RAW Dissemination file is made up of Header Data, and a series of "<DOCUMENT> ....</DOCUMENT>" Tag sets. What comes between the start and end DOCUMENT Tag is the various "files" within the "filing".
The each of the different files within a filing should be treated separately. The PDFS, JPG, GIF file types are UUEncoded, and should be UUdecoded. Others like TXT, HTML, XML, XBRL should be treated as plain text, and if needed, parsed as the appropriate type.
The Header Data tags information about the companies, people, filers, filer agents, etc who have submitted the filing.