Scraping dynamic table in R - html

I am stuck on a simple web scrape.
My goal is to scrape Morningstar.com to retrieve the education of the managers associated to a fund name.
First off, let me say that I am not familiar at all with this operation. However, I did my best to provide some code.
For example, consider the following webpage
http://financials.morningstar.com/fund/management.html?t=AALGX&region=usa&culture=en_US
The problem is that the page dynamically loads the section I am targeting, so it doesn't actually get pulled in by read_html()
So what I did was to access to the data loaded in my section of interest.
Specifically, I did:
# edit: added packages required
library(xml2)
library(rvest)
library(stringi)
# original code
tmp_url <- "http://financials.morningstar.com/fund/management.html?t=AALGX&region=usa&culture=en_US"
pg <- read_html(tmp_url)
tmp <- length(html_nodes(pg, xpath=".//script[contains(., 'function loadManagerInfo()')]"))
html_nodes(pg, xpath=".//script[contains(., 'function loadManagerInfo()')]") %>%
html_text() %>%
stri_split_lines() %>%
.[[1]] -> js_lines
idx <- which(stri_detect_fixed(js_lines, '\t\t\"//financials.morningstar.com/oprn/c-managers.action?&t='))
start <- nchar("\t\t\"//financials.morningstar.com/oprn/c-managers.action?&t=")+1
id <- substr(js_lines[idx],start, start+9)
tab <- read_html(paste0("http://financials.morningstar.com/oprn/c-managers.action?&t=",id,"&region=usa&culture=en-US&cur=&callback=jsonp1523529017966&_=1523529019244"), options = "HUGE")
The object tab contains the information I need.
What I need to do now is to create a dataframe associating to each manager name, his or her manager education.
I could try to do this by transforming my object in a string, then extracting the characters following the word "Education".
Though, this looks extremely inefficient.
I was wondering if anyone can provide some guidance.

This thing really is a mess - nice work getting the links and downloding the info.
Poking around a lot and taking various detours this is the best I could come up:
Clean Up
First there is some cleanup to do. Instead of directly downloading and parsing the document in one step we will:
download the document as text
clean up the text a little to get the JSON
parse the JSON
extract the HTML item
do some further cleaning
finally parse the HTML
url <-
paste0(
"http://financials.morningstar.com/oprn/c-managers.action?&t=",
id,
"&region=usa&culture=en-US&cur=&callback=jsonp1523529017966&_=1523529019244"
)
txt <-
readLines(url, warn = FALSE)
json <-
txt %>%
gsub("^jsonp\\d+\\(", "", .) %>%
gsub("\\)$", "", .)
json_parsed <-
jsonlite::fromJSON(json)
html_clean <-
json_parsed$html %>%
gsub("\t", "", .)
html_parsed <-
read_html(html_clean)
First Round of Node Extraction
Next we use some black magic node extraction trickery. Basically the trick goes like this: If we have a node set (the thing you get when using html_nodes) we can use further XPath queries to drill down.
The first node set (cvs) captures the basic path to the CV entries in the table.
The second node set (info_tmp) drills down a little further to get the those part of the CV entries where further information ("Other Assets Managed", "Education", ... etc) is stored.
cvs <-
html_parsed %>%
html_nodes(xpath = "/html/body/table/tbody/tr[not(#align='left')]")
info_tmp <-
cvs %>%
html_nodes(xpath = "td/table/tbody")
Building up Data.Frame 1
There is little problem with the table. Each CV entry lives in its own table row. For name, from, to and description there is always exactly one item per CV entry but for "Other Assets Managed", "Education", ... etc this is not true.
Therefore, information extraction is done in two parts.
df <-
cvs %>%
lapply(
FUN =
function(x){
tmp <-
x %>%
html_nodes(xpath = "th") %>%
html_text() %>%
gsub(" +", "", .)
data.frame(
name = stri_extract(tmp, regex = "[. \\w]+"),
from = stri_extract(tmp, regex = "\\d{2}/\\d{2}/\\d{4}"),
to = stri_extract(tmp, regex = "\\d{2}/\\d{2}/\\d{4}")
)
}
) %>%
do.call(rbind, .)
df$description <-
info_tmp %>%
html_nodes(xpath = "tr[1]/td[1]") %>%
html_text()
df$cv_id <- seq_len(nrow(df))
Building Up Data.Frame 2
Now some more html nodes trickery ... If we use html_nodes() the result set of html_nodes() we get all matching and none of the none matching nodes. This is a problem since we might get 1, 0 or multiple nodes per node set node basically destroying any information about where those newly selected nodes came from.
There is a solution however: We can use lapply to query each element of an node set independently from the others and therewith preserving information about the original structure.
extract_key_value_pairs <-
function(i, info_tmp){
cv_id <-
seq_along(info_tmp)
key <-
lapply(
info_tmp,
function(x){
tmp <-
x %>%
html_nodes(xpath = paste0("tr[",i,"]/td[1]")) %>%
html_text()
if ( length(tmp) == 0 ) {
return("")
}else{
return(tmp)
}
}
)
value <-
lapply(
info_tmp,
function(x){
tmp <-
x %>%
html_nodes(xpath = paste0("tr[",i,"]/td[2]")) %>%
html_text() %>%
stri_trim_both() %>%
stri_split(fixed = "\n") %>%
lapply(X = ., stri_trim_both)
if ( length(tmp) == 0 ) {
return("")
}else{
return(unlist(tmp))
}
}
)
df <-
mapply(
cv_id = cv_id,
key = key,
value = value,
FUN =
function(cv_id, key, value){
data.frame(
cv_id = cv_id,
key = key,
value = value
)
},
SIMPLIFY = FALSE
) %>%
do.call(rbind, .)
df[df$key != "",]
}
df2 <-
lapply(
X = c(3, 5, 7),
FUN = extract_key_value_pairs,
info_tmp = info_tmp
) %>%
do.call(rbind, .)
Results
df
## name from to description cv_id
## 1 Kurt J. Lauber 03/20/2013 03/20/2013 Mr. Lauber ... 1
## 2 Noah J. Monsen 02/28/2018 02/28/2018 Mr. Monsen ... 2
## 3 Lauri Brunner 09/30/2018 09/30/2018 Ms. Brunne ... 3
## 4 Darren M. Bagwell 02/29/2016 02/29/2016 Darren M. ... 4
## 5 David C. Francis 10/07/2011 10/07/2011 Francis is ... 5
## 6 Michael A. Binger 04/14/2010 04/14/2010 Binger has ... 6
## 7 David E. Heupel 04/14/2010 04/14/2010 Mr. Heupel ... 7
## 8 Matthew D. Finn 03/30/2007 03/30/2007 Mr. Finn h ... 8
## 9 Scott Vergin 03/30/2007 03/30/2007 Vergin has ... 9
## 10 Frederick L. Plautz 11/01/1995 11/01/1995 Plautz has ... 10
## 11 Clyde E. Bartter 01/01/1994 01/01/1994 Bartter is ... 11
## 12 Wayne C. Stevens 01/01/1994 01/01/1994 Stevens is ... 12
## 13 Julian C. Ball 07/16/1987 07/16/1987 Ball is a ... 13
df2
## cv_id key value
## 1 Other Assets Managed
## 2 Other Assets Managed
## 3 Other Assets Managed
## 4 Certification CFA
## 4 Other Assets Managed
## 5 Certification CFA
## 5 Education M.B.A. University of Pittsburgh, 1978
## 5 Education B.A. University of Pittsburgh, 1977
## 5 Other Assets Managed
## 6 Certification CFA
## 6 Education M.B.A. University of Minnesota, 1991
## 6 Education B.S. University of Minnesota, 1987
## 6 Other Assets Managed
## 7 Other Assets Managed
## 8 Certification CFA
## 8 Education B.A. University of Pennsylvania, 1984
## 8 Education M.B.A. University of Michigan, 1990
## 8 Other Assets Managed
## 9 Certification CFA
## 9 Education M.B.A. University of Minnesota, 1980
## 9 Education B.A. St. Olaf College, 1976
## 9 Other Assets Managed
## 10 Education M.S. University of Wisconsin, 1981
## 10 Education B.B.A. University of Wisconsin, 1979
## 10 Other Assets Managed
## 11 Certification CFA
## 11 Education M.B.A. Western Reserve University, 1964
## 11 Education B.A. Baldwin-Wallace College, 1953
## 11 Other Assets Managed
## 12 Certification CFA
## 12 Education M.B.A. University of Wisconsin,
## 12 Education B.B.A. University of Miami,
## 12 Other Assets Managed
## 13 Certification CFA
## 13 Education B.A. Kent State University, 1974
## 13 Education J.D. Cleveland State University, 1984
## 13 Other Assets Managed

I don't have a solution, as this is not an area I have worked with before. However, with brute force you can probably get the table, assuming you have a list of rules that can parse the text to a data frame.
Thought I'd share what I have though
# get the text
f <- xml_text(tab)
# split up, this bit is tricky..
split_f <- strsplit(f, split="\\\\t", perl=TRUE)[[1]]
split_f <- strsplit(split_f, split="\\\\n", perl=TRUE)
split_f <- unlist(split_f)
split_f <- trimws(split_f)
# find ones to remove
sort(table(split_f), decreasing = T)[1:5]
split_f <- split_f[split_f!="—"]
split_f <- split_f[split_f!=""]
# manually found where to split
keep <- split_f[2:108]
# text looks ok, but would need rules to extract the rows in to a data.frame
View(keep)

Related

rvest error on form submission "`Form` doesn't contain a `action` attribute"

I am trying to send search requests with rvest, but I get always the same error. I have tried several ways included this solution: https://gist.github.com/ibombonato/11507d776d1042f80ca59cd31509afd3
My code is the following.
library(rvest)
url <- 'https://www.saferproducts.gov/PublicSearch'
cocorahs <- html_session(URL)
form.unfilled <- cocorahs %>% html_node("form") %>% html_form()
form.unfilled[["fields"]][[3]][["value"]] <- "input" ## This is the line which I think should be corrected
form.filled <- form.unfilled %>%
set_values("searchParameter.AdvancedKeyword" = "amazon")
session1 <- session_submit(cocorahs, form.filled, submit = NULL)
# or
session <- submit_form(cocorahs, form.filled)
But I get always the following error:
Error in `submission_build()`:
! `form` doesn't contain a `action` attribute
Run `rlang::last_error()` to see where the error occurred.
I think the way is to edit the attributes of those buttons. Maybe has someone the answer to this. Thanks in advance.
An alternative method with httr2
library(tidyverse)
library(rvest)
library(httr2)
data <- "https://www.saferproducts.gov/PublicSearch" %>%
request() %>%
req_body_form(
"searchParameter.Keyword" = "Amazon"
) %>%
req_perform() %>%
resp_body_html()
tibble(
title = data %>%
html_elements(".document-title") %>%
html_text2(),
report_title = data %>%
html_elements(".info") %>%
html_text2() %>%
str_remove_all("\r") %>%
str_squish()
)
#> # A tibble: 10 × 2
#> title repor…¹
#> <chr> <chr>
#> 1 Self balancing scooter was used off & on for three years. Consumer i… Incide…
#> 2 The consumer stated that when he opened one of the marshmallow roast… Incide…
#> 3 The consumer, 59, stated that he was welding with a brand new auto d… Incide…
#> 4 The consumer reported, that their hover soccer toy caught fire while… Incide…
#> 5 80 yr old male's electric hotplate was set between 1 and 2(of 5) bef… Incide…
#> 6 Amazon Recalls Amazon Basics Desk Chairs Due to Fall and Injury Haza… Recall…
#> 7 The Consumer reported to have been notified by email that the diarrh… Incide…
#> 8 consumer reported about light fixture attached to a photography umbr… Incide…
#> 9 Drive DeVilbiss Healthcare Recalls Adult Portable Bed Rails After Tw… Recall…
#> 10 MixBin Electronics Recalls iPhone Cases Due to Risk of Skin Irritati… Recall…
#> # … with abbreviated variable name ¹​report_title
Created on 2023-01-15 with reprex v2.0.2

R - Scrape a number of URLs and save individually

Disclaimer: I'm not a programmer by trade and my knowledge of R is limited to say the least. I've also already searched Stackoverflow for a solution (but to no avail).
Here's my situation: I need to scrape a series of webpages and save the data (not quite sure in what format, but I'll get to that). Fortunately the pages I need to scrape have a very logical naming structure (they use the date).
The base URL is: https://www.bbc.co.uk/schedules/p00fzl6p
I need to scrape everything from August 1st 2018 (for which the URL is https://www.bbc.co.uk/schedules/p00fzl6p/2018/08/01) until yesterday (for which the URL is https://www.bbc.co.uk/schedules/p00fzl6p/2020/05/17).
So far I've figured out to create a list of dates which can be appended to the base URL using the following:
dates <- seq(as.Date("2018-08-01"), as.Date("2020-05-17"), by=1)
dates <- format(dates,"20%y/%m/%d")
I can append these to the base URL with the following:
url <- paste0("https://www.bbc.co.uk/schedules/p00fzl6p/",dates)
However, that's pretty much as far as I've gotten (not very far, I know!) I assume I need to use a for loop but my own attempts at this have proved futile. Perhaps I'm not approaching this the right way?
In case it's not clear, what I'm trying to do is to visit each URL and save the html as an individual html file (ideally labelled with the relevant date). In truth, I don't need all of the html (just the list of programmes and times) but I can extract that information from the relevant files at a later date.
Any guidance on the best way to approach this would be much appreciated! And if you need any more info, just ask.
Have a look at the rvest package and associated tutorials. E.g. https://www.datacamp.com/community/tutorials/r-web-scraping-rvest.
The messy part is extracting the fields the way you want them.
Here is one possible solution:
library(rvest)
#> Loading required package: xml2
library(magrittr)
library(stringr)
library(data.table)
dates <- seq(as.Date("2018-08-01"), as.Date("2020-05-17"), by=1)
dates <- format(dates,"20%y/%m/%d")
urls <- paste0("https://www.bbc.co.uk/schedules/p00fzl6p/", dates)
get_data <- function(url){
html <- tryCatch(read_html(url), error=function(e) NULL)
if(is.null(html)) return(data.table(
date=gsub("https://www.bbc.co.uk/schedules/p00fzl6p/", "", url),
title=NA, description=NA)) else {
time <- html %>%
rvest::html_nodes('body') %>%
xml2::xml_find_all("//div[contains(#class, 'broadcast__info grid 1/4 1/6#bpb2 1/6#bpw')]") %>%
rvest::html_text() %>% gsub(".*([0-9]{2}.[0-9]{2}).*", "\\1", .)
text <- html %>%
rvest::html_nodes('body') %>%
xml2::xml_find_all("//div[contains(#class, 'programme__body')]") %>%
rvest::html_text() %>%
gsub("[ ]{2,}", " ", .) %>% gsub("[\n|\n ]{2,}", "\n", .) %>%
gsub("\n(R)\n", " (R)", ., fixed = TRUE) %>%
gsub("^\n|\n$", "", .) %>%
str_split_fixed(., "\n", 2) %>%
as.data.table() %>% setnames(., c("title", "description")) %>%
.[, `:=`(date = gsub("https://www.bbc.co.uk/schedules/p00fzl6p/", "", url),
time = time,
description = gsub("\n", " ", description))] %>%
setcolorder(., c("date", "time", "title", "description"))
text
}
}
res <- rbindlist(parallel::mclapply(urls, get_data, mc.cores = 6L))
res
#> date time
#> 1: 2018/08/01 06:00
#> 2: 2018/08/01 09:15
#> 3: 2018/08/01 10:00
#> 4: 2018/08/01 11:00
#> 5: 2018/08/01 11:45
#> ---
#> 16760: 2020/05/17 22:20
#> 16761: 2020/05/17 22:30
#> 16762: 2020/05/17 00:20
#> 16763: 2020/05/17 01:20
#> 16764: 2020/05/17 01:25
#> title
#> 1: Breakfast—01/08/2018
#> 2: Wanted Down Under—Series 11, Hanson Family
#> 3: Homes Under the Hammer—Series 21, Episode 6
#> 4: Fake Britain—Series 7, Episode 7
#> 5: The Farmers' Country Showdown—Series 2 30-Minute Versions, Ploughing
#> ---
#> 16760: BBC London—Late News, 17/05/2020
#> 16761: Educating Rita
#> 16762: The Real Marigold Hotel—Series 4, Episode 2
#> 16763: Weather for the Week Ahead—18/05/2020
#> 16764: Joins BBC News—18/05/2020
#> description
#> 1: The latest news, sport, business and weather from the BBC's Breakfast team.
#> 2: 22/24 Will a week in Melbourne help Keith persuade his wife Mary to move to Australia? (R)
#> 3: Properties in Hertfordshire, Croydon and Derbyshire are sold at auction. (R)
#> 4: 7/10 The fake sports memorabilia that cost collectors thousands. (R)
#> 5: 13/20 Farmers show the skill and passion needed to do well in a top ploughing competition.
#> ---
#> 16760: The latest news, sport and weather from London.
#> 16761: Comedy drama about a hairdresser who dreams of rising above her drab urban existence. (R)
#> 16762: 2/4 The group take a night train to Madurai to attend the famous Chithirai festival. (R)
#> 16763: Detailed weather forecast.
#> 16764: BBC One joins the BBC's rolling news channel for a night of news.
Created on 2020-05-18 by the reprex package (v0.3.0)

Trouble using rvest on nested tables

I'm having an issue trying to get Rankings from the Freeride World Tour website.
I tried first to get a CSS code for rvest using selectorGadget in Chrome but but can only get the riders and their overall score. What I'm interested in is getting the points a rider scored in each heat. I'm new to web-scraping and CSS/HTML so please hang in there with me.
# Get the website url
url <- read_html("https://www.freerideworldtour.com/rankings-detailed?season=165&competition=2&discipline=38")
Download everything from the page,
(all_text <- url %>%
html_nodes("div") %>%
html_text())
then look for Kristofer Turdell's first score of 2500 pts. grep("2500 pts.", all_text) but I find...nothing?
When I right-click the 2500 pts. on the website and select "Inspect" I can see that the html code for this section is:
<div class="field__item even">2500 pts.</div>
So I tried to use the div class:
url %>%
html_nodes(".field__item.even:) %>%
html_text()
This only returns the overall score for the participants (e.g. Kristofer Turdell 7870 pts.).
Next, I tried using the right-click option to save Xpath from "Inspect".
url %>%
html_nodes(xpath = "//*[#id="page-content"]/div/div/div[2]/div/div/div/div[1]/div[2]/div/div/div[1]/div/div[4]/div/div/div") %>%
html_text()
I'm not having any luck on this so I'd really appreciate your help.
url %>%
html_node("div.panel-second")%>%
html_text() %>%
gsub("\\s*\\n+\\s*",";",.)%>%
gsub("pts.","\n",.)%>%
read.table(text=.,fill=T,sep=";",row.names = NULL)%>%
subset(select=3:4)%>%na.omit()
V3 V4
1 Kristofer Turdell 7870
2 Markus Eder 7320
3 Mickael Bimboes 6930
4 Loic Collomb-Patton 6660
5 Yann Rausis 6290
6 Berkeley Patterson 5860
7 Leo Slemett 5835
8 Ivan Malakhov 5800
9 Craig Murray 5705
10 Logan Pehota 5655
11 Reine Barkered 5470
12 Grifen Moller 4765
13 Sam Lee 4580
14 Ryan Faye 3210
15 Conor Pelton 3185
16 George Rodney 3115
17 Taisuke Kusunoki 3060
18 Trace Cooke 2905
19 Aymar Navarro 2855
20 Felix Wiemers 2655
21 Fabio Studer 2305
22 Stefan Hausl 2240
23 Drew Tabke 1880
24 Carl Regnér Eriksson 1310
Writing that much code in the comments was awful, so here goes. You can store the scraped data into a dataframe and not be limited to printing it to the console:
library(tidyverse)
library(magrittr)
library(rvest)
url_base <- "https://www.freerideworldtour.com/rider/"
riders <- c("kristofer-turdell", "markus-eder", "mickael-bimboes")
output <- data_frame()
for (i in riders) {
temp <- read_html(paste0(url_base, i)) %>%
html_node("div") %>%
html_text() %>%
gsub("\\s*\\n+\\s*", ";", .) %>%
gsub("pts.", "\n", .) %>%
read.table(text = ., fill = T, sep = ";", row.names = NULL,
col.names = c("Drop", "Ranking", "FWT", "Events", "Points")) %>%
subset(select = 2:5) %>%
dplyr::filter(
!is.na(as.numeric(as.character(Ranking))) &
as.character(Points) != ""
) %>%
dplyr::mutate(name = i)
output <- bind_rows(output, temp)
}
I put in parts such as as.character(Points) != "" to exclude the sum of points (such as in Mickael Bimboe's 6930 pts) and not individual scores.
Again, much credit goes to #Onyambu though, many lines are borrowed from his answer.

How to loop - JSONP / JSON data using R

I thought I had parsed the data correctly using jsonlite & tidyjson. However, I am noticing that only the data from the first page is being parsed. Please advice how I could parse all the pages correctly. The total number of pages are over 1300 -if I look at the json output, so I think the data is available but not correctly parsed.
Note: I have used tidyjson, but am open to using jsonlite or any other library too.
library(dplyr)
library(tidyjson)
library(jsonlite)
req <- httr::GET("http://svcs.ebay.com/services/search/FindingService/v1?OPERATION-NAME=findItemsByKeywords&SERVICE-VERSION=1.0.0&SECURITY-APPNAME=xxxxxx&GLOBAL-ID=EBAY-US&RESPONSE-DATA-FORMAT=JSON&callback=_cb_findItemsByKeywords&REST-PAYLOAD&keywords=harry%20potter&paginationInput.entriesPerPage=100")
txt <- content(req, "text")
json <- sub("/**/_cb_findItemsByKeywords(", "", txt, fixed = TRUE)
json <- sub(")$", "", json)
data1 <- json %>% as.tbl_json %>%
enter_object("findItemsByKeywordsResponse") %>% gather_array %>% enter_object("searchResult") %>% gather_array %>%
enter_object("item") %>% gather_array %>%
spread_values(
ITEMID = jstring("itemId"),
TITLE = jstring("title")
) %>%
select(ITEMID, TITLE) # select only what is needed
############################################################
*Note: "paginationOutput":[{"pageNumber":["1"],"entriesPerPage":["100"],"totalPages":["1393"],"totalEntries":["139269"]}]
* &_ipg=100&_pgn=1"
No need for tidyjson. You will need to write another function/set of calls to get the total number of pages (it's over 1,400) to use the following, but that should be fairly straightforward. Try to compartmentalize your operations a bit more and use the full power of httr when you can to parameterize things:
library(dplyr)
library(jsonlite)
library(httr)
library(purrr)
get_pg <- function(i) {
cat(".") # shows progress
req <- httr::GET("http://svcs.ebay.com/services/search/FindingService/v1",
query=list(`OPERATION-NAME`="findItemsByKeywords",
`SERVICE-VERSION`="1.0.0",
`SECURITY-APPNAME`="xxxxxxxxxxxxxxxxxxx",
`GLOBAL-ID`="EBAY-US",
`RESPONSE-DATA-FORMAT`="JSON",
`REST-PAYLOAD`="",
`keywords`="harry potter",
`paginationInput.pageNumber`=i,
`paginationInput.entriesPerPage`=100))
dat <- fromJSON(content(req, as="text", encoding="UTF-8"))
map_df(dat$findItemsByKeywordsResponse$searchResult[[1]]$item, function(x) {
data_frame(ITEMID=flatten_chr(x$itemId),
TITLE=flatten_chr(x$title))
})
}
# "10" will need to be the max page number. I wasn't about to
# make 1,400 requests to ebay. I'd probably break them up into
# sets of 30 or 50 and save off temporary data frames as rdata files
# just so you don't get stuck in a situation where R crashes and you
# have to get all the data again.
srch_dat <- map_df(1:10, get_pg)
srch_dat
## Source: local data frame [1,000 x 2]
##
## ITEMID TITLE
## (chr) (chr)
## 1 371533364795 Harry Potter: Complete 8-Film Collection (DVD, 2011, 8-Disc Set)
## 2 331128976689 HOT New Harry Potter 14.5" Magical Wand Replica Cosplay In Box
## 3 131721213216 Harry Potter: Complete 8-Film Collection (DVD, 2011, 8-Disc Set)
## 4 171430021529 New Harry Potter Hermione Granger Rotating Time Turner Necklace Gold Hourglass
## 5 261597812013 Harry Potter Time Turner+GOLD Deathly Hallows Charm Pendant necklace
## 6 111883750466 Harry Potter: Complete 8-Film Collection (DVD, 2011, 8-Disc Set)
## 7 251947403227 HOT New Harry Potter 14.5" Magical Wand Replica Cosplay In Box
## 8 351113839731 Marauder's Map Hogwarts Wizarding World Harry Potter Warner Bros LIMITED **NEW**
## 9 171912724869 Harry Potter Time Turner Necklace Hermione Granger Rotating Spins Gold Hourglass
## 10 182024752232 Harry Potter : Complete 8-Film Collection (DVD, 2011, 8-Disc Set) Free Shipping
## .. ... ...

Scraping .asp site with R

I'm scraping http://www.progarchives.com/album.asp?id= and get a warning message:
Warning message:
XML content does not seem to be XML:
http://www.progarchives.com/album.asp?id=2
http://www.progarchives.com/album.asp?id=3 http://www.progarchives.com/album.asp?id=4
http://www.progarchives.com/album.asp?id=5
The scraper works for each page separately but not for the urls b1=2:b2=1000.
library(RCurl)
library(XML)
getUrls <- function(b1,b2){
root="http://www.progarchives.com/album.asp?id="
urls <- NULL
for (bandid in b1:b2){
urls <- c(urls,(paste(root,bandid,sep="")))
}
return(urls)
}
prog.arch.scraper <- function(url){
SOURCE <- getUrls(b1=2,b2=1000)
PARSED <- htmlParse(SOURCE)
album <- xpathSApply(PARSED,"//h1[1]",xmlValue)
date <- xpathSApply(PARSED,"//strong[1]",xmlValue)
band <- xpathSApply(PARSED,"//h2[1]",xmlValue)
return(c(band,album,date))
}
prog.arch.scraper(urls)
Here's an alternate approach with rvest and dplyr:
library(rvest)
library(dplyr)
library(pbapply)
base_url <- "http://www.progarchives.com/album.asp?id=%s"
get_album_info <- function(id) {
pg <- html(sprintf(base_url, id))
data.frame(album=pg %>% html_nodes(xpath="//h1[1]") %>% html_text(),
date=pg %>% html_nodes(xpath="//strong[1]") %>% html_text(),
band=pg %>% html_nodes(xpath="//h2[1]") %>% html_text(),
stringsAsFactors=FALSE)
}
albums <- bind_rows(pblapply(2:10, get_album_info))
head(albums)
## Source: local data frame [6 x 3]
##
## album date band
## 1 FOXTROT Studio Album, released in 1972 Genesis
## 2 NURSERY CRYME Studio Album, released in 1971 Genesis
## 3 GENESIS LIVE Live, released in 1973 Genesis
## 4 A TRICK OF THE TAIL Studio Album, released in 1976 Genesis
## 5 FROM GENESIS TO REVELATION Studio Album, released in 1969 Genesis
## 6 GRATUITOUS FLASH Studio Album, released in 1984 Abel Ganz
I didn't feel like barraging the site with a ton of reqs so bump up the sequence for your use. pblapply gives you a free progress bar.
To be kind to the site (esp since it doesn't explicitly prohibit scraping) you might want to throw a Sys.sleep(10) at the end of the get_album_info function.
UPDATE
To handle server errors (in this case 500, but it'll work for others, too), you can use try:
library(rvest)
library(dplyr)
library(pbapply)
library(data.table)
base_url <- "http://www.progarchives.com/album.asp?id=%s"
get_album_info <- function(id) {
pg <- try(html(sprintf(base_url, id)), silent=TRUE)
if (inherits(pg, "try-error")) {
data.frame(album=character(0), date=character(0), band=character(0))
} else {
data.frame(album=pg %>% html_nodes(xpath="//h1[1]") %>% html_text(),
date=pg %>% html_nodes(xpath="//strong[1]") %>% html_text(),
band=pg %>% html_nodes(xpath="//h2[1]") %>% html_text(),
stringsAsFactors=FALSE)
}
}
albums <- rbindlist(pblapply(c(9:10, 23, 28, 29, 30), get_album_info))
## album date band
## 1: THE DANGERS OF STRANGERS Studio Album, released in 1988 Abel Ganz
## 2: THE DEAFENING SILENCE Studio Album, released in 1994 Abel Ganz
## 3: AD INFINITUM Studio Album, released in 1998 Ad Infinitum
You won't get any entries for the errant pages (in this case it just returns id 9, 10 and 30's entries).
Instead of xpathApply(), you could subset the first node in the node sets of each path and call xmlValue() on that. Here's what I came up with,
library(XML)
library(RCurl)
## define the urls and xpath queries
urls <- sprintf("http://www.progarchives.com/album.asp?id=%s", 2:10)
path <- c(album = "//h1", date = "//strong", band = "//h2")
## define a re-usable curl handle for the c-level nodes
curl <- getCurlHandle()
## allocate the result list
out <- vector("list", length(urls))
## do the work
for(u in urls) {
content <- getURL(u, curl = curl)
doc <- htmlParse(content, useInternalNodes = TRUE)
out[[u]] <- lapply(path, function(x) xmlValue(doc[x][[1]]))
free(doc)
}
## structure the result
data.table::rbindlist(out)
# album date band
# 1: FOXTROT Studio Album, released in 1972 Genesis
# 2: NURSERY CRYME Studio Album, released in 1971 Genesis
# 3: GENESIS LIVE Live, released in 1973 Genesis
# 4: A TRICK OF THE TAIL Studio Album, released in 1976 Genesis
# 5: FROM GENESIS TO REVELATION Studio Album, released in 1969 Genesis
# 6: GRATUITOUS FLASH Studio Album, released in 1984 Abel Ganz
# 7: GULLIBLES TRAVELS Studio Album, released in 1985 Abel Ganz
# 8: THE DANGERS OF STRANGERS Studio Album, released in 1988 Abel Ganz
# 9: THE DEAFENING SILENCE Studio Album, released in 1994 Abel Ganz
Update: To handle the id queries do not exist, we can write a condition with RCurl::url.exists() that handles the bad ones. So the following function getAlbums() returns a character vector of the either the fetched xml values or NA, depending on the status of the url. You can change that if you want, of course. That was just a method that came to mind in the wee hours.
getAlbums <- function(url, id = numeric(), xPath = list()) {
urls <- sprintf("%s?id=%d", url, id)
curl <- getCurlHandle()
out <- vector("list", length(urls))
for(u in urls) {
out[[u]] <- if(url.exists(u)) {
content <- getURL(u, curl = curl)
doc <- htmlParse(content, useInternalNodes = TRUE)
lapply(path, function(x) xmlValue(doc[x][[1]]))
} else {
warning(sprintf("returning 'NA' for urls[%d] ", id[urls == u]))
structure(as.list(path[NA]), names = names(path))
}
if(exists("doc")) free(doc)
}
data.table::rbindlist(out)
}
url <- "http://www.progarchives.com/album.asp"
id <- c(9:10, 23, 28, 29, 30)
path <- c(album = "//h1", date = "//strong", band = "//h2")
getAlbums(url, id, path)
# album date band
# 1: THE DANGERS OF STRANGERS Studio Album, released in 1988 Abel Ganz
# 2: THE DEAFENING SILENCE Studio Album, released in 1994 Abel Ganz
# 3: NA NA NA
# 4: NA NA NA
# 5: NA NA NA
# 6: AD INFINITUM Studio Album, released in 1998 Ad Infinitum
#
# Warning messages:
# 1: In albums(url, id, path) : returning 'NA' for urls[23]
# 2: In albums(url, id, path) : returning 'NA' for urls[28]
# 3: In albums(url, id, path) : returning 'NA' for urls[29]