Scraping html headers in R using XML package - html

I'm trying to extract the header 1 (h1) from a html code like this:
<div class="cuerpo-not"><div mod="2323">
<h1>Jamón 5 Jotas, champagne Bollinger y King Alexander III</h1>
I'm using the function xpathSApply() but it returns nothing:
xpathSApply(webpage, "//div[contains(#class, 'cuerpo-not')]/h1", xmlValue)
# list()
But when I use the same function without specify the class of header, it returns all the information below the class in this format:
xpathSApply(webpage, "//div[contains(#class, 'cuerpo-not')]", xmlValue)
# ;\n\t\t}\n\t}\n\t\n\t\n\tenviarNoticiaLeida_Site( 6916437,16 ) ;\n//]]>Jamón 5 Jotas, champagne Bollinger y King Alexander III\n\n\n\tPor J.M.
How can I extract the information as a string? In other web pages the previous code has worked.

I think you just need one more / in your query down to h1, as in //h1 instead of /h1.
library(XML)
x <- '<div class="cuerpo-not"><div mod="2323">
<h1>Jamón 5 Jotas, champagne Bollinger y King Alexander III</h1>'
xpathSApply(htmlParse(x), "//div[contains(#class, 'cuerpo-not')]//h1", xmlValue)
# [1] "Jamón 5 Jotas, champagne Bollinger y King Alexander III"

Related

Scraping dynamic table in R

I am stuck on a simple web scrape.
My goal is to scrape Morningstar.com to retrieve the education of the managers associated to a fund name.
First off, let me say that I am not familiar at all with this operation. However, I did my best to provide some code.
For example, consider the following webpage
http://financials.morningstar.com/fund/management.html?t=AALGX&region=usa&culture=en_US
The problem is that the page dynamically loads the section I am targeting, so it doesn't actually get pulled in by read_html()
So what I did was to access to the data loaded in my section of interest.
Specifically, I did:
# edit: added packages required
library(xml2)
library(rvest)
library(stringi)
# original code
tmp_url <- "http://financials.morningstar.com/fund/management.html?t=AALGX&region=usa&culture=en_US"
pg <- read_html(tmp_url)
tmp <- length(html_nodes(pg, xpath=".//script[contains(., 'function loadManagerInfo()')]"))
html_nodes(pg, xpath=".//script[contains(., 'function loadManagerInfo()')]") %>%
html_text() %>%
stri_split_lines() %>%
.[[1]] -> js_lines
idx <- which(stri_detect_fixed(js_lines, '\t\t\"//financials.morningstar.com/oprn/c-managers.action?&t='))
start <- nchar("\t\t\"//financials.morningstar.com/oprn/c-managers.action?&t=")+1
id <- substr(js_lines[idx],start, start+9)
tab <- read_html(paste0("http://financials.morningstar.com/oprn/c-managers.action?&t=",id,"&region=usa&culture=en-US&cur=&callback=jsonp1523529017966&_=1523529019244"), options = "HUGE")
The object tab contains the information I need.
What I need to do now is to create a dataframe associating to each manager name, his or her manager education.
I could try to do this by transforming my object in a string, then extracting the characters following the word "Education".
Though, this looks extremely inefficient.
I was wondering if anyone can provide some guidance.
This thing really is a mess - nice work getting the links and downloding the info.
Poking around a lot and taking various detours this is the best I could come up:
Clean Up
First there is some cleanup to do. Instead of directly downloading and parsing the document in one step we will:
download the document as text
clean up the text a little to get the JSON
parse the JSON
extract the HTML item
do some further cleaning
finally parse the HTML
url <-
paste0(
"http://financials.morningstar.com/oprn/c-managers.action?&t=",
id,
"&region=usa&culture=en-US&cur=&callback=jsonp1523529017966&_=1523529019244"
)
txt <-
readLines(url, warn = FALSE)
json <-
txt %>%
gsub("^jsonp\\d+\\(", "", .) %>%
gsub("\\)$", "", .)
json_parsed <-
jsonlite::fromJSON(json)
html_clean <-
json_parsed$html %>%
gsub("\t", "", .)
html_parsed <-
read_html(html_clean)
First Round of Node Extraction
Next we use some black magic node extraction trickery. Basically the trick goes like this: If we have a node set (the thing you get when using html_nodes) we can use further XPath queries to drill down.
The first node set (cvs) captures the basic path to the CV entries in the table.
The second node set (info_tmp) drills down a little further to get the those part of the CV entries where further information ("Other Assets Managed", "Education", ... etc) is stored.
cvs <-
html_parsed %>%
html_nodes(xpath = "/html/body/table/tbody/tr[not(#align='left')]")
info_tmp <-
cvs %>%
html_nodes(xpath = "td/table/tbody")
Building up Data.Frame 1
There is little problem with the table. Each CV entry lives in its own table row. For name, from, to and description there is always exactly one item per CV entry but for "Other Assets Managed", "Education", ... etc this is not true.
Therefore, information extraction is done in two parts.
df <-
cvs %>%
lapply(
FUN =
function(x){
tmp <-
x %>%
html_nodes(xpath = "th") %>%
html_text() %>%
gsub(" +", "", .)
data.frame(
name = stri_extract(tmp, regex = "[. \\w]+"),
from = stri_extract(tmp, regex = "\\d{2}/\\d{2}/\\d{4}"),
to = stri_extract(tmp, regex = "\\d{2}/\\d{2}/\\d{4}")
)
}
) %>%
do.call(rbind, .)
df$description <-
info_tmp %>%
html_nodes(xpath = "tr[1]/td[1]") %>%
html_text()
df$cv_id <- seq_len(nrow(df))
Building Up Data.Frame 2
Now some more html nodes trickery ... If we use html_nodes() the result set of html_nodes() we get all matching and none of the none matching nodes. This is a problem since we might get 1, 0 or multiple nodes per node set node basically destroying any information about where those newly selected nodes came from.
There is a solution however: We can use lapply to query each element of an node set independently from the others and therewith preserving information about the original structure.
extract_key_value_pairs <-
function(i, info_tmp){
cv_id <-
seq_along(info_tmp)
key <-
lapply(
info_tmp,
function(x){
tmp <-
x %>%
html_nodes(xpath = paste0("tr[",i,"]/td[1]")) %>%
html_text()
if ( length(tmp) == 0 ) {
return("")
}else{
return(tmp)
}
}
)
value <-
lapply(
info_tmp,
function(x){
tmp <-
x %>%
html_nodes(xpath = paste0("tr[",i,"]/td[2]")) %>%
html_text() %>%
stri_trim_both() %>%
stri_split(fixed = "\n") %>%
lapply(X = ., stri_trim_both)
if ( length(tmp) == 0 ) {
return("")
}else{
return(unlist(tmp))
}
}
)
df <-
mapply(
cv_id = cv_id,
key = key,
value = value,
FUN =
function(cv_id, key, value){
data.frame(
cv_id = cv_id,
key = key,
value = value
)
},
SIMPLIFY = FALSE
) %>%
do.call(rbind, .)
df[df$key != "",]
}
df2 <-
lapply(
X = c(3, 5, 7),
FUN = extract_key_value_pairs,
info_tmp = info_tmp
) %>%
do.call(rbind, .)
Results
df
## name from to description cv_id
## 1 Kurt J. Lauber 03/20/2013 03/20/2013 Mr. Lauber ... 1
## 2 Noah J. Monsen 02/28/2018 02/28/2018 Mr. Monsen ... 2
## 3 Lauri Brunner 09/30/2018 09/30/2018 Ms. Brunne ... 3
## 4 Darren M. Bagwell 02/29/2016 02/29/2016 Darren M. ... 4
## 5 David C. Francis 10/07/2011 10/07/2011 Francis is ... 5
## 6 Michael A. Binger 04/14/2010 04/14/2010 Binger has ... 6
## 7 David E. Heupel 04/14/2010 04/14/2010 Mr. Heupel ... 7
## 8 Matthew D. Finn 03/30/2007 03/30/2007 Mr. Finn h ... 8
## 9 Scott Vergin 03/30/2007 03/30/2007 Vergin has ... 9
## 10 Frederick L. Plautz 11/01/1995 11/01/1995 Plautz has ... 10
## 11 Clyde E. Bartter 01/01/1994 01/01/1994 Bartter is ... 11
## 12 Wayne C. Stevens 01/01/1994 01/01/1994 Stevens is ... 12
## 13 Julian C. Ball 07/16/1987 07/16/1987 Ball is a ... 13
df2
## cv_id key value
## 1 Other Assets Managed
## 2 Other Assets Managed
## 3 Other Assets Managed
## 4 Certification CFA
## 4 Other Assets Managed
## 5 Certification CFA
## 5 Education M.B.A. University of Pittsburgh, 1978
## 5 Education B.A. University of Pittsburgh, 1977
## 5 Other Assets Managed
## 6 Certification CFA
## 6 Education M.B.A. University of Minnesota, 1991
## 6 Education B.S. University of Minnesota, 1987
## 6 Other Assets Managed
## 7 Other Assets Managed
## 8 Certification CFA
## 8 Education B.A. University of Pennsylvania, 1984
## 8 Education M.B.A. University of Michigan, 1990
## 8 Other Assets Managed
## 9 Certification CFA
## 9 Education M.B.A. University of Minnesota, 1980
## 9 Education B.A. St. Olaf College, 1976
## 9 Other Assets Managed
## 10 Education M.S. University of Wisconsin, 1981
## 10 Education B.B.A. University of Wisconsin, 1979
## 10 Other Assets Managed
## 11 Certification CFA
## 11 Education M.B.A. Western Reserve University, 1964
## 11 Education B.A. Baldwin-Wallace College, 1953
## 11 Other Assets Managed
## 12 Certification CFA
## 12 Education M.B.A. University of Wisconsin,
## 12 Education B.B.A. University of Miami,
## 12 Other Assets Managed
## 13 Certification CFA
## 13 Education B.A. Kent State University, 1974
## 13 Education J.D. Cleveland State University, 1984
## 13 Other Assets Managed
I don't have a solution, as this is not an area I have worked with before. However, with brute force you can probably get the table, assuming you have a list of rules that can parse the text to a data frame.
Thought I'd share what I have though
# get the text
f <- xml_text(tab)
# split up, this bit is tricky..
split_f <- strsplit(f, split="\\\\t", perl=TRUE)[[1]]
split_f <- strsplit(split_f, split="\\\\n", perl=TRUE)
split_f <- unlist(split_f)
split_f <- trimws(split_f)
# find ones to remove
sort(table(split_f), decreasing = T)[1:5]
split_f <- split_f[split_f!="—"]
split_f <- split_f[split_f!=""]
# manually found where to split
keep <- split_f[2:108]
# text looks ok, but would need rules to extract the rows in to a data.frame
View(keep)

How to follow a link with data-params using rvest

I'm trying to web scrape a public data provider but I got stuck when I had to click on a button passing a parameter to the JS. Here's my attempt:
require(rvest)
url <- 'https://myterna.terna.it/SunSet/Public/'
page <- url %>% read_html()
node_link <- page %>% html_node('.sub-item:nth-child(1) .postlink')
In node_link I can easily find the target page as the href of this HTML tag:
<a href="/SunSet/Public/Pubblicazioni"
class="postlink"
data-params="filter.IdSezione=52767620567B3077E053A8829B0A9478">
The point is that I cannot easily retrieve the content of the linked page because there are other buttons that point to the same link. The only difference between the various buttons is the data-params attribute which probably has to be given to the JS in order to retrieve the specific content.
Any ideas on how to solve the issue?
Obligatory heads-up:
It's not really clear if the site allow scraping, the Legal Notice says Authorization is granted for the reproduction of documents published on this website exclusively for personal use and not for commercial purposes, provided the name of source is properly indicated.
Use this respecting their terms of service.
Inspecting the network activity when clicking on that link, we can see the webpage makes a POST request to https://myterna.terna.it/SunSet/Public/Pubblicazioni/List. We can find both the requested headers and the params sent.
par <- '{"draw":1,"columns":[{"data":0,"name":"","searchable":true,"orderable":true,"search":{"value":"","regex":false}},{"data":1,"name":"","searchable":true,"orderable":true,"search":{"value":"","regex":false}},{"data":2,"name":"","searchable":false,"orderable":false,"search":{"value":"","regex":false}},{"data":3,"name":"","searchable":false,"orderable":false,"search":{"value":"","regex":false}},{"data":4,"name":"","searchable":false,"orderable":false,"search":{"value":"","regex":false}},{"data":5,"name":"","searchable":false,"orderable":false,"search":{"value":"","regex":false}},{"data":6,"name":"","searchable":false,"orderable":false,"search":{"value":"","regex":false}},{"data":7,"name":"","searchable":false,"orderable":false,"search":{"value":"","regex":false}}],"order":[],"start":0,"length":10,"search":{"value":"","regex":false},"filter":{"IdSezione":"52767620567B3077E053A8829B0A9478","Titolo":"","Id":"","ExtKey":"","TipoPubblicazione":"","SheetName":"","Anno":"2017","Mese":"7","Giorno":"","DataPubblicazione":"","TipoDatoPubblicazione":""},"details":{}}'
This is json, we can parse and change its values if we want (although I tried a few different filters and it does no respond much)
par <- jsonlite::fromJSON(par)
par$filter$Mese <- '7'
As for headers only X-Requested-With:MLHttpRequest is really needed so we can cut it down to that.
response <- POST('https://myterna.terna.it/SunSet/Public/Pubblicazioni/List',
add_headers('X-Requested-With' = 'XMLHttpRequest'),
body = par,
encode = 'json')
json_data <- content(response)$data
This returns a list, that we can safely transform to a dataframe for convenient use:
df <- data.frame(matrix(unlist(json_data), nrow=length(json_data), byrow=TRUE))
head(df, 2)
#> X1
#> 1 SbilanciamentoAggregatoZonale_SegnoGiornaliero_Orario_20170709
#> 2 SbilanciamentoAggregatoZonale_SegnoGiornaliero_QuartoOrario_20170709
#> X2
#> 1 /Date(1499680800000)/
#> 2 /Date(1499680800000)/
#> X3
#> 1 <div class="actions detail-inline export" data-pk="53F4A57FCB70304EE0532A889B0A7758"></div>
#> 2 <div class="actions detail-inline export" data-pk="53F4A57FCB6D304EE0532A889B0A7758"></div>
#> X4 X5 X6
#> 1 53F4A57FCB70304EE0532A889B0A7758 25 SEGNO_MACROZONALE_ORARIO
#> 2 53F4A57FCB6D304EE0532A889B0A7758 25 SEGNO_MACROZONALE_QUARTO_ORARIO
#> X7 X8
#> 1 Segno Giornaliero Orario
#> 2 Segno Giornaliero Quarto Orario
Ok, basicly I was missing the mechanism of how HTTP works. After some days of study I understood the correct approach is using httr package the way showed below.
First of all I retrieved all the settings needed from the public page:
lnkd_url <- paste0(dirname(dirname(url)),
node_link %>%
html_attr('href'))
lnkd_id <- strsplit(zs_node %>%
html_attr('data-params'), '=')[[1]][2]
Then it is possible to launch the POST request to the target page:
lnkd_page <- POST(lnkd_url,
body = list('filter.IdSezione' = lnkd_id)
That's it!

Read HTML Table from Greyhound via R

I'm trying to read the HTML data regarding Greyhound bus timings. An example can be found here. I'm mainly concerned with getting the schedule and status data off the table, but when I execute the following code:
library(XML)
url<-"http://bustracker.greyhound.com/routes/4511/I/Chicago_Amtrak_IL-Cincinnati_OH/4511/10-26-2016"
greyhound<-readHTMLTable(url)
greyhound<-greyhound[[2]]
This just produces the following table:
I'm not sure why it's grabbing data that's not even on the page, as opposed to the
you can not retrieve the data using readHTMLTable because the traject result are sent as javascript script. So you should select that script and parse it to extract the right information.
Her a solution , that do this :
Extract the javascript script that contain the json data
extract the json data from the script using regular expression
parse the json data to an R list
Reshape the resulted list into a table ( data.table here)
The code looks maybe short but it is really compact ( it takes me an hour to do produce it)!
library(XML)
library(httr)
library(jsonlite)
library(data.table)
dc <- htmlParse(GET(url))
script <- xpathSApply(dc,"//script/text()",xmlValue)[[5]]
res <- strsplit(script,"stopArray.push({",fixed=TRUE)[[1]][-1]
dcast(point~name,data=rbindlist(Map(function(x,y){
x <- paste('{',sub(');|);.*docum.*',"",x))
dx <- unlist(fromJSON(x))
data.frame(point=y,name=names(dx),value=dx)
},res,seq_along(res))
,fill=TRUE)[name!="polyline"])
the table result :
point category direction id lat linkName lon
1: 1 2 empty 562310 41.878589630127 Chicago_Amtrak_IL -87.6398544311523
2: 2 2 empty 560252 41.8748474121094 Chicago_IL -87.6435165405273
3: 3 1 empty 561627 41.7223281860352 Chicago_95th_&_Dan_Ryan_IL -87.6247329711914
4: 4 2 empty 260337 41.6039199829102 Gary_IN -87.3386917114258
5: 5 1 empty 260447 40.4209785461426 Lafayette_e_IN -86.8942031860352
6: 6 2 empty 260392 39.7617835998535 Indianapolis_IN -86.161018371582
7: 7 2 empty 250305 39.1079406738281 Cincinnati_OH -84.5041427612305
name shortName ticketName
1: Chicago Amtrak: 225 S Canal St, IL 60606 Chicago Amtrak, IL CHD
2: Chicago: 630 W Harrison St, IL 60607 Chicago, IL CHD
3: Chicago 95th & Dan Ryan: 14 W 95th St, IL 60628 Chicago 95th & Dan Ryan, IL CHD
4: Gary: 100 W 4th Ave, IN 46402 Gary, IN GRY
5: Lafayette (e): 401 N 3rd St, IN 47901 Lafayette (e), IN XIN
6: Indianapolis: 350 S Illinois St, IN 46225 Indianapolis, IN IND
7: Cincinnati: 1005 Gilbert Ave, OH 45202 Cincinnati, OH CIN
As #agstudy notes, the data is rendered to HTML; it's not delivered via HTML directly from the server. Therefore, you can (a) use something like RSelenium to scrape the rendered content, or (b) extract the data from the <script> tags that contain the data.
To explain #agstudy's work, we observe that the data is contained in a series of stopArray.push() commands in one of the (many) script tags. For example:
stopArray.push({
"id" : "562310",
"name" : "Chicago Amtrak: 225 S Canal St, IL 60606",
"shortName" : "Chicago Amtrak, IL",
"ticketName" : "CHD",
"category" : 2,
"linkName" : "Chicago_Amtrak_IL",
"direction" : "empty",
"lat" : 41.87858963012695,
"lon" : -87.63985443115234,
"polyline" : "elr~Fnb|uOmC##nG?XBdH#rC?f#?P?V#`AlAAn#A`CCzBC~BE|CEdCA^Ap#A"
});
Now, this is json data contained inside each function call. I tend to think that if someone has gone to the work of formatting data in a machine-readable format, well golly we should appreciate it!
The tidyverse approach to this problem is as follows:
Download the page using the rvest package.
Identify the appropriate script tag to use by employing an xpath expression that searches for all script tags that contain the string url =.
Use a regular expression to pull out everything inside each stopArray.push() call.
Fix the formatting of the resulting object by (a) separating each block with commas, (b) surrounding the string by [] to indicate a json list.
Use jsonlite::fromJSON to convert into a data.frame.
Note that I hide the polyline column near the end, since it's too large to previous appropriately.
library(tidyverse)
library(rvest)
library(stringr)
library(jsonlite)
url <- "http://bustracker.greyhound.com/routes/4511/I/Chicago_Amtrak_IL-Cincinnati_OH/4511/10-26-2016"
page <- read_html(url)
page %>%
html_nodes(xpath = '//script[contains(text(), "url = ")]') %>%
html_text() %>%
str_extract_all(regex("(?<=stopArray.push\\().+?(?=\\);)", multiline = T, dotall = T), F) %>%
unlist() %>%
paste(collapse = ",") %>%
sprintf("[%s]", .) %>%
fromJSON() %>%
select(-polyline) %>%
head()
#> id name
#> 1 562310 Chicago Amtrak: 225 S Canal St, IL 60606
#> 2 560252 Chicago: 630 W Harrison St, IL 60607
#> 3 561627 Chicago 95th & Dan Ryan: 14 W 95th St, IL 60628
#> 4 260337 Gary: 100 W 4th Ave, IN 46402
#> 5 260447 Lafayette (e): 401 N 3rd St, IN 47901
#> 6 260392 Indianapolis: 350 S Illinois St, IN 46225
#> shortName ticketName category
#> 1 Chicago Amtrak, IL CHD 2
#> 2 Chicago, IL CHD 2
#> 3 Chicago 95th & Dan Ryan, IL CHD 1
#> 4 Gary, IN GRY 2
#> 5 Lafayette (e), IN XIN 1
#> 6 Indianapolis, IN IND 2
#> linkName direction lat lon
#> 1 Chicago_Amtrak_IL empty 41.87859 -87.63985
#> 2 Chicago_IL empty 41.87485 -87.64352
#> 3 Chicago_95th_&_Dan_Ryan_IL empty 41.72233 -87.62473
#> 4 Gary_IN empty 41.60392 -87.33869
#> 5 Lafayette_e_IN empty 40.42098 -86.89420
#> 6 Indianapolis_IN empty 39.76178 -86.16102

How to loop - JSONP / JSON data using R

I thought I had parsed the data correctly using jsonlite & tidyjson. However, I am noticing that only the data from the first page is being parsed. Please advice how I could parse all the pages correctly. The total number of pages are over 1300 -if I look at the json output, so I think the data is available but not correctly parsed.
Note: I have used tidyjson, but am open to using jsonlite or any other library too.
library(dplyr)
library(tidyjson)
library(jsonlite)
req <- httr::GET("http://svcs.ebay.com/services/search/FindingService/v1?OPERATION-NAME=findItemsByKeywords&SERVICE-VERSION=1.0.0&SECURITY-APPNAME=xxxxxx&GLOBAL-ID=EBAY-US&RESPONSE-DATA-FORMAT=JSON&callback=_cb_findItemsByKeywords&REST-PAYLOAD&keywords=harry%20potter&paginationInput.entriesPerPage=100")
txt <- content(req, "text")
json <- sub("/**/_cb_findItemsByKeywords(", "", txt, fixed = TRUE)
json <- sub(")$", "", json)
data1 <- json %>% as.tbl_json %>%
enter_object("findItemsByKeywordsResponse") %>% gather_array %>% enter_object("searchResult") %>% gather_array %>%
enter_object("item") %>% gather_array %>%
spread_values(
ITEMID = jstring("itemId"),
TITLE = jstring("title")
) %>%
select(ITEMID, TITLE) # select only what is needed
############################################################
*Note: "paginationOutput":[{"pageNumber":["1"],"entriesPerPage":["100"],"totalPages":["1393"],"totalEntries":["139269"]}]
* &_ipg=100&_pgn=1"
No need for tidyjson. You will need to write another function/set of calls to get the total number of pages (it's over 1,400) to use the following, but that should be fairly straightforward. Try to compartmentalize your operations a bit more and use the full power of httr when you can to parameterize things:
library(dplyr)
library(jsonlite)
library(httr)
library(purrr)
get_pg <- function(i) {
cat(".") # shows progress
req <- httr::GET("http://svcs.ebay.com/services/search/FindingService/v1",
query=list(`OPERATION-NAME`="findItemsByKeywords",
`SERVICE-VERSION`="1.0.0",
`SECURITY-APPNAME`="xxxxxxxxxxxxxxxxxxx",
`GLOBAL-ID`="EBAY-US",
`RESPONSE-DATA-FORMAT`="JSON",
`REST-PAYLOAD`="",
`keywords`="harry potter",
`paginationInput.pageNumber`=i,
`paginationInput.entriesPerPage`=100))
dat <- fromJSON(content(req, as="text", encoding="UTF-8"))
map_df(dat$findItemsByKeywordsResponse$searchResult[[1]]$item, function(x) {
data_frame(ITEMID=flatten_chr(x$itemId),
TITLE=flatten_chr(x$title))
})
}
# "10" will need to be the max page number. I wasn't about to
# make 1,400 requests to ebay. I'd probably break them up into
# sets of 30 or 50 and save off temporary data frames as rdata files
# just so you don't get stuck in a situation where R crashes and you
# have to get all the data again.
srch_dat <- map_df(1:10, get_pg)
srch_dat
## Source: local data frame [1,000 x 2]
##
## ITEMID TITLE
## (chr) (chr)
## 1 371533364795 Harry Potter: Complete 8-Film Collection (DVD, 2011, 8-Disc Set)
## 2 331128976689 HOT New Harry Potter 14.5" Magical Wand Replica Cosplay In Box
## 3 131721213216 Harry Potter: Complete 8-Film Collection (DVD, 2011, 8-Disc Set)
## 4 171430021529 New Harry Potter Hermione Granger Rotating Time Turner Necklace Gold Hourglass
## 5 261597812013 Harry Potter Time Turner+GOLD Deathly Hallows Charm Pendant necklace
## 6 111883750466 Harry Potter: Complete 8-Film Collection (DVD, 2011, 8-Disc Set)
## 7 251947403227 HOT New Harry Potter 14.5" Magical Wand Replica Cosplay In Box
## 8 351113839731 Marauder's Map Hogwarts Wizarding World Harry Potter Warner Bros LIMITED **NEW**
## 9 171912724869 Harry Potter Time Turner Necklace Hermione Granger Rotating Spins Gold Hourglass
## 10 182024752232 Harry Potter : Complete 8-Film Collection (DVD, 2011, 8-Disc Set) Free Shipping
## .. ... ...

Scrape values from HTML select/option tags in R

I'm trying (fairly unsuccessfully) to scrape some data from a website (www.majidata.co.ke) using R. I've managed to scrape the HTML and parse it but now a little unsure how to extract the bits I actually need!
Using the XML library I scrape my data using this code:
majidata_get <- GET("http://www.majidata.go.ke/town.php?MID=MTE=&SMID=MTM=")
majidata_html <- htmlTreeParse(content(majidata_get, as="text"))
This leaves me with (Large) XMLDocumentContent. There is a drop-down list on the webpage and I want to scrape the values from it (which relate to the names and ID no. of different towns). The bits I want to extract are the numbers between <option value ="XXX"> and the name following it in capital letters.
<div class="regiondata">
<div id="town_data">
<select id="town" name="town" onchange="town_data(this.value);">
<option value="0" selected="selected">[SELECT TOWN]</option>
<option value="611">AHERO</option>
<option value="635">AKALA</option>
<option value="625">AWASI</option>
<option value="628">AWENDO</option>
<option value="749">BAHATI</option>
<option value="327">BANGALE</option>
Ideally, I'd like to have these in a data.frame where the first column is the number and second column is the name e.g.
ID Name
611 AHERO
635 AKALA
625 AWASI
etc.
I'm not really sure where to go from here. I had thought to use regex and match the pattern within the text, though I've read from a number of forums that this is a bad idea an that its better/more efficient to use the xpath. Not really sure where to start with this though other than thinking I need to use xpathApplysomehow.
The very new rvest package makes quick work of this and lets you use sane CSS selectors, too.
UPDATED Incorporates the second request (see comments below)
library(rvest)
library(dplyr)
# gets data from the second popup
# returns a data frame of town_id, town_name, area_id, area_name
addArea <- function(town_id, town_name) {
# make the AJAX URL and grab the data
url <- sprintf("http://www.majidata.go.ke/ajax-list-area.php?reg=towns&type=projects&id=%s",
town_id)
subunits <- html(url)
# reformat into a data frame with the town data
data.frame(town_id=town_id,
town_name=town_name,
area_id=subunits %>% html_nodes("option") %>% html_attr("value"),
area_name=subunits %>% html_nodes("option") %>% html_text(),
stringsAsFactors=FALSE)[-1,]
}
# get data from the first popup and put it into a dat a frame
majidata <- html("http://www.majidata.go.ke/town.php?MID=MTE=&SMID=MTM=")
maji <- data.frame(town_id=majidata %>% html_nodes("#town option") %>% html_attr("value"),
town_name=majidata %>% html_nodes("#town option") %>% html_text(),
stringsAsFactors=FALSE)[-1,]
# pass in the name and id to our addArea function and make the result into
# a data frame with all the data (town and area)
combined <- do.call("rbind.data.frame",
mapply(addArea, maji$town_id, maji$town_name,
SIMPLIFY=FALSE, USE.NAMES=FALSE))
# row names aren't super-important, but let's keep them tidy
rownames(combined) <- NULL
str(combined)
## 'data.frame': 1964 obs. of 4 variables:
## $ town_id : chr "611" "635" "625" "628" ...
## $ town_name: chr "AHERO" "AKALA" "AWASI" "AWENDO" ...
## $ area_id : chr "60603030101" "60107050201" "60603020101" "61103040101" ...
## $ area_name: chr "AHERO" "AKALA" "AWASI" "ANINDO" ...
head(combined)
## town_id town_name area_id area_name
## 1 611 AHERO 60603030101 AHERO
## 2 635 AKALA 60107050201 AKALA
## 3 625 AWASI 60603020101 AWASI
## 4 628 AWENDO 61103040101 ANINDO
## 5 628 AWENDO 61103050401 SARE
## 6 749 BAHATI 73101010101 BAHATI
Using xpath expressions with HTML is almost always a better choice than regex. Given this data you can extract what you're after with
options<-getNodeSet(xmlRoot(majidata_html), "//select[#id='town']/option")
ids <- sapply(options, xmlGetAttr, "value")
names <- sapply(options, xmlValue)
data.frame(ID=ids, Name=names)
which returns
ID Name
1 0 [SELECT TOWN]
2 611 AHERO
3 635 AKALA
4 625 AWASI
5 628 AWENDO
6 749 BAHATI
...