R - Scrape a number of URLs and save individually - html

Disclaimer: I'm not a programmer by trade and my knowledge of R is limited to say the least. I've also already searched Stackoverflow for a solution (but to no avail).
Here's my situation: I need to scrape a series of webpages and save the data (not quite sure in what format, but I'll get to that). Fortunately the pages I need to scrape have a very logical naming structure (they use the date).
The base URL is: https://www.bbc.co.uk/schedules/p00fzl6p
I need to scrape everything from August 1st 2018 (for which the URL is https://www.bbc.co.uk/schedules/p00fzl6p/2018/08/01) until yesterday (for which the URL is https://www.bbc.co.uk/schedules/p00fzl6p/2020/05/17).
So far I've figured out to create a list of dates which can be appended to the base URL using the following:
dates <- seq(as.Date("2018-08-01"), as.Date("2020-05-17"), by=1)
dates <- format(dates,"20%y/%m/%d")
I can append these to the base URL with the following:
url <- paste0("https://www.bbc.co.uk/schedules/p00fzl6p/",dates)
However, that's pretty much as far as I've gotten (not very far, I know!) I assume I need to use a for loop but my own attempts at this have proved futile. Perhaps I'm not approaching this the right way?
In case it's not clear, what I'm trying to do is to visit each URL and save the html as an individual html file (ideally labelled with the relevant date). In truth, I don't need all of the html (just the list of programmes and times) but I can extract that information from the relevant files at a later date.
Any guidance on the best way to approach this would be much appreciated! And if you need any more info, just ask.

Have a look at the rvest package and associated tutorials. E.g. https://www.datacamp.com/community/tutorials/r-web-scraping-rvest.
The messy part is extracting the fields the way you want them.
Here is one possible solution:
library(rvest)
#> Loading required package: xml2
library(magrittr)
library(stringr)
library(data.table)
dates <- seq(as.Date("2018-08-01"), as.Date("2020-05-17"), by=1)
dates <- format(dates,"20%y/%m/%d")
urls <- paste0("https://www.bbc.co.uk/schedules/p00fzl6p/", dates)
get_data <- function(url){
html <- tryCatch(read_html(url), error=function(e) NULL)
if(is.null(html)) return(data.table(
date=gsub("https://www.bbc.co.uk/schedules/p00fzl6p/", "", url),
title=NA, description=NA)) else {
time <- html %>%
rvest::html_nodes('body') %>%
xml2::xml_find_all("//div[contains(#class, 'broadcast__info grid 1/4 1/6#bpb2 1/6#bpw')]") %>%
rvest::html_text() %>% gsub(".*([0-9]{2}.[0-9]{2}).*", "\\1", .)
text <- html %>%
rvest::html_nodes('body') %>%
xml2::xml_find_all("//div[contains(#class, 'programme__body')]") %>%
rvest::html_text() %>%
gsub("[ ]{2,}", " ", .) %>% gsub("[\n|\n ]{2,}", "\n", .) %>%
gsub("\n(R)\n", " (R)", ., fixed = TRUE) %>%
gsub("^\n|\n$", "", .) %>%
str_split_fixed(., "\n", 2) %>%
as.data.table() %>% setnames(., c("title", "description")) %>%
.[, `:=`(date = gsub("https://www.bbc.co.uk/schedules/p00fzl6p/", "", url),
time = time,
description = gsub("\n", " ", description))] %>%
setcolorder(., c("date", "time", "title", "description"))
text
}
}
res <- rbindlist(parallel::mclapply(urls, get_data, mc.cores = 6L))
res
#> date time
#> 1: 2018/08/01 06:00
#> 2: 2018/08/01 09:15
#> 3: 2018/08/01 10:00
#> 4: 2018/08/01 11:00
#> 5: 2018/08/01 11:45
#> ---
#> 16760: 2020/05/17 22:20
#> 16761: 2020/05/17 22:30
#> 16762: 2020/05/17 00:20
#> 16763: 2020/05/17 01:20
#> 16764: 2020/05/17 01:25
#> title
#> 1: Breakfast—01/08/2018
#> 2: Wanted Down Under—Series 11, Hanson Family
#> 3: Homes Under the Hammer—Series 21, Episode 6
#> 4: Fake Britain—Series 7, Episode 7
#> 5: The Farmers' Country Showdown—Series 2 30-Minute Versions, Ploughing
#> ---
#> 16760: BBC London—Late News, 17/05/2020
#> 16761: Educating Rita
#> 16762: The Real Marigold Hotel—Series 4, Episode 2
#> 16763: Weather for the Week Ahead—18/05/2020
#> 16764: Joins BBC News—18/05/2020
#> description
#> 1: The latest news, sport, business and weather from the BBC's Breakfast team.
#> 2: 22/24 Will a week in Melbourne help Keith persuade his wife Mary to move to Australia? (R)
#> 3: Properties in Hertfordshire, Croydon and Derbyshire are sold at auction. (R)
#> 4: 7/10 The fake sports memorabilia that cost collectors thousands. (R)
#> 5: 13/20 Farmers show the skill and passion needed to do well in a top ploughing competition.
#> ---
#> 16760: The latest news, sport and weather from London.
#> 16761: Comedy drama about a hairdresser who dreams of rising above her drab urban existence. (R)
#> 16762: 2/4 The group take a night train to Madurai to attend the famous Chithirai festival. (R)
#> 16763: Detailed weather forecast.
#> 16764: BBC One joins the BBC's rolling news channel for a night of news.
Created on 2020-05-18 by the reprex package (v0.3.0)

Related

rvest error on form submission "`Form` doesn't contain a `action` attribute"

I am trying to send search requests with rvest, but I get always the same error. I have tried several ways included this solution: https://gist.github.com/ibombonato/11507d776d1042f80ca59cd31509afd3
My code is the following.
library(rvest)
url <- 'https://www.saferproducts.gov/PublicSearch'
cocorahs <- html_session(URL)
form.unfilled <- cocorahs %>% html_node("form") %>% html_form()
form.unfilled[["fields"]][[3]][["value"]] <- "input" ## This is the line which I think should be corrected
form.filled <- form.unfilled %>%
set_values("searchParameter.AdvancedKeyword" = "amazon")
session1 <- session_submit(cocorahs, form.filled, submit = NULL)
# or
session <- submit_form(cocorahs, form.filled)
But I get always the following error:
Error in `submission_build()`:
! `form` doesn't contain a `action` attribute
Run `rlang::last_error()` to see where the error occurred.
I think the way is to edit the attributes of those buttons. Maybe has someone the answer to this. Thanks in advance.
An alternative method with httr2
library(tidyverse)
library(rvest)
library(httr2)
data <- "https://www.saferproducts.gov/PublicSearch" %>%
request() %>%
req_body_form(
"searchParameter.Keyword" = "Amazon"
) %>%
req_perform() %>%
resp_body_html()
tibble(
title = data %>%
html_elements(".document-title") %>%
html_text2(),
report_title = data %>%
html_elements(".info") %>%
html_text2() %>%
str_remove_all("\r") %>%
str_squish()
)
#> # A tibble: 10 × 2
#> title repor…¹
#> <chr> <chr>
#> 1 Self balancing scooter was used off & on for three years. Consumer i… Incide…
#> 2 The consumer stated that when he opened one of the marshmallow roast… Incide…
#> 3 The consumer, 59, stated that he was welding with a brand new auto d… Incide…
#> 4 The consumer reported, that their hover soccer toy caught fire while… Incide…
#> 5 80 yr old male's electric hotplate was set between 1 and 2(of 5) bef… Incide…
#> 6 Amazon Recalls Amazon Basics Desk Chairs Due to Fall and Injury Haza… Recall…
#> 7 The Consumer reported to have been notified by email that the diarrh… Incide…
#> 8 consumer reported about light fixture attached to a photography umbr… Incide…
#> 9 Drive DeVilbiss Healthcare Recalls Adult Portable Bed Rails After Tw… Recall…
#> 10 MixBin Electronics Recalls iPhone Cases Due to Risk of Skin Irritati… Recall…
#> # … with abbreviated variable name ¹​report_title
Created on 2023-01-15 with reprex v2.0.2

R: Webscraping a List From Wikipedia

I am working with the R programming language. I am trying to scrape the following website: https://en.wikipedia.org/wiki/List_of_unincorporated_communities_in_Ontario
I tried the code below:
library(rvest)
url<-"https://en.wikipedia.org/wiki/List_of_unincorporated_communities_in_Ontario"
page <-read_html(url)
#find the div tab of class=one_third
b = page %>% html_nodes("li")
This seems to have produced some result, but I am not sure what to do with this.
Ideally, I would like the final results to look something like this:
id names
1 Aberdeen
2 Grey County
3 Aberdeen
4 Prescott and Russell County
5 Aberfeldy
... ...
6 Babys Point
7 Baddow
8 Baden
... ......
Can someone please show me how to do this?
Thanks!
You can find the appropriate names as anchor elements nested within list elements using css or xpath selectors. Then, extract these using html_text. Here's a full reprex:
library(rvest)
result <- "https://en.wikipedia.org/wiki/" %>%
paste0("List_of_unincorporated_communities_in_Ontario") %>%
read_html %>%
html_elements(xpath = '//ul/li/a') %>%
html_text() %>%
`[`(-(1:29)) %>%
as.data.frame() %>%
setNames('Community')
head(result, 10)
#> Community
#> 1 10th Line Shore
#> 2 Aberdeen, Grey County
#> 3 Aberdeen, Prescott and Russell County
#> 4 Aberfeldy
#> 5 Aberfoyle
#> 6 Abingdon
#> 7 Abitibi 70
#> 8 Abitibi Canyon
#> 9 Aboyne
#> 10 Acanthus
Created on 2022-09-15 with reprex v2.0.2

How can I pull all nested URLs from a webpage using rvest and xml2?

I'm trying to pull all nested links from the webpage below. My code below returns an empty character vector.
page1 <- "https://thrivemarket.com/c/condiments-sauces?cur_page=1"
page1 <- read_html(page1)
page1_body <- page1 %>%
html_node("body") %>%
html_children()
page1_urls <- page1 %>%
rvest::html_nodes('body') %>%
xml2::xml_find_all("//div[contains(#class, 'd85qmy-0 kRbsKs')]") %>%
rvest::html_attr('href')
Thank you in advance for your help with this.
Best,
~Mayra
The links you are looking for do not exist in the html document you are reading with read_html. When you look at the page in a browser, the html document contains Javascript code, which your browser runs. Some of this Javascript code causes your browser to download further information to be inserted into the web page you see on your browser.
In your case, the extra information you are looking for is in the form of a json file, which you can obtain and parse as follows:
library(httr)
library(dplyr)
url <- paste0("https://thrivemarket.com/api/v1/products",
"?page_size=60&multifilter=1&cur_page=1")
content(GET(url))$products %>%
lapply(function(x) data.frame(product = x$title, url = x$url)) %>%
bind_rows() %>%
as_tibble()
#> # A tibble: 60 x 2
#> product url
#> <chr> <chr>
#> 1 Organic Extra Virgin Olive Oil https://thrivemarket.com/p/~
#> 2 Grass-Fed Collagen Peptides https://thrivemarket.com/p/~
#> 3 Grass-Fed Beef Sticks, Original https://thrivemarket.com/p/~
#> 4 Organic Dry Roasted & Salted Cashews https://thrivemarket.com/p/~
#> 5 Organic Vanilla Extract https://thrivemarket.com/p/~
#> 6 Organic Raw Cashews https://thrivemarket.com/p/~
#> 7 Organic Coconut Milk, Regular https://thrivemarket.com/p/~
#> 8 Organic Robust Maple Syrup, Grade A, Value Size https://thrivemarket.com/p/~
#> 9 Organic Coconut Water https://thrivemarket.com/p/~
#> 10 Non-GMO Avocado Oil Potato Chips, Himalayan Salt https://thrivemarket.com/p/~
#> # ... with 50 more rows
Created on 2022-06-04 by the reprex package (v2.0.1)

Scraping dynamic table in R

I am stuck on a simple web scrape.
My goal is to scrape Morningstar.com to retrieve the education of the managers associated to a fund name.
First off, let me say that I am not familiar at all with this operation. However, I did my best to provide some code.
For example, consider the following webpage
http://financials.morningstar.com/fund/management.html?t=AALGX&region=usa&culture=en_US
The problem is that the page dynamically loads the section I am targeting, so it doesn't actually get pulled in by read_html()
So what I did was to access to the data loaded in my section of interest.
Specifically, I did:
# edit: added packages required
library(xml2)
library(rvest)
library(stringi)
# original code
tmp_url <- "http://financials.morningstar.com/fund/management.html?t=AALGX&region=usa&culture=en_US"
pg <- read_html(tmp_url)
tmp <- length(html_nodes(pg, xpath=".//script[contains(., 'function loadManagerInfo()')]"))
html_nodes(pg, xpath=".//script[contains(., 'function loadManagerInfo()')]") %>%
html_text() %>%
stri_split_lines() %>%
.[[1]] -> js_lines
idx <- which(stri_detect_fixed(js_lines, '\t\t\"//financials.morningstar.com/oprn/c-managers.action?&t='))
start <- nchar("\t\t\"//financials.morningstar.com/oprn/c-managers.action?&t=")+1
id <- substr(js_lines[idx],start, start+9)
tab <- read_html(paste0("http://financials.morningstar.com/oprn/c-managers.action?&t=",id,"&region=usa&culture=en-US&cur=&callback=jsonp1523529017966&_=1523529019244"), options = "HUGE")
The object tab contains the information I need.
What I need to do now is to create a dataframe associating to each manager name, his or her manager education.
I could try to do this by transforming my object in a string, then extracting the characters following the word "Education".
Though, this looks extremely inefficient.
I was wondering if anyone can provide some guidance.
This thing really is a mess - nice work getting the links and downloding the info.
Poking around a lot and taking various detours this is the best I could come up:
Clean Up
First there is some cleanup to do. Instead of directly downloading and parsing the document in one step we will:
download the document as text
clean up the text a little to get the JSON
parse the JSON
extract the HTML item
do some further cleaning
finally parse the HTML
url <-
paste0(
"http://financials.morningstar.com/oprn/c-managers.action?&t=",
id,
"&region=usa&culture=en-US&cur=&callback=jsonp1523529017966&_=1523529019244"
)
txt <-
readLines(url, warn = FALSE)
json <-
txt %>%
gsub("^jsonp\\d+\\(", "", .) %>%
gsub("\\)$", "", .)
json_parsed <-
jsonlite::fromJSON(json)
html_clean <-
json_parsed$html %>%
gsub("\t", "", .)
html_parsed <-
read_html(html_clean)
First Round of Node Extraction
Next we use some black magic node extraction trickery. Basically the trick goes like this: If we have a node set (the thing you get when using html_nodes) we can use further XPath queries to drill down.
The first node set (cvs) captures the basic path to the CV entries in the table.
The second node set (info_tmp) drills down a little further to get the those part of the CV entries where further information ("Other Assets Managed", "Education", ... etc) is stored.
cvs <-
html_parsed %>%
html_nodes(xpath = "/html/body/table/tbody/tr[not(#align='left')]")
info_tmp <-
cvs %>%
html_nodes(xpath = "td/table/tbody")
Building up Data.Frame 1
There is little problem with the table. Each CV entry lives in its own table row. For name, from, to and description there is always exactly one item per CV entry but for "Other Assets Managed", "Education", ... etc this is not true.
Therefore, information extraction is done in two parts.
df <-
cvs %>%
lapply(
FUN =
function(x){
tmp <-
x %>%
html_nodes(xpath = "th") %>%
html_text() %>%
gsub(" +", "", .)
data.frame(
name = stri_extract(tmp, regex = "[. \\w]+"),
from = stri_extract(tmp, regex = "\\d{2}/\\d{2}/\\d{4}"),
to = stri_extract(tmp, regex = "\\d{2}/\\d{2}/\\d{4}")
)
}
) %>%
do.call(rbind, .)
df$description <-
info_tmp %>%
html_nodes(xpath = "tr[1]/td[1]") %>%
html_text()
df$cv_id <- seq_len(nrow(df))
Building Up Data.Frame 2
Now some more html nodes trickery ... If we use html_nodes() the result set of html_nodes() we get all matching and none of the none matching nodes. This is a problem since we might get 1, 0 or multiple nodes per node set node basically destroying any information about where those newly selected nodes came from.
There is a solution however: We can use lapply to query each element of an node set independently from the others and therewith preserving information about the original structure.
extract_key_value_pairs <-
function(i, info_tmp){
cv_id <-
seq_along(info_tmp)
key <-
lapply(
info_tmp,
function(x){
tmp <-
x %>%
html_nodes(xpath = paste0("tr[",i,"]/td[1]")) %>%
html_text()
if ( length(tmp) == 0 ) {
return("")
}else{
return(tmp)
}
}
)
value <-
lapply(
info_tmp,
function(x){
tmp <-
x %>%
html_nodes(xpath = paste0("tr[",i,"]/td[2]")) %>%
html_text() %>%
stri_trim_both() %>%
stri_split(fixed = "\n") %>%
lapply(X = ., stri_trim_both)
if ( length(tmp) == 0 ) {
return("")
}else{
return(unlist(tmp))
}
}
)
df <-
mapply(
cv_id = cv_id,
key = key,
value = value,
FUN =
function(cv_id, key, value){
data.frame(
cv_id = cv_id,
key = key,
value = value
)
},
SIMPLIFY = FALSE
) %>%
do.call(rbind, .)
df[df$key != "",]
}
df2 <-
lapply(
X = c(3, 5, 7),
FUN = extract_key_value_pairs,
info_tmp = info_tmp
) %>%
do.call(rbind, .)
Results
df
## name from to description cv_id
## 1 Kurt J. Lauber 03/20/2013 03/20/2013 Mr. Lauber ... 1
## 2 Noah J. Monsen 02/28/2018 02/28/2018 Mr. Monsen ... 2
## 3 Lauri Brunner 09/30/2018 09/30/2018 Ms. Brunne ... 3
## 4 Darren M. Bagwell 02/29/2016 02/29/2016 Darren M. ... 4
## 5 David C. Francis 10/07/2011 10/07/2011 Francis is ... 5
## 6 Michael A. Binger 04/14/2010 04/14/2010 Binger has ... 6
## 7 David E. Heupel 04/14/2010 04/14/2010 Mr. Heupel ... 7
## 8 Matthew D. Finn 03/30/2007 03/30/2007 Mr. Finn h ... 8
## 9 Scott Vergin 03/30/2007 03/30/2007 Vergin has ... 9
## 10 Frederick L. Plautz 11/01/1995 11/01/1995 Plautz has ... 10
## 11 Clyde E. Bartter 01/01/1994 01/01/1994 Bartter is ... 11
## 12 Wayne C. Stevens 01/01/1994 01/01/1994 Stevens is ... 12
## 13 Julian C. Ball 07/16/1987 07/16/1987 Ball is a ... 13
df2
## cv_id key value
## 1 Other Assets Managed
## 2 Other Assets Managed
## 3 Other Assets Managed
## 4 Certification CFA
## 4 Other Assets Managed
## 5 Certification CFA
## 5 Education M.B.A. University of Pittsburgh, 1978
## 5 Education B.A. University of Pittsburgh, 1977
## 5 Other Assets Managed
## 6 Certification CFA
## 6 Education M.B.A. University of Minnesota, 1991
## 6 Education B.S. University of Minnesota, 1987
## 6 Other Assets Managed
## 7 Other Assets Managed
## 8 Certification CFA
## 8 Education B.A. University of Pennsylvania, 1984
## 8 Education M.B.A. University of Michigan, 1990
## 8 Other Assets Managed
## 9 Certification CFA
## 9 Education M.B.A. University of Minnesota, 1980
## 9 Education B.A. St. Olaf College, 1976
## 9 Other Assets Managed
## 10 Education M.S. University of Wisconsin, 1981
## 10 Education B.B.A. University of Wisconsin, 1979
## 10 Other Assets Managed
## 11 Certification CFA
## 11 Education M.B.A. Western Reserve University, 1964
## 11 Education B.A. Baldwin-Wallace College, 1953
## 11 Other Assets Managed
## 12 Certification CFA
## 12 Education M.B.A. University of Wisconsin,
## 12 Education B.B.A. University of Miami,
## 12 Other Assets Managed
## 13 Certification CFA
## 13 Education B.A. Kent State University, 1974
## 13 Education J.D. Cleveland State University, 1984
## 13 Other Assets Managed
I don't have a solution, as this is not an area I have worked with before. However, with brute force you can probably get the table, assuming you have a list of rules that can parse the text to a data frame.
Thought I'd share what I have though
# get the text
f <- xml_text(tab)
# split up, this bit is tricky..
split_f <- strsplit(f, split="\\\\t", perl=TRUE)[[1]]
split_f <- strsplit(split_f, split="\\\\n", perl=TRUE)
split_f <- unlist(split_f)
split_f <- trimws(split_f)
# find ones to remove
sort(table(split_f), decreasing = T)[1:5]
split_f <- split_f[split_f!="—"]
split_f <- split_f[split_f!=""]
# manually found where to split
keep <- split_f[2:108]
# text looks ok, but would need rules to extract the rows in to a data.frame
View(keep)

Web-scraping: No matching CSS slector

I just got into web scraping and I'm trying to scrape the data from this web page : https://www.warsofninja.eu/index.php
More precisely I'm trying to get one of the tables. The problem is, the data in that table are not structured in a way that suits my web scraping knowledge right now, so I need your help. I've tried with rvest package from R, but I finally chose the UIpath studio solution, which seemed to be a quicker way to reach my objective. Here's a screenshot of the code of that page, with the element of interest highlighted :
enter image description here
I can't select The text "à pillé" on it's own, and make it a variable or a column in the output table that I want. What's the trick here? How am I supposed to do that ? I looked all over the web for an answer and didn't find anything... I hope my question is understandable.
Have a look at vignette("selectorgadget") from the rvest package. This is a good start. You can than get for example the Top 10 Village table with this code:
library(tidyverse)
library(rvest, warn.conflicts = FALSE)
#> Lade nötiges Paket: xml2
url <- "https://www.warsofninja.eu/index.php"
top_10_village <- read_html(url) %>%
html_nodes(".col:nth-child(1) td") %>%
html_text()
tibble(`#` = 1:10,
Village = top_10_village[seq(1, length(top_10_village), 2)],
Habitants = top_10_village[seq(2, length(top_10_village), 2)])
#> # A tibble: 10 x 3
#> `#` Village Habitants
#> <int> <chr> <chr>
#> 1 1 Number 1 455
#> 2 2 Beaumanoir 448
#> 3 3 L'Astra 446
#> 4 4 Yolo Land 438
#> 5 5 Sexonthebeach 430
#> 6 6 -.- 429
#> 7 7 Konoha- 427
#> 8 8 yuei 410
#> 9 9 Memen 409
#> 10 10 Moulin Huon 408
Created on 2018-09-22 by the reprex package (v0.2.1)