R: Webscraping a List From Wikipedia - html

I am working with the R programming language. I am trying to scrape the following website: https://en.wikipedia.org/wiki/List_of_unincorporated_communities_in_Ontario
I tried the code below:
library(rvest)
url<-"https://en.wikipedia.org/wiki/List_of_unincorporated_communities_in_Ontario"
page <-read_html(url)
#find the div tab of class=one_third
b = page %>% html_nodes("li")
This seems to have produced some result, but I am not sure what to do with this.
Ideally, I would like the final results to look something like this:
id names
1 Aberdeen
2 Grey County
3 Aberdeen
4 Prescott and Russell County
5 Aberfeldy
... ...
6 Babys Point
7 Baddow
8 Baden
... ......
Can someone please show me how to do this?
Thanks!

You can find the appropriate names as anchor elements nested within list elements using css or xpath selectors. Then, extract these using html_text. Here's a full reprex:
library(rvest)
result <- "https://en.wikipedia.org/wiki/" %>%
paste0("List_of_unincorporated_communities_in_Ontario") %>%
read_html %>%
html_elements(xpath = '//ul/li/a') %>%
html_text() %>%
`[`(-(1:29)) %>%
as.data.frame() %>%
setNames('Community')
head(result, 10)
#> Community
#> 1 10th Line Shore
#> 2 Aberdeen, Grey County
#> 3 Aberdeen, Prescott and Russell County
#> 4 Aberfeldy
#> 5 Aberfoyle
#> 6 Abingdon
#> 7 Abitibi 70
#> 8 Abitibi Canyon
#> 9 Aboyne
#> 10 Acanthus
Created on 2022-09-15 with reprex v2.0.2

Related

rvest error on form submission "`Form` doesn't contain a `action` attribute"

I am trying to send search requests with rvest, but I get always the same error. I have tried several ways included this solution: https://gist.github.com/ibombonato/11507d776d1042f80ca59cd31509afd3
My code is the following.
library(rvest)
url <- 'https://www.saferproducts.gov/PublicSearch'
cocorahs <- html_session(URL)
form.unfilled <- cocorahs %>% html_node("form") %>% html_form()
form.unfilled[["fields"]][[3]][["value"]] <- "input" ## This is the line which I think should be corrected
form.filled <- form.unfilled %>%
set_values("searchParameter.AdvancedKeyword" = "amazon")
session1 <- session_submit(cocorahs, form.filled, submit = NULL)
# or
session <- submit_form(cocorahs, form.filled)
But I get always the following error:
Error in `submission_build()`:
! `form` doesn't contain a `action` attribute
Run `rlang::last_error()` to see where the error occurred.
I think the way is to edit the attributes of those buttons. Maybe has someone the answer to this. Thanks in advance.
An alternative method with httr2
library(tidyverse)
library(rvest)
library(httr2)
data <- "https://www.saferproducts.gov/PublicSearch" %>%
request() %>%
req_body_form(
"searchParameter.Keyword" = "Amazon"
) %>%
req_perform() %>%
resp_body_html()
tibble(
title = data %>%
html_elements(".document-title") %>%
html_text2(),
report_title = data %>%
html_elements(".info") %>%
html_text2() %>%
str_remove_all("\r") %>%
str_squish()
)
#> # A tibble: 10 × 2
#> title repor…¹
#> <chr> <chr>
#> 1 Self balancing scooter was used off & on for three years. Consumer i… Incide…
#> 2 The consumer stated that when he opened one of the marshmallow roast… Incide…
#> 3 The consumer, 59, stated that he was welding with a brand new auto d… Incide…
#> 4 The consumer reported, that their hover soccer toy caught fire while… Incide…
#> 5 80 yr old male's electric hotplate was set between 1 and 2(of 5) bef… Incide…
#> 6 Amazon Recalls Amazon Basics Desk Chairs Due to Fall and Injury Haza… Recall…
#> 7 The Consumer reported to have been notified by email that the diarrh… Incide…
#> 8 consumer reported about light fixture attached to a photography umbr… Incide…
#> 9 Drive DeVilbiss Healthcare Recalls Adult Portable Bed Rails After Tw… Recall…
#> 10 MixBin Electronics Recalls iPhone Cases Due to Risk of Skin Irritati… Recall…
#> # … with abbreviated variable name ¹​report_title
Created on 2023-01-15 with reprex v2.0.2

How can I pull all nested URLs from a webpage using rvest and xml2?

I'm trying to pull all nested links from the webpage below. My code below returns an empty character vector.
page1 <- "https://thrivemarket.com/c/condiments-sauces?cur_page=1"
page1 <- read_html(page1)
page1_body <- page1 %>%
html_node("body") %>%
html_children()
page1_urls <- page1 %>%
rvest::html_nodes('body') %>%
xml2::xml_find_all("//div[contains(#class, 'd85qmy-0 kRbsKs')]") %>%
rvest::html_attr('href')
Thank you in advance for your help with this.
Best,
~Mayra
The links you are looking for do not exist in the html document you are reading with read_html. When you look at the page in a browser, the html document contains Javascript code, which your browser runs. Some of this Javascript code causes your browser to download further information to be inserted into the web page you see on your browser.
In your case, the extra information you are looking for is in the form of a json file, which you can obtain and parse as follows:
library(httr)
library(dplyr)
url <- paste0("https://thrivemarket.com/api/v1/products",
"?page_size=60&multifilter=1&cur_page=1")
content(GET(url))$products %>%
lapply(function(x) data.frame(product = x$title, url = x$url)) %>%
bind_rows() %>%
as_tibble()
#> # A tibble: 60 x 2
#> product url
#> <chr> <chr>
#> 1 Organic Extra Virgin Olive Oil https://thrivemarket.com/p/~
#> 2 Grass-Fed Collagen Peptides https://thrivemarket.com/p/~
#> 3 Grass-Fed Beef Sticks, Original https://thrivemarket.com/p/~
#> 4 Organic Dry Roasted & Salted Cashews https://thrivemarket.com/p/~
#> 5 Organic Vanilla Extract https://thrivemarket.com/p/~
#> 6 Organic Raw Cashews https://thrivemarket.com/p/~
#> 7 Organic Coconut Milk, Regular https://thrivemarket.com/p/~
#> 8 Organic Robust Maple Syrup, Grade A, Value Size https://thrivemarket.com/p/~
#> 9 Organic Coconut Water https://thrivemarket.com/p/~
#> 10 Non-GMO Avocado Oil Potato Chips, Himalayan Salt https://thrivemarket.com/p/~
#> # ... with 50 more rows
Created on 2022-06-04 by the reprex package (v2.0.1)

R - Scrape a number of URLs and save individually

Disclaimer: I'm not a programmer by trade and my knowledge of R is limited to say the least. I've also already searched Stackoverflow for a solution (but to no avail).
Here's my situation: I need to scrape a series of webpages and save the data (not quite sure in what format, but I'll get to that). Fortunately the pages I need to scrape have a very logical naming structure (they use the date).
The base URL is: https://www.bbc.co.uk/schedules/p00fzl6p
I need to scrape everything from August 1st 2018 (for which the URL is https://www.bbc.co.uk/schedules/p00fzl6p/2018/08/01) until yesterday (for which the URL is https://www.bbc.co.uk/schedules/p00fzl6p/2020/05/17).
So far I've figured out to create a list of dates which can be appended to the base URL using the following:
dates <- seq(as.Date("2018-08-01"), as.Date("2020-05-17"), by=1)
dates <- format(dates,"20%y/%m/%d")
I can append these to the base URL with the following:
url <- paste0("https://www.bbc.co.uk/schedules/p00fzl6p/",dates)
However, that's pretty much as far as I've gotten (not very far, I know!) I assume I need to use a for loop but my own attempts at this have proved futile. Perhaps I'm not approaching this the right way?
In case it's not clear, what I'm trying to do is to visit each URL and save the html as an individual html file (ideally labelled with the relevant date). In truth, I don't need all of the html (just the list of programmes and times) but I can extract that information from the relevant files at a later date.
Any guidance on the best way to approach this would be much appreciated! And if you need any more info, just ask.
Have a look at the rvest package and associated tutorials. E.g. https://www.datacamp.com/community/tutorials/r-web-scraping-rvest.
The messy part is extracting the fields the way you want them.
Here is one possible solution:
library(rvest)
#> Loading required package: xml2
library(magrittr)
library(stringr)
library(data.table)
dates <- seq(as.Date("2018-08-01"), as.Date("2020-05-17"), by=1)
dates <- format(dates,"20%y/%m/%d")
urls <- paste0("https://www.bbc.co.uk/schedules/p00fzl6p/", dates)
get_data <- function(url){
html <- tryCatch(read_html(url), error=function(e) NULL)
if(is.null(html)) return(data.table(
date=gsub("https://www.bbc.co.uk/schedules/p00fzl6p/", "", url),
title=NA, description=NA)) else {
time <- html %>%
rvest::html_nodes('body') %>%
xml2::xml_find_all("//div[contains(#class, 'broadcast__info grid 1/4 1/6#bpb2 1/6#bpw')]") %>%
rvest::html_text() %>% gsub(".*([0-9]{2}.[0-9]{2}).*", "\\1", .)
text <- html %>%
rvest::html_nodes('body') %>%
xml2::xml_find_all("//div[contains(#class, 'programme__body')]") %>%
rvest::html_text() %>%
gsub("[ ]{2,}", " ", .) %>% gsub("[\n|\n ]{2,}", "\n", .) %>%
gsub("\n(R)\n", " (R)", ., fixed = TRUE) %>%
gsub("^\n|\n$", "", .) %>%
str_split_fixed(., "\n", 2) %>%
as.data.table() %>% setnames(., c("title", "description")) %>%
.[, `:=`(date = gsub("https://www.bbc.co.uk/schedules/p00fzl6p/", "", url),
time = time,
description = gsub("\n", " ", description))] %>%
setcolorder(., c("date", "time", "title", "description"))
text
}
}
res <- rbindlist(parallel::mclapply(urls, get_data, mc.cores = 6L))
res
#> date time
#> 1: 2018/08/01 06:00
#> 2: 2018/08/01 09:15
#> 3: 2018/08/01 10:00
#> 4: 2018/08/01 11:00
#> 5: 2018/08/01 11:45
#> ---
#> 16760: 2020/05/17 22:20
#> 16761: 2020/05/17 22:30
#> 16762: 2020/05/17 00:20
#> 16763: 2020/05/17 01:20
#> 16764: 2020/05/17 01:25
#> title
#> 1: Breakfast—01/08/2018
#> 2: Wanted Down Under—Series 11, Hanson Family
#> 3: Homes Under the Hammer—Series 21, Episode 6
#> 4: Fake Britain—Series 7, Episode 7
#> 5: The Farmers' Country Showdown—Series 2 30-Minute Versions, Ploughing
#> ---
#> 16760: BBC London—Late News, 17/05/2020
#> 16761: Educating Rita
#> 16762: The Real Marigold Hotel—Series 4, Episode 2
#> 16763: Weather for the Week Ahead—18/05/2020
#> 16764: Joins BBC News—18/05/2020
#> description
#> 1: The latest news, sport, business and weather from the BBC's Breakfast team.
#> 2: 22/24 Will a week in Melbourne help Keith persuade his wife Mary to move to Australia? (R)
#> 3: Properties in Hertfordshire, Croydon and Derbyshire are sold at auction. (R)
#> 4: 7/10 The fake sports memorabilia that cost collectors thousands. (R)
#> 5: 13/20 Farmers show the skill and passion needed to do well in a top ploughing competition.
#> ---
#> 16760: The latest news, sport and weather from London.
#> 16761: Comedy drama about a hairdresser who dreams of rising above her drab urban existence. (R)
#> 16762: 2/4 The group take a night train to Madurai to attend the famous Chithirai festival. (R)
#> 16763: Detailed weather forecast.
#> 16764: BBC One joins the BBC's rolling news channel for a night of news.
Created on 2020-05-18 by the reprex package (v0.3.0)

Web-scraping: No matching CSS slector

I just got into web scraping and I'm trying to scrape the data from this web page : https://www.warsofninja.eu/index.php
More precisely I'm trying to get one of the tables. The problem is, the data in that table are not structured in a way that suits my web scraping knowledge right now, so I need your help. I've tried with rvest package from R, but I finally chose the UIpath studio solution, which seemed to be a quicker way to reach my objective. Here's a screenshot of the code of that page, with the element of interest highlighted :
enter image description here
I can't select The text "à pillé" on it's own, and make it a variable or a column in the output table that I want. What's the trick here? How am I supposed to do that ? I looked all over the web for an answer and didn't find anything... I hope my question is understandable.
Have a look at vignette("selectorgadget") from the rvest package. This is a good start. You can than get for example the Top 10 Village table with this code:
library(tidyverse)
library(rvest, warn.conflicts = FALSE)
#> Lade nötiges Paket: xml2
url <- "https://www.warsofninja.eu/index.php"
top_10_village <- read_html(url) %>%
html_nodes(".col:nth-child(1) td") %>%
html_text()
tibble(`#` = 1:10,
Village = top_10_village[seq(1, length(top_10_village), 2)],
Habitants = top_10_village[seq(2, length(top_10_village), 2)])
#> # A tibble: 10 x 3
#> `#` Village Habitants
#> <int> <chr> <chr>
#> 1 1 Number 1 455
#> 2 2 Beaumanoir 448
#> 3 3 L'Astra 446
#> 4 4 Yolo Land 438
#> 5 5 Sexonthebeach 430
#> 6 6 -.- 429
#> 7 7 Konoha- 427
#> 8 8 yuei 410
#> 9 9 Memen 409
#> 10 10 Moulin Huon 408
Created on 2018-09-22 by the reprex package (v0.2.1)

Trouble using rvest on nested tables

I'm having an issue trying to get Rankings from the Freeride World Tour website.
I tried first to get a CSS code for rvest using selectorGadget in Chrome but but can only get the riders and their overall score. What I'm interested in is getting the points a rider scored in each heat. I'm new to web-scraping and CSS/HTML so please hang in there with me.
# Get the website url
url <- read_html("https://www.freerideworldtour.com/rankings-detailed?season=165&competition=2&discipline=38")
Download everything from the page,
(all_text <- url %>%
html_nodes("div") %>%
html_text())
then look for Kristofer Turdell's first score of 2500 pts. grep("2500 pts.", all_text) but I find...nothing?
When I right-click the 2500 pts. on the website and select "Inspect" I can see that the html code for this section is:
<div class="field__item even">2500 pts.</div>
So I tried to use the div class:
url %>%
html_nodes(".field__item.even:) %>%
html_text()
This only returns the overall score for the participants (e.g. Kristofer Turdell 7870 pts.).
Next, I tried using the right-click option to save Xpath from "Inspect".
url %>%
html_nodes(xpath = "//*[#id="page-content"]/div/div/div[2]/div/div/div/div[1]/div[2]/div/div/div[1]/div/div[4]/div/div/div") %>%
html_text()
I'm not having any luck on this so I'd really appreciate your help.
url %>%
html_node("div.panel-second")%>%
html_text() %>%
gsub("\\s*\\n+\\s*",";",.)%>%
gsub("pts.","\n",.)%>%
read.table(text=.,fill=T,sep=";",row.names = NULL)%>%
subset(select=3:4)%>%na.omit()
V3 V4
1 Kristofer Turdell 7870
2 Markus Eder 7320
3 Mickael Bimboes 6930
4 Loic Collomb-Patton 6660
5 Yann Rausis 6290
6 Berkeley Patterson 5860
7 Leo Slemett 5835
8 Ivan Malakhov 5800
9 Craig Murray 5705
10 Logan Pehota 5655
11 Reine Barkered 5470
12 Grifen Moller 4765
13 Sam Lee 4580
14 Ryan Faye 3210
15 Conor Pelton 3185
16 George Rodney 3115
17 Taisuke Kusunoki 3060
18 Trace Cooke 2905
19 Aymar Navarro 2855
20 Felix Wiemers 2655
21 Fabio Studer 2305
22 Stefan Hausl 2240
23 Drew Tabke 1880
24 Carl Regnér Eriksson 1310
Writing that much code in the comments was awful, so here goes. You can store the scraped data into a dataframe and not be limited to printing it to the console:
library(tidyverse)
library(magrittr)
library(rvest)
url_base <- "https://www.freerideworldtour.com/rider/"
riders <- c("kristofer-turdell", "markus-eder", "mickael-bimboes")
output <- data_frame()
for (i in riders) {
temp <- read_html(paste0(url_base, i)) %>%
html_node("div") %>%
html_text() %>%
gsub("\\s*\\n+\\s*", ";", .) %>%
gsub("pts.", "\n", .) %>%
read.table(text = ., fill = T, sep = ";", row.names = NULL,
col.names = c("Drop", "Ranking", "FWT", "Events", "Points")) %>%
subset(select = 2:5) %>%
dplyr::filter(
!is.na(as.numeric(as.character(Ranking))) &
as.character(Points) != ""
) %>%
dplyr::mutate(name = i)
output <- bind_rows(output, temp)
}
I put in parts such as as.character(Points) != "" to exclude the sum of points (such as in Mickael Bimboe's 6930 pts) and not individual scores.
Again, much credit goes to #Onyambu though, many lines are borrowed from his answer.