I just got into web scraping and I'm trying to scrape the data from this web page : https://www.warsofninja.eu/index.php
More precisely I'm trying to get one of the tables. The problem is, the data in that table are not structured in a way that suits my web scraping knowledge right now, so I need your help. I've tried with rvest package from R, but I finally chose the UIpath studio solution, which seemed to be a quicker way to reach my objective. Here's a screenshot of the code of that page, with the element of interest highlighted :
enter image description here
I can't select The text "à pillé" on it's own, and make it a variable or a column in the output table that I want. What's the trick here? How am I supposed to do that ? I looked all over the web for an answer and didn't find anything... I hope my question is understandable.
Have a look at vignette("selectorgadget") from the rvest package. This is a good start. You can than get for example the Top 10 Village table with this code:
library(tidyverse)
library(rvest, warn.conflicts = FALSE)
#> Lade nötiges Paket: xml2
url <- "https://www.warsofninja.eu/index.php"
top_10_village <- read_html(url) %>%
html_nodes(".col:nth-child(1) td") %>%
html_text()
tibble(`#` = 1:10,
Village = top_10_village[seq(1, length(top_10_village), 2)],
Habitants = top_10_village[seq(2, length(top_10_village), 2)])
#> # A tibble: 10 x 3
#> `#` Village Habitants
#> <int> <chr> <chr>
#> 1 1 Number 1 455
#> 2 2 Beaumanoir 448
#> 3 3 L'Astra 446
#> 4 4 Yolo Land 438
#> 5 5 Sexonthebeach 430
#> 6 6 -.- 429
#> 7 7 Konoha- 427
#> 8 8 yuei 410
#> 9 9 Memen 409
#> 10 10 Moulin Huon 408
Created on 2018-09-22 by the reprex package (v0.2.1)
Related
I am trying to send search requests with rvest, but I get always the same error. I have tried several ways included this solution: https://gist.github.com/ibombonato/11507d776d1042f80ca59cd31509afd3
My code is the following.
library(rvest)
url <- 'https://www.saferproducts.gov/PublicSearch'
cocorahs <- html_session(URL)
form.unfilled <- cocorahs %>% html_node("form") %>% html_form()
form.unfilled[["fields"]][[3]][["value"]] <- "input" ## This is the line which I think should be corrected
form.filled <- form.unfilled %>%
set_values("searchParameter.AdvancedKeyword" = "amazon")
session1 <- session_submit(cocorahs, form.filled, submit = NULL)
# or
session <- submit_form(cocorahs, form.filled)
But I get always the following error:
Error in `submission_build()`:
! `form` doesn't contain a `action` attribute
Run `rlang::last_error()` to see where the error occurred.
I think the way is to edit the attributes of those buttons. Maybe has someone the answer to this. Thanks in advance.
An alternative method with httr2
library(tidyverse)
library(rvest)
library(httr2)
data <- "https://www.saferproducts.gov/PublicSearch" %>%
request() %>%
req_body_form(
"searchParameter.Keyword" = "Amazon"
) %>%
req_perform() %>%
resp_body_html()
tibble(
title = data %>%
html_elements(".document-title") %>%
html_text2(),
report_title = data %>%
html_elements(".info") %>%
html_text2() %>%
str_remove_all("\r") %>%
str_squish()
)
#> # A tibble: 10 × 2
#> title repor…¹
#> <chr> <chr>
#> 1 Self balancing scooter was used off & on for three years. Consumer i… Incide…
#> 2 The consumer stated that when he opened one of the marshmallow roast… Incide…
#> 3 The consumer, 59, stated that he was welding with a brand new auto d… Incide…
#> 4 The consumer reported, that their hover soccer toy caught fire while… Incide…
#> 5 80 yr old male's electric hotplate was set between 1 and 2(of 5) bef… Incide…
#> 6 Amazon Recalls Amazon Basics Desk Chairs Due to Fall and Injury Haza… Recall…
#> 7 The Consumer reported to have been notified by email that the diarrh… Incide…
#> 8 consumer reported about light fixture attached to a photography umbr… Incide…
#> 9 Drive DeVilbiss Healthcare Recalls Adult Portable Bed Rails After Tw… Recall…
#> 10 MixBin Electronics Recalls iPhone Cases Due to Risk of Skin Irritati… Recall…
#> # … with abbreviated variable name ¹report_title
Created on 2023-01-15 with reprex v2.0.2
I am working with the R programming language. I am trying to scrape the following website: https://en.wikipedia.org/wiki/List_of_unincorporated_communities_in_Ontario
I tried the code below:
library(rvest)
url<-"https://en.wikipedia.org/wiki/List_of_unincorporated_communities_in_Ontario"
page <-read_html(url)
#find the div tab of class=one_third
b = page %>% html_nodes("li")
This seems to have produced some result, but I am not sure what to do with this.
Ideally, I would like the final results to look something like this:
id names
1 Aberdeen
2 Grey County
3 Aberdeen
4 Prescott and Russell County
5 Aberfeldy
... ...
6 Babys Point
7 Baddow
8 Baden
... ......
Can someone please show me how to do this?
Thanks!
You can find the appropriate names as anchor elements nested within list elements using css or xpath selectors. Then, extract these using html_text. Here's a full reprex:
library(rvest)
result <- "https://en.wikipedia.org/wiki/" %>%
paste0("List_of_unincorporated_communities_in_Ontario") %>%
read_html %>%
html_elements(xpath = '//ul/li/a') %>%
html_text() %>%
`[`(-(1:29)) %>%
as.data.frame() %>%
setNames('Community')
head(result, 10)
#> Community
#> 1 10th Line Shore
#> 2 Aberdeen, Grey County
#> 3 Aberdeen, Prescott and Russell County
#> 4 Aberfeldy
#> 5 Aberfoyle
#> 6 Abingdon
#> 7 Abitibi 70
#> 8 Abitibi Canyon
#> 9 Aboyne
#> 10 Acanthus
Created on 2022-09-15 with reprex v2.0.2
I'm trying to pull all nested links from the webpage below. My code below returns an empty character vector.
page1 <- "https://thrivemarket.com/c/condiments-sauces?cur_page=1"
page1 <- read_html(page1)
page1_body <- page1 %>%
html_node("body") %>%
html_children()
page1_urls <- page1 %>%
rvest::html_nodes('body') %>%
xml2::xml_find_all("//div[contains(#class, 'd85qmy-0 kRbsKs')]") %>%
rvest::html_attr('href')
Thank you in advance for your help with this.
Best,
~Mayra
The links you are looking for do not exist in the html document you are reading with read_html. When you look at the page in a browser, the html document contains Javascript code, which your browser runs. Some of this Javascript code causes your browser to download further information to be inserted into the web page you see on your browser.
In your case, the extra information you are looking for is in the form of a json file, which you can obtain and parse as follows:
library(httr)
library(dplyr)
url <- paste0("https://thrivemarket.com/api/v1/products",
"?page_size=60&multifilter=1&cur_page=1")
content(GET(url))$products %>%
lapply(function(x) data.frame(product = x$title, url = x$url)) %>%
bind_rows() %>%
as_tibble()
#> # A tibble: 60 x 2
#> product url
#> <chr> <chr>
#> 1 Organic Extra Virgin Olive Oil https://thrivemarket.com/p/~
#> 2 Grass-Fed Collagen Peptides https://thrivemarket.com/p/~
#> 3 Grass-Fed Beef Sticks, Original https://thrivemarket.com/p/~
#> 4 Organic Dry Roasted & Salted Cashews https://thrivemarket.com/p/~
#> 5 Organic Vanilla Extract https://thrivemarket.com/p/~
#> 6 Organic Raw Cashews https://thrivemarket.com/p/~
#> 7 Organic Coconut Milk, Regular https://thrivemarket.com/p/~
#> 8 Organic Robust Maple Syrup, Grade A, Value Size https://thrivemarket.com/p/~
#> 9 Organic Coconut Water https://thrivemarket.com/p/~
#> 10 Non-GMO Avocado Oil Potato Chips, Himalayan Salt https://thrivemarket.com/p/~
#> # ... with 50 more rows
Created on 2022-06-04 by the reprex package (v2.0.1)
Disclaimer: I'm not a programmer by trade and my knowledge of R is limited to say the least. I've also already searched Stackoverflow for a solution (but to no avail).
Here's my situation: I need to scrape a series of webpages and save the data (not quite sure in what format, but I'll get to that). Fortunately the pages I need to scrape have a very logical naming structure (they use the date).
The base URL is: https://www.bbc.co.uk/schedules/p00fzl6p
I need to scrape everything from August 1st 2018 (for which the URL is https://www.bbc.co.uk/schedules/p00fzl6p/2018/08/01) until yesterday (for which the URL is https://www.bbc.co.uk/schedules/p00fzl6p/2020/05/17).
So far I've figured out to create a list of dates which can be appended to the base URL using the following:
dates <- seq(as.Date("2018-08-01"), as.Date("2020-05-17"), by=1)
dates <- format(dates,"20%y/%m/%d")
I can append these to the base URL with the following:
url <- paste0("https://www.bbc.co.uk/schedules/p00fzl6p/",dates)
However, that's pretty much as far as I've gotten (not very far, I know!) I assume I need to use a for loop but my own attempts at this have proved futile. Perhaps I'm not approaching this the right way?
In case it's not clear, what I'm trying to do is to visit each URL and save the html as an individual html file (ideally labelled with the relevant date). In truth, I don't need all of the html (just the list of programmes and times) but I can extract that information from the relevant files at a later date.
Any guidance on the best way to approach this would be much appreciated! And if you need any more info, just ask.
Have a look at the rvest package and associated tutorials. E.g. https://www.datacamp.com/community/tutorials/r-web-scraping-rvest.
The messy part is extracting the fields the way you want them.
Here is one possible solution:
library(rvest)
#> Loading required package: xml2
library(magrittr)
library(stringr)
library(data.table)
dates <- seq(as.Date("2018-08-01"), as.Date("2020-05-17"), by=1)
dates <- format(dates,"20%y/%m/%d")
urls <- paste0("https://www.bbc.co.uk/schedules/p00fzl6p/", dates)
get_data <- function(url){
html <- tryCatch(read_html(url), error=function(e) NULL)
if(is.null(html)) return(data.table(
date=gsub("https://www.bbc.co.uk/schedules/p00fzl6p/", "", url),
title=NA, description=NA)) else {
time <- html %>%
rvest::html_nodes('body') %>%
xml2::xml_find_all("//div[contains(#class, 'broadcast__info grid 1/4 1/6#bpb2 1/6#bpw')]") %>%
rvest::html_text() %>% gsub(".*([0-9]{2}.[0-9]{2}).*", "\\1", .)
text <- html %>%
rvest::html_nodes('body') %>%
xml2::xml_find_all("//div[contains(#class, 'programme__body')]") %>%
rvest::html_text() %>%
gsub("[ ]{2,}", " ", .) %>% gsub("[\n|\n ]{2,}", "\n", .) %>%
gsub("\n(R)\n", " (R)", ., fixed = TRUE) %>%
gsub("^\n|\n$", "", .) %>%
str_split_fixed(., "\n", 2) %>%
as.data.table() %>% setnames(., c("title", "description")) %>%
.[, `:=`(date = gsub("https://www.bbc.co.uk/schedules/p00fzl6p/", "", url),
time = time,
description = gsub("\n", " ", description))] %>%
setcolorder(., c("date", "time", "title", "description"))
text
}
}
res <- rbindlist(parallel::mclapply(urls, get_data, mc.cores = 6L))
res
#> date time
#> 1: 2018/08/01 06:00
#> 2: 2018/08/01 09:15
#> 3: 2018/08/01 10:00
#> 4: 2018/08/01 11:00
#> 5: 2018/08/01 11:45
#> ---
#> 16760: 2020/05/17 22:20
#> 16761: 2020/05/17 22:30
#> 16762: 2020/05/17 00:20
#> 16763: 2020/05/17 01:20
#> 16764: 2020/05/17 01:25
#> title
#> 1: Breakfast—01/08/2018
#> 2: Wanted Down Under—Series 11, Hanson Family
#> 3: Homes Under the Hammer—Series 21, Episode 6
#> 4: Fake Britain—Series 7, Episode 7
#> 5: The Farmers' Country Showdown—Series 2 30-Minute Versions, Ploughing
#> ---
#> 16760: BBC London—Late News, 17/05/2020
#> 16761: Educating Rita
#> 16762: The Real Marigold Hotel—Series 4, Episode 2
#> 16763: Weather for the Week Ahead—18/05/2020
#> 16764: Joins BBC News—18/05/2020
#> description
#> 1: The latest news, sport, business and weather from the BBC's Breakfast team.
#> 2: 22/24 Will a week in Melbourne help Keith persuade his wife Mary to move to Australia? (R)
#> 3: Properties in Hertfordshire, Croydon and Derbyshire are sold at auction. (R)
#> 4: 7/10 The fake sports memorabilia that cost collectors thousands. (R)
#> 5: 13/20 Farmers show the skill and passion needed to do well in a top ploughing competition.
#> ---
#> 16760: The latest news, sport and weather from London.
#> 16761: Comedy drama about a hairdresser who dreams of rising above her drab urban existence. (R)
#> 16762: 2/4 The group take a night train to Madurai to attend the famous Chithirai festival. (R)
#> 16763: Detailed weather forecast.
#> 16764: BBC One joins the BBC's rolling news channel for a night of news.
Created on 2020-05-18 by the reprex package (v0.3.0)
I'm trying to read the HTML data regarding Greyhound bus timings. An example can be found here. I'm mainly concerned with getting the schedule and status data off the table, but when I execute the following code:
library(XML)
url<-"http://bustracker.greyhound.com/routes/4511/I/Chicago_Amtrak_IL-Cincinnati_OH/4511/10-26-2016"
greyhound<-readHTMLTable(url)
greyhound<-greyhound[[2]]
This just produces the following table:
I'm not sure why it's grabbing data that's not even on the page, as opposed to the
you can not retrieve the data using readHTMLTable because the traject result are sent as javascript script. So you should select that script and parse it to extract the right information.
Her a solution , that do this :
Extract the javascript script that contain the json data
extract the json data from the script using regular expression
parse the json data to an R list
Reshape the resulted list into a table ( data.table here)
The code looks maybe short but it is really compact ( it takes me an hour to do produce it)!
library(XML)
library(httr)
library(jsonlite)
library(data.table)
dc <- htmlParse(GET(url))
script <- xpathSApply(dc,"//script/text()",xmlValue)[[5]]
res <- strsplit(script,"stopArray.push({",fixed=TRUE)[[1]][-1]
dcast(point~name,data=rbindlist(Map(function(x,y){
x <- paste('{',sub(');|);.*docum.*',"",x))
dx <- unlist(fromJSON(x))
data.frame(point=y,name=names(dx),value=dx)
},res,seq_along(res))
,fill=TRUE)[name!="polyline"])
the table result :
point category direction id lat linkName lon
1: 1 2 empty 562310 41.878589630127 Chicago_Amtrak_IL -87.6398544311523
2: 2 2 empty 560252 41.8748474121094 Chicago_IL -87.6435165405273
3: 3 1 empty 561627 41.7223281860352 Chicago_95th_&_Dan_Ryan_IL -87.6247329711914
4: 4 2 empty 260337 41.6039199829102 Gary_IN -87.3386917114258
5: 5 1 empty 260447 40.4209785461426 Lafayette_e_IN -86.8942031860352
6: 6 2 empty 260392 39.7617835998535 Indianapolis_IN -86.161018371582
7: 7 2 empty 250305 39.1079406738281 Cincinnati_OH -84.5041427612305
name shortName ticketName
1: Chicago Amtrak: 225 S Canal St, IL 60606 Chicago Amtrak, IL CHD
2: Chicago: 630 W Harrison St, IL 60607 Chicago, IL CHD
3: Chicago 95th & Dan Ryan: 14 W 95th St, IL 60628 Chicago 95th & Dan Ryan, IL CHD
4: Gary: 100 W 4th Ave, IN 46402 Gary, IN GRY
5: Lafayette (e): 401 N 3rd St, IN 47901 Lafayette (e), IN XIN
6: Indianapolis: 350 S Illinois St, IN 46225 Indianapolis, IN IND
7: Cincinnati: 1005 Gilbert Ave, OH 45202 Cincinnati, OH CIN
As #agstudy notes, the data is rendered to HTML; it's not delivered via HTML directly from the server. Therefore, you can (a) use something like RSelenium to scrape the rendered content, or (b) extract the data from the <script> tags that contain the data.
To explain #agstudy's work, we observe that the data is contained in a series of stopArray.push() commands in one of the (many) script tags. For example:
stopArray.push({
"id" : "562310",
"name" : "Chicago Amtrak: 225 S Canal St, IL 60606",
"shortName" : "Chicago Amtrak, IL",
"ticketName" : "CHD",
"category" : 2,
"linkName" : "Chicago_Amtrak_IL",
"direction" : "empty",
"lat" : 41.87858963012695,
"lon" : -87.63985443115234,
"polyline" : "elr~Fnb|uOmC##nG?XBdH#rC?f#?P?V#`AlAAn#A`CCzBC~BE|CEdCA^Ap#A"
});
Now, this is json data contained inside each function call. I tend to think that if someone has gone to the work of formatting data in a machine-readable format, well golly we should appreciate it!
The tidyverse approach to this problem is as follows:
Download the page using the rvest package.
Identify the appropriate script tag to use by employing an xpath expression that searches for all script tags that contain the string url =.
Use a regular expression to pull out everything inside each stopArray.push() call.
Fix the formatting of the resulting object by (a) separating each block with commas, (b) surrounding the string by [] to indicate a json list.
Use jsonlite::fromJSON to convert into a data.frame.
Note that I hide the polyline column near the end, since it's too large to previous appropriately.
library(tidyverse)
library(rvest)
library(stringr)
library(jsonlite)
url <- "http://bustracker.greyhound.com/routes/4511/I/Chicago_Amtrak_IL-Cincinnati_OH/4511/10-26-2016"
page <- read_html(url)
page %>%
html_nodes(xpath = '//script[contains(text(), "url = ")]') %>%
html_text() %>%
str_extract_all(regex("(?<=stopArray.push\\().+?(?=\\);)", multiline = T, dotall = T), F) %>%
unlist() %>%
paste(collapse = ",") %>%
sprintf("[%s]", .) %>%
fromJSON() %>%
select(-polyline) %>%
head()
#> id name
#> 1 562310 Chicago Amtrak: 225 S Canal St, IL 60606
#> 2 560252 Chicago: 630 W Harrison St, IL 60607
#> 3 561627 Chicago 95th & Dan Ryan: 14 W 95th St, IL 60628
#> 4 260337 Gary: 100 W 4th Ave, IN 46402
#> 5 260447 Lafayette (e): 401 N 3rd St, IN 47901
#> 6 260392 Indianapolis: 350 S Illinois St, IN 46225
#> shortName ticketName category
#> 1 Chicago Amtrak, IL CHD 2
#> 2 Chicago, IL CHD 2
#> 3 Chicago 95th & Dan Ryan, IL CHD 1
#> 4 Gary, IN GRY 2
#> 5 Lafayette (e), IN XIN 1
#> 6 Indianapolis, IN IND 2
#> linkName direction lat lon
#> 1 Chicago_Amtrak_IL empty 41.87859 -87.63985
#> 2 Chicago_IL empty 41.87485 -87.64352
#> 3 Chicago_95th_&_Dan_Ryan_IL empty 41.72233 -87.62473
#> 4 Gary_IN empty 41.60392 -87.33869
#> 5 Lafayette_e_IN empty 40.42098 -86.89420
#> 6 Indianapolis_IN empty 39.76178 -86.16102