Webscraping between two tags - html

I am trying to scrape the following page(s):
Htpps://mywebsite.com
In particular, I would like to get the name of each entry. I noticed that the text I am interested in is always in (MY TEXT) the middle of these two tags: <div class="title"> MY TEXT
I know how to search for these tags individually:
#load libraries
library(rvest)
library(httr)
library(XML)
library(rvest)
# set up page
url<-"https://www.mywebsite.com"
page <-read_html(url)
#option 1
b = page %>% html_nodes("title")
option1 <- b %>% html_text() %>% strsplit("\\n")
#option 2
b = page %>% html_nodes("a")
option2 <- b %>% html_text() %>% strsplit("\\n")
Is there some way that I could have specified the "html_nodes" argument so that it picked up on "MY TEXT" - i.e. scrape between <div class="title"> and </a> :
<div class="title"> MY TEXT

Scraping of pages 1:10
library(tidyverse)
library(rvest)
my_function <- function(page_n) {
cat("Scraping page ", page_n, "\n")
page <- paste0("https://www.dentistsearch.ca/search-doctor/",
page_n, "?category=0&services=0&province=55&city=&k=") %>% read_html
tibble(title = page %>%
html_elements(".title a") %>%
html_text2(),
adress = page %>%
html_elements(".marker") %>%
html_text2(),
page = page_n)
}
df <- map_dfr(1:10, my_function)

You can use the xpath argument inside html_elements to locate each a tag inside a div with class "title".
Here's a complete reproducible example.
library(rvest)
"https://www.mywebsite.ca/extension1/" %>%
paste0("2?extension2") %>%
read_html() %>%
html_elements(xpath = "//div[#class='title']/a") %>%
html_text()
Or to get all entries on the first 10 pages:
library(rvest)
unlist(lapply(1:10, function(page){
"https://www.mywebsite.ca/extension1/" %>%
paste0(page, "?extension2") %>%
read_html() %>%
html_elements(xpath = "//div[#class='title']/a") %>%
html_text()}))
Created on 2022-07-26 by the reprex package (v2.0.1)

Related

Can't extract href link from html_node in rvest

When I use the rvest package xpath and and try to get the embedded links (football team names) from the sites I get an empty result. Could someone help this?
The code is as follows:
library(rvest)
url <- read_html('https://www.transfermarkt.com/premier-league/startseite/wettbewerb/GB1')
xpath <- as.character('/html/body/div[2]/div[11]/div[1]/div[2]/div[2]/div')
url %>%
html_node(xpath=xpath) %>%
html_attr('href')
You can get all the links using :
library(rvest)
url <- 'https://www.transfermarkt.com/premier-league/startseite/wettbewerb/GB1'
url %>%
read_html %>%
html_nodes('td.hauptlink a') %>%
html_attr('href') %>%
.[. != '#'] %>%
paste0('https://www.transfermarkt.com', .) %>%
unique() %>%
head(20)

Vivino - Scraping with R

I would like to scrape basic data about wines from Vivino. I have never done scraping before but based on some tutorials and lecture on Datacamp I tried to use basic code using library rvest.
However, it seems it does not work and returns zero values.
Could anyone please help me and tell me, where is the problem please? Is the code completely wrong and I should use some other method, or am I just missing something and doing it wrong?
Thank you in advance for any answers!
library(rvest)
library(dplyr)
url <- 'https://www.vivino.com/explore?e=eJwNybEOQDAQBuC3ubkG4z-abMQkIqdO00RbuTbF2_OtX1A0FHyEocAPWmPIvhh7suimga5_3YHK6qXwSWmDcvHR5ZWrKDuhhF2ypbvMC5oP96QajA%3D%3D&cart_item_source=nav-explore'
web <- read_html(url)
winery_data <- web %>% html_nodes('.vintageTitle__winery--2YoIr') %>% html_text()
head(winery_data)
wine_name <- web %>% html_nodes('.vintageTitle__wine--U7t9G') %>% html_text()
wine_country <- web %>% html_nodes('.vintageLocation__anchor--T7J3k+ .vintageLocation__anchor--T7J3k') %>% html_text()
wine_region <- web %>% html_nodes('span+ .vintageLocation__anchor--T7J3k') %>% html_text()
wine_rating <- web %>% html_nodes('.vivinoRating__averageValue--3Navj') %>% html_text()
n_ratings <- web %>% html_nodes('.vivinoRating__caption--3tZeS') %>% html_text()
The page loads dynamically, which is why rvest alone will not work; you also need to use RSelenium.
Suppose I use Firefox, the following code should work:
# RSelenium with Firefox
rD <- RSelenium::rsDriver(browser="firefox", port=4546L, verbose=F)
remDr <- rD[["client"]]
remDr$navigate(url)
# Scroll down a couple of times to reach the bottom of the page
# so that additional data load dynamically with each scroll.
# Here I scroll 4 times, but perhaps you will need much more than that.
for(i in 1:4){
remDr$executeScript(paste("scroll(0,",i*10000,");"))
Sys.sleep(3)
}
# get the page source
web <- remDr$getPageSource()
web <- xml2::read_html(web[[1]])
# close RSelenium
remDr$close()
gc()
rD$server$stop()
system("taskkill /im java.exe /f", intern=FALSE, ignore.stdout=FALSE)
# now we can go on to our rvest code and scrape the data
winery_data <- web %>% html_nodes('.vintageTitle__winery--2YoIr') %>% html_text()
head(winery_data)
wine_name <- web %>% html_nodes('.vintageTitle__wine--U7t9G') %>% html_text()
wine_country <- web %>% html_nodes('.vintageLocation__anchor--T7J3k+ .vintageLocation__anchor--T7J3k') %>% html_text()
wine_region <- web %>% html_nodes('span+ .vintageLocation__anchor--T7J3k') %>% html_text()
wine_rating <- web %>% html_nodes('.vivinoRating__averageValue--3Navj') %>% html_text()
n_ratings <- web %>% html_nodes('.vivinoRating__caption--3tZeS') %>% html_text()

scrape web page pagination with rvest. Pagination path does not appear in the structure

I need your help with a web scraping problem.
I am trying to scraping news from a website.
But I'm having problems to scraping the number of total paging.
For example on this page I want to scrape this pagination (166). But the pagination path is not in the site structure :
url <- 'https://www.burkina24.com/category/actualite-au-burkina-faso/politique/'
read_html(url) %>%
html_nodes("#wrapper .nav-links > a") %>%
html_attr("href") %>%
str_trim()
read_html(url) %>%
html_nodes("#wrapper > #content > .site-content > .container > .row > div > div > div > nav > .nav-links > a") %>%
html_attr("href") %>%
str_trim()
I have tried all the nodes but nothing. Thank you
The number is clearly there with class .pages. Use the class of the ellipsis as a preceding anchor point and move to the required node with an adjacent sibling combinator.
library(rvest)
library(magrittr)
url <- 'https://www.burkina24.com/category/actualite-au-burkina-faso/politique/'
pages <- read_html(url) %>%
html_node(".dots + .page-numbers") %>% html_text() %>% as.integer()
Personally, I would consider looping until there is no match for node with class next i.e. html_node(".next") returns no match.
Uglier would be something like
pages <- read_html(url) %>% html_nodes(".page-numbers:not(.next)") %>% tail(.,1) %>% html_text() %>% as.integer()
Why do you need to scrape the number of total pages when you already knew it's 166?
Just loop through 1:166 :
url <- 'https://www.burkina24.com/category/actualite-au-burkina-faso/politique/page/'
data <-
purrr::map_dfr(
1:166,
function(x) {
articles <- read_html(paste0(url, x)) %>%
html_nodes(xpath = "//div[#class='posts-lists']/div/article")
data.frame(
id = articles %>% html_attr("id"),
title = articles %>% html_nodes("h2") %>% html_text(),
link = articles %>% html_nodes("h2 > a") %>% html_attr("href"),
author = articles %>% html_nodes(xpath = "//a[#rel='author']") %>% html_text()
)
}
)

Remove href and/or deactivate anchored links when printing a PDF from HTML using xml2 and pagedown with R

I'm using R to extract 100s of articles from a food blog and convert them to PDF. I'm 99% done, but when I print the final PDF, in-line hyperlinks have their URL written right within the text. I do not want every link rendered to text in the PDF, and believe I need to remove the href attributes from the HTML prior to printing with pagedown. Does anyone know how to do this? My example code below should get you to my pdf creation loop for the first article. The initial portions pull all of the URLs into a vector. The portion that needs this enhancementThanks.
library(rvest)
library(dplyr)
library(tidyr)
library(stringr)
library(purrr)
library(downloader)
library(pagedown)
library(xml2)
library(htmltools)
#Specifying the url for desired website to be scraped
url1 <- paste0('https://www.foodrepublic.com/author/george-embiricos/page/', '1', '/')
#Reading the HTML code from the website
webpage1 <- read_html(url1)
# Pull the links for all articles on George's initial author page
dat <- html_attr(html_nodes(webpage1, 'a'), "href") %>%
as_tibble() %>%
filter(str_detect(value, "([0-9]{4})")) %>%
unique() %>%
rename(link=value)
dat <- head(dat, 10)
# Pull the links for all articles on George's 2nd-89th author page
for (i in 2:89) {
url <- paste0('https://www.foodrepublic.com/author/george-embiricos/page/', i, '/')
#Reading the HTML code from the website
webpage <- read_html(url)
links <- html_attr(html_nodes(webpage, 'a'), "href") %>%
as_tibble() %>%
filter(str_detect(value, "([0-9]{4})")) %>%
unique() %>%
rename(link=value)
dat <- bind_rows(dat, links) %>%
unique()
}
dat <- dat %>%
arrange(link)
dat <- tail(dat, 890)
articleUrls <- dat$link[1]
# Mac
# Windows
setwd("YOUR-WD")
# articleUrls <- articleUrls[1]
for(i in seq_along(articleUrls)) {
filename <- str_extract(articleUrls[i], "[^/]+(?=/$|$)")
a <- read_html(articleUrls[i])
xml_remove(a %>% xml_find_all("aside"))
xml_remove(a %>% xml_find_all("footer"))
xml_remove(a %>% xml_find_all(xpath = "//*[contains(#class, 'article-related mb20')]"))
xml_remove(a %>% xml_find_all(xpath = "//*[contains(#class, 'tags')]"))
#xml_remove(a %>% xml_find_all("head") %>% xml2::xml_find_all("script"))
xml_remove(a %>% xml2::xml_find_all("//script"))
xml_remove(a %>% xml_find_all("//*[contains(#class, 'ad box')]"))
xml_remove(a %>% xml_find_all("//*[contains(#class, 'newsletter-signup')]"))
xml_remove(a %>% xml_find_all("//*[contains(#class, 'article-footer')]"))
xml_remove(a %>% xml_find_all("//*[contains(#class, 'article-footer-sidebar')]"))
xml_remove(a %>% xml_find_all("//*[contains(#class, 'site-footer')]"))
xml_remove(a %>% xml_find_all("//*[contains(#class, 'sticky-newsletter')]"))
xml_remove(a %>% xml_find_all("//*[contains(#class, 'site-header')]"))
xml_remove(a %>% xml_find_all("//*[contains(#class, '.fb_iframe_widget')]"))
xml_remove(a %>% xml_find_all("//*[contains(#class, '_8f1i')]"))
xml_remove(a %>% xml_find_all("//*[contains(#class, 'newsletter-toggle')]"))
# xml_remove(a %>% xml_find_all("//*[contains(#class, 'articleBody')]"))
# xml_remove(a %>% xml_find_all("//href='([^\"]*)'"))
xml2::write_html(a, file = paste0("html/", filename, ".html"))
tryCatch(pagedown::chrome_print(input = paste0("html/", filename, ".html"),
output=paste0("pdf/", filename, ".pdf"),
format="pdf", timeout = 300, verbose=0,
wait=20), error=function(e) paste("wrong"))
}
You can see a screenshot of what I'm seeing below. The "< >" portion containing the URL should not display. It should only say "King's Brew".
Try something like this:
library(dplyr)
library(xml2)
allHref <- a %>% xml_find_all("//a")
for (l in allHref) {
cntnt <- l %>% xml_text(trim = T)
xml_replace(l, read_xml(paste0("<span>", cntnt, "</span>")))
}
First of all we find all links. Then, for each one of them we extract its content and replace the link itself with this content.

Reading off links on a site and storing them in a list

I am trying to read off the urls to data from StatsCan as follows:
# 2015
url <- "https://www.nrcan.gc.ca/our-natural-resources/energy-sources-distribution/clean-fossil-fuels/crude-oil/oil-pricing/crude-oil-prices-2015/18122"
x1 <- read_html(url) %>%
html_nodes(xpath = '//*[#class="col-md-4"]/ul/li/ul/li/a') %>%
html_attr("href")
# 2014
url2 <- "https://www.nrcan.gc.ca/our-natural-resources/energy-sources-distribution/clean-fossil-fuels/crude-oil/oil-pricing/crude-oil-prices-2014/16993"
x2 <- read_html(url) %>%
html_nodes(xpath = '//*[#class="col-md-4"]/ul/li/ul/li/a') %>%
html_attr("href")
Doing so returns two empty lists; I am confused as this worked for this link: https://www.nrcan.gc.ca/our-natural-resources/energy-sources-distribution/clean-fossil-fuels/crude-oil/oil-pricing/18087. Ultimately I want to loop over the list and read off the tables on each page as so:
for (i in 1:length(x2)){
out.data <- read_html(x2[i]) %>%
html_table(fill = TRUE) %>%
`[[`(1) %>%
as_tibble()
write.xlsx(out.data, str_c(destination,i,".xlsx"))
}
In order to extract all url, I recommend using the css selector ".field-item li a" and subset according to a pattern.
links <- read_html(url) %>%
html_nodes(".field-item li a") %>%
html_attr("href") %>%
str_subset("fuel-prices/crude")
Your XPath needs to be fixed. You can use the following one :
//strong[contains(.,"Oil")]/following-sibling::ul//a