R programming, webscraping- I can not get a link from html - html

enter image description here
I am trying to get this html, but there is no class name.
How could I do that?
That is my code in R:
link <- read_html('https://www.coffeedesk.pl/kawa/')
link %>% html_nodes('div.product-title a') %>%
html_attr('href')
link_www <- paste0("https://www.coffeedesk.pl", link)

I prefer using xpath to parse htmls:
library(magrittr)
url <-"https://www.coffeedesk.pl/kawa/"
x <- xml2::read_html(url)
links <- xml2::xml_find_all(x, "//div[#class = 'product-title']//a") %>%
xml2::xml_attr("href") %>%
paste0("https://www.coffeedesk.pl",.)

Related

Webscraping between two tags

I am trying to scrape the following page(s):
Htpps://mywebsite.com
In particular, I would like to get the name of each entry. I noticed that the text I am interested in is always in (MY TEXT) the middle of these two tags: <div class="title"> MY TEXT
I know how to search for these tags individually:
#load libraries
library(rvest)
library(httr)
library(XML)
library(rvest)
# set up page
url<-"https://www.mywebsite.com"
page <-read_html(url)
#option 1
b = page %>% html_nodes("title")
option1 <- b %>% html_text() %>% strsplit("\\n")
#option 2
b = page %>% html_nodes("a")
option2 <- b %>% html_text() %>% strsplit("\\n")
Is there some way that I could have specified the "html_nodes" argument so that it picked up on "MY TEXT" - i.e. scrape between <div class="title"> and </a> :
<div class="title"> MY TEXT
Scraping of pages 1:10
library(tidyverse)
library(rvest)
my_function <- function(page_n) {
cat("Scraping page ", page_n, "\n")
page <- paste0("https://www.dentistsearch.ca/search-doctor/",
page_n, "?category=0&services=0&province=55&city=&k=") %>% read_html
tibble(title = page %>%
html_elements(".title a") %>%
html_text2(),
adress = page %>%
html_elements(".marker") %>%
html_text2(),
page = page_n)
}
df <- map_dfr(1:10, my_function)
You can use the xpath argument inside html_elements to locate each a tag inside a div with class "title".
Here's a complete reproducible example.
library(rvest)
"https://www.mywebsite.ca/extension1/" %>%
paste0("2?extension2") %>%
read_html() %>%
html_elements(xpath = "//div[#class='title']/a") %>%
html_text()
Or to get all entries on the first 10 pages:
library(rvest)
unlist(lapply(1:10, function(page){
"https://www.mywebsite.ca/extension1/" %>%
paste0(page, "?extension2") %>%
read_html() %>%
html_elements(xpath = "//div[#class='title']/a") %>%
html_text()}))
Created on 2022-07-26 by the reprex package (v2.0.1)

Can't extract href link from html_node in rvest

When I use the rvest package xpath and and try to get the embedded links (football team names) from the sites I get an empty result. Could someone help this?
The code is as follows:
library(rvest)
url <- read_html('https://www.transfermarkt.com/premier-league/startseite/wettbewerb/GB1')
xpath <- as.character('/html/body/div[2]/div[11]/div[1]/div[2]/div[2]/div')
url %>%
html_node(xpath=xpath) %>%
html_attr('href')
You can get all the links using :
library(rvest)
url <- 'https://www.transfermarkt.com/premier-league/startseite/wettbewerb/GB1'
url %>%
read_html %>%
html_nodes('td.hauptlink a') %>%
html_attr('href') %>%
.[. != '#'] %>%
paste0('https://www.transfermarkt.com', .) %>%
unique() %>%
head(20)

Reading off links on a site and storing them in a list

I am trying to read off the urls to data from StatsCan as follows:
# 2015
url <- "https://www.nrcan.gc.ca/our-natural-resources/energy-sources-distribution/clean-fossil-fuels/crude-oil/oil-pricing/crude-oil-prices-2015/18122"
x1 <- read_html(url) %>%
html_nodes(xpath = '//*[#class="col-md-4"]/ul/li/ul/li/a') %>%
html_attr("href")
# 2014
url2 <- "https://www.nrcan.gc.ca/our-natural-resources/energy-sources-distribution/clean-fossil-fuels/crude-oil/oil-pricing/crude-oil-prices-2014/16993"
x2 <- read_html(url) %>%
html_nodes(xpath = '//*[#class="col-md-4"]/ul/li/ul/li/a') %>%
html_attr("href")
Doing so returns two empty lists; I am confused as this worked for this link: https://www.nrcan.gc.ca/our-natural-resources/energy-sources-distribution/clean-fossil-fuels/crude-oil/oil-pricing/18087. Ultimately I want to loop over the list and read off the tables on each page as so:
for (i in 1:length(x2)){
out.data <- read_html(x2[i]) %>%
html_table(fill = TRUE) %>%
`[[`(1) %>%
as_tibble()
write.xlsx(out.data, str_c(destination,i,".xlsx"))
}
In order to extract all url, I recommend using the css selector ".field-item li a" and subset according to a pattern.
links <- read_html(url) %>%
html_nodes(".field-item li a") %>%
html_attr("href") %>%
str_subset("fuel-prices/crude")
Your XPath needs to be fixed. You can use the following one :
//strong[contains(.,"Oil")]/following-sibling::ul//a

Scraping web table using R and rvest

I'm new in web scraping using R.
I'm trying to scrape the table generated by this link:
https://gd.eppo.int/search?k=saperda+tridentata.
In this specific case, it's just one record in the table but it could be more (I am actually interested in the first column but the whole table is ok).
I tried to follow the suggestion by Allan Cameron given here (rvest, table with thead and tbody tags) as the issue seems to be exactly the same but with no success maybe for my little knowledge on how webpages work. I always get a "no data" table. Maybe I am not following correctly the suggested step "# Get the JSON as plain text from the link generated by Javascript on the page".
Where can I get this link? In this specific case I used "https://gd.eppo.int/media/js/application/zzsearch.js?7", is this one?
Below you have my code.
Thank you in advance!
library(httr)
library(rlist)
library(rvest)
library(jsonlite)
library(dplyr)
pest.name <- "saperda+tridentata"
url <- paste("https://gd.eppo.int/search?k=",pest.name, sep="")
resp <- GET(url) %>% content("text")
json_url <- "https://gd.eppo.int/media/js/application/zzsearch.js?7"
JSON <- GET(json_url) %>% content("text", encoding = "utf8")
table_contents <- JSON %>%
{gsub("\\\\n", "\n", .)} %>%
{gsub("\\\\/", "/", .)} %>%
{gsub("\\\\\"", "\"", .)} %>%
strsplit("html\":\"") %>%
unlist %>%
extract(2) %>%
substr(1, nchar(.) -2) %>%
paste0("</tbody>")
new_page <- gsub("</tbody>", table_contents, resp)
read_html(new_page) %>%
html_nodes("table") %>%
html_table()
The data comes from another endpoint you can see in the network tab when refreshing the page. You can send a request with your search phrase in the params and then extract the json you need from the response.
library(httr)
library(jsonlite)
params = list('k' = 'saperda tridentata','s' = 1,'m' = 1,'t' = 0)
r <- httr::GET(url = 'https://gd.eppo.int/ajax/search', query = params)
data <- jsonlite::parse_json(r %>% read_html() %>% html_node('p') %>%html_text())
print(data[[1]]$e)

How can I extract selective data from a webpage using rvest?

I have been trying to display the review rating of this song using rvest in r from Pitchforkhttps://pitchfork.com/reviews/albums/us-girls-heavy-light/ . In this case, it is 8.5. But somehow I get this:
Here is my code
library(rvest)
library(dplyr)
library(RCurl)
library(tidyverse)
URL="https://pitchfork.com/reviews/albums/us-girls-heavy-light/"
webpage = read_html(URL)
cat("Review Rating")
webpage%>%
html_nodes("div span")%>%
html_text
We can get the relevant information from the class of div which is "score-circle".
library(rvest)
webpage %>% html_nodes('div.score-circle') %>% html_text() %>% as.numeric()
#[1] 8.5