Problems to access certain category using xpath in Rstudio - html

how are you? I am trying to access to a particular sportbetting web, i want to get the names of the football matches that are being played, but when i try i only get access to all the event names and i do not know why. I am using this code:
library(rvest)
library(tidyverse)
library(xml2)
html="https://www.supermatch.com.uy/live_recargar_menu/"
a=rvest::read_html(html)
b2= a %>% html_node("body") %>% html_node(xpath="//li[#class='sport code_sport-1']") %>%
html_nodes(xpath="//span[#class='titulo']") %>% html_text()
As you can see, this code gets the name for all the events that are being played.

library(tidyverse)
library(rvest)
page <- "https://www.supermatch.com.uy/live_recargar_menu/" %>%
read_html()
tibble(
title = page %>%
html_elements(".titulo") %>%
html_text(),
score = page %>%
html_elements(".marcador") %>%
html_text(),
time = page %>%
html_elements(".marcador+ span") %>%
html_text() %>%
str_squish()
)
# A tibble: 15 × 3
title score time
<chr> <chr> <chr>
1 CS Emelec - Atletico Mineiro 0:1 1ª parte 18'
2 Nacional de Montevideo - Union de Santa Fe 1:0 1ª parte 18'
3 Brusque FC - Esporte Clube Bahia 0:1 PT 34'
4 Gremio FBPA - Londrina 1:0 PT 34'
5 Fulgencio Yegros - Deportivo Santani 2:0 2ª parte 76'
6 Paraguay - Panama 2:0 2ª parte 65'
7 Venezuela U20 - Bolivia U20 26:10 2do cuarto 6'
8 Liu, L - Parikh, H 1:1 3er set
9 Truwit, Teddy - Raab, J 0:0 1er Set
10 Ngounoue, M - Woog, M 1:0 2do set
11 Cheng, E - Lopez, Jackeline 1:0 2do set
12 Moore, M - Pratt, S 0:1 2do set
13 Zhu, Jiayun - Horwood, S 0:0 1er Set
14 Nguyen, M - White, M 0:0 1er Set
15 Martincova, T - Pliskova, Karolina 0:0 No iniciado

Related

RSelenium and Rvest to create a table without html_table() from oddsportal.com

Recently, https://www.oddsportal.com/ changed their format. I can no longer use the html_table() to parse the game result table. It seems like the only option here is to use html_text2()and reconstruct the table manually.
library(RSelenium)
library(rvest)
library(dplyr)
library(stringr)
url_results <- "https://www.oddsportal.com/basketball/australia/nbl/results/"
rD <- rsDriver(port= sample(7600)[1], browser=c("firefox"), chromever = NULL)
remDr <- rD$client ; remDr$navigate(url_results)
try(remDr$findElement(using = "xpath", '//*[#id="onetrust-accept-btn-handler"]')$clickElement())
page <- remDr$getPageSource() ; remDr$close() ; rD$server$stop()
# R_table <- 0
# pop <- page[[1]] %>%
# read_html() %>%
# html_nodes(xpath='//*[#id="tournamentTable"]') %>%
# html_table()
# try(R_table <- pop[[1]])
# table <- R_table
R_table <- 0
pop <- page[[1]] %>%
read_html() %>%
html_nodes(xpath=paste0('//*[#id="app"]/div/div[1]/div/main/div[2]/div[7]')) %>%
html_text2()
try(R_table <- pop[[1]])
table <- R_table
Would anyone know good ways to reconstruct the table the way the website represents? This is the outcome I used to get before they changed the format by using html_table() :
V1 V2 V3 V4 V5
Today, 10 Jan 1 2 B's
21:30 Perth – New Zealand Breakers 93:90 1.98 1.79 16
19:30 Illawarra Hawks – Tasmania JackJumpers 89:92 3.95 1.24 16
08 Jan 2023 1 2 B's
16:00 Cairns Taipans – South East Melbourne 94:85 1.54 2.43 16
14:00 Adelaide – New Zealand Breakers 83:85 1.91 1.85 16

Placing "NA" into an Empty Position?

I am trying to scrape name/address information from yellowpages (https://www.yellowpages.ca/). I have a function (from :(R) Webscraping Error : arguments imply differing number of rows: 1, 0) that is able to retrieve this information:
library(rvest)
library(dplyr)
scraper <- function(url) {
page <- url %>%
read_html()
tibble(
name = page %>%
html_elements(".jsListingName") %>%
html_text2(),
address = page %>%
html_elements(".listing__address--full") %>%
html_text2()
)
}
However, sometimes the address information is not always present. For example : there are several barbers listed on this page https://www.yellowpages.ca/search/si/1/barber/Sudbury+ON and they all have addresses except one of them. As a result, when I run this function, I get the following error:
scraper("https://www.yellowpages.ca/search/si/1/barber/Sudbury+ON")
Error:
! Tibble columns must have compatible sizes.
* Size 14: Existing data.
* Size 12: Column `address`.
i Only values of size one are recycled.
Run `rlang::last_error()` to see where the error occurred.
My Question: Is there some way that I can modify the definition of the "scraper" function in such a way, such that when no address is listed, an NA appears in that line? For example:
barber address
1 barber111 address111
2 barber222 address222
3 barber333 NA
Is there some way I could add a statement similar to CASE WHEN that would grab the address or place an NA when the address is not there?
In order to match the businesses with their addresses, it is best to find a root node for each listing and get the text from the relevant child node. If the child node is empty, you can add an NA
library(rvest)
library(dplyr)
scraper <- function(url) {
nodes <- read_html(url) %>% html_elements(".listing_right_section")
tibble(name = nodes %>% sapply(function(x) {
x <- html_text2(html_elements(x, css = ".jsListingName"))
if(length(x)) x else NA}),
address = nodes %>% sapply(function(x) {
x <- html_text2(html_elements(x, css = ".listing__address--full"))
if(length(x)) x else NA}))
}
So now we can do:
scraper("https://www.yellowpages.ca/search/si/1/barber/Sudbury+ON")
#> # A tibble: 14 x 2
#> name address
#> <chr> <chr>
#> 1 Lords'n Ladies Hair Design 1560 Lasalle Blvd, Sudbury, ON P3A~
#> 2 Jo's The Lively Barber 611 Main St, Lively, ON P3Y 1M9
#> 3 Hairapy Studio 517 & Barber Shop 517 Notre Dame Ave, Sudbury, ON P3~
#> 4 Nickel Range Unisex Hairstyling 111 Larch St, Sudbury, ON P3E 4T5
#> 5 Ugo Barber & Hairstyling 911 Lorne St, Sudbury, ON P3C 4R7
#> 6 Gordon's Hairstyling 19 Durham St, Sudbury, ON P3C 5E2
#> 7 Valley Plaza Barber Shop 5085 Highway 69 N, Hanmer, ON P3P ~
#> 8 Rick's Hairstyling Shop 28 Young St, Capreol, ON P0M 1H0
#> 9 President Men's Hairstyling & Barber Shop 117 Elm St, Sudbury, ON P3C 1T3
#> 10 Pat's Hairstylists 33 Godfrey Dr, Copper Cliff, ON P0~
#> 11 WildRootz Hair Studio 911 Lorne St, Sudbury, ON P3C 4R7
#> 12 Sleek Barber Bar 324 Elm St, ON P3C 1V8
#> 13 Faiella Classic Hair <NA>
#> 14 Ben's Barbershop & Hairstyling <NA>
Created on 2022-09-16 with reprex v2.0.2
Perhaps even simpler solution
library(tidyverse)
library(rvest)
scraper <- function(url) {
page <- url %>%
read_html() %>%
html_elements(".listing_right_top_section")
tibble(
name = page %>%
html_element(".jsListingName") %>%
html_text2(),
address = page %>%
html_element(".listing__address--full") %>%
html_text2()
)
}
# A tibble: 14 x 2
name address
<chr> <chr>
1 Lords'n Ladies Hair Design 1560 Lasalle Blvd, Sudbury, ON P3A 1Z7
2 Jo's The Lively Barber 611 Main St, Lively, ON P3Y 1M9
3 Hairapy Studio 517 & Barber Shop 517 Notre Dame Ave, Sudbury, ON P3C 5L1
4 Nickel Range Unisex Hairstyling 111 Larch St, Sudbury, ON P3E 4T5
5 Ugo Barber & Hairstyling 911 Lorne St, Sudbury, ON P3C 4R7
6 Gordon's Hairstyling 19 Durham St, Sudbury, ON P3C 5E2
7 Valley Plaza Barber Shop 5085 Highway 69 N, Hanmer, ON P3P 1J6
8 Rick's Hairstyling Shop 28 Young St, Capreol, ON P0M 1H0
9 President Men's Hairstyling & Barber Shop 117 Elm St, Sudbury, ON P3C 1T3
10 Pat's Hairstylists 33 Godfrey Dr, Copper Cliff, ON P0M 1N0
11 WildRootz Hair Studio 911 Lorne St, Sudbury, ON P3C 4R7
12 Sleek Barber Bar 324 Elm St, ON P3C 1V8
13 Faiella Classic Hair NA
14 Ben's Barbershop & Hairstyling NA

Webscraping Rvest not working from html page, table showing NA'S - Mc Donalds

I am trying to scrape data from https://www.mcdonalds.com/de/de-de/product/grand-cheese-n-beef-classic-5642.html to make a dataframe with all the nutri values and allerges drop down menu,(Further information, per 100g, per portion, contained allergies), however my rvest cannot detect the information as a table.
I don't even show any required value
library(rvest)
url4 <- "https://www.mcdonalds.com/de/de-de/product/grand-cheese-n-beef-classic-5642.html"
test <- url4 %>% read_html() %>%
html_nodes(xpath = '//*[#id="collapseOne"]/div/div/div/div[1]') %>%
html_table()
test <- as.data.frame(test)
I also tried this
library(rvest)
library(stringr)
library(tidyr)
url <- "https://www.mcdonalds.com/de/de-de/product/grand-cheese-n-beef-classic-5642.html"
webpage <- read_html(url)
sb_table <- html_nodes(webpage, 'table')
sb <- html_table(sb_table)[[1]]
head(sb)
How could that be done, I'm very new to web scraping don't know if it's Html tags are correct
------ This is scraping data I want---------
link correct or not.
You can request the information from their json API
library(tidyverse)
library(httr2)
"https://www.mcdonalds.com/dnaapp/itemDetails?country=de&language=de&showLiveData=true&item=201799" %>%
request() %>%
req_perform() %>%
resp_body_json(simplifyVector = TRUE) %>%
.$item %>%
.$nutrient_facts %>%
.$nutrient %>%
as_tibble %>%
select(4:9)
# A tibble: 10 x 6
id name nutrient_~1 uom uom_d~2 value
<int> <chr> <chr> <chr> <chr> <chr>
1 1 Serving Size primary_se~ g grams 302
2 2 Brennwert energy_kJ kJ kiloJo~ 2992
3 3 Brennwert energy_kcal kcal kilo c~ 716
4 4 Fett fat g grams 40
5 5 davon gesättigte Fettsäuren saturated_~ g grams 16
6 6 Kohlenhydrate carbohydra~ g grams 44
7 7 davon Zucker sugar g grams 11
8 8 Ballaststoffe fiber g grams 3.3
9 9 Eiweiß protein g grams 40
10 10 Salz salt g grams 2.4
# ... with abbreviated variable names 1: nutrient_name_id,
# 2: uom_description
Information on the allergies
"https://www.mcdonalds.com/dnaapp/itemDetails?country=de&language=de&showLiveData=true&item=201799" %>%
request() %>%
req_perform() %>%
resp_body_json(simplifyVector = TRUE) %>%
.$item %>%
.$item_allergen %>%
str_split(pattern = ", ") %>%
getElement(1)
[1] "Milch (einschl. Laktose)"
[2] "Eier"
[3] "Glutenhaltiges Getreide: Weizen (wie Dinkel und Khorasan-Weizen)"
[4] "Senf"
[5] "Sesamsamen"

Retrieving data from HTML in RStudio

I want to retrieve data frame from this HTML : https://www.transfermarkt.pl/pko-ekstraklasa/torschuetzenliste/wettbewerb/PL1/saison_id/2020/altersklasse/alle/detailpos//plus/1
Is there any simple way to get a table like from this site? I tried the way below, but I don't know what to enter in "html_node"
transfermarkt <- xml2::read_html("https://www.transfermarkt.pl/pko-ekstraklasa/torschuetzenliste/wettbewerb/PL1/saison_id/2020/altersklasse/alle/detailpos//plus/1")
transfermarkt %>%
html_node("responsive-table") %>%
html_text()
You can Right click on the table and choose Inspect to see the relevant selectors:
Use html_node("#yw1 table") since you want the <table> inside id="yw1"
Change html_text() to html_table() since this is tabular data
Add drop_na('#') to remove superfluous rows (rows that have NA values in the # column)
library(rvest)
library(tidyverse)
transfermarkt <- xml2::read_html("https://www.transfermarkt.pl/pko-ekstraklasa/torschuetzenliste/wettbewerb/PL1/saison_id/2020/altersklasse/alle/detailpos//plus/1")
transfermarkt %>%
html_node("#yw1 > table") %>%
html_table() %>%
drop_na('#')
#
Zawodnik
Narodowość
Wiek (obecny)
Klub
Czas na boisku
Gole na mecz
1
Tomas Pekhart Środkowy napastnik
NA
Tomas Pekhart
Środkowy napastnik
NA
31
19
0
5
1.510'
79'
1,00
2
Jesús Imaz Ofensywny pomocnik
NA
Jesús Imaz
Ofensywny pomocnik
NA
30
19
4
1
1.610'
161'
0,53
3
Flávio Paixão Środkowy napastnik
NA
Flávio Paixão
Środkowy napastnik
NA
36
22
3
4
1.693'
188'
0,41
...
...
...
...
...
...
...
...
...
...
...
...
...
...

Scraping wikipedia table r

Trying to scrape the first 8 tables (very high, high, medium, low) from the human development index in Wikipedia.
Started with but getting a list of zero. What am I doing wrong? New to R :(
libray(rvest)
url <- "https://en.wikipedia.org/wiki/List_of_countries_by_Human_Development_Index#Complete_list_of_countries"
webpage <- read_html(url)
hdi_tables <- html_nodes(webpage, 'table')
head(hdi_tables, n = 10)
scrape <- url %>%
read_html() %>%
html_nodes(xpath = '//*[#id="mw-content-text"]/div/div[5]/table/tbody/tr/td[1]/table') %>%
html_table()
head(scrape, n=10)
I think it would be easier to work with the original data source:
Select "Human Development Index (HDI)" in both the drop-down select lists, then click the "Download Data" link to get a CSV file named Human Development Index (HDI).csv.
Read it into R:
library(tidyverse)
Human_Development_Index_HDI_ <- read_csv("path/to/Human Development Index (HDI).csv",
skip = 1)
You can reshape the data, get the values for 2015 and classify countries as low, medium, high or very high:
hdi <- Human_Development_Index_HDI_ %>%
gather(Year, HDI, -`HDI Rank (2015)`, -Country) %>%
filter(Year == "2015") %>%
na.omit() %>%
mutate(Year = as.numeric(Year),
classification = cut(HDI,
breaks = c(0, 0.549, 0.699, 0.799, 1),
labels = c("low", "medium", "high", "very_high")))
hdi
# A tibble: 188 x 5
`HDI Rank (2015)` Country Year HDI classification
<int> <chr> <dbl> <dbl> <fctr>
1 169 Afghanistan 2015 0.479 low
2 75 Albania 2015 0.764 high
3 83 Algeria 2015 0.745 high
4 32 Andorra 2015 0.858 very_high
5 150 Angola 2015 0.533 low
6 62 Antigua and Barbuda 2015 0.786 high
7 45 Argentina 2015 0.827 very_high
8 84 Armenia 2015 0.743 high
9 2 Australia 2015 0.939 very_high
10 24 Austria 2015 0.893 very_high
# ... with 178 more rows
You could change the filter to get values for 2014 too, if you want to replicate the "change from previous year" values in the Wikipedia table.
If you're okay with parsing the wikipedia markup language instead, you could try using WikipediR to grab the markup of the page (from skimming the documentation, try page_content with as_wikitext set to true). Then you'll get some lines that all look like this:
| 1 || {{steady}} ||style="text-align:left"| {{flag|Norway}} || 0.949 || {{increase}} 0.001
This should be parseable in R using strsplit or something.