Retrieving data from HTML in RStudio

Retrieving data from HTML in RStudio - html

I want to retrieve data frame from this HTML : https://www.transfermarkt.pl/pko-ekstraklasa/torschuetzenliste/wettbewerb/PL1/saison_id/2020/altersklasse/alle/detailpos//plus/1
Is there any simple way to get a table like from this site? I tried the way below, but I don't know what to enter in "html_node"
transfermarkt <- xml2::read_html("https://www.transfermarkt.pl/pko-ekstraklasa/torschuetzenliste/wettbewerb/PL1/saison_id/2020/altersklasse/alle/detailpos//plus/1")
transfermarkt %>%
html_node("responsive-table") %>%
html_text()

You can Right click on the table and choose Inspect to see the relevant selectors:
Use html_node("#yw1 table") since you want the <table> inside id="yw1"
Change html_text() to html_table() since this is tabular data
Add drop_na('#') to remove superfluous rows (rows that have NA values in the # column)
library(rvest)
library(tidyverse)
transfermarkt <- xml2::read_html("https://www.transfermarkt.pl/pko-ekstraklasa/torschuetzenliste/wettbewerb/PL1/saison_id/2020/altersklasse/alle/detailpos//plus/1")
transfermarkt %>%
html_node("#yw1 > table") %>%
html_table() %>%
drop_na('#')
#
Zawodnik
Narodowość
Wiek (obecny)
Klub
Czas na boisku
Gole na mecz
1
Tomas Pekhart Środkowy napastnik
NA
Tomas Pekhart
Środkowy napastnik
NA
31
19
0
5
1.510'
79'
1,00
2
Jesús Imaz Ofensywny pomocnik
NA
Jesús Imaz
Ofensywny pomocnik
NA
30
19
4
1
1.610'
161'
0,53
3
Flávio Paixão Środkowy napastnik
NA
Flávio Paixão
Środkowy napastnik
NA
36
22
3
4
1.693'
188'
0,41
...
...
...
...
...
...
...
...
...
...
...
...
...
...

Related

RSelenium and Rvest to create a table without html_table() from oddsportal.com

Recently, https://www.oddsportal.com/ changed their format. I can no longer use the html_table() to parse the game result table. It seems like the only option here is to use html_text2()and reconstruct the table manually.
library(RSelenium)
library(rvest)
library(dplyr)
library(stringr)
url_results <- "https://www.oddsportal.com/basketball/australia/nbl/results/"
rD <- rsDriver(port= sample(7600)[1], browser=c("firefox"), chromever = NULL)
remDr <- rD$client ; remDr$navigate(url_results)
try(remDr$findElement(using = "xpath", '//*[#id="onetrust-accept-btn-handler"]')$clickElement())
page <- remDr$getPageSource() ; remDr$close() ; rD$server$stop()
# R_table <- 0
# pop <- page[[1]] %>%
# read_html() %>%
# html_nodes(xpath='//*[#id="tournamentTable"]') %>%
# html_table()
# try(R_table <- pop[[1]])
# table <- R_table
R_table <- 0
pop <- page[[1]] %>%
read_html() %>%
html_nodes(xpath=paste0('//*[#id="app"]/div/div[1]/div/main/div[2]/div[7]')) %>%
html_text2()
try(R_table <- pop[[1]])
table <- R_table
Would anyone know good ways to reconstruct the table the way the website represents? This is the outcome I used to get before they changed the format by using html_table() :
V1 V2 V3 V4 V5
Today, 10 Jan 1 2 B's
21:30 Perth – New Zealand Breakers 93:90 1.98 1.79 16
19:30 Illawarra Hawks – Tasmania JackJumpers 89:92 3.95 1.24 16
08 Jan 2023 1 2 B's
16:00 Cairns Taipans – South East Melbourne 94:85 1.54 2.43 16
14:00 Adelaide – New Zealand Breakers 83:85 1.91 1.85 16

Webscraping Rvest not working from html page, table showing NA'S - Mc Donalds

I am trying to scrape data from https://www.mcdonalds.com/de/de-de/product/grand-cheese-n-beef-classic-5642.html to make a dataframe with all the nutri values and allerges drop down menu,(Further information, per 100g, per portion, contained allergies), however my rvest cannot detect the information as a table.
I don't even show any required value
library(rvest)
url4 <- "https://www.mcdonalds.com/de/de-de/product/grand-cheese-n-beef-classic-5642.html"
test <- url4 %>% read_html() %>%
html_nodes(xpath = '//*[#id="collapseOne"]/div/div/div/div[1]') %>%
html_table()
test <- as.data.frame(test)
I also tried this
library(rvest)
library(stringr)
library(tidyr)
url <- "https://www.mcdonalds.com/de/de-de/product/grand-cheese-n-beef-classic-5642.html"
webpage <- read_html(url)
sb_table <- html_nodes(webpage, 'table')
sb <- html_table(sb_table)[[1]]
head(sb)
How could that be done, I'm very new to web scraping don't know if it's Html tags are correct
------ This is scraping data I want---------
link correct or not.

You can request the information from their json API
library(tidyverse)
library(httr2)
"https://www.mcdonalds.com/dnaapp/itemDetails?country=de&language=de&showLiveData=true&item=201799" %>%
request() %>%
req_perform() %>%
resp_body_json(simplifyVector = TRUE) %>%
.$item %>%
.$nutrient_facts %>%
.$nutrient %>%
as_tibble %>%
select(4:9)
# A tibble: 10 x 6
id name nutrient_~1 uom uom_d~2 value
<int> <chr> <chr> <chr> <chr> <chr>
1 1 Serving Size primary_se~ g grams 302
2 2 Brennwert energy_kJ kJ kiloJo~ 2992
3 3 Brennwert energy_kcal kcal kilo c~ 716
4 4 Fett fat g grams 40
5 5 davon gesättigte Fettsäuren saturated_~ g grams 16
6 6 Kohlenhydrate carbohydra~ g grams 44
7 7 davon Zucker sugar g grams 11
8 8 Ballaststoffe fiber g grams 3.3
9 9 Eiweiß protein g grams 40
10 10 Salz salt g grams 2.4
# ... with abbreviated variable names 1: nutrient_name_id,
# 2: uom_description
Information on the allergies
"https://www.mcdonalds.com/dnaapp/itemDetails?country=de&language=de&showLiveData=true&item=201799" %>%
request() %>%
req_perform() %>%
resp_body_json(simplifyVector = TRUE) %>%
.$item %>%
.$item_allergen %>%
str_split(pattern = ", ") %>%
getElement(1)
[1] "Milch (einschl. Laktose)"
[2] "Eier"
[3] "Glutenhaltiges Getreide: Weizen (wie Dinkel und Khorasan-Weizen)"
[4] "Senf"
[5] "Sesamsamen"

Problems to access certain category using xpath in Rstudio

how are you? I am trying to access to a particular sportbetting web, i want to get the names of the football matches that are being played, but when i try i only get access to all the event names and i do not know why. I am using this code:
library(rvest)
library(tidyverse)
library(xml2)
html="https://www.supermatch.com.uy/live_recargar_menu/"
a=rvest::read_html(html)
b2= a %>% html_node("body") %>% html_node(xpath="//li[#class='sport code_sport-1']") %>%
html_nodes(xpath="//span[#class='titulo']") %>% html_text()
As you can see, this code gets the name for all the events that are being played.

library(tidyverse)
library(rvest)
page <- "https://www.supermatch.com.uy/live_recargar_menu/" %>%
read_html()
tibble(
title = page %>%
html_elements(".titulo") %>%
html_text(),
score = page %>%
html_elements(".marcador") %>%
html_text(),
time = page %>%
html_elements(".marcador+ span") %>%
html_text() %>%
str_squish()
)
# A tibble: 15 × 3
title score time
<chr> <chr> <chr>
1 CS Emelec - Atletico Mineiro 0:1 1ª parte 18'
2 Nacional de Montevideo - Union de Santa Fe 1:0 1ª parte 18'
3 Brusque FC - Esporte Clube Bahia 0:1 PT 34'
4 Gremio FBPA - Londrina 1:0 PT 34'
5 Fulgencio Yegros - Deportivo Santani 2:0 2ª parte 76'
6 Paraguay - Panama 2:0 2ª parte 65'
7 Venezuela U20 - Bolivia U20 26:10 2do cuarto 6'
8 Liu, L - Parikh, H 1:1 3er set
9 Truwit, Teddy - Raab, J 0:0 1er Set
10 Ngounoue, M - Woog, M 1:0 2do set
11 Cheng, E - Lopez, Jackeline 1:0 2do set
12 Moore, M - Pratt, S 0:1 2do set
13 Zhu, Jiayun - Horwood, S 0:0 1er Set
14 Nguyen, M - White, M 0:0 1er Set
15 Martincova, T - Pliskova, Karolina 0:0 No iniciado

Scrape nested html structure

I would like to scrape the data from this site, without losing the information from the nested structure. Consider the name benodanil, which not only belongs to benzanilide fungicides, but also to anilide fungicides and amide fungicides. It's not necessarily always 3 classes, but at least one and up to many. So, ideally, I'd want a data.frame that looks as such:
name
class1
class2
class3
...
benodanil
benzanilide fungicides
anilide fungicides
amide fungicides
NA
aureofungin
antibiotic fungicides
NA
NA
NA
...
...
...
...
I can scrape the data, but can't wrap my head around how to handle the information in the nested structure. What I tried so far:
require(rvest)
url = 'http://www.alanwood.net/pesticides/class_fungicides.html'
site = read_html(url)
# extract lists
li = html_nodes(site, 'li')
# extract unorder lists
ul = html_nodes(site, 'ul')
# loop idea
l = list()
for (i in seq_along(li)) {
li1 = html_nodes(li[i], 'a')
name = na.omit(unique(html_attr(li1, 'href')))
clas = na.omit(unique(html_attr(li1, 'name')))
l[[i]] = list(name = name,
clas = clas)
}
An additional problem is, that some names occur more than one time, such as bixafen. Hence, I guess the job has to be done iteratively.

library(dplyr)
library(tidyr)
library(rvest)
url = 'http://www.alanwood.net/pesticides/class_fungicides.html'
site = read_html(url)
a <- site %>% html_nodes('li ul a')
tibble(name = a %>% html_attr('href'),
class = a %>% html_attr('name')) %>%
fill(class) %>%
filter(!is.na(name)) %>%
mutate(name = sub('\\.html', '', name)) %>%
group_by(name) %>%
mutate(col = paste0('class', row_number())) %>%
pivot_wider(names_from = col, values_from = class) %>%
ungroup()
# A tibble: 189 x 4
# name class1 class2 class3
# <chr> <chr> <chr> <chr>
# 1 benalaxyl acylamino_acid_fungici… anilide_fungicides NA
# 2 benalaxyl-m acylamino_acid_fungici… anilide_fungicides NA
# 3 furalaxyl acylamino_acid_fungici… furanilide_fungicides NA
# 4 metalaxyl acylamino_acid_fungici… anilide_fungicides NA
# 5 metalaxyl-m acylamino_acid_fungici… anilide_fungicides NA
# 6 pefurazoate acylamino_acid_fungici… NA NA
# 7 valifenalate acylamino_acid_fungici… NA NA
# 8 bixafen anilide_fungicides picolinamide_fungici… pyrazolecarboxamide_fungic…
# 9 boscalid anilide_fungicides NA NA
#10 carboxin anilide_fungicides NA NA
# … with 179 more rows
Extract name and class from the webpage, fill the NA values with the previous non-NA, drop rows with NA values and get the data in wide format.

How to clean and split HTML tags in R?

My parser create a data frame, which looks like:
name html
1 John <span class="incident-icon" data-minute="68" data-second="37" data-id="8028"></span><span class="name-meta-data">68</span>
2 Steve <span class="incident-icon" data-minute="69" data-second="4" data-id="132205"></span><span class="name-meta-data">69</span>
So how I can extract usefull information from HTML? For example, I want to use some HTML attributes as features:
name minute second id
1 John 68 37 8028
2 Steve 69 4 132205

If you already have the data frame in your question, you can try the following. Your data frame is called mydf here. You can extract all numbers with stri_extract_all_regex(). Then, you follow the classic method converting a list to a data frame. Then, you assign new column names and bind the result with the column, name in the original data frame.
library(stringi)
library(dplyr)
stri_extract_all_regex(str = mydf$url, pattern = "[0-9]+") %>%
unlist %>%
matrix(ncol = 4, byrow = T) %>%
data.frame %>%
setNames(c("minute", "second", "ID", "data")) %>%
bind_cols(mydf["name"], .)
# name minute second ID data
#1 John 68 37 8028 68
#2 Steve 69 4 132205 69
DATA
mydf <- structure(list(name = c("John", "Steve"), url = c("<span class=\"incident-icon\" data-minute=\"68\" data-second=\"37\" data-id=\"8028\"></span><span class=\"name-meta-data\">68</span>",
"<span class=\"incident-icon\" data-minute=\"69\" data-second=\"4\" data-id=\"132205\"></span><span class=\"name-meta-data\">69</span>"
)), .Names = c("name", "url"), row.names = c(NA, -2L), class = "data.frame")

An alternate rvest approach using purrr and dplyr:
library(rvest)
library(purrr)
library(dplyr)
df <- read.table(stringsAsFactors=FALSE, header=TRUE, sep=",", text='name,html
John,<span class="incident-icon" data-minute="68" data-second="37" data-id="8028"></span><span class="name-meta-data">68</span>
Steve,<span class="incident-icon" data-minute="69" data-second="4" data-id="132205"></span><span class="name-meta-data">69</span>')
by_row(df, .collate="cols",
~read_html(.$html) %>%
html_nodes("span:first-of-type") %>%
html_attrs() %>%
flatten_chr() %>%
as.list() %>%
flatten_df()) %>%
select(-html, -class1) %>%
setNames(gsub("^data-|1$", "", colnames(.)))
## # A tibble: 2 × 4
## name minute second id
## <chr> <chr> <chr> <chr>
## 1 John 68 37 8028
## 2 Steve 69 4 132205

regex is possible, but I prefer the rvest package for this,
this is easier with data.table or dplyr, but lets do it base R, (on the off-chance that those are new concepts)
# Example data
df <- structure(list(name = c("John", "Steve"), html = c("<span class=\"incident-icon\" data-minute=\"68\" data-second=\"37\" data-id=\"8028\"></span><span class=\"name-meta-data\">68</span>",
"<span class=\"incident-icon\" data-minute=\"69\" data-second=\"4\" data-id=\"132205\"></span><span class=\"name-meta-data\">69</span>"
)), .Names = c("name", "html"), row.names = c(NA, -2L), class = "data.frame")
rvest lets us split this up using the DOM, which can be a lot nicer than working with regex for the same thing.
library(rvest)
# Get span attributes from each row:
spanattrs <-
lapply(df$html,
function(y) read_html(y) %>% html_node('span') %>% html_attrs)
# rbind to get a data.frame with all attributes
final <- data.frame(df, do.call(rbind,spanattrs))
> final
name html class
1 John <span class="incident-icon" data-minute="68" data-second="37" data-id="8028"></span><span class="name-meta-data">68</span> incident-icon
2 Steve <span class="incident-icon" data-minute="69" data-second="4" data-id="132205"></span><span class="name-meta-data">69</span> incident-icon
data.minute data.second data.id
1 68 37 8028
2 69 4 132205
Lets remove the html so it's a little nicer in the viewer here:
> final$html <- NULL
> final
name class data.minute data.second data.id
1 John incident-icon 68 37 8028
2 Steve incident-icon 69 4 132205

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Retrieving data from HTML in RStudio - html

Related

RSelenium and Rvest to create a table without html_table() from oddsportal.com

Webscraping Rvest not working from html page, table showing NA'S - Mc Donalds

Problems to access certain category using xpath in Rstudio

Scrape nested html structure

How to clean and split HTML tags in R?

Categories

Resources