How to get HTML element that is before a certain class? - html

I'm scraping and having trouble getting the element of the “th” tag that comes before the other “th” element that contains the “type2” class. I prefer to take it by identifying that it is the element "th" before the "th" with class "type2" because my HTML has a lot of "th" and that was the only difference I found between the tables.
Using rvest or xml2 (or other R package), can I get this parent?
The content which I want is "text_that_I_want".
Thank you!
<tr>
<th class="array">text_that_I_want</th>
<td class="array">
<table>
<thead>
<tr>
<th class="string type2">name</th>
<th class="array type2">answers</th>
</tr>
</thead>

The formal and more generalizable way to navigate xpath relative to a given node is via ancestor preceding-sibling:
read_html(htmldoc) %>%
html_nodes(xpath = "//th[#class = 'string type2']/ancestor::td/preceding-sibling::th") %>%
html_text()
#> [1] "text_that_I_want"

We can look for the "type2" string in all <th>s, get the index of the first occurrence and substract 1 to get the index we want:
library(dplyr)
library(rvest)
location <- test%>%
html_nodes('th') %>%
str_detect("type2")
index_want <- min(which(location == TRUE) - 1)
test%>%
html_nodes('th') %>%
.[[index_want]] %>%
html_text()
[1] "text_that_I_want"

Related

scraping wikipedia data which looks like a table but is not actually a table

I am trying to scrape some data from Wikiepedia. The data I want to collect is the # of cases and # of deaths from the first "table" on the Wikipedia page. Usually I would get the xpath of the table and use rvest but I cannot seem to collect this piece of data. I would actually prefer to collect the numbers from the graphic, if I look at one of the collapsible's I get (for the date 2020-04-04):
<tr class="mw-collapsible mw-collapsed mw-made-collapsible" id="mw-customcollapsible-apr" style="display: none;">
<td colspan="2" style="text-align:center" class="bb-04em">2020-04-04</td>
<td class="bb-lr">
<div title="8359" style="background:#A50026;width:0.6px" class="bb-fl">​</div>
<div title="14825" style="background:SkyBlue;width:1.06px" class="bb-fl">​</div>
<div title="284692" style="background:Tomato;width:20.36px" class="bb-fl">​</div>
</td>
<td style="text-align:center" class="bb-04em"><span class="cbs-ibr" style="padding:0 0.3em 0 0; width:5.6em">307,876</span><span class="cbs-ibl" style="width:3.5em">(+12%)</span></td>
<td style="text-align:center" class="bb-04em"><span class="cbs-ibr" style="padding:0 0.3em 0 0; width:4.55em">8,359</span><span class="cbs-ibl" style="width:3.5em">(+19%)</span></td>
</tr>
The data is here - 8359, 14825, 284692 along with the # of cases - 307,876 and # of deaths - 8,359. I am trying to extract these numbers for each day.
Code:
url <- "https://en.wikipedia.org/wiki/COVID-19_pandemic_in_the_United_States"
url %>%
read_html() %>%
html_node(xpath = '//*[#id="mw-content-text"]/div[1]/div[4]/div/table/tbody') %>%
html_table(fill = TRUE)
You could use nth-child to target the various columns. To get the right number of rows in each column it is useful to use a css attribute selector with starts with operator to target the appropriate id attribute and substring of attribute value
library(rvest)
library(tidyverse)
library(stringr)
p <- read_html('https://en.wikipedia.org/wiki/COVID-19_pandemic_in_the_United_States')
covid_info <- tibble(
dates = p %>% html_nodes('[id^=mw-customcollapsible-] td:nth-child(1)') %>% html_text() %>% as.Date(),
cases = p %>% html_nodes('[id^=mw-customcollapsible-] td:nth-child(3)') %>% html_text(),
deaths = p %>% html_nodes('[id^=mw-customcollapsible-] td:nth-child(4)') %>% html_text()
)%>%
mutate(
case_numbers = str_extract(gsub(',','',cases), '^.*(?=\\()' ) %>% as.integer(),
death_numbers = replace_na(str_extract(gsub(',','',deaths), '^.*(?=\\()' ) %>% as.integer(), NA_integer_)
)
print(covid_info)

Scraping ID attribute using rvest

I am trying to check if Polish elections are fair and candidates form opposition did not get abnormal low number of votes in districts with higher amount of invalid votes. To do so I need to scrape results of each district.
Link to official results of elections for my city - in the bottom table, each row is different district and by clicking you get redirected to district. The link is not usual <a ... hef = ...> format, but in the data-id=... is encoded the variable part of the link to districts.
My question is how to extract the data-id= attribute table on a webpage using R?
Sample data - in this example I would like to extract 697773 from row data
<div class="proto" style="">
<div id="DataTables_Table_16_wrapper" class="dataTables_wrapper dt-bootstrap no-footer">
<div class="table-responsive">
<table class="table table-bordered table-striped table-hover dataTable no-footer clickable" id="DataTables_Table_16" role="grid">
<thead><tr role="row"><th class="sorting_asc" tabindex="0" aria-controls="DataTables_Table_16" rowspan="1" colspan="1" aria-sort="ascending" aria-label="Numer: aktywuj, by posortować kolumnę malejąco">Numer</th><th class="sorting" tabindex="0" aria-controls="DataTables_Table_16" rowspan="1" colspan="1" aria-label="Siedziba: aktywuj, by posortować kolumnę rosnąco">Siedziba</th><th class="sorting" tabindex="0" aria-controls="DataTables_Table_16" rowspan="1" colspan="1" aria-label="Granice: aktywuj, by posortować kolumnę rosnąco">Granice</th></tr></thead>
<tbody>
<tr data-id="697773" role="row" class="odd"><td class="sorting_1">1</td><td>Szkoła Podstawowa nr 63</td> <td>Bożego Ciała...</td></tr>
</tbody>
</table>
</div>
</div>
</div>
I have tried using:
library(dplyr)
library(rvest)
read_html("https://wybory.gov.pl/prezydent20200628/pl/wyniki/1/pow/26400") %>%
html_nodes('[class="table-responsive"]') %>%
html_nodes('[class="table table-bordered table-striped table-hover"]') %>%
html_nodes('tr') %>%
html_attrs()
But I get named character(0) as a result
I found not very optimal solution. I bet there is better way!
I have downloaded webpage, saved it as txt file and read from there:
txt_webpage <- readChar(paste0(getwd(), "\\Wyniki pierwszego głosowania _ Wrocław.txt"),
file.info(paste0(getwd(), "\\Wyniki pierwszego głosowania _ Wrocław.txt"))$size)
posiotions <- gregexpr(pattern ='<tr data', txt_webpage)
districts_numbers <- c()
for (i in posiotions[[1]]) {
print (i)
tmp <- substr(txt_webpage, i + 10, i + 22)
tmp <- gsub('\\D+','', tmp)
districts_numbers <- c(districts_numbers, tmp)
}

html_nodes to scrape text with R

Actually I'm trying to get the sku number of this code (this number -> 111653240199):
<body>
<div id= ‘a page’>
<div class =”spaui-squishy-container” style=”display:table:table-row;”>
<div class =”spaui-squishy-inner-container” style=”display:table-row;”>
<div class =”spaui-squishy-content” style=display:table-cell;”>
<div id=”myi-table-center” class=”a-container Madagascar-main-body”>
<div id=”miytable” class=”mt-container clearfix””>
<div class="mt-content clearfix">
::before
<div class="mt-content clearfix">
::before
<table class="a-bordered a-horizontal-stripes mt-table">
<tbody>
<tr id="head-row" class="mt-head">
<tr id="MTExNjUzMjQwMTk5" data-delayed-dependency-data="{"MYIService"(…)
<td id= MTExNjUzMjQwMTk5-sku” data-colum=”sku” data-row=” MTExNjUzMjQwMTk5”>
<div class="mt-combination mt-layout-block">
<div id="MTExNjUzMjQwMTk5-sku-sku" data-column="sku" data-row="ExNjUzMjQwMTk5">
<div class="clamped wordbreak">
<div class="mt-text mt-wrap-bw">
<span class="mt-text-content mt-table-main">
111653240199
</span>
My script in R has this:
dades<-read_html(url)
id<-dades %>% html_nodes("#mt-table-container.clearfix .mt-link.mt-wrap-bw.clamped.wordbreak a") %>% html_text()
But the result is -> character empty
What Am I doing wrong?
Thanks in advance for the help and your time :-)
One way with the following:
library(rvest)
read_html(text) %>%
html_nodes('div.mt-text') %>%
html_text() %>%
#the following removes whitespaces
trimws()
#[1] "111653240199"

How to find and bold a series of four letters in an html table

I'm using the R programming language.
I'm hoping to find and make bold a series of four letters (amino acids, if you're curious) in a large html table of letters. I want to do this through html table navigation. If I were using regex on a normal string of letters, it would be "([KR].[ST][ILV])". This would find the letters RSSI or KATV, for instance. Unfortunately, the actual string I'm looking for would look something like this:
<center><table class="sequence-table"><tr><th align="left">
<tr>
<td bgcolor="lightgreen"><tt>R</tt></td>
<td bgcolor=""><tt>S</tt></td>
<td bgcolor="pink"><tt>S</tt></td>
<td bgcolor=""><tt>I</tt></td>
The end result I want is this:
<center><table class="sequence-table"><tr><th align="left">
<tr>
<td bgcolor="lightgreen"><tt><b>R</b></tt></td>
<td bgcolor=""><tt><b>S</b></tt></td>
<td bgcolor="pink"><tt><b>S</b></tt></td>
<td bgcolor=""><tt><b>I</b></tt></td>
I've written a monster-sized regex to find this sequence (attached below), but it doesn't seem to work. And I realize now that I should be using html commands, but I'm having a good deal of trouble finding websites that tell me how to search-and-replace. What should I be searching for? And/or how would I accomplish what I've described above?
This is my monster-sized regex to find the sequence I want, but it doesn't seem to work. I now realize, of course, that I was going at it from the wrong direction.
`regexp <- '(
[\\<<td bgcolor=""><tt>K</tt></td>\\>
\\<<td bgcolor="\\w+"><tt>K</tt></td>\\>
\\<<td bgcolor=""><tt>R</tt></td>\\>
\\<<td bgcolor="\\w+"><tt>R</tt></td>\\>]
[\\<<td bgcolor=""><tt>.</tt></td>\\>
\\<<td bgcolor="\\w+"><tt>.</tt></td>\\>]
[\\<<td bgcolor=""><tt>S</tt></td>\\>
\\<<td bgcolor="\\w+"><tt>S</tt></td>\\>
\\<<td bgcolor=""><tt>T</tt></td>\\>
\\<<td bgcolor="\\w+"><tt>T</tt></td>\\>]
[\\<<td bgcolor=""><tt>I</tt></td>\\>
\\<<td bgcolor="\\w+"><tt>I</tt></td>\\>
\\<<td bgcolor=""><tt>L</tt></td>\\>
\\<<td bgcolor="\\w+"><tt>L</tt></td>\\>
\\<<td bgcolor=""><tt>V</tt></td>\\>
\\<<td bgcolor="\\w+"><tt>V</tt></td>\\>])'
`
Maybe try this approach instead of regular expressions:
library(xml2)
library(tidyverse)
txt <- '<center><table class="sequence-table"><tr><th align="left">
<tr>
<td bgcolor="lightgreen"><tt>R</tt></td>
<td bgcolor=""><tt>S</tt></td>
<td bgcolor="pink"><tt>S</tt></td>
<td bgcolor=""><tt>I</tt></td>'
needles <- c("RSSI", "KMSV")
doc <- read_html(txt)
doc %>%
xml_find_all("//tr") %>%
keep(xml_text(.) %in% gsub("(.)", "\\1\n", needles)) %>%
xml_find_all("td/tt/text()") %>%
xml_add_parent("b")
write_html(doc, tf <- tempfile(fileext = ".html"))
shell.exec(tf) # open temp file on windows
This wraps each column text into <b>...</b> (and saves the result to a temporary file).
cat(as.character(doc))
# ...
# <center><table class="sequence-table">
# <tr><th align="left">
# </th></tr>
# <tr>
# <td bgcolor="lightgreen"><tt><b>R</b></tt></td>
# <td bgcolor=""><tt><b>S</b></tt></td>
# <td bgcolor="pink"><tt><b>S</b></tt></td>
# <td bgcolor=""><tt><b>I</b></tt></td>
# ...

R - How to extract items from XML Nodeset?

I have a list of 438 pitcher names that look like this (in XML Nodeset):
> pitcherlinks[[1]]
<td class="left " data-append-csv="abadfe01" data-stat="player" csk="Abad,Fernando0.01">
Fernando Abad*
</td>
> pitcherlinks[[2]]
<td class="left " data-append-csv="adlemti01" data-stat="player" csk="Adleman,Tim0.01">
Tim Adleman
</td>
How do I extract the names like Fernando Abad and the associated links like /players/a/abadfe01.shtml
Since you have a list, an apply function is used to walk through the list. Each function uses read_html to parse the hmtl fragment in the list using the CSS selector a to find the anchors (links). The names come from the html_text and the link is in the attribute href
library(rvest)
pitcherlinks <- list()
pitcherlinks[[1]] <-
'<td class="left " data-append-csv="abadfe01" data-stat="player" csk="Abad,Fernando0.01">
Fernando Abad*
</td>'
pitcherlinks[[2]] <-
'<td class="left " data-append-csv="adlemti01" data-stat="player" csk="Adleman,Tim0.01">
Tim Adleman
</td>'
names <- sapply(pitcherlinks, function(x) {x %>% read_html() %>% html_nodes("a") %>% html_text()})
links <- sapply(pitcherlinks, function(x) {x %>% read_html() %>% html_nodes("a") %>% html_attr("href")})
names
# [1] "Fernando Abad" "Tim Adleman"
links
# [1] "/players/a/abadfe01.shtml" "/players/a/adlemti01.shtml"