html_nodes to scrape text with R - html

Actually I'm trying to get the sku number of this code (this number -> 111653240199):
<body>
<div id= ‘a page’>
<div class =”spaui-squishy-container” style=”display:table:table-row;”>
<div class =”spaui-squishy-inner-container” style=”display:table-row;”>
<div class =”spaui-squishy-content” style=display:table-cell;”>
<div id=”myi-table-center” class=”a-container Madagascar-main-body”>
<div id=”miytable” class=”mt-container clearfix””>
<div class="mt-content clearfix">
::before
<div class="mt-content clearfix">
::before
<table class="a-bordered a-horizontal-stripes mt-table">
<tbody>
<tr id="head-row" class="mt-head">
<tr id="MTExNjUzMjQwMTk5" data-delayed-dependency-data="{"MYIService"(…)
<td id= MTExNjUzMjQwMTk5-sku” data-colum=”sku” data-row=” MTExNjUzMjQwMTk5”>
<div class="mt-combination mt-layout-block">
<div id="MTExNjUzMjQwMTk5-sku-sku" data-column="sku" data-row="ExNjUzMjQwMTk5">
<div class="clamped wordbreak">
<div class="mt-text mt-wrap-bw">
<span class="mt-text-content mt-table-main">
111653240199
</span>
My script in R has this:
dades<-read_html(url)
id<-dades %>% html_nodes("#mt-table-container.clearfix .mt-link.mt-wrap-bw.clamped.wordbreak a") %>% html_text()
But the result is -> character empty
What Am I doing wrong?
Thanks in advance for the help and your time :-)

One way with the following:
library(rvest)
read_html(text) %>%
html_nodes('div.mt-text') %>%
html_text() %>%
#the following removes whitespaces
trimws()
#[1] "111653240199"

Related

scraping wikipedia data which looks like a table but is not actually a table

I am trying to scrape some data from Wikiepedia. The data I want to collect is the # of cases and # of deaths from the first "table" on the Wikipedia page. Usually I would get the xpath of the table and use rvest but I cannot seem to collect this piece of data. I would actually prefer to collect the numbers from the graphic, if I look at one of the collapsible's I get (for the date 2020-04-04):
<tr class="mw-collapsible mw-collapsed mw-made-collapsible" id="mw-customcollapsible-apr" style="display: none;">
<td colspan="2" style="text-align:center" class="bb-04em">2020-04-04</td>
<td class="bb-lr">
<div title="8359" style="background:#A50026;width:0.6px" class="bb-fl">​</div>
<div title="14825" style="background:SkyBlue;width:1.06px" class="bb-fl">​</div>
<div title="284692" style="background:Tomato;width:20.36px" class="bb-fl">​</div>
</td>
<td style="text-align:center" class="bb-04em"><span class="cbs-ibr" style="padding:0 0.3em 0 0; width:5.6em">307,876</span><span class="cbs-ibl" style="width:3.5em">(+12%)</span></td>
<td style="text-align:center" class="bb-04em"><span class="cbs-ibr" style="padding:0 0.3em 0 0; width:4.55em">8,359</span><span class="cbs-ibl" style="width:3.5em">(+19%)</span></td>
</tr>
The data is here - 8359, 14825, 284692 along with the # of cases - 307,876 and # of deaths - 8,359. I am trying to extract these numbers for each day.
Code:
url <- "https://en.wikipedia.org/wiki/COVID-19_pandemic_in_the_United_States"
url %>%
read_html() %>%
html_node(xpath = '//*[#id="mw-content-text"]/div[1]/div[4]/div/table/tbody') %>%
html_table(fill = TRUE)
You could use nth-child to target the various columns. To get the right number of rows in each column it is useful to use a css attribute selector with starts with operator to target the appropriate id attribute and substring of attribute value
library(rvest)
library(tidyverse)
library(stringr)
p <- read_html('https://en.wikipedia.org/wiki/COVID-19_pandemic_in_the_United_States')
covid_info <- tibble(
dates = p %>% html_nodes('[id^=mw-customcollapsible-] td:nth-child(1)') %>% html_text() %>% as.Date(),
cases = p %>% html_nodes('[id^=mw-customcollapsible-] td:nth-child(3)') %>% html_text(),
deaths = p %>% html_nodes('[id^=mw-customcollapsible-] td:nth-child(4)') %>% html_text()
)%>%
mutate(
case_numbers = str_extract(gsub(',','',cases), '^.*(?=\\()' ) %>% as.integer(),
death_numbers = replace_na(str_extract(gsub(',','',deaths), '^.*(?=\\()' ) %>% as.integer(), NA_integer_)
)
print(covid_info)

Scraping ID attribute using rvest

I am trying to check if Polish elections are fair and candidates form opposition did not get abnormal low number of votes in districts with higher amount of invalid votes. To do so I need to scrape results of each district.
Link to official results of elections for my city - in the bottom table, each row is different district and by clicking you get redirected to district. The link is not usual <a ... hef = ...> format, but in the data-id=... is encoded the variable part of the link to districts.
My question is how to extract the data-id= attribute table on a webpage using R?
Sample data - in this example I would like to extract 697773 from row data
<div class="proto" style="">
<div id="DataTables_Table_16_wrapper" class="dataTables_wrapper dt-bootstrap no-footer">
<div class="table-responsive">
<table class="table table-bordered table-striped table-hover dataTable no-footer clickable" id="DataTables_Table_16" role="grid">
<thead><tr role="row"><th class="sorting_asc" tabindex="0" aria-controls="DataTables_Table_16" rowspan="1" colspan="1" aria-sort="ascending" aria-label="Numer: aktywuj, by posortować kolumnę malejąco">Numer</th><th class="sorting" tabindex="0" aria-controls="DataTables_Table_16" rowspan="1" colspan="1" aria-label="Siedziba: aktywuj, by posortować kolumnę rosnąco">Siedziba</th><th class="sorting" tabindex="0" aria-controls="DataTables_Table_16" rowspan="1" colspan="1" aria-label="Granice: aktywuj, by posortować kolumnę rosnąco">Granice</th></tr></thead>
<tbody>
<tr data-id="697773" role="row" class="odd"><td class="sorting_1">1</td><td>Szkoła Podstawowa nr 63</td> <td>Bożego Ciała...</td></tr>
</tbody>
</table>
</div>
</div>
</div>
I have tried using:
library(dplyr)
library(rvest)
read_html("https://wybory.gov.pl/prezydent20200628/pl/wyniki/1/pow/26400") %>%
html_nodes('[class="table-responsive"]') %>%
html_nodes('[class="table table-bordered table-striped table-hover"]') %>%
html_nodes('tr') %>%
html_attrs()
But I get named character(0) as a result
I found not very optimal solution. I bet there is better way!
I have downloaded webpage, saved it as txt file and read from there:
txt_webpage <- readChar(paste0(getwd(), "\\Wyniki pierwszego głosowania _ Wrocław.txt"),
file.info(paste0(getwd(), "\\Wyniki pierwszego głosowania _ Wrocław.txt"))$size)
posiotions <- gregexpr(pattern ='<tr data', txt_webpage)
districts_numbers <- c()
for (i in posiotions[[1]]) {
print (i)
tmp <- substr(txt_webpage, i + 10, i + 22)
tmp <- gsub('\\D+','', tmp)
districts_numbers <- c(districts_numbers, tmp)
}

How to get HTML element that is before a certain class?

I'm scraping and having trouble getting the element of the “th” tag that comes before the other “th” element that contains the “type2” class. I prefer to take it by identifying that it is the element "th" before the "th" with class "type2" because my HTML has a lot of "th" and that was the only difference I found between the tables.
Using rvest or xml2 (or other R package), can I get this parent?
The content which I want is "text_that_I_want".
Thank you!
<tr>
<th class="array">text_that_I_want</th>
<td class="array">
<table>
<thead>
<tr>
<th class="string type2">name</th>
<th class="array type2">answers</th>
</tr>
</thead>
The formal and more generalizable way to navigate xpath relative to a given node is via ancestor preceding-sibling:
read_html(htmldoc) %>%
html_nodes(xpath = "//th[#class = 'string type2']/ancestor::td/preceding-sibling::th") %>%
html_text()
#> [1] "text_that_I_want"
We can look for the "type2" string in all <th>s, get the index of the first occurrence and substract 1 to get the index we want:
library(dplyr)
library(rvest)
location <- test%>%
html_nodes('th') %>%
str_detect("type2")
index_want <- min(which(location == TRUE) - 1)
test%>%
html_nodes('th') %>%
.[[index_want]] %>%
html_text()
[1] "text_that_I_want"

Get elements from table using XPath

I am trying to get information from this website
https://www.realtypro.co.za/property_detail.php?ref=1736
I have this table from which I want to take the number of bedrooms
<div class="panel panel-primary">
<div class="panel-heading">Property Details</div>
<div class="panel-body">
<table width="100%" cellpadding="0" cellspacing="0" border="0" class="table table-striped table-condensed table-tweak">
<tbody><tr>
<td class="xh-highlight">3</td><td style="width: 140px" class="">Bedrooms</td>
</tr>
<tr>
<td>Bathrooms</td>
<td>3</td>
</tr>
I am using this xpath expression:
bedrooms = response.xpath("//div[#class='panel panel-primary']/div[#class='panel-body']/table[#class='table table-striped table-condensed table-tweak']/tbody/tr[1]/td[2]/text()").extract_first()
However, I only get 'None' as output.
I have tried several combinations and I only get None as output. Any suggestions on what I am doing wrong?
Thanks in advance!
I would use bs4 4.7.1. where you can search with :contains for the td cell having the text "Bedrooms" then take the adjacent sibling td. You can add a test for is None for error handling. Less fragile than long xpath.
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.realtypro.co.za/property_detail.php?ref=1736')
soup = bs(r.content, 'lxml')
print(int(soup.select_one('td:contains(Bedrooms) + td').text)
If position was fixed you could use
.table-tweak td + td
Try this and let me know if it works:
import lxml.html
response = [your code above]
beds = lxml.html.fromstring(response)
bedrooms = beds.xpath("//div[#class='panel panel-primary']/div[#class='panel-body']/table[#class='table table-striped table-condensed table-tweak']/tbody/tr[1]/td[2]//preceding-sibling::*/text()")
bedrooms
Output:
['3']
EDIT:
Or possibly:
for bed in beds:
num_rooms = bed.xpath("//div[#class='panel panel-primary']/div[#class='panel-body']/table[#class='table table-striped table-condensed table-tweak']/tbody/tr[1]/td[2]//preceding-sibling::*/text()")
print(num_rooms)

R - How to extract items from XML Nodeset?

I have a list of 438 pitcher names that look like this (in XML Nodeset):
> pitcherlinks[[1]]
<td class="left " data-append-csv="abadfe01" data-stat="player" csk="Abad,Fernando0.01">
Fernando Abad*
</td>
> pitcherlinks[[2]]
<td class="left " data-append-csv="adlemti01" data-stat="player" csk="Adleman,Tim0.01">
Tim Adleman
</td>
How do I extract the names like Fernando Abad and the associated links like /players/a/abadfe01.shtml
Since you have a list, an apply function is used to walk through the list. Each function uses read_html to parse the hmtl fragment in the list using the CSS selector a to find the anchors (links). The names come from the html_text and the link is in the attribute href
library(rvest)
pitcherlinks <- list()
pitcherlinks[[1]] <-
'<td class="left " data-append-csv="abadfe01" data-stat="player" csk="Abad,Fernando0.01">
Fernando Abad*
</td>'
pitcherlinks[[2]] <-
'<td class="left " data-append-csv="adlemti01" data-stat="player" csk="Adleman,Tim0.01">
Tim Adleman
</td>'
names <- sapply(pitcherlinks, function(x) {x %>% read_html() %>% html_nodes("a") %>% html_text()})
links <- sapply(pitcherlinks, function(x) {x %>% read_html() %>% html_nodes("a") %>% html_attr("href")})
names
# [1] "Fernando Abad" "Tim Adleman"
links
# [1] "/players/a/abadfe01.shtml" "/players/a/adlemti01.shtml"