R - How to extract items from XML Nodeset?

R - How to extract items from XML Nodeset? - html

I have a list of 438 pitcher names that look like this (in XML Nodeset):
> pitcherlinks[[1]]
<td class="left " data-append-csv="abadfe01" data-stat="player" csk="Abad,Fernando0.01">
FernandoÂ Abad*
</td>
> pitcherlinks[[2]]
<td class="left " data-append-csv="adlemti01" data-stat="player" csk="Adleman,Tim0.01">
TimÂ Adleman
</td>
How do I extract the names like FernandoÂ Abad and the associated links like /players/a/abadfe01.shtml

Since you have a list, an apply function is used to walk through the list. Each function uses read_html to parse the hmtl fragment in the list using the CSS selector a to find the anchors (links). The names come from the html_text and the link is in the attribute href
library(rvest)
pitcherlinks <- list()
pitcherlinks[[1]] <-
'<td class="left " data-append-csv="abadfe01" data-stat="player" csk="Abad,Fernando0.01">
FernandoÂ Abad*
</td>'
pitcherlinks[[2]] <-
'<td class="left " data-append-csv="adlemti01" data-stat="player" csk="Adleman,Tim0.01">
TimÂ Adleman
</td>'
names <- sapply(pitcherlinks, function(x) {x %>% read_html() %>% html_nodes("a") %>% html_text()})
links <- sapply(pitcherlinks, function(x) {x %>% read_html() %>% html_nodes("a") %>% html_attr("href")})
names
# [1] "FernandoÂ Abad" "TimÂ Adleman"
links
# [1] "/players/a/abadfe01.shtml" "/players/a/adlemti01.shtml"

Related

scraping wikipedia data which looks like a table but is not actually a table

I am trying to scrape some data from Wikiepedia. The data I want to collect is the # of cases and # of deaths from the first "table" on the Wikipedia page. Usually I would get the xpath of the table and use rvest but I cannot seem to collect this piece of data. I would actually prefer to collect the numbers from the graphic, if I look at one of the collapsible's I get (for the date 2020-04-04):
<tr class="mw-collapsible mw-collapsed mw-made-collapsible" id="mw-customcollapsible-apr" style="display: none;">
<td colspan="2" style="text-align:center" class="bb-04em">2020-04-04</td>
<td class="bb-lr">
<div title="8359" style="background:#A50026;width:0.6px" class="bb-fl"></div>
<div title="14825" style="background:SkyBlue;width:1.06px" class="bb-fl"></div>
<div title="284692" style="background:Tomato;width:20.36px" class="bb-fl"></div>
</td>
<td style="text-align:center" class="bb-04em"><span class="cbs-ibr" style="padding:0 0.3em 0 0; width:5.6em">307,876</span><span class="cbs-ibl" style="width:3.5em">(+12%)</span></td>
<td style="text-align:center" class="bb-04em"><span class="cbs-ibr" style="padding:0 0.3em 0 0; width:4.55em">8,359</span><span class="cbs-ibl" style="width:3.5em">(+19%)</span></td>
</tr>
The data is here - 8359, 14825, 284692 along with the # of cases - 307,876 and # of deaths - 8,359. I am trying to extract these numbers for each day.
Code:
url <- "https://en.wikipedia.org/wiki/COVID-19_pandemic_in_the_United_States"
url %>%
read_html() %>%
html_node(xpath = '//*[#id="mw-content-text"]/div[1]/div[4]/div/table/tbody') %>%
html_table(fill = TRUE)

You could use nth-child to target the various columns. To get the right number of rows in each column it is useful to use a css attribute selector with starts with operator to target the appropriate id attribute and substring of attribute value
library(rvest)
library(tidyverse)
library(stringr)
p <- read_html('https://en.wikipedia.org/wiki/COVID-19_pandemic_in_the_United_States')
covid_info <- tibble(
dates = p %>% html_nodes('[id^=mw-customcollapsible-] td:nth-child(1)') %>% html_text() %>% as.Date(),
cases = p %>% html_nodes('[id^=mw-customcollapsible-] td:nth-child(3)') %>% html_text(),
deaths = p %>% html_nodes('[id^=mw-customcollapsible-] td:nth-child(4)') %>% html_text()
)%>%
mutate(
case_numbers = str_extract(gsub(',','',cases), '^.*(?=\\()' ) %>% as.integer(),
death_numbers = replace_na(str_extract(gsub(',','',deaths), '^.*(?=\\()' ) %>% as.integer(), NA_integer_)
)
print(covid_info)

Scraping ID attribute using rvest

I am trying to check if Polish elections are fair and candidates form opposition did not get abnormal low number of votes in districts with higher amount of invalid votes. To do so I need to scrape results of each district.
Link to official results of elections for my city - in the bottom table, each row is different district and by clicking you get redirected to district. The link is not usual <a ... hef = ...> format, but in the data-id=... is encoded the variable part of the link to districts.
My question is how to extract the data-id= attribute table on a webpage using R?
Sample data - in this example I would like to extract 697773 from row data
<div class="proto" style="">
<div id="DataTables_Table_16_wrapper" class="dataTables_wrapper dt-bootstrap no-footer">
<div class="table-responsive">
<table class="table table-bordered table-striped table-hover dataTable no-footer clickable" id="DataTables_Table_16" role="grid">
<thead><tr role="row"><th class="sorting_asc" tabindex="0" aria-controls="DataTables_Table_16" rowspan="1" colspan="1" aria-sort="ascending" aria-label="Numer: aktywuj, by posortować kolumnę malejąco">Numer</th><th class="sorting" tabindex="0" aria-controls="DataTables_Table_16" rowspan="1" colspan="1" aria-label="Siedziba: aktywuj, by posortować kolumnę rosnąco">Siedziba</th><th class="sorting" tabindex="0" aria-controls="DataTables_Table_16" rowspan="1" colspan="1" aria-label="Granice: aktywuj, by posortować kolumnę rosnąco">Granice</th></tr></thead>
<tbody>
<tr data-id="697773" role="row" class="odd"><td class="sorting_1">1</td><td>Szkoła Podstawowa nr 63</td> <td>Bożego Ciała...</td></tr>
</tbody>
</table>
</div>
</div>
</div>
I have tried using:
library(dplyr)
library(rvest)
read_html("https://wybory.gov.pl/prezydent20200628/pl/wyniki/1/pow/26400") %>%
html_nodes('[class="table-responsive"]') %>%
html_nodes('[class="table table-bordered table-striped table-hover"]') %>%
html_nodes('tr') %>%
html_attrs()
But I get named character(0) as a result

I found not very optimal solution. I bet there is better way!
I have downloaded webpage, saved it as txt file and read from there:
txt_webpage <- readChar(paste0(getwd(), "\\Wyniki pierwszego głosowania _ Wrocław.txt"),
file.info(paste0(getwd(), "\\Wyniki pierwszego głosowania _ Wrocław.txt"))$size)
posiotions <- gregexpr(pattern ='<tr data', txt_webpage)
districts_numbers <- c()
for (i in posiotions[[1]]) {
print (i)
tmp <- substr(txt_webpage, i + 10, i + 22)
tmp <- gsub('\\D+','', tmp)
districts_numbers <- c(districts_numbers, tmp)
}

How to get HTML element that is before a certain class?

I'm scraping and having trouble getting the element of the “th” tag that comes before the other “th” element that contains the “type2” class. I prefer to take it by identifying that it is the element "th" before the "th" with class "type2" because my HTML has a lot of "th" and that was the only difference I found between the tables.
Using rvest or xml2 (or other R package), can I get this parent?
The content which I want is "text_that_I_want".
Thank you!
<tr>
<th class="array">text_that_I_want</th>
<td class="array">
<table>
<thead>
<tr>
<th class="string type2">name</th>
<th class="array type2">answers</th>
</tr>
</thead>

The formal and more generalizable way to navigate xpath relative to a given node is via ancestor preceding-sibling:
read_html(htmldoc) %>%
html_nodes(xpath = "//th[#class = 'string type2']/ancestor::td/preceding-sibling::th") %>%
html_text()
#> [1] "text_that_I_want"

We can look for the "type2" string in all <th>s, get the index of the first occurrence and substract 1 to get the index we want:
library(dplyr)
library(rvest)
location <- test%>%
html_nodes('th') %>%
str_detect("type2")
index_want <- min(which(location == TRUE) - 1)
test%>%
html_nodes('th') %>%
.[[index_want]] %>%
html_text()
[1] "text_that_I_want"

html_nodes to scrape text with R

Actually I'm trying to get the sku number of this code (this number -> 111653240199):
<body>
<div id= ‘a page’>
<div class =”spaui-squishy-container” style=”display:table:table-row;”>
<div class =”spaui-squishy-inner-container” style=”display:table-row;”>
<div class =”spaui-squishy-content” style=display:table-cell;”>
<div id=”myi-table-center” class=”a-container Madagascar-main-body”>
<div id=”miytable” class=”mt-container clearfix””>
<div class="mt-content clearfix">
::before
<div class="mt-content clearfix">
::before
<table class="a-bordered a-horizontal-stripes mt-table">
<tbody>
<tr id="head-row" class="mt-head">
<tr id="MTExNjUzMjQwMTk5" data-delayed-dependency-data="{"MYIService"(…)
<td id= MTExNjUzMjQwMTk5-sku” data-colum=”sku” data-row=” MTExNjUzMjQwMTk5”>
<div class="mt-combination mt-layout-block">
<div id="MTExNjUzMjQwMTk5-sku-sku" data-column="sku" data-row="ExNjUzMjQwMTk5">
<div class="clamped wordbreak">
<div class="mt-text mt-wrap-bw">
<span class="mt-text-content mt-table-main">
111653240199
</span>
My script in R has this:
dades<-read_html(url)
id<-dades %>% html_nodes("#mt-table-container.clearfix .mt-link.mt-wrap-bw.clamped.wordbreak a") %>% html_text()
But the result is -> character empty
What Am I doing wrong?
Thanks in advance for the help and your time :-)

One way with the following:
library(rvest)
read_html(text) %>%
html_nodes('div.mt-text') %>%
html_text() %>%
#the following removes whitespaces
trimws()
#[1] "111653240199"

Extracting text from HTML page in R

I am working on drugbank database, please i need help to extract specific text from the below HTML code:
<table>
<tr>
<td>Text</td>
</tr>
<tr>
<th>ATC Codes</th>
<td>B01AC05
<ul class="atc-drug-tree">
<li><a data-no-turbolink="true" href="/atc/B01AC">B01AC — Platelet aggregation inhibitors excl. heparin</a></li>
<li><a data-no-turbolink="true" href="/atc/B01A">B01A — ANTITHROMBOTIC AGENTS</a></li>
<li><a data-no-turbolink="true" href="/atc/B01">B01 — ANTITHROMBOTIC AGENTS</a></li>
<li><a data-no-turbolink="true" href="/atc/B">B — BLOOD AND BLOOD FORMING ORGANS</a></li>
</ul>
</td>
</tr>
<tr>
<td>Text</td>
</tr>
</table>
i want to have the following as my output text as list object:
B01AC05
B01AC — Platelet aggregation inhibitors excl. heparin
B01A — ANTITHROMBOTIC AGENTS
B01 — ANTITHROMBOTIC AGENTS
B — BLOOD AND BLOOD FORMING ORGANS
I have tried the below function but its not working:
library(XML)
getATC <- function(id){
url <- "http://www.drugbank.ca/drugs/"
dburl <- paste(url, id, sep ="")
tables <- readHTMLTable(dburl, header = F)
table <- tables[['atc-drug-tree']]
table
}
ids <- c("DB00208", "DB00209")
ref <- apply(ids, 1, getATC)
NB:
The url can be use to see the actual page i want to parse, the HTML snippet i provided was just and example.
Thanks

rvest makes web scraping pretty simple. Here's a solution using it.
library("rvest")
library("stringr")
your_html <- read_html('<table>
<tr>
<td>Text</td>
</tr>
<tr>
<th>ATC Codes</th>
<td>B01AC05
<ul class="atc-drug-tree">
<li><a data-no-turbolink="true" href="/atc/B01AC">B01AC — Platelet aggregation inhibitors excl. heparin</a></li>
<li><a data-no-turbolink="true" href="/atc/B01A">B01A — ANTITHROMBOTIC AGENTS</a></li>
<li><a data-no-turbolink="true" href="/atc/B01">B01 — ANTITHROMBOTIC AGENTS</a></li>
<li><a data-no-turbolink="true" href="/atc/B">B — BLOOD AND BLOOD FORMING ORGANS</a></li>
</ul>
</td>
</tr>
<tr>
<td>Text</td>
</tr>
</table>')
your_name <-
your_html %>%
html_nodes(xpath='//th[contains(text(), "ATC Codes")]/following-sibling::td') %>%
html_text() %>%
str_extract(".+(?=\n)")
list_elements <-
your_html %>% html_nodes("li") %>% html_nodes("a") %>% html_text()
your_list <- list()
your_list[[your_name]] <- list_elements
> your_list
$B01AC05
[1] "B01AC Â— Platelet aggregation inhibitors excl. heparin"
[2] "B01A Â— ANTITHROMBOTIC AGENTS"
[3] "B01 Â— ANTITHROMBOTIC AGENTS"
[4] "B Â— BLOOD AND BLOOD FORMING ORGANS"

Create the URL strings and sapply them using the getDrugs function which parses the HTML, extracts the root of the HTML tree, finds the ul node with the indicated class and returns its parent's text (but only before the first whitespace) followed by the text in each ./li/a grandchild:
library(XML)
getDrugs <- function(...) {
doc <- htmlTreeParse(..., useInternalNodes = TRUE)
xpathApply(xmlRoot(doc), "//ul[#class='atc-drug-tree']", function(node) {
c(sub("\\s.*", "", xmlValue(xmlParent(node))), # get text before 1st whitespace
xpathSApply(node, "./li/a", xmlValue)) # get text in each ./li/a node
})
}
ids <- c("DB00208", "DB00209")
urls <- paste0("http://www.drugbank.ca/drugs/", ids)
L <- sapply(urls, getDrugs)
giving the following list (one component per URL and a component within each for each drug found in that URL):
> L
$`http://www.drugbank.ca/drugs/DB00208`
$`http://www.drugbank.ca/drugs/DB00208`[[1]]
[1] "B01AC05B01AC"
[2] "B01AC — Platelet aggregation inhibitors excl. heparin"
[3] "B01A — ANTITHROMBOTIC AGENTS"
[4] "B01 — ANTITHROMBOTIC AGENTS"
[5] "B — BLOOD AND BLOOD FORMING ORGANS"
$`http://www.drugbank.ca/drugs/DB00209`
$`http://www.drugbank.ca/drugs/DB00209`[[1]]
[1] "A03DA06A03DA"
[2] "A03DA — Synthetic anticholinergic agents in combination with analgesics"
[3] "A03D — ANTISPASMODICS IN COMBINATION WITH ANALGESICS"
[4] "A03 — DRUGS FOR FUNCTIONAL GASTROINTESTINAL DISORDERS"
[5] "A — ALIMENTARY TRACT AND METABOLISM"
$`http://www.drugbank.ca/drugs/DB00209`[[2]]
[1] "A03DA06A03DA"
[2] "G04BD — Drugs for urinary frequency and incontinence"
[3] "G04B — UROLOGICALS"
[4] "G04 — UROLOGICALS"
[5] "G — GENITO URINARY SYSTEM AND SEX HORMONES"
We could create a 5x3 matrix out of the above like this:
simplify2array(do.call(c, L))
And here is a test using the input in the question:
Lines <- '<table>
<tr>
<td>Text</td>
</tr>
<tr>
<th>ATC Codes</th>
<td>B01AC05
<ul class="atc-drug-tree">
<li><a data-no-turbolink="true" href="/atc/B01AC">B01AC — Platelet aggregation inhibitors excl. heparin</a></li>
<li><a data-no-turbolink="true" href="/atc/B01A">B01A — ANTITHROMBOTIC AGENTS</a></li>
<li><a data-no-turbolink="true" href="/atc/B01">B01 — ANTITHROMBOTIC AGENTS</a></li>
<li><a data-no-turbolink="true" href="/atc/B">B — BLOOD AND BLOOD FORMING ORGANS</a></li>
</ul>
</td>
</tr>
<tr>
<td>Text</td>
</tr>
</table>'
getDrugs(Lines, asText = TRUE)
giving:
[[1]]
[1] "B01AC05"
[2] "B01AC Â— Platelet aggregation inhibitors excl. heparin"
[3] "B01A Â— ANTITHROMBOTIC AGENTS"
[4] "B01 Â— ANTITHROMBOTIC AGENTS"
[5] "B Â— BLOOD AND BLOOD FORMING ORGANS"

readHTMLTable is not working because it can't read the headers in tables 3 and 4.
url <- "http://www.drugbank.ca/drugs/DB00208"
doc <- htmlParse(readLines(url))
summary(doc)
$nameCounts
td a tr li th span div p strong img table ...
745 399 342 175 159 137 66 49 46 27 27
#errors
readHTMLTable(doc)
readHTMLTable(doc, which=3)
# this works
readHTMLTable(doc, which=3, header=FALSE)
Also, ATC codes is not within a nearby table tag, so you have to use xpath like the other answers here.
xpathSApply(doc, '//ul[#class="atc-drug-tree"]/*', xmlValue)
[1] "B01AC — Platelet aggregation inhibitors excl. heparin" "B01A — ANTITHROMBOTIC AGENTS"
[3] "B01 — ANTITHROMBOTIC AGENTS" "B — BLOOD AND BLOOD FORMING ORGANS"
xpathSApply(doc, '//ul[#class="atc-drug-tree"]/../node()[1]', xmlValue)
[1] "B01AC05"

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

R - How to extract items from XML Nodeset? - html

Related

scraping wikipedia data which looks like a table but is not actually a table

Scraping ID attribute using rvest

How to get HTML element that is before a certain class?

html_nodes to scrape text with R

Extracting text from HTML page in R

Categories

Resources