Extracting text from HTML page in R

Extracting text from HTML page in R - html

I am working on drugbank database, please i need help to extract specific text from the below HTML code:
<table>
<tr>
<td>Text</td>
</tr>
<tr>
<th>ATC Codes</th>
<td>B01AC05
<ul class="atc-drug-tree">
<li><a data-no-turbolink="true" href="/atc/B01AC">B01AC — Platelet aggregation inhibitors excl. heparin</a></li>
<li><a data-no-turbolink="true" href="/atc/B01A">B01A — ANTITHROMBOTIC AGENTS</a></li>
<li><a data-no-turbolink="true" href="/atc/B01">B01 — ANTITHROMBOTIC AGENTS</a></li>
<li><a data-no-turbolink="true" href="/atc/B">B — BLOOD AND BLOOD FORMING ORGANS</a></li>
</ul>
</td>
</tr>
<tr>
<td>Text</td>
</tr>
</table>
i want to have the following as my output text as list object:
B01AC05
B01AC — Platelet aggregation inhibitors excl. heparin
B01A — ANTITHROMBOTIC AGENTS
B01 — ANTITHROMBOTIC AGENTS
B — BLOOD AND BLOOD FORMING ORGANS
I have tried the below function but its not working:
library(XML)
getATC <- function(id){
url <- "http://www.drugbank.ca/drugs/"
dburl <- paste(url, id, sep ="")
tables <- readHTMLTable(dburl, header = F)
table <- tables[['atc-drug-tree']]
table
}
ids <- c("DB00208", "DB00209")
ref <- apply(ids, 1, getATC)
NB:
The url can be use to see the actual page i want to parse, the HTML snippet i provided was just and example.
Thanks

rvest makes web scraping pretty simple. Here's a solution using it.
library("rvest")
library("stringr")
your_html <- read_html('<table>
<tr>
<td>Text</td>
</tr>
<tr>
<th>ATC Codes</th>
<td>B01AC05
<ul class="atc-drug-tree">
<li><a data-no-turbolink="true" href="/atc/B01AC">B01AC — Platelet aggregation inhibitors excl. heparin</a></li>
<li><a data-no-turbolink="true" href="/atc/B01A">B01A — ANTITHROMBOTIC AGENTS</a></li>
<li><a data-no-turbolink="true" href="/atc/B01">B01 — ANTITHROMBOTIC AGENTS</a></li>
<li><a data-no-turbolink="true" href="/atc/B">B — BLOOD AND BLOOD FORMING ORGANS</a></li>
</ul>
</td>
</tr>
<tr>
<td>Text</td>
</tr>
</table>')
your_name <-
your_html %>%
html_nodes(xpath='//th[contains(text(), "ATC Codes")]/following-sibling::td') %>%
html_text() %>%
str_extract(".+(?=\n)")
list_elements <-
your_html %>% html_nodes("li") %>% html_nodes("a") %>% html_text()
your_list <- list()
your_list[[your_name]] <- list_elements
> your_list
$B01AC05
[1] "B01AC Â— Platelet aggregation inhibitors excl. heparin"
[2] "B01A Â— ANTITHROMBOTIC AGENTS"
[3] "B01 Â— ANTITHROMBOTIC AGENTS"
[4] "B Â— BLOOD AND BLOOD FORMING ORGANS"

Create the URL strings and sapply them using the getDrugs function which parses the HTML, extracts the root of the HTML tree, finds the ul node with the indicated class and returns its parent's text (but only before the first whitespace) followed by the text in each ./li/a grandchild:
library(XML)
getDrugs <- function(...) {
doc <- htmlTreeParse(..., useInternalNodes = TRUE)
xpathApply(xmlRoot(doc), "//ul[#class='atc-drug-tree']", function(node) {
c(sub("\\s.*", "", xmlValue(xmlParent(node))), # get text before 1st whitespace
xpathSApply(node, "./li/a", xmlValue)) # get text in each ./li/a node
})
}
ids <- c("DB00208", "DB00209")
urls <- paste0("http://www.drugbank.ca/drugs/", ids)
L <- sapply(urls, getDrugs)
giving the following list (one component per URL and a component within each for each drug found in that URL):
> L
$`http://www.drugbank.ca/drugs/DB00208`
$`http://www.drugbank.ca/drugs/DB00208`[[1]]
[1] "B01AC05B01AC"
[2] "B01AC — Platelet aggregation inhibitors excl. heparin"
[3] "B01A — ANTITHROMBOTIC AGENTS"
[4] "B01 — ANTITHROMBOTIC AGENTS"
[5] "B — BLOOD AND BLOOD FORMING ORGANS"
$`http://www.drugbank.ca/drugs/DB00209`
$`http://www.drugbank.ca/drugs/DB00209`[[1]]
[1] "A03DA06A03DA"
[2] "A03DA — Synthetic anticholinergic agents in combination with analgesics"
[3] "A03D — ANTISPASMODICS IN COMBINATION WITH ANALGESICS"
[4] "A03 — DRUGS FOR FUNCTIONAL GASTROINTESTINAL DISORDERS"
[5] "A — ALIMENTARY TRACT AND METABOLISM"
$`http://www.drugbank.ca/drugs/DB00209`[[2]]
[1] "A03DA06A03DA"
[2] "G04BD — Drugs for urinary frequency and incontinence"
[3] "G04B — UROLOGICALS"
[4] "G04 — UROLOGICALS"
[5] "G — GENITO URINARY SYSTEM AND SEX HORMONES"
We could create a 5x3 matrix out of the above like this:
simplify2array(do.call(c, L))
And here is a test using the input in the question:
Lines <- '<table>
<tr>
<td>Text</td>
</tr>
<tr>
<th>ATC Codes</th>
<td>B01AC05
<ul class="atc-drug-tree">
<li><a data-no-turbolink="true" href="/atc/B01AC">B01AC — Platelet aggregation inhibitors excl. heparin</a></li>
<li><a data-no-turbolink="true" href="/atc/B01A">B01A — ANTITHROMBOTIC AGENTS</a></li>
<li><a data-no-turbolink="true" href="/atc/B01">B01 — ANTITHROMBOTIC AGENTS</a></li>
<li><a data-no-turbolink="true" href="/atc/B">B — BLOOD AND BLOOD FORMING ORGANS</a></li>
</ul>
</td>
</tr>
<tr>
<td>Text</td>
</tr>
</table>'
getDrugs(Lines, asText = TRUE)
giving:
[[1]]
[1] "B01AC05"
[2] "B01AC Â— Platelet aggregation inhibitors excl. heparin"
[3] "B01A Â— ANTITHROMBOTIC AGENTS"
[4] "B01 Â— ANTITHROMBOTIC AGENTS"
[5] "B Â— BLOOD AND BLOOD FORMING ORGANS"

readHTMLTable is not working because it can't read the headers in tables 3 and 4.
url <- "http://www.drugbank.ca/drugs/DB00208"
doc <- htmlParse(readLines(url))
summary(doc)
$nameCounts
td a tr li th span div p strong img table ...
745 399 342 175 159 137 66 49 46 27 27
#errors
readHTMLTable(doc)
readHTMLTable(doc, which=3)
# this works
readHTMLTable(doc, which=3, header=FALSE)
Also, ATC codes is not within a nearby table tag, so you have to use xpath like the other answers here.
xpathSApply(doc, '//ul[#class="atc-drug-tree"]/*', xmlValue)
[1] "B01AC — Platelet aggregation inhibitors excl. heparin" "B01A — ANTITHROMBOTIC AGENTS"
[3] "B01 — ANTITHROMBOTIC AGENTS" "B — BLOOD AND BLOOD FORMING ORGANS"
xpathSApply(doc, '//ul[#class="atc-drug-tree"]/../node()[1]', xmlValue)
[1] "B01AC05"

Related

scraping wikipedia data which looks like a table but is not actually a table

I am trying to scrape some data from Wikiepedia. The data I want to collect is the # of cases and # of deaths from the first "table" on the Wikipedia page. Usually I would get the xpath of the table and use rvest but I cannot seem to collect this piece of data. I would actually prefer to collect the numbers from the graphic, if I look at one of the collapsible's I get (for the date 2020-04-04):
<tr class="mw-collapsible mw-collapsed mw-made-collapsible" id="mw-customcollapsible-apr" style="display: none;">
<td colspan="2" style="text-align:center" class="bb-04em">2020-04-04</td>
<td class="bb-lr">
<div title="8359" style="background:#A50026;width:0.6px" class="bb-fl"></div>
<div title="14825" style="background:SkyBlue;width:1.06px" class="bb-fl"></div>
<div title="284692" style="background:Tomato;width:20.36px" class="bb-fl"></div>
</td>
<td style="text-align:center" class="bb-04em"><span class="cbs-ibr" style="padding:0 0.3em 0 0; width:5.6em">307,876</span><span class="cbs-ibl" style="width:3.5em">(+12%)</span></td>
<td style="text-align:center" class="bb-04em"><span class="cbs-ibr" style="padding:0 0.3em 0 0; width:4.55em">8,359</span><span class="cbs-ibl" style="width:3.5em">(+19%)</span></td>
</tr>
The data is here - 8359, 14825, 284692 along with the # of cases - 307,876 and # of deaths - 8,359. I am trying to extract these numbers for each day.
Code:
url <- "https://en.wikipedia.org/wiki/COVID-19_pandemic_in_the_United_States"
url %>%
read_html() %>%
html_node(xpath = '//*[#id="mw-content-text"]/div[1]/div[4]/div/table/tbody') %>%
html_table(fill = TRUE)

You could use nth-child to target the various columns. To get the right number of rows in each column it is useful to use a css attribute selector with starts with operator to target the appropriate id attribute and substring of attribute value
library(rvest)
library(tidyverse)
library(stringr)
p <- read_html('https://en.wikipedia.org/wiki/COVID-19_pandemic_in_the_United_States')
covid_info <- tibble(
dates = p %>% html_nodes('[id^=mw-customcollapsible-] td:nth-child(1)') %>% html_text() %>% as.Date(),
cases = p %>% html_nodes('[id^=mw-customcollapsible-] td:nth-child(3)') %>% html_text(),
deaths = p %>% html_nodes('[id^=mw-customcollapsible-] td:nth-child(4)') %>% html_text()
)%>%
mutate(
case_numbers = str_extract(gsub(',','',cases), '^.*(?=\\()' ) %>% as.integer(),
death_numbers = replace_na(str_extract(gsub(',','',deaths), '^.*(?=\\()' ) %>% as.integer(), NA_integer_)
)
print(covid_info)

How to get HTML element that is before a certain class?

I'm scraping and having trouble getting the element of the “th” tag that comes before the other “th” element that contains the “type2” class. I prefer to take it by identifying that it is the element "th" before the "th" with class "type2" because my HTML has a lot of "th" and that was the only difference I found between the tables.
Using rvest or xml2 (or other R package), can I get this parent?
The content which I want is "text_that_I_want".
Thank you!
<tr>
<th class="array">text_that_I_want</th>
<td class="array">
<table>
<thead>
<tr>
<th class="string type2">name</th>
<th class="array type2">answers</th>
</tr>
</thead>

The formal and more generalizable way to navigate xpath relative to a given node is via ancestor preceding-sibling:
read_html(htmldoc) %>%
html_nodes(xpath = "//th[#class = 'string type2']/ancestor::td/preceding-sibling::th") %>%
html_text()
#> [1] "text_that_I_want"

We can look for the "type2" string in all <th>s, get the index of the first occurrence and substract 1 to get the index we want:
library(dplyr)
library(rvest)
location <- test%>%
html_nodes('th') %>%
str_detect("type2")
index_want <- min(which(location == TRUE) - 1)
test%>%
html_nodes('th') %>%
.[[index_want]] %>%
html_text()
[1] "text_that_I_want"

How to find and bold a series of four letters in an html table

I'm using the R programming language.
I'm hoping to find and make bold a series of four letters (amino acids, if you're curious) in a large html table of letters. I want to do this through html table navigation. If I were using regex on a normal string of letters, it would be "([KR].[ST][ILV])". This would find the letters RSSI or KATV, for instance. Unfortunately, the actual string I'm looking for would look something like this:
<center><table class="sequence-table"><tr><th align="left">
<tr>
<td bgcolor="lightgreen"><tt>R</tt></td>
<td bgcolor=""><tt>S</tt></td>
<td bgcolor="pink"><tt>S</tt></td>
<td bgcolor=""><tt>I</tt></td>
The end result I want is this:
<center><table class="sequence-table"><tr><th align="left">
<tr>
<td bgcolor="lightgreen"><tt><b>R</b></tt></td>
<td bgcolor=""><tt><b>S</b></tt></td>
<td bgcolor="pink"><tt><b>S</b></tt></td>
<td bgcolor=""><tt><b>I</b></tt></td>
I've written a monster-sized regex to find this sequence (attached below), but it doesn't seem to work. And I realize now that I should be using html commands, but I'm having a good deal of trouble finding websites that tell me how to search-and-replace. What should I be searching for? And/or how would I accomplish what I've described above?
This is my monster-sized regex to find the sequence I want, but it doesn't seem to work. I now realize, of course, that I was going at it from the wrong direction.
`regexp <- '(
[\\<<td bgcolor=""><tt>K</tt></td>\\>
\\<<td bgcolor="\\w+"><tt>K</tt></td>\\>
\\<<td bgcolor=""><tt>R</tt></td>\\>
\\<<td bgcolor="\\w+"><tt>R</tt></td>\\>]
[\\<<td bgcolor=""><tt>.</tt></td>\\>
\\<<td bgcolor="\\w+"><tt>.</tt></td>\\>]
[\\<<td bgcolor=""><tt>S</tt></td>\\>
\\<<td bgcolor="\\w+"><tt>S</tt></td>\\>
\\<<td bgcolor=""><tt>T</tt></td>\\>
\\<<td bgcolor="\\w+"><tt>T</tt></td>\\>]
[\\<<td bgcolor=""><tt>I</tt></td>\\>
\\<<td bgcolor="\\w+"><tt>I</tt></td>\\>
\\<<td bgcolor=""><tt>L</tt></td>\\>
\\<<td bgcolor="\\w+"><tt>L</tt></td>\\>
\\<<td bgcolor=""><tt>V</tt></td>\\>
\\<<td bgcolor="\\w+"><tt>V</tt></td>\\>])'
`

Maybe try this approach instead of regular expressions:
library(xml2)
library(tidyverse)
txt <- '<center><table class="sequence-table"><tr><th align="left">
<tr>
<td bgcolor="lightgreen"><tt>R</tt></td>
<td bgcolor=""><tt>S</tt></td>
<td bgcolor="pink"><tt>S</tt></td>
<td bgcolor=""><tt>I</tt></td>'
needles <- c("RSSI", "KMSV")
doc <- read_html(txt)
doc %>%
xml_find_all("//tr") %>%
keep(xml_text(.) %in% gsub("(.)", "\\1\n", needles)) %>%
xml_find_all("td/tt/text()") %>%
xml_add_parent("b")
write_html(doc, tf <- tempfile(fileext = ".html"))
shell.exec(tf) # open temp file on windows
This wraps each column text into <b>...</b> (and saves the result to a temporary file).
cat(as.character(doc))
# ...
# <center><table class="sequence-table">
# <tr><th align="left">
# </th></tr>
# <tr>
# <td bgcolor="lightgreen"><tt><b>R</b></tt></td>
# <td bgcolor=""><tt><b>S</b></tt></td>
# <td bgcolor="pink"><tt><b>S</b></tt></td>
# <td bgcolor=""><tt><b>I</b></tt></td>
# ...

R - How to extract items from XML Nodeset?

I have a list of 438 pitcher names that look like this (in XML Nodeset):
> pitcherlinks[[1]]
<td class="left " data-append-csv="abadfe01" data-stat="player" csk="Abad,Fernando0.01">
FernandoÂ Abad*
</td>
> pitcherlinks[[2]]
<td class="left " data-append-csv="adlemti01" data-stat="player" csk="Adleman,Tim0.01">
TimÂ Adleman
</td>
How do I extract the names like FernandoÂ Abad and the associated links like /players/a/abadfe01.shtml

Since you have a list, an apply function is used to walk through the list. Each function uses read_html to parse the hmtl fragment in the list using the CSS selector a to find the anchors (links). The names come from the html_text and the link is in the attribute href
library(rvest)
pitcherlinks <- list()
pitcherlinks[[1]] <-
'<td class="left " data-append-csv="abadfe01" data-stat="player" csk="Abad,Fernando0.01">
FernandoÂ Abad*
</td>'
pitcherlinks[[2]] <-
'<td class="left " data-append-csv="adlemti01" data-stat="player" csk="Adleman,Tim0.01">
TimÂ Adleman
</td>'
names <- sapply(pitcherlinks, function(x) {x %>% read_html() %>% html_nodes("a") %>% html_text()})
links <- sapply(pitcherlinks, function(x) {x %>% read_html() %>% html_nodes("a") %>% html_attr("href")})
names
# [1] "FernandoÂ Abad" "TimÂ Adleman"
links
# [1] "/players/a/abadfe01.shtml" "/players/a/adlemti01.shtml"

Regex \n doesn't work

I'm trying to parse text out of two lines of HTML.
Dim PattStats As New Regex("class=""head"">(.+?)</td>"+
"\n<td>(.+?)</td>")
Dim makor As MatchCollection = PattStats.Matches(page)
For Each MatchMak As Match In makor
ListView3.Items.Add(MatchMak.Groups(1).Value)
Next
I added the \n to match the next line, but for some reason it won't work. Here's the source I'm running the regex against.
<table class="table table-striped table-bordered table-condensed">
<tbody>
<tr>
<td class="head">Health Points:</td>
<td>445 (+85 / per level)</td>
<td class="head">Health Regen:</td>
<td>7.25</td>
</tr>
<tr>
<td class="head">Energy:</td>
<td>200</td>
<td class="head">Energy Regen:</td>
<td>50</td>
</tr>
<tr>
<td class="head">Damage:</td>
<td>53 (+3.2 / per level)</td>
<td class="head">Attack Speed:</td>
<td>0.694 (+3.1 / per level)</td>
</tr>
<tr>
<td class="head">Attack Range:</td>
<td>125</td>
<td class="head">Movement Speed:</td>
<td>325</td>
</tr>
<tr>
<td class="head">Armor:</td>
<td>16.5 (+3.5 / per level)</td>
<td class="head">Magic Resistance:</td>
<td>30 (+1.25 / per level)</td>
</tr>
<tr>
<td class="head">Influence Points (IP):</td>
<td>3150</td>
<td class="head">Riot Points (RP):</td>
<td>975</td>
</tr>
</tbody>
</table>
I'd like to match the first <td class...> and the following line in one regex :/

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Extracting text from HTML page in R - html

Related

scraping wikipedia data which looks like a table but is not actually a table

How to get HTML element that is before a certain class?

How to find and bold a series of four letters in an html table

R - How to extract items from XML Nodeset?

Regex \n doesn't work

Categories

Resources