Extracting text from HTML page in R - html

I am working on drugbank database, please i need help to extract specific text from the below HTML code:
<table>
<tr>
<td>Text</td>
</tr>
<tr>
<th>ATC Codes</th>
<td>B01AC05
<ul class="atc-drug-tree">
<li><a data-no-turbolink="true" href="/atc/B01AC">B01AC — Platelet aggregation inhibitors excl. heparin</a></li>
<li><a data-no-turbolink="true" href="/atc/B01A">B01A — ANTITHROMBOTIC AGENTS</a></li>
<li><a data-no-turbolink="true" href="/atc/B01">B01 — ANTITHROMBOTIC AGENTS</a></li>
<li><a data-no-turbolink="true" href="/atc/B">B — BLOOD AND BLOOD FORMING ORGANS</a></li>
</ul>
</td>
</tr>
<tr>
<td>Text</td>
</tr>
</table>
i want to have the following as my output text as list object:
B01AC05
B01AC — Platelet aggregation inhibitors excl. heparin
B01A — ANTITHROMBOTIC AGENTS
B01 — ANTITHROMBOTIC AGENTS
B — BLOOD AND BLOOD FORMING ORGANS
I have tried the below function but its not working:
library(XML)
getATC <- function(id){
url <- "http://www.drugbank.ca/drugs/"
dburl <- paste(url, id, sep ="")
tables <- readHTMLTable(dburl, header = F)
table <- tables[['atc-drug-tree']]
table
}
ids <- c("DB00208", "DB00209")
ref <- apply(ids, 1, getATC)
NB:
The url can be use to see the actual page i want to parse, the HTML snippet i provided was just and example.
Thanks

rvest makes web scraping pretty simple. Here's a solution using it.
library("rvest")
library("stringr")
your_html <- read_html('<table>
<tr>
<td>Text</td>
</tr>
<tr>
<th>ATC Codes</th>
<td>B01AC05
<ul class="atc-drug-tree">
<li><a data-no-turbolink="true" href="/atc/B01AC">B01AC — Platelet aggregation inhibitors excl. heparin</a></li>
<li><a data-no-turbolink="true" href="/atc/B01A">B01A — ANTITHROMBOTIC AGENTS</a></li>
<li><a data-no-turbolink="true" href="/atc/B01">B01 — ANTITHROMBOTIC AGENTS</a></li>
<li><a data-no-turbolink="true" href="/atc/B">B — BLOOD AND BLOOD FORMING ORGANS</a></li>
</ul>
</td>
</tr>
<tr>
<td>Text</td>
</tr>
</table>')
your_name <-
your_html %>%
html_nodes(xpath='//th[contains(text(), "ATC Codes")]/following-sibling::td') %>%
html_text() %>%
str_extract(".+(?=\n)")
list_elements <-
your_html %>% html_nodes("li") %>% html_nodes("a") %>% html_text()
your_list <- list()
your_list[[your_name]] <- list_elements
> your_list
$B01AC05
[1] "B01AC — Platelet aggregation inhibitors excl. heparin"
[2] "B01A — ANTITHROMBOTIC AGENTS"
[3] "B01 — ANTITHROMBOTIC AGENTS"
[4] "B — BLOOD AND BLOOD FORMING ORGANS"

Create the URL strings and sapply them using the getDrugs function which parses the HTML, extracts the root of the HTML tree, finds the ul node with the indicated class and returns its parent's text (but only before the first whitespace) followed by the text in each ./li/a grandchild:
library(XML)
getDrugs <- function(...) {
doc <- htmlTreeParse(..., useInternalNodes = TRUE)
xpathApply(xmlRoot(doc), "//ul[#class='atc-drug-tree']", function(node) {
c(sub("\\s.*", "", xmlValue(xmlParent(node))), # get text before 1st whitespace
xpathSApply(node, "./li/a", xmlValue)) # get text in each ./li/a node
})
}
ids <- c("DB00208", "DB00209")
urls <- paste0("http://www.drugbank.ca/drugs/", ids)
L <- sapply(urls, getDrugs)
giving the following list (one component per URL and a component within each for each drug found in that URL):
> L
$`http://www.drugbank.ca/drugs/DB00208`
$`http://www.drugbank.ca/drugs/DB00208`[[1]]
[1] "B01AC05B01AC"
[2] "B01AC — Platelet aggregation inhibitors excl. heparin"
[3] "B01A — ANTITHROMBOTIC AGENTS"
[4] "B01 — ANTITHROMBOTIC AGENTS"
[5] "B — BLOOD AND BLOOD FORMING ORGANS"
$`http://www.drugbank.ca/drugs/DB00209`
$`http://www.drugbank.ca/drugs/DB00209`[[1]]
[1] "A03DA06A03DA"
[2] "A03DA — Synthetic anticholinergic agents in combination with analgesics"
[3] "A03D — ANTISPASMODICS IN COMBINATION WITH ANALGESICS"
[4] "A03 — DRUGS FOR FUNCTIONAL GASTROINTESTINAL DISORDERS"
[5] "A — ALIMENTARY TRACT AND METABOLISM"
$`http://www.drugbank.ca/drugs/DB00209`[[2]]
[1] "A03DA06A03DA"
[2] "G04BD — Drugs for urinary frequency and incontinence"
[3] "G04B — UROLOGICALS"
[4] "G04 — UROLOGICALS"
[5] "G — GENITO URINARY SYSTEM AND SEX HORMONES"
We could create a 5x3 matrix out of the above like this:
simplify2array(do.call(c, L))
And here is a test using the input in the question:
Lines <- '<table>
<tr>
<td>Text</td>
</tr>
<tr>
<th>ATC Codes</th>
<td>B01AC05
<ul class="atc-drug-tree">
<li><a data-no-turbolink="true" href="/atc/B01AC">B01AC — Platelet aggregation inhibitors excl. heparin</a></li>
<li><a data-no-turbolink="true" href="/atc/B01A">B01A — ANTITHROMBOTIC AGENTS</a></li>
<li><a data-no-turbolink="true" href="/atc/B01">B01 — ANTITHROMBOTIC AGENTS</a></li>
<li><a data-no-turbolink="true" href="/atc/B">B — BLOOD AND BLOOD FORMING ORGANS</a></li>
</ul>
</td>
</tr>
<tr>
<td>Text</td>
</tr>
</table>'
getDrugs(Lines, asText = TRUE)
giving:
[[1]]
[1] "B01AC05"
[2] "B01AC — Platelet aggregation inhibitors excl. heparin"
[3] "B01A — ANTITHROMBOTIC AGENTS"
[4] "B01 — ANTITHROMBOTIC AGENTS"
[5] "B — BLOOD AND BLOOD FORMING ORGANS"

readHTMLTable is not working because it can't read the headers in tables 3 and 4.
url <- "http://www.drugbank.ca/drugs/DB00208"
doc <- htmlParse(readLines(url))
summary(doc)
$nameCounts
td a tr li th span div p strong img table ...
745 399 342 175 159 137 66 49 46 27 27
#errors
readHTMLTable(doc)
readHTMLTable(doc, which=3)
# this works
readHTMLTable(doc, which=3, header=FALSE)
Also, ATC codes is not within a nearby table tag, so you have to use xpath like the other answers here.
xpathSApply(doc, '//ul[#class="atc-drug-tree"]/*', xmlValue)
[1] "B01AC — Platelet aggregation inhibitors excl. heparin" "B01A — ANTITHROMBOTIC AGENTS"
[3] "B01 — ANTITHROMBOTIC AGENTS" "B — BLOOD AND BLOOD FORMING ORGANS"
xpathSApply(doc, '//ul[#class="atc-drug-tree"]/../node()[1]', xmlValue)
[1] "B01AC05"

Related

scraping wikipedia data which looks like a table but is not actually a table

I am trying to scrape some data from Wikiepedia. The data I want to collect is the # of cases and # of deaths from the first "table" on the Wikipedia page. Usually I would get the xpath of the table and use rvest but I cannot seem to collect this piece of data. I would actually prefer to collect the numbers from the graphic, if I look at one of the collapsible's I get (for the date 2020-04-04):
<tr class="mw-collapsible mw-collapsed mw-made-collapsible" id="mw-customcollapsible-apr" style="display: none;">
<td colspan="2" style="text-align:center" class="bb-04em">2020-04-04</td>
<td class="bb-lr">
<div title="8359" style="background:#A50026;width:0.6px" class="bb-fl">​</div>
<div title="14825" style="background:SkyBlue;width:1.06px" class="bb-fl">​</div>
<div title="284692" style="background:Tomato;width:20.36px" class="bb-fl">​</div>
</td>
<td style="text-align:center" class="bb-04em"><span class="cbs-ibr" style="padding:0 0.3em 0 0; width:5.6em">307,876</span><span class="cbs-ibl" style="width:3.5em">(+12%)</span></td>
<td style="text-align:center" class="bb-04em"><span class="cbs-ibr" style="padding:0 0.3em 0 0; width:4.55em">8,359</span><span class="cbs-ibl" style="width:3.5em">(+19%)</span></td>
</tr>
The data is here - 8359, 14825, 284692 along with the # of cases - 307,876 and # of deaths - 8,359. I am trying to extract these numbers for each day.
Code:
url <- "https://en.wikipedia.org/wiki/COVID-19_pandemic_in_the_United_States"
url %>%
read_html() %>%
html_node(xpath = '//*[#id="mw-content-text"]/div[1]/div[4]/div/table/tbody') %>%
html_table(fill = TRUE)
You could use nth-child to target the various columns. To get the right number of rows in each column it is useful to use a css attribute selector with starts with operator to target the appropriate id attribute and substring of attribute value
library(rvest)
library(tidyverse)
library(stringr)
p <- read_html('https://en.wikipedia.org/wiki/COVID-19_pandemic_in_the_United_States')
covid_info <- tibble(
dates = p %>% html_nodes('[id^=mw-customcollapsible-] td:nth-child(1)') %>% html_text() %>% as.Date(),
cases = p %>% html_nodes('[id^=mw-customcollapsible-] td:nth-child(3)') %>% html_text(),
deaths = p %>% html_nodes('[id^=mw-customcollapsible-] td:nth-child(4)') %>% html_text()
)%>%
mutate(
case_numbers = str_extract(gsub(',','',cases), '^.*(?=\\()' ) %>% as.integer(),
death_numbers = replace_na(str_extract(gsub(',','',deaths), '^.*(?=\\()' ) %>% as.integer(), NA_integer_)
)
print(covid_info)

How to get HTML element that is before a certain class?

I'm scraping and having trouble getting the element of the “th” tag that comes before the other “th” element that contains the “type2” class. I prefer to take it by identifying that it is the element "th" before the "th" with class "type2" because my HTML has a lot of "th" and that was the only difference I found between the tables.
Using rvest or xml2 (or other R package), can I get this parent?
The content which I want is "text_that_I_want".
Thank you!
<tr>
<th class="array">text_that_I_want</th>
<td class="array">
<table>
<thead>
<tr>
<th class="string type2">name</th>
<th class="array type2">answers</th>
</tr>
</thead>
The formal and more generalizable way to navigate xpath relative to a given node is via ancestor preceding-sibling:
read_html(htmldoc) %>%
html_nodes(xpath = "//th[#class = 'string type2']/ancestor::td/preceding-sibling::th") %>%
html_text()
#> [1] "text_that_I_want"
We can look for the "type2" string in all <th>s, get the index of the first occurrence and substract 1 to get the index we want:
library(dplyr)
library(rvest)
location <- test%>%
html_nodes('th') %>%
str_detect("type2")
index_want <- min(which(location == TRUE) - 1)
test%>%
html_nodes('th') %>%
.[[index_want]] %>%
html_text()
[1] "text_that_I_want"

How to find and bold a series of four letters in an html table

I'm using the R programming language.
I'm hoping to find and make bold a series of four letters (amino acids, if you're curious) in a large html table of letters. I want to do this through html table navigation. If I were using regex on a normal string of letters, it would be "([KR].[ST][ILV])". This would find the letters RSSI or KATV, for instance. Unfortunately, the actual string I'm looking for would look something like this:
<center><table class="sequence-table"><tr><th align="left">
<tr>
<td bgcolor="lightgreen"><tt>R</tt></td>
<td bgcolor=""><tt>S</tt></td>
<td bgcolor="pink"><tt>S</tt></td>
<td bgcolor=""><tt>I</tt></td>
The end result I want is this:
<center><table class="sequence-table"><tr><th align="left">
<tr>
<td bgcolor="lightgreen"><tt><b>R</b></tt></td>
<td bgcolor=""><tt><b>S</b></tt></td>
<td bgcolor="pink"><tt><b>S</b></tt></td>
<td bgcolor=""><tt><b>I</b></tt></td>
I've written a monster-sized regex to find this sequence (attached below), but it doesn't seem to work. And I realize now that I should be using html commands, but I'm having a good deal of trouble finding websites that tell me how to search-and-replace. What should I be searching for? And/or how would I accomplish what I've described above?
This is my monster-sized regex to find the sequence I want, but it doesn't seem to work. I now realize, of course, that I was going at it from the wrong direction.
`regexp <- '(
[\\<<td bgcolor=""><tt>K</tt></td>\\>
\\<<td bgcolor="\\w+"><tt>K</tt></td>\\>
\\<<td bgcolor=""><tt>R</tt></td>\\>
\\<<td bgcolor="\\w+"><tt>R</tt></td>\\>]
[\\<<td bgcolor=""><tt>.</tt></td>\\>
\\<<td bgcolor="\\w+"><tt>.</tt></td>\\>]
[\\<<td bgcolor=""><tt>S</tt></td>\\>
\\<<td bgcolor="\\w+"><tt>S</tt></td>\\>
\\<<td bgcolor=""><tt>T</tt></td>\\>
\\<<td bgcolor="\\w+"><tt>T</tt></td>\\>]
[\\<<td bgcolor=""><tt>I</tt></td>\\>
\\<<td bgcolor="\\w+"><tt>I</tt></td>\\>
\\<<td bgcolor=""><tt>L</tt></td>\\>
\\<<td bgcolor="\\w+"><tt>L</tt></td>\\>
\\<<td bgcolor=""><tt>V</tt></td>\\>
\\<<td bgcolor="\\w+"><tt>V</tt></td>\\>])'
`
Maybe try this approach instead of regular expressions:
library(xml2)
library(tidyverse)
txt <- '<center><table class="sequence-table"><tr><th align="left">
<tr>
<td bgcolor="lightgreen"><tt>R</tt></td>
<td bgcolor=""><tt>S</tt></td>
<td bgcolor="pink"><tt>S</tt></td>
<td bgcolor=""><tt>I</tt></td>'
needles <- c("RSSI", "KMSV")
doc <- read_html(txt)
doc %>%
xml_find_all("//tr") %>%
keep(xml_text(.) %in% gsub("(.)", "\\1\n", needles)) %>%
xml_find_all("td/tt/text()") %>%
xml_add_parent("b")
write_html(doc, tf <- tempfile(fileext = ".html"))
shell.exec(tf) # open temp file on windows
This wraps each column text into <b>...</b> (and saves the result to a temporary file).
cat(as.character(doc))
# ...
# <center><table class="sequence-table">
# <tr><th align="left">
# </th></tr>
# <tr>
# <td bgcolor="lightgreen"><tt><b>R</b></tt></td>
# <td bgcolor=""><tt><b>S</b></tt></td>
# <td bgcolor="pink"><tt><b>S</b></tt></td>
# <td bgcolor=""><tt><b>I</b></tt></td>
# ...

R - How to extract items from XML Nodeset?

I have a list of 438 pitcher names that look like this (in XML Nodeset):
> pitcherlinks[[1]]
<td class="left " data-append-csv="abadfe01" data-stat="player" csk="Abad,Fernando0.01">
Fernando Abad*
</td>
> pitcherlinks[[2]]
<td class="left " data-append-csv="adlemti01" data-stat="player" csk="Adleman,Tim0.01">
Tim Adleman
</td>
How do I extract the names like Fernando Abad and the associated links like /players/a/abadfe01.shtml
Since you have a list, an apply function is used to walk through the list. Each function uses read_html to parse the hmtl fragment in the list using the CSS selector a to find the anchors (links). The names come from the html_text and the link is in the attribute href
library(rvest)
pitcherlinks <- list()
pitcherlinks[[1]] <-
'<td class="left " data-append-csv="abadfe01" data-stat="player" csk="Abad,Fernando0.01">
Fernando Abad*
</td>'
pitcherlinks[[2]] <-
'<td class="left " data-append-csv="adlemti01" data-stat="player" csk="Adleman,Tim0.01">
Tim Adleman
</td>'
names <- sapply(pitcherlinks, function(x) {x %>% read_html() %>% html_nodes("a") %>% html_text()})
links <- sapply(pitcherlinks, function(x) {x %>% read_html() %>% html_nodes("a") %>% html_attr("href")})
names
# [1] "Fernando Abad" "Tim Adleman"
links
# [1] "/players/a/abadfe01.shtml" "/players/a/adlemti01.shtml"

Regex \n doesn't work

I'm trying to parse text out of two lines of HTML.
Dim PattStats As New Regex("class=""head"">(.+?)</td>"+
"\n<td>(.+?)</td>")
Dim makor As MatchCollection = PattStats.Matches(page)
For Each MatchMak As Match In makor
ListView3.Items.Add(MatchMak.Groups(1).Value)
Next
I added the \n to match the next line, but for some reason it won't work. Here's the source I'm running the regex against.
<table class="table table-striped table-bordered table-condensed">
<tbody>
<tr>
<td class="head">Health Points:</td>
<td>445 (+85 / per level)</td>
<td class="head">Health Regen:</td>
<td>7.25</td>
</tr>
<tr>
<td class="head">Energy:</td>
<td>200</td>
<td class="head">Energy Regen:</td>
<td>50</td>
</tr>
<tr>
<td class="head">Damage:</td>
<td>53 (+3.2 / per level)</td>
<td class="head">Attack Speed:</td>
<td>0.694 (+3.1 / per level)</td>
</tr>
<tr>
<td class="head">Attack Range:</td>
<td>125</td>
<td class="head">Movement Speed:</td>
<td>325</td>
</tr>
<tr>
<td class="head">Armor:</td>
<td>16.5 (+3.5 / per level)</td>
<td class="head">Magic Resistance:</td>
<td>30 (+1.25 / per level)</td>
</tr>
<tr>
<td class="head">Influence Points (IP):</td>
<td>3150</td>
<td class="head">Riot Points (RP):</td>
<td>975</td>
</tr>
</tbody>
</table>
I'd like to match the first <td class...> and the following line in one regex :/
Description
This regex will find td tags and return them in groups of two.
<td\b[^>]*>([^<]*)<\/td>[^<]*<td\b[^>]*>([^<]*)<\/td>
Summary
<td\b[^>]*> find the first td tag and consume any attributes
([^<]*) capture the first inner text, this can be greedy but we assume the cell has no nested tags
<\/td> find the close tag
[^<]* move past all the rest of the text until you, this assumes there are no additional tags between the first and second td tag
<td\b[^>]*> find the second td tage and consume any attributes
([^<]*) capture the second inner text, this can be greedy but we assume the cell has no nested tags
<\/td> find the close tag
Groups
Group 0 will get the entire string
will have the first td group
will have the second td group
VB.NET Code Example:
Imports System.Text.RegularExpressions
Module Module1
Sub Main()
Dim sourcestring as String = "replace with your source string"
Dim re As Regex = New Regex("<td\b[^>]*>([^<]*)<\/td>[^<]*<td\b[^>]*>([^<]*)<\/td>",RegexOptions.IgnoreCase OR RegexOptions.Singleline)
Dim mc as MatchCollection = re.Matches(sourcestring)
Dim mIdx as Integer = 0
For each m as Match in mc
For groupIdx As Integer = 0 To m.Groups.Count - 1
Console.WriteLine("[{0}][{1}] = {2}", mIdx, re.GetGroupNames(groupIdx), m.Groups(groupIdx).Value)
Next
mIdx=mIdx+1
Next
End Sub
End Module
$matches Array:
(
[0] => Array
(
[0] => <td class="head">Health Points:</td>
<td>445 (+85 / per level)</td>
[1] => <td class="head">Health Regen:</td>
<td>7.25</td>
[2] => <td class="head">Energy:</td>
<td>200</td>
[3] => <td class="head">Energy Regen:</td>
<td>50</td>
[4] => <td class="head">Damage:</td>
<td>53 (+3.2 / per level)</td>
[5] => <td class="head">Attack Speed:</td>
<td>0.694 (+3.1 / per level)</td>
[6] => <td class="head">Attack Range:</td>
<td>125</td>
[7] => <td class="head">Movement Speed:</td>
<td>325</td>
[8] => <td class="head">Armor:</td>
<td>16.5 (+3.5 / per level)</td>
[9] => <td class="head">Magic Resistance:</td>
<td>30 (+1.25 / per level)</td>
[10] => <td class="head">Influence Points (IP):</td>
<td>3150</td>
[11] => <td class="head">Riot Points (RP):</td>
<td>975</td>
)
[1] => Array
(
[0] => Health Points:
[1] => Health Regen:
[2] => Energy:
[3] => Energy Regen:
[4] => Damage:
[5] => Attack Speed:
[6] => Attack Range:
[7] => Movement Speed:
[8] => Armor:
[9] => Magic Resistance:
[10] => Influence Points (IP):
[11] => Riot Points (RP):
)
[2] => Array
(
[0] => 445 (+85 / per level)
[1] => 7.25
[2] => 200
[3] => 50
[4] => 53 (+3.2 / per level)
[5] => 0.694 (+3.1 / per level)
[6] => 125
[7] => 325
[8] => 16.5 (+3.5 / per level)
[9] => 30 (+1.25 / per level)
[10] => 3150
[11] => 975
)
)
Disclaimer
Parsing html with a regex is really not the best solution as there a ton of edge cases what we can't predict. However in this case if input string is always this basic, and you're willing to accept the risk of the regex not working 100% of the time, then this solution would probably work for you.