Extracting html text using R - can't access some nodes

Extracting html text using R - can't access some nodes - html

I have a large number of water take permits that are available online and I want to extract some data from them. For example
url <- "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1"
I don't know html at all, but have been plugging away with help from google and a friend. I can get to some of the nodes without any issues using the xpath or css selector, for instance to get to the title:
library(rvest)
url %>%
read_html() %>%
html_nodes(xpath = '//*[#id="main"]/div/h1') %>%
html_text()
[1] "Details for CRC000002.1"
Or using the css selectors:
url %>%
read_html() %>%
html_nodes(css = "#main") %>%
html_nodes(css = "div") %>%
html_nodes(css = "h1") %>%
html_text()
[1] "Details for CRC000002.1"
So far, so good, but the information I actually want is buried a bit deeper and I can't seem to get to it. For instance, the client name field ("Killermont Station Limited", in this case) has this xpath:
clientxpath <- '//*[#id="main"]/div/div[1]/div/table/tbody/tr[1]/td[2]'
url %>%
read_html() %>%
html_nodes(xpath = clientxpath) %>%
html_text()
character(0)
The css selectors gets quite convoluted, but I get the same result. The help file for html_nodes() says:
# XPath selectors ---------------------------------------------
# chaining with XPath is a little trickier - you may need to vary
# the prefix you're using - // always selects from the root noot
# regardless of where you currently are in the doc
But it doesn't give me any clues on what to use an an alternative prefix in the xpath (might be obvious if I knew html).
My friend pointed out that some of the document is in javascript (ajax), which may be part of the problem too. That said, the bit I'm trying to get to above shows up in the html, but it is within a node called 'div.ajax-block'.
css selectors: #main > div > div.ajax-block > div > table > tbody > tr:nth-child(1) > td:nth-child(4)
Can anyone help? Thanks!

It's super disconcerting that most if not all SO R contributors default to "use a heavyweight third-party dependency" in curt "answers" when it comes to scraping. 99% of the time you don't need Selenium. You just need to exercise the little gray cells.
First, a big clue that the page loads content asynchronously is the wait-spinner that appears. The second one is in your snippet where the div actually has part of a selector name with ajax in it. Tell-tale signs XHR requests are in-play.
If you open Developer Tools in your browser and reload the page then go to Network and then the XHR tab you'll see:
Most of the "real" data on the page is loaded dynamically. We can write httr calls that mimic the browser calls.
However…
We first need to make one GET call to the main page to prime some cookies which will be carried over for us and then find a per-generated session token that's used to prevent abuse of the site. It's defined using JavaScript so we'll use the V8 package to evaluate it. We could have just use regular expressions to find the string. Do whatev you like.
library(httr)
library(rvest)
library(dplyr)
library(V8)
ctx <- v8() # we need this to eval some javascript
# Prime Cookies -----------------------------------------------------------
res <- httr::GET("https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1")
httr::cookies(res)
## domain flag path secure expiration name
## 1 .ecan.govt.nz TRUE / FALSE 2019-11-24 11:46:13 visid_incap_927063
## 2 .ecan.govt.nz TRUE / FALSE <NA> incap_ses_148_927063
## value
## 1 +p8XAM6uReGmEnVIdnaxoxWL+VsAAAAAQUIPAAAAAABjdOjQDbXt7PG3tpBpELha
## 2 nXJSYz8zbCRj8tGhzNANAhaL+VsAAAAA7JyOH7Gu4qeIb6KKk/iSYQ==
pg <- httr::content(res)
html_node(pg, xpath=".//script[contains(., '_monsido')]") %>%
html_text() %>%
ctx$eval()
## [1] "2"
monsido_token <- ctx$get("_monsido")[1,2]
Here's the searchlist (which is, indeed empty):
httr::VERB(
verb = "POST", url = "https://www.ecan.govt.nz/data/document-library/searchlist",
httr::add_headers(
Referer = "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1",
`X-Requested-With` = "XMLHttpRequest",
TE = "Trailers"
), httr::set_cookies(
monsido = monsido_token
),
body = list(
name = "CRC000002.1",
pageSize = "999999"
),
encode = "form"
) -> res
httr::content(res)
## NULL ## <<=== this is OK as there is no response
Here's the "Consent Overview" section:
httr::GET(
url = "https://www.ecan.govt.nz/data/consent-search/consentoverview/CRC000002.1",
httr::add_headers(
Referer = "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1",
Authority = "www.ecan.govt.nz",
`X-Requested-With` = "XMLHttpRequest"
),
httr::set_cookies(
monsido = monsido_token
)
) -> res
httr::content(res) %>%
html_table() %>%
glimpse()
## List of 1
## $ :'data.frame': 5 obs. of 4 variables:
## ..$ X1: chr [1:5] "RMA Authorisation Number" "Consent Location" "To" "Commencement Date" ...
## ..$ X2: chr [1:5] "CRC000002.1" "Manuka Creek, KILLERMONT STATION" "To take water from Manuka Creek at or about map reference NZMS 260 H39:5588-2366 for irrigation of up to 40.8 hectares." "29 Apr 2010" ...
## ..$ X3: chr [1:5] "Client Name" "State" "To take water from Manuka Creek at or about map reference NZMS 260 H39:5588-2366 for irrigation of up to 40.8 hectares." "29 Apr 2010" ...
## ..$ X4: chr [1:5] "Killermont Station Limited" "Issued - Active" "To take water from Manuka Creek at or about map reference NZMS 260 H39:5588-2366 for irrigation of up to 40.8 hectares." "29 Apr 2010" ...
Here are the "Consent Conditions":
httr::GET(
url = "https://www.ecan.govt.nz/data/consent-search/consentconditions/CRC000002.1",
httr::add_headers(
Referer = "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1",
Authority = "www.ecan.govt.nz",
`X-Requested-With` = "XMLHttpRequest"
),
httr::set_cookies(
monsido = monsido_token
)
) -> res
httr::content(res) %>%
as.character() %>%
substring(1, 300) %>%
cat()
## <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
## <html><body><div class="consentDetails">
## <ul class="unstyled-list">
## <li>
##
##
## <strong class="pull-left">1</strong> <div class="pad-left1">The rate at which wa
Here's the "Consent Related":
httr::GET(
url = "https://www.ecan.govt.nz/data/consent-search/consentrelated/CRC000002.1",
httr::add_headers(
Referer = "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1",
Authority = "www.ecan.govt.nz",
`X-Requested-With` = "XMLHttpRequest"
),
httr::set_cookies(
monsido = monsido_token
)
) -> res
httr::content(res) %>%
as.character() %>%
substring(1, 300) %>%
cat()
## <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
## <html><body>
## <p>There are no related documents.</p>
##
##
##
##
##
## <div class="summary-table-wrapper">
## <table class="summary-table left">
## <thead><tr>
## <th>Relationship</th>
## <th>Recor
Here's the "Workflow:
httr::GET(
url = "https://www.ecan.govt.nz/data/consent-search/consentworkflow/CRC000002.1",
httr::add_headers(
Referer = "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1",
Authority = "www.ecan.govt.nz",
`X-Requested-With` = "XMLHttpRequest"
),
httr::set_cookies(
monsido = monsido_token
)
) -> res
httr::content(res)
## {xml_document}
## <html>
## [1] <body><p>No workflow</p></body>
Here are the "Consent Flow Restrictions":
httr::GET(
url = "https://www.ecan.govt.nz/data/consent-search/consentflowrestrictions/CRC000002.1",
httr::add_headers(
Referer = "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1",
Authority = "www.ecan.govt.nz",
`X-Requested-With` = "XMLHttpRequest"
),
httr::set_cookies(
monsido = monsido_token
)
) -> res
httr::content(res) %>%
as.character() %>%
substring(1, 300) %>%
cat()
## <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
## <html><body><div class="summary-table-wrapper">
## <table class="summary-table left">
## <thead>
## <th colspan="2">Low Flow Site</th>
## <th>Todays Flow <span class="lower">(m3/s)</span>
## </th>
You still need to parse HTML but now you can do it all with just plain R packages.

Related

R Webscraping RCurl and httr Content

I'm learning a bit about webscraping and I'm having a little doubt regarding 2 packages (httr and RCurl), I'm trying to get a code from a magazine (ISSN) on the researchgate website and I came across a situation. When extracting the content from the site by httr and RCurl, I get the ISSN in the RCurl package and in httr my function is returning NULL, could anyone tell me why this? in my opinion it was for both functions to be working. Follow the code below.
library(rvest)
library(httr)
library(RCurl)
url <- "https://www.researchgate.net/journal/0730-0301_Acm_Transactions_On_Graphics"
########
# httr #
########
conexao <- GET(url)
conexao_status <- http_status(conexao)
conexao_status
content(conexao, as = "text", encoding = "utf-8") %>% read_html() -> webpage1
ISSN <- webpage1 %>%
html_nodes(xpath = '//*/div/div[2]/div[1]/div[1]/table[2]/tbody/tr[7]/td') %>%
html_text %>%
str_to_title() %>%
str_split(" ") %>%
unlist
ISSN
########
# RCurl #
########
options(RCurlOptions = list(verbose = FALSE,
capath = system.file("CurlSSL", "cacert.pem", package = "RCurl"),
ssl.verifypeer = FALSE))
webpage <- getURLContent(url) %>% read_html()
ISSN <- webpage %>%
html_nodes(xpath = '//*/div/div[2]/div[1]/div[1]/table[2]/tbody/tr[7]/td') %>%
html_text %>%
str_to_title() %>%
str_split(" ") %>%
unlist
ISSN
sessionInfo() R version 3.5.0 (2018-04-23) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows >= 8 x64 (build
9200)
Matrix products: default
locale: [1] LC_COLLATE=Portuguese_Brazil.1252
LC_CTYPE=Portuguese_Brazil.1252 LC_MONETARY=Portuguese_Brazil.1252
[4] LC_NUMERIC=C LC_TIME=Portuguese_Brazil.1252
attached base packages: [1] stats graphics grDevices utils
datasets methods base
other attached packages: [1] testit_0.7 dplyr_0.7.4
progress_1.1.2 readxl_1.1.0 stringr_1.3.0 RCurl_1.95-4.10
bitops_1.0-6 [8] httr_1.3.1 rvest_0.3.2 xml2_1.2.0
jsonlite_1.5
loaded via a namespace (and not attached): [1] Rcpp_0.12.16
bindr_0.1.1 magrittr_1.5 R6_2.2.2 rlang_0.2.0
tools_3.5.0 [7] yaml_2.1.19 assertthat_0.2.0
tibble_1.4.2 bindrcpp_0.2.2 curl_3.2 glue_1.2.0
[13] stringi_1.1.7 pillar_1.2.2 compiler_3.5.0
cellranger_1.1.0 prettyunits_1.0.2 pkgconfig_2.0.1

Because the content type is JSON and not HTML, you can't use read_html() on it:
> conexao
Response [https://www.researchgate.net/journal/0730-0301_Acm_Transactions_On_Graphics]
Date: 2018-06-02 03:15
Status: 200
Content-Type: application/json; charset=utf-8
Size: 328 kB
Use fromJSON() instead to extract issn:
library(jsonlite)
result <- fromJSON(content(conexao, as = "text", encoding = "utf-8") )
result$result$data$journalFullInfo$data$issn
result:
> result$result$data$journalFullInfo$data$issn
[1] "0730-0301"

Read HTML Table Into Data Frame with Hyperlinks in R

I am trying to read an HTML table from a publicly-accessible website into a data frame in R. The final column of the table contains hyperlinks, and I would like to read these hyperlinks into the table rather than the text that is displayed on the webpage. I've reviewed several posts here on StackOverflow and on other sites and have gotten almost there, but I haven't been able to read the hyperlinks themselves.
The table I'm trying to read is here: http://mis.ercot.com/misapp/GetReports.do?reportTypeId=12300&reportTitle=LMPs%20by%20Resource%20Nodes,%20Load%20Zones%20and%20Trading%20Hubs&showHTMLView=&mimicKey.
The final column contains hyperlinks that point to the actual data in *.ZIP file format for download. I've managed to read the table into R as text, but I can't figure out how to resolve the hyperlinks in the final column.
Here's what I have so far:
library(XML)
webURL <- 'http://mis.ercot.com/misapp/GetReports.do?reportTypeId=12300&reportTitle=LMPs%20by%20Resource%20Nodes,%20Load%20Zones%20and%20Trading%20Hubs&showHTMLView=&mimicKey'
page <- htmlParse( webURL )
tableNodes <- getNodeSet( sitePage, "//table" )
myTable <- readHTMLTable( tableNodes[[3]] )
However, this contains the text in the final column, not the hyperlink. How do I replace the word "zip" in the final column of this table in R with the values for the corresponding hyperlink in each row?

I find using the rvest package easier than XML.
Here is a solution to obtain a list of the links:
webURL <- 'http://mis.ercot.com/misapp/GetReports.do?reportTypeId=12300&reportTitle=LMPs%20by%20Resource%20Nodes,%20Load%20Zones%20and%20Trading%20Hubs&showHTMLView=&mimicKey'
library(rvest)
page<-read_html(webURL)
links<-page %>% html_nodes("a") %>% html_attr("href")

This code will let you target either the XML files or the CSV files and you get the filename as well as the URL so you can then iterate over the URLs and filenames and save them with names you'll recognize later on.
library(rvest)
library(dplyr)
pg <- read_html("http://mis.ercot.com/misapp/GetReports.do?reportTypeId=12300&reportTitle=LMPs%20by%20Resource%20Nodes,%20Load%20Zones%20and%20Trading%20Hubs&showHTMLView=&mimicKey")
csv_fils <- html_nodes(pg, xpath=".//td[contains(#class, 'labelOptional_ind') and contains(., 'csv')]/..")
data_frame(
fil_name = html_nodes(csv_fils, "td.labelOptional_ind") %>% html_text(),
url = html_nodes(csv_fils, xpath=".//td[4]/div/a") %>% html_attr("href")
) -> csv_df
glimpse(csv_df)
## Observations: 1,560
## Variables: 2
## $ fil_name <chr> "cdr.00012300.0000000000000000.20170729.094015151.LMPSROSNODENP6788_20170729_094011_csv.zip", "cdr...
## $ url <chr> "/misdownload/servlets/mirDownload?mimic_duns=&doclookupId=572923018", "/misdownload/servlets/mirD...
xml_fils <- html_nodes(pg, xpath=".//td[contains(#class, 'labelOptional_ind') and contains(., 'xml')]/..")
data_frame(
fil_name = html_nodes(xml_fils, "td.labelOptional_ind") %>% html_text(),
url = html_nodes(xml_fils, xpath=".//td[4]/div/a") %>% html_attr("href")
) -> xml_df
glimpse(xml_df)
## Observations: 1,560
## Variables: 2
## $ fil_name <chr> "cdr.00012300.0000000000000000.20170729.094015016.LMPSROSNODENP6788_20170729_094011_xml.zip", "cdr...
## $ url <chr> "/misdownload/servlets/mirDownload?mimic_duns=&doclookupId=572923015", "/misdownload/servlets/mirD...

With rvest, how to extract html contents from the object returned by submit_form()

I am trying to download some traffic data from pems.dot.ca.gov, following this topic.
rm(list=ls())
library(rvest)
library(xml2)
library(httr)
url <- "http://pems.dot.ca.gov/?report_form=1&dnode=tmgs&content=tmg_volumes&tab=tmg_vol_ts&export=&tmg_station_id=74250&s_time_id=1369094400&s_time_id_f=05%2F21%2F2013&e_time_id=1371772740&e_time_id_f=06%2F20%2F2013&tod=all&tod_from=0&tod_to=0&dow_5=on&dow_6=on&tmg_sub_id=all&q=obs_flow&gn=hour&html.x=34&html.y=8"
pgsession <- html_session(url)
pgform <-html_form(pgsession)[[1]]
filled_form <- set_values(pgform,
'username' = 'omitted',
'password' = 'omitted')
resp = submit_form(pgsession, filled_form)
resp_2 = resp$response
cont = resp_2$content
I checked the class() of these items and found that the resp is a 'session', resp_2 is a 'response', and cont is 'raw'. My question is: how can I extract the html content correctly so that I can proceed with XPath to pick out the actual data I want from this page? My intuition is that I should parse the resp_2 which is a response, but I just can not make it work. Your help are highly appreciated!

This should do it:
pg <- content(resp$response)
html_nodes(pg, "table.inlayTable") %>%
html_table() -> tab
head(tab[[1]])
## X1 X2 X3 X4
## 1 Data Quality Data Quality
## 2 Hour 8 Lanes % Observed % Estimated
## 3 05/24/2013 00:00 1,311 50 0
## 4 05/24/2013 01:00 729 50 0
## 5 05/24/2013 02:00 399 50 0
## 6 05/24/2013 03:00 487 50 0
(you'll obviously need to modify the column names)

You need httr::content, which parses a response into content, which in this case is HTML that can easily be parsed with rvest:
resp_2 %>% content()
## {xml_document}
## <html style="height: 100%">
## [1] <head>\n <!-- public -->\n <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/ ## ...
## [2] <body class="yui-skin-sam public">\n <div id="maincontainer" style="height: 100%">\n\n \n\ ## ...

When scraping with rvest expected html_node not appearing

The ITTO website produces a table of timber products and flows directly under the search form once the query is submitted (on the same page). Using information I obtained from Chrome's SelectorGadget I'm expecting the table to appear as the css element "td". Using rvest to scrape information on Albania for 2014...
library(rvest)
session <- html_session("http://www.itto.int/annual_review_output/?mode=searchdata")
form <- html_form(session)[[2]]
form <- set_values(form, "countries[]" = "8", "products[]" = "1" ,"flows[]" = "1", "years[]" = "2014")
query <- submit_form(session, form, submit = NULL)
page <- read_html(query) %>% html_nodes("td")
page
Which results in the table "td" being absent:
{xml_nodeset (0)}
Examining other elements of the page with html_nodes() suggests that submit_form() performed otherwise as expected.
So my question is where is the expected table?

It might be easier (in the long run) to scrape the select box options and just feed the POST call directly:
library(httr)
library(rvest)
res <- POST(url = "http://www.itto.int/annual_review_output/?mode=searchdata",
body = list(`countries[]` = "76",
`products[]` = "1", `flows[]` = "1",
`years[]` = "2014"),
encode = "form")
pg <- content(res, as="parsed")
html_nodes(pg, "td")
## {xml_nodeset (7)}
## [1] <td>Brazil</td>
## [2] <td>Ind. roundwood</td>
## [3] <td>Exports Quantity</td>
## [4] <td>1000 m3</td>
## [5] <td>2014</td>
## [6] <td style="text-align:right;">204.59</td>
## [7] <td>I</td>

web scraping html in R

I want get the URL list from scraping http://obamaspeeches.com/P-Obama-Inaugural-Speech-Inauguration.htm like this:
[1] "P-Obama-Inaugural-Speech-Inauguration.htm"
[2] "E11-Barack-Obama-Election-Night-Victory-Speech-Grant-Park-Illinois-November-4-2008.htm"
and this is my code:
library(XML)
url = "http://obamaspeeches.com/P-Obama-Inaugural-Speech-Inauguration.htm"
doc = htmlTreeParse(url, useInternalNodes = T)
url.list = xpathSApply(doc, "//a[contains(#href, 'htm')]")
The problem is that I want to unlist() url.list so I can strsplit it but it doesn't unlist.

One more step ought to do it (just need to get the href attribute):
library(XML)
url <- "http://obamaspeeches.com/P-Obama-Inaugural-Speech-Inauguration.htm"
doc <- htmlTreeParse(url, useInternalNodes=TRUE)
url.list <- xpathSApply(doc, "//a[contains(#href, 'htm')]")
hrefs <- gsub("^/", "", sapply(url.list, xmlGetAttr, "href"))
head(hrefs, 6)
## [1] "P-Obama-Inaugural-Speech-Inauguration.htm"
## [2] "E11-Barack-Obama-Election-Night-Victory-Speech-Grant-Park-Illinois-November-4-2008.htm"
## [3] "E11-Barack-Obama-Election-Night-Victory-Speech-Grant-Park-Illinois-November-4-2008.htm"
## [4] "E-Barack-Obama-Speech-Manassas-Virgina-Last-Rally-2008-Election.htm"
## [5] "E10-Barack-Obama-The-American-Promise-Acceptance-Speech-at-the-Democratic-Convention-Mile-High-Stadium--Denver-Colorado-August-28-2008.htm"
## [6] "E10-Barack-Obama-The-American-Promise-Acceptance-Speech-at-the-Democratic-Convention-Mile-High-Stadium--Denver-Colorado-August-28-2008.htm"
free(doc)
UPDATE Obligatory rvest + dplyr way:
library(rvest)
library(dplyr)
speeches <- html("http://obamaspeeches.com/P-Obama-Inaugural-Speech-Inauguration.htm")
speeches %>% html_nodes("a[href*=htm]") %>% html_attr("href") %>% head(6)
## same output as above

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Extracting html text using R - can't access some nodes - html

Related

R Webscraping RCurl and httr Content

Read HTML Table Into Data Frame with Hyperlinks in R

With rvest, how to extract html contents from the object returned by submit_form()

When scraping with rvest expected html_node not appearing

web scraping html in R

Categories

Resources