Extracting href attr or converting node to character list - html

I try to extract some information from the website
library(rvest)
library(XML)
url <- "http://wiadomosci.onet.pl/wybory-prezydenckie/xcnpc"
html <- html(url)
nodes <- html_nodes(html, ".listItemSolr")
nodes
I get "list" of 30 parts of HTML code. I want from each element of the "list" extract last href attribute, so for the 30. element it would be
<a href="http://wiadomosci.onet.pl/kraj/w-sobote-prezentacja-hasla-i-programu-wyborczego-komorowskiego/tvgcq" title="W sobotę prezentacja hasła i programu wyborczego Komorowskiego">
so I want to get string
"http://wiadomosci.onet.pl/kraj/w-sobote-prezentacja-hasla-i-programu-wyborczego-komorowskiego/tvgcq"
The problem is html_attr(nodes, "href") doesn't work (I get vector of NA's). So I thought about regex but the problem is that nodes isn't the character list.
class(nodes)
[1] "XMLNodeSet"
I tried
xmlToList(nodes)
but it doesn't work either.
So my question is: how can I extract this url with some function created for HTML? Or, if it is not possible how can I get convert XMLNodeSet to character list?

Try searching inside nodes' children:
nodes <- html_nodes(html, ".listItemSolr")
sapply(html_children(nodes), function(x){
html_attr( x$a, "href")
})
Update
Hadley suggested using elegant pipes:
html %>%
html_nodes(".listItemSolr") %>%
html_nodes(xpath = "./a") %>%
html_attr("href")

Package XML function getHTMLLinks() can do virtually all the work for us, we just have to write the xpath query. Here we query all the node attributes to determine if any contains "listItemSolr", then select the parent node for the href query.
getHTMLLinks(url, xpQuery = "//#*[contains(., 'listItemSolr')]/../a/#href")
In xpQuery we are doing the following:
//#*[contains(., 'listItemSolr')] query all node attributes for listItemSolr
/.. select the parent node
/a/#href get the href links

Related

Extracting innerHTML using rvest

I would like to extract the html content of a tag in R. For instance, in the following HTML,
<html><body>Hi <b>name</b></body></html>
suppose I'd like to extract the content of the <body> tag, which would be:
Hi <b>name</b>
In this question, the answer (using as.character()) will include the enclosing tag, which is not what I want. eg,
library(rvest)
html = '<html><body>Hi <b>name</b></body></html>'
read_html(html) |>
html_element('body') |>
as.character()
returns outerHTML:
[1] "<body>Hi <b>name</b>\n</body>"
...but I want the innerHTML. How can I get the content of a HTML tag in R, without the enclosing tag?
I could not find an inbuilt function to do this so here's a custom one.
library(rvest)
html = '<html><body>Hi <b>name</b></body></html>'
turn_to_character_keeping_inner_tags <- function(x, tag) {
gsub(sprintf('^<%s.*?>|</%s>$', tag, tag), '', as.character(x))
}
read_html(html) |>
html_element('body') |>
turn_to_character_keeping_inner_tags('body')
[1] "Hi <b>name</b>\n"

How to extract table, convert it to data frame ,write as csv file and deal with child tables?

I cant write the file as a csv. file there is an error, as I want to Extract bike sharing system data from a Wiki page and convert the data to a data frame. but When I use the head function to see the table or str. function I cant determine the table it came out with so many unorganized details.
Also Note that this HTML page at least contains three child nodes under the root HTML node. So, you will need to use (html_nodes(root_node, "table") function to get all its child nodes:
<html>
<table>(table1)</table>
<table>(table2)</table>
<table>(table3)</table>
...
</html>
url<- "https://en.wikipedia.org/wiki/List_of_bicycle-sharing_systems"
root_node<-read_html(url)
table_nodes <- html_nodes(root_node,"table")
Bicycle_sharing <- html_table(table_nodes, fill = TRUE )
head(Bicycle_sharing)
summary(Bicycle_sharing)
str(Bicycle_sharing)
## Exporting the date frame as csv. file.
write.csv(mtcars,"raw_bike_sharing_systems.csv",row.names = FALSE)
library(tidyverse)
library(rvest)
data <- "https://en.wikipedia.org/wiki/List_of_bicycle-sharing_systems" %>%
read_html() %>%
html_table() %>%
getElement(2) %>%
janitor::clean_names()
data %>%
write_csv(file = "bike_sharing.csv")

Create a loop within a function so that URLs return a dataframe

I was provided with a list of identifiers (in this case the identifier is called an NPI). These identifiers can be copied and pasted to this website (https://npiregistry.cms.hhs.gov/registry/?). I want to return the name of the NPI number, name of the physician, address, phone number, and specialty.
I have over 3,000 identifiers so a copy and paste is not efficient and not easily repeatable for future use.
If possible, I would like to create a list of URLs, pass them into a function, and received a dataframe that provides me with the variables mentioned above (NPI, NAME, ADDRESS, PHONE, SPECIALTY).
I was able to write a function that produces the URLs needed:
Here are some NPI numbers for reference: 1417024746, 1386790517, 1518101096, 1255500625.
This is my code for reading in the file that contains my NPIs
npiList <- c("1417024746", "1386790517", "1518101096", "1255500625")
npiList <- as.list(npiList)
npiList <- unlist(npiList, use.names = FALSE)
This is the function to return the list of URLs:
npiaddress <- function(x){
url <- paste("https://npiregistry.cms.hhs.gov/registry/search-results-
table?number=",x,"&addressType=ANY", sep = "")
return(url)
}
I saved the list to a variable and perhaps this is my downfall:
npi_urls <- npiaddress(npiList)
From here I wrote a function that can accept a single URL, retrieves the data I want and turns it into a dataframe. My issue is that I cannot pass multiple URLs:
npiLookup <- function (x){
url <- x
webpage <- read_html(url)
npi_html <- html_nodes(webpage, "td")
npi <- html_text(npi_html)
npi[4] <- gsub("\r?\n|\r", " ", npi[4])
npi[4] <- gsub("\r?\t|\r", " ", npi[4])
npiFinal <- npi[c(1:2,4:6)]
npiFinal <- as.data.frame(npiFinal)
npiFinal <- t(npiFinal)
npiFinal <- as.data.frame(npiFinal)
names(npiFinal) <- c("NPI", "NAME", "ADDRESS", "PHONE", "SPECIALTY")
return(npiFinal)
}
For example:
If I wanted to get a dataframe for the following identifier (1417024746), I can run this and it works:
x <- npiLookup("https://npiregistry.cms.hhs.gov/registry/search-results-table?number=1417024746&addressType=ANY")
View(x)
My output for the example returns the NPI, NAME, ADDRESS, PHONE, SPECIALTY as desired, but again, I need to do this for several thousand NPI identifiers. I feel like I need a loop within npiLookup. I've also tried to put npi_urls into the npiLookup function but it does not work.
Thank you for any help and for taking the time to read.
You're most of the way there. The final step uses this useful R idiom:
do.call(rbind,lapply(npiList,function(npi) {url=npiaddress(npi); npiLookup(url)}))
do.call is a base R function that applies a function (in this case rbind) to the list produced by lapply. That list is the result of running your npiLookup function on the url produced by your npiaddress for each element of npiList.
A few further comments for future reference should anyone else come upon this question: (1) I don't know why you're doing the as.list, unlist sequence at the beginning; it's redundant and probably unnecessary. (2) The NPI registry provides a programming interface (API) that avoids the need to scrape data from the HTML pages; this might be more robust in the long run. (3) The NPI registry provides the entire dataset as a downloadable file; this might have been an easier way to go.

R loop with html_nodes ( rvest ) isn´t catching all data

I would like to make a loop with html_node to catch some the value of nodes (nodes no text), that is, I have some values
library(rvest)
country <- c("Canada", "US", "Japan", "China")
With those values ("Canada","us", ...), I´ve done a loop which creates a URL by pasting each value with "https://en.wikipedia.org/wiki/", after that, with each new html apply read_html(i) and a sequences of codes to catch finally a node with html_nodes ('a.page-link') -yes! a node, not a text- and save that html_nodes (...) as.character in a data.frame (or could be a list).
dff<- NULL
for ( i in country ) {
url<-paste0("https://en.wikipedia.org/wiki/",i)
page<- read_html(url)
b <- page%>%
html_nodes ('h2.flow-title') %>%
html_nodes ('a.page-link') %>%
as.character()
dff<- data.frame(b)
}
The problem is this code only save the data from the last country, that is, run the first country and obtain the html_nodes(saving it), but when run the next country the first data is erased and replace by this new , and so on, obtaining as final result just the dat from the last country.
I would be grateful with your help!
As the comment mentioned this line: dff<- data.frame(b) is over writing dff on each loop iteration. On solution is to create an empty list and append the data to the list.
In this example the list items are named for the country queried.
library(rvest)
country <- c("Canada", "US", "Japan", "China")
#initialize the empty list
dff<- list()
for ( i in country ) {
url<-paste0("https://en.wikipedia.org/wiki/",i)
page<- read_html(url)
b <- page%>%
html_nodes ('h2.flow-title') %>%
html_nodes ('a.page-link') %>%
as.character()
#append new data onto the list
dff[[i]]<- data.frame(b)
}
To access the data, one can use dff$Canada, or lapply to process the entire list.
Note: I ran your example which returned no results, better double check the node ids.

Download hidden json array in HTML using R

I'm trying to scrape data from tranfermrkt using mainly XML + httr package.
page.doc <- content(GET("http://www.transfermarkt.es/george-corral/marktwertverlauf/spieler/103889"))
After downloading, there is a hidden array named 'series':
'series':[{'type':'line','name':'Valor de mercado','data':[{'y':600000,'verein':'CF América','age':21,'mw':'600 miles €','datum_mw':'02/12/2011','x':1322780400000,'marker':{'symbol':'url(http://akacdn.transfermarkt.de/images/wappen/verysmall/3631.png?lm=1403472558)'}},{'y':850000,'verein':'Jaguares de Chiapas','age':21,'mw':'850 miles €','datum_mw':'02/06/2012','x':1338588000000,'marker':{'symbol':'url(http://akacdn.transfermarkt.de/images/wappen/verysmall/4774_1441956822.png?lm=1441956822)'}},{'y':1000000,'verein':'Jaguares de Chiapas','age':22,'mw':'1,00 mill. €','datum_mw':'03/12/2012','x':1354489200000,'marker':{'symbol':'url(http://akacdn.transfermarkt.de/images/wappen/verysmall/4774_1441956822.png?lm=1441956822)'}},{'y':1000000,'verein':'Jaguares de Chiapas','age':22,'mw':'1,00 mill. €','datum_mw':'29/05/2013','x':1369778400000,'marker':{'symbol':'url(http://akacdn.transfermarkt.de/images/wappen/verysmall/4774_1441956822.png?lm=1441956822)'}},{'y':1250000,'verein':'Querétaro FC','age':23,'mw':'1,25 mill. €','datum_mw':'27/12/2013','x':1388098800000,'marker':{'symbol':'url(http://akacdn.transfermarkt.de/images/wappen/verysmall/4961.png?lm=1409989898)'}},{'y':1500000,'verein':'Querétaro FC','age':24,'mw':'1,50 mill. €','datum_mw':'01/09/2014','x':1409522400000,'marker':{'symbol':'url(http://akacdn.transfermarkt.de/images/wappen/verysmall/4961.png?lm=1409989898)'}},{'y':1800000,'verein':'Querétaro FC','age':25,'mw':'1,80 mill. €','datum_mw':'01/10/2015','x':1443650400000,'marker':{'symbol':'url(http://akacdn.transfermarkt.de/images/wappen/verysmall/4961.png?lm=1409989898)'}}]}]
Is there a way to download directly? I want to scrape 600+ pages.
Until now, I have tried
page.doc.2 <- xpathSApply(page.doc, "//*/div[#class='eight columns']")
page.doc.2 <- xpathSApply(page.doc, "//*/div[#class='eight columns']", xmlAttrs)
No, there is no way to download just the JSON data: the JSON array you’re interested in is embedded inside the page’s source code, as part of a script.
You can then use conventional XPath or CSS selectors to find the script elements. However, finding and extracting just the JSON part is harder without a library that evaluates the JavaScript code. A better option would definitely be to use an official API, should one exist.
library(rvest) # Better suited for web scraping than httr & xml.
library(rjson)
doc = read_html('http://www.transfermarkt.es/george-corral/marktwertverlauf/spieler/103889')
script = doc %>%
html_nodes('script') %>%
html_text() %>%
grep(pattern = "'series':", value = TRUE)
# Replace JavaScript quotes with JSON quotes
json_content = gsub("'", '"', gsub("^.*'series':", '', script))
# Truncate characters from the end until the result is parseable as valid JSON …
while (nchar(json_content) > 0) {
json = try(fromJSON(json_content), silent = TRUE)
if (! inherits(json, 'try-error'))
break
json_content = substr(json_content, 1, nchar(json_content) - 1)
}
However, there’s no guarantee that the above will always work: it is JavaScript after all, not JSON; the two are similar but not every valid JavaScript array is valid JSON.
It could be possible to evaluate the JavaScript fragment instead but that gets much more complicated. As a start, take a look at the V8 interface for R.