Extracting innerHTML using rvest - html

I would like to extract the html content of a tag in R. For instance, in the following HTML,
<html><body>Hi <b>name</b></body></html>
suppose I'd like to extract the content of the <body> tag, which would be:
Hi <b>name</b>
In this question, the answer (using as.character()) will include the enclosing tag, which is not what I want. eg,
library(rvest)
html = '<html><body>Hi <b>name</b></body></html>'
read_html(html) |>
html_element('body') |>
as.character()
returns outerHTML:
[1] "<body>Hi <b>name</b>\n</body>"
...but I want the innerHTML. How can I get the content of a HTML tag in R, without the enclosing tag?

I could not find an inbuilt function to do this so here's a custom one.
library(rvest)
html = '<html><body>Hi <b>name</b></body></html>'
turn_to_character_keeping_inner_tags <- function(x, tag) {
gsub(sprintf('^<%s.*?>|</%s>$', tag, tag), '', as.character(x))
}
read_html(html) |>
html_element('body') |>
turn_to_character_keeping_inner_tags('body')
[1] "Hi <b>name</b>\n"

Related

Scrape all div tags id (not their value) with similar format

I have a internal company html webpage with a div html tag having the following format:
<div id="B4_6_2019">
<div id="B3_6_2019">
I would like to extract all the id names so the end result would be
B4_6_2019
B3_6_2019
How would I do that? (the id names are all dates)
Try doing
library(dplyr)
library(rvest)
url %>%
read_html() %>%
html_nodes("div") %>%
html_attr("id") %>%
grep("^B\\d+_\\d+_\\d+", ., value = TRUE)
Try also attribute = value css selector with ends with operator to substring match on end of id value string
library(rvest)
page <- read_html("url")
id<- page %>%
html_nodes("[id$='_2019']") %>%
html_attr(., "id")

R - have xtable interpret HTML tags within strings as HTML tags rather than literals

Assuming I don't want to or cannot modify the stylesheet or the HTML internally, how can I make xtable interpret a section of a string in R as a html tag rather than a literal? For example, I have:
df <- as.data.frame(c("<b>Foo</b>", "Bar", "Box"), byrow = TRUE)
library(xtable)
print(xtable(df), type = "html", include.rownames = FALSE)
I want "Foo" to be bold. Nevertheless, when xtable creates the table, it prints "<b>Foo</b>" (i.e. it interprets the string literally) rather than "Foo". Is there an option or workaround to custom-defining a tag within a string and ensuring that it is interpreted as a tag?
I'm just going to post up an answer for this question, as, after a bit of fidgeting around, I do have a per-se solution.
df <- as.data.frame(c("<b>Foo</b>", "Bar", "Box"), byrow = TRUE)
library(xtable)
print(xtable(df), type = "html", include.rownames = FALSE,
sanitize.text.function = function(x){x})
This works, but can have unintended consequences, since you are overwriting the default santize.text function and it seems that you cannot specify to apply sanitize.text.function = function(x){x} as a lambda function for a particular part of the table, but have to instead apply it to the whole table. It works for something simple like this... might not work for everything.

crawl data from "angular.callbacks" web

I want to use R to crawl the news from url(http://www.foxnews.com/search-results/search?q="AlphaGo"&ss=fn&start=0). Here is my code:
url <- "http://api.foxnews.com/v1/content/search?q=%22AlphaGo%22&fields=date,description,title,url,image,type,taxonomy&section.path=fnc&start=0&callback=angular.callbacks._0&cb=2017719162"
html <- str_c(readLines(url,encoding = "UTF-8"),collapse = "")
content_fox <- RJSONIO:: fromJSON(html)
However, the json could not be understood as the error showed up :
Error in file(con, "r") : cannot open the connection
I notice that the json starts from angular.callbacks._0 , which I think might be the problem.
Any idea how to fix this?
According to the answer in Parse JSONP with R, I ajusted my code with two new ones and it worked:
url <- "http://api.foxnews.com/v1/content/search?q=%22AlphaGo%22&fields=date,description,title,url,image,type,taxonomy&section.path=fnc&start=0&callback=angular.callbacks._0&cb=2017719162"
html <- str_c(readLines(url,encoding = "UTF-8"),collapse = "")
html <- sub('[^\\{]*', '', html) # remove function name and opening parenthesis
html <- sub('\\)$', '', html) # remove closing parenthesis
content_fox <- RJSONIO:: fromJSON(html)

Download hidden json array in HTML using R

I'm trying to scrape data from tranfermrkt using mainly XML + httr package.
page.doc <- content(GET("http://www.transfermarkt.es/george-corral/marktwertverlauf/spieler/103889"))
After downloading, there is a hidden array named 'series':
'series':[{'type':'line','name':'Valor de mercado','data':[{'y':600000,'verein':'CF América','age':21,'mw':'600 miles €','datum_mw':'02/12/2011','x':1322780400000,'marker':{'symbol':'url(http://akacdn.transfermarkt.de/images/wappen/verysmall/3631.png?lm=1403472558)'}},{'y':850000,'verein':'Jaguares de Chiapas','age':21,'mw':'850 miles €','datum_mw':'02/06/2012','x':1338588000000,'marker':{'symbol':'url(http://akacdn.transfermarkt.de/images/wappen/verysmall/4774_1441956822.png?lm=1441956822)'}},{'y':1000000,'verein':'Jaguares de Chiapas','age':22,'mw':'1,00 mill. €','datum_mw':'03/12/2012','x':1354489200000,'marker':{'symbol':'url(http://akacdn.transfermarkt.de/images/wappen/verysmall/4774_1441956822.png?lm=1441956822)'}},{'y':1000000,'verein':'Jaguares de Chiapas','age':22,'mw':'1,00 mill. €','datum_mw':'29/05/2013','x':1369778400000,'marker':{'symbol':'url(http://akacdn.transfermarkt.de/images/wappen/verysmall/4774_1441956822.png?lm=1441956822)'}},{'y':1250000,'verein':'Querétaro FC','age':23,'mw':'1,25 mill. €','datum_mw':'27/12/2013','x':1388098800000,'marker':{'symbol':'url(http://akacdn.transfermarkt.de/images/wappen/verysmall/4961.png?lm=1409989898)'}},{'y':1500000,'verein':'Querétaro FC','age':24,'mw':'1,50 mill. €','datum_mw':'01/09/2014','x':1409522400000,'marker':{'symbol':'url(http://akacdn.transfermarkt.de/images/wappen/verysmall/4961.png?lm=1409989898)'}},{'y':1800000,'verein':'Querétaro FC','age':25,'mw':'1,80 mill. €','datum_mw':'01/10/2015','x':1443650400000,'marker':{'symbol':'url(http://akacdn.transfermarkt.de/images/wappen/verysmall/4961.png?lm=1409989898)'}}]}]
Is there a way to download directly? I want to scrape 600+ pages.
Until now, I have tried
page.doc.2 <- xpathSApply(page.doc, "//*/div[#class='eight columns']")
page.doc.2 <- xpathSApply(page.doc, "//*/div[#class='eight columns']", xmlAttrs)
No, there is no way to download just the JSON data: the JSON array you’re interested in is embedded inside the page’s source code, as part of a script.
You can then use conventional XPath or CSS selectors to find the script elements. However, finding and extracting just the JSON part is harder without a library that evaluates the JavaScript code. A better option would definitely be to use an official API, should one exist.
library(rvest) # Better suited for web scraping than httr & xml.
library(rjson)
doc = read_html('http://www.transfermarkt.es/george-corral/marktwertverlauf/spieler/103889')
script = doc %>%
html_nodes('script') %>%
html_text() %>%
grep(pattern = "'series':", value = TRUE)
# Replace JavaScript quotes with JSON quotes
json_content = gsub("'", '"', gsub("^.*'series':", '', script))
# Truncate characters from the end until the result is parseable as valid JSON …
while (nchar(json_content) > 0) {
json = try(fromJSON(json_content), silent = TRUE)
if (! inherits(json, 'try-error'))
break
json_content = substr(json_content, 1, nchar(json_content) - 1)
}
However, there’s no guarantee that the above will always work: it is JavaScript after all, not JSON; the two are similar but not every valid JavaScript array is valid JSON.
It could be possible to evaluate the JavaScript fragment instead but that gets much more complicated. As a start, take a look at the V8 interface for R.

Extracting href attr or converting node to character list

I try to extract some information from the website
library(rvest)
library(XML)
url <- "http://wiadomosci.onet.pl/wybory-prezydenckie/xcnpc"
html <- html(url)
nodes <- html_nodes(html, ".listItemSolr")
nodes
I get "list" of 30 parts of HTML code. I want from each element of the "list" extract last href attribute, so for the 30. element it would be
<a href="http://wiadomosci.onet.pl/kraj/w-sobote-prezentacja-hasla-i-programu-wyborczego-komorowskiego/tvgcq" title="W sobotę prezentacja hasła i programu wyborczego Komorowskiego">
so I want to get string
"http://wiadomosci.onet.pl/kraj/w-sobote-prezentacja-hasla-i-programu-wyborczego-komorowskiego/tvgcq"
The problem is html_attr(nodes, "href") doesn't work (I get vector of NA's). So I thought about regex but the problem is that nodes isn't the character list.
class(nodes)
[1] "XMLNodeSet"
I tried
xmlToList(nodes)
but it doesn't work either.
So my question is: how can I extract this url with some function created for HTML? Or, if it is not possible how can I get convert XMLNodeSet to character list?
Try searching inside nodes' children:
nodes <- html_nodes(html, ".listItemSolr")
sapply(html_children(nodes), function(x){
html_attr( x$a, "href")
})
Update
Hadley suggested using elegant pipes:
html %>%
html_nodes(".listItemSolr") %>%
html_nodes(xpath = "./a") %>%
html_attr("href")
Package XML function getHTMLLinks() can do virtually all the work for us, we just have to write the xpath query. Here we query all the node attributes to determine if any contains "listItemSolr", then select the parent node for the href query.
getHTMLLinks(url, xpQuery = "//#*[contains(., 'listItemSolr')]/../a/#href")
In xpQuery we are doing the following:
//#*[contains(., 'listItemSolr')] query all node attributes for listItemSolr
/.. select the parent node
/a/#href get the href links