R Parses incomplete text from webpages (HTML)

R Parses incomplete text from webpages (HTML) - html

I am trying to parse the plain text from multiple scientific articles for subsequent text analysis. So far I use a R script by Tony Breyal based on the packages RCurl and XML. This works fine for all targeted journals, except for those published by http://www.sciencedirect.com. When I try to parse the articles from SD (and this is consistent for all tested journals I need to access from SD), the text object in R just stores the first part of the whole document in it. Unfortunately, I am not too familiar with html, but I think the problem should be in the SD html code, since it works in all other cases.
I am aware that some journals are not open accessible, but I have access authorisations and the problems also occur in open access articles (check the example).
This is the code from Github:
htmlToText <- function(input, ...) {
###---PACKAGES ---###
require(RCurl)
require(XML)
###--- LOCAL FUNCTIONS ---###
# Determine how to grab html for a single input element
evaluate_input <- function(input) {
# if input is a .html file
if(file.exists(input)) {
char.vec <- readLines(input, warn = FALSE)
return(paste(char.vec, collapse = ""))
}
# if input is html text
if(grepl("</html>", input, fixed = TRUE)) return(input)
# if input is a URL, probably should use a regex here instead?
if(!grepl(" ", input)) {
# downolad SSL certificate in case of https problem
if(!file.exists("cacert.perm")) download.file(url="http://curl.haxx.se/ca/cacert.pem", destfile="cacert.perm")
return(getURL(input, followlocation = TRUE, cainfo = "cacert.perm"))
}
# return NULL if none of the conditions above apply
return(NULL)
}
# convert HTML to plain text
convert_html_to_text <- function(html) {
doc <- htmlParse(html, asText = TRUE)
text <- xpathSApply(doc, "//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)][not(ancestor::form)]", xmlValue)
return(text)
}
# format text vector into one character string
collapse_text <- function(txt) {
return(paste(txt, collapse = " "))
}
###--- MAIN ---###
# STEP 1: Evaluate input
html.list <- lapply(input, evaluate_input)
# STEP 2: Extract text from HTML
text.list <- lapply(html.list, convert_html_to_text)
# STEP 3: Return text
text.vector <- sapply(text.list, collapse_text)
return(text.vector)
}
This is now my code and an example article:
target <- "http://www.sciencedirect.com/science/article/pii/S1754504816300319"
temp.text <- htmlToText(target)
The unformatted text stops somewhere in the Method section:
DNA was extracted using the MasterPure™ Yeast DNA Purification Kit
(Epicentre, Madison, Wisconsin, USA) following the manufacturer's
instructions.
Any suggestions/ideas?
P.S. I also tried html_text based on rvest with the same outcome.

You can prbly use your existing code and just add ?np=y to the end of the URL, but this is a bit more compact:
library(rvest)
library(stringi)
target <- "http://www.sciencedirect.com/science/article/pii/S1754504816300319?np=y"
pg <- read_html(target)
pg %>%
html_nodes(xpath=".//div[#id='centerContent']//child::node()/text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)][not(ancestor::form)]") %>%
stri_trim() %>%
paste0(collapse=" ") %>%
write(file="output.txt")
A bit of the output (total for that article was >80K):
Fungal Ecology Volume 22 , August 2016, Pages 61–72 175394|| Species richness
influences wine ecosystem function through a dominant species Primrose J. Boynton a , , ,
Duncan Greig a , b a Max Planck Institute for Evolutionary Biology, Plön, 24306, Germany
b The Galton Laboratory, Department of Genetics, Evolution, and Environment, University
College London, London, WC1E 6BT, UK Received 9 November 2015, Revised 27 March 2016,
Accepted 15 April 2016, Available online 1 June 2016 Corresponding editor: Marie Louise
Davey Abstract Increased species richness does not always cause increased ecosystem function.
Instead, richness can influence individual species with positive or negative ecosystem effects.
We investigated richness and function in fermenting wine, and found that richness indirectly
affects ecosystem function by altering the ecological dominance of Saccharomyces cerevisiae .
While S. cerevisiae generally dominates fermentations, it cannot dominate extremely species-rich
communities, probably because antagonistic species prevent it from growing. It is also diluted
from species-poor communities,

Related

Loop through multiple links from an Excel file, open and download the corresponding webpages

I downloaded from MediaCloud an Excel file with 1719 links to different newspaper articles. I am trying to use R to loop through each link, open it and download all the corresponding online articles in a single searchable file (HTML, CSV, TXT, PDF - doesn't matter) that I can read and analyze later.
I went through all similar questions on Stack Overflow and a number of tutorials for downloading files and managed to assemble this code (I am very new to R):
express <-read.csv("C://Users//julir//Documents//Data//express.csv")
library(curl)
for (express$url in 2:1720)
destfile <- paste0("C://Users//julir//Documents//Data//results.csv")
download.file(express$url, destfile, method = "auto", quiet = TRUE, cacheOK=TRUE)
Whenever I try to run it though I get the following error:
Error in download.file(express$url, destfile = express$url, method = "auto", : 'url' must be a length-one character vector
I tried also this alternative method suggested online:
library(httr)
url <- express$url
response <- GET(express$url)
html_document <- content(response, type = "text", encoding = "UTF-8")
But I get the same mistake:
Error in parse_url(url) : length(url) == 1 is not TRUE
So I guess there is a problem with how the URLs are stored - but I can't understand how to fix it.
I am also not certain about the downloading process - I would ideally want all text on the HTML page - it seems unpractical to use selectors and rvest in this case - but I might be very wrong.

You need to look through the url's and read/parse each individually. You are essentially passing an array of urls into one request, which is why you see that error.
I don't know your content/urls, but here's an example of how you would approach this:
library(xml2)
library(jsonlite)
library(dplyr)
df <- data.frame(page_n = 1:5, urls = sprintf('https://www.politifact.com/factchecks/list/?page=%s', 1:5))
result_info <- lapply(df$urls, function(i){
raw <- read_html(i)
a_tags <- raw %>% xml_find_all(".//a[contains(#href,'factchecks/2021')]")
urls <- xml2::url_absolute(xml_attr(a_tags, "href"),xml_url(raw))
titles <- xml_text(a_tags) %>% stri_trim_both()
data.frame(title = titles, links = urls)
}) %>% rbind_pages()
result_info %>% head()
title
links
Says of UW-Madison, "It cost the university $50k (your tax dollars) to remove" a rock considered by some a symbol of racism.
https://www.politifact.com/factchecks/2021/aug/14/rachel-campos-duffy/no-taxpayer-funds-were-not-used-remove-rock-deemed/
“Rand Paul’s medical license was just revoked!”
https://www.politifact.com/factchecks/2021/aug/13/facebook-posts/no-rand-pauls-medical-license-wasnt-revoked/
Every time outgoing New York Gov. Andrew Cuomo “says the firearm industry ‘is immune from lawsuits,’ it's false.”
https://www.politifact.com/factchecks/2021/aug/13/elise-stefanik/refereeing-andrew-cuomo-elise-stefanik-firearm-ind/
The United States' southern border is "basically open" and is "a super spreader event.”
https://www.politifact.com/factchecks/2021/aug/13/gary-sides/north-carolina-school-leader-repeats-false-claims-/
There is a “0.05% chance of dying from COVID.”
https://www.politifact.com/factchecks/2021/aug/13/tiktok-posts/experts-break-down-numbers-catching-or-dying-covid/
The Biden administration is “not even testing these people” being released by Border Patrol into the U.S.
https://www.politifact.com/factchecks/2021/aug/13/ken-paxton/biden-administration-not-even-testing-migrants-rel/

Scraping WordPress reviews

I am just learning R programming. For an exercise, I want to scrape reviews of a WordPress plugin that seems to have been discontinued here
I start by specifying the URL
> url <- 'https://wordpress.org/plugins/demo-data-creator/#reviews'
Scraping the HTML content from the URL
> url <- read_html('https://wordpress.org/plugins/demo-data-creator/#reviews')
Extract the title of each review using the ID tag
> reviews <- html_nodes(url, 'h3.review-title')
Strip out the HTML tags, leaving only the content of the title
> titletext <- html_text(reviews)
Print the titles scraped
> head(titletext)
> [1] "Good for development"
> [2] "Used it for creating test users"
> [3] "Excelent! negative comments come from people who doesn't read!"
> [4] "Thanks"
> [5] "Does EXACTLY what it says it will – thanks! Very Handy"
> [6] "Dangerous plugin"
I repeat the same for the contents of the reviews
> reviewcontent <- html_nodes(url, 'div.review-content')
> reviewtext <- html_text(reviewcontent)
And prints out
> head(reviewcontent)
> {xml_nodeset (6)} [1] <div class="review-content">Good and handy tool
> for deve ... [2] <div class="review-content">This plugin came in very
> han ... [3] <div class="review-content">Does exactly what it offers!
> ... [4] <div class="review-content">Thanks</div> [5] <div
> class="review-content">Very handy for a test system ... [6] <div
> class="review-content">I have to agree with viesli ...
However, I realized it didn't scrape all the reviews as there are more here
Is there a way to tell R to check each review listed to extract the title and review content and probably populate in a table?

You can use the same approach to extract the reviews from the second link. The main difference is that the content of each review is in its own page. Hence, you need two steps:
Extract the list of review page URLs from the main page.
For each URL, fetch the page, and extract the title and content of the review.
For step 1, it is very similar to what you already did, except that you are now trying to extract the URLs for the review pages. If you inspect that page, you'll see that these links (<a> elements) have CSS class bbp-topic-permalink.
So, we can extract them using:
links <- html_nodes(page, css='a.bbp-topic-permalink')
Now, we don't want the text part of the tag, but rather the href attribute value (where the link is pointing). We can extract that using
reviewurls <- html_attr(links, 'href')
For step 2, we will loop over the list of reviewurls, and for each one, fetch the page using read_html, extract the title and the content using html_node and html_text, then add them to a table/matrix/data.frame.
The loop can be done using:
for (u in reviewurls) {
}
Inside the loop, u is the variable that holds the current review URL. We will use read_html to read the page, then extract the title and content.
Inspecting the review page, the title is in an <h1> tag with CSS class page-title. Simiarly, thThe content of the review are inside a <div> with CSS class bbp-topic-content.
So, inside the loop, you can do this:
page = read_html(u)
reviewT = html_text(html_node(page, css='h1.page-title'))
reviewC = html_text(html_node(page, css='div.bbp-topic-content'))
Now you will have both the title and the content for that particular review. You can add them to a list, so that by the end of the loop, you will have titles and contents of all the reviews.
The final code will look like this:
url <- 'https://wordpress.org/support/plugin/demo-data-creator/reviews/'
page <- read_html(url)
links <- html_nodes(page, css='a.bbp-topic-permalink')
reviewurls <- html_attr(links, 'href')
# Two empty lists, to be populated inside the loop
titles = c()
contents = c()
for (u in reviewurls) {
page = read_html(u)
reviewT = html_text(html_node(page, css='h1.page-title'))
reviewC = html_text(html_node(page, css='div.bbp-topic-content'))
titles = c(titles, reviewT)
contents = c(contents, reviewC)
}
Once it's done, you will get:
> length(titles)
[1] 21
> head(titles)
[1] "Good for development"
[2] "Used it for creating test users"
[3] "Excelent! negative comments come from people who doesn't read!"
...
> head(contents)
[1] "\n\n\t\t\t\t\n\t\t\t\tGood and handy tool for development.\n\n\n\n\t\tThis topic was modified 2 years, 10 months ago by Subrata Sarkar.\n\t\n\n"
[2] "\n\n\t\t\t\t\n\t\t\t\tThis plugin came in very handy during development of my own plugin. I used it to create a lot of users and it did exactly what it should.\nNot sure where all the negativity about wiping the database comes from. Are they users that didn’t read all the warnings? Or did older versions of the plugin not warn about wiping all data? Anyway, now it does \U0001f642\n\n\t\t\t\t\n\t\t\t"
[3] "\n\n\t\t\t\t\n\t\t\t\tDoes exactly what it offers! Nothing less.\nPeople complaining is too lazy to read the SEVERAL warnings about the usage of this plugin.\n\n\t\t\t\t\n\t\t\t"
...

How to get descriptive table for both continuous and categorical variables?

I want to get descriptive table in html format for all variables that are in data frame. I need for continuous variables mean and standard deviation. For categorical variables frequency (absolute count) of each category and percentage of each category. Also I need the count of missing values to be included.
Lets use this data:
data("ToothGrowth")
df<-ToothGrowth
df$len[2]<-NA
df$supp[5]<-NA
I want to get table in html format that will look like this:
----------------------------------------------------------------------
Variables N (missing) Mean (SD) / %
----------------------------------------------------------------------
len 59 (1) 18.9 (7.65)
supp
OJ 30 50%
VC 29 48.33%
NA 1 1.67%
dose 60 1.17 (0.629)
I need also to set the number of digits after decimal point to show.
If you know better variant to display that information in html in better way than please provide your solution.

Here's a programatic way to create separate summary tables for the numeric and factor columns. Note that this doesn't make note of NAs in the table as you requested, but does ignore NAs to calculate summary stats as you did. It's a starting point, anyway. From here you could combine the tables and format the headers however you want.
If you knit this code within an RMarkdown document with HTML output, kable will automatically generate the html table and a css will format the table nicely with a horizontal rules as pictured below. Note that there's also a booktabs option to kable that makes prettier tables like the LaTeX booktabs package. Otherwise, see the documentation for knitr::kable for options.
library(dplyr)
library(tidyr)
library(knitr)
data("ToothGrowth")
df<-ToothGrowth
df$len[2]<-NA
df$supp[5]<-NA
numeric_cols <- dplyr::select_if(df, is.numeric) %>%
gather(key = "variable", value = "value") %>%
group_by(variable) %>%
summarize(count = n(),
mean = mean(value, na.rm = TRUE),
sd = sd(value, na.rm = TRUE))
factor_cols <- dplyr::select_if(df, is.factor) %>%
gather(key = "variable", value = "value") %>%
group_by(variable, value) %>%
summarize(count = n()) %>%
mutate(p = count / sum(count, na.rm = TRUE))
knitr::kable(numeric_cols)
knitr::kable(factor_cols)

I found r package table1 that does what I want. Here is a code:
library(table1)
data("ToothGrowth")
df<-ToothGrowth
df$len[2]<-NA
df$supp[5]<-NA
table1(reformulate(colnames(df)), data=df)

Navigate to a link using html session in R

I am trying to navigate to a link on a website. All the links work except for one single link. Here are the results.
> mcsession<-html_session("http://www.moneycontrol.com/financials/tataconsultancyservices/balance-sheetVI/TCS#TCS")
> mcsession<-mcsession %>% follow_link("Previous Years »")
Error: No links have text 'Previous Years »'
In addition: Warning message:
In grepl(i, text, fixed = TRUE) : input string 316 is invalid UTF-8
> mcsession<-mcsession %>% follow_link("Balance Sheet")
Navigating to /financials/tataconsultancyservices/balance-sheetVI/TCS#TCS
Warning message:
In grepl(i, text, fixed = TRUE) : input string 316 is invalid UTF-8
Any idea why this happens so?

It is not a normal link - it is javascript. I don't know of a way of doing it with rvest, but you could use RSelenium, which basically automates a normal browser window. It is slower than scraping directly, but you can automate just about anything that you can do by hand. This works for me (using chrome on Windows 10)...
library(RSelenium)
rD <- rsDriver(port=4444L,browser="chrome")
remDr <- rD$client
remDr$navigate("http://www.moneycontrol.com/financials/tataconsultancyservices/balance-sheetVI/TCS#TCS")
firstpage <- remDr$getPageSource() #you can use this to get the first table
#(1)
webElem1 <- remDr$findElement(using = 'partial link text', value = "Previous Years")
webElem1$clickElement()
nextpage <- remDr$getPageSource() #you can use this to get the next page for previous years
#repeat from #(1) to go back another page etc
remDr$closeall() #when you have finished.

Edit map with "R for leaflet"

I have a script which allows me to generate a map with with "R for leaflet" :
library(htmlwidgets)
library(raster)
library(leaflet)
# PATHS TO INPUT / OUTPUT FILES
projectPath = "path"
#imgPath = paste(projectPath,"data/cea.tif", sep = "")
#imgPath = paste(projectPath,"data/o41078a1.tif", sep = "") # bigger than standard max size (15431804 bytes is greater than maximum 4194304 bytes)
imgPath = paste(projectPath,"/test.tif", sep = "")
outPath = paste(projectPath, "/leaflethtmlgen.html", sep="")
# load raster image file
r <- raster(imgPath)
# reproject the image, if necessary
#crs(r) <- sp::CRS("+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs")
# color palette, which is interpolated ?
pal <- colorNumeric(c("#FF0000", "#666666", "#FFFFFF"), values(r),
na.color = "transparent")
# create the leaflet widget
m <- leaflet() %>%
addTiles() %>%
addRasterImage(r, colors=pal, opacity = 0.9, maxBytes = 123123123) %>%
addLegend(pal = pal, values = values(r), title = "Test")
# save the generated widget to html
# contains the leaflet widget AND the image.
saveWidget(m, file = outPath, selfcontained = FALSE, libdir = 'leafletwidget_libs')
My problem is that this is generating a html file and I need this map to be dyanamic. For example, when a user click on some html button which is not integrate on the map, I want to add a rectangle on the map. Any solutions would be welcome...

Leaflet itself does not provide the interactive functionality you are looking for. One solution is to use shiny, which is a web application framework for R. From simple R code, it generates a web page, and runs R on the server-side to respond to user interaction. It is well documented, has a gallery of examples, and a tutorial to get new users started.
It works well with leaflet. One of the examples on the shiny web site uses it, and also includes a link to the source code.
Update
Actually, if simple showing/hiding of elements is enough, leaflet alone will suffice with the use of groups. From the question it's not very clear how dynamic you need it to be.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

R Parses incomplete text from webpages (HTML) - html

Related

Loop through multiple links from an Excel file, open and download the corresponding webpages

Scraping WordPress reviews

How to get descriptive table for both continuous and categorical variables?

Navigate to a link using html session in R

Edit map with "R for leaflet"

Categories

Resources