Scraping WordPress reviews - html

I am just learning R programming. For an exercise, I want to scrape reviews of a WordPress plugin that seems to have been discontinued here
I start by specifying the URL
> url <- 'https://wordpress.org/plugins/demo-data-creator/#reviews'
Scraping the HTML content from the URL
> url <- read_html('https://wordpress.org/plugins/demo-data-creator/#reviews')
Extract the title of each review using the ID tag
> reviews <- html_nodes(url, 'h3.review-title')
Strip out the HTML tags, leaving only the content of the title
> titletext <- html_text(reviews)
Print the titles scraped
> head(titletext)
> [1] "Good for development"
> [2] "Used it for creating test users"
> [3] "Excelent! negative comments come from people who doesn't read!"
> [4] "Thanks"
> [5] "Does EXACTLY what it says it will – thanks! Very Handy"
> [6] "Dangerous plugin"
I repeat the same for the contents of the reviews
> reviewcontent <- html_nodes(url, 'div.review-content')
> reviewtext <- html_text(reviewcontent)
And prints out
> head(reviewcontent)
> {xml_nodeset (6)} [1] <div class="review-content">Good and handy tool
> for deve ... [2] <div class="review-content">This plugin came in very
> han ... [3] <div class="review-content">Does exactly what it offers!
> ... [4] <div class="review-content">Thanks</div> [5] <div
> class="review-content">Very handy for a test system ... [6] <div
> class="review-content">I have to agree with viesli ...
However, I realized it didn't scrape all the reviews as there are more here
Is there a way to tell R to check each review listed to extract the title and review content and probably populate in a table?

You can use the same approach to extract the reviews from the second link. The main difference is that the content of each review is in its own page. Hence, you need two steps:
Extract the list of review page URLs from the main page.
For each URL, fetch the page, and extract the title and content of the review.
For step 1, it is very similar to what you already did, except that you are now trying to extract the URLs for the review pages. If you inspect that page, you'll see that these links (<a> elements) have CSS class bbp-topic-permalink.
So, we can extract them using:
links <- html_nodes(page, css='a.bbp-topic-permalink')
Now, we don't want the text part of the tag, but rather the href attribute value (where the link is pointing). We can extract that using
reviewurls <- html_attr(links, 'href')
For step 2, we will loop over the list of reviewurls, and for each one, fetch the page using read_html, extract the title and the content using html_node and html_text, then add them to a table/matrix/data.frame.
The loop can be done using:
for (u in reviewurls) {
}
Inside the loop, u is the variable that holds the current review URL. We will use read_html to read the page, then extract the title and content.
Inspecting the review page, the title is in an <h1> tag with CSS class page-title. Simiarly, thThe content of the review are inside a <div> with CSS class bbp-topic-content.
So, inside the loop, you can do this:
page = read_html(u)
reviewT = html_text(html_node(page, css='h1.page-title'))
reviewC = html_text(html_node(page, css='div.bbp-topic-content'))
Now you will have both the title and the content for that particular review. You can add them to a list, so that by the end of the loop, you will have titles and contents of all the reviews.
The final code will look like this:
url <- 'https://wordpress.org/support/plugin/demo-data-creator/reviews/'
page <- read_html(url)
links <- html_nodes(page, css='a.bbp-topic-permalink')
reviewurls <- html_attr(links, 'href')
# Two empty lists, to be populated inside the loop
titles = c()
contents = c()
for (u in reviewurls) {
page = read_html(u)
reviewT = html_text(html_node(page, css='h1.page-title'))
reviewC = html_text(html_node(page, css='div.bbp-topic-content'))
titles = c(titles, reviewT)
contents = c(contents, reviewC)
}
Once it's done, you will get:
> length(titles)
[1] 21
> head(titles)
[1] "Good for development"
[2] "Used it for creating test users"
[3] "Excelent! negative comments come from people who doesn't read!"
...
> head(contents)
[1] "\n\n\t\t\t\t\n\t\t\t\tGood and handy tool for development.\n\n\n\n\t\tThis topic was modified 2 years, 10 months ago by Subrata Sarkar.\n\t\n\n"
[2] "\n\n\t\t\t\t\n\t\t\t\tThis plugin came in very handy during development of my own plugin. I used it to create a lot of users and it did exactly what it should.\nNot sure where all the negativity about wiping the database comes from. Are they users that didn’t read all the warnings? Or did older versions of the plugin not warn about wiping all data? Anyway, now it does \U0001f642\n\n\t\t\t\t\n\t\t\t"
[3] "\n\n\t\t\t\t\n\t\t\t\tDoes exactly what it offers! Nothing less.\nPeople complaining is too lazy to read the SEVERAL warnings about the usage of this plugin.\n\n\t\t\t\t\n\t\t\t"
...

Related

Loop through multiple links from an Excel file, open and download the corresponding webpages

I downloaded from MediaCloud an Excel file with 1719 links to different newspaper articles. I am trying to use R to loop through each link, open it and download all the corresponding online articles in a single searchable file (HTML, CSV, TXT, PDF - doesn't matter) that I can read and analyze later.
I went through all similar questions on Stack Overflow and a number of tutorials for downloading files and managed to assemble this code (I am very new to R):
express <-read.csv("C://Users//julir//Documents//Data//express.csv")
library(curl)
for (express$url in 2:1720)
destfile <- paste0("C://Users//julir//Documents//Data//results.csv")
download.file(express$url, destfile, method = "auto", quiet = TRUE, cacheOK=TRUE)
Whenever I try to run it though I get the following error:
Error in download.file(express$url, destfile = express$url, method = "auto", : 'url' must be a length-one character vector
I tried also this alternative method suggested online:
library(httr)
url <- express$url
response <- GET(express$url)
html_document <- content(response, type = "text", encoding = "UTF-8")
But I get the same mistake:
Error in parse_url(url) : length(url) == 1 is not TRUE
So I guess there is a problem with how the URLs are stored - but I can't understand how to fix it.
I am also not certain about the downloading process - I would ideally want all text on the HTML page - it seems unpractical to use selectors and rvest in this case - but I might be very wrong.
You need to look through the url's and read/parse each individually. You are essentially passing an array of urls into one request, which is why you see that error.
I don't know your content/urls, but here's an example of how you would approach this:
library(xml2)
library(jsonlite)
library(dplyr)
df <- data.frame(page_n = 1:5, urls = sprintf('https://www.politifact.com/factchecks/list/?page=%s', 1:5))
result_info <- lapply(df$urls, function(i){
raw <- read_html(i)
a_tags <- raw %>% xml_find_all(".//a[contains(#href,'factchecks/2021')]")
urls <- xml2::url_absolute(xml_attr(a_tags, "href"),xml_url(raw))
titles <- xml_text(a_tags) %>% stri_trim_both()
data.frame(title = titles, links = urls)
}) %>% rbind_pages()
result_info %>% head()
title
links
Says of UW-Madison, "It cost the university $50k (your tax dollars) to remove" a rock considered by some a symbol of racism.
https://www.politifact.com/factchecks/2021/aug/14/rachel-campos-duffy/no-taxpayer-funds-were-not-used-remove-rock-deemed/
“Rand Paul’s medical license was just revoked!”
https://www.politifact.com/factchecks/2021/aug/13/facebook-posts/no-rand-pauls-medical-license-wasnt-revoked/
Every time outgoing New York Gov. Andrew Cuomo “says the firearm industry ‘is immune from lawsuits,’ it's false.”
https://www.politifact.com/factchecks/2021/aug/13/elise-stefanik/refereeing-andrew-cuomo-elise-stefanik-firearm-ind/
The United States' southern border is "basically open" and is "a super spreader event.”
https://www.politifact.com/factchecks/2021/aug/13/gary-sides/north-carolina-school-leader-repeats-false-claims-/
There is a “0.05% chance of dying from COVID.”
https://www.politifact.com/factchecks/2021/aug/13/tiktok-posts/experts-break-down-numbers-catching-or-dying-covid/
The Biden administration is “not even testing these people” being released by Border Patrol into the U.S.
https://www.politifact.com/factchecks/2021/aug/13/ken-paxton/biden-administration-not-even-testing-migrants-rel/

Trying to scrape business website from webpage with R

I am attempting to scrape the website link:
https://www.anelegantaffairbridal.com/?utm_source=theknot.com&utm_medium=referral&utm_campaign=theknot
from the Contact for Appointment Popup on
https://www.theknot.com/marketplace/an-elegant-affair-cedar-falls-ia-537984 using R Studio.
I have tried
page <- read_html("https://www.theknot.com/marketplace/an-elegant-affair-cedar-falls-ia-537984")
Website <- html_attr(html_nodes(page,xpath = '//*[#id="appointment-vendors-categories"]/div/div[3]/div[1]/span/a'),"href")
This is my output: character(0).
The desired output is: https://www.anelegantaffairbridal.com/?utm_source=theknot.com&utm_medium=referral&utm_campaign=theknot
I successfully scraped info from the contact section at the bottom of the page using the code below, but the same method doesn't seem to be working for the link.
Name_of_Vendor2 <- substr((page %>% html_nodes("h3") %>% html_text),18,70)
Phone_of_Vendor <- html_text(html_nodes(page, xpath = "//div[#class = 'contact-info--900c8 body1--711dc']/span[2]"))
Address_of_Vendor <- html_text(html_nodes(page, xpath = "//div[#class = 'contact-info--900c8 body1--711dc']/span[1]"))
After taking a look into the downloaded html by writing it into a file using rvest::write_xml(page, file="temp.html"). I then proceeded to search for the URL and found it in the one of the last script tags as a JSON object, and the key for the url is websiteUrl. So I selected the last script tag and ran a regular expression on it's content to get the URL
scripts <- html_nodes(page,xpath = '//script[#type="text/javascript"]')
script <- html_text(scripts[[length(scripts)-1]])
stringr::str_extract(script,'(?<="websiteUrl":").+?(?=")')
#> [1] "http://www.AnElegantAffairBridal.com?utm_source=theknot.com&utm_medium=referral&utm_campaign=theknot"

Navigate to a link using html session in R

I am trying to navigate to a link on a website. All the links work except for one single link. Here are the results.
> mcsession<-html_session("http://www.moneycontrol.com/financials/tataconsultancyservices/balance-sheetVI/TCS#TCS")
> mcsession<-mcsession %>% follow_link("Previous Years »")
Error: No links have text 'Previous Years »'
In addition: Warning message:
In grepl(i, text, fixed = TRUE) : input string 316 is invalid UTF-8
> mcsession<-mcsession %>% follow_link("Balance Sheet")
Navigating to /financials/tataconsultancyservices/balance-sheetVI/TCS#TCS
Warning message:
In grepl(i, text, fixed = TRUE) : input string 316 is invalid UTF-8
Any idea why this happens so?
It is not a normal link - it is javascript. I don't know of a way of doing it with rvest, but you could use RSelenium, which basically automates a normal browser window. It is slower than scraping directly, but you can automate just about anything that you can do by hand. This works for me (using chrome on Windows 10)...
library(RSelenium)
rD <- rsDriver(port=4444L,browser="chrome")
remDr <- rD$client
remDr$navigate("http://www.moneycontrol.com/financials/tataconsultancyservices/balance-sheetVI/TCS#TCS")
firstpage <- remDr$getPageSource() #you can use this to get the first table
#(1)
webElem1 <- remDr$findElement(using = 'partial link text', value = "Previous Years")
webElem1$clickElement()
nextpage <- remDr$getPageSource() #you can use this to get the next page for previous years
#repeat from #(1) to go back another page etc
remDr$closeall() #when you have finished.

R Parses incomplete text from webpages (HTML)

I am trying to parse the plain text from multiple scientific articles for subsequent text analysis. So far I use a R script by Tony Breyal based on the packages RCurl and XML. This works fine for all targeted journals, except for those published by http://www.sciencedirect.com. When I try to parse the articles from SD (and this is consistent for all tested journals I need to access from SD), the text object in R just stores the first part of the whole document in it. Unfortunately, I am not too familiar with html, but I think the problem should be in the SD html code, since it works in all other cases.
I am aware that some journals are not open accessible, but I have access authorisations and the problems also occur in open access articles (check the example).
This is the code from Github:
htmlToText <- function(input, ...) {
###---PACKAGES ---###
require(RCurl)
require(XML)
###--- LOCAL FUNCTIONS ---###
# Determine how to grab html for a single input element
evaluate_input <- function(input) {
# if input is a .html file
if(file.exists(input)) {
char.vec <- readLines(input, warn = FALSE)
return(paste(char.vec, collapse = ""))
}
# if input is html text
if(grepl("</html>", input, fixed = TRUE)) return(input)
# if input is a URL, probably should use a regex here instead?
if(!grepl(" ", input)) {
# downolad SSL certificate in case of https problem
if(!file.exists("cacert.perm")) download.file(url="http://curl.haxx.se/ca/cacert.pem", destfile="cacert.perm")
return(getURL(input, followlocation = TRUE, cainfo = "cacert.perm"))
}
# return NULL if none of the conditions above apply
return(NULL)
}
# convert HTML to plain text
convert_html_to_text <- function(html) {
doc <- htmlParse(html, asText = TRUE)
text <- xpathSApply(doc, "//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)][not(ancestor::form)]", xmlValue)
return(text)
}
# format text vector into one character string
collapse_text <- function(txt) {
return(paste(txt, collapse = " "))
}
###--- MAIN ---###
# STEP 1: Evaluate input
html.list <- lapply(input, evaluate_input)
# STEP 2: Extract text from HTML
text.list <- lapply(html.list, convert_html_to_text)
# STEP 3: Return text
text.vector <- sapply(text.list, collapse_text)
return(text.vector)
}
This is now my code and an example article:
target <- "http://www.sciencedirect.com/science/article/pii/S1754504816300319"
temp.text <- htmlToText(target)
The unformatted text stops somewhere in the Method section:
DNA was extracted using the MasterPure™ Yeast DNA Purification Kit
(Epicentre, Madison, Wisconsin, USA) following the manufacturer's
instructions.
Any suggestions/ideas?
P.S. I also tried html_text based on rvest with the same outcome.
You can prbly use your existing code and just add ?np=y to the end of the URL, but this is a bit more compact:
library(rvest)
library(stringi)
target <- "http://www.sciencedirect.com/science/article/pii/S1754504816300319?np=y"
pg <- read_html(target)
pg %>%
html_nodes(xpath=".//div[#id='centerContent']//child::node()/text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)][not(ancestor::form)]") %>%
stri_trim() %>%
paste0(collapse=" ") %>%
write(file="output.txt")
A bit of the output (total for that article was >80K):
Fungal Ecology Volume 22 , August 2016, Pages 61–72 175394|| Species richness
influences wine ecosystem function through a dominant species Primrose J. Boynton a , , ,
Duncan Greig a , b a Max Planck Institute for Evolutionary Biology, Plön, 24306, Germany
b The Galton Laboratory, Department of Genetics, Evolution, and Environment, University
College London, London, WC1E 6BT, UK Received 9 November 2015, Revised 27 March 2016,
Accepted 15 April 2016, Available online 1 June 2016 Corresponding editor: Marie Louise
Davey Abstract Increased species richness does not always cause increased ecosystem function.
Instead, richness can influence individual species with positive or negative ecosystem effects.
We investigated richness and function in fermenting wine, and found that richness indirectly
affects ecosystem function by altering the ecological dominance of Saccharomyces cerevisiae .
While S. cerevisiae generally dominates fermentations, it cannot dominate extremely species-rich
communities, probably because antagonistic species prevent it from growing. It is also diluted
from species-poor communities,

Putting hyperlinks into an HTML table in R

I am a biologist trying to do computer science for research, so I may be a bit naïve. But I would like to a make a table containing information from a data frame, with a hyperlink in one of the columns. I imagine this needs to be an html document (?). I found this post this post describing how to put a hyperlink into a data frame and write it as an HTML file using googleVis. I would like to use this approach (it is the only one I know and seems to work well) except I would like to replace the actual URL with a description. The real motivation being that I would like to include many of these hyperlinks, and the links have long addresses which is difficult to read.
To be verbose, I essentially want to do what I did here where we read 'here' but 'here' points to
http:// stackoverflow.com/questions/8030208/exporting-table-in-r-to-html-with-hyperlinks
From your previous question, you can have another list which contains the titles of the URL's:
url=c('http://nytimes.com', 'http://cnn.com', 'http://www.weather.gov'))
urlTitles=c('NY Times', 'CNN', 'Weather'))
foo <- transform(foo, url = paste('<a href = ', shQuote(url), '>', urlTitles, '</a>'))
x = gvisTable(foo, options = list(allowHTML = TRUE))
plot(x)
Building on Jack's answer but consolidating from different threads:
library(googleVis)
library(R2HTML)
url <- c('http://nytimes.com', 'http://cnn.com', 'http://www.weather.gov')
urlTitles <- c('NY Times', 'CNN', 'Weather')
foo <- data.frame(a=c(1,2,3), b=c(4,5,6), url=url)
foo <- transform(foo, url = paste('<a href = ', shQuote(url), '>', urlTitles, '</a>'))
x <- gvisTable(foo, options = list(allowHTML = TRUE))
plot(x)