How to webscrape in R to get counts on images and videos for this page? Sorry I'm new to webscrape and would like some help.
Here is the link:
https://www.kickstarter.com/projects/urban-farm-florida/hammock-greens-vertical-hydroponic-urban-farm
Should yield video count= 1 and image count=9. But I'm only able to get this far.
library('dplyr')
library('rvest')
library('xml2')
library('selectr')
library("httr")
website<-read_html("https://www.kickstarter.com/projects/urban-farm-florida/hammock-greens-vertical-hydroponic-urban-farm")
website%>%html_nodes("div.template.asset")
I thint that the main issue is that the main body of the website, containing all the images, loads only when you open the website. As far as I know, rvest cannot handle this website architecture. Using RSelenium is a bit more complicated but works fine for your kind of problem.
library(RSelenium)
# open server session and open browser client ----
rD <- rsDriver(browser = c("firefox"), verbose = FALSE)
# adapt this to your browser version
remDr <- rD$client
# navitage to url in browser client ----
url <- "https://www.kickstarter.com/projects/urban-farm-florida/hammock-greens-vertical-hydroponic-urban-farm"
remDr$navigate(url)
# access image elements using xpath ----
elements_image <- remDr$findElements(using = "xpath", "//*[#class='template asset']")
# using only "//img" would show all images on the site = 12
# for some time the xpath "//img[#class='fit lazyloaded']" worked as well, somehow it stopped doing so
length(elements_image)
# show number of images on site
# [1] 9
unlist(lapply(elements_image, function(x) {x$getElementAttribute("alt")}))
# print the attribute "alt" to check output
# [1] "Vertical Columns hold 18 heads of greens, fed by a closed circuit of nutrient-rich water. "
# [2] ""
# [3] ""
# [4] "our farms can be found in unused, unloved, and unlikely spaces"
# [5] "Overtown 1 location at Lotus House homeless shelter"
# [6] "led spectrum manipulated to encourage leafy-green growth"
# [7] "vertical columns are harvested and planted in the same day to maximize efficiency "
# [8] "Chef Aaron shows off some young Romaine"
# [9] "Farmer Thomas keeps the whole team moving"
# access video elements using xpath ----
elements_video <- remDr$findElements(using = "xpath", "//video")
# since there is only one video and since its class seems very specific for the video, "//video" seems the better solution here
length(elements_video)
# show number of videos on site
# [1] 1
unlist(lapply(elements_video, function(x) {x$getElementAttribute("class")}))
# print the attribute "class" to check output
# [1] "aspect-ratio--object z1 has_hls hide"
# close browser client and close server session ----
remDr$close()
rD$server$stop()
Related
My objective
I'm attempting to use R to scrape text from a web page: https://tidesandcurrents.noaa.gov/stationhome.html?id=8467150. For the purposes of this question, my goal is to access the header text that contains the station number ("Bridgeport, CT - Station ID: 8467150"). Below is a screenshot of the page. I've highlighted the text that I'm trying to verify is present, and the text is also highlighted in the inspect element pane.
My old approach was to access the full text of the site with readLines(). A recent update to the website has made the text more difficult to access, and the station name/number is no longer visible to readLines():
url <- "https://tidesandcurrents.noaa.gov/stationhome.html?id=8467150"
stn <- "8467150"
webpage <- readLines(url, warn = FALSE)
### grep indicates that the station number is not present in the scraped text
grep(x = webpage, pattern = stn, value = TRUE)
Potential solutions
I am therefore looking for a new way to access my target text. I have tried using httr, but still cannot get all the html text to be included in what I scrape from the web page. The XML and rvest packages also seem promising, but I am not sure how to identify the relevant CSS selector or XPath expression.
### an attempt using httr
hDat <- httr::RETRY("GET", url, times = 10)
txt <- httr::content(hDat, "text")
### grep indicates that the station number is still not present
grep(x = txt, pattern = stn, value = TRUE)
### a partial attempt using XML
h <- xml2::read_html(url)
h2 <- XML::htmlTreeParse(h, useInternalNodes=TRUE, asText = TRUE)
### this may end up working, but I'm not sure how to identify the correct path
html.parse <- XML::xpathApply(h2, path = "div.span8", XML::xmlValue)
Regardless of the approach, I would welcome any suggestions that can help me access the header text containing the station name/number.
Unless you use Selenium, it will be very hard.
NOAA encourages you to access their free Restful json APIs. It also goes to great lengths to discourage html scraping.
That said, the following code will get what you want from a NOAA json in a data frame.
library(tidyverse)
library(jsonlite)
j1 <- fromJSON(txt = 'https://api.tidesandcurrents.noaa.gov/mdapi/prod/webapi/stations/8467150.json', simplifyDataFrame = TRUE, flatten = TRUE)
j1$stations %>% as_tibble() %>% select(name, state, id)
Results
# A tibble: 1 x 3
name state id
<chr> <chr> <chr>
1 Bridgeport CT 8467150
I'm trying to scrape the date and policy type for COVID related announcements from this url: https://covid19.healthdata.org/united-states-of-america/alabama
The first date I'm trying to pull is the "April 4th, 2020" date for Alabama's Stay at Home Order.
As far as I can tell (as I am new to this), it has the xpath:
"//[#id="root"]/div/main/div[3]/div[1]/div[2]/div[1]/div[1]/div/div/span"
I've been using the following lines to try to retrieve it -
data <- read_html(url) %>%
html_nodes("span.ant-statistic-content-value")
data <- read_html(url) %>%
html_nodes(xpath = "//*[#id='root']/div/main/div[3]/div[1]/div[2]/div[1]/div[1]/div/div/span")
Neither are able to pull the information I'm looking for. Any help would be appreciated!
The data for this page is stored in a series of JSON files. If you use the developer tools from your browser and look for the Networks files of type XHR; you should obtain a list similar to this (Safari browser below):
Right click the names to copy URL link.
This script should get you started:
library(jsonlite)
#obtain the list of locations
locations<-fromJSON("https://covid19.healthdata.org/api/metadata/location?v=7", flatten = TRUE)
head(locations[, 1:9])
#get list if US locations
US <- locations$children[locations$location_name =="United States of America"]
head(US[[1]])
#Get data frame from interventions
#Create link with desired location_id (569 is Virginia)
#paste0("https://covid19.healthdata.org/api/data/intervention?location=", "569")
Interventions <- fromJSON("https://covid19.healthdata.org/api/data/intervention?location=569", flatten = TRUE)
Interventions
# date_reported covid_intervention_id location_id covid_intervention_measure_id covid_intervention_measure_name
# 1 2020-03-30 00:00:00 110 569 1 People instructed to stay at home
# 2 2020-03-16 00:00:00 258 569 2 Educational facilities closed
# 3 2020-04-19 00:00:00 437 569 7 Assumed_implemented_date
#Repeat for other links of interest
I am trying to extract the digits of pi from a website using the rvest package in R, but it keeps giving me an xml error.
library(rvest)
pisite <- read_html("http://www.eveandersson.com/pi/digits/1000000")
pitable <- pisite %>%
html_node(xpath = "/html/body/table[2]/tbody/tr/td[1]/pre/text()[1]")
I keep getting the result:
{xml_missing}
NA
Note that I copied the value used for the xpath from the chrome website inspection tool. Although it does look a bit different to the xpaths I have gotten before.
Not sure what to change. Guessing it is something really simple. Any ideas?
Maybe this could help:
library(rvest)
library(dplyr)
# here the site
pisite <- read_html("http://www.eveandersson.com/pi/digits/1000000")
# here you catch what you need
pi <- pisite %>% html_nodes("pre") %>% html_text()
# here you replace de \n with nothing, to have the numbers only
pi <-gsub("\n", "", pi)
pi
[1] "3.1415926535897932384626433832795028841971 ...and so on..."
I am using the rvest package to scrape information from the page http://www.radiolab.org/series/podcasts. After scraping the first page, I want to follow the "Next" link at the bottom, scrape that second page, move onto the third page, etc.
The following line gives an error:
html_session("http://www.radiolab.org/series/podcasts") %>% follow_link("Next")
## Navigating to
##
## ./2/
## Error in parseURI(u) : cannot parse URI
##
## ./2/
Inspecting the HTML shows there is some extra cruft around the "./2/" that rvest apparently doesn't like:
html("http://www.radiolab.org/series/podcasts") %>% html_node(".pagefooter-next a")
## Next
.Last.value %>% html_attrs()
## href
## "\n \n ./2/ "
Question 1:
How can I get rvest::follow_link to treat this link correctly like my browser does? (I could manually grab the "Next" link and clean it up with regex, but prefer to take advantage of the automation provided with rvest.)
At the end of the follow_link code, it calls jump_to. So I tried the following:
html_session("http://www.radiolab.org/series/podcasts") %>% jump_to("./2/")
## <session> http://www.radiolab.org/series/2/
## Status: 404
## Type: text/html; charset=utf-8
## Size: 10744
## Warning message:
## In request_GET(x, url, ...) : client error: (404) Not Found
Digging into the code, it looks like the issue is with XML::getRelativeURL, which uses dirname to strip off the last part of the original path ("/podcasts"):
XML::getRelativeURL("./2/", "http://www.radiolab.org/series/podcasts/")
## [1] "http://www.radiolab.org/series/./2"
XML::getRelativeURL("../3/", "http://www.radiolab.org/series/podcasts/2/")
## [1] "http://www.radiolab.org/series/3"
Question 2:
How can I get rvest::jump_to and XML::getRelativeURL to correctly handle relative paths?
Since this problem still seems to occur with RadioLab.com, your best solution is to create a custom function to handle this edge case. If you're only worried about this site - and this particular error - then you can write something like this:
library(rvest)
follow_next <- function(session, text ="Next", ...) {
link <- html_node(session, xpath = sprintf("//*[text()[contains(.,'%s')]]", text))
url <- html_attr(link, "href")
url = trimws(url)
url = gsub("^\\.{1}/", "", url)
message("Navigating to ", url)
jump_to(session, url, ...)
}
That would allow you to write code like this:
html_session("http://www.radiolab.org/series/podcasts") %>%
follow_next()
#> Navigating to 2/
#> <session> http://www.radiolab.org/series/podcasts/2/
#> Status: 200
#> Type: text/html; charset=utf-8
#> Size: 61261
This is not per se an error - the URL on RadioLab is malformed, and failing to parse a malformed URL is not a bug. If you want to be liberal in how you handle the issue you need to manually work around it.
Note that you could also use RSelenium to launch an actual browser (e.g. Chrome) and have that perform the URL parsing for you.
I would like to create a Corpus for the collection of downloaded HTML files, and then read them in R for future text mining.
Essentially, this is what I want to do:
Create a Corpus from multiple html files.
I tried to use DirSource:
library(tm)
a<- DirSource("C:/test")
b<-Corpus(DirSource(a), readerControl=list(language="eng", reader=readPlain))
but it returns "invalid directory parameters"
Read in html files from the Corpus all at once.
Not sure how to do it.
Parse them, convert them to plain text, remove tags.
Many people suggested using XML, however, I didn't find a way to process multiple files. They are all for one single file.
Thanks very much.
This should do it. Here I've got a folder on my computer of HTML files (a random sample from SO) and I've made a corpus out of them, then a document term matrix and then done a few trivial text mining tasks.
# get data
setwd("C:/Downloads/html") # this folder has your HTML files
html <- list.files(pattern="\\.(htm|html)$") # get just .htm and .html files
# load packages
library(tm)
library(RCurl)
library(XML)
# get some code from github to convert HTML to text
writeChar(con="htmlToText.R", (getURL(ssl.verifypeer = FALSE, "https://raw.github.com/tonybreyal/Blog-Reference-Functions/master/R/htmlToText/htmlToText.R")))
source("htmlToText.R")
# convert HTML to text
html2txt <- lapply(html, htmlToText)
# clean out non-ASCII characters
html2txtclean <- sapply(html2txt, function(x) iconv(x, "latin1", "ASCII", sub=""))
# make corpus for text mining
corpus <- Corpus(VectorSource(html2txtclean))
# process text...
skipWords <- function(x) removeWords(x, stopwords("english"))
funcs <- list(tolower, removePunctuation, removeNumbers, stripWhitespace, skipWords)
a <- tm_map(a, PlainTextDocument)
a <- tm_map(corpus, FUN = tm_reduce, tmFuns = funcs)
a.dtm1 <- TermDocumentMatrix(a, control = list(wordLengths = c(3,10)))
newstopwords <- findFreqTerms(a.dtm1, lowfreq=10) # get most frequent words
# remove most frequent words for this corpus
a.dtm2 <- a.dtm1[!(a.dtm1$dimnames$Terms) %in% newstopwords,]
inspect(a.dtm2)
# carry on with typical things that can now be done, ie. cluster analysis
a.dtm3 <- removeSparseTerms(a.dtm2, sparse=0.7)
a.dtm.df <- as.data.frame(inspect(a.dtm3))
a.dtm.df.scale <- scale(a.dtm.df)
d <- dist(a.dtm.df.scale, method = "euclidean")
fit <- hclust(d, method="ward")
plot(fit)
# just for fun...
library(wordcloud)
library(RColorBrewer)
m = as.matrix(t(a.dtm1))
# get word counts in decreasing order
word_freqs = sort(colSums(m), decreasing=TRUE)
# create a data frame with words and their frequencies
dm = data.frame(word=names(word_freqs), freq=word_freqs)
# plot wordcloud
wordcloud(dm$word, dm$freq, random.order=FALSE, colors=brewer.pal(8, "Dark2"))
This will correct the error.
b<-Corpus(a, ## I change DireSource(a) by a
readerControl=list(language="eng", reader=readPlain))
But I think to read your Html you need to use xml reader. Something like :
r <- Corpus(DirSource('c:\test'),
readerControl = list(reader = readXML),spec)
But you need to supply the spec argument, which depends with your file structure.
see for example readReut21578XML. It is a good example of xml/html parser.
To read all the html files into an R object you can use
# Set variables
folder <- 'C:/test'
extension <- '.htm'
# Get the names of *.html files in the folder
files <- list.files(path=folder, pattern=extension)
# Read all the files into a list
htmls <- lapply(X=files,
FUN=function(file){
.con <- file(description=paste(folder, file, sep='/'))
.html <- readLines(.con)
close(.con)
names(.html) <- file
.html
})
That will give you a list, and each element is the HTML content of each file.
I'll post later on parsing it, I'm in a hurry.
I found the package boilerpipeR particularly useful to extract only the "core" text of an html page.