My objective
I'm attempting to use R to scrape text from a web page: https://tidesandcurrents.noaa.gov/stationhome.html?id=8467150. For the purposes of this question, my goal is to access the header text that contains the station number ("Bridgeport, CT - Station ID: 8467150"). Below is a screenshot of the page. I've highlighted the text that I'm trying to verify is present, and the text is also highlighted in the inspect element pane.
My old approach was to access the full text of the site with readLines(). A recent update to the website has made the text more difficult to access, and the station name/number is no longer visible to readLines():
url <- "https://tidesandcurrents.noaa.gov/stationhome.html?id=8467150"
stn <- "8467150"
webpage <- readLines(url, warn = FALSE)
### grep indicates that the station number is not present in the scraped text
grep(x = webpage, pattern = stn, value = TRUE)
Potential solutions
I am therefore looking for a new way to access my target text. I have tried using httr, but still cannot get all the html text to be included in what I scrape from the web page. The XML and rvest packages also seem promising, but I am not sure how to identify the relevant CSS selector or XPath expression.
### an attempt using httr
hDat <- httr::RETRY("GET", url, times = 10)
txt <- httr::content(hDat, "text")
### grep indicates that the station number is still not present
grep(x = txt, pattern = stn, value = TRUE)
### a partial attempt using XML
h <- xml2::read_html(url)
h2 <- XML::htmlTreeParse(h, useInternalNodes=TRUE, asText = TRUE)
### this may end up working, but I'm not sure how to identify the correct path
html.parse <- XML::xpathApply(h2, path = "div.span8", XML::xmlValue)
Regardless of the approach, I would welcome any suggestions that can help me access the header text containing the station name/number.
Unless you use Selenium, it will be very hard.
NOAA encourages you to access their free Restful json APIs. It also goes to great lengths to discourage html scraping.
That said, the following code will get what you want from a NOAA json in a data frame.
library(tidyverse)
library(jsonlite)
j1 <- fromJSON(txt = 'https://api.tidesandcurrents.noaa.gov/mdapi/prod/webapi/stations/8467150.json', simplifyDataFrame = TRUE, flatten = TRUE)
j1$stations %>% as_tibble() %>% select(name, state, id)
Results
# A tibble: 1 x 3
name state id
<chr> <chr> <chr>
1 Bridgeport CT 8467150
Related
I am currently working on reading an HTML file of this web page into R and processing it to extract useful data that creates a new dataframe.
Visual inspection of the web page text shows that the lines that contain data values all starts with '< td >'. So here is my code so far:
thepage<-readLines('https://www.worldometers.info/world-population/population-by-country/')
dataline <- grep('<td>', thepage)
dataline
This returns:
11
Which tells me all the data is in line 11. So I did this:
data <- thepage[11]
datalines <- grep('<td>', data)
datalines
This returns:
1
Which isn't helpful at all as "data" is still one massive line. How do I split this massive lines into multiple lines? My preferred dataframe would look something like this:
TIA.
How about the following?
library(tidyverse)
library(rvest)
url <- 'https://www.worldometers.info/world-population/population-by-country/'
pg <- xml2::read_html(url) %>%
rvest::html_table() %>%
.[[1]]
I am trying to extract the digits of pi from a website using the rvest package in R, but it keeps giving me an xml error.
library(rvest)
pisite <- read_html("http://www.eveandersson.com/pi/digits/1000000")
pitable <- pisite %>%
html_node(xpath = "/html/body/table[2]/tbody/tr/td[1]/pre/text()[1]")
I keep getting the result:
{xml_missing}
NA
Note that I copied the value used for the xpath from the chrome website inspection tool. Although it does look a bit different to the xpaths I have gotten before.
Not sure what to change. Guessing it is something really simple. Any ideas?
Maybe this could help:
library(rvest)
library(dplyr)
# here the site
pisite <- read_html("http://www.eveandersson.com/pi/digits/1000000")
# here you catch what you need
pi <- pisite %>% html_nodes("pre") %>% html_text()
# here you replace de \n with nothing, to have the numbers only
pi <-gsub("\n", "", pi)
pi
[1] "3.1415926535897932384626433832795028841971 ...and so on..."
The official Premier league website provides data with various statistics for league's teams over seasons (e.g. this one). I used the function readHTMLTable from XML R package to retrieve those tables. However, I noticed that the function can not read tables for May months while for others it works well. Here is an example:
april2007.url <- "http://www.premierleague.com/en-gb/matchday/league-table.html?season=2006-2007&month=APRIL&timelineView=date&toDate=1177887600000&tableView=CURRENT_STANDINGS"
april.df <- readHTMLTable(april2007.url, which = 1)
april.df[complete.cases(april.df),] ## correct table
march2014.url <- "http://www.premierleague.com/en-gb/matchday/league-table.html?season=2013-2014&month=APRIL&timelineView=date&toDate=1398639600000&tableView=CURRENT_STANDINGS"
march.df <- readHTMLTable(march2014.url, which = 1)
march.df[complete.cases(march.df), ] ## correct table
may2007.url <- "http://www.premierleague.com/en-gb/matchday/league-table.html?season=2006-2007&month=MAY&timelineView=date&toDate=1179010800000&tableView=CURRENT_STANDINGS"
may.df1 <- readHTMLTable(may2007.url, which = 1)
may.df1 ## Just data for the first team
may2014.url <- "http://www.premierleague.com/en-gb/matchday/league-table.html?season=2013-2014&month=MAY&timelineView=date&toDate=1399762800000&tableView=CURRENT_STANDINGS"
may.df2 <- readHTMLTable(may2014.url, which =1)
may.df2 ## Just data for the first team
As you can see, the function can not retrieve data for May month.
Please, can someone explain why this happens and how it can be fixed?
EDIT AFTER #zyurnaidi answer:
Below is the code that can do the job without manual editing.
url <- "http://www.premierleague.com/en-gb/matchday/league-table.html?season=2009-2010&month=MAY&timelineView=date&toDate=1273359600000&tableView=CURRENT_STANDINGS" ## data for the 09-05-2010.
con <- file (url)
raw <- readLines (con)
close (con)
pattern <- '<span class=" cupchampions-league= competitiontooltip= qualifiedforuefachampionsleague=' ## it seems that this part of the webpage source code mess the things up
raw <- gsub (pattern = pattern, replacement = '""', x = raw)
df <- readHTMLTable (doc = raw, which = 1)
df[complete.cases(df), ] ## correct table
OK. There are few hints for me to find the problem here:
1. The issues happen consistently on May. This is the last month of each season. It means that there should be something unique in this particular case.
2. Direct parsing (htmlParse, from both link and downloaded file) produces a truncated file. The table and html file are just suddenly closed after the first team in the table is reported.
The parsed data always differs from the original right after this point:
<span class=" cupchampions-league=
After downloading and carefully checking the html file itself, I found that there are (uncoded?) character issues there. My guess, this is caused by the cute little trophy icons seen after the team names.
Anyway, to solve this issue, you need to take out these error characters. Instead of editing the downloaded html files, my suggestion is:
1. View page source the EPL url for May's league table
2. Copy all and paste to the text editor, save as an html file
3. You can now use either htmlParse or readHTMLTable
There might be better way to automate this, but hope it can help.
I would like to use R to download the HTML code of any Yahoo Finance Headlines webpage, select the "headlines" and collect them in Excel. Unfortunately I cannot find and select the HTML nodes corresponding to the headlines once I download the source file to R.
Let me show the problem with an example.
I started with
source <- "http://finance.yahoo.com/q/h?s=AAPL+Headlines"
file <- "destination/finance_file.cvs"
download.file(url = source, destfile = file)
x = scan(file, what = "", sep = "\n")
producing the Excel file finance_file.cvs and, most importantly, the character x.
Using x I would like to collect the headlines and write them into a column in a second Excel file, called headlines.cvs.
My problem now is the following: if I select any headline I can find it in the HTML code of the webpage itself, but I lose its track in x. Therefore, I do not know how to extract it.
For the extraction I was thinking of
x = x[grep("some string of characters to do the job", x)]
but I am no expert in web scraping.
Any ideas/suggestions?
I thank you very much!
You can use the XML package and write the XPath query needed to extract the headlines.
Since the web page looks like:
...
<ul class="newsheadlines"/>
<ul>
<li>First headline</li>
...
you get the following query.
library(XML)
source <- "http://finance.yahoo.com/q/h?s=AAPL+Headlines"
d <- htmlParse(source)
xpathSApply(d, "//ul[contains(#class,'newsheadlines')]/following::ul/li/a", xmlValue)
free(d)
I would like to create a Corpus for the collection of downloaded HTML files, and then read them in R for future text mining.
Essentially, this is what I want to do:
Create a Corpus from multiple html files.
I tried to use DirSource:
library(tm)
a<- DirSource("C:/test")
b<-Corpus(DirSource(a), readerControl=list(language="eng", reader=readPlain))
but it returns "invalid directory parameters"
Read in html files from the Corpus all at once.
Not sure how to do it.
Parse them, convert them to plain text, remove tags.
Many people suggested using XML, however, I didn't find a way to process multiple files. They are all for one single file.
Thanks very much.
This should do it. Here I've got a folder on my computer of HTML files (a random sample from SO) and I've made a corpus out of them, then a document term matrix and then done a few trivial text mining tasks.
# get data
setwd("C:/Downloads/html") # this folder has your HTML files
html <- list.files(pattern="\\.(htm|html)$") # get just .htm and .html files
# load packages
library(tm)
library(RCurl)
library(XML)
# get some code from github to convert HTML to text
writeChar(con="htmlToText.R", (getURL(ssl.verifypeer = FALSE, "https://raw.github.com/tonybreyal/Blog-Reference-Functions/master/R/htmlToText/htmlToText.R")))
source("htmlToText.R")
# convert HTML to text
html2txt <- lapply(html, htmlToText)
# clean out non-ASCII characters
html2txtclean <- sapply(html2txt, function(x) iconv(x, "latin1", "ASCII", sub=""))
# make corpus for text mining
corpus <- Corpus(VectorSource(html2txtclean))
# process text...
skipWords <- function(x) removeWords(x, stopwords("english"))
funcs <- list(tolower, removePunctuation, removeNumbers, stripWhitespace, skipWords)
a <- tm_map(a, PlainTextDocument)
a <- tm_map(corpus, FUN = tm_reduce, tmFuns = funcs)
a.dtm1 <- TermDocumentMatrix(a, control = list(wordLengths = c(3,10)))
newstopwords <- findFreqTerms(a.dtm1, lowfreq=10) # get most frequent words
# remove most frequent words for this corpus
a.dtm2 <- a.dtm1[!(a.dtm1$dimnames$Terms) %in% newstopwords,]
inspect(a.dtm2)
# carry on with typical things that can now be done, ie. cluster analysis
a.dtm3 <- removeSparseTerms(a.dtm2, sparse=0.7)
a.dtm.df <- as.data.frame(inspect(a.dtm3))
a.dtm.df.scale <- scale(a.dtm.df)
d <- dist(a.dtm.df.scale, method = "euclidean")
fit <- hclust(d, method="ward")
plot(fit)
# just for fun...
library(wordcloud)
library(RColorBrewer)
m = as.matrix(t(a.dtm1))
# get word counts in decreasing order
word_freqs = sort(colSums(m), decreasing=TRUE)
# create a data frame with words and their frequencies
dm = data.frame(word=names(word_freqs), freq=word_freqs)
# plot wordcloud
wordcloud(dm$word, dm$freq, random.order=FALSE, colors=brewer.pal(8, "Dark2"))
This will correct the error.
b<-Corpus(a, ## I change DireSource(a) by a
readerControl=list(language="eng", reader=readPlain))
But I think to read your Html you need to use xml reader. Something like :
r <- Corpus(DirSource('c:\test'),
readerControl = list(reader = readXML),spec)
But you need to supply the spec argument, which depends with your file structure.
see for example readReut21578XML. It is a good example of xml/html parser.
To read all the html files into an R object you can use
# Set variables
folder <- 'C:/test'
extension <- '.htm'
# Get the names of *.html files in the folder
files <- list.files(path=folder, pattern=extension)
# Read all the files into a list
htmls <- lapply(X=files,
FUN=function(file){
.con <- file(description=paste(folder, file, sep='/'))
.html <- readLines(.con)
close(.con)
names(.html) <- file
.html
})
That will give you a list, and each element is the HTML content of each file.
I'll post later on parsing it, I'm in a hurry.
I found the package boilerpipeR particularly useful to extract only the "core" text of an html page.