R web scraping function getPageNumber error - html

I am building a webscraper and trying to understand why my getPage Number Function does not work. The function worked last night and tonight I have been having an error getting the right output
library(rvest)
library(RCurl)
library(XML)
library(stringr)
getPageNumber <- function(URL) {
parsedDocument <- read_html(URL)
results_per_page <- length(parsedDocument %>% html_nodes(".sr-list"))
total_results <- parsedDocument %>%
toString() %>%
str_match(., 'num_results":"(.*?)"') %>%
.[,2] %>%
as.integer()
pageNumber <- tryCatch(ceiling(total_results / results_per_page), error = function(e) {1})
return(pageNumber)
}
getPageNumber("https://academic.oup.com/dnaresearch/search-results?rg_IssuePublicationDate=01%2F01%2F2010%20TO%2012%2F31%2F2010&fl_SiteID=5275&page=")
The output I am getting is NA, when it should be numeric number

Do this to troubleshoot your problems:
getPageNumber <- function(URL) {
parsedDocument <- read_html(URL)
results_per_page <- length(parsedDocument %>% html_nodes(".sr-list"))
total_results <- parsedDocument %>%
toString() %>%
str_match(., 'num_results":"(.*?)"') %>%
.[,2] %>%
as.integer()
browser() ## <-- inject this here to stop when you get to this point.
pageNumber <- tryCatch(ceiling(total_results / results_per_page), error = function(e) {1})
return(pageNumber)
}
getPageNumber("https://academic.oup.com/dnaresearch/search-results?rg_IssuePublicationDate=01%2F01%2F2010%20TO%2012%2F31%2F2010&fl_SiteID=5275&page=")
This is what my R session looks like (I typed total_results and results_per_page there myself to see what they contained):
> getPageNumber("https://academic.oup.com/dnaresearch/search-results?rg_IssuePublicationDate=01%2F01%2F2010%20TO%2012%2F31%2F2010&fl_SiteID=5275&page=")
Called from: getPageNumber("https://academic.oup.com/dnaresearch/search-results?rg_IssuePublicationDate=01%2F01%2F2010%20TO%2012%2F31%2F2010&fl_SiteID=5275&page=")
Browse[1]> debug at #10: pageNumber <- tryCatch(ceiling(total_results/results_per_page),
error = function(e) {
1
})
Browse[2]> total_results
[1] 176
Browse[2]> results_per_page
[1] 20
Browse[2]>
debug at #11: return(pageNumber)
Browse[2]>
[1] 9
>
Other things you can do while you are inside the running function with browser():
put the html somewhere you can inspect it later: cat( toString(parsedDocument), file="~/foo.html")
look at: parsedDocument %>% toString() %>% str_match('num_results":"(.*?)"')
If you still experience NA as output, try changing your function code to this:
getPageNumber <- function(URL) {
parsedDocument <- read_html(URL)
results_per_page <- length(parsedDocument %>% html_nodes(".sr-list"))
total_results <- parsedDocument %>%
toString() %>%
str_match(., 'num_results":"(.*?)"') %>%
.[,2] %>%
as.integer()
pageNumber <- tryCatch(ceiling(total_results / results_per_page), error = function(e) {1})
if( is.na(pageNumber) ) {
cat( "PAGENUMBER IS NA!\n" )
cat( "total_results is: ", total_results, "\n" )
cat( "results_per_page is: ", results_per_page, "\n" )
cat( toString(parsedDocument), file="~/foo.html")
cat( "The HTML is stored here: ", normalizePath("~/foo.html") )
cat( "Open it in a text editor and investigate why it went wrong\n" )
}
return(pageNumber)
}

Related

R program has stopped working possibly because of Rdata file or R package

I built a webscraper that's working fine until it randomly stopped working. I thought it was because of my Rdata files but I deleted the ones that I found. I now get an error in my first function because I can't access the URL correctly anymore.
#Getting the number of Page
getPageNumber <- function(URL) {
parsedDocument <- read_html(URL)
results_per_page <- length(parsedDocument %>% html_nodes(".sr-list"))
total_results <- parsedDocument %>%
toString() %>%
str_match(., 'num_results":"(.*?)"') %>%
.[,2] %>%
as.integer()
browser()
pageNumber <- tryCatch(ceiling(total_results / results_per_page), error = function(e) {1})
return(pageNumber)
}
getPageNumber("https://academic.oup.com/dnaresearch/search-results?rg_IssuePublicationDate=01%2F01%2F2010%20TO%2012%2F31%2F2010&fl_SiteID=5275&page=")
output should be getPageNumber("academic.oup.com/dnaresearch/…) [1] 9
instead I get NA

webscraping a pdf file using R

I've been web scraping articles in R from the Oxford journals and want to grab the full text of specific articles. All articles have a pdf link to them so I've been trying to pull the pdf link and scrape the entire text onto a csv. The full text should all fit into 1 row however the output in the csv file shows one article of 11 rows. How can I fix this issue?
The code is below:
####install.packages("rvest")
library(rvest)
library(RCurl)
library(XML)
library(stringr)
#for Fulltext to read pdf
####install.packages("pdftools")
library(pdftools)
fullText <- function(parsedDocument){
endLink <- parsedDocument %>%
html_node('.article-pdfLink') %>% html_attr('href')
frontLink <- "https://academic.oup.com"
#link of pdf
pdfLink <- paste(frontLink,endLink,sep = "")
#extract full text from pdfLink
pdfFullText <- pdf_text(pdfLink)
fulltext <- paste(pdfFullText, sep = "\n")
return(fulltext)
}
#############################################
#main function with input as parameter year
testFullText <- function(DOIurl){
parsedDocument <- read_html(DOIurl)
DNAresearch <- data.frame()
allData <- data.frame("Full Text" = fullText(parsedDocument), stringsAsFactors = FALSE)
DNAresearch <- rbind(DNAresearch, allData)
write.csv(DNAresearch, "DNAresearch.csv", row.names = FALSE)
}
testFullText("https://doi.org/10.1093/dnares/dsm026")
This is how I would approach this task.
library(tidyverse)
library(rvest)
df <- data.frame(
# you have a data.frame with a column where there are links to html research articles
links_to_articles = c("https://doi.org/10.1093/dnares/dsm026", "https://doi.org/10.1093/dnares/dsm027")
) %>%
# telling R to process each row separately (it is useful because functions such as read_html process one link rather than a vector of links)
rowwise() %>%
mutate(
pdf_link = read_html(links_to_articles) %>%
html_node('.article-pdfLink') %>%
html_attr('href') %>%
paste0("https://academic.oup.com", .),
articles_txt = pdf_text(pdf_link) %>%
paste0(collapse = " ")
) %>%
ungroup()
# writing the csv
df %>%
write_csv(file = "DNAresearch.csv")
Using your code, I would do:
####install.packages("rvest")
library(rvest)
library(RCurl)
library(XML)
library(stringr)
#for Fulltext to read pdf
####install.packages("pdftools")
library(pdftools)
fullText <- function(parsedDocument){
endLink <- parsedDocument %>%
html_node('.article-pdfLink') %>% html_attr('href')
frontLink <- "https://academic.oup.com"
#link of pdf
pdfLink <- paste(frontLink,endLink,sep = "")
#extract full text from pdfLink
pdfFullText <- pdf_text(pdfLink)
fulltext <- paste(pdfFullText, collapse = " ") # here I changed sep to collapse
return(fulltext)
}
#############################################
#main function with input as parameter year
testFullText <- function(DOIurl){
parsedDocument <- read_html(DOIurl)
DNAresearch <- data.frame()
allData <- data.frame("Full Text" = fullText(parsedDocument) %>% str_squish(), stringsAsFactors = FALSE) # here I used str_squish to remove extra spaces
DNAresearch <- rbind(DNAresearch, allData)
write.csv(DNAresearch, "DNAresearch.csv", row.names = FALSE)
}
testFullText("https://doi.org/10.1093/dnares/dsm026")

stop web scraper from showing 404 error in R

Trying to webscrape journals from website Oxford articles.
library(rvest)
library(RCurl)
library(XML)
library(stringr)
#Getting the number of Page
getPageNumber <- function(URL) {
print(URL)
parsedDocument <- read_html(URL)
results_per_page <- length(parsedDocument %>% html_nodes(".sr-list"))
total_results <- parsedDocument %>%
toString() %>%
str_match(., 'num_results":"(.*?)"') %>%
.[,2] %>%
as.integer()
pageNumber <- tryCatch(ceiling(total_results / results_per_page), error = function(e) {1})
return(pageNumber)
}
#Getting all articles based off of their DOI
getAllArticles <-function(URL){
parsedDocument = read_html(URL)
findLocationDiv <- html_nodes(parsedDocument,'div')
foundClass <- findLocationDiv[which(html_attr(findLocationDiv, "class") == "al-citation-list")]
ArticleDOInumber = trimws(gsub(".*10.1093/dnares/","",html_text(foundClass)))
DOImain <- "https://doi.org/10.1093/dnares/"
fullDOI <- paste(DOImain, ArticleDOInumber, sep = "")
return(fullDOI)
}
#Get Title of journals
Title <- function(parsedDocument) {
Title <- parsedDocument %>%
html_node(".article-title-main") %>%
html_text() %>%
gsub("\\r\\n\\s+", "", .) %>%
trimws(.)
Title <- ifelse(is.na(Title), "No", Title)
return(Title)
}
#Getting Authors of Journals
Authors <- function(parsedDocument){
Authors <- parsedDocument %>%
html_node("a.linked-name") %>%
html_text() %>%
return(Authors)
}
#main function with input as parameter year
findURL <- function(year_chosen){
if (year_chosen >= 1994) {
noYearURL <- glue::glue("https://academic.oup.com/dnaresearch/search-results?rg_IssuePublicationDate=01%2F01%2F{year_chosen}%20TO%2012%2F31%2F{year_chosen}")
pagesURl <- "&fl_SiteID=5275&page="
URL <- paste(noYearURL, pagesURl, sep = "")
# URL is working with parameter year_chosen
firstPage <- getPageNumber(URL)
if (firstPage == 5) {
nextPage <- 0
while (firstPage < nextPage | firstPage != nextPage) {
firstPage <- nextPage
URLwithPageNum <- paste(URL, firstPage-1, sep = "")
nextPage <- getPageNumber(URLwithPageNum)
}
}
DNAresearch <- data.frame()
for (i in 1:firstPage) {
URLallArticles <- getAllArticles(paste(URL, i, sep = ""))
print(URLallArticles)
for (j in 1:(length(URLallArticles))) {
parsedDocument <- read_html(URLallArticles[j])
paste(parsedDocument)
#need work on getiing Full Text
#allData <- data.frame("Full text"=FullText(parsedDocument),stringsAsFactors = FALSE)
#scraped items that are good
#"Authors" = Authors(parsedDocument),"Author Affiliations" = AuthorAffil(parsedDocument),"Corresponding Authors" = CorrespondingAuthors(parsedDocument),"CoAuthor Email" = CoAuthorEmail(parsedDocument),"Publish Date" = PublicationDate(parsedDocument),"Abstract" = Abstract(parsedDocument),"Keywords" = Keywords(parsedDocument)
allData <- data.frame("Title" = Title(parsedDocument),stringsAsFactors = FALSE)
DNAresearch <- rbind(DNAresearch, allData)
}
}
write.csv(DNAresearch, "DNAresearch.csv", row.names = FALSE)
} else {
print("The Year you provide is out of range, this journal only contain articles from 1994 to present")
}
}
##################### Main function test
findURL(2015)
Code is showing error 404.
I believe it is a problem with getAllArticles,
the last output has a bad url. I've tried using a try catch to stop the error from outputting but haven't been successful. It may also be my logic.
the output for the year 2015 is:
[1] "https://academic.oup.com/dnaresearch/search-results?rg_IssuePublicationDate=01%2F01%2F2015%20TO%2012%2F31%2F2015&fl_SiteID=5275&page="
[1] "https://doi.org/10.1093/dnares/dsv028"
[2] "https://doi.org/10.1093/dnares/dsv027"
[3] "https://doi.org/10.1093/dnares/dsv029"
[4] "https://doi.org/10.1093/dnares/dsv030"
[1] "https://doi.org/10.1093/dnares/dsv022"
[1] "https://doi.org/10.1093/dnares/dsv024"
[2] "https://doi.org/10.1093/dnares/dsv025"
[3] "https://doi.org/10.1093/dnares/dsv026"
[4] "https://doi.org/10.1093/dnares/dsv021"
[5] "https://doi.org/10.1093/dnares/dsv023"
[1] "https://doi.org/10.1093/dnares/dsv020"
[2] "https://doi.org/10.1093/dnares/dsv019"
[3] "https://doi.org/10.1093/dnares/dsv017"
[1] "https://doi.org/10.1093/dnares/dsv018"
[2] "https://doi.org/10.1093/dnares/dsv015"
[1] "https://doi.org/10.1093/dnares/dsv013"
[2] "https://doi.org/10.1093/dnares/dsv016"
[3] "https://doi.org/10.1093/dnares/dsv014"
[1] "https://doi.org/10.1093/dnares/dsv012"
[2] "https://doi.org/10.1093/dnares/dsv010"
[1] "https://doi.org/10.1093/dnares/dsv011"
[2] "https://doi.org/10.1093/dnares/dsv009"
[3] "https://doi.org/10.1093/dnares/dsv005"
[1] "https://doi.org/10.1093/dnares/dsv008"
[2] "https://doi.org/10.1093/dnares/dsv007"
[3] "https://doi.org/10.1093/dnares/dsv004"
[1] "https://doi.org/10.1093/dnares/dsv006"
[2] "https://doi.org/10.1093/dnares/dsv002"
[3] "https://doi.org/10.1093/dnares/dsv003"
[4] "https://doi.org/10.1093/dnares/dsv001"
[1] "https://doi.org/10.1093/dnares/dsu047"
[2] "https://doi.org/10.1093/dnares/dsu045"
[3] "https://doi.org/10.1093/dnares/dsu046"
[1] "https://doi.org/10.1093/dnares/dsu044"
[2] "https://doi.org/10.1093/dnares/dsu041"
[3] "https://doi.org/10.1093/dnares/dsu038"
[4] "https://doi.org/10.1093/dnares/dsu040"
[5] "https://doi.org/10.1093/dnares/dsu042"
[6] "https://doi.org/10.1093/dnares/dsu043"
[1] "https://doi.org/10.1093/dnares/"
Error in open.connection(x, "rb") : HTTP error 404.
In addition: Warning message:
In for (i in seq_along(specs)) { :
Error in open.connection(x, "rb") : HTTP error 404.
a year like 1994 for example runs without an error, but years like 2015 and 2016 has this error.
You can check for valid URL and add exception -
if (url.exists(URLallArticles[j])){
parsedDocument <- read_html(URLallArticles[j])
paste(parsedDocument)
#need work on getiing Full Text
#allData <- data.frame("Full text"=FullText(parsedDocument),stringsAsFactors = FALSE)
#scraped items that are good
#"Authors" = Authors(parsedDocument),"Author Affiliations" = AuthorAffil(parsedDocument),"Corresponding Authors" = CorrespondingAuthors(parsedDocument),"CoAuthor Email" = CoAuthorEmail(parsedDocument),"Publish Date" = PublicationDate(parsedDocument),"Abstract" = Abstract(parsedDocument),"Keywords" = Keywords(parsedDocument)
allData <- data.frame("Title" = Title(parsedDocument),stringsAsFactors = FALSE)
DNAresearch <- rbind(DNAresearch, allData)
}

Page number function is wrong in webscraper in R

library(rvest)
library(RCurl)
library(XML)
library(stringr)
#Getting the number of Page
getPageNumber <- function(URL) {
print(URL)
parsedDocument <- read_html(URL)
pageNumber <- parsedDocument %>%
html_nodes(".al-pageNumber") %>%
html_text() %>%
as.integer()
return(ifelse(length(pageNumber) == 0, 0, max(pageNumber)))
}
findURL <- function(year_chosen){
if (year_chosen >= 1994) {
noYearURL <- glue::glue("https://academic.oup.com/dnaresearch/search-results?rg_IssuePublicationDate=01%2F01%2F{year_chosen}%20TO%2012%2F31%2F{year_chosen}")
pagesURl <- "&fl_SiteID=5275&page="
URL <- paste(noYearURL, pagesURl, sep = "")
# URL is working with parameter year_chosen
firstPage <- getPageNumber(URL)
paste(firstPage)
if (firstPage == 5) {
nextPage <- 0
while (firstPage < nextPage | firstPage != nextPage) {
firstPage <- nextPage
URLwithPageNum <- paste(URL, firstPage-1, sep = "")
nextPage <- getPageNumber(URLwithPageNum)
}
}else {
print("The Year you provide is out of range, this journal only contain articles from 1994 to present")
}
}
}
findURL(2018)
The above code is a part of my webscrape. Mainly what I want to do is get the pages of all the journals given the parameter year. I believe my getPageNumber is wrong as I am only able to get the amount of pages seen from the first page instead of getting all the pages that are given in a year.
my main function is then incorrectly grabbing the urls based off the pages.
I would like to add that the most pages I would like to grab for a year is 5
I would really appreciate any help! Thank you in advance
Looks like the page count needs to be calculated based on total results / number of results per page as sometimes pages as hidden by next. You may need to evolve this for wrong urls, or urls where no results and this is not indicated within the script tag being scraped currently (via regex). Perhaps wrap within an outer tryCatch.
getPageNumber <- function(URL) {
print(URL)
parsedDocument <- read_html(URL)
results_per_page <- length(parsedDocument %>% html_nodes(".sr-list"))
total_results <- parsedDocument %>%
toString() %>%
str_match(., 'num_results":"(.*?)"') %>%
.[,2] %>%
as.integer()
pageNumber <- tryCatch(ceiling(total_results / results_per_page), error = function(e) {1})
return(pageNumber)
}
getPageNumber("https://academic.oup.com/dnaresearch/search-results?fl_SiteID=5275&rg_IssuePublicationDate=01%2f01%2f2018+TO+12%2f31%2f2018&page=1")

read html table in R

Im trying to read head2head data from tennis abstract webpage in R using package XML.
I want the big h2h table at the bottom,
css selector: html > body > div#main > table#maintable > tbody > tr > td#stats > table#matches.tablesorter
I have tried following suggestions from scraping html into r data frame.
I believe the difficulty is caused by table within table
url = "http://www.tennisabstract.com/cgi-bin/player.cgi?p=NovakDjokovic&f=ACareerqqs00&view=h2h"
library(RCurl)
library(XML)
webpage <- getURL(url)
webpage <- readLines(tc <- textConnection(webpage)); close(tc) #doesnt have the h2h table
pagetree <- htmlTreeParse(webpage, error=function(...){}, useInternalNodes = TRUE)
results <- xpathSApply(pagetree, "//*/table[#class='tablesorter']/tr/td", xmlValue) # gives NULL
tables <- readHTMLTable( url,stringsAsFactors=T) # has 4 tables, not the desired one
I'm new to html parsing, so please bear with.
This is not the most efficient but it will do the job.
library(rvest)
library(RSelenium)
tennis.url <- "http://www.tennisabstract.com/cgi-bin/player.cgi?p=NovakDjokovic&f=ACareerqqs00&view=h2h"
checkForServer(); startServer()
remDrv <- remoteDriver()
remDrv$open()
remDrv$navigate(tennis.url)
tennis.html <- html(remDrv$getPageSource()[[1]])
remDrv$close()
H2Hs <- tennis.html %>% html_nodes(".h2hclick") %>% html_text %>% as.numeric
Opponent <- tennis.html %>% html_nodes("#matches a") %>% html_text
Country <- tennis.html %>% html_nodes("a+ span") %>% html_text %>% gsub("[^(A-Z)]", "", .)
W <- tennis.html %>% html_nodes("#matches td:nth-child(3)") %>% .[-1] %>% html_text %>% as.numeric
L <- tennis.html %>% html_nodes("#matches td:nth-child(4)") %>% .[-1] %>% html_text %>% as.numeric
Win.Prc <- tennis.html %>% html_nodes("#matches td:nth-child(5)") %>% .[-1] %>% html_text
And so on for the rest. You just need to increment the # in nth-child(#) and then create a data frame.