webscraping a pdf file using R - html

I've been web scraping articles in R from the Oxford journals and want to grab the full text of specific articles. All articles have a pdf link to them so I've been trying to pull the pdf link and scrape the entire text onto a csv. The full text should all fit into 1 row however the output in the csv file shows one article of 11 rows. How can I fix this issue?
The code is below:
####install.packages("rvest")
library(rvest)
library(RCurl)
library(XML)
library(stringr)
#for Fulltext to read pdf
####install.packages("pdftools")
library(pdftools)
fullText <- function(parsedDocument){
endLink <- parsedDocument %>%
html_node('.article-pdfLink') %>% html_attr('href')
frontLink <- "https://academic.oup.com"
#link of pdf
pdfLink <- paste(frontLink,endLink,sep = "")
#extract full text from pdfLink
pdfFullText <- pdf_text(pdfLink)
fulltext <- paste(pdfFullText, sep = "\n")
return(fulltext)
}
#############################################
#main function with input as parameter year
testFullText <- function(DOIurl){
parsedDocument <- read_html(DOIurl)
DNAresearch <- data.frame()
allData <- data.frame("Full Text" = fullText(parsedDocument), stringsAsFactors = FALSE)
DNAresearch <- rbind(DNAresearch, allData)
write.csv(DNAresearch, "DNAresearch.csv", row.names = FALSE)
}
testFullText("https://doi.org/10.1093/dnares/dsm026")

This is how I would approach this task.
library(tidyverse)
library(rvest)
df <- data.frame(
# you have a data.frame with a column where there are links to html research articles
links_to_articles = c("https://doi.org/10.1093/dnares/dsm026", "https://doi.org/10.1093/dnares/dsm027")
) %>%
# telling R to process each row separately (it is useful because functions such as read_html process one link rather than a vector of links)
rowwise() %>%
mutate(
pdf_link = read_html(links_to_articles) %>%
html_node('.article-pdfLink') %>%
html_attr('href') %>%
paste0("https://academic.oup.com", .),
articles_txt = pdf_text(pdf_link) %>%
paste0(collapse = " ")
) %>%
ungroup()
# writing the csv
df %>%
write_csv(file = "DNAresearch.csv")
Using your code, I would do:
####install.packages("rvest")
library(rvest)
library(RCurl)
library(XML)
library(stringr)
#for Fulltext to read pdf
####install.packages("pdftools")
library(pdftools)
fullText <- function(parsedDocument){
endLink <- parsedDocument %>%
html_node('.article-pdfLink') %>% html_attr('href')
frontLink <- "https://academic.oup.com"
#link of pdf
pdfLink <- paste(frontLink,endLink,sep = "")
#extract full text from pdfLink
pdfFullText <- pdf_text(pdfLink)
fulltext <- paste(pdfFullText, collapse = " ") # here I changed sep to collapse
return(fulltext)
}
#############################################
#main function with input as parameter year
testFullText <- function(DOIurl){
parsedDocument <- read_html(DOIurl)
DNAresearch <- data.frame()
allData <- data.frame("Full Text" = fullText(parsedDocument) %>% str_squish(), stringsAsFactors = FALSE) # here I used str_squish to remove extra spaces
DNAresearch <- rbind(DNAresearch, allData)
write.csv(DNAresearch, "DNAresearch.csv", row.names = FALSE)
}
testFullText("https://doi.org/10.1093/dnares/dsm026")

Related

Scraping several webpages from a website (newspaper archive) using RSelenium

I managed to scrape one page from a newspaper archive according to explanations here.
Now I am trying to automatise the process to access a list of pages by running one code.
Making a list of URLs was easy as the newspaper's archive has a similar pattern of links:
https://en.trend.az/archive/2021-XX-XX
The problem is with writing a loop to scrape such data as title, date, time, category. For simplicity, I tried to work only with article headlines from 2021-09-30 to 2021-10-02.
## Setting data frames
d1 <- as.Date("2021-09-30")
d2 <- as.Date("2021-10-02")
list_of_url <- character() # or str_c()
## Generating subpage list
for (i in format(seq(d1, d2, by="days"), format="%Y-%m-%d")) {
list_of_url[i] <- str_c ("https://en.trend.az", "/archive/", i)
# Launching browser
driver <- rsDriver(browser = c("firefox")) #Version 93.0 (64-bit)
remDr <- driver[["client"]]
remDr$errorDetails
remDr$navigate(list_of_url[i])
remDr0$findElement(using = "xpath", value = '/html/body/div[1]/div/div[1]/h1')$clickElement()
webElem <- remDr$findElement("css", "body")
#scrolling to the end of webpage, to load all articles
for (i in 1:25){
Sys.sleep(2)
webElem$sendKeysToElement(list(key = "end"))
}
page <- read_html(remDr$getPageSource()[[1]])
# Scraping article headlines
get_headline <- page %>%
html_nodes('.category-article') %>% html_nodes('.article-title') %>%
html_text()
get_time <- str_sub(get_time, start= -5)
length(get_time)
}
}
In total length should have been 157+166+140=463. In fact, I did not manage to collect all data even from one page (length(get_time) = 126)
I considered that after the first set of commands in the loop, I obtained three remDr for the 3 dates specified, but they were not recognised later independently.
Because of that I tried to initiate a second loop inside the initial one before or after page <- by
for (remDr0 in remDr) {
page <- read_html(remDr0$getPageSource()[[1]])
# substituted all remDr-s below with remDr0
OR
page <- read_html(remDr$getPageSource()[[1]])
for (page0 in page)
# substituted all page-s below with page0
However, these attempts ended with different errors.
I would appreciate the help of specialists as it is my first time using R for such purposes.
Hope it will be possible to correct the existing loop that I made or maybe even suggest a shorter pathway, by making a function, for example.
Slight broadening for scraping multiple categories
library(RSelenium)
library(dplyr)
library(rvest)
Mention the date period
d1 <- as.Date("2021-09-30")
d2 <- as.Date("2021-10-02")
dt = seq(d1, d2, by="days")#contains the date sequence
#launch browser
driver <- rsDriver(browser = c("firefox"))
remDr <- driver[["client"]]
### `get_headline` Function for newspaper headlines
get_headline = function(x){
link = paste0( 'https://en.trend.az/archive/', x)
remDr$navigate(link)
remDr$findElement(using = "xpath", value = '/html/body/div[1]/div/div[1]/h1')$clickElement()
webElem <- remDr$findElement("css", "body")
#scrolling to the end of webpage, to load all articles
for (i in 1:25){
Sys.sleep(1)
webElem$sendKeysToElement(list(key = "end"))
}
headlines = remDr$getPageSource()[[1]] %>%
read_html() %>%
html_nodes('.category-article') %>% html_nodes('.article-title') %>%
html_text()
headlines
return(headlines)
}
get_time Function for the time of publishing
get_time <- function(x){
link = paste0( 'https://en.trend.az/archive/', x)
remDr$navigate(link)
remDr$findElement(using = "xpath", value = '/html/body/div[1]/div/div[1]/h1')$clickElement()
webElem <- remDr$findElement("css", "body")
#scrolling to the end of webpage, to load all articles
for (i in 1:25){
Sys.sleep(1)
webElem$sendKeysToElement(list(key = "end"))
}
# Addressing selector of time on the website
time <- remDr$getPageSource()[[1]] %>%
read_html() %>%
html_nodes('.category-article') %>% html_nodes('.article-date') %>%
html_text() %>%
str_sub(start= -5)
time
return(time)
}
Numbering of all articles from one page/day
get_number <- function(x){
link = paste0( 'https://en.trend.az/archive/', x)
remDr$navigate(link)
remDr$findElement(using = "xpath", value = '/html/body/div[1]/div/div[1]/h1')$clickElement()
webElem <- remDr$findElement("css", "body")
#scrolling to the end of webpage, to load all articles
for (i in 1:25){
Sys.sleep(1)
webElem$sendKeysToElement(list(key = "end"))
}
# Addressing selectors of headlines on the website
headline <- remDr$getPageSource()[[1]] %>%
read_html() %>%
html_nodes('.category-article') %>% html_nodes('.article-title') %>%
html_text()
number <- seq(1:length(headline))
return(number)
}
Collection of all functions into tibble
get_data_table <- function(x){
# Extract the Basic information from the HTML
headline <- get_headline(x)
time <- get_time(x)
headline_number <- get_number(x)
# Combine into a tibble
combined_data <- tibble(Num = headline_number,
Article = headline,
Time = time)
}
Used lapply to loop through all the dates in dt
df = lapply(dt, get_data_table)

Web scraping in R with Selenium to click new pages

I am trying to enter the different pages of this dynamic web (https://es.gofundme.com/s?q=covid). In this search engine, my intention is to enter each project. There are 12 projects per page.
Once you have entered each of these projects and have obtained the desired information (that is, if I get it), I want you to continue to the next page. That is, once you have obtained the 12 projects on page 1, you must obtain the 12 projects on page 2 and so on.
How can it be done? You help me a lot. Thanks!
This is my code:
#Loading the rvest package
library(rvest)
library(magrittr) # for the '%>%' pipe symbols
library(RSelenium) # to get the loaded html of
library(purrr) # for 'map_chr' to get reply
library(tidyr) #extract_numeric(years)
library(stringr)
df_0<-data.frame(project=character(),
name=character(),
location=character(),
dates=character(),
objective=character(),
collected=character(),
donor=character(),
shares=character(),
follow=character(),
comments=character(),
category=character())
#Specifying the url for desired website to be scraped
url <- 'https://es.gofundme.com/f/ayuda-a-ta-josefina-snchez-por-covid-en-pulmn?qid=00dc4567cb859c97b9c3cefd893e1ed9&utm_campaign=p_cp_url&utm_medium=os&utm_source=customer'
# starting local RSelenium (this is the only way to start RSelenium that is working for me atm)
selCommand <- wdman::selenium(jvmargs = c("-Dwebdriver.chrome.verboseLogging=true"), retcommand = TRUE)
shell(selCommand, wait = FALSE, minimized = TRUE)
remDr <- remoteDriver(port = 4567L, browserName = "firefox")
remDr$open()
require(RSelenium)
# go to website
remDr$navigate(url)
# get page source and save it as an html object with rvest
html_obj <- remDr$getPageSource(header = TRUE)[[1]] %>% read_html()
# 1) Project name
project <- html_obj %>% html_nodes(".a-campaign-title") %>% html_text()
# 2) name
info <- html_obj %>% html_nodes(".m-person-info") %>% html_text()
# 3) location
location <- html_obj %>% html_nodes(".m-person-info-content") %>% html_text()
# 4) dates
dates <- html_obj %>% html_nodes(".a-created-date") %>% html_text()
# 5) Money -collected -objective
money <- html_obj %>% html_nodes(".m-progress-meter-heading") %>% html_text()
# 6) doner, shares and followers
popularity <- html_obj %>% html_nodes(".text-stat-value") %>% html_text()
# 7) Comments
comments <- html_obj %>% html_nodes(".o-expansion-list-wrapper") %>% html_text()
# 8) Category
category <- html_obj %>% html_nodes(".a-link") %>% html_text()
# create the df with all the info
review_data <- data.frame(project=project,
name= gsub("\\Organizador.*","",info[7]),
location=str_remove(location[7], "Organizador"),
dates = dates,
collected = unlist(strsplit(money, " "))[1],
objective = unlist(strsplit(money, " "))[8],
donor = popularity[1],
shares = popularity[2],
follow = popularity[3],
comments = extract_numeric(comments),
category = category[17],
stringsAsFactors = F)
The page does a POST request that you can mimic/simplify. To keep dynamic you need to first grab an api key and application id from a source js file, then pass those in the subsequent POST request.
In the following I simply extract the urls from each request. I set the querystring for the POST to have the max of 20 results per page. After an initial request, in which I retrieve the number of pages, I then map a function across the page numbers, extracting urls from the POST response for each; altering the page param.
You end up with a list of urls for all the projects you can then visit to extract info from; or, potentially make xmlhttp requests to.
N.B. Code can be re-factored a little as tidy up.
library(httr)
library(stringr)
library(purrr)
library(tidyverse)
get_df <- function(x){
df <- map_dfr(x, .f = as_tibble) %>% select(c('url')) %>% unique() %>%
mutate( url = paste0('https://es.gofundme.com/f/', url))
return(df)
}
r <- httr::GET('https://es.gofundme.com/static/js/main~4f8b914b.bfe3a91b38d67631e0fa.js') %>% content(as='text')
matches <- stringr::str_match_all(r, 't\\.algoliaClient=r\\.default\\("(.*?)","(.*?)"')
application_id <- matches[[1]][,2]
api_key <-matches[[1]][,3]
headers = c(
'User-Agent' = 'Mozilla/5.0',
'content-type' = 'application/x-www-form-urlencoded',
'Referer' = 'https://es.gofundme.com/'
)
params = list(
'x-algolia-agent' = 'Algolia for JavaScript (4.7.0); Browser (lite); JS Helper (3.2.2); react (16.12.0); react-instantsearch (6.8.2)',
'x-algolia-api-key' = api_key,
'x-algolia-application-id' = application_id
)
post_body <- '{"requests":[{"indexName":"prod_funds_feed_replica_1","params":"filters=status%3D1%20AND%20custom_complete%3D1&exactOnSingleWordQuery=word&query=covid&hitsPerPage=20&attributesToRetrieve=%5B%22fundname%22%2C%22username%22%2C%22bene_name%22%2C%22objectID%22%2C%22thumb_img_url%22%2C%22url%22%5D&clickAnalytics=true&userToken=00-e940a6572f1b47a7b2338b563aa09b9f-6841178f&page='
page_num <- 0
data <- paste0(post_body, page_num, '"}]}')
res <- httr::POST(url = 'https://e7phe9bb38-dsn.algolia.net/1/indexes/*/queries', httr::add_headers(.headers=headers), query = params, body = data) %>% content()
num_pages <- res$results[[1]]$nbPages
df <- get_df(res$results[[1]]$hits)
pages <- c(1:num_pages-1)
df2 <- map_dfr(pages, function(page_num){
data <- paste0(post_body, page_num, '"}]}')
res <- httr::POST('https://e7phe9bb38-dsn.algolia.net/1/indexes/*/queries', httr::add_headers(.headers=headers), query = params, body = data) %>% content()
temp_df <-get_df(res$results[[1]]$hits)
}
)
df <- rbind(df, df2)
#David Perea, see this page for differentiation of scraping methods, including Selenium. The method proposed by QHarr is very good, but doesn't use Selenium and also requires good knowledge of HTTP.

Remove href and/or deactivate anchored links when printing a PDF from HTML using xml2 and pagedown with R

I'm using R to extract 100s of articles from a food blog and convert them to PDF. I'm 99% done, but when I print the final PDF, in-line hyperlinks have their URL written right within the text. I do not want every link rendered to text in the PDF, and believe I need to remove the href attributes from the HTML prior to printing with pagedown. Does anyone know how to do this? My example code below should get you to my pdf creation loop for the first article. The initial portions pull all of the URLs into a vector. The portion that needs this enhancementThanks.
library(rvest)
library(dplyr)
library(tidyr)
library(stringr)
library(purrr)
library(downloader)
library(pagedown)
library(xml2)
library(htmltools)
#Specifying the url for desired website to be scraped
url1 <- paste0('https://www.foodrepublic.com/author/george-embiricos/page/', '1', '/')
#Reading the HTML code from the website
webpage1 <- read_html(url1)
# Pull the links for all articles on George's initial author page
dat <- html_attr(html_nodes(webpage1, 'a'), "href") %>%
as_tibble() %>%
filter(str_detect(value, "([0-9]{4})")) %>%
unique() %>%
rename(link=value)
dat <- head(dat, 10)
# Pull the links for all articles on George's 2nd-89th author page
for (i in 2:89) {
url <- paste0('https://www.foodrepublic.com/author/george-embiricos/page/', i, '/')
#Reading the HTML code from the website
webpage <- read_html(url)
links <- html_attr(html_nodes(webpage, 'a'), "href") %>%
as_tibble() %>%
filter(str_detect(value, "([0-9]{4})")) %>%
unique() %>%
rename(link=value)
dat <- bind_rows(dat, links) %>%
unique()
}
dat <- dat %>%
arrange(link)
dat <- tail(dat, 890)
articleUrls <- dat$link[1]
# Mac
# Windows
setwd("YOUR-WD")
# articleUrls <- articleUrls[1]
for(i in seq_along(articleUrls)) {
filename <- str_extract(articleUrls[i], "[^/]+(?=/$|$)")
a <- read_html(articleUrls[i])
xml_remove(a %>% xml_find_all("aside"))
xml_remove(a %>% xml_find_all("footer"))
xml_remove(a %>% xml_find_all(xpath = "//*[contains(#class, 'article-related mb20')]"))
xml_remove(a %>% xml_find_all(xpath = "//*[contains(#class, 'tags')]"))
#xml_remove(a %>% xml_find_all("head") %>% xml2::xml_find_all("script"))
xml_remove(a %>% xml2::xml_find_all("//script"))
xml_remove(a %>% xml_find_all("//*[contains(#class, 'ad box')]"))
xml_remove(a %>% xml_find_all("//*[contains(#class, 'newsletter-signup')]"))
xml_remove(a %>% xml_find_all("//*[contains(#class, 'article-footer')]"))
xml_remove(a %>% xml_find_all("//*[contains(#class, 'article-footer-sidebar')]"))
xml_remove(a %>% xml_find_all("//*[contains(#class, 'site-footer')]"))
xml_remove(a %>% xml_find_all("//*[contains(#class, 'sticky-newsletter')]"))
xml_remove(a %>% xml_find_all("//*[contains(#class, 'site-header')]"))
xml_remove(a %>% xml_find_all("//*[contains(#class, '.fb_iframe_widget')]"))
xml_remove(a %>% xml_find_all("//*[contains(#class, '_8f1i')]"))
xml_remove(a %>% xml_find_all("//*[contains(#class, 'newsletter-toggle')]"))
# xml_remove(a %>% xml_find_all("//*[contains(#class, 'articleBody')]"))
# xml_remove(a %>% xml_find_all("//href='([^\"]*)'"))
xml2::write_html(a, file = paste0("html/", filename, ".html"))
tryCatch(pagedown::chrome_print(input = paste0("html/", filename, ".html"),
output=paste0("pdf/", filename, ".pdf"),
format="pdf", timeout = 300, verbose=0,
wait=20), error=function(e) paste("wrong"))
}
You can see a screenshot of what I'm seeing below. The "< >" portion containing the URL should not display. It should only say "King's Brew".
Try something like this:
library(dplyr)
library(xml2)
allHref <- a %>% xml_find_all("//a")
for (l in allHref) {
cntnt <- l %>% xml_text(trim = T)
xml_replace(l, read_xml(paste0("<span>", cntnt, "</span>")))
}
First of all we find all links. Then, for each one of them we extract its content and replace the link itself with this content.

Scraping table with multiple headers in R using any package? (XML, rCurl, rlist htmltab, rvest etc)

I am attempting to scrape this table
http://www.hko.gov.hk/cis/dailyExtract_e.htm?y=1999&m=1
Here are all my attempts. None of them get even close to extracting any information. Am i missing something?
library("rvest")
library("tidyverse")
# METHOD 1
url <- "http://www.hko.gov.hk/cis/dailyExtract_e.htm?y=1999&m=1"
data <- url %>%
read_html() %>%
html_nodes(xpath='//*[#id="t1"]/tbody/tr[1]') %>%
html_table()
data <- data[[1]]
# METHOD 2
library(XML)
library(RCurl)
library(rlist)
theurl <- getURL("http://www.hko.gov.hk/cis/dailyExtract_e.htm?y=1999&m=1",.opts = list(ssl.verifypeer = FALSE) )
tables <- readHTMLTable(theurl)
tables <- list.clean(tables, fun = is.null, recursive = FALSE)
n.rows <- unlist(lapply(tables, function(t) dim(t)[1]))
tables[[which.max(n.rows)]]
# METHOD 3
library(htmltab)
tab <- htmltab("http://www.hko.gov.hk/cis/dailyExtract_e.htm?y=1999&m=1",
which = '//*[#id="t1"]/tbody/tr[4]',
header = '//*[#id="t1"]/tbody/tr[3]',
rm_nodata_cols = TRUE)
# METHOD 4
website <-read_html("http://www.hko.gov.hk/cis/dailyExtract_e.htm?y=1999&m=1")
scraped <- website %>%
html_nodes("table") %>%
.[(2)] %>%
html_table(fill = TRUE) %>%
`[[`(1)
# METHOD 5
getHrefs <- function(node, encoding)
if (!is.null(xmlChildren(node)$a)) {
paste(xpathSApply(node, './a', xmlGetAttr, "href"), collapse = ",")
} else {
return(xmlValue(xmlChildren(node)$text))
}
data <- ( readHTMLTable("http://www.hko.gov.hk/cis/dailyExtract_e.htm?y=1999&m=1", which = 1, elFun = getHrefs) )
The expected results should be the 12 colnames in the table & the data below

read html table in R

Im trying to read head2head data from tennis abstract webpage in R using package XML.
I want the big h2h table at the bottom,
css selector: html > body > div#main > table#maintable > tbody > tr > td#stats > table#matches.tablesorter
I have tried following suggestions from scraping html into r data frame.
I believe the difficulty is caused by table within table
url = "http://www.tennisabstract.com/cgi-bin/player.cgi?p=NovakDjokovic&f=ACareerqqs00&view=h2h"
library(RCurl)
library(XML)
webpage <- getURL(url)
webpage <- readLines(tc <- textConnection(webpage)); close(tc) #doesnt have the h2h table
pagetree <- htmlTreeParse(webpage, error=function(...){}, useInternalNodes = TRUE)
results <- xpathSApply(pagetree, "//*/table[#class='tablesorter']/tr/td", xmlValue) # gives NULL
tables <- readHTMLTable( url,stringsAsFactors=T) # has 4 tables, not the desired one
I'm new to html parsing, so please bear with.
This is not the most efficient but it will do the job.
library(rvest)
library(RSelenium)
tennis.url <- "http://www.tennisabstract.com/cgi-bin/player.cgi?p=NovakDjokovic&f=ACareerqqs00&view=h2h"
checkForServer(); startServer()
remDrv <- remoteDriver()
remDrv$open()
remDrv$navigate(tennis.url)
tennis.html <- html(remDrv$getPageSource()[[1]])
remDrv$close()
H2Hs <- tennis.html %>% html_nodes(".h2hclick") %>% html_text %>% as.numeric
Opponent <- tennis.html %>% html_nodes("#matches a") %>% html_text
Country <- tennis.html %>% html_nodes("a+ span") %>% html_text %>% gsub("[^(A-Z)]", "", .)
W <- tennis.html %>% html_nodes("#matches td:nth-child(3)") %>% .[-1] %>% html_text %>% as.numeric
L <- tennis.html %>% html_nodes("#matches td:nth-child(4)") %>% .[-1] %>% html_text %>% as.numeric
Win.Prc <- tennis.html %>% html_nodes("#matches td:nth-child(5)") %>% .[-1] %>% html_text
And so on for the rest. You just need to increment the # in nth-child(#) and then create a data frame.