R: Webscraping: XML content does not seem to be XML: Using HTMLParse - html

I am trying to webscrape data over numerous years (represented by different webpages). My 2019 data works exactly like I want it to, but I am getting an error when I try to prep my 2016 data like my 2019 data.
url19 <- 'https://www.pro-football-reference.com/draft/2019-combine.htm'
get_pfr_HTML_file19 <- GET(url19)
combine.parsed19 <- htmlParse(get_pfr_HTML_file19)
page.tables19 <- readHTMLTable(combine.parsed19, stringsAsFactors = FALSE)
data19 <- data.frame(page.tables19[1])
cleanData19 <- data19[!rowSums(data19 == "")> 0,]
cleanData19 <- filter(cleanData19, cleanData19$combine.Pos == 'CB' | cleanData19$combine.Pos == 'S')
cleanData19 is exactly what I want, but when I try to run it with 2016 data, I get the error: XML content does not seem to be XML: ''
url16 <- 'https://www.pro-football-reference.com/draft/2016-combine.htm'
get_pfr_HTML_file16 <- GET(url16)
combine.parsed16 <- htmlParse(get_pfr_HTML_file16)
page.tables16 <- readHTMLTable(combine.parsed16, stringsAsFactors = FALSE)
data16 <- data.frame(page.tables16[1])
cleanData16 <- data16[!rowSums(data16 == "")> 0,]
cleanData16 <- filter(cleanData16, cleanData16$combine.Pos == 'CB' | cleanData16$combine.Pos == 'S')
I get the error when I try to run combine.parsed16 <- htmlParse(get_pfr_HTML_file16)

I am not 100% sure of your desired output, you did not include your library calls in your example. Any way, using this code you can get the table
library(rvest)
library(dplyr)
url <- 'https://www.pro-football-reference.com/draft/2016-combine.htm'
read_html(url) %>%
html_nodes(".stats_table") %>%
html_table() %>%
as.data.frame() %>%
filter(Pos == 'CB' | Pos == "S")
Several years at once:
library(rvest)
library(magrittr)
library(dplyr)
library(purrr)
years <- 2013:2019
urls <- paste0(
'https://www.pro-football-reference.com/draft/',
years,
'-combine.htm')
map(
urls,
~read_html(.x) %>%
html_nodes(".stats_table") %>%
html_table() %>%
as.data.frame()
) %>%
set_names(years) %>%
bind_rows(.id = "year") %>%
filter(Pos == 'CB' | Pos == "S")

Related

Tabulizing data off website PDFs (w/ various formats) assigning each event values according to HTML link titles

I've been trying to automate the process of manually typing down the data from the ATF's trace data site (see "URL") but it's been a fairly big pain as I've only been able to collect the each URL link that holds the PDFs and assign it to its correct State/Territory and Year. The newer files 2017-2019 have data "tables that are relatively easier to pull data from compared to 2014-2016, e.g. I'm referring to page 10 of the Trace Data report 2019 and Trace Data report 2014.
It's the latter that I'm having trouble with the most as the data is not stored in something that looks like a table but surrounding a (!)pie-chart. There have been some promising R packages such as "pdftools" and "tesseract". But I'm very much an amateur when it comes to trouble-shooting advanced analytical packages such as these.
It's my guess that I'm still a ways off from where I want to be with the final product as I would need to mine the bottom text of page 10 to find how many "other" weapons were traced to a city, as well the number of weapons where a recovery city couldn't be determined. But if anyone has any suggestions on what I could try next or to even make the working code more efficient, I'd appreciate it.
URL <- "https://www.atf.gov/resource-center/data-statistics"
html <- paste(readLines(URL))
library(xml2)
library(tidyverse)
library(rvest)
library(stringr)
x <- c('\t\t\t<div>([^<]*)</div>','\t\t</tr><tr><td>([^<]*)</td>','\t\t\t<td>([^<]*)</td>')
r <- read_html(URL) %>% html_nodes("a") %>% map_df(~{
Link <- .x %>% html_attr("href")
Title <- .x %>% html_text()
data_frame(Link, Title)
}) %>%
dplyr::filter(grepl('node',Link, fixed = T))
r <- as.data.frame(r)
x <- c('<ul><li>([^<]*)</li>','\t<li>([^<]*)</li>')
states <- c('Alabama','Alaska','Arizona','Arkansas','California','Colorado','Connecticut','Delaware','District of Columbia','Florida','Georgia','Guam & Northern Mariana Islands','Hawaii','Idaho','Illinois','Indiana','Iowa','Kansas','Kentucky','Louisiana','Maine','Maryland','Massachusetts','Michigan','Minnesota','Mississippi','Missouri','Montana','Nebraska','Nevada','New Hampshire','New Jersey','New Mexico','New York','North Carolina','North Dakota','Ohio','Oklahoma','Oregon','Pennsylvania','Puerto Rico','Rhode Island','South Carolina','South Dakota','Tennessee','Texas','Utah','Vermont','Virginia','Washington','West Virginia','Wisconsin','Wyoming')
s <- list()
for(i in 1:nrow(r)){
s[[i]] <- read_html(r$Link[i]) %>% html_nodes("a") %>% map_df(~{
Link <- .x %>% html_attr("href")
Title <- .x %>% html_text()
data_frame(Link, Title)
}) %>% mutate(Year <- r$Title[i]) %>%
dplyr::filter(Title %in% states | str_detect(Title, "Virgin Islands")) %>%
dplyr::filter(grepl('download',Link, fixed = T))
trace_list = do.call(rbind, s)
}
names(trace_list)[3] <- "Year"
Progress so far...
library(pdftools)
pdf_file <- "https://www.atf.gov/file/146951/download"
text <- pdf_text(pdf_file)
cat(text[10])
vtext <- as.list(str_split(text[10],"\n"))
x <- data.frame(matrix(unlist(vtext), nrow=length(vtext), byrow=TRUE),stringsAsFactors=FALSE)
x1 <- pivot_longer(x, cols = 1:length(x),names_to="X1",values_to="X2")
x1$X2 <- trimws(x1$X2)
x1 <- x1[c(8,12),]
x1[1,2] <- sub(" ","_",x1[1,2],fixed=T)
library(splitstackshape)
x1 <- as.data.frame(cSplit(x1, 'X2', sep=" ", type.convert=FALSE))
x1 <- x1[,c(2:length(x1))]
colnames(x1) <- x1[1,]
x1 <- x1[-1, ]
x2 <- pivot_longer(x1, cols = 1:length(x1),names_to="city",values_to="count")
mixing both pdftools amd tesseract...
library(tesseract)
img_file <- pdftools::pdf_convert("https://www.atf.gov/file/89621/download", format = 'tiff', dpi = 400)
text <- ocr(img_file)
strsplit(text[10],"\n")
Expected output:
year
state
city
count
2019
AL
Birmingham
100
2018
CA
Los Angeles
200
2017
CA
None
30
2017
CA
Other
400

webscraping a pdf file using R

I've been web scraping articles in R from the Oxford journals and want to grab the full text of specific articles. All articles have a pdf link to them so I've been trying to pull the pdf link and scrape the entire text onto a csv. The full text should all fit into 1 row however the output in the csv file shows one article of 11 rows. How can I fix this issue?
The code is below:
####install.packages("rvest")
library(rvest)
library(RCurl)
library(XML)
library(stringr)
#for Fulltext to read pdf
####install.packages("pdftools")
library(pdftools)
fullText <- function(parsedDocument){
endLink <- parsedDocument %>%
html_node('.article-pdfLink') %>% html_attr('href')
frontLink <- "https://academic.oup.com"
#link of pdf
pdfLink <- paste(frontLink,endLink,sep = "")
#extract full text from pdfLink
pdfFullText <- pdf_text(pdfLink)
fulltext <- paste(pdfFullText, sep = "\n")
return(fulltext)
}
#############################################
#main function with input as parameter year
testFullText <- function(DOIurl){
parsedDocument <- read_html(DOIurl)
DNAresearch <- data.frame()
allData <- data.frame("Full Text" = fullText(parsedDocument), stringsAsFactors = FALSE)
DNAresearch <- rbind(DNAresearch, allData)
write.csv(DNAresearch, "DNAresearch.csv", row.names = FALSE)
}
testFullText("https://doi.org/10.1093/dnares/dsm026")
This is how I would approach this task.
library(tidyverse)
library(rvest)
df <- data.frame(
# you have a data.frame with a column where there are links to html research articles
links_to_articles = c("https://doi.org/10.1093/dnares/dsm026", "https://doi.org/10.1093/dnares/dsm027")
) %>%
# telling R to process each row separately (it is useful because functions such as read_html process one link rather than a vector of links)
rowwise() %>%
mutate(
pdf_link = read_html(links_to_articles) %>%
html_node('.article-pdfLink') %>%
html_attr('href') %>%
paste0("https://academic.oup.com", .),
articles_txt = pdf_text(pdf_link) %>%
paste0(collapse = " ")
) %>%
ungroup()
# writing the csv
df %>%
write_csv(file = "DNAresearch.csv")
Using your code, I would do:
####install.packages("rvest")
library(rvest)
library(RCurl)
library(XML)
library(stringr)
#for Fulltext to read pdf
####install.packages("pdftools")
library(pdftools)
fullText <- function(parsedDocument){
endLink <- parsedDocument %>%
html_node('.article-pdfLink') %>% html_attr('href')
frontLink <- "https://academic.oup.com"
#link of pdf
pdfLink <- paste(frontLink,endLink,sep = "")
#extract full text from pdfLink
pdfFullText <- pdf_text(pdfLink)
fulltext <- paste(pdfFullText, collapse = " ") # here I changed sep to collapse
return(fulltext)
}
#############################################
#main function with input as parameter year
testFullText <- function(DOIurl){
parsedDocument <- read_html(DOIurl)
DNAresearch <- data.frame()
allData <- data.frame("Full Text" = fullText(parsedDocument) %>% str_squish(), stringsAsFactors = FALSE) # here I used str_squish to remove extra spaces
DNAresearch <- rbind(DNAresearch, allData)
write.csv(DNAresearch, "DNAresearch.csv", row.names = FALSE)
}
testFullText("https://doi.org/10.1093/dnares/dsm026")

Page number function is wrong in webscraper in R

library(rvest)
library(RCurl)
library(XML)
library(stringr)
#Getting the number of Page
getPageNumber <- function(URL) {
print(URL)
parsedDocument <- read_html(URL)
pageNumber <- parsedDocument %>%
html_nodes(".al-pageNumber") %>%
html_text() %>%
as.integer()
return(ifelse(length(pageNumber) == 0, 0, max(pageNumber)))
}
findURL <- function(year_chosen){
if (year_chosen >= 1994) {
noYearURL <- glue::glue("https://academic.oup.com/dnaresearch/search-results?rg_IssuePublicationDate=01%2F01%2F{year_chosen}%20TO%2012%2F31%2F{year_chosen}")
pagesURl <- "&fl_SiteID=5275&page="
URL <- paste(noYearURL, pagesURl, sep = "")
# URL is working with parameter year_chosen
firstPage <- getPageNumber(URL)
paste(firstPage)
if (firstPage == 5) {
nextPage <- 0
while (firstPage < nextPage | firstPage != nextPage) {
firstPage <- nextPage
URLwithPageNum <- paste(URL, firstPage-1, sep = "")
nextPage <- getPageNumber(URLwithPageNum)
}
}else {
print("The Year you provide is out of range, this journal only contain articles from 1994 to present")
}
}
}
findURL(2018)
The above code is a part of my webscrape. Mainly what I want to do is get the pages of all the journals given the parameter year. I believe my getPageNumber is wrong as I am only able to get the amount of pages seen from the first page instead of getting all the pages that are given in a year.
my main function is then incorrectly grabbing the urls based off the pages.
I would like to add that the most pages I would like to grab for a year is 5
I would really appreciate any help! Thank you in advance
Looks like the page count needs to be calculated based on total results / number of results per page as sometimes pages as hidden by next. You may need to evolve this for wrong urls, or urls where no results and this is not indicated within the script tag being scraped currently (via regex). Perhaps wrap within an outer tryCatch.
getPageNumber <- function(URL) {
print(URL)
parsedDocument <- read_html(URL)
results_per_page <- length(parsedDocument %>% html_nodes(".sr-list"))
total_results <- parsedDocument %>%
toString() %>%
str_match(., 'num_results":"(.*?)"') %>%
.[,2] %>%
as.integer()
pageNumber <- tryCatch(ceiling(total_results / results_per_page), error = function(e) {1})
return(pageNumber)
}
getPageNumber("https://academic.oup.com/dnaresearch/search-results?fl_SiteID=5275&rg_IssuePublicationDate=01%2f01%2f2018+TO+12%2f31%2f2018&page=1")

How can I get the data of the second web page?

I am trying to get the data from a web in R using rvest package: https://etfdb.com/stock/AAPL/
But no matter how I tried, I can only get the table of the first page. Could anybody help me do this? Thank you so much.
See code below. tb1 and tb2 are the same!! That's wired.
url1 <- "https://etfdb.com/stock/AAPL/#etfs&sort_name=weighting&sort_order=desc&page=1"
url2 <- "https://etfdb.com/stock/AAPL/#etfs&sort_name=weighting&sort_order=desc&page=2"
tbs1 <- rvest::html_nodes(xml2::read_html(url1), "table")
tbs2 <- rvest::html_nodes(xml2::read_html(url2), "table")
tb1 <- rvest::html_table(tbs1[1])[[1]]
tb2 <- rvest::html_table(tbs2[1])[[1]]
This website post GET requests to update JSON data to the table. After some attempts, this is the code I came up with to deal with JSON data: (not a beautiful code but it works)
library(rjson)
library(rvest)
library(writexl)
lastpage <- 9;
df <- data.frame();
for (i in 1:lastpage){
x <- fromJSON(file = paste("https://etfdb.com/data_set/?tm=40274&cond={%22by_stock%22:25}&no_null_sort=&count_by_id=true&limit=25&sort=weighting&order=desc&limit=25&offset=", 25 * (i-1), sep = ""));
x <- x[2][[1]];
pg_df <- data.frame(matrix(unlist(x), nrow=length(x), byrow=T),stringsAsFactors=FALSE);
df <- rbind(df, pg_df);
}
for (i in 1:nrow(df)){
df$X1[i] <- read_html(df$X1[i]) %>% html_text(trim = TRUE);
df$X3[i] <- read_html(df$X3[i]) %>% html_text(trim = TRUE);
df$X5[i] <- read_html(df$X5[i]) %>% html_text(trim = TRUE);
}
df <- data.frame(df$X1, df$X3, df$X5, df$X7, df$X9);
colnames(df) <- c("Ticker", "ETF", "ETFdb.com Category", "Expense Ratio", "Weighting");
write_xlsx(
df,
path = "stock.xlsx",
col_names = TRUE,
format_headers = TRUE,
use_zip64 = FALSE
)
Update:
You can see the data source at the attribute data-url of the table here:
I'll update the code that makes it easier for you:
library(rjson)
library(rvest)
library(writexl)
stock_ticket <- "AAPL";
url <- paste("https://etfdb.com/stock/", stock_ticket, sep = "");
lastpage <- 9;
df <- data.frame();
data_url <- read_html(url) %>% html_node(xpath = "//table[#id='etfs']") %>% html_attr("data-url");
for (i in 1:lastpage){
x <- fromJSON(file = paste("https://etfdb.com", data_url, "&offset=", 25 * (i-1), sep = ""));
x <- x[2][[1]];
pg_df <- data.frame(matrix(unlist(x), nrow=length(x), byrow=T),stringsAsFactors=FALSE);
df <- rbind(df, pg_df);
}
for (i in 1:nrow(df)){
df$X1[i] <- read_html(df$X1[i]) %>% html_text(trim = TRUE);
df$X3[i] <- read_html(df$X3[i]) %>% html_text(trim = TRUE);
df$X5[i] <- read_html(df$X5[i]) %>% html_text(trim = TRUE);
}
df <- data.frame(df$X1, df$X3, df$X5, df$X7, df$X9);
colnames(df) <- c("Ticker", "ETF", "ETFdb.com Category", "Expense Ratio", "Weighting");
write_xlsx(
df,
path = "stock.xlsx",
col_names = TRUE,
format_headers = TRUE,
use_zip64 = FALSE
)

Scraping table with multiple headers in R using any package? (XML, rCurl, rlist htmltab, rvest etc)

I am attempting to scrape this table
http://www.hko.gov.hk/cis/dailyExtract_e.htm?y=1999&m=1
Here are all my attempts. None of them get even close to extracting any information. Am i missing something?
library("rvest")
library("tidyverse")
# METHOD 1
url <- "http://www.hko.gov.hk/cis/dailyExtract_e.htm?y=1999&m=1"
data <- url %>%
read_html() %>%
html_nodes(xpath='//*[#id="t1"]/tbody/tr[1]') %>%
html_table()
data <- data[[1]]
# METHOD 2
library(XML)
library(RCurl)
library(rlist)
theurl <- getURL("http://www.hko.gov.hk/cis/dailyExtract_e.htm?y=1999&m=1",.opts = list(ssl.verifypeer = FALSE) )
tables <- readHTMLTable(theurl)
tables <- list.clean(tables, fun = is.null, recursive = FALSE)
n.rows <- unlist(lapply(tables, function(t) dim(t)[1]))
tables[[which.max(n.rows)]]
# METHOD 3
library(htmltab)
tab <- htmltab("http://www.hko.gov.hk/cis/dailyExtract_e.htm?y=1999&m=1",
which = '//*[#id="t1"]/tbody/tr[4]',
header = '//*[#id="t1"]/tbody/tr[3]',
rm_nodata_cols = TRUE)
# METHOD 4
website <-read_html("http://www.hko.gov.hk/cis/dailyExtract_e.htm?y=1999&m=1")
scraped <- website %>%
html_nodes("table") %>%
.[(2)] %>%
html_table(fill = TRUE) %>%
`[[`(1)
# METHOD 5
getHrefs <- function(node, encoding)
if (!is.null(xmlChildren(node)$a)) {
paste(xpathSApply(node, './a', xmlGetAttr, "href"), collapse = ",")
} else {
return(xmlValue(xmlChildren(node)$text))
}
data <- ( readHTMLTable("http://www.hko.gov.hk/cis/dailyExtract_e.htm?y=1999&m=1", which = 1, elFun = getHrefs) )
The expected results should be the 12 colnames in the table & the data below