Scraping online PDFs with rvest - html

I want to access the data from this train timetable web page. Using rvest on the URL doesn't give a useful answer:
> read_html("https://www.scotrail.co.uk/sites/default/files/assets/download_ct/_sr1705_glasgow-edinburgh_via_falkirk_highv2.pdf")
{xml_document}
<html>
[1] <body><p>%PDF-1.5\r%\xe2ãÏÓ\r\n22 0 obj\r<>\rendobj\r \rxref\r22 97\r0000000 ...
[2] <html><p>C*ÐsO\u0086ZFWM\u0086X H$\u0083>\u0083-Ïs\u0086O=Ì\u008c"Lí½/1\u009c\u009fõ\u008e\u0 ...
However when I save the source code locally as an html file I can scrape the contents just fine:
> read_html("/path/to/this/file/_sr1705_glasgow-edinburgh_via_falkirk_highv2.html")
{xml_document}
<html dir="ltr" mozdisallowselectionprint="" moznomarginboxes="">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">\n<meta charset="utf- ...
[2] <body tabindex="1" class="">\n <div id="outerContainer">\n\n <div id="sidebarContainer"> ...
I'd like to scrape using the URL rather than manually downloading and saving as a html file. It feels like I'm missing something fundamental about PDFs. I'm confused that the file extension in the URL is .pdf but F12 reveals html.
Is there a way to scrape directly from this URL? If not why does saving locally 'fix' the issue?

If you have all URLs saved in the vector called my_urls, you can then iterate through it and tell R to download those files.
my_urls <- c("www.pdf995.com/samples/pdf.pdf",
"che.org.il/wp-content/uploads/2016/12/pdf-sample.pdf",
"www.africau.edu/images/default/sample.pdf")
save_here <- paste0("document_", 1:3, ".pdf")
for(i in seq_along(my_urls)){
download.file(my_urls[i], save_here[i])
}
Or perhaps a bit more elegantly, using mapply():
mapply(download.file, my_urls, save_here)
After execution, you will see that there are three PDFs called document_1.pdf, document_2.pdf and document_3.pdf saved in your working directory.

Related

How can I render an HTML file using URL in Shiny App without breaking other interactive plots and tables?

I've made a Shiny app that connects to a remote database (PostgreSQL) to pull tables in order to display in my app and use the values from the tables to generate user-interactive plots(box-plots, scatter, histogram).
Another part of the app was to render HTML files that are hosted on a website, such as GitHub, in the app.
Here is a portion of the code I used:
output$genotyping <- renderUI({
if (creds_reactive()$user == "Admin") {
fluidPage(useShinyjs(),
titlePanel("Github HTML file"),
mainPanel(
h3("Click this button to display/hide the html file"),
br(),
br(),
actionButton("genotype", "Click me!"),
hidden(
div(id = 'text_div',
htmlOutput("includeHTML")
) )
)
) }
else {
mainPanel(
fluidRow(
align = "center",
h3("Sorry! You don't have the permissions required to view this content.")
)
)
}
})
request <- GET("link to html file")
github.html <-content(request, as ="text")
observeEvent(input$genotype, {
toggle('text_div')
output$includeHTML <- renderText({github.html})
})
I'm using code from here.
However, it's quite outdated and Dropbox has since removed HTML rendering support/disabled some Java features so the HTML file doesn't load, so using Dropbox links in my code would generate a blank page in my app. Therefore, I'm using GitHub to store my file. The file itself is rather large, so GitHub itself can't render it. I've used the link to the raw code generated by github as well as raw.githack to render the file with a URL. Both options are able to render the HTML file in my app after the screen quickly refreshes. The problem is, all other interactive tables and plots can't be used after loading in the HTML file. Any tabs(from tabsetPanel) have their font turned red and can't load any of the tables and plots created from before.
I would store the report on a local drive/folder but that's not very practical as this is just one report(there will be many more and the EC2 instance I am using to run the shiny-server would get too full).
EDIT: I should note that even having it on a local folder and using includeHTML has the same problems as using the URL, so maybe it's just a problem with rendering HTML within a Shiny app?
I'm also not too sure how the http request works with the GET function. Is it creating a new connection that replaces the original one? Why aren't my plots/tables loading anymore after rendering in the HTML file via URL? Should I be disconnecting from the GET connection?
I'm also using a package called shinymanager to create a secure app login with credentials and passwords, so I have that "if" statement there to generate customized UI. shinyJS is also used here to create the hidden effect. Pressing a button will display/hide the report.
Let me know if there's anything else I can provide. Thanks.
Have you tried using an iframe instead of includeHTML?
Something like this should do it:
library(shiny)
addResourcePath("html", file.path(R.home(), "doc", "html"))
ui <- fluidPage(
tags$iframe(src="html/about.html", height="200px", width="100%", frameborder="0", scrolling="yes")
)
server <- function(input, output, session) {}
shinyApp(ui, server)

Is there a way to render an HTML page from Ruby?

I am developing an application that takes in the address of a web page and generates an HTML file with the source of that page. I have successfully generated the file. I can't figure out how to launch that file in a new tab. Here
This is running in Repl.it, a web-based code editor. Here's what I have:
def run
require 'open-uri'
puts "enter a URL and view the source"
puts "don't include the https:// at the beginning"
url = gets.chomp
fh = open("https://"+url)
html = fh.read
puts html
out_file = File.new("out.html", "w")
out_file.puts(html)
out_file.close
run
end
Then I'm running that code.
As I understand you just want to save html of site and open new file in your browser.
You can do it this way (I use Firefox).
require 'net/http'
require 'uri'
uri = URI.parse('https://bla-bla-bla.netlify.com/')
response = Net::HTTP.get_response(uri)
file_name = 'out.html'
File.write(file_name, response.body)
system("firefox #{file_name}")
Note: Keep in mind that site owners often block parsers, so you may have to use torify.
Now check the file
$ cat out.html
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Bla-bla-bla</title>
</head>
<body>
<p>Bla-bla</p>
</body>
</html>
Everything worked out.
Hope it helps you.
If all you need is to open this file locally in your computer, I would perform a system call.
For example on my macOS the following would open the HTML page on my default browser:
system("open #{out_file.path}")
If you want to supply the rendered HTML to other users in your network then you will need a HTTP server, I suggest Sinatra to start with.

Find absolute html path given relative href using R

I'm new to html but playing with a script to download all PDF files that a given webpage links to (for fun and avoiding boring manual work) and I can't to find where in the html document I should look for the data that completes relative paths - I know it is possible since my web browser can do it.
Example: I trying to scrape lecture notes linked to on this page from ocw.mit.edu using R package rvest looking at the raw html or accessing the href attribute of a "nodes" I only get relative paths:
library(rvest)
url <- paste0("https://ocw.mit.edu/courses/",
"electrical-engineering-and-computer-science/",
"6-006-introduction-to-algorithms-fall-2011/lecture-notes/")
# Read webpage and extract all links
links_all <- read_html(url) %>%
html_nodes("a") %>%
html_attr("href")
# Extract only href ending in "pdf"
links_pdf <- grep("pdf$", tolower(links_all), value = TRUE)
links_pdf[1]
[1] "/courses/electrical-engineering-and-computer-science/6-006-introduction-to-algorithms-fall-2011/lecture-videos/mit6_006f11_lec01.pdf"
The easiest solution that I have found as of today is using the url_absolute(x, base) function of the xml2 package. For the base parameter, you use the url of the page you retrieved the source from.
This seems less error prone than trying to extract the base url of the address via regexp.

Include HTML files in R Markdown file?

Quick Summary
How do I place HTML files in place within an R Markdown file?
Details
I have created some nice animated choropleth maps via choroplethr.
As the link demonstrates, the animated choropleths function via creating a set of PNG images, which are then rolled into an HTML file that cycles through the images, to show the animation. Works great, looks great.
But now I want to embed / incorporate these pages within the .Rmd file, so that I have a holistic report including these animated choropleths, along with other work.
It seems to me there should be an easy way to do an equivalent to
Links:
[please click here](http://this.is.where.you.will.go.html)
or
Images:
![cute cat image](http://because.that.is.what.we.need...another.cat.image.html)
The images path is precisely what I want: a reference that is "blown up" to put the information in place, instead of just as a link. How can I do this with a full HTML file instead of just an image? Is there any way?
Explanation via Example
Let's say my choropleth HTML file lives in my local path at './animations/demographics.html', and I have an R Markdown file like:
---
title: 'Looking at the demographics issue'
author: "Mike"
date: "April 9th, 2016"
output:
html_document:
number_sections: no
toc: yes
toc_depth: 2
fontsize: 12pt
---
# Introduction
Here is some interesting stuff that I want to talk about. But first, let's review those earlier demographic maps we'd seen.
!![demographics map]('./animations/demographics.html')
where I have assumed / pretended that !! is the antecedent that will do precisely what I want: allow me to embed that HTML file in-line with the rest of the report.
Updates
Two updates. Most recently, I still could not get things to work, so I pushed it all up to a GitHub repository, in case anyone is willing to help me sort out the problem. Further details can be found at that repo's Readme file.
It seems that being able to embed HTML into an R Markdown file would be incredibly useful, so I keep trying to sort it out.
(Older comments)
As per some of the helpful suggestions, I tried and failed the following in the R Markdown file:
Shiny method:
```{r showChoro1}
shiny::includeHTML("./animations/demographics.html")
```
(I also added runtime:Shiny up in the YAML portion.)
htmltools method:
```{r showChoro1}
htmltools::includeHTML("./animations/demographics.html")
```
(In this case, I made no changes to the YAML.)
In the former case (Shiny), it did not work at all. In fact, including the HTML seemed to muck up the functionality of the document altogether, such that the runtime seemed perpetually not-fully-functional. (In short, while it appeared to load everything, the "loading" spindel never went away.)
In the latter case, nothing else got messed up, but it was a broken image. Strangely, there was a "choropleth player" ribbon at the top of the document which would work, it's just that none of the images would pop up.
For my own sanity, I also provided simple links, which worked fine.
[This link](./animations/demographics.html) worked without a problem, except that it is not embedded, as I would prefer.
So it is clearly a challenge with the embedding.
Here is a hack (probably inelegant)...idea is to directly insert HTML programmatically in Rmd and then render Rmd.
temp.Rmd file:
---
title: "Introduction"
author: "chinsoon12"
date: "April 10, 2016"
output: html_document
---
<<insertHTML:[test.html]
etc, etc, etc
```{r, echo=FALSE}
htmltools::includeHTML("test.html")
```
etc, etc, etc
test.html file:
<html>
<head>
<title>Title</title>
</head>
<body>
<p>This is an R HTML document. When you click the <b>Knit HTML</b> button a web page will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:</p>
<p>test test</p>
</body>
</html>
verbose code to replace Rmd code with HTML code and then render (can probably be shortened by a lot)
library(stringi)
subHtmlRender <- function(mdfile, htmlfile) {
#replace <<insertHTML:htmlfile with actual html code
#but without beginning white space
lines <- readLines(mdfile)
toSubcode <- paste0("<<insertHTML:[",htmlfile,"]")
location <- which(stri_detect_fixed(lines, toSubcode) )
htmllines <- stri_trim(readLines(htmlfile))
#render html doc
newRmdfile <- tempfile("temp", getwd(), ".Rmd")
newlines <- c(lines[1:(location-1)],
htmllines,
lines[min(location+1, length(lines)):length(lines)]) #be careful when insertHTML being last line in .Rmd file
write(newlines, newRmdfile)
rmarkdown::render(newRmdfile, "html_document")
shell(gsub(".Rmd",".html",basename(newRmdfile),fixed=T))
} #end subHtmlRender
subHtmlRender("temp.Rmd", "test.html")
EDIT: htmltools::includeHTML also works with the sample files that I provided. Is it because your particular html does not like UTF8-encoding?
EDIT: taking #MikeWilliamson comments into feedback
I tried the following
copied and pasted animated_choropleth.html into a blank .Rmd
remove references to cloudfare.com as I had access issues while
rendering (see below)
knit HTML
put back those cloudfare weblinks
put the graphs in the same folder as the rendered html
open the HTML
I appear to get back the html but am not sure if the result is what you expect
Are you also facing the same issue in pt 2? You might want to post the error message and ask for fixes :). This was my error message
pandoc.exe: Failed to retrieve http://cdnjs.cloudflare.com/ajax/libs/twitter-bootstrap/3.1.1/css/bootstrap.min.css
FailedConnectionException2 "cdnjs.cloudflare.com" 80 False getAddrInfo: does not exist (error 11001)
Error: pandoc document conversion failed with error 61
Did you try the includes: option in your YAML header?
https://rmarkdown.rstudio.com/html_document_format.html#includes
But maybe you'll have the same problem I have: I'd like to include the HTML file in a specific section in my RMarkdown document, not in the header or before/after body.
can try put this line in the Rmarkdown and then knit.
(YAML header "output: html_document"; if "runtime: shiny" somehow it does not work)

How download files from a html form R

when I press ctrl+s and save this page on my web browser
http://www.kegg.jp/kegg-bin/show_pathway?zma00944+default%3dred+cpd:C01514+cpd:C05903+cpd:C01265+cpd:C01714
I download the html form and a folder with some png files. I'm interested in png files that have a known pattern.
Is there a way to download them in the same way from R?
I'm trying:
download.file("http://www.kegg.jp/kegg-bin/show_pathway?zma00944+default%3dred+cpd:C01514+cpd:C05903+cpd:C01265+cpd:C01714","form.html", mode = "wb")
but I download only the html form, not the associated pngs.
Thanks
This will get you part of the way there:
source("http://bioconductor.org/biocLite.R")
biocLite("KEGGREST")
library(png)
library(KEGGREST)
png <- keggGet(c("zma00944","default=red","cpd:C01514","cpd:C05903","cpd:C01265","cpd:C01714"), "image")
t <- tempfile()
writePNG(png, t)
browseURL(t)
Unfortunately it does not do the red highlighting which you probably want. I'm not sure if that can be done through the REST API.
So probably instead you could just download the URL as you have, and then parse it for the PNG and then download that:
download.file("http://www.kegg.jp/kegg-bin/show_pathway?zma00944+default%3dred+cpd%3aC01514+cpd%3aC05903+cpd%3aC01265+cpd%3aC01714", "form.html")
lines <- readLines("form.html")
imgUrl <- lines[grep('img src="/', lines)]
url <- paste0("http://www.kegg.jp/", strsplit(imgUrl, '"')[[1]][2])
download.file(url, "file.png")
browseURL("file.png")