Is there a way to render an HTML page from Ruby? - html

I am developing an application that takes in the address of a web page and generates an HTML file with the source of that page. I have successfully generated the file. I can't figure out how to launch that file in a new tab. Here
This is running in Repl.it, a web-based code editor. Here's what I have:
def run
require 'open-uri'
puts "enter a URL and view the source"
puts "don't include the https:// at the beginning"
url = gets.chomp
fh = open("https://"+url)
html = fh.read
puts html
out_file = File.new("out.html", "w")
out_file.puts(html)
out_file.close
run
end
Then I'm running that code.

As I understand you just want to save html of site and open new file in your browser.
You can do it this way (I use Firefox).
require 'net/http'
require 'uri'
uri = URI.parse('https://bla-bla-bla.netlify.com/')
response = Net::HTTP.get_response(uri)
file_name = 'out.html'
File.write(file_name, response.body)
system("firefox #{file_name}")
Note: Keep in mind that site owners often block parsers, so you may have to use torify.
Now check the file
$ cat out.html
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Bla-bla-bla</title>
</head>
<body>
<p>Bla-bla</p>
</body>
</html>
Everything worked out.
Hope it helps you.

If all you need is to open this file locally in your computer, I would perform a system call.
For example on my macOS the following would open the HTML page on my default browser:
system("open #{out_file.path}")
If you want to supply the rendered HTML to other users in your network then you will need a HTTP server, I suggest Sinatra to start with.

Related

Scraping online PDFs with rvest

I want to access the data from this train timetable web page. Using rvest on the URL doesn't give a useful answer:
> read_html("https://www.scotrail.co.uk/sites/default/files/assets/download_ct/_sr1705_glasgow-edinburgh_via_falkirk_highv2.pdf")
{xml_document}
<html>
[1] <body><p>%PDF-1.5\r%\xe2ãÏÓ\r\n22 0 obj\r<>\rendobj\r \rxref\r22 97\r0000000 ...
[2] <html><p>C*ÐsO\u0086ZFWM\u0086X H$\u0083>\u0083-Ïs\u0086O=Ì\u008c"Lí½/1\u009c\u009fõ\u008e\u0 ...
However when I save the source code locally as an html file I can scrape the contents just fine:
> read_html("/path/to/this/file/_sr1705_glasgow-edinburgh_via_falkirk_highv2.html")
{xml_document}
<html dir="ltr" mozdisallowselectionprint="" moznomarginboxes="">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">\n<meta charset="utf- ...
[2] <body tabindex="1" class="">\n <div id="outerContainer">\n\n <div id="sidebarContainer"> ...
I'd like to scrape using the URL rather than manually downloading and saving as a html file. It feels like I'm missing something fundamental about PDFs. I'm confused that the file extension in the URL is .pdf but F12 reveals html.
Is there a way to scrape directly from this URL? If not why does saving locally 'fix' the issue?
If you have all URLs saved in the vector called my_urls, you can then iterate through it and tell R to download those files.
my_urls <- c("www.pdf995.com/samples/pdf.pdf",
"che.org.il/wp-content/uploads/2016/12/pdf-sample.pdf",
"www.africau.edu/images/default/sample.pdf")
save_here <- paste0("document_", 1:3, ".pdf")
for(i in seq_along(my_urls)){
download.file(my_urls[i], save_here[i])
}
Or perhaps a bit more elegantly, using mapply():
mapply(download.file, my_urls, save_here)
After execution, you will see that there are three PDFs called document_1.pdf, document_2.pdf and document_3.pdf saved in your working directory.

Can I run a batch file when I open my html document

I am trying to get a batch file to load in my head tag when I open my document
I have tried these and they don't seem to work:
<head onLoad="shutdown.bat">
<head onLoad="window.open('shutdown.bat')">
I've tried this as well putting it in to a function then calling it on load up
<head onLoad="shut">
<script>
function shut() {
}
</script>
</head>
I have made sure the batch file works by opening it as well
This is only possible to do when you are opening your HTML file through file:/// on your local computer. As soon as you host it on a server it will throw an error.
What you want to do though is this:
window.open("file:"///C:/Path/To/File/some_bat_file.bat"

What am I doing wrong with trying to get my Disqus plugin to work?

I'm slowly building a website and it's just saved on my desktop right now. I'm trying to place Disqus on one of my pages and I pasted the code in the HTML document and I'm not getting anything on my page. I was able to successfully get my twitter widget to work on a different page just by pasting the code that was given to me and the same type of instruction was given from Disqus which was to paste the universal code to my site but nothing is showing up.
Do I have to do something with a CSS file to get it showing? I was searching through the settings in Disqus and one of the settings allows me to set the website URL but my website is not live and is just located in a folder in my desktop containing my html and CSS files.
I created a test HTML document in the folder containing all my HTML documents but I only get the sentence contained in the paragraph tag.
<! DOCTYPE html>
<html>
<head>
<title>Test-Disqus</title>
</head>
<body>
<p> Testing Disqus.</p.>
<div id="disqus_thread"></div>
<script type="text/javascript">
/* * * CONFIGURATION VARIABLES * * */
var disqus_shortname = 'myusername'; /***changed for this question*///
/* * * DON'T EDIT BELOW THIS LINE * * */
(function() {
var dsq = document.createElement('script'); dsq.type = 'text/javascript'; dsq.async = true;
dsq.src = '//' + disqus_shortname + '.disqus.com/embed.js';
(document.getElementsByTagName('head')[0] || document.getElementsByTagName('body')[0]).appendChild(dsq);
})();
</script>
<noscript>Please enable JavaScript to view the comments powered by Disqus.</noscript>
</body>
</html>
A couple of issues with your HTML, however it's not the root cause of your issue.
1) It's <!DOCTYPE html> not <! DOCTYPE html>
2) Your closing paragraph tag has a ".", it should be </p> and not </p.>
You said, "I'm slowly building a website and it's just saved on my desktop right now." If you're opening this test file directly from your desktop using your web browsers, the disqus module will not load. Disqus restricts the comment module to only load on trusted domains set by you. You can check the trusted domain by logging into disqus -> admin -> settings -> advanced.
You can add additional trusted domains if you need. However if your trusted domain is "xyz.com" and you load your test page from your desktop, the trusted domain will not match.
You need to run a webserver to get it working, I recommend MAMP for local development. MAMP will most likely start up on port 8888 or 8080. This will allow you to access your test file by going to http://localhost:8080/test.html. After that you can try adding localhost as a trusted domain, or create an entry in your host file.

Classic ASP Content-type XLS Chrome downloads as .asp file for some users

I have a classic asp website which includes the ability to export/download table data as an excel file (.xls). This is done by redirecting the user to a new page with this block of code in place of the usual HTML headers:
sub PutInTopOfXLS(FileName)
Response.Buffer = TRUE
Response.CharSet="UTF-8"
Response.CodePage=65001
Response.ContentType = "application/vnd.ms-excel"
Response.AddHeader "Content-Disposition", "attachment;filename=" & FileName%>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns:x="urn:schemas-microsoft-com:office:excel">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<style>
<!--table
br {mso-data-placement:same-cell;}
tr {vertical-align:top;}
-->
</style>
</head>
<body>
<%end sub
this works fine for all users (or at least no problems have been reported) on Firefox, Internet Explorer and Safari (for both Mac and Windows) and it works fine for my development machine using Chrome (20.0.1132.57). However, my QA person reports that on one particular report out of several on the site, it always downloads on Chrome with the actual code page name ReportFileName.asp and then he gets a Windows error about no file association for .asp files. If he actually selects Open with Excel, the correct file has been downloaded. I asked another person in our office to download Chrome and she is has no problem; the file downloads as {filename}.xls and opens normally.
I'm very confused because the fact that for the QA person, it is only affecting this one report would suggest that the problem is in the specific report. However, the fact that two other users with the same version of Chrome are not experiencing the problem would suggest it is something in his Chrome setting.
I haven't had any luck googling for a solution or searching on SO so I thought I'd throw the question out there and see if anyone has any ideas.
Thanks very much for your assistance.
I'd recommend streaming the file, this will allow you to handle large files in a nicer way too.
Here's my working example, I use this to pick up some automated csv files which are stored in a rather unfriendly filename, I am also using this script to present a nicer filename to the end user:
' Set some variables
strSourceFilename = "00008446.dat"
strNiceFilename = "Nice Name Here.csv"
' Read the file and stream it back to the client
Const adTypeBinary = 1
Set BinaryStream = CreateObject("ADODB.Stream")
BinaryStream.Type = adTypeBinary
BinaryStream.Open
BinaryStream.LoadFromFile "C:\path\to\the\folder\" & strSourceFilename
ReadBinaryFile = BinaryStream.Read
Response.ContentType = "application/x-unknown"
Response.AddHeader "Content-Disposition", "attachment; filename=" & strNiceFilename
Response.BinaryWrite ReadBinaryFile
Hope this helps!

How to get the HTML source of a webpage in Ruby [duplicate]

This question already has answers here:
Equivalent of cURL for Ruby?
(13 answers)
Closed 7 years ago.
In browsers such as Firefox or Safari, with a website open, I can right click the page, and select something like: "View Page Source" or "View Source." This shows the HTML source for the page.
In Ruby, is there a function (maybe a library) that allows me to store this HTML source as a variable? Something like this:
source = view_source(http://stackoverflow.com)
where source would be this text:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<title>Stack Overflow</title>
etc
Use Net::HTTP:
require 'net/http'
source = Net::HTTP.get('stackoverflow.com', '/index.html')
require "open-uri"
source = open(url){ |f| f.read }
UPD: Ruby >=1.9 allows syntax
require "open-uri"
source = open(url, &:read)
UPD: Ruby >=3.0 demands syntax
require "open-uri"
source = URI(url).open(&:read)
require 'open-uri'
source = open(url).read
short, simple, sweet.
Yes, like this:
require 'open-uri'
open('http://stackoverflow.com') do |file|
#use the source Eric
#e.g. file.each_line { |line| puts line }
end
require 'mechanize'
agent = Mechanize.new
page = agent.get('http://google.com/')
puts page.body
you can then do a lot of other cool stuff with mechanize as well.
You could use the builtin Net::HTTP:
>> require 'net/http'
>> Net::HTTP.get 'stackoverflow.com', '/'
Or one of the several libraries suggested in "Equivalent of cURL for Ruby?".
Another thing you might be interested in is Nokogiri. It is an HTML, XML, etc. parser that is very easy to use. Their front page has some example code that should get you started and see if it's right for what you need.
If you have cURL installed, you could simply:
url = 'http://stackoverflow.com'
html = `curl #{url}`
If you want to use pure Ruby, look at the Net::HTTP library:
require 'net/http'
stack = Net::HTTP.new 'stackoverflow.com'
# ...later...
page = '/questions/4217223/how-to-get-the-html-source-of-a-webpage-in-ruby'
html = stack.get(page).body