How to get the HTML source of a webpage in Ruby [duplicate] - html

This question already has answers here:
Equivalent of cURL for Ruby?
(13 answers)
Closed 7 years ago.
In browsers such as Firefox or Safari, with a website open, I can right click the page, and select something like: "View Page Source" or "View Source." This shows the HTML source for the page.
In Ruby, is there a function (maybe a library) that allows me to store this HTML source as a variable? Something like this:
source = view_source(http://stackoverflow.com)
where source would be this text:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<title>Stack Overflow</title>
etc

Use Net::HTTP:
require 'net/http'
source = Net::HTTP.get('stackoverflow.com', '/index.html')

require "open-uri"
source = open(url){ |f| f.read }
UPD: Ruby >=1.9 allows syntax
require "open-uri"
source = open(url, &:read)
UPD: Ruby >=3.0 demands syntax
require "open-uri"
source = URI(url).open(&:read)

require 'open-uri'
source = open(url).read
short, simple, sweet.

Yes, like this:
require 'open-uri'
open('http://stackoverflow.com') do |file|
#use the source Eric
#e.g. file.each_line { |line| puts line }
end

require 'mechanize'
agent = Mechanize.new
page = agent.get('http://google.com/')
puts page.body
you can then do a lot of other cool stuff with mechanize as well.

You could use the builtin Net::HTTP:
>> require 'net/http'
>> Net::HTTP.get 'stackoverflow.com', '/'
Or one of the several libraries suggested in "Equivalent of cURL for Ruby?".

Another thing you might be interested in is Nokogiri. It is an HTML, XML, etc. parser that is very easy to use. Their front page has some example code that should get you started and see if it's right for what you need.

If you have cURL installed, you could simply:
url = 'http://stackoverflow.com'
html = `curl #{url}`
If you want to use pure Ruby, look at the Net::HTTP library:
require 'net/http'
stack = Net::HTTP.new 'stackoverflow.com'
# ...later...
page = '/questions/4217223/how-to-get-the-html-source-of-a-webpage-in-ruby'
html = stack.get(page).body

Related

Is there a way to render an HTML page from Ruby?

I am developing an application that takes in the address of a web page and generates an HTML file with the source of that page. I have successfully generated the file. I can't figure out how to launch that file in a new tab. Here
This is running in Repl.it, a web-based code editor. Here's what I have:
def run
require 'open-uri'
puts "enter a URL and view the source"
puts "don't include the https:// at the beginning"
url = gets.chomp
fh = open("https://"+url)
html = fh.read
puts html
out_file = File.new("out.html", "w")
out_file.puts(html)
out_file.close
run
end
Then I'm running that code.
As I understand you just want to save html of site and open new file in your browser.
You can do it this way (I use Firefox).
require 'net/http'
require 'uri'
uri = URI.parse('https://bla-bla-bla.netlify.com/')
response = Net::HTTP.get_response(uri)
file_name = 'out.html'
File.write(file_name, response.body)
system("firefox #{file_name}")
Note: Keep in mind that site owners often block parsers, so you may have to use torify.
Now check the file
$ cat out.html
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Bla-bla-bla</title>
</head>
<body>
<p>Bla-bla</p>
</body>
</html>
Everything worked out.
Hope it helps you.
If all you need is to open this file locally in your computer, I would perform a system call.
For example on my macOS the following would open the HTML page on my default browser:
system("open #{out_file.path}")
If you want to supply the rendered HTML to other users in your network then you will need a HTTP server, I suggest Sinatra to start with.

get linked css files via a regexp in a html-page

i try to parse a html page which a have loaded with perl. i need to get the src="asd/jkl/xyz.css" for example out of the html-repsone to manipulate the path to an absolute.
the reason why i want to do this is, that is need the css inline in a E-Mail head ...
so my try to realize this is:
load the page via perl
get the src of the linked css
load the css files via perl
parse the css und put the contents of the css files in the head-tag of my generated email.
has anyone a better idea or a working regex?
Try something like this:
#!/usr/bin/env perl
use XML::LibXML;
my $parser = XML::LibXML->new();
my $doc = $parser->load_html(location => "http://mywebsite.com", recover => 2);
print $doc->findnodes('//link[#rel="stylesheet"]/#src');
Reference: http://metacpan.org/pod/XML::LibXML

Parsing RSS in a Ruby on Rails project shows the html as inline html

Im calling my feed with <%= blog_feed %> inside the view and have a little snippet in my helper.
require 'rss/1.0'
require 'rss/2.0'
require 'open-uri'
def blog_feed
source = "http://www.domain.com/.rss" # url or local file
content = "" # raw content of rss feed will be loaded here
open(source) do |s| content = s.read end
rss = RSS::Parser.parse(content, false)
html = "<ul>"
rss.items.first(3).each do |i|
html << "<li><a href='#{i.link}'>#{i.title}</a></li>"
end
html << "</ul>"
html
end
It runs mostly the way i want. But the html is inline html. So i see li,ul and a hrefs on the website.
Any idea or suggestion?
best regards
denym
The way you're processing and displaying the RSS in a view isn't the way I'd do it, but the quick answer is that you need to call html_safe for any HTML strings you build in this way.
This may be unsafe, as the incoming RSS data may have code in it that causes cross site security issues. You can handle that by using the sanitize helper. I think the sanitize helper automatically calls html_safe for you.
So, at the end of your blog_feed method, replace the html return value with:
sanitize html
Check out the documentation for sanitize here: http://api.rubyonrails.org/classes/ActionView/Helpers/SanitizeHelper.html#method-i-sanitize

How can I call a Perl script inside HTML page?

I have a single HTML file, how I use a Perl script(date/hour) in the HTML code?
My goal: show a date/hour in HTML
Obs.: alone both script are ok.
Example:
HTML File:
<html>
<body>
code or foo.pl script
</body>
</html>
Perl script(foo.pl):
#!/usr/local/bin/perl
use CGI qw/:push -nph/;
$| = 1;
print multipart_init(-boundary=>'----here we go!');
for (0 .. 4) {
print multipart_start(-type=>'text/plain'),
"The current time is ",scalar(localtime),"\n";
if ($_ < 4) {
print multipart_end;
} else {
print multipart_final;
}
sleep 1;
}
Perl is a server-side language, so it must be run on the server. The HTML code is displayed in the browser, and it is generated by the server. So you would have to run the perl script on the server to generate the date / hour, and embed that into the HTML code that you serve to the browser.
Here is a tutorial on how to do this.
It sounds like you want Ajax. Your HTML page uses JavaScript to call your Perl program. Your JavaScript gets the response and replaces the part of the page where you want the data to go. Alternatively, you can just skip the Perl bit altogether and just do it all in JavaScript.
You can either generate the entire HTML page via a CGI script (as per Chetan's answer) - or as an alternative you can use one of the templating modules (EmbPerl, Mason, HTML::Template, or many others).
The templating solution is better for real software development, where separation of HTML and the Perl logic is more important. E.g. for EmbPerl, your code would look like:
<html>
<body>
[- my $date_hour= my_sub_printing_date_and_hour(); # Logic to generate -]
[+ $date_hour # print into HTML - could be combined with last statement +]
</body>
</html>

make my file readable as either Perl or HTML

In the spirit of the "Perl Preamble" where a script works properly whether executed by a shell script interpreter or the Perl interpreter...
I have a Perl script which contains an embedded HTML document (as a "heredoc"), i.e.:
#!/usr/bin/perl
... some Perl code ...
my $html = <<'END' ;
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
... more HTML ...
</HTML>
END
... Perl code that processes $html ...
I would like to be able to work on the HTML that's inside the Perl script and check it out using a web browser, and only run the script when the HTML is the way I want. To accomplish this, I need the file to be openable both as an HTML file and as a Perl script.
I have tried various tricks with Perl comments and HTML comments but can't get it quite perfect. The file as a whole doesn't have to be "strictly legal" HTML (although the embedded document should be)... just displayable in a browser with no (or minimal) Perl garbage visible.
EDIT: Solved! See my own answer
Read it and weep Mr. #Axeman... I now present to you the empty set:
</dev/fd/0 eval 'exec perl -x -S $0 ${1+"$#"}' #> <!--
#!perl
... some Perl code ...
my $html = << '<!-- END' ; # -->
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
... more HTML ...
</HTML>
<!-- END
... Perl code that processes $html ...
# -->
This sounds like a path to pain. Consider storing the HTML in a separate file and reading it in within the script.
Maybe this is a job for Markup::Perl:
# don't write this...
print "Content-type: text/html;\n\n";
print "<html>\n<body>\n";
print "<p>\nYour \"lucky number\" is\n";
print "<i>", int rand 10, "</i>\n</p>\n";
print "</body>\n</html>\n";
# write this instead...
use Markup::Perl;
<html><body><p>
Your "lucky number" is
<i><perl> print int rand 10 </perl></i>
</p></body></html>
You could also drop the use Markup::Perl line and run your script like
perl -MMarkup::Perl my_page_with_embedded_perl.html
Then the page should render pretty well.
Sounds to me like you want a templating solution, such as Template::Toolkit or HTML::Template. Embedding HTML in your code or embedding code in your HTML is a recipe for pain.
Have you considered putting Perl inside of HTML?
Like ASP4 does?
It's a lot easier that way - trust me ;-)