What is the best way to parse a web page in Ruby?

What is the best way to parse a web page in Ruby? - html

I have been looking at XML and HTML libraries on rubyforge for a simple way to pull data out of a web page. For example if I want to parse a user page on stackoverflow how can I get the data into a usable format?
Say I want to parse my own user page for my current reputation score and badge listing. I tried to convert the source retrieved from my user page into xml but the conversion failed due to a missing div. I know I could do a string compare and find the text I'm looking for, but there has to be a much better way of doing this.
I want to incorporate this into a simple script that spits out my user data at the command line, and possibly expand it into a GUI application.

Unfortunately stackoverflow is claiming to be XML but actually isn't. Hpricot however can parse this tag soup into a tree of elements for you.
require 'hpricot'
require 'open-uri'
doc = Hpricot(open("http://stackoverflow.com/users/19990/armin-ronacher"))
reputation = (doc / "td.summaryinfo div.summarycount").text.gsub(/[^\d]+/, "").to_i
And so forth.

Hpricot is over !
Use Nokogiri now.

try hpricot, its well... awesome
I've used it several times for screen scraping.

I always really like what Ilya Grigorik writes, and he wrote up a nice post about using hpricot.
I also read this post a while back and it looks like it would be useful for you.
Haven't done either myself, so YMMV but these seem pretty useful.

Something I ran into trying to do this before is that few web pages are well-formed XML documents. Hpricot may be able to deal with that (I haven't used it) but when I was doing a similar project in the past (using Python and its library's built in parsing functions) it helped to have a pre-processor to clean up the HTML. I used the python bindings for HTML Tidy as this and it made life a lot easier. Ruby bindings are here but I haven't tried them.
Good luck!

it seems to be an old topic but here is a new one. Example getting reputation:
#!/usr/bin/env ruby
require 'rubygems'
require 'hpricot'
require 'open-uri'
user = "619673/100kg"
html = "http://stackoverflow.com/users/%s?tab=reputation"
page = html % user
puts page
doc = Hpricot(open(page))
pars = Array.new
doc.search("div[#class='subheader user-full-tab-header']/h1/span[#class='count']").text.each do |p|
pars << p
end
puts "reputation " + pars[0]

Related

How to go to a list of 500 websites sequentially with a single button

I have a text file with a list of websites that goes something like this:
http://sb9.astro.ulb.ac.be/ProcessMainform.cgi?Catalog=HD&Id=58872++&Coord=&Epoch=2000&radius=10&unit=arc+min
http://sb9.astro.ulb.ac.be/ProcessMainform.cgi?Catalog=HD&Id=58515++&Coord=&Epoch=2000&radius=10&unit=arc+min
...
What I need to do is go to each website one by one, perhaps by just hitting enter and having it open in a new tab. I don't know if I'd build a script for this, or if there is already a tool available to do something like this.
I have some experience in Java, Python, and Fortran, but I am by no means a professional programmer.
I appreciate any help you may be able to provide.

Are you trying to test these website for traffic or actually view them physically? If you just need to test them then python is easy enough for that. Requires curl installed on the system path.
file_in = open("input_list.txt", "r")
file_lines = file_in.readlines()
file_in.close()
for x in range(0, len(file_lines)): # strip the trailing \n character
file_lines[x] = file_lines[x].strip()
for url in file_lines:
os.system("curl " + url)

gulp-htmlmin fails on valid document: workaround or abandon plugin?

I'm trying to minify my HTML. I've just discovered and started using the gulp-htmlmin plugin
My gulp task...
gulp.task('build-html',function(){
return gulp.src(appDev+'test.html')
.pipe(htmlmin({collapseWhitespace: true}))
.pipe(gulp.dest(appProd));
});
fails when applied to this valid HTML, or any document with the < character:
<div> < back </div>
The error is:
Error: Parse Error: < back </div>
at new HTMLParser (html-minifier\src\htmlparser.js:236:13)
I can think of two solutions:
Replace < with < in all my templates. Doing so manually won't be fun, but that's life. Any idea how to do it in a gulp task?
Ditch this plugin in search for one that can parse and minify my templates. Any suggestions?
Guess I'm wondering how someone more experienced at building for deployment (I'm new) would handle this.

I'm afraid you're going to have to change them to <. The html-minifier team has specifically stated they won't support bare <s.
You want do this anyway, both to not trip up parsers and to protect against certain XSS attacks. See
When Should One Use HTML Entities,
the W3C's recommendations,
and OWASP's XSS prevention cheat sheet
for more info.
The good news is any text editor worth its coding salt supports project-wide or at least multi-file search and replace. Assuming all your HTML <tags> don't have whitespace after the <, you should be able to just replace "< " with "< ".

I decided to replace all < with < since I figured this will probably save me some grief down the road. Besides #henry made great points in his answer.
I was too big of a chicken though to trust my IDE to do a find and replace without breaking my code all kinds of ways. Instead I followed these steps:
Run gulp task from the OP
Notice the file that threw the parse error and go fix it
Run gulp task again
A new file throws the parse error. Go fix it
...
Eventually I fixed all the files.

How can I convert Wikitext Markup containing the double curly bracket functions, into plaintext or html?

I am creating a customized Wiki Markup parser/interpreter. There is a big task however in regards to interpreting functions like these:
{{convert|500|ft|m|0}}
which is converted like so:
500 feet (152 m)
I'd like to avoid having to manually code interpretations of these functions, and would rather employ a method where I query a string
+akiva#akiva-ThinkPad-X230:~$ wiki-to-text "convert|3|to(-)|6|ft|abbr=on}}"
and get a return of:
"3 to 6 ft (0.91–1.83 m)"
Is there a tool to do this? Offline is by far the most ideal solution, but I could live with having to query a server.

You could query the MediaWiki api to get a parsed text from wikitext. E.g. to parse the template Template:Done from the english wikipedia you could use: https://en.wikipedia.org/w/api.php?action=parse&text={{Template:done}}&title=Test (see the online docs for parse). You, however, need a MediaWiki instance that provides a template that you want to parse and which works in the exact same way. If you install a webserver locally, you can install your own MediaWiki instance and parse wikitext locally, too.
Btw.: There's the Parsoid project, too, which implements a node-based wikitext->html->wikitext parser. However, it, iirc, still needs to query the api of the wiki to parse templates.

Escape Html in erlang

Does anyone have a good way to escape html tags in erlang (as for CGI.escapeHtml in Ruby)?
Thanks

Well, i would tell you to roll your own method using string and list processing But, i would also say that if you have yaws web server source, there is a method i have used and copied into my own libraries. yaws_api:url_encode(HtmlString). See it here in action.
1> Html = "5 > 4 = true".
"5 > 4 = true"
2> yaws_api:url_encode(Html).
"5%20%3E%204%20%3D%20true"
3>
I hope this is some how what u needed. If this is what you needed, you could just browse yaws web server source code and then copy out this function and use it in your own projects, notice that within the module yaws_api.erl, you will have to make sure that you copy out all the dependencies for this function as klacke did a lot of pattern matching, function clauses, recursion e.t.c. Just copy the whole function and the small support functions from that source file and paste it some where in your projects. The other way would be to do it by your own by manipulating strings and Lists. Those are my suggestions :)

Interpret/Render output from puts() as HTML

When I run my ruby script, I want the output to be rendered as HTML, preferably with a browser (e.g. Chrome). However, I would very much prefer if I didn't have to start a webservice, because I'm not making a website. I've tried sinatra, and the problem with it, is that I have to restart the server every time I do changes to my code, plus it features requests (like GET/POST-arguments) which I don't really need.
I simply prefer the output from my Ruby program to appear as HTML as opposed to console-text -- since html allows for more creative/expressive output. Is there a good/simple/effective way to do this? (I'm using notepad++ to edit my code, so if its possible to combine the above with it somehow, that would be awesome).
Thanks alot :)

Using the gem shotgun you can run a Sinatra app that automatically reloads changes without restarting the server.
Alternatively, using a library like awesome_print which has HTML formatting, you could write a function which takes the output and saves it to a file. Then open the file in Chrome.
If you don't want to have to manually refresh the page in Chrome, you could take a look at guard-livereload (https://github.com/guard/guard-livereload) which will monitor a given file using the guard gem and reload Chrome. Ryan Bates has a screenshot on guard here, http://railscasts.com/episodes/264-guard.
Here's a function that overrides Kernel#puts to print the string to STDOUT and write the HTML formatted version of it to output.html.
require 'awesome_print'
module Kernel
alias :old_puts :puts
def puts(string)
old_puts string
File.open("output.html", "w") do |file|
file.puts string.ai(:html => true)
end
end
end
puts "test"

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

What is the best way to parse a web page in Ruby? - html

Hpricot is over ! Use Nokogiri now.

try hpricot, its well... awesome I've used it several times for screen scraping.

I always really like what Ilya Grigorik writes, and he wrote up a nice post about using hpricot. I also read this post a while back and it looks like it would be useful for you. Haven't done either myself, so YMMV but these seem pretty useful.

Related

How to go to a list of 500 websites sequentially with a single button

gulp-htmlmin fails on valid document: workaround or abandon plugin?

How can I convert Wikitext Markup containing the double curly bracket functions, into plaintext or html?

Escape Html in erlang

Interpret/Render output from puts() as HTML

Categories

Resources