Parse HTML using ruby core libraries? (ie, no gems required) - html

Some friends and I have been working on a set of scripts that make it easier to do work on the machines at uni. One of these tools currently uses Nokogiri, but in order for these tools to run on all machines with as little setup as possible we've been trying to find a 'native' html parser, instead of requiring users to install RVM and custom gems (due to disk space limitations for most users).
Are we pretty much restricted to Nokogiri/Hpricot/? Should we look at just writing our own custom parser that fits our needs?
Cheers.
EDIT: If there's posts on here that I've missed in my searches, let me know! S.O. is sometimes just too large to find things effectively...

There is no html parser in ruby stdlib
html parsers have to be more forgiving of bad markup than xml parsers
You could run the html though tidy (http://tidy.sourceforge.net)
to tidy up the html and produce valid markup
This can now be read via rexml :-) which is in stdlib
rexml is much slower than nokogiri, last checked in 2009
Sam Ruby had been working on making rexml faster though
A better way would be to have a better deployment
Take a look at http://gembundler.com/bundle_package.html and using capistrano (or some such) to provision servers

Related

Which technology should I use to transform my latex documents into html documents

I want to write a little program that transforms my TeX files into HTML. I want to parse the documents and turn the macros (the build-in and of course my own) into HTML pieces. Here are my requirements:
predefined rules (e.g. begin{itemize} \item text \end{itemize} => <br> <p>text </p> <br/>)
defining own CSS style
ability to convert formulars (extract the formulars, load them in an imagecreator and then save the jpg/png)
easy to maintain and concise
I know there are several technologies out there, but I don't exactly know which is the best for me. Here are the technologies which flow into my mind
Ruby (I/O is easy, formular loading via webrat),
XML XSLT (I don't think that I need just overhead)
perl (there are many libs out there but I'm not quite familiar with it)
bash (I worked with sed and was surprised how easy it was to work with regular expressions)
latex2html ... (these converters won't work for me and they don't give me freedom in parsing)
Any suggestions, hints and comments are welcome.
Thanks for your time, folks.
have a look at pandoc here. it can also be installed on linux or os x. Though it won't do your custom macros. The only thing I've seen that can do a decent job with custom macros is tex4ht, but to really work well you need to be producing .DVI files. If you have a ton of custom macros, writing your own converter is going to take an ass load of time. Even if you only have a few custom macros, it's still going to be a pain. good luck!
Six: TeX
Seven: Haskell
(I gave up trying to persuade SO to start numbering my list from 6).

JSP/.NET XML - simple client to display html

I am fairly proficient in PHP, but just starting out in ASP.Net and JSP/Java
I would like to learn JSP/ASP.NET XML to HTML transformation with some simple practical examples. Im not looking to learn how to edit XML, just displaying it, but im having trouble finding definitive examples/tutorials.
Ive spent quite a while studying JSP/ASP.NET but quickly find how vast they are and how many different ways there are to do this (quite frankly im a bit overwhelmed). I would be really grateful for advice before I embark upon this journey (and perhaps I will be saved from going in the completely wrong direction). If there are any tutorials or especially example apps you could point me towards this would really help (i like to do hands on learning)
For this I expect I need to do the following:
1) Set up a server for each technology (im using Tomcat and IIS at the moment - are these the best?)
2) Use some parameter based routing system (MVC?, but this is most surely overkill for me)
3) Parse the XML and create some variables/objects
4) Display the HTML (Use template libraries (JSTL? not sure for ASP.NET))
Any tutorials/example apps you could point me towards to help me through the above steps will be truly appreciated.
Thankyou
Ke
By the sounds of your skillset, carefully working through this developerworks tutorial on JSTL looks like a good place for you to start. It does cover the XML handling libs around part 4, and it'll also help you avoid the mistake of using scriptlets where JSTL would give cleaner, less error-prone and much more readable code.
You'll also most likely want IDE support, so that you get documentation, syntax checking and autocomplete. I personally use Eclipse (The EE download will have everything you need and more) but NetBeans might be the most straightforward to get your started.
Tomcat will be fine to get you started, but these IDEs tend to have build in web containers to save you time in deploying and testing.

CGI language choice

Ok, I've asked a few related questions here and only ended up with more questions and I realized now it's because I don't have enough background info. So I'll make it more generic:
I need to make a simple web application. Static HTML/JQuery pages will send AJAX POST requests to some server side code, which will:
Read the POST variables passed in
Run some very simple logic
Hit a MySQL database for simple CRUD ops
Return a plain string of data to be consumed by the javascript on the page
I was assuming Ruby was a good choice for this as everyone is raving about how well it's designed, and I've been playing with it - not RoR, just Ruby for simple scripting tasks - and I kind of like it.
My question is, I'm hopelessly confused by the trillion helper libraries and frameworks out there. I don't know what these are and therefore if I need any/all of them: Rack, Sinatra, Camping, mod_ruby, FastCGI, etc.
Would it be easier to just learn PHP and us that? Or can I get away with just dropping my .rb files into the cgi-bin folder(I'm using Apache for hosting) and use the ruby cgi library to get my variables?
EDIT: As far as Rails, I'm just assuming that it's overkill for what I want but I might be wrong. I looked at it, and it seemed cool for generating data based web sites quickly, but that's not what I'm trying to do. I don't want any forms pages for the user. I don't want them entering data or viewing records. I don't even want to return any HTML. I just want a ruby script to sit on the server, get passed a few variables in a post request, and return a JSON string in response. I will need some basic cookie/session/state managment
This is a really easy thing to do in C# and ASP.NET with webservices, but it seems very confusing with the open source technologies.
You don't want to use any feature from a fully blown framework so don't use one. Less code = less bugs = less security nightmares.
CGI
CGI has some performance drawbacks in comparison to other methods, but is still (in my opinion) the simplest and easiest to use one. This is how you use the builtin cgi library:
require "cgi"
cgi= CGI.new
answer= evaluate(cgi.params)
cgi.out do
answer
end
rack
Another low tech easy to use variant would be rack. Rack is an abstraction layer which works for many webserver interfaces (cgi, fastcgi, webrick, …). It's simplicity can be compared to the one of only using cgi. Put the following into a file wich ends with .ru into your cgi directory.
#!/usr/bin/rackup
require "rack/request"
run (lambda do |env|
request= Rack::Request(env)
anwser= evaluate(request.params)
return [200, {}, answer]
end)
This does not seem very different from cgi, but it gives you much more possibilities. If youst execute this file on your local machine rackup will start the webrick webserver. This webserver will deliver the webpages you described in your .ru file.
Other interfaces
fast-cgi
fast-cgi works almost like CGI. The difference is, in CGI your script get's started for every request it has to work on. With fast-cgi, your script only starts once for all requests. There is a library available to write fast-cgi script in ruby.
mod_ruby
mod_ruby is a builtin ruby interpreter for apache. It works analog to mod_php in apache.
mongrel
mongrel is a standalone webserver for ruby applications. This is a simple hello world example with it.
require 'mongrel'
class SimpleHandler < Mongrel::HttpHandler
def process(request, response)
response.start(200) do |head,out|
head["Content-Type"] = "text/plain"
out.write("hello world!\n")
end
end
end
h = Mongrel::HttpServer.new("0.0.0.0", "3000")
h.register("/hello", SimpleHandler.new)
h.run.join
Mongrel is often used for rails and other ruby frameworks. Most people use an apache or something else on port 80. This webserver than distributes the requests to several mongrel servers running on other ports. I think this is totaly overkill for your needs.
phusion passenger
passenger is also called mod_rails or mod_rack. It is a module for apache and nginx to host rails and rack applications. According to their websites rails with passenger uses 1/3 less ram than rails alone. If you write your software for rack, you could make it a little faster by using passenger, instead of cgi or fast-cgi.
Use jQuery and PHP.
Both technologies are well documented, and you should be able to get an application up and running in a matter of hours. You sound like you know a thing or two - talking about CRUD operations and so on - so I won't bore you with examples. And as far as JSON goes, there are probably a million PHP libraries out there, for outputting JSON objects.
Sinatra is very simple to learn and use. It's also easy to deploy with the use of Phusion Passenger (which is like mod_php for ruby frameworks like Rails and Sinatra). Instructions here: http://blog.squarefour.net/2009/03/06/deploying-sinatra-on-passenger/
If you find that you need more than what Sinatra will give you, I recommend Rails. Setting up that with Passenger is even easier since hardly any configuration is required. (see modrails.com).
PHP is very easy to use because it's designed specifically for this sort of thing. Want to read POST variables? They're in $_POST. Want to query MySQL? mysql_query("SELECT `something` FROM `table`");. And if you ever need help, Google searches for "php what_you_need_to_do" almost always return results on php.net, which is very helpful.
And for what you're doing, you don't need any additional frameworks.
I am curious about what I perceive to be your resistance to trying Rails. You say that you want "to spend more time on the scripting itself and less on configuration", and yet you seem to dismiss Rails out of hand. Rails is all about convention over configuration. If you take the time to learn how Rails does things, you can get an incredible amount of functionality "for free" just by following the conventions of the framework.
If you want to make a simple web app, Rails is really a very painless and good way to start. You can use a sqlite database and not even mess with MySQL (won't scale, but for learning or simple apps it's fine). It's true that there are simpler frameworks, but since you seem new to web programming, I'd recommend that you start with what will get you the most support in terms of documentation and knowledgeable folks. Follow the old adage: Get it working first, then optimize later.
The only sticking point I can see is the Apache integration... The consensus on Rails deployment these days seems to be focused on using lightweight httpds in place of Apache. There is a mod_fcgid which seems to be the best way to do it with Apache (mod_ruby is deprecated, buggy and slow, last I read) if you can do custom mods. Or there's Phusion Passenger, which seems to be the latest and greatest way to do it. Running Rails in a standard CGI environment will yield awful performance (but that goes for any CGI framework, really) due to the overhead of executing the interpreter + framework for every request. You'll get much better performance if you go with something that keeps the interpreter + framework in memory.
I personally like Django. I had a problem with Ruby on Rails where I just got overwhelmed by everything when I just wanted to do something simple, which it sounds like you want to do (since you said ROR feels like overkill). The cool thing I found with Django is that if you WANT everything, then you can get everything by plugging it in...but if you want less then you just don't plug in that technology and it's that much more lightweight.
Take for example "views". Django, like ROR, uses MVC. But if you just want to return a string of data and don't need the view, then you don't need to plug in the view. But if later on you decide that it will be more organized in a view then you can easily plug it in at that time.
Here's their website: http://www.djangoproject.com/

Repository of BNF Grammars?

Is there a place I can find Backus–Naur Form or BNF grammars for popular languages? Whenever I do a search I don't turn up much, but I figure they must be published somewhere. I'm most interested in seeing one for Objective-C and maybe MySQL.
you have to search on tools used to create grammars: "lex/yacc grammar", "antlr grammar" "railroad diagram"
http://www.antlr3.org/grammar/list.html
Here's some grammar files
objective-c
http://www.omnigroup.com/mailman/archive/macosx-dev/2001-March/022979.html
http://www.cilinder.be/docs/next/NeXTStep/3.3/nd/Concepts/ObjectiveC/B_Grammar/Grammar.htmld/index.html
https://github.com/pornel/objc2grammar
python
http://www.python.org/dev/summary/2006-04-16_2006-04-30/#the-grammar-file-and-syntaxerrors
javascript
http://tomcopeland.blogs.com/EcmaScript.html
http://www.ccs.neu.edu/home/dherman/javascript/
ruby
http://www.ruby-doc.org/docs/ruby-doc-bundle/Manual/man-1.4/yacc.html
FWIW, the MySQL grammar file (mysql-server/sql/sql_yacc.y) is open source and browseable at launchpad.net (though it's a bit slow and I got an error when I tried to pull up the specific file).
Also, a snapshot of the whole MySQL Server source is downloadable from dev.mysql.com.
There are some links from w:BNF#Language Grammars.
BNF Grammars for SQL-92, SQL-99 and SQL-2003
I also found a page that lists grammars for Objective C.
Objective-C grammar for Lex/Yacc Flex/Bison
Reference Manual for the Objective-C Language
IIRC, BNF grammars are just different enough from what yacc/bison want as input to be really annoying :) If you intend to feed these files into a parser generator, you may want to look for files in the appropriate format. I recall seeing such files for Java, JavaScript and C++ at one point. Probably as part of Eclipse, Firefox and GCC, respectively, but I can't remember for sure. I would assume you can find pretty much any parser input file by finding an open source project that uses that language.
I also searched this and i collected this repository
http://slps.github.io/zoo/

best library to do web-scraping [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 months ago.
Improve this question
I would like to get data from from different webpages such as addresses of restaurants or dates of different events for a given location and so on. What is the best library I can use for extracting this data from a given set of sites?
If using python, take a good look at Beautiful Soup (http://crummy.com/software/BeautifulSoup).
An extremely capable library, makes scraping a breeze.
The HTML Agility Pack For .net programers is awesome. It turns webpages in XML docs that can be queried with XPath.
HtmlDocument doc = new HtmlDocument();
doc.Load("file.htm");
foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a#href")
{
HtmlAttribute att = link"href";
att.Value = FixLink(att);
}
doc.Save("file.htm");
You can find it here. http://www.codeplex.com/htmlagilitypack
I think the general answer here is to use any language + http library + html/xpath parser. I find that using ruby + hpricot gives a nice clean solution:
require 'rubygems'
require 'hpricot'
require 'open-uri'
sites = %w(http://www.google.com http://www.stackoverflow.com)
sites.each do |site|
doc = Hpricot(open(site))
# iterate over each div in the document (or use xpath to grab whatever you want)
(doc/"div").each do |div|
# do something with divs here
end
end
For more on Hpricot see http://code.whytheluckystiff.net/hpricot/
I personally like the WWW::Mechanize Perl module for these kinds of tasks. It gives you an object that is modeled after a typical web browser, (i.e. you can follow links, fill out forms, or use the "back button" by calling methods on it).
For the extraction of the actual content, you could then hook it up to HTML::TreeBuilder to transform the website you're currently visiting into a tree of HTML::Element objects, and extract the data you want (the look_down() method of HTML::Element is especially useful).
i think watir or selenium are the best choices. Most of the other mentioned libraries are actually HTML parsers, and that is not what you want... You are scraping, if the owner of the website wanted you to get to his data he'd put a dump of his database or site on a torrent and avoid all the http requests and expensive traffic.
basically, you need to parse HTML, but more importantly automate a browser. This to the point of being able to move the mouse and click, basically really mimicking a user. You need to use a screencapture program to get to the captchas and send them off to decaptcha.com (that solve them for a fraction of a cent) to circumvent that. forget about saving that captcha file by parsing the html without rendering it in a browser 'as it is supposed to be seen'. You are screenscraping, not httprequestscraping.
watir did the trick for me in combination with autoitx (for moving the mouse and entering keys in fields -> sometimes this is necessery to set of the right javascript events) and a simple screen capture utility for the captcha's. this way you will be most succesfull, it's quite useless writing a great html parser to find out that the owner of the site has turned some of the text into graphics. (Problematic? no, just get an OCR library and feed the jpeg, text will be returned). Besides i have rarely seen them go that far, although on chinese sites, there is a lot of text in graphics.
Xpath saved my day all the time, it's a great Domain Specific Language (IMHO, i could be wrong) and you can get to any tag in the page, although sometimes you need to tweak it.
What i did miss was 'reverse templates' (the robot framework of selenium has this). Perl had this in CPAN module Template::Extract, very handy.
The html parsing, or the creation of the DOM, i would leave to the browser, yes, it won't be as fast, but it'll work all the time.
Also libraries that pretend to be Useragents are useless, sites are protected against scraping nowadays, and the rendering of the site on a real screen is often necessery to get beyond the captcha's, but also javascript events that need to be triggered for information to appear etc.
Watir if you're into Ruby, Selenium for the rest i'd say. The 'Human Emulator' (or Web Emulator in russia) is really made for this kind of scraping, but then again it's a russian product from a company that makes no secret of its intentions.
i also think that one of these weeks Wiley has a new book out on scraping, that should be interesting. Good luck...
I personally find http://github.com/shuber/curl/tree/master and http://simplehtmldom.sourceforge.net/ awesome for use in my PHP spidering/scraping projects.
The Perl WWW::Mechanize library is excellent for doing the donkey work of interacting with a website to get to the actual page you need.
I would use LWP (Libwww for Perl). Here's a good little guide: http://www.perl.com/pub/a/2002/08/20/perlandlwp.html
WWW::Scraper has docs here: http://cpan.uwinnipeg.ca/htdocs/Scraper/WWW/Scraper.html
It can be useful as a base, you'd probably want to create your own module that fits your restaurant mining needs.
LWP would give you a basic crawler for you to build on.
There have been a number of answers recommending Perl Mechanize, but I think that Ruby Mechanize (very similar to Perl's version) is even better. It handles some things like forms in a much cleaner way syntactically. Also, there are a few frontends which run on top of Ruby Mechanize which make things even easier.
What language do you want to use?
curl with awk might be all you need.
You can use tidy to convert it to XHTML, and then use whatever XML processing facilities your language of choice has available.
I'd recommend BeautifulSoup. It isn't the fastest but performs really well in regards to the not-wellformedness of (X)HTML pages which most parsers choke on.
what someone said.
use ANY LANGUAGE.
as long as you have a good parser library and http library, you are set.
the tree stuff are slower, then just using a good parse library.