What language/tool should I use for HTML parsing? - html

I have a couple of websites that I want to extract data from and based on previous experiences, this isn't as easy as it sound. Why? Simply because the HTML pages I have to parse aren't properly formatted (missing closing tag, etc.).
Considering that I have no constraints regarding the technology, language or tool that I can use, what are your suggestions to easily parse and extract data from HTML pages? I have tried HTML Agility Pack, BeautifulSoup, and even these tools aren't perfect (HTML Agility Pack is buggy, and BeautifulSoup parsing engine doesn't work with the pages I am passing to it).

You can use pretty much any language you like just don't try and parse HTML with regular expressions.
So let me rephrase that and say: you can use any language you like that has a HTML parser, which is pretty much everything invented in the last 15-20 years.
If you're having issues with particular pages I suggest you look into repairing them with HTML Tidy.

I think hpricot (linked by Colin Pickard) is ace. Add scrubyt to the mix and you get a great html scraping and browsing interface with the text matching power of Ruby http://scrubyt.org/
here is some example code from http://github.com/scrubber/scrubyt_examples/blob/7a219b58a67138da046aa7c1e221988a9e96c30e/twitter.rb
require 'rubygems'
require 'scrubyt'
# Simple exmaple for scraping basic
# information from a public Twitter
# account.
# Scrubyt.logger = Scrubyt::Logger.new
twitter_data = Scrubyt::Extractor.define do
fetch 'http://www.twitter.com/scobleizer'
profile_info '//ul[#class="about vcard entry-author"]' do
full_name "//li//span[#class='fn']"
location "//li//span[#class='adr']"
website "//li//a[#class='url']/#href"
bio "//li//span[#class='bio']"
end
end
puts twitter_data.to_xml

As language Java and as a open source library Jsoup will be a pretty solution for you.

hpricot may be what you are looking for.

You may try PHP's DOMDocument class. It has a couple of methods for loading HTML content. I usually make use of this class. My advises are to prepend a DOCTYPE element to the HTML in case it hasn't one and to inspect in Firebug the HTML that results after parsing. In some cases, where invalid markup is encountered, DOMDocument does a bit of rearrangement of the HTML elements. Also, if there's a meta tag specifying the charset inside the source be careful that it will be used internally by libxml when parsing the markup. Here's a little example
$html = file_get_contents('http://example.com');
$dom = new DOMDocument;
$oldValue = libxml_use_internal_errors(true);
$dom->loadHTML($html);
libxml_use_internal_errors($oldValue);
echo $dom->saveHTML();

Any language which works with HTML on DOM level is good.
for perl it is HTML::TreeBuilder module.

Related

Ruby gem to quickly validate partial HTML snippets?

I'm making a customized quasi-CMS in Rails, and we'd like to have one field that is editable as an HTML fragment in code (the admin interface will be using CodeMirror on the frontend). When it's presented to the end user, it will just be html_safe'd and inserted into a div. We trust our content editors not to be malicious, but it would be helpful to ensure they're creating valid HTML so they don't break the page, especially since they're relatively new to coding!
As a first attempt, I'm using Hash.from_xml and rescuing exceptions as a custom validator. But is there a better and/or more-optimized way (i.e. a gem) to check that it is valid HTML?
Thanks!
You can use the Nokogiri library (and gem) to create a validator in your model. Using Nokogiri on fragments isn't perfect (so you might want to add the ability to override the validator) but it will catch many obvious errors that might break the page.
Example (assuming your model attribute/field is called content):
validate :invalid_html?
def invalid_html?
doc = Nokogiri::HTML(self.content) do |config|
config.strict
end
if doc.errors.any?
errors.add(:base, "Custom Error Message")
end
end
Instead of validation, perhaps it's worth to use Nokogiri which is capable of fixing markup:
require 'nokogiri'
html = '<div><b>Whoa</i>'
Nokogiri::HTML::DocumentFragment.parse(html).to_html
#=> "<div><b>Whoa</b></div>"
You probably want https://github.com/libc/tidy_ffi or http://apidock.com/rails/v4.0.2/HTML/WhiteListSanitizer (class method sanitize)
I think this may be what you're looking for?: be_valid_asset.

How can I modify a local HTML file in Perl?

Is there a CPAN module or code snippet that I can use to modify local HTML files without using a regExp?
What I want to do :
Change the start tag ( example : <div> to <div id="newtag"> )
Add a tag before another ( example : </head> to <script type="text/javascript"> ...</script></head>
Remove tags
Read the content of a given tag. (<- ok this can be done with an XML / HTML parser.
If you have HTML, and not XHTML, then you don't want to be using an XML parser.
HTML::Parser is the standard HTML parser for Perl. Pretty much everything else is built on top of it.
HTML::TokeParser is an alternative interface to HTML::Parser. It returns things on demand instead of passing everything to callbacks.
HTML::TreeBuilder builds a DOM-like tree from the HTML, which you can then modify.
HTML::TreeBuilder::XPath extends HTML::TreeBuilder with XPath support.
HTML::Query extends HTML::TreeBuilder with jQuery-like selectors.
pQuery is another module that brings more complete jQuery compatibility to HTML::TreeBuilder.
CPAN
XML::XPATH
XML::Xerces
A simple CPAN search returns
XML Search
XPATH
XPATH Tutorial
It sounds like you are not familiar with XPath. Here is a quick tutorial to get you familiar. Its not Perl but it will explain the concepts.

How to sanitize user generated html code in ruby on rails

I am storing user generated html code in the database, but some of the codes are broken (without end tags), so when this code will mess up the whole render of the page.
How could I prevent this sort of behaviour with ruby on rails.
Thanks
It's not too hard to do this with a proper HTML parser like Nokogiri which can perform clean-up as part of the processing method:
bad_html = '<div><p><strong>bad</p>'
puts Nokogiri.fragment(bad_html).to_s
# <div><p><strong>bad</strong></p></div>
Once parsed properly, you should have fully balanced tags.
My google-fu reveals surprisingly few hits, but here is the top one :)
Valid Well-formed HTML
Try using the h() escape function in your erb templates to sanitize. That should do the trick
Check out Loofah, an HTML sanitization library based on Nokogiri. This will also remove potentially unsafe HTML that could inject malicious script or embed objects on the page. You should also scrub out style blocks, which might mess up the markup on the page.

Grep and Extract Data in Perl

I have HTML content stored in a variable. How do I extract data that is found between a set of common tags in the page? For example, I am interested in the data (represented by DATA kept between a set of tags which one line after the other:
...
<td class="jumlah">*DATA_1*</td>
<td class="ud">*DATA_2*</td>
...
And then I would like to store a mapping DATA_2 => DATA_1 in a hash
Since it is HTML I think this could work for you?
https://metacpan.org/pod/XML::XPath
XPath is the way.
Since it's HTML, you probably want the XPath module made for working with HTML, HTML::TreeBuilder::XPath.
First you'll need to parse your string using the HTML::TreeBuilder methods. Assuming your webpage's content is in a variable named $content, do it like this:
my $tree = HTML::TreeBuilder->new;
$tree->parse_file($file_name);
Now you can use XPath expressions to get iterators over the nodes you care about. This first expression gets all td nodes that are in a tr in a table in the body in the html element:
my $tdNodes = $tree->findnodes('/html/body/table/tr/td');
Finally you can just iterate over all the nodes in a loop to find what you want:
foreach my $node ($tdNodes->get_nodelist) {
my $data = $node->findvalue('.'); // the content of the node
print "$data\n";
}
See the HTML::TreeBuilder documentation for more on its methods and the NodeSet documentation for how to use the NodeSet result object. w3schools has a passable XPath tutorial here.
With all this, you should be able to do pretty robust HTML parsing to grab out any element you want. You can even specify classes, ids, and more in your XPath queries to be really specific about which nodes you want. In my opinion, parsing HTML using this modified XPath library is a lot faster and more maintainable than dealing with a bunch of one-off regexes.
Use HTML parsing modules as described in answers to this Q - HTML::TreeBuilder or HTML::Parser.
Purely theoretically you could try doing this using Regular Expressions to do this but as noted in the linked question's answers and countless other times on SO, parsing HTML with RegEx is a Bad Idea with capital letters - too easy to get wrong, too hard to get well, and impossible to get 100% right since HTML is not a regular language.
You might try this module: HTML::TreeBuilder::XPath. The doc says:
This module adds typical XPath methods to HTML::TreeBuilder, to make it easy to query a document.

parse html in adobe air

I am trying to load and parse html in adobe air. The main purpose being to extract title, meta tags and links. I have been trying the HTMLLoader but I get all sort of errors, mainly javascript uncaught exceptions.
I also tried to load the html content directly (using URLLoader) and push the text into HTMLLoader (using loadString(...)) but got the same error. Last resort was to try and load the text into xml and then use E4X queries or xpath, no luck there cause the html is not well formed.
My questions are:
Is there simple and reliable (air/action script) DOM component there (I do not need to display the page and headless mode will do)?
Is there any library to convert (crappy) html into well formed xml so I can use xpath/E4X
Any other suggestions on how to do this?
thx
ActionScript is supposed to be a superset of JavaScript, and thankfully, there's...
Pure JavaScript/ActionScript HTML Parser
created by Javascript guru and jQuery creator John Resig :-)
One approach is to run the HTML through HTMLtoXML() then use E4X as you please :)
Afaik:
No :-(
No :-(
I think the easiest way to grab title and meta tags is writing some regular expressions. You can load the page's HTML code into a string and then read out whatever you need like this:
var str:String = ""; // put HTML code in here
var pattern:RegExp = /<title>(.+)<\/title>/i;
trace(pattern.exec(str));