QT HTML Parser (+XQuery) - html

I'm looking for a QT HTML parser tool.
I have some html source code and I'd like to use XQuery on it.
I already tried using QWebPage + QWebElement, but I don't like this solution cause firstly it doesn't works on non-gui thread (because of QWebPage) and because we can't apply XPath but CSS Path.
The other solution I tried is QXmlQuery, it works great, but the only problem is that it doesn't works if there is an error on the page. For example, the first page I tried was missing systemId (in the DOCTYPE tag), so the parsing was aborted.
I heard we can use gecko for parsing but I have no idea how to use it with QT.
Have you some suggestions ?
Thanks

I recommend that you use tidy on your HTML page and then process it with XQuery.
Zorba is a C++ XQuery processor that provides a tidy module.
You can find a live example at http://www.zorba-xquery.com/html/demo#tQZu6aq1K4KoGJm9m0oIPwKRt04=

BaseX got a QT client and can use TagSoup for cleaning up HTML documents.
I'm sorry I cannot provide you with an QT example as I don't know QT at all.

Related

Get HTML of page with awesomium

How do I get the HTML of a web page in awesomium with C++?
I've searched and apparently you can only do it with webcontrol in C# or in Java. Using the sample hello I tried doing:
JSValue theVal( view->ExecuteJavascriptWithResult(WSLit("document.getElementsByTagName('html')[0].innerHTML"),WSLit("")));
but it does not work. any ideas? and please in c++ as i am aware that you can do this in C# and Java.
Using Javascript you can do it like this:
web_view->ExecuteJavascriptWithResult("document.getElementsByTagName('html')[0].innerHTML");
also you can use:
web_view->CopyHTML();
and then get HTML from the clipboard. I am not sure if there is another way of getting HTML without using Javascript.

Fixing malformed html that html tidy doesn't fix

Okay, so I've been utilizing HTML tidy to convert regular HTML webpages into XHTML suitable for parsing. The problem is the test page I saved in firefox had its html apparently somewhat precleaned by firefox during saving, call this File F. Html tidy works fine on file F, but fails on the raw data written to a file via .NET (file N). Html tidy is complaining about form tags being intermixed with table tags. The Html isn't mine so I can't just fix the source.
How do I clean up file N enough so that it can be run through Html tidy? Is there a standard way of hooking into Firefox (completely programmically without having to use mouse or keyboard) or another tool that will apply extra fixes to the html?
I had been using HTML tidy for some time, but then found that I was getting better results from TagSoup.
It can be used as a JAXP parser, converting non-wellformed HTML on the fly. I usually let it parse the input for Saxon XQuery transformations.
But it can also be used as a stand-alone utility, as an executable jar.
I wound up using SendKeys in C# and importing functions from user32.dll to set Firefox as the active window after launching it to the website I wanted (file:///myfilepathhere/).
SendKeys seemed to require running a windowed program, so I also added another executable which performs actions in its form_load() method.
By using alt+f, down six times, enter, wait for a bit, type full path file name, enter (twice) and then killing firefox, I was able to automate firefox's ability to clean some html up.

Rails - close html tags when you enter html in a form

Is there any way to close html tags if a user forgets to? E.g. when the user input is:
<b>small</b><i>test
Is there a way in Rails to automatically add the closing </i> tag, so that all the following html won't be italic?
I used .html_safe to interpret everything as html, but I would like to terminate <i> too.
Rails doesn't have any built in capability to do this however you have a couple of options:
Nokogiri - easy to install on pretty much all platforms (gem install nokogiri)
Tidy - The second post has the details to use it in linux and windows
Using nokogori you can simply do:
html = "<b>small</b><i>test"
clean = Nokogiri::HTML::DocumentFragment.parse(html).to_html
# clean = "<b>small</b><i>test</i>"
Rails can't do that to you.
You can use some template system like Haml or Slim to don't forget that.
You can do it by your own editor too.
A decent way to do this would be to feed the input to a dom parser and then have that parser output correct HTML.
Note that this isn't fool proof and if the user makes too many mistakes, the parser won't know what to do.
I'd suggest Nokogiri

parse html in adobe air

I am trying to load and parse html in adobe air. The main purpose being to extract title, meta tags and links. I have been trying the HTMLLoader but I get all sort of errors, mainly javascript uncaught exceptions.
I also tried to load the html content directly (using URLLoader) and push the text into HTMLLoader (using loadString(...)) but got the same error. Last resort was to try and load the text into xml and then use E4X queries or xpath, no luck there cause the html is not well formed.
My questions are:
Is there simple and reliable (air/action script) DOM component there (I do not need to display the page and headless mode will do)?
Is there any library to convert (crappy) html into well formed xml so I can use xpath/E4X
Any other suggestions on how to do this?
thx
ActionScript is supposed to be a superset of JavaScript, and thankfully, there's...
Pure JavaScript/ActionScript HTML Parser
created by Javascript guru and jQuery creator John Resig :-)
One approach is to run the HTML through HTMLtoXML() then use E4X as you please :)
Afaik:
No :-(
No :-(
I think the easiest way to grab title and meta tags is writing some regular expressions. You can load the page's HTML code into a string and then read out whatever you need like this:
var str:String = ""; // put HTML code in here
var pattern:RegExp = /<title>(.+)<\/title>/i;
trace(pattern.exec(str));

What language/tool should I use for HTML parsing?

I have a couple of websites that I want to extract data from and based on previous experiences, this isn't as easy as it sound. Why? Simply because the HTML pages I have to parse aren't properly formatted (missing closing tag, etc.).
Considering that I have no constraints regarding the technology, language or tool that I can use, what are your suggestions to easily parse and extract data from HTML pages? I have tried HTML Agility Pack, BeautifulSoup, and even these tools aren't perfect (HTML Agility Pack is buggy, and BeautifulSoup parsing engine doesn't work with the pages I am passing to it).
You can use pretty much any language you like just don't try and parse HTML with regular expressions.
So let me rephrase that and say: you can use any language you like that has a HTML parser, which is pretty much everything invented in the last 15-20 years.
If you're having issues with particular pages I suggest you look into repairing them with HTML Tidy.
I think hpricot (linked by Colin Pickard) is ace. Add scrubyt to the mix and you get a great html scraping and browsing interface with the text matching power of Ruby http://scrubyt.org/
here is some example code from http://github.com/scrubber/scrubyt_examples/blob/7a219b58a67138da046aa7c1e221988a9e96c30e/twitter.rb
require 'rubygems'
require 'scrubyt'
# Simple exmaple for scraping basic
# information from a public Twitter
# account.
# Scrubyt.logger = Scrubyt::Logger.new
twitter_data = Scrubyt::Extractor.define do
fetch 'http://www.twitter.com/scobleizer'
profile_info '//ul[#class="about vcard entry-author"]' do
full_name "//li//span[#class='fn']"
location "//li//span[#class='adr']"
website "//li//a[#class='url']/#href"
bio "//li//span[#class='bio']"
end
end
puts twitter_data.to_xml
As language Java and as a open source library Jsoup will be a pretty solution for you.
hpricot may be what you are looking for.
You may try PHP's DOMDocument class. It has a couple of methods for loading HTML content. I usually make use of this class. My advises are to prepend a DOCTYPE element to the HTML in case it hasn't one and to inspect in Firebug the HTML that results after parsing. In some cases, where invalid markup is encountered, DOMDocument does a bit of rearrangement of the HTML elements. Also, if there's a meta tag specifying the charset inside the source be careful that it will be used internally by libxml when parsing the markup. Here's a little example
$html = file_get_contents('http://example.com');
$dom = new DOMDocument;
$oldValue = libxml_use_internal_errors(true);
$dom->loadHTML($html);
libxml_use_internal_errors($oldValue);
echo $dom->saveHTML();
Any language which works with HTML on DOM level is good.
for perl it is HTML::TreeBuilder module.