Get HTML of page with awesomium - html

How do I get the HTML of a web page in awesomium with C++?
I've searched and apparently you can only do it with webcontrol in C# or in Java. Using the sample hello I tried doing:
JSValue theVal( view->ExecuteJavascriptWithResult(WSLit("document.getElementsByTagName('html')[0].innerHTML"),WSLit("")));
but it does not work. any ideas? and please in c++ as i am aware that you can do this in C# and Java.

Using Javascript you can do it like this:
web_view->ExecuteJavascriptWithResult("document.getElementsByTagName('html')[0].innerHTML");
also you can use:
web_view->CopyHTML();
and then get HTML from the clipboard. I am not sure if there is another way of getting HTML without using Javascript.

Related

Extend Markdown Parser to render custom code blocks

I am building a static blog, which uses Marked to parse markdown. I want to be able to have code blocks with tabs.
I want to parse code that looks like this:
```JavaScript
var geolocation = require("nativescript-geolocation");
```
```TypeScript
import geolocation = require("nativescript-geolocation");
```
To something like this (from the angular2 docs), where the tab names would be JavaScript and TypeScript.
I am programming in JavaScript (nodeJs), so I could manually render this if required? What would a custom implementation of a code block tab look like?
I am not sure if there is a special name for these, as I can't really seem to find any examples or templates.
I think answer is: 'Marked' does not support custom tags. I've spend few hours trying to find some way to extend it and finally switched to showdown.
It appears to be really easy to implement one ( her is expandable section tag example ).
Extension 'showdownjs/prettify-extension' implements code highlighting using Google Prettify.

HTML parsing in Clojure

I'm looking for a good way to parse HTML in Clojure.
Exactly what I'm trying to do is get content of a web page with crawler and then get content of some HTML tags or their attributes.
So I have URL to the page, and I get html as String, but how do get data I need?
Use https://github.com/cgrand/enlive
It allows you to select and retrieve with CSS-alike selectors.
Or https://github.com/nathell/clj-tagsoup
I am not experienced with tag-soup but I can tell that enlive works well for most scraping.

Scrape Data within HTML Tags Perl

I'm writing a web scraper, and am a Perl novice. I'm using HTML::TreeBuilder to get the data I need, but I've run into a case I'm not sure how to handle. Here's some sample HTML:
<div class="anything" val="20" name="matchup">someUniqueData</div>
I want to extract the val from this HTML tag. I've been using findvalues() to do most of my work, but I don't know if this can pull data from inside tags. I've glossed over the documentation unsuccessfully. Is there a simple solution for this type of scrape?
You need (using HTML::TreeBuilder::XPath):
my ($val) = $tree->findvalues('//div[#class="anything"]/#val');

QT HTML Parser (+XQuery)

I'm looking for a QT HTML parser tool.
I have some html source code and I'd like to use XQuery on it.
I already tried using QWebPage + QWebElement, but I don't like this solution cause firstly it doesn't works on non-gui thread (because of QWebPage) and because we can't apply XPath but CSS Path.
The other solution I tried is QXmlQuery, it works great, but the only problem is that it doesn't works if there is an error on the page. For example, the first page I tried was missing systemId (in the DOCTYPE tag), so the parsing was aborted.
I heard we can use gecko for parsing but I have no idea how to use it with QT.
Have you some suggestions ?
Thanks
I recommend that you use tidy on your HTML page and then process it with XQuery.
Zorba is a C++ XQuery processor that provides a tidy module.
You can find a live example at http://www.zorba-xquery.com/html/demo#tQZu6aq1K4KoGJm9m0oIPwKRt04=
BaseX got a QT client and can use TagSoup for cleaning up HTML documents.
I'm sorry I cannot provide you with an QT example as I don't know QT at all.

parse html in adobe air

I am trying to load and parse html in adobe air. The main purpose being to extract title, meta tags and links. I have been trying the HTMLLoader but I get all sort of errors, mainly javascript uncaught exceptions.
I also tried to load the html content directly (using URLLoader) and push the text into HTMLLoader (using loadString(...)) but got the same error. Last resort was to try and load the text into xml and then use E4X queries or xpath, no luck there cause the html is not well formed.
My questions are:
Is there simple and reliable (air/action script) DOM component there (I do not need to display the page and headless mode will do)?
Is there any library to convert (crappy) html into well formed xml so I can use xpath/E4X
Any other suggestions on how to do this?
thx
ActionScript is supposed to be a superset of JavaScript, and thankfully, there's...
Pure JavaScript/ActionScript HTML Parser
created by Javascript guru and jQuery creator John Resig :-)
One approach is to run the HTML through HTMLtoXML() then use E4X as you please :)
Afaik:
No :-(
No :-(
I think the easiest way to grab title and meta tags is writing some regular expressions. You can load the page's HTML code into a string and then read out whatever you need like this:
var str:String = ""; // put HTML code in here
var pattern:RegExp = /<title>(.+)<\/title>/i;
trace(pattern.exec(str));