Selective HTML rendering heuristics for crawler - html

I use C++, obsolete Qt web kit 5.5.1 (in order to support Windows XP) in a crawler. I use HTML rendering in order to get html/text content. But I'd like to minimize rendering frequency and skip downloading irrelevant stuff in order to speedup crawling.
First of all, I use QNetworkAccessManager to get web page, then I pass it via setContent() method to QWebFrame instance (I consider redirects manually). I also have QNetworkAccessManager descendant, which can be used to skip certain GET requests.
What web page attributes will definitely say that no rendering is required for simple text extraction?
What GET requests, generated by webkit during rendering, can be safely omitted if we just wanna grab text/links in the rendered html? For example, *.css?

Related

Where to find entire HTML content in Chromium source code

I am currently trying to do this: once the webpage loads, find out if the URL is of a certain pattern (say www.wikipedia.com/*), then, if so, parse the HTML content of that webpage like one can do with BeautifulSoup, and check if the webpage has a div with class foo and id boo. Any idea where can I writ this code, that is, where can I get access to URL, where do I need to listen to to know that the webpage has finished loading following which I can look for the URL and HTML content, and where and how I can parse the HTML?
I tried going through the code in src/chrome/browser/tab_contents, I could not find any reasonable place where I can do all this.
Take a look at the following conceptual application layers which represent how Chromium displays web pages:
Image Source: https://docs.google.com/drawings/d/1gdSTfvLxbJDbX8oiWo5LTwAmXmdMQvjoUhYEhfhj0-k/edit
The different layers are described as:
WebKit: Rendering engine shared between Safari, Chromium, and all other WebKit-based browsers. The Port is a part of WebKit that integrates with platform dependent system services such as resource loading and graphics.
Glue: Converts WebKit types to Chromium types. This is our "WebKit embedding layer." It is the basis of two browsers, Chromium, and test_shell (which allows us to test WebKit).
Renderer / Render host: This is Chromium's "multi-process embedding layer." It proxies notifications and commands across the process boundary.
WebContents: A reusable component that is the main class of the Content module. It's easily embeddable to allow multiprocess rendering of HTML into a view. See the content module pages for more information.
Browser: Represents the browser window, it contains multiple WebContentses.
Tab Helpers: Individual objects that can be attached to a WebContents (via the WebContentsUserData mixin). The Browser attaches an assortment of them to the WebContentses that it holds (one for favicons, one for infobars, etc).
Since your goal is to access and interpret the HTML content of a web page by element and/or class, you can look to the rendering process which uses Blink:
The renderers use the Blink open-source layout engine for interpreting and laying out HTML.
Blink has a WebDocument class which allows you to access the HTML content and other properties of a web page:
WebDocument document = GetMainFrame()->GetDocument();
WebElement element = document.GetElementById(WebString::FromUTF8("example"));
// document.Url();
Cleanest would be via the chrome remote debugging protocol
Use the DOM methods to get the root DOM and walk, search, or query the dom
This would make testing simpler as well: you can implement the logic in your favourite scripting language using an existing client library (there are many) and once that works implement it in C++.
If this for some reason has to be inprocess within Chromium, as a next step start a thread that connects to this and performs the operations.
You need to use a server side library to parse the contents of a requested HTML page. In Java for example there is a library "jsoup" there might be another alternatives for other server side languages. The main problem you could find is a "forbiden access", due to security restrictions, but as you are not trying to access REST services or similar things but only parse pure HTML to found string patterns, it must be easily done with "jsoup". There was a project where similar things were programmed for accessing web sites pages & parse the response html string.
Document doc = Jsoup.connect("http://jsoup.org").get();
Element link = doc.select("a").first();
String relHref = link.attr("href"); // == "/"
String absHref = link.attr("abs:href"); // "http://jsoup.org/"
See: https://jsoup.org/

Is it ok to place Microdata attributes on vuejs tags?

In our site we have product detail pages built with vuejs which include some non-standard HTML elements. e.g. we have the following to show the average rating for a product
<rating :stars="5"
:rating="#Model.Rating"
size="'large'"></rating>
Vuejs then transforms the above markup within the browser to show a number of star icons, producing quite different HTML markup for the browser.
We need to add support for Schema.org. Is it ok to add the itemprop="ratingValue" attribute to the above element e.g.:
<rating :stars="5"
:rating="#Model.Rating"
size="'large'"
itemprop="#Model.Rating"></rating>
or does Microdata require that the attributes are placed on standard HTML elements?
Take a look at https://ssr.vuejs.org/en/
Note that as of now, Google and Bing can index synchronous JavaScript
applications just fine. Synchronous being the key word there. If your
app starts with a loading spinner, then fetches content via Ajax, the
crawler will not wait for you to finish. This means if you have
content fetched asynchronously on pages where SEO is important, SSR
might be necessary.
If you render your content server side your schema.org tags will work, but if the content which contains your tag is rendered after an API request or other asynchronous actions then a search engine will not be able to see your tags.
Note: itemprop="#Model.Rating" is probably not going to do what you want, you probably want to :itemprop="#Model.Rating" or else you're going to get the literal #Model.Rating as the value of itemprop.

Reliable method of scraping page source i.e the tv at the beginning of each line?

When extracting data you can use CSS/xpaths. But is there a similar or reliable method of doing this in the page source.
www.amazon.com/Best-Sellers-Electronics-Televisions/zgbs/electronics/172659
You could get the page source and then parse using Regex but probably not be reliable if for instance the tv did not load on the page. I have looked up various solutions but I have yet to find one that mentions getting every tv at start of each line (1, 4, 7 etc,, in source) or using a reliable method e.g Css/xpaths in source of a page.
What would is the golden standard of reliable method of doing what I am after?
To get the page source you can use CURL if the page is rendered entirely on server side (most pages won't be), or headless chrome to get the actual DOM that will render in the browser (https://developers.google.com/web/updates/2017/04/headless-chrome).
For scraping the content, I've used cheerio (https://github.com/cheeriojs/cheerio) which will allow you to read in HTML to an object and then scrape your data off that using jQuery expressions. (Headless chrome allows you to execute JS on the pages you visit, so you don't necessarily need cheerio).
In your specific example you could get the TV on each line by combining the right class selectors to get the divs containing TV's, and using attribute selector with 'margin-left=0px' which would get first item on each line. That is obviously very much bound to structure of the page and will likely be broken by smallest of changes in the page source. (And not really any different from using xpaths. Still better than regex though)
With certain elements loading / not loading on the page (if that was what you meant by TV not being there), no golden solutions that I know of, except allowing sufficient time for the page to load and handling your scraper failing gracefully.

How to test the case that HTML <object> (or a similar feature) is unsupported?

I'm writing a web page that has a HTML <object> in it, like
<object [...]>Your browser does not support this.</object>
On all my machines I only have up-to-date browsers installed and don't want to clutter my machines with old browsers (this is actually not easily possible in most cases without depending on third-party-software and/or doing hours of configuration tweaking).
I know of pages like https://www.browserstack.com/ that let you render websites, but this is rather time consuming when I frequently need to check loads of small changes. And honestly I actually don't want to give my data to external companies just for a simple rendering.
How can I easily check how my page would look on old browsers?
Just found it out. The content between the <object></object> tags is not only triggered in unsupporting browsers, but also when the data attribute holds an invalid target (like an unavailable file).
So, to test how it looks on unsupporting browsers, one can simply set the data-attribute to something unavailable. But keep in mind that the webdesigner then also has to define a more meaningful message than just "Your browser does not support SVG", but also has to consider that the object to display is simply missing (for example in a dynamic setting of the data attribute via PHP, like data=<?php echo getFile(); ?> when the function returns something undefined).

WPF: Display HTML-based content stored in resource assembly

In my WPF project I need to render HTML-based content, where the content is stored in a resource assembly referenced by my WPF project.
I have looked at the WPF Frame and WebBrowser controls. Unfortunately, they both only expose Navigation events (Navigating, Navigated), but not any events that would allow me, based on the requested URL, to return HTML content retrieved from the resource assembly.
I can intercept navigation requests and serve up HTML content using the Navigating event and the NavigateToString() method. But that doesn't work for intercepting load calls for images, CSS files, etc.
Furthermore, I am aware of an HTML to Flowdocument SDK sample application that might be useful, but I would probably have to extend the sample considerably to deal with images and style sheets.
For what it is worth, we also generate the HTML content to be rendered (via Wiki pages) so the source HTML is somewhat predictable (e.g., maybe no JavaScript) in terms for referenced image locations and CSS style sheets used. We are looking to display random HTML content from the internet.
Update:
There is also the possibility to create an MHT file for each HTML page, which would 'inline' all images as MIME-types and alleviate the need to have finer-grained callbacks.
If you're okay with using a 28 meg DLL, you may want to take a look at BerkeliumSharp, which is a managed wrapper around the awesome Berkelium library. Berkelium uses the chromium browser at its core to provide offscreen rendering and a delegated eventing model. There are tons of really cool things you can do with this, but for your particular problem, in Berkelium there is an interface called ProtocolHandler. The purpose of a protocol handler is to take in a URL and provide the HTTP headers and body back to the underlying rendering engine.
In the BerkeliumSharp test app (one of the projects available in the source), you can see one particular use of this is the FileProtocolHandler -- it handles all the file IO for the "file://" protocol using .NET managed classes (System.IO). You could do the same thing for a made up protocol like "resource://". There's really only one method you have to override called HandleRequest that looks like this:
bool HandleRequest (string url, ref byte[] responseBody, ref string[] responseHeaders)
So you'd take a URL like "resource://path/to/my/html" and do all the necessary Assembly.GetResourceStream etc. in that method. It should be pretty easy to take a look at how FileProtocolHandler is used to adapt your own.
Both berkelium and berkelium sharp are open source with a BSD license.
The WebBrowser exposes a NavigateToStream(Stream) method that might work for you:
If your content is then stored as an embedded resource, you could use:
var browser = new WebBrowser();
var source = Assembly.Load("ResourceAssemblyName");
browser.NavigateTo(source.GetManifestResourceStream("ResourceNamespace.ResourceName"));
There is also a NavigateToString(string) method that expects the string content of the document.
Note: I have never used this in anger, so I have no idea how much help it will be!