Where to find entire HTML content in Chromium source code - google-chrome

I am currently trying to do this: once the webpage loads, find out if the URL is of a certain pattern (say www.wikipedia.com/*), then, if so, parse the HTML content of that webpage like one can do with BeautifulSoup, and check if the webpage has a div with class foo and id boo. Any idea where can I writ this code, that is, where can I get access to URL, where do I need to listen to to know that the webpage has finished loading following which I can look for the URL and HTML content, and where and how I can parse the HTML?
I tried going through the code in src/chrome/browser/tab_contents, I could not find any reasonable place where I can do all this.

Take a look at the following conceptual application layers which represent how Chromium displays web pages:
Image Source: https://docs.google.com/drawings/d/1gdSTfvLxbJDbX8oiWo5LTwAmXmdMQvjoUhYEhfhj0-k/edit
The different layers are described as:
WebKit: Rendering engine shared between Safari, Chromium, and all other WebKit-based browsers. The Port is a part of WebKit that integrates with platform dependent system services such as resource loading and graphics.
Glue: Converts WebKit types to Chromium types. This is our "WebKit embedding layer." It is the basis of two browsers, Chromium, and test_shell (which allows us to test WebKit).
Renderer / Render host: This is Chromium's "multi-process embedding layer." It proxies notifications and commands across the process boundary.
WebContents: A reusable component that is the main class of the Content module. It's easily embeddable to allow multiprocess rendering of HTML into a view. See the content module pages for more information.
Browser: Represents the browser window, it contains multiple WebContentses.
Tab Helpers: Individual objects that can be attached to a WebContents (via the WebContentsUserData mixin). The Browser attaches an assortment of them to the WebContentses that it holds (one for favicons, one for infobars, etc).
Since your goal is to access and interpret the HTML content of a web page by element and/or class, you can look to the rendering process which uses Blink:
The renderers use the Blink open-source layout engine for interpreting and laying out HTML.
Blink has a WebDocument class which allows you to access the HTML content and other properties of a web page:
WebDocument document = GetMainFrame()->GetDocument();
WebElement element = document.GetElementById(WebString::FromUTF8("example"));
// document.Url();

Cleanest would be via the chrome remote debugging protocol
Use the DOM methods to get the root DOM and walk, search, or query the dom
This would make testing simpler as well: you can implement the logic in your favourite scripting language using an existing client library (there are many) and once that works implement it in C++.
If this for some reason has to be inprocess within Chromium, as a next step start a thread that connects to this and performs the operations.

You need to use a server side library to parse the contents of a requested HTML page. In Java for example there is a library "jsoup" there might be another alternatives for other server side languages. The main problem you could find is a "forbiden access", due to security restrictions, but as you are not trying to access REST services or similar things but only parse pure HTML to found string patterns, it must be easily done with "jsoup". There was a project where similar things were programmed for accessing web sites pages & parse the response html string.
Document doc = Jsoup.connect("http://jsoup.org").get();
Element link = doc.select("a").first();
String relHref = link.attr("href"); // == "/"
String absHref = link.attr("abs:href"); // "http://jsoup.org/"
See: https://jsoup.org/

Related

Ruby on Rails - How to find/manipulate DOM elements

I started learning Rub/Rails about a month ago, but haven't been able to find many resources specific to my issue.
I understand that in HTML/JS you can do something like:
let elements = document.getElementByName('name')
Is there a way in rails to get elements that share the class/id/name?
How can we interact with those elements? for example: if a div with a specific name already exists, append some data from our rails application to that div instead of creating a new one.
Thank you in advance.
Unless I'm really missing the point, what you're asking for isn't possible.
DOM manipulation (using Javascript) is something that happens client-side, in the browser; the browser requests a page, the server responds with an HTML document, and then the browser builds the DOM and we go from there, running Javascript, and potentially inspecting and manipulating the DOM.
Ruby on Rails is server-side; in the above description, it would be involved in the "the server responds" step, but there is no DOM at that point; it's simply generating an HTML document, using models / a view / a controller.

Reliable method of scraping page source i.e the tv at the beginning of each line?

When extracting data you can use CSS/xpaths. But is there a similar or reliable method of doing this in the page source.
www.amazon.com/Best-Sellers-Electronics-Televisions/zgbs/electronics/172659
You could get the page source and then parse using Regex but probably not be reliable if for instance the tv did not load on the page. I have looked up various solutions but I have yet to find one that mentions getting every tv at start of each line (1, 4, 7 etc,, in source) or using a reliable method e.g Css/xpaths in source of a page.
What would is the golden standard of reliable method of doing what I am after?
To get the page source you can use CURL if the page is rendered entirely on server side (most pages won't be), or headless chrome to get the actual DOM that will render in the browser (https://developers.google.com/web/updates/2017/04/headless-chrome).
For scraping the content, I've used cheerio (https://github.com/cheeriojs/cheerio) which will allow you to read in HTML to an object and then scrape your data off that using jQuery expressions. (Headless chrome allows you to execute JS on the pages you visit, so you don't necessarily need cheerio).
In your specific example you could get the TV on each line by combining the right class selectors to get the divs containing TV's, and using attribute selector with 'margin-left=0px' which would get first item on each line. That is obviously very much bound to structure of the page and will likely be broken by smallest of changes in the page source. (And not really any different from using xpaths. Still better than regex though)
With certain elements loading / not loading on the page (if that was what you meant by TV not being there), no golden solutions that I know of, except allowing sufficient time for the page to load and handling your scraper failing gracefully.

Selective HTML rendering heuristics for crawler

I use C++, obsolete Qt web kit 5.5.1 (in order to support Windows XP) in a crawler. I use HTML rendering in order to get html/text content. But I'd like to minimize rendering frequency and skip downloading irrelevant stuff in order to speedup crawling.
First of all, I use QNetworkAccessManager to get web page, then I pass it via setContent() method to QWebFrame instance (I consider redirects manually). I also have QNetworkAccessManager descendant, which can be used to skip certain GET requests.
What web page attributes will definitely say that no rendering is required for simple text extraction?
What GET requests, generated by webkit during rendering, can be safely omitted if we just wanna grab text/links in the rendered html? For example, *.css?

dynamic HTML page to pdf

I know there is a list of similar questions but all handle pages without user interaction (static even though some js may be there).
Let's say we've a page the user can interact (e.g. svg than changes, or html tables with drilldown - content changes). Those interactions will change the page. Same happens in stackoverflow when entering the question...
The idea is adding a button, "convert to pdf" taking the state of the html and sending to the user back a pdf version (we've a Java server).
Using the print of the browser is not the answer I'm looking for :-).
Is this a stick in the moon ?
You would have to store the parameters that generate the HTML view (i.e. what the user clicks on, what selections they make, etc). If you can have a list of parameters that generate the HTML view, you can have a method which accepts the list of parameters (JSON post?), generates the HTML view and passes it to your PDF generating routine. I'm not too familiar with Java libraries for this purpose, but PHP has TCPDF can take html output to basically generate a PDF for you. Certainly, there are Java libraries which will allow you to do the same thing, or you can use the parameters to get a list of rows/arrays which can be iterated over and output using the PDF library of your choice.
Both iTextPDF and Aspose.PDF would allow you to do that (I've seen them used in two different projects), but there is no magic and you will have to do some work.
The steps are roughly:
Get (as a string) the part of the document which you want to print with jQuery or innerHTML
Call a service on the server side to convert this to PDF
[Serverside] Use a whitlist - based tool to clean up the hmtl (unless you want to be hacked). JSoup is great for that.
[Serverside] Use IText or Aspose API to create the PDF from the HTML (this is not trivial, you will have to read the doc)
Download the document
I'd also recommend DocRaptor, an HTML to PDF API built by my company, Expected Behavior.
DocRaptor uses Prince XML to generate PDFs, and thus produces higher quality results than similar products.
Adding PDF generation to your own web application using our service is as simple as making an HTTP POST request to our server.
Here's a link to DocRaptor's home page:
DocRaptor
And a link to our API documentation:
DocRaptor API documentation

WPF: Display HTML-based content stored in resource assembly

In my WPF project I need to render HTML-based content, where the content is stored in a resource assembly referenced by my WPF project.
I have looked at the WPF Frame and WebBrowser controls. Unfortunately, they both only expose Navigation events (Navigating, Navigated), but not any events that would allow me, based on the requested URL, to return HTML content retrieved from the resource assembly.
I can intercept navigation requests and serve up HTML content using the Navigating event and the NavigateToString() method. But that doesn't work for intercepting load calls for images, CSS files, etc.
Furthermore, I am aware of an HTML to Flowdocument SDK sample application that might be useful, but I would probably have to extend the sample considerably to deal with images and style sheets.
For what it is worth, we also generate the HTML content to be rendered (via Wiki pages) so the source HTML is somewhat predictable (e.g., maybe no JavaScript) in terms for referenced image locations and CSS style sheets used. We are looking to display random HTML content from the internet.
Update:
There is also the possibility to create an MHT file for each HTML page, which would 'inline' all images as MIME-types and alleviate the need to have finer-grained callbacks.
If you're okay with using a 28 meg DLL, you may want to take a look at BerkeliumSharp, which is a managed wrapper around the awesome Berkelium library. Berkelium uses the chromium browser at its core to provide offscreen rendering and a delegated eventing model. There are tons of really cool things you can do with this, but for your particular problem, in Berkelium there is an interface called ProtocolHandler. The purpose of a protocol handler is to take in a URL and provide the HTTP headers and body back to the underlying rendering engine.
In the BerkeliumSharp test app (one of the projects available in the source), you can see one particular use of this is the FileProtocolHandler -- it handles all the file IO for the "file://" protocol using .NET managed classes (System.IO). You could do the same thing for a made up protocol like "resource://". There's really only one method you have to override called HandleRequest that looks like this:
bool HandleRequest (string url, ref byte[] responseBody, ref string[] responseHeaders)
So you'd take a URL like "resource://path/to/my/html" and do all the necessary Assembly.GetResourceStream etc. in that method. It should be pretty easy to take a look at how FileProtocolHandler is used to adapt your own.
Both berkelium and berkelium sharp are open source with a BSD license.
The WebBrowser exposes a NavigateToStream(Stream) method that might work for you:
If your content is then stored as an embedded resource, you could use:
var browser = new WebBrowser();
var source = Assembly.Load("ResourceAssemblyName");
browser.NavigateTo(source.GetManifestResourceStream("ResourceNamespace.ResourceName"));
There is also a NavigateToString(string) method that expects the string content of the document.
Note: I have never used this in anger, so I have no idea how much help it will be!