Reliable method of scraping page source i.e the tv at the beginning of each line? - html

When extracting data you can use CSS/xpaths. But is there a similar or reliable method of doing this in the page source.
www.amazon.com/Best-Sellers-Electronics-Televisions/zgbs/electronics/172659
You could get the page source and then parse using Regex but probably not be reliable if for instance the tv did not load on the page. I have looked up various solutions but I have yet to find one that mentions getting every tv at start of each line (1, 4, 7 etc,, in source) or using a reliable method e.g Css/xpaths in source of a page.
What would is the golden standard of reliable method of doing what I am after?

To get the page source you can use CURL if the page is rendered entirely on server side (most pages won't be), or headless chrome to get the actual DOM that will render in the browser (https://developers.google.com/web/updates/2017/04/headless-chrome).
For scraping the content, I've used cheerio (https://github.com/cheeriojs/cheerio) which will allow you to read in HTML to an object and then scrape your data off that using jQuery expressions. (Headless chrome allows you to execute JS on the pages you visit, so you don't necessarily need cheerio).
In your specific example you could get the TV on each line by combining the right class selectors to get the divs containing TV's, and using attribute selector with 'margin-left=0px' which would get first item on each line. That is obviously very much bound to structure of the page and will likely be broken by smallest of changes in the page source. (And not really any different from using xpaths. Still better than regex though)
With certain elements loading / not loading on the page (if that was what you meant by TV not being there), no golden solutions that I know of, except allowing sufficient time for the page to load and handling your scraper failing gracefully.

Related

Ruby on Rails - How to find/manipulate DOM elements

I started learning Rub/Rails about a month ago, but haven't been able to find many resources specific to my issue.
I understand that in HTML/JS you can do something like:
let elements = document.getElementByName('name')
Is there a way in rails to get elements that share the class/id/name?
How can we interact with those elements? for example: if a div with a specific name already exists, append some data from our rails application to that div instead of creating a new one.
Thank you in advance.
Unless I'm really missing the point, what you're asking for isn't possible.
DOM manipulation (using Javascript) is something that happens client-side, in the browser; the browser requests a page, the server responds with an HTML document, and then the browser builds the DOM and we go from there, running Javascript, and potentially inspecting and manipulating the DOM.
Ruby on Rails is server-side; in the above description, it would be involved in the "the server responds" step, but there is no DOM at that point; it's simply generating an HTML document, using models / a view / a controller.

Where to find entire HTML content in Chromium source code

I am currently trying to do this: once the webpage loads, find out if the URL is of a certain pattern (say www.wikipedia.com/*), then, if so, parse the HTML content of that webpage like one can do with BeautifulSoup, and check if the webpage has a div with class foo and id boo. Any idea where can I writ this code, that is, where can I get access to URL, where do I need to listen to to know that the webpage has finished loading following which I can look for the URL and HTML content, and where and how I can parse the HTML?
I tried going through the code in src/chrome/browser/tab_contents, I could not find any reasonable place where I can do all this.
Take a look at the following conceptual application layers which represent how Chromium displays web pages:
Image Source: https://docs.google.com/drawings/d/1gdSTfvLxbJDbX8oiWo5LTwAmXmdMQvjoUhYEhfhj0-k/edit
The different layers are described as:
WebKit: Rendering engine shared between Safari, Chromium, and all other WebKit-based browsers. The Port is a part of WebKit that integrates with platform dependent system services such as resource loading and graphics.
Glue: Converts WebKit types to Chromium types. This is our "WebKit embedding layer." It is the basis of two browsers, Chromium, and test_shell (which allows us to test WebKit).
Renderer / Render host: This is Chromium's "multi-process embedding layer." It proxies notifications and commands across the process boundary.
WebContents: A reusable component that is the main class of the Content module. It's easily embeddable to allow multiprocess rendering of HTML into a view. See the content module pages for more information.
Browser: Represents the browser window, it contains multiple WebContentses.
Tab Helpers: Individual objects that can be attached to a WebContents (via the WebContentsUserData mixin). The Browser attaches an assortment of them to the WebContentses that it holds (one for favicons, one for infobars, etc).
Since your goal is to access and interpret the HTML content of a web page by element and/or class, you can look to the rendering process which uses Blink:
The renderers use the Blink open-source layout engine for interpreting and laying out HTML.
Blink has a WebDocument class which allows you to access the HTML content and other properties of a web page:
WebDocument document = GetMainFrame()->GetDocument();
WebElement element = document.GetElementById(WebString::FromUTF8("example"));
// document.Url();
Cleanest would be via the chrome remote debugging protocol
Use the DOM methods to get the root DOM and walk, search, or query the dom
This would make testing simpler as well: you can implement the logic in your favourite scripting language using an existing client library (there are many) and once that works implement it in C++.
If this for some reason has to be inprocess within Chromium, as a next step start a thread that connects to this and performs the operations.
You need to use a server side library to parse the contents of a requested HTML page. In Java for example there is a library "jsoup" there might be another alternatives for other server side languages. The main problem you could find is a "forbiden access", due to security restrictions, but as you are not trying to access REST services or similar things but only parse pure HTML to found string patterns, it must be easily done with "jsoup". There was a project where similar things were programmed for accessing web sites pages & parse the response html string.
Document doc = Jsoup.connect("http://jsoup.org").get();
Element link = doc.select("a").first();
String relHref = link.attr("href"); // == "/"
String absHref = link.attr("abs:href"); // "http://jsoup.org/"
See: https://jsoup.org/

Selective HTML rendering heuristics for crawler

I use C++, obsolete Qt web kit 5.5.1 (in order to support Windows XP) in a crawler. I use HTML rendering in order to get html/text content. But I'd like to minimize rendering frequency and skip downloading irrelevant stuff in order to speedup crawling.
First of all, I use QNetworkAccessManager to get web page, then I pass it via setContent() method to QWebFrame instance (I consider redirects manually). I also have QNetworkAccessManager descendant, which can be used to skip certain GET requests.
What web page attributes will definitely say that no rendering is required for simple text extraction?
What GET requests, generated by webkit during rendering, can be safely omitted if we just wanna grab text/links in the rendered html? For example, *.css?

Way To Modify HTML Before Display using Cocoa Webkit for Internationalization

In Objective C to build a Mac OSX (Cocoa) application, I'm using the native Webkit widget to display local files with the file:// URL, pulling from this folder:
MyApp.app/Contents/Resources/lang/en/html
This is all well and good until I start to need a German version. That means I have to copy en/html as de/html, then have someone replace the wording in the HTML (and some in the Javascript (like with modal dialogs)) with German phrasing. That's quite a lot of work!
Okay, that might seem doable until this creates a headache where I have to constantly maintain multiple versions of the html folder for each of the languages I need to support.
Then the thought came to me...
Why not just replace the phrasing with template tags like %CONTINUE%
and then, before the page is rendered, intercept it and swap it out
with strings pulled from a language plist file?
Through some API with this widget, is it possible to intercept HTML before it is rendered and replace text?
If it is possible, would it be noticeably slow such that it wouldn't be worth it?
Or, do you recommend I do a strategy where I build a generator that I keep on my workstation which builds each of the HTML folders for me from a main template, and then I deploy those already completed with my setup application once I determine the user's language from the setup application?
Through a lot of experimentation, I found an ugly way to do templating. Like I said, it's not desirable and has some side effects:
You'll see a flash on the first window load. On first load of the application window that has the WebKit widget, you'll want to hide the window until the second time the page content is displayed. I guess you'll have to use a property for that.
When you navigate, each page loads twice. It's almost not noticeable, but not good enough for good development.
I found an odd quirk with Bootstrap CSS where it made my table grid rows very large and didn't apply CSS properly for some strange reason. I might be able to tweak the CSS to fix that.
Unfortunately, I found no other event I could intercept on this except didFinishLoadForFrame. However, by then, the page has already downloaded and rendered at least once for a microsecond. It would be great to intercept some event before then, where I have the full HTML, and do the swap there before display. I didn't find such an event. However, if someone finds such an event -- that would probably make this a great templating solution.
- (void)webView:(WebView *)sender didFinishLoadForFrame:(WebFrame *)frame
{
DOMHTMLElement * htmlNode =
(DOMHTMLElement *) [[[frame DOMDocument] getElementsByTagName: #"html"] item: 0];
NSString *s = [htmlNode outerHTML];
if ([s containsString:#"<!-- processed -->"]) {
return;
}
NSURL *oBaseURL = [[[frame dataSource] request] URL];
s = [s stringByReplacingOccurrencesOfString:#"%EXAMPLE%" withString:#"ZZZ"];
s = [s stringByReplacingOccurrencesOfString:#"</head>" withString:#"<!-- processed -->\n</head>"];
[frame loadHTMLString:s baseURL:oBaseURL];
}
The above will look at HTML that contains %EXAMPLE% and replace it with ZZZ.
In the end, I realized that this is inefficient because of page flash, and, on long bits of text that need a lot of replacing, may have some quite noticeable delay. The better way is to create a compile time generator. This would be to make one HTML folder with %PARAMETERIZED_TAGS% inside instead of English text. Then, create a "Run Script" in your "Build Phase" that runs some program/script you create in whatever language you want that generates each HTML folder from all the available lang-XX.plist files you have in a directory, where XX is a language code like 'en', 'de', etc. It reads the HTML file, finds the parameterized tag match in the lang-XX.plist file, and replaces that text with the text for that language. That way, after compilation, you have several HTML folders for each language, already using your translated strings. This is efficient because then it allows you to have one single HTML folder where you handle your code, and don't have to do the extremely tedious process of creating each HTML folder in each language, nor have to maintain that mess. The compile time generator would do that for you. However -- you'll have to build that compile time generator.

Supplying arguments to an image-generation program

We have a web application that creates a web page. In one section of the page, a graph is diplayed. The graph is created by calling graphing program with an "img src=..." tag in the HTML body. The graphing program takes a number of arguments about the height, width, legends, etc., and the data to be graphed. The only way we have found so far to pass the arguments to the graphing program is to use the GET method. This works, but in some cases the size of the query string passed to the grapher is approaching the 2058 (or whatever) character limit for URLs in Internet Explorer. I've included an example of the tag below. If the length is too long, the query string is truncated and either the program bombs or even worse, displays a graph that is not correct (depending on where the truncation occurs).
The POST method with an auto submit does not work for our purposes, because we want the image inserted on the page where the grapher is invoked. We don't want the graph displayed on a separate web page, which is what the POST method does with the URL in the "action=" attribute.
Does anyone know a way around this problem, or do we just have to stick with the GET method and inform users to stay away from Internet Explorer when they're using our application?
Thanks!
One solution is to have the page put data into the session, then have the img generation script pull from that session information. For example page stores $_SESSION['tempdata12345'] and creates an img src="myimage.php?data=tempdata12345". Then myimage.php pulls from the session information.
One solution is to have the web application that generates the entire page to pre-emptively
call the actual graphing program with all the necessary parameters.
Perhaps store the generated image in a /tmp folder.
Then have the web application create the web page and send it to the browser with a "img src=..." tag that, instead of referring to the graphing program, refers to the pre-generated image.