How to parse dynamically html code using HTML Agility pack? - html

I need to parse dynamically generated HMTL code using a HTML Agility pack.
For example this code:
<div class="navigation_noClass"> There are 43 articles </div>
is not displayed in the Page Source option of the web browser, i.e. this code can only be visible using some inspect tools such as Firebug, Inspect Context ...

Right at the moment, it sounds like you're feeding the HTML received directly into the Agility pack and thus missing a few of the (vital?) steps which a regular browser would do.
i.e. the execution of Javascript and / or CSS.
There are numerous option for executing Javascript but most of the reasonably "self contained" options require you recreate the DOM and the associated functionality. Not trivial.
And then there's those occasions where CSS contains content (such as the Before / After pseudo elements). As far as I know, there's not a whole lot of libraries around for simulating CSS behaviour upon a HTML source outside of a browser.
All of this means if you really need to capture the output of Javascript and / or CSS execution, it might be easiest to wire up a browser directly into your app processing pipeline (such as one of the Chromium based offerings) and interrogate its DOM (in a similar manner to the many function web testing suites).
NB: If this is a serious sized server style processing task, you may want to hive off such processing onto dedicated servers / app pools / processes to give your app a fighting chance at a decent up time and / or memory.

Related

Node Red HTML node to see Inspect instead of View Source

I am trying to have node red go to my router IP and search through the HTML code to see whether a certain device is on the list. When I right click - inspect I can hover over the list I am interested in and see the HTML information I am looking for. When I use the HTML node it seems to only look through the view page source information, which does not have what I am looking for. I there a way to point the HTML node at a more specific element instead of the page source as a whole?
It sounds like the data in the page on your router might be dynamically generated using JavaScript.
This means that when the page is loaded it only has the outline and the rest is filled in by the code using XHResquests to a different URL that supplies the information.
In order for Node-RED to be able to extract the information from the page it would need to load the outline, then effectively run all the JavaScript. Libraries like PhantomJS
There is a contrib node that might be able to help node-red-contrib-nbrowser but the better approach would probably be to work out what URL the JavaScript is calling and calling that directly as the data is most likely to be in a format that is easier to process (e.g. JSON)

Scraping with Cheerio, text is not visible

so I've been web scraping with Cheerio and I'm able to find the particular HTML element that I'm looking for, but for some reason, the text is not there.
For example in my web browser, when I inspect element I see Why Him?.
But, when I print out the object while scraping I see, so when I call the .text() function, it doesn't return anything. Why does this happen?
Inspect Element is not a valid test that Cheerio will be able to see something. You must use View Source instead.
Inspect Element is a live view of how the browser has rendered an element after applying all of the various technologies that exist in a browser, including CSS and JavaScript. View Source, on the other hand, is the raw code that the server sent to the browser, which you can generally expect to be the same as what Cheerio will receive. That is, assuming you ensure the HTTP headers are identical, particularly the ones relevant to content negotiation.
It is important to understand that while Cheerio is a DOM parser, it does not simulate a browser. So if the text is added via JavaScript, for example, then the text will not be there because that JavaScript will not have run.
If browser simulation is important to you, you should look into using PhantomJS. If you need a highly realistic browser rendering setup, then look into WebDriver and Leadfoot.

Get the "real" source code from a website

I've got a problem getting the "real" source code from a website:
http://sirius.searates.com/explorer
Trying it the normal way (view-source:) via Chrome I get a different result than trying it by using inspect elements function. And the code which I can see (using that function) is the one that I would like to have... How is that possible to get this code?
This usually happens because the UI is actually generated by a client-side Javascript utility.
In this case, most of the screen is generated by HighCharts, and a few elements are generated/modified by Bootstrap.
The DOM inspector will always give you the "current" view of the HTML, while the view source gives you the "initial" view. Since view source does not run the Javascript utilities, much of the UI is never generated.
To get the most up-to-date (HTML) source, you can use the DOM inspector to find the root html node, right-click and select "Edit as HTML". Then select-all and copy/paste into your favorite text editor.
Note, though, that this will only give you a snapshot of the page. Most modern web pages are really browser applications and the HTML is just one part of the whole. Copy/pasting the HTML will not give you a fully functional page.
You can get real-time html with this url,bookmark this url:
javascript:document.write('<textarea width="400">'+document.body.innerHTML+'</textarea>');

render a full web page in node.js code

I am running a node.js server, and it is rendering a web page wonderfully. When I look at this in a browser, it runs exactly as I expect.
However, what I actually want to do is make the call to fully generate the html page - exactly as it is in the browser - within the node.js code, as a call. Currently, I have tried this:
http.request("http://localhost:8000/").end();
(with a few variants). This does exactly what it says, which is to make the single call to the server for the page - what it doesn't do is actually render the page, pulling in all of the other script files, and running the code on the page.
I have tried exploring express and ejs, and I think I need to use one of these, but I cannot find out how to do this fairly straightforward task. All it needs is to render an html page, but it seems to be a whole lot more complex than it should be.
What output do you want? A string of HTML? Maybe you want PhantomJS the headless browser. You could use it to render the page, then get the rendered DOM as a string of HTML.
Use the Mikeal's Request module to make http requests once you captured the response you then can inspect the html however you like.
To make that easier though you should use cheerio, this will give you a jQuery style api to manipulate the html.
Perhaps you are looking for wkhtmltopdf?
In a nutshell, it will render an entire web page (including images and JavaScript) to a PDF document.

Is it possible to get the generated source of a webpage programatically?

As the title states, I am wondering if there is a method to obtain the generated HTML code of a page. Obviously I can inspect the page with web developer tools (browser built-in, or external program) and get it, but I would really like to do it automatically. Perhaps using Fiddler's API it could be possible?
Thanks!
"Source" doesn't get altered by JavaScript after page load, it's the document object model (DOM) generated from the source that gets altered. It is this DOM that is then translated to the GUI, and is altered with every change as long as the page is not re-loaded.
The DOM is not a string of HTML code, it is an in-memory hierarchical object representation of the page. The browser does not maintain an up-to-date, flat-file representation of the DOM as it gets altered, which is why when you "view source" you only ever see what was originally sent to the browser over HTTP.
The node-for-node representation of the page/DOM, in developer tools such as Firebug is the closest you'll get to a re-generation of the source code (AFAIK) without building some new tool yourself.
You may be able to write a script in Python that would take a variable (the URL) and insert it after a command that would download the webpage, such as wget.
Googling it, I have found this to parse HTML files: maybe you could wget the index.HTML and use one of these:
How do you parse and process HTML/XML in PHP?