so I've been web scraping with Cheerio and I'm able to find the particular HTML element that I'm looking for, but for some reason, the text is not there.
For example in my web browser, when I inspect element I see Why Him?.
But, when I print out the object while scraping I see, so when I call the .text() function, it doesn't return anything. Why does this happen?
Inspect Element is not a valid test that Cheerio will be able to see something. You must use View Source instead.
Inspect Element is a live view of how the browser has rendered an element after applying all of the various technologies that exist in a browser, including CSS and JavaScript. View Source, on the other hand, is the raw code that the server sent to the browser, which you can generally expect to be the same as what Cheerio will receive. That is, assuming you ensure the HTTP headers are identical, particularly the ones relevant to content negotiation.
It is important to understand that while Cheerio is a DOM parser, it does not simulate a browser. So if the text is added via JavaScript, for example, then the text will not be there because that JavaScript will not have run.
If browser simulation is important to you, you should look into using PhantomJS. If you need a highly realistic browser rendering setup, then look into WebDriver and Leadfoot.
Related
I try to load a page content with:
HttpResponse response2 = HttpRequest.get(_PAGE_URL).cookies(response.cookies()).send();
In a browser, the page source is full of javascript to generate the DOM, but in the Web Inspector of the browser I can see the generated source.
The question is, can I somehow retrieved the generated page content by Jodd's utilities?
You can't. You can just download the static HTML content (as you did) and then you would need to render it using some other tool.
Since Java 8, you can use JavaFX's WebView Component (as far as I remember), but please search for other tools as well (maybe cef?)
EDIT
See: https://github.com/igr/web-scraper (based on Selenium WebDriver). One thing I miss is better control over request/response.
There is also HtmlUnit, but from the reviews, it seems Selenium is a better choice.
I've got a problem getting the "real" source code from a website:
http://sirius.searates.com/explorer
Trying it the normal way (view-source:) via Chrome I get a different result than trying it by using inspect elements function. And the code which I can see (using that function) is the one that I would like to have... How is that possible to get this code?
This usually happens because the UI is actually generated by a client-side Javascript utility.
In this case, most of the screen is generated by HighCharts, and a few elements are generated/modified by Bootstrap.
The DOM inspector will always give you the "current" view of the HTML, while the view source gives you the "initial" view. Since view source does not run the Javascript utilities, much of the UI is never generated.
To get the most up-to-date (HTML) source, you can use the DOM inspector to find the root html node, right-click and select "Edit as HTML". Then select-all and copy/paste into your favorite text editor.
Note, though, that this will only give you a snapshot of the page. Most modern web pages are really browser applications and the HTML is just one part of the whole. Copy/pasting the HTML will not give you a fully functional page.
You can get real-time html with this url,bookmark this url:
javascript:document.write('<textarea width="400">'+document.body.innerHTML+'</textarea>');
I am running a node.js server, and it is rendering a web page wonderfully. When I look at this in a browser, it runs exactly as I expect.
However, what I actually want to do is make the call to fully generate the html page - exactly as it is in the browser - within the node.js code, as a call. Currently, I have tried this:
http.request("http://localhost:8000/").end();
(with a few variants). This does exactly what it says, which is to make the single call to the server for the page - what it doesn't do is actually render the page, pulling in all of the other script files, and running the code on the page.
I have tried exploring express and ejs, and I think I need to use one of these, but I cannot find out how to do this fairly straightforward task. All it needs is to render an html page, but it seems to be a whole lot more complex than it should be.
What output do you want? A string of HTML? Maybe you want PhantomJS the headless browser. You could use it to render the page, then get the rendered DOM as a string of HTML.
Use the Mikeal's Request module to make http requests once you captured the response you then can inspect the html however you like.
To make that easier though you should use cheerio, this will give you a jQuery style api to manipulate the html.
Perhaps you are looking for wkhtmltopdf?
In a nutshell, it will render an entire web page (including images and JavaScript) to a PDF document.
As the title states, I am wondering if there is a method to obtain the generated HTML code of a page. Obviously I can inspect the page with web developer tools (browser built-in, or external program) and get it, but I would really like to do it automatically. Perhaps using Fiddler's API it could be possible?
Thanks!
"Source" doesn't get altered by JavaScript after page load, it's the document object model (DOM) generated from the source that gets altered. It is this DOM that is then translated to the GUI, and is altered with every change as long as the page is not re-loaded.
The DOM is not a string of HTML code, it is an in-memory hierarchical object representation of the page. The browser does not maintain an up-to-date, flat-file representation of the DOM as it gets altered, which is why when you "view source" you only ever see what was originally sent to the browser over HTTP.
The node-for-node representation of the page/DOM, in developer tools such as Firebug is the closest you'll get to a re-generation of the source code (AFAIK) without building some new tool yourself.
You may be able to write a script in Python that would take a variable (the URL) and insert it after a command that would download the webpage, such as wget.
Googling it, I have found this to parse HTML files: maybe you could wget the index.HTML and use one of these:
How do you parse and process HTML/XML in PHP?
I need to parse dynamically generated HMTL code using a HTML Agility pack.
For example this code:
<div class="navigation_noClass"> There are 43 articles </div>
is not displayed in the Page Source option of the web browser, i.e. this code can only be visible using some inspect tools such as Firebug, Inspect Context ...
Right at the moment, it sounds like you're feeding the HTML received directly into the Agility pack and thus missing a few of the (vital?) steps which a regular browser would do.
i.e. the execution of Javascript and / or CSS.
There are numerous option for executing Javascript but most of the reasonably "self contained" options require you recreate the DOM and the associated functionality. Not trivial.
And then there's those occasions where CSS contains content (such as the Before / After pseudo elements). As far as I know, there's not a whole lot of libraries around for simulating CSS behaviour upon a HTML source outside of a browser.
All of this means if you really need to capture the output of Javascript and / or CSS execution, it might be easiest to wire up a browser directly into your app processing pipeline (such as one of the Chromium based offerings) and interrogate its DOM (in a similar manner to the many function web testing suites).
NB: If this is a serious sized server style processing task, you may want to hive off such processing onto dedicated servers / app pools / processes to give your app a fighting chance at a decent up time and / or memory.