How it is possible to use jodd.http.HttpRequest to load a page content that is generated by javascript? - jodd

I try to load a page content with:
HttpResponse response2 = HttpRequest.get(_PAGE_URL).cookies(response.cookies()).send();
In a browser, the page source is full of javascript to generate the DOM, but in the Web Inspector of the browser I can see the generated source.
The question is, can I somehow retrieved the generated page content by Jodd's utilities?

You can't. You can just download the static HTML content (as you did) and then you would need to render it using some other tool.
Since Java 8, you can use JavaFX's WebView Component (as far as I remember), but please search for other tools as well (maybe cef?)
EDIT
See: https://github.com/igr/web-scraper (based on Selenium WebDriver). One thing I miss is better control over request/response.
There is also HtmlUnit, but from the reviews, it seems Selenium is a better choice.

Related

Get generated source of an HTML page programmatically

What is the easiest way to get the generated web page of a website programatically in any programming language?
The generated web page that is required is the one you get if you go to a web page in firefox and press Ctrl-a and then right click and press "View Selection Source".
The one way that comes to mind is to understand the chromium open source web browser code and get the rendered page and use it in our service.
But I believe that there may be another solution out there that I am not aware of.
In javascript, you can get the full document content with
var html = document.documentElement.innerHTML;
If you want to do this server side you can use file_get_contents()
Ex:
file_get_contents(path_to_webpage);
For reference:
http://php.net/manual/en/function.file-get-contents.php
https://www.w3schools.com/php/func_filesystem_file_get_contents.asp

Scraping with Cheerio, text is not visible

so I've been web scraping with Cheerio and I'm able to find the particular HTML element that I'm looking for, but for some reason, the text is not there.
For example in my web browser, when I inspect element I see Why Him?.
But, when I print out the object while scraping I see, so when I call the .text() function, it doesn't return anything. Why does this happen?
Inspect Element is not a valid test that Cheerio will be able to see something. You must use View Source instead.
Inspect Element is a live view of how the browser has rendered an element after applying all of the various technologies that exist in a browser, including CSS and JavaScript. View Source, on the other hand, is the raw code that the server sent to the browser, which you can generally expect to be the same as what Cheerio will receive. That is, assuming you ensure the HTTP headers are identical, particularly the ones relevant to content negotiation.
It is important to understand that while Cheerio is a DOM parser, it does not simulate a browser. So if the text is added via JavaScript, for example, then the text will not be there because that JavaScript will not have run.
If browser simulation is important to you, you should look into using PhantomJS. If you need a highly realistic browser rendering setup, then look into WebDriver and Leadfoot.

Get the "real" source code from a website

I've got a problem getting the "real" source code from a website:
http://sirius.searates.com/explorer
Trying it the normal way (view-source:) via Chrome I get a different result than trying it by using inspect elements function. And the code which I can see (using that function) is the one that I would like to have... How is that possible to get this code?
This usually happens because the UI is actually generated by a client-side Javascript utility.
In this case, most of the screen is generated by HighCharts, and a few elements are generated/modified by Bootstrap.
The DOM inspector will always give you the "current" view of the HTML, while the view source gives you the "initial" view. Since view source does not run the Javascript utilities, much of the UI is never generated.
To get the most up-to-date (HTML) source, you can use the DOM inspector to find the root html node, right-click and select "Edit as HTML". Then select-all and copy/paste into your favorite text editor.
Note, though, that this will only give you a snapshot of the page. Most modern web pages are really browser applications and the HTML is just one part of the whole. Copy/pasting the HTML will not give you a fully functional page.
You can get real-time html with this url,bookmark this url:
javascript:document.write('<textarea width="400">'+document.body.innerHTML+'</textarea>');

render a full web page in node.js code

I am running a node.js server, and it is rendering a web page wonderfully. When I look at this in a browser, it runs exactly as I expect.
However, what I actually want to do is make the call to fully generate the html page - exactly as it is in the browser - within the node.js code, as a call. Currently, I have tried this:
http.request("http://localhost:8000/").end();
(with a few variants). This does exactly what it says, which is to make the single call to the server for the page - what it doesn't do is actually render the page, pulling in all of the other script files, and running the code on the page.
I have tried exploring express and ejs, and I think I need to use one of these, but I cannot find out how to do this fairly straightforward task. All it needs is to render an html page, but it seems to be a whole lot more complex than it should be.
What output do you want? A string of HTML? Maybe you want PhantomJS the headless browser. You could use it to render the page, then get the rendered DOM as a string of HTML.
Use the Mikeal's Request module to make http requests once you captured the response you then can inspect the html however you like.
To make that easier though you should use cheerio, this will give you a jQuery style api to manipulate the html.
Perhaps you are looking for wkhtmltopdf?
In a nutshell, it will render an entire web page (including images and JavaScript) to a PDF document.

Is it possible to get the generated source of a webpage programatically?

As the title states, I am wondering if there is a method to obtain the generated HTML code of a page. Obviously I can inspect the page with web developer tools (browser built-in, or external program) and get it, but I would really like to do it automatically. Perhaps using Fiddler's API it could be possible?
Thanks!
"Source" doesn't get altered by JavaScript after page load, it's the document object model (DOM) generated from the source that gets altered. It is this DOM that is then translated to the GUI, and is altered with every change as long as the page is not re-loaded.
The DOM is not a string of HTML code, it is an in-memory hierarchical object representation of the page. The browser does not maintain an up-to-date, flat-file representation of the DOM as it gets altered, which is why when you "view source" you only ever see what was originally sent to the browser over HTTP.
The node-for-node representation of the page/DOM, in developer tools such as Firebug is the closest you'll get to a re-generation of the source code (AFAIK) without building some new tool yourself.
You may be able to write a script in Python that would take a variable (the URL) and insert it after a command that would download the webpage, such as wget.
Googling it, I have found this to parse HTML files: maybe you could wget the index.HTML and use one of these:
How do you parse and process HTML/XML in PHP?