Is it possible to get the generated source of a webpage programatically? - html

As the title states, I am wondering if there is a method to obtain the generated HTML code of a page. Obviously I can inspect the page with web developer tools (browser built-in, or external program) and get it, but I would really like to do it automatically. Perhaps using Fiddler's API it could be possible?
Thanks!

"Source" doesn't get altered by JavaScript after page load, it's the document object model (DOM) generated from the source that gets altered. It is this DOM that is then translated to the GUI, and is altered with every change as long as the page is not re-loaded.
The DOM is not a string of HTML code, it is an in-memory hierarchical object representation of the page. The browser does not maintain an up-to-date, flat-file representation of the DOM as it gets altered, which is why when you "view source" you only ever see what was originally sent to the browser over HTTP.
The node-for-node representation of the page/DOM, in developer tools such as Firebug is the closest you'll get to a re-generation of the source code (AFAIK) without building some new tool yourself.

You may be able to write a script in Python that would take a variable (the URL) and insert it after a command that would download the webpage, such as wget.
Googling it, I have found this to parse HTML files: maybe you could wget the index.HTML and use one of these:
How do you parse and process HTML/XML in PHP?

Related

Saving static HTML page generated with ReactJS

Background:
I need to allow users to create web pages for various products, with each page having a standard overall appearance. So basically, I will have a template, and based on the input data I need the HTML page to be generated for each product. The input data will be submitted via a web form, following which the data should be merged with the template to produce the output.
I initially considered using a pure templating approach such as Nunjucks, but moved to ReactJS as I have prior experience with the latter.
Problem:
Once I display the output page (by adding the user input to the template file with placeholders), I am getting the desired output page displayed in the browser. But how can I now obtain the HTML code for this specific page?
When I tried to view the source code of the page, I see the contents of 'public/index.html' stating:
This HTML file is a template.
If you open it directly in the browser, you will see an empty page.
Expectedly, the same happens when I try to save (Save As...) the html page via the browser. I understand why the above happens.
But I cannot find a solution to my requirement. Can anyone tell me how I can download/save the static source code for the output page displayed on the browser.
I have read possible solutions such as installing 'React/Redux Development Extension' etc... but these would not work as a solution for external users (who cannot be expected to install these extensions to use my tool). I need a way to do this on production environment.
p.s. Having read the "background" info of my task, do let me know if you can think of any better ways of approaching this.
Edit note:
My app is currently actually just a single page, that accepts user data via a form and displays the output (in a full screen dialog). I don't wish to have these output pages 'published' on the website, and these are simply to be saved/downloaded for internal use. So simply being able to get the "source code" for the dislayed view/page on the browser and saving this to a file would solve my problem. But I am not sure if there is a way to do this?
Its recommended that you use a well-known site generator such as Gatsby or Next for your static sites since "npx create-react-app my-app" is for single page apps.
(ref: https://reactjs.org/docs/create-a-new-react-app.html#recommended-toolchains)
If I'm understanding correctly, you need to generate a new page link for each user. Each of your users will have their own link (http/https) to share with their users.
For example, a scheduling tool will need each user to create their own "booking page", which is a generated link (could be on your domain --> www.yourdomain.com/bookinguser1).
You'll need user profiles to store each user's custom page, a database, and such. If you're not comfortable, I'll use something like an e-commerce tool that will do it for you.
You can turn on the debugger (f12) and go to "Elements"
Then right-click on the HTML tag and press edit as HTML
And then copy everything (ctrl + a)

Node Red HTML node to see Inspect instead of View Source

I am trying to have node red go to my router IP and search through the HTML code to see whether a certain device is on the list. When I right click - inspect I can hover over the list I am interested in and see the HTML information I am looking for. When I use the HTML node it seems to only look through the view page source information, which does not have what I am looking for. I there a way to point the HTML node at a more specific element instead of the page source as a whole?
It sounds like the data in the page on your router might be dynamically generated using JavaScript.
This means that when the page is loaded it only has the outline and the rest is filled in by the code using XHResquests to a different URL that supplies the information.
In order for Node-RED to be able to extract the information from the page it would need to load the outline, then effectively run all the JavaScript. Libraries like PhantomJS
There is a contrib node that might be able to help node-red-contrib-nbrowser but the better approach would probably be to work out what URL the JavaScript is calling and calling that directly as the data is most likely to be in a format that is easier to process (e.g. JSON)

NaCl Module HTML Interface

I'm developing a Chrome packaged app which displays a certain kind of document as HTML. I have the app working to some degree, but would like to add a feature allowing the user to open a file by clicking on a link to an applicable file.
I am able to launch the app by MIME type as per the docs here, and am familiar with the pp::Instance::HandleDocumentLoad method to handle the clicked link's source, but am unsure how to display HTML I'm generating from the parsed document.
This is easy enough to do when the user manually launches the app and selects a file using an input element and the HTML file system since the HTML GUI is specified in the app manifest, but as far as I can tell, launching based on MIME type just embeds the NMF.
TL;DR: Is there a way to specify a HTML interface for (or a simple way to render HTML from) a NaCl module instance created by a nacl_modules manifest entry?
This is possible, but it's a bit of a hack. I copied the trick from here:
https://groups.google.com/d/msg/native-client-discuss/UJu7VXvV_bw/pLc19D50gbwJ
You can see how I did it here and here:
Basically, you listen on chrome.tabs.onCreated and chrome.tabs.onUpdated, then you inject a small bit of JavaScript that checks for the embed element with the correct mimetype. If it finds the element, it sends a message (via chrome.runtime.sendMessage) to your extension. When your extension gets that message, it injects the rest of your JavaScript into the page using chrome.tabs.executeScript. At this point you can display whatever you want.
You could do it earlier, by injecting your code into every page, but I found this was a bit nicer, as it only injects a small bit of code.

Get the "real" source code from a website

I've got a problem getting the "real" source code from a website:
http://sirius.searates.com/explorer
Trying it the normal way (view-source:) via Chrome I get a different result than trying it by using inspect elements function. And the code which I can see (using that function) is the one that I would like to have... How is that possible to get this code?
This usually happens because the UI is actually generated by a client-side Javascript utility.
In this case, most of the screen is generated by HighCharts, and a few elements are generated/modified by Bootstrap.
The DOM inspector will always give you the "current" view of the HTML, while the view source gives you the "initial" view. Since view source does not run the Javascript utilities, much of the UI is never generated.
To get the most up-to-date (HTML) source, you can use the DOM inspector to find the root html node, right-click and select "Edit as HTML". Then select-all and copy/paste into your favorite text editor.
Note, though, that this will only give you a snapshot of the page. Most modern web pages are really browser applications and the HTML is just one part of the whole. Copy/pasting the HTML will not give you a fully functional page.
You can get real-time html with this url,bookmark this url:
javascript:document.write('<textarea width="400">'+document.body.innerHTML+'</textarea>');

PdfSharp, GDI+ and HTML printing

I currently have a "PrintingWebService" that I call from an AJAX page with all the information that is needed to construct a highly customized PDF printout using PDF Sharp and the PDFSharp's GDI+ mode, which takes DrawString and other commands that work basically just like GDI+ only they are drawn to the PDF.
I then save the PDF file to a location on the webserver and return the file name from the web service, and the AJAX page opens a new window with the pdf file.
So far, it works well, however, there is one part of my AJAX page that I want to printout and I haven't come up with a solution for yet. I've got a string of the HTML content of a TinyMCE editor that I want to dispay in the bottom part of the PDF page.
I'm looking for some sort of tool I could use for this purpose. Even something opensource that prints to GDI+ I could use by taking the source code and translating it to use PdfSharp's GDI+ (the class names are like XGraphics, with each class having X before the GDI+ name).
If I have to I will limit what HTML can be generated by TinyMCE and write my own renderer, but that will be a big challenge, so I'm looking for other solutions first.
I've stayed away from a printer-friendly page approach because I wanted to construct a page that was a near identical of an existing WinForms printout, using my existing code. With PdfSharp I was able to convert all the code except the text area stuff (which used the RichTextBox and RTF in the WinForms version).
Tony,
I personally have used WebSupergoo's ABCPdf library with much success. You can actually render HTML directly to the PDF and it does fairly well in regards to accuracy.
Another free software that will allow you the flexibility of writing HTML to PDF that I have used in the past with much success is iTextSharp.
Otherwise, I think you'll have to write something to render HTML to GDI.
Either way, you may want to consider using an HttpHandler that you map to using your web.config to generate the PDF file. This will allow for you to render the PDF to a bytestream and then dump it directly to the user (as opposed to having to save each PDF receipt to the web server). It will also allow for you to use the .pdf extension in the page that returns the receipt (PurchaseReceipt.pdf could be mapped to a HttpHandler)... making it more cross-browser friendly. Older versions of Adobe / Browsers will not display correctly if you start throwing a PDF byte stream from an ASPX page.
Hope this helps.