Get generated source of an HTML page programmatically - html

What is the easiest way to get the generated web page of a website programatically in any programming language?
The generated web page that is required is the one you get if you go to a web page in firefox and press Ctrl-a and then right click and press "View Selection Source".
The one way that comes to mind is to understand the chromium open source web browser code and get the rendered page and use it in our service.
But I believe that there may be another solution out there that I am not aware of.

In javascript, you can get the full document content with
var html = document.documentElement.innerHTML;

If you want to do this server side you can use file_get_contents()
Ex:
file_get_contents(path_to_webpage);
For reference:
http://php.net/manual/en/function.file-get-contents.php
https://www.w3schools.com/php/func_filesystem_file_get_contents.asp

Related

Saving static HTML page generated with ReactJS

Background:
I need to allow users to create web pages for various products, with each page having a standard overall appearance. So basically, I will have a template, and based on the input data I need the HTML page to be generated for each product. The input data will be submitted via a web form, following which the data should be merged with the template to produce the output.
I initially considered using a pure templating approach such as Nunjucks, but moved to ReactJS as I have prior experience with the latter.
Problem:
Once I display the output page (by adding the user input to the template file with placeholders), I am getting the desired output page displayed in the browser. But how can I now obtain the HTML code for this specific page?
When I tried to view the source code of the page, I see the contents of 'public/index.html' stating:
This HTML file is a template.
If you open it directly in the browser, you will see an empty page.
Expectedly, the same happens when I try to save (Save As...) the html page via the browser. I understand why the above happens.
But I cannot find a solution to my requirement. Can anyone tell me how I can download/save the static source code for the output page displayed on the browser.
I have read possible solutions such as installing 'React/Redux Development Extension' etc... but these would not work as a solution for external users (who cannot be expected to install these extensions to use my tool). I need a way to do this on production environment.
p.s. Having read the "background" info of my task, do let me know if you can think of any better ways of approaching this.
Edit note:
My app is currently actually just a single page, that accepts user data via a form and displays the output (in a full screen dialog). I don't wish to have these output pages 'published' on the website, and these are simply to be saved/downloaded for internal use. So simply being able to get the "source code" for the dislayed view/page on the browser and saving this to a file would solve my problem. But I am not sure if there is a way to do this?
Its recommended that you use a well-known site generator such as Gatsby or Next for your static sites since "npx create-react-app my-app" is for single page apps.
(ref: https://reactjs.org/docs/create-a-new-react-app.html#recommended-toolchains)
If I'm understanding correctly, you need to generate a new page link for each user. Each of your users will have their own link (http/https) to share with their users.
For example, a scheduling tool will need each user to create their own "booking page", which is a generated link (could be on your domain --> www.yourdomain.com/bookinguser1).
You'll need user profiles to store each user's custom page, a database, and such. If you're not comfortable, I'll use something like an e-commerce tool that will do it for you.
You can turn on the debugger (f12) and go to "Elements"
Then right-click on the HTML tag and press edit as HTML
And then copy everything (ctrl + a)

How to retrieve the HTML code of a particular site's homepage

I want to get the HTML code of a particular site. It asks me to register myself first so that I can be redirected to their home page. Now, my question is: is it possible to retrieve the HTML code of the desired page just by choosing option ‘View Page Source’ which appears on right click? Is there any other way to fetch the HTML code?
There are multiple ways of getting the HTML source code of a page
One way, as you already know is by viewing the page's source code.
If you Right Click -> View Page Source or just press Ctrl + U you will view the source code in your browser
If you are using linux, you can use wget to get the source code.
Just open up a console and type wget www.somewebsite.com and you will get the HTML source code along with any CSS and JS links.
However, you cannot get the PHP code using any method unless you have FTP access to the server
Yes it is possible to view HTML via 'View page source' or you could use PHP as mentioned in the comments.
'usign php yes php.net/manual/en/function.file-get-contents.php –
Vitorino fernandes'
You could also let a website and or program do it for you but it's trustability depends on the site and or program,
Do note it is NOT possible to view the PHP source since that is server-side.
Using any browser, the "View Page Source" option will show you the source of the page, as received by the browser (which may be different then the source currently displayed). You also have the option of using the File > Save Page As (or similar) menu option to save a copy of the html code of the page from the browser.
It is also possible to use command line tools like curl and wget to download the page to your local machine. Those tools provide options to send data (such as cookies or headers to identify yourself) along with the request.

Get the "real" source code from a website

I've got a problem getting the "real" source code from a website:
http://sirius.searates.com/explorer
Trying it the normal way (view-source:) via Chrome I get a different result than trying it by using inspect elements function. And the code which I can see (using that function) is the one that I would like to have... How is that possible to get this code?
This usually happens because the UI is actually generated by a client-side Javascript utility.
In this case, most of the screen is generated by HighCharts, and a few elements are generated/modified by Bootstrap.
The DOM inspector will always give you the "current" view of the HTML, while the view source gives you the "initial" view. Since view source does not run the Javascript utilities, much of the UI is never generated.
To get the most up-to-date (HTML) source, you can use the DOM inspector to find the root html node, right-click and select "Edit as HTML". Then select-all and copy/paste into your favorite text editor.
Note, though, that this will only give you a snapshot of the page. Most modern web pages are really browser applications and the HTML is just one part of the whole. Copy/pasting the HTML will not give you a fully functional page.
You can get real-time html with this url,bookmark this url:
javascript:document.write('<textarea width="400">'+document.body.innerHTML+'</textarea>');

What is DOM generated code?

I am obviously new to HTML and Web Browsers and python too. I installed the Web Developer extension in Firefox and noticed that in addition to the "View Source" option there are two additional "View Generated Source" and "View Frame Source" options. What are these? Why should they be different?
I have no idea what a generated source is.
Aren't frames part of the page? If so why do I need a separate "View Frame Source" option? Does it mean that the regular "View Page Source" will not show source for all the elements in the page?
If I want to see the code that is executed/used to show me a page which option should I look at and why?
If I want to get this code in python using the requests module how do I get these various sources?
HTML code can be modified dynamically be javascript. "View Generated Source" will show you the HTML as in it is current state that might have been modified by javascript and differs from the html delivered by the server. So this is interesting for the debugging javascript applications.
"View Frame Source" is for websites that are using HTML framesets. Such such sites are a composite of multiple single html sites that are displayed together at one page. Is an older attempt of web design but still widely deployed. So such sites can look like a simple page with the menu on the left side and the content beside it. Using framesets there would be a menu.html and a content.html. Both html sites can be displayed separately in 'Web Developer Toolbar' while clicking with the right mouse button on it and select "Show frame source"
Question 1 and 2 should being answered. Question 3.
If I want to see the code that is executed/used to show me a page which option should I look at and why?
Answer use "View Generated Source..." as this will give you the html you are actually seeing diplayed in browser regardless if it is generated by javascript or not.
Unfortunately I'm not a python expert so question 4 keeps open
The generated source is the result of the frame source that is fetched by the browser then the execution of the javascript on the browser to modify this page.
To understand more how browsers get an html page compared to a program check my answer here:
https://stackoverflow.com/a/15775702/707949
Then to get the sourge html page check this answer:
https://stackoverflow.com/a/15799102/707949
And to get the generated html source, check the end of the first answer

Whether we can generate a url from a page source?

I know this question is weird, but anyway I want to know it,
In web browsers or generally we can know the page source of a url, but I had some page source (HTML Code) and now I don't know the url of that page source. Can we generate a url from that page source or is there a way or anything that we can do to get a url from the page source?
When I searched I am getting page source from a url, so I am asking here.
If #Wex is headed in the right direction with his answer and based on your comment then I'll answer with this.
You can get the "Web Developers Toolbar" add-on for FireFox which has an option to "View Generated Source"
this is the same a selecting the whole page and using "View Selected Source" in FF. This will give you the DOM of that page including javascript render code.
If you're asking if there is a way to get the original url from source code the answer is no. Why? Because Google doesn't search the source code, it searches content. There also could be a thousand different websites that use pieces of that code.
In Chrome, you can view the source of any webpage by preceding the url with view-source:. As far as I know, this is the only browser that allows you to do this; Safari and Firefox for instance, shows the source in a popup window which can't be accessed in a regular window.