Capture a copy of a website and then serve it - html

I'm aware of HTTRack, but it doesn't fit my need:
I wonder if there's a program I can run to capture a complete copy of a dynamic site such that I can then serve it on a static server? I know javascript-heavy pages could be complicated to do that with.
Basically I can see this working as follows:
Step 1. fetch all the linked assets starting from a particular URL and recreate the directory structure
Step 2. open each fetched html file in a headless browser and then convert the DOM to html and then overwrite the original html file

Related

Opening certain files in certain folders in a HTML page

I am trying to learn a bit about web technologies therefore I am trying to create a catalogue for my files.
The situation is the following:
I have a folder with N sub folders;
in each of there sub folders there is an image with always the same name (ie: image.jpg)
in each of there sub folders there is also a certain swg file with always the same name (ie: test.swg)
I would like to create an HTML file which read all the sub folders and create a preview using image.jpg, and when one clicks on the preview test.swg should be launched (not in the browser if possible)
The HTML files should contains all these preview like a catalogue.
How can I do this? should I have a local web server which runs in my machine? is it possible to do this with non web page technologies?
Thank you!
As far as i know Javascript & HTML doesn't have access to the filesystem as it's running on your browser and shouldn't be possible to go through the files iteratively because it would be some kind of breach in security.
If you ask me it's possible or not without a server, it should be possible but it is going to use other technology, for example:
Using a Command Line Interface in Linux or Windows based os you could write a shell script that iteratively will go through the files and folder path, and possibly create a JSON from it. From there the javascript could technically load that file like below.
<script type="text/javascript" src="data.json"></script>
<script type="text/javascript" src="javascript.js"></script>
But do note that you should periodically run the shell script periodically with something like scheduler or refresh it manually.
If you want to do it the normal way you could use many different server side language, for example NodeJs, or PHP as I think both of them require only little configuration.
You could post follow up question if you've decided on which language you want to use.
Below is some reference that you can use to start working on reading the directories
NodeJS
Node.js fs.readdir recursive directory search
Get all files recursively in directories NodejS
PHP
List all the files and folders in a Directory with PHP recursive function
How to recursively iterate through files in PHP?
After reading the directories & Files you just need to pass the data to the "rendering" part, and use some javascript to invoke the .swg when the image is clicked
But I'm not really sure about the .swg file can be invoked to the desktop app directly or not you could do some research on it
Open online file with desktop applications?

Node.js/Express.js - How to render a remote HTML file to the client?

I am currently in the process of making a blog website, writers for this website have the ability to upload AMP HTML files and the assets required for does files to work. Both the html file and their assets get sent to our CDN.
Now, when a client visits the website in the link example.com/13215 the server gets the parameter 13215 and checks what post it refers to in the database and retrieves the link to the HTML file. How can I send this HTML file to the front end with Node.js/Express.js even though it is a remote file.
Just copy pasting the URL into response.sendFile() and response.render() functions just throws errors. I thought about reading the file contents then writing them to a file then sending them to the client but I don't think that's a good idea performance wise.
Is there a way to achieve this?

Can node-red html be edite elsewhere?

I'm developing a node-red application right now that uses a html response. The html uses google maps, visual indicators and websockets. It is very hard to debug this system through node-red's little html editor. Is there a way to edit the html file through any normal editor (e.g. vs code) and then deploy the application again to see the effect ?
One solution that came to my mind was to read from an external file using the file node and return it as html, put I don't know if that works. Is there a better way ?
You can create and edit static resources (html/css files etc) however you'd like and then serve them from Node-RED.
You have two options for serving static content:
create corresponding HTTP In -> File In -> HTTP Response flows for each file you want to serve
or use the httpStatic property in your settings.js file to identify a directory whose content should be automatically served by the runtime.

html-minifier: Recursive but copying-over invalid files

I first met html-minifier today after running a small site I've created using Hugo through Google PageSpeed.
First thing I noticed is that although it does have recursion capabilities it stops working on unsupported files like images (my speakers started beeping and I freaked a little)
I've found this stack showing an apparently undocumented command-line option --file-ext
That worked perfectly but in the output directory, I noticed that the folders with the unmatching contents were gone.
From the directory root, I saw it was Hugo's folders for CSS, JS, images and Github Pages' CNAME file. Not only I can't tell for sure there's not even one piece of static file in any of the folders Hugo generated (you may know that Hugo is sometimes unpredictable) but also I would like to keep language specific XML Sitemaps I've created for some specific folders.
Long story short, is there a way to copy-over unmatching files "as is", keeping input directory ready for a commit/push?
After analyzing the whole directory structure I could be sure that within all the directory structure Hugo creates there are nothing more than HTML and XML files so then the Ockham's Razor took place.
Since both my Hugo's source code and output contents are in totally different directories, it was a simple matter of pointing the output directory to the same path of the input directory.
All HTML files are minified, overwriting those Hugo generated.

save html page from the server by URL with no changes - get the exact copy, the clone

Let's say I have a URL http://example.com/path/to/document.html
That's the html document, the file, that has no external css or js.
If I open it in Google Chrome and save it with Ctrl+S locally, the content is changed. The content of that html file starts with <!-- saved from url= which is not I want at all. I need to get the exact html document, even spaces count.
The second option is to copy it with Ctrl+U (View Source), Select All and paste it into new document, save it and rename it. This is better, however spaces, tabs and end of file will be different depending on what operation system I'm using.
I need the exact copy of that html file - byte to byte.
How to make it?
This is a practical question as I need slightly modify that document.
I'm sorry there is no any source code in my question, but this question is about web developing.
Any ideas?
Thank you.
P.S. Of course that document could be generated by php or whatever, the part of the code can be even extracted from the db, but not in my case. I know that's a plain file.
I'd delete the comment after saving from Chrome, use wget in a linux environment, or open the page as an InputStream in Java. Do all three, run a diff, and if two arrived identical assume that's the file on the server.
Why do you need a byte-for-byte copy of the file on the server anyway, and why can't you ftp the file? There is always the chance that the server will serve different html files depending on your user-agent, but there are other tools which may be better than Chrome for getting your copy and many can spoof a user-agent as well.