wget downloads the same html for every version of the website - html

I'm attempting to download the html using wget for this website:
https://cxcfps.cfa.harvard.edu/cda/footprint/cdaview.html#Footprints|filterText%3D%24filterTypes%3D|query_string=&posfilename=&poslocalname=&inst=ACIS-S&inst=ACIS-I&inst=HRC-S&inst=HRC-I&RA=210.905648&Dec=39.609177&Radius=0.0006&Obsids=&preview=1&output_size=256&cutout_size=12.8|ra=&dec=&sr=&level=&image=&inst=ACIS-S%2CACIS-I%2CHRC-S%2CHRC-I&ds=
Which is a version of the main website:
https://cxcfps.cfa.harvard.edu/cda/footprint/cdaview.html
The only difference from the main website is that the first link takes you to the version that has already searched through a database and displayed results, which you can see in a table. But when I use wget to download the text version of the html for the longer link, but it gives me the exact same text as for the main/short link. I'm confused, but maybe I just don't understand enough about html. I thought they should be slightly different, display the text-html for the database results, etc.
I also used the --mirror option to download all the necessary files, but they all look the same, too. I've also tried using cURL for this too, and the same thing. Can someone please explain why this is happening and if it's fixable?

The problem is that the main website has a lot of javascript and other code that is not included in the version that you are downloading. The --mirror option will download all the necessary files, but it's not going to be exactly what you want. You can use wget to download the HTML file from the main website, then use wget again with the --mirror option to download all the necessary files. Then you can use grep to search through the HTML file for the table that you want.

Related

Including images in a Genshi/Trac template

I am trying to include some images in a Genshi template for my Trac plugin, but it always shows only the alternative text because it cannot find the images.
I have the following (X)HTML code:
<div>
<img src="file://c:/path/to/image.png" alt="asdf" />
</div>
When I use this code with a simple html file and open it in the browser, the image is displayed correctly, which means that both the path and syntax are correct.
But when I insert the code snippet into a Genshi template and use it within Trac, the image cannot be found. However, when I look at the HTML source code in the web browser and copy the URLs into a new browser tab, it is again displayed correctly. This means that only the server cannot find the image.
The images are in a directory inside the python-egg file, and the path points directly to the directory created by Trac, which also contains my CSS and HTML files, both of which are loaded correctly. The images are correctly referenced in the setup script which creates the egg.
How do I have to reference images in (X)HTML documents when using them with a server?
Is there a special way to include images in Genshi documents? (I haven't found one.)
Thanks to the comment of RjOllos and this site I was able to fix it by trying all of the URL types. Although it says for a plugin to be /chrome/<pluginname>, it was actually just /chrome that worked. See the edit below! So the full URL is then <ip>:<port>/chrome/path/to/image.png.
EDIT: I discovered I actually used the /chrome/pluginname version, just that I did not use the name of my plugin as "pluginname". See my comment below. It seems like /chrome/pluginname should actually be /chrome/htdocsnameor something like that, in case you use a different name rather than the plugin name when implementing the ITemplateProvider. In my case I called it images, which was the same name as the folder. END OF EDIT
Another mistake I made was forgetting the initial slash (chrome/path/to/image.png), which caused Trac to assemble the URL to <ip>:<port>/<current page>/chrome/path/to/image.png.

Grab resource URL from a web page

I am sometimes frustated when I want to go to a page from my bookmarks and the page has disappear.
I also want to build a tool which download the entire webpage when I bookmark it.
For that, I have to grab all urls of ressources linked to the page: javascript, css, images,...
Here all the xpath selectors I thinked:
//img[#src]
//link[#href]
//script[#src]
//object[#data]
//iframe[#src]
//video[#src]
//audio[#src]
and also the background images contained in css files.
Could you tell me if I forgot something?
wget do that very well
wget --no-parent --timestamping --convert-links --page-requisites --no-directories --no-host-directories -erobots=off http://domain.tld/webpage.html
https://superuser.com/questions/55040/save-a-single-web-page-with-background-images-with-wget

Is there a way to export a page with CSS/images/etc using relative paths?

I work on a very large enterprise web application - and I created a prototype HTML page that is very simple - it is just a list of CSS and JS includes with very little markup. However, it contains a total of 57 CSS includes and 271 javascript includes (crazy right??)
In production these CSS/JS files will be minified and combined in various ways, but for dev purposes I am not going to bother.
The HTML is being served by a simple apache HTTP server and I am hitting it with a URL like this: http://localhost/demo.html and I share this link to others but you must be behind the firewall to access it.
I would like to package up this one HTML file with all referenced JS and CSS files into a ZIP file and share this with others so that all one would need to do is unzip and directly open the HTML file.
I have 2 problems:
The CSS files reference images using URLs like this url(/path/to/image.png) which are not relative, so if you unzip and view the HTML these links will be broken
There are literally thousands of other JS/CSS files/images that are also in these same folders that the demo doesn't use, so just zipping up the entire folder will result in a very bloated zip file
Anyway -
I create these types of demos on a regular basis, is there some easy way to create a ZIP that will:
Have updated CSS files that use relative URLs instead
Only include the JS/CSS that this html references, plus only those images which the specific CSS files reference as well
If I could do this without a bunch of manual work, if it could be automatic somehow, that would be so awesome!
As an example, one CSS file might have the following path and file name.
/ui/demoapp/css/theme.css
In this CSS file you'll find many image references like this one:
url(/ui/common/img/background.png)
I believe for this to work the relative image path should look like this:
url(../../common/img/background.png)
I am going to answer my own question because I have solved the problem for my own purposes. There are 2 options that I have found useful:
Modern browsers have a "Save Page As..." option under the File menu, or in Chrome on the one menu. This, however does not always work properly when the page is generated by javascript
I created my own custom application that can parse out all of the CSS/Javascript resources and transform the CSS references to relative URLs; however, this is not really a good answer for others.
If anyone else is aware of a commonly available utility or something like that which is better than using the browser built in "Save page as..." option - feel free to post another answer.

How can I save a "complete" HTML file as single file?

Are there any utilities or web browsers that can save a file and referenced resources as a single HTML file?
With most web browsers / wget there's the option to download required CSS and images as seperate files. Is there a way to automatically inline the CSS and images?
I have made a python script for this. Up to now, it covers my own needs perfectly. Hopes to be useful.
https://github.com/zTrix/webpage2html
MHTML is the format for this.
http://en.wikipedia.org/wiki/MHTML
This web extension might help you.
https://github.com/gildas-lormeau/SingleFile
"It helps you to save a complete web page into a single HTML file."
It is available for almost all popular browsers.
Safari (on both Windows and Mac) can create .webarchive files.
Link:
http://en.wikipedia.org/wiki/Webarchive
If you have access to wget, then you likely have access to a tar utility too. While it won't give you a browser-readable single file, if you wget a page and then tar up all of the downloaded artifacts, you effectively have a 1-file version of everything needed for that page.

mass change link in html website

I took over an old HTML based site with all hard coded links, no frames etc. Theres who knows how many pages that have a link to abc.html (<--example).
I've been asked to go through the pages and change the abc.html link to 123.html (<--another example).
I could download the entire site from via FTP then use find and replace to go through all the files, then upload the changes.
Problem is the site is poorly organized and heavily nested so theres probably several hundred mg of junk I'd have to download just to be sure.
The other option is to change the html code of abc.html and put in something like
We've moved, you are currently being
redirected.
And use some sort of redirect.
Anyone have any other ideas on how to do this?
Why not using a software such as Actual Search and Replace ?
You will need to return HTTP 301 Moved Permanently on old links so that the search engines know that the content has moved and not just disappeared.
I made a list of all the files that contained the old link using
grep -lir "some text" *
(above taken from comandlinefu.com)
I then used the following command to replace all the matching text accordingly.
find . -name '*.html' -exec sed -ir 's/old/new/g' {} \;
(also taken from commandlinefu.com)
I used the sed version as it created backups of the html files and named them *.htmlr
Not ideal as I now have more junk, but I can easily delete them with
rm *.htmlr