Grab resource URL from a web page - html

I am sometimes frustated when I want to go to a page from my bookmarks and the page has disappear.
I also want to build a tool which download the entire webpage when I bookmark it.
For that, I have to grab all urls of ressources linked to the page: javascript, css, images,...
Here all the xpath selectors I thinked:
//img[#src]
//link[#href]
//script[#src]
//object[#data]
//iframe[#src]
//video[#src]
//audio[#src]
and also the background images contained in css files.
Could you tell me if I forgot something?

wget do that very well
wget --no-parent --timestamping --convert-links --page-requisites --no-directories --no-host-directories -erobots=off http://domain.tld/webpage.html
https://superuser.com/questions/55040/save-a-single-web-page-with-background-images-with-wget

Related

wget downloads the same html for every version of the website

I'm attempting to download the html using wget for this website:
https://cxcfps.cfa.harvard.edu/cda/footprint/cdaview.html#Footprints|filterText%3D%24filterTypes%3D|query_string=&posfilename=&poslocalname=&inst=ACIS-S&inst=ACIS-I&inst=HRC-S&inst=HRC-I&RA=210.905648&Dec=39.609177&Radius=0.0006&Obsids=&preview=1&output_size=256&cutout_size=12.8|ra=&dec=&sr=&level=&image=&inst=ACIS-S%2CACIS-I%2CHRC-S%2CHRC-I&ds=
Which is a version of the main website:
https://cxcfps.cfa.harvard.edu/cda/footprint/cdaview.html
The only difference from the main website is that the first link takes you to the version that has already searched through a database and displayed results, which you can see in a table. But when I use wget to download the text version of the html for the longer link, but it gives me the exact same text as for the main/short link. I'm confused, but maybe I just don't understand enough about html. I thought they should be slightly different, display the text-html for the database results, etc.
I also used the --mirror option to download all the necessary files, but they all look the same, too. I've also tried using cURL for this too, and the same thing. Can someone please explain why this is happening and if it's fixable?
The problem is that the main website has a lot of javascript and other code that is not included in the version that you are downloading. The --mirror option will download all the necessary files, but it's not going to be exactly what you want. You can use wget to download the HTML file from the main website, then use wget again with the --mirror option to download all the necessary files. Then you can use grep to search through the HTML file for the table that you want.

why can my browser still open an html file not served through a static file server?

Just wondering how/why this works, when I'm making a simple html file and linking in some css, then dragging my html file into the browser, no static web server is needed for me to view the file.
Why is that so..
I'm looking at my browser's network tab, and no request is made for the css file, and my browser still displays it perfectly..
Is there a way to do without a static file server on the web for html, css, js files, like when dragging and dropping a file into a browser?
Just going back and requestionning basics here..
Thanks in advance!
Because the link to your CSS file is relative, and your CSS file is accessible locally. Browsers can be used to access local files, not just files on the Internet.
When working with links, you may see just the name of the file referenced, as such:
Link
This is known as a relative link. file.html is relative to wherever the document is that is linking to it. In this case, the two files would be in the same folder.
There's a second type of link, known as an absolute URL, where the full path is specified.
Consider a typical absolute website link:
Link
With a local file, this would essentially be:
Link
The file protocol can be used to access local files.
Considering both the homepage (presumably index.html) and file.html would live in the same folder on both a web server and your local machine, Link would work for either scenario. In fact, with a relative link, the location of the second file is automatically determined based on the location of the first file. In my example, index.html would live at file://[YOUR WEBSITE]/index.html, so your browser is smart enough to known to look in file://[YOUR WEBSITE]/ when searching for any relative URLs.
Note that the same scenario applies to any other file! <link> and <script> tags will look for files in the exact same way -- that includes your stylesheet :)
Hope this helps!
Sounds like you are new to HTML and web development.
It all has to do with relative versus absolute file paths.
Check out these articles and have fun coding! Always remember that Google is your friend, improve your search-foo and you will not have to ask questions like this.
God speed.
http://www.geeksengine.com/article/absolute-relative-path.html
http://www.coffeecup.com/help/articles/absolute-vs-relative-pathslinks/
How to properly reference local resources in HTML?

Hiding page names in the browser

When we launch a website, we usually see webpage name (menu.php or admin.aspx) but I would like to hide that name and show only virtual path or just website name. I don't want it for the first page because I did that with default.aspx but I want to implement it for the whole website.
Showing www.abcd.com/faq/ instead of www.abcd.com/faq/faq.html
Note: My code is not MVC code and server is Apache.
Use .htaccess to rewrite the URL. Millions of tutorials are out there for that ;)
What you are asking is achieved using (for xampp, wamp, lamp or any other apache powered webserver setup) htaccess rewriterules. The rules take the URL and break it into parts that can be modified or used as variables to feed other pages - whilst still keeping the URL you typed. Neat huh!
Showing www.abcd.com/faq/ instead of www.abcd.com/faq/faq.html
call the file placed into the folder faq simply index.html (not faq.html) and then www.abcd.com/faq/
will display the page without the filename. (Make sure, you have defined index.html as a valid Directory index.)
There are more options with using mod_rewrite etc - but since you seem to use a prety static directory based navigation layout, that would be the easiest way.

How can I add a downloadable file to my Github.io page?

I have set up my professional website/homepage using Github Pages. I know if this was just HTML being served up from somewhere, my downloadable file would need to be in the directory of my .html file, and then I could reference it in the .html file and link it up. However, since this is served by Github through repository, I am unsure on how to do this.
Do I put my downloadable file in my repo under version control like the rest of the project?
If so, what path do I use in the .html file?
Also, I am aware that the Automatic Page Generator makes it possible to hardly touch the HTML, but it seems pretty restrictive as far as customizing where links and other content appears on your page...
You could just link it normally in your html. Commit it to your repository and have users right click to save.
I just tried this on one of my repositories where I put a link to my CSS file.
style.css
I was able to right click the link and download the file.
If you wanted to create a download from the root you would do:
Download File
I'm pushing my repositories manually instead of using the Automatic Page Generator. The steps are pretty straight forward Creating Project Pages Manually - GitHub Help
Since it is done in GitHub pages. It can also be done like this (in markdown fashion): [download]({{ site.baseurl }}{% link file.txt %}). It has the advantage to work locally without pushing the file to the repo.

iphone uiwebview download complete page with CSS and Images

In my app there's a uiwebview that loads a URL.
I am using the following line to save the HTML of the page loaded locally to be able to view it offline:
NSString* html=[webView stringByEvaluatingJavaScriptFromString:#"document.getElementsByTagName('html')[0].innerHTML"]
The problem is that only the HTML of the document gets saved. I want to save also the images and the CSS along with the HTML so that the user see the page as if they are online.
Just like "save web page complete" or something like that, that we're used to in the browsers.
There is no easy way. Regex the HTML using RegexKitLite (http://regexkit.sourceforge.net/RegexKitLite/index.html) and snag all the urls to .jpg,.gif,.png, and .css and .js and whatever all else you need.
alternately, call:
NSString* imgUrls=[webView stringByEvaluatingJavaScriptFromString:#"document.getElementsByTagName('img')"]
or something like that, I'm no javascript whizz... and then deal with whatever all that returns ;)
Sorry. It's a pain in the rearheinie.
edit:
Save all the img's on the iphone, also save the html file. When you want to reload the page, load the html from a file into a string, and then use
- (void)loadHTMLString:(NSString *)string baseURL:(NSURL *)baseURL
to load the HTML string. baseURL is used to specify the directory or site the webview will imagine the html string you hand it is located. All URLS will be relative to that.
Note, of course that this will not work very well for absolute URLs, only for relative ones. So this, in your html file, will monkey things up:
<img src="http://google.com/f/r/i/g/img.gif">
while this would be ok:
<img src="f/r/i/g/img.gif">
Again, this whole solution is mucky.
You might look into a pre-existing open source recursive html spider. I think wget does what you want, but I doubt it can be compiled for iPhone without a -lot- of hassle.
I didn't have time to check, but ASIWebPageRequest seams very promising. It states it can "Store a complete web page in a single string, or with each external resource in a separate file referenced from the page"
ASIWebPageRequest1
I will be using it on one of my projects, and then update thread.
Gonso