Serve different resource depending on full URL of requesting page - html

Let's say that we have two pages:
https://www.example.com/first/firstpage.html
https://www.example.com/second/secondpage.html
that both load the resource https://www.example.com/resource.js
If I want the server that serves resource.js to be able to serve a different version of resource.js depending on which page the request is coming from, is there a reliable header upon which the full URL of the requesting page can be determined (or maybe there is some other way to determine this)?
I know that there is an Origin header, but from my understanding this just represents the domain (and any subdomains) without the full URL and query string. Is there any way for the server to know the full URL and query string that the request for the resource is coming from?
If this isn't possible, I know it would be easy to include that info in the JS script tag as follows:
<script src="/resource.js?origin=/first/firstpage.html"></script>
But I don't want to have to modify the script tag for each page. Is there some other way to have the page automatically include it's own URL in the query string of the resource request (without having to dynamically load the resource using my own JS script - HTML only please!), or just any unique identifier so that the script tag doesn't have to be modified individually on each page?

There's the Referer header that you can use.
Make sure that your response uses Vary: Referer. Otherwise, browsers are going to cache this resource as if the referring page URL didn't matter.
I'd plead of you not to do this at all though. You're going to create a rabbit hole of problems, as not all browsers or proxy servers are well behaved. Some are going to aggressively cache this anyway, no matter what you do with the Vary header.

Related

Why does View Source issue a new HTTP request?

I've noticed that both Firefox and Chrome issue a new HTTP request when you view the source for a web page that you've already loaded. It's particularly annoying when the page itself is slow to load or if it won't load at all.
Why is that? Wouldn't they have the existing source for the originally received page cached already? Is it based on Cache-Control headers?
This has been on my mind for a while (usually, comes up when looking at what's behind slow web apps).
In the context of Chrome, according to this link it is indeed base on Cache-Control headers.
...view-source grabs the html source from the http cache
and pretty-prints it, but for the page NOT in http cache, it's 'forced to'
make a new request.
To me, this makes sense. You wouldn't want to use what is currently rendered as the source of truth as obviously the HTML can be manipulated dynamically. If you can't use this, then the http cache would be the next likely candidate for the source. If the source in unavailable from cahce, a subsequent GET of the source seems to be the only alternative.
This does, however, introduce another interesting delima raised here.
Requesting the URL again doesn't make sense as there is no guarantee that source received during the second request will match what was received during the first request.
I would imagine this was a conscious trade-off that was made to ensure that a view-source request is always satisfied in some form or another.
You need to do "Inspect Element" for the live web page. Show-code reloads the page to show the source code without modification.

What does #/ means in url?

I am working on ROR web apps. My webpage url looks like below-
http://dev.ibiza.jp:3000/facebook/report?advertiser_id=2102#/dashboard
Here I understood that advertiser_id is 2102 but I couldn't understand what #/dashboard is pointing to?
The portion of the URL which follows the # symbol is not normally sent to the server in the request for the page. If you open your web inspector and watch the request for the page, you will see that the #/dashboard portion is not included in the request at all.
On a normal (basic HTML) web page, the # symbol can be used to link to a section within the page, so that the browser jumps down to that section after the page loads.
In fancy javascript-heavy web applications, the # symbol is commonly used followed by more URL paths, for example www.example.com/some-path#/other-path/etc the other-path/etc portion of the URL is not seen by the server, but is available for Javascript to read in the browser and presumably display something different based on that URL path.
So in your case, the first part of the URL is a request to the server:
http://dev.ibiza.jp:3000/facebook/report?advertiser_id=2102
and the second part of the URL could be for Javascript to display a specific view of the page once it has loaded:
#/dashboard
The # symbol is also used to create a Fragment Identifier and is also typically used to link to a specific piece of content within a web page (such as to cause the browser to jump down to a particular section on the page).
As others have mentioned, this has SEO implications. In order to index pages such as this, you may have to employ different techniques to allow the content that is "behind the # symbol" to be accessible to search engines.
# symbol is called anchor, it redirects to a specific position on the html page.
It's a crawling technique , you could read more Here
Providing another example
Here's a request to github for the sourcecode of a java class
https://github.com/spring-cloud/spring-cloud-consul/blob/master/spring-cloud-consul-discovery/src/main/java/org/springframework/cloud/consul/serviceregistry/ConsulServiceRegistry.java
By appending this with "#L90" the web browser will make the same request, and then scroll to line 90 and highlight the code.
https://github.com/spring-cloud/spring-cloud-consul/blob/master/spring-cloud-consul-discovery/src/main/java/org/springframework/cloud/consul/serviceregistry/ConsulServiceRegistry.java#L90
Your web browser made the same request to the github server, but in the anchored case, performed the additional action of highlighting the selected line after the response was received.
after # is the hash of the location; the ! the follows is used by search engines to help index AJAX content. After that can be anything, but is usually rendered to look as a path (hence the /)

Exclude page self by appcache

I have an appcache (with NETWORK *). So now I visit my page with <html manifest="/cache.appcache">. Then the page itself is cached as all the images are. But I want the page self to not be cached. How can I do this? I thought NETWORK * would do the trick.
Regards,
Kevin
The appcache manifest always caches the master page.
If you are using Chrome check the cached files for your page here: chrome://appcache-internals
A workaround could be to put a hidden iframe somewhere on your page, which contains the appcache file to cache offline content. (take a look at "Preventing the application cache from storing masters with an iframe" here: http://labs.ft.com/2012/11/using-an-iframe-to-stop-app-cache-storing-masters/ )
A better solution could be to write your page to fetch new content from your server when it is opened - if the server cannot be reached, it can serve the last known content from the HTML5 local storage.
I have tried the iframe work around, and find it ripe with errors. Most browsers cache the data for the iframe where the page cannot get it.
Instead make the page's content load via AJAX. Basically have a blank html page with the manifest and javascript which pulls and adds its content from the server. This way only the blank html is cached, and content is always updated from the server.
Converting a page to this method can be very difficult, but it works. Making sure the appropriate javascript gets run at the correct time, probably requires some detangling. Moving around server code which won't be called when pulling from cache to the new ajax method.
Note: no need to pull conditional content from the server if the condition is in the query string, different query strings make a separate cache

Gap.com is redirecting me when I try to Screen Scrape

We are building a site that allows users to collect and store their favorite products from all over the Internet to one spot. We have an algorithm that filters out and finds the correct image by reading the source code. 80% of the sites work correctly but 2 large companies are blocking redirecting us from a product page to their homepage.
For example this product http://www.gap.com/browse/product.do?pid=741123&kwid=1&sem=false&sdReferer=http://www.gap.com/products/graphic-ts-toddler-boy-clothing-C35792.jsp# picks up the header for gap.com main page and not for the product at hand.
How do we get around this redirect and allows our algorithm to collect the correct image by reading the correct source code?
First, you might ask a lawyer to study the terms of service of your target web sites, and make sure that you won't run into legal problems.
On the technical side, set the Referer [sic] header when requesting the image. The referrer for an image should be the page in which it is embedded. The server may check that to ensure that the image is being requested to satisfy a page render by a browser, rather than a image-harvesting screen scraper.
After a bit of testing with the image in question, it doesn't look the Referer header is required. Perhaps it is simply rejecting an unfamiliar user-agent, or is keying off some other oddity in the request, like a missing Accept header, etc.
I'd imagine you need to change your scraper's user agent string to something that looks like a normal browser (you're probably sending a string like curl or wget by default).
There's a good chance, though, that if you're sending enough traffic their way they'll eventually notice and shut you down in a harder-to-circumvent manner.

Problem with modifying a page with ajax, and the browser keeping the unmodified page in cache

I have a situation where my page loads some information from a database, which is then modified through AJAX.
I click a link to another page, then use the 'back' button to return to the original page.
The changes to the page through AJAX I made before don't appear, because the browser has the unchanged page stored in the cache.
Is there a way of fixing this without setting the page not to cache at all?
Thanks :)
Imagine that each request to the server for information, including the initial page load and each ajax request, are distinct entities. Each one may or may not be cached anywhere between the server and the browser.
You are modifying the initial page that was served to you (and cached by the browser, in most cases) with arbitrary requests to the server and dynamic DOM manipulation. The browser has to capacity to track these changed.
You will have to maintain state, maybe using a cookie, in order to reconstruct the page. In fact, it seems to me that a dynamically generated document that you may wish to move to and from should definitely have a workflow defined that persists and retrieves it's state.
Perhaps set a cookie for each manipulated element with the key that was sent to the server to get the data?