I know a 301 redirect is for a permanent change, and 302 is for a temporary change.
What code should I use when the page is offline for a number of weeks and in the mean time I am redirecting to the homepage? The page should be back up in a few weeks.
If you want to redirect, it would be 302. If you don't want to redirect, you could send 503 Service unavailable and set a Retry-After header (which should hopefully prevent search engines from coming back before that time).
If you still want the end-user experience to be a redirect to the homepage, you might, with heavy heart, consider adding that to the content of your 503 error page with a meta refresh or something JavaScript based, and hope for the best in terms of what a search engine crawler makes of that.
Previous answers suggest that browsers might honour cache and expires headers set on a 301 response, but since that fails-unsafe, I wouldn't rely on it. (The standard says the response is "cacheable unless indicated otherwise"; its definition of 302 Found suggests a 302 that is explicitly cacheable might be cached, but it wouldn't be the first time browsers don't implement what could be read out of the letter of the RFCs.)
Related
I have a page /data.txt, which is cached in the client's browser. Based on data which might be known only to the server, I now know that this page is out of date and should be refreshed. However, since it is cached, they will not re-request it for a long time (until the cache expires).
The client is now requesting a different page /foo.html. How can I make the client's browser re-request /data.txt and update its cache?
This should be done using HTTP or HTML (not all clients have JS).
(I specifically want to avoid the "cache-busting" pattern of appending version numbers to the /data.txt URL, like /data.txt?v=2. This fills the cache with useless entries rather than replacing expired ones.)
Edit for clarity: I specifically want to cache /data.txt for a long time, so telling the client not to cache it is unfortunately not what I'm looking for (for this question). I want /data.txt to be cached forever until the server chooses to invalidate it. But since the user never re-requests /data.txt, I need to invalidate it as a side effect of another request (for /foo.html).
To expand my comment:
You can use IF-Modified-Since and Etag, and to invalidate the resource that has been already downloaded you may take a look at the different approaches suggested in Clear the cache in JavaScript and fetch(), how do you make a non-cached request?, most of the suggestions there mentioned fetching the resource from JavaScript with no-cache header fetch(url, {cache: "no-store"}).
Or, if you can try sending a Clear-Site-Data header if your clients' browsers are supported.
Or maybe, give up this time only for the cache-busting solution. And if it's possible for you, rename the file to something else rather than adding a querystring as suggested in Revving Filenames: don’t use querystring.
Update after clarification:
If you are not maintaining a legacy implementation with users that already have /data.txt cached, the use of Etag And IF-Modified-Since headers should help.
And for the users with the cached versions, you may redirect to: /newFile.txt or /data.txt?v=1 from /foo.html. The new requests will have the newly added headers.
The first step is to fix your cache headers on the data.txt resource so it uses your desired cache policy (perhaps Cache-Control: no-cache in conjunction with an ETag for conditional validation). Otherwise you're just going to have this problem over and over again.
The next step is to get clients who have it in their cache already to re-request it. In general there's no automatic way to achieve this, but if you know they're accessing foo.html then it should be possible. On that page you can make an AJAX request to data.txt with the Cache-Control: no-cache request header. That should force the browser to bypass the cache and get a fresh version, and the cache should then be repopulated with the new version.
(At least, that's how it's supposed to work. I've never tried this, and I've seen reports here that browsers don't handle Cache-Control request headers properly.)
I'm trying to get data from this site: [1] https://www.eurobet.it/it/scommesse/#!/calcio/?temporalFilter=TEMPORAL_FILTER_OGGI_DOMANI
I found this link where I can get the data in JSON format: [2] https://www.eurobet.it/detail-service/sport-schedule/services/discipline/calcio?prematch=1&live=0&temporalFilter=TEMPORAL_FILTER_OGGI_DOMANI
But there is a problem:
The JSON link Doesn't work every time in fact sometimes I get a 404 error.
I noticed that if I open the first link [1] before opening the second [2] it works perfectly.
This error is also more frequent when I try to scrape other data on the same site: [3] https://www.eurobet.it/detail-service/sport-schedule/services/discipline/calcio/piu-giocate/u-o-goal?prematch=1&live=0&temporalFilter=TEMPORAL_FILTER_OGGI_DOMANI
In this link [3] I try to get all "u-o-goal" odds but this link works only if (before starting my program to scrape data) in the main link [1] I press the "U/O GOAL" button -> https://i.stack.imgur.com/Nei5u.png
In my code, I'm using Java and htmlunit to scrape the data.
My question is: how this webpage works, why couldn't I open directly the links [2]/[3], I know that there is a sort of request and approval system behind but I can't see where.
You cannot directly open these URLs since the website (and many like it) will use cookies and bot-prevention techniques/session tracking so they can gather data about usage of their website. eg. they set a "Referer".
I'm not going to code a solution for you but I can at least help you understand what you need to do to get to where you want...
I've attempted to summarise how I'd typically unpick a request like this to recreate it, but in its essence, you need to understand the sequence of HTTP requests being made (this is how the web works - HTTP requests).
First you typically start with no session cookies and you access the site directly (no referer).
Once you access a website, typically the server responds with a session cookie for you to communicate back to the server a unique session ID so it has some sort of record of your browser having already been in contact.
Your browser may make more requests (asynchronously) and in doing so typically sends the cookies and the referring URL (usually the base Url will work... just don't use something that starts with something other than "https://www.eurobet.it"
anything else you're going to need to figure it out. Lots of headers are optional. Lots of query params have defaults.
https://stackoverflow.com/a/64671815/7619034 - here's an answer I've given before that answers this type of question which comes up often enough.
so to explain a bit further, for your specific scenario...
When you access https://www.eurobet.it/it/scommesse/#!/calcio/?temporalFilter=TEMPORAL_FILTER_OGGI_DOMANI, the server responds with HTTP headers:
...
set-cookie: __cfduid=dd38d***********41125; ...
...
The rest doesn't look that relevant:
Going straight to the other request: https://www.eurobet.it/detail-service/sport-schedule/services/discipline/calcio?prematch=1&live=0&temporalFilter=TEMPORAL_FILTER_OGGI_DOMANI
This HTTP request takes (as input):
cookie: __cfduid=dd38d***********41125; mbox=session#6661556c.....b6e8cc1fa6f03#1608242987; at_check=true; s_ecid=MCMID%***********2021453010; AMCVS_45F10C3A53DAEC9F0A490D4D%40AdobeOrg=1; AMCV_45F10C3A53DAEC9F0A490D4D%40AdobeOrg=1075005958%7CMCIDTS%7C18614%7CMCMID%7C91883906030825914429183258312021453010%7CMCAID%7CNONE%7CMCOPTOUT-1608248327s%7CNONE%7CvVersion%7C4.4.1; s_cc=true
...
referer: https://www.eurobet.it/it/scommesse/
...
x-eb-accept-language: it_IT
x-eb-marketid: 5
x-eb-platformid: 1
Cookies are set in an initial request (typically) using Set-Cookie header and then are passed back to the server in subsequent requests using the cookie header.
I'm not certain how many of these values are relevant but you'd need to figure out where each came from in the chain of HTTP requests between the initial one and this one and you'd need to replicate them (see url above of my previous answer - warning this can be time consuming).
The other headers can be set statically most likely since they probably aren't due to change.
If you have access to curl on the command line, you can attempt to reconstruct some of these requests by hand. Some will be time sensitive since cookies do expire after an amount of time (see set-cookie header details for exactly when). Once you've reconstructed a working request, you can then start coding it in your application.
If you can work all this out you should be able to re-construct the chain of HTTP GET requests to get the JSON data you want. Good luck!
It is not the question like "what happens if I input an URL on a browser".
My point it, to be specifically, if I input www.google.com or google.com or https://google.com or http://google.com on a browser, the URL will be automatically changed to https://www.google.com and load the html content. What is the reason?
There are some URLs called canonical URLs. It's clearly given, why a website should be consolidating duplicate URLs in Google. So, for the same reason, the web application will be written in such a way, say, take my website: https://praveen.science/, I personally coded this way:
<?php
if ((isset($_SERVER['HTTPS']) ? "https" : "http") . "://$_SERVER[HTTP_HOST]$_SERVER[REQUEST_URI]" != "https://praveen.science/") {
header("Location: https://praveen.science/");
die();
}
?>
No matter what URL you put, it always goes to https://praveen.science/. This is the "under the hood" reason. Try out the following variations:
http://praveen.science/
https://praveen.science/
http://www.praveen.science/
https://www.praveen.science/
All the above URLs will just go to https://praveen.science/ because of the code above. And that's a live code from my website. It's not DNS, but just the individual application routing configuration. This helps the search engine and other places to avoid duplicate URLs.
The other way to implement this is using .htaccess. (Example: htaccess, Redirect all requests to https://www (Random Question on Stack Overflow)).
This is important to have your users or requests to be unanimously routed to one single URL. You can read about why this is implemented in many websites here:
Duplicate Content
Consolidate duplicate URLs
Another way to answer your question will include checking the request data. When you give a request to any of the URLs other than https://www.google.com/, the response will be a HTTP 301 Moved Permanently:
This is one way of having all the requests routed to one single domain (or path) and making sure there are no duplicates.
According to the HTTP spec, upon loading a resource that results in a 302 redirect:
...the redirection might be altered on occasion, the client SHOULD continue to use the Request-URI for future requests. This response is only cacheable if indicated by a Cache-Control or Expires header field.
However, within a single page load, I'm seeing current Chrome and Firefox both resolving subsequent requests to the initial Request-URI to the resolved value from the first request, even when the redirect specifies no caching.
I've setup a minimal repro case here:
http://chrome-302-broke.herokuapp.com/test.html
It's on a free heroku dyno (in case you reach it while it's offline).
Am I missing something? It seems like caching the redirect from the initial response, even within the same page load, is taking liberty with the description from the spec. A strict interpretation shouldn't cache this request at all.
Especially with a growing number of web applications that don't navigate between pages for a considerable amount of time, this seems like it would cause problems for an increasing number of use cases.
Is this something I should submit as a bug to Chrome/Firefox?
I'm trying to find the correct HTTP status code for a page where the content is temporary unavailable however there is no redirect, instead a message is displayed on the page informing the user the content is temporarily unavailable.
307 Temporary Redirect isn't applicable as there is no redirect.
404 Not Found might possibly be applicable, however I'm not sure if this is the correct response to give as the content is found, just not available.
410 Gone isn't applicable as the content will be available again some time in the future.
None of the other codes seemed even remotely applicable. Does anyone know the correct code to use and can explain why?
It sounds like the 4XX series of responses are appropriate here. From the RFC:
The 4xx class of status code is intended for cases in which the
client seems to have erred. Except when responding to a HEAD request,
the server SHOULD include an entity containing an explanation of the
error situation, and whether it is a temporary or permanent
condition.
With this in mind, I think 403 forbidden is the most appropriate:
10.4.4 403 Forbidden
The server understood the request, but is refusing to fulfill it.
Authorization will not help and the request SHOULD NOT be repeated.
If the request method was not HEAD and the server wishes to make
public why the request has not been fulfilled, it SHOULD describe the
reason for the refusal in the entity. If the server does not wish to
make this information available to the client, the status code 404
(Not Found) can be used instead.
I suggest this for three reasons:
It's not an exotic code, so it will work fine in the browser. This, to me, is the most important reason - you will be able to serve a page that explains why the content isn't available, and you can be fairly certain it will be displayed correctly.
It's appropriate for the server to say "I understand your request, but I won't serve you that content at this time", and that's exactly what the first two lines of the description say.
It doesn't explicity say "forget you ever knew about this content" to any robots (or for that matter, people).
For completeness, here's why I ruled out the other response code categories:
2XX Success: This class of status code indicates that the client's request was
successfully received, understood, and accepted.
But, we're not accepting the request in this case. I don't think 2XX is right.
3XX Redirection: This class of status code indicates that further action needs to be
taken by the user agent in order to fulfill the request.
I suppose that you could argue "further action" to mean "please wait until the content is available before trying again", but reading the other 3XX codes, "further action" usually means "immediate redirect", which as you've already pointed out, isn't appropriate.
5XX Server error: Response status codes beginning with the digit "5" indicate cases in
which the server is aware that it has erred or is incapable of
performing the request.
Nothing has gone wrong on the server, you just don't want to serve the content right now.
HTTP STATUS CODE 204
i.e. NO CONTENT
Read More about it here:
http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html
section 10.2.5
i.e.
The server successfully processed the request, but is not returning any content.