I'm trying to scrape products from a site (e.g. https://www.violetgrey.com/en-us/shopping/the-rich-cream-18105401). Whilst on browser it loads normally, when I copy the initial curl request for the site, it gives me access denied. This is all done in local environment. So far, before copying the curl request from browser dev tools I have:
Disabled JS for the site
Cleared all my cache, cookies
Tried different browsers
Still, it's the same result - blocked via curl. When the exact same request worked in my browser. Could anyone please point me to the right direction?
If you look at the response header you can see it comes from Cloudflare.
Cloudflare is evil. IMHO.
The HTTP status is 403. HTTP/2 403 Which means Forbidden.
The body is the text:
error code: 1020
Error 1020 can be roughly translated to "take your curl and go elsewhere. You and your curl are not wanted here."
Cloudflare profiles and fingerprints Browsers. For example they monitor the SSL/TLS handshaking and if your curl handshaking is not do the handshaking exactly like the Browser in your User Agent, they give you a 403 Forbidden and Error code 1020.
And your request does not reach violetgrey.com. They do not even know you tried.
Cloudflare is political and blocks whatever traffic they want to. If it is in their best interest not allow you through, they block you. For example Cloudflare blocked me from accessing the US Patent and Trademark site. Not only that but they sent out 3 XHR beacon requests to YouTube and Google Play. My Firefox blocked those requests. Cloudflare and Google are closely related. I do not trust either one of them.
There is no shortage of articles about your problem and possible fixes. Just search "Cloudflare 403 forbidden 1020 error". And maybe not use Google to do the search.
Here is my effort to scrape your URL. I tried a few things like trying various User Agents. I tried wget.
Request header
GET /en-us/shopping/the-rich-cream-18105401 HTTP/2
Host: www.violetgrey.com
mozilla/5.0 (x11; netbsd amd64; rv:16.0) Gecko/20121102 Firefox/16.0
accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8
accept-language: en-US,en;q=0.5
accept-encoding: gzip, deflate, br
dnt: 1
alt-used: www.violetgrey.com
connection: keep-alive
upgrade-insecure-requests: 1
sec-fetch-dest: document
sec-fetch-mode: navigate
sec-fetch-site: cross-site
sec-fetch-user: ?1
te: trailers
Response header:
HTTP/2 403
date: Thu, 27 Oct 2022 23:56:19 GMT
content-type: text/plain; charset=UTF-8
content-length: 16
x-frame-options: SAMEORIGIN
referrer-policy: same-origin
cache-control: private, max-age=0, no-store, no-cache, must-revalidate, post-check=0, pre-check=0
expires: Thu, 01 Jan 1970 00:00:01 GMT
server-timing: cf-q-config;dur=4.9999998736894e-06
vary: Accept-Encoding
server: cloudflare
cf-ray: 760f5e1ced6e8dcc-MIA
alt-svc: h3=":443"; ma=86400, h3-29=":443"; ma=86400
Response Body:
error code: 1020
Related
I am trying to inspect an HTTP request made in my browser with the chrome dev tools. I want to see the response but it seems to be empty (failed to load data), whereas the Content-Type is set to 4464326.
Below the HTTP response headers :
HTTP/1.1 200 OK
Last-Modified: Tue, 21 Jan 2014 12:08:30 GMT
ETag: "3ba731138c01f1ad6536bc1d4030cfdd"
Content-Type: audio/mpeg
Server: AmazonS3
Content-Length: 4464326
Accept-Ranges: bytes
Date: Wed, 04 May 2016 13:02:20 GMT
Via: 1.1 varnish
Age: 153069
Connection: keep-alive
Cache-Control: no-cache, no-store, private
X-Served-By: cache-fra1226-FRA
X-Cache: HIT
X-Cache-Hits: 2
X-Timer: S1462366940.863431,VS0,VE0
Plus, in the timeline, I see a download time of 4.61s so I guess to response was not actually empty.
I also tried with Fiddler but responser is still unavailable. Does someone has an explanation, or even better, knows how I can try to read the response ?
As you can see,
Content-Type: audio/mpeg
Can the chrome dev tools read this type of content?
When you are streaming a video, the Response panel shows as empty because the request hasn't completed yet. Once completed, it displays "Failed to load response data", since the data is likely to be too large to display as text in the panel.
You could try Wireshark to capture the HTTP response data.
I'm experiencing an issue with Chrome that I can't seem to fully understand, I'm curious if folks here have dealt with it before. This doesn't reproduce in Firefox. The steps are as follows:
Start incognito Chrome, navigate to https://foo.mysite.com and have the JS on the page make a GET ajax request to S3 for https://s3.amazonaws.com/mystuff/file.json . You get back a 200 response with:
HTTP/1.1 200 OK
x-amz-id-2: somestuffhere
x-amz-request-id: somestuffhere
Date: Tue, 14 Oct 2014 03:06:41 GMT
Access-Control-Allow-Origin: https://foo.mysite.com
Access-Control-Allow-Methods: GET
Access-Control-Max-Age: 3000
Access-Control-Allow-Credentials: true
Vary: Origin, Access-Control-Request-Headers, Access-Control-Request-Method
Cache-Control: max-age=86400
Content-Encoding: gzip
Last-Modified: Sun, 05 Oct 2014 00:29:53 GMT
ETag: "fe76607baa40a793eb3b3cbd373a3fb8"
Accept-Ranges: bytes
Content-Type: application/json
Content-Length: 5609
Server: AmazonS3
Open a second tab, navigate to https://bar.mysite.com and have its JS make a GET ajax request to S3 for the same file https://s3.amazonaws.com/mystuff/file.json . Get back the following 304 response:
HTTP/1.1 304 Not Modified
x-amz-id-2: somestuffhere
x-amz-request-id: somestuffhere
Date: Tue, 14 Oct 2014 03:06:58 GMT
Access-Control-Allow-Origin: https://bar.mysite.com
Access-Control-Allow-Methods: GET
Access-Control-Max-Age: 3000
Access-Control-Allow-Credentials: true
Vary: Origin, Access-Control-Request-Headers, Access-Control-Request-Method
Cache-Control: max-age=86400
Last-Modified: Sun, 05 Oct 2014 00:29:53 GMT
ETag: "fe76607baa40a793eb3b3cbd373a3fb8"
Server: AmazonS3
Open a third tab, navigate to https://foo.mysite.com (the first site) and repeat the same steps as in 1. Chrome kills the response for CORS reasons and reports the following:
XMLHttpRequest cannot load https://s3.amazonaws.com/mystuff/file.json. The 'Access-Control-Allow-Origin' header has a value 'https://bar.mysite.com' that is not equal to the supplied origin. Origin 'https://foo.mysite.com' is therefore not allowed access.
What's the story here? This doesn't reproduce in Firefox. In Firefox I'm happily getting a 304 in both steps 2 and 3, which I would expect to see in Chrome as well.
A temporary workaround for this issue in Chrome is to set Cache-Control: no-cache on the file in S3, but then I'm forcing our clients to be re-downloading that file for no good reason, so it's not a real solution.
Is this intended and documented behavior? Is this a bug with Chrome? Any other thoughts?
Looks like this is caused by Chromium issue 260239
I found this blog that help: Add Vary headers to S3
It helped by adding Vary headers to all XHR request.
I did run into a problem with html request (i.e. ) but I was able to overcome that by using hackround#2 described here:https://serverfault.com/a/856948
TL;DR of hack#2 is to use a "dummy" query string parameter that differs for HTML and XHR or is absent from one or the other. Example:
<img src="https://s3.png?x-request=html">
I just add a timestamp in request URL to force load the asset from S3 again, not from cache, such as xxxx?timestamp=yyyy
We have an ASP.net website, (https://website.example.com) which is loading external libraries (css and javascript) from a different sub-domain name (https://library.example.com)
The resources we are loading via the library project are only css files and javascript plugins, which themselves doesn't make any request (via AJAX).
Testing the website in normal environments, everything works fine.
However, opening it from an Internet Explorer 8 browser, returns an error:
internet explorer has modified this page to prevent cross site scripting
Could the fact that we are referencing external resources cause the error?
If yes, what would be the solution to fix this problem?
I think 90% of the websites downloads references from external domains (like CDN servers) for example.
Here's one way- configure the X-XSS-Protection header on your server.
This will tell IE to disable XSS protection on your site.
Looks something like this :
GET / HTTP/1.1
HTTP/1.1 200 OK
Date: Wed, 01 Feb 2012 03:42:24 GMT
Expires: -1
Cache-Control: private, max-age=0
Content-Type: text/html; charset=ISO-8859-1
Set-Cookie: PREF=ID=6ddbc0a0342e7e63:FF=0:TM=1328067744:LM=1328067744:S=4d4farvCGl5Ww0C3; expires=Fri, 31-Jan-2014 03:42:24 GMT; path=/; domain=.google.com
Set-Cookie: NID=56=PgRwCKa8EltKnHS5clbFuhwyWsd3cPXiV1-iXzgyKsiy5RKXEKbg89gWWpjzYZjLPWTKrCWhOUhdInOlYU56LOb2W7XpC7uBnKAjMbxQSBw1UIprzw2BFK5dnaY7PRji; expires=Thu, 02-Aug-2012 03:42:24 GMT; path=/; domain=.google.com; HttpOnly
P3P: CP="This is not a P3P policy! See http://www.google.com/support/accounts/bin/answer.py?hl=en&answer=151657 for more info."
Server: gws
X-XSS-Protection: 1; mode=block
X-Frame-Options: SAMEORIGIN
Transfer-Encoding: chunked
1000
Please read here for more details
I have Amazon S3 objects, and for each object, I have set
Cache-Control: public, max-age=3600000
That is roughly 41 days.
And I have Amazon CloudFront Distribution set with Minimum TTL also with 3600000.
This is the first request after clearing cache.
GET /1.0.8/web-atoms.js HTTP/1.1
Host: d3bhjcyci8s9i2.cloudfront.net
Connection: keep-alive
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
User-Agent: Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.57 Safari/537.36
Accept-Encoding: gzip,deflate,sdch
Accept-Language: en-US,en;q=0.8
And Response is
HTTP/1.1 200 OK
Content-Type: application/x-javascript
Content-Length: 226802
Connection: keep-alive
Date: Wed, 28 Aug 2013 10:37:38 GMT
Cache-Control: public, max-age=3600000
Last-Modified: Wed, 28 Aug 2013 10:36:42 GMT
ETag: "124752e0d85461a16e76fbdef2e84fb9"
Accept-Ranges: bytes
Server: AmazonS3
Age: 342557
Via: 1.0 6eb330235ca3971f6142a5f789cbc988.cloudfront.net (CloudFront)
X-Cache: Hit from cloudfront
X-Amz-Cf-Id: 92Q2uDA4KizhPk4TludKpwP6Q6uEaKRV0ls9P_TIr11c8GQpTuSfhw==
Even while Amazon clearly sends Cache-Control, Chrome still makes second request instead of reading it from Cache.
GET /1.0.8/web-atoms.js HTTP/1.1
Host: d3bhjcyci8s9i2.cloudfront.net
Connection: keep-alive
Cache-Control: max-age=0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
User-Agent: Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.57 Safari/537.36
Accept-Encoding: gzip,deflate,sdch
Accept-Language: en-US,en;q=0.8
If-None-Match: "124752e0d85461a16e76fbdef2e84fb9"
If-Modified-Since: Wed, 28 Aug 2013 10:36:42 GMT
Question:
Why does chrome makes second request?
Expires
This behavior changes when I put an explicit Expires attribute in headers. Browser will not send subsequent request for Expires header, but for cache-control public, it does send it. My all S3 objects will never change, they are immutable, when we change file, we put them as new object with new URL.
In Page Script Reference
Chrome makes subsequent requests only sometimes, I did this test by actually typing URL in browser. When script is referenced by HTML page, for few subsequent requests chrome loads cached scripts, but once again after sometime, once in a while it does send request to server. There is no Disk Size issue here, Chrome has sufficient cache space.
Problem is we get charged for every request, and I want S3 objects to be cached forever, and should be loaded from Cache and should never connect to server back.
When you press F5 in Chrome, it will always send requests to the server. These will be made with the Cache-Control:max-age=0 header. The server will usually respond with a 304 (Not Changed) status code.
When you press Ctrl+F5 or Shift+F5, the same requests are performed, but with the Cache-Control:no-cache header, thus forcing the server to send an uncached version, usually with a 200 (OK) status code.
If you want to make sure that you're utilizing the local browser cache, simply press Enter in the address bar.
If the HTTP Response contains the etag entry, the conditional request will always be made. ETag is a cache validator tag. The client will always send the etag to the server to see if the element has been modified.
If Chrome Developer Tools are open (F12), Chrome usually disables caching.
It is controllable in the Developer Tools settings - the Gear icon to the right of the dev-tools top bar.
If you are hitting the refresh button for loading the particular page or resource, the if-modified-since header request is sent everytime, if you instead request the page/resource as a separate request in a new tab or via a link in a script or html page, it will load the page/resource from the browser cache itself.
This is what has happened in my case, may be this is the general universal case. I am not completely sure, but this is what I gathered via my digging.
Chrome adds Cache-control: max-age=0 header when you use self-signed certificate. Switching from HTTPS to HTTP will remove this header.
Firefox doesn't add this header.
Still facing a weird issue with Google Chrome. I have a text/html page generated from a php source code. This page loads well and is well displayed on any popular browsers but Chrome. When the source code has just been saved, Chrome loads and render the file the right way. Even if I simply added or removed a space character. Next, if I try to refresh the page, Chrome displays a blank page and shows an error in the Developper Tools panel (see screenshot) indicating a "failed" status. But if I check the HTTP response headers, everything seems to be fine, including the HTTP status: 200 OK.
HTTP/1.1 200 OK
Date: Mon, 17 Sep 2012 08:37:03 GMT
Server: Apache/2.2.3 (CentOS)
X-Powered-By: PHP/5.3.14
Expires: Mon, 17 Sep 2012 08:52:03 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
Connection: close
Transfer-Encoding: chunked
Content-Type: text/html; charset=UTF-8
Bellow is the HTTP response headers just after saving the source and getting a correct rendering. No change (except time related information)
HTTP/1.1 200 OK
Date: Mon, 17 Sep 2012 08:56:06 GMT
Server: Apache/2.2.3 (CentOS)
X-Powered-By: PHP/5.3.14
Expires: Mon, 17 Sep 2012 09:11:07 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
Connection: close
Transfer-Encoding: chunked
Content-Type: text/html; charset=UTF-8
I also checked the HTTP request headers, they are the same in both cases:
Working case:
Accept:text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Charset:ISO-8859-1,utf-8;q=0.7,*;q=0.3
Accept-Encoding:gzip,deflate,sdch
Accept-Language:fr-FR,fr;q=0.8,en-US;q=0.6,en;q=0.4
Cache-Control:max-age=0
Connection:keep-alive
Cookie:PHPSESSID=qn01olb0lkgh3qh7tpmjbdbct1
Host:(hidden here, but correct, looks like subsubdomain.subdomain.domain.tld)
User-Agent:Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1
Failing case:
Accept:text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Charset:ISO-8859-1,utf-8;q=0.7,*;q=0.3
Accept-Encoding:gzip,deflate,sdch
Accept-Language:fr-FR,fr;q=0.8,en-US;q=0.6,en;q=0.4
Cache-Control:max-age=0
Connection:keep-alive
Cookie:PHPSESSID=qn01olb0lkgh3qh7tpmjbdbct1
Host:(also hidden)
User-Agent:Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1
I noticed that even if the page failed, some other resources (javascript, style sheets) are successfully loaded or retrieved from the local cache. I can also access the HTML source code successfully every time, could the page be rendered or not (the HTML code is what it's expected to contain).
I also ran Wireshark in order to see if there would be something wrong while transferring data or something, but everything seems to be OK on this side too.
I read something on Google about content-length that would make Chrome fail if the information provided in the HTTP headers differs from the effective size of the delivered file. It does not seem to me it would be the case here as content-length is not provided.
Any help would be welcome! Thanks!
To those who are still encountering the issue:
add
setEnv downgrade-1.0
to your .htaccess, for me it works as temporary fix. It disables chunked transfers.
Another solution is to set user agent to IE, apparently it does the same...
Actual issue is somehow related to server configuration, I am also seeing segmentation fault errors.
Site I am encountering this behaviour runs on PHP 5.3.27 over HTTPS.
edit: It doesn't matter if site is accessed over HTTPS or HTTP.
edit2: The cause for this error was one extremely long line, splitting it into multiple lines solved the problem.