HTTPS connections duplicated - google-compute-engine

Additional, unexpected HTTPS connections are being made to GCE servers.
This started 2nd October and is affecting europe-west1-b and us-central1-b.
We have the same codebase running on servers in Amazon EC2 that are not affected.
Is anyone else seeing issues with HTTPS traffic to GCE?
UPDATE: Clarification of duplicated connections:
A single HTTPS request from a web browser for example
GET /favicon.ico HTTP/1.1
Results in 5 HTTPS connections opened, no http request is send and then they are closed (before timeout period).
Then a final connection is opened and the request is sent as it should.
Note:
This usually would go undetected. However we only allow 10 SSL connections from a single IP in the space of 1 second (velocity restriction).
I have temporarily increased this to 20 and everything is working OK.
What I don't understand is why this would suddenly start happening and only on GCE servers.
I will update this again when I have looked into the raw SSL traffic.

Related

60 Second Timeout on Elastic Beanstalk

I have a single-instance (NO load balancer) Docker container (NO proxy server) that times out at exactly sixty seconds no matter what I do.
Yes, I'm aware of the many seemingly "duplicate" questions. I've been trying to solve this problem for 40+ hours. I've seen them all.
Every single answer to these questions informs the user that they must change the settings of NGINX or the load balancer.
However, I have NEITHER NGINX or a load balancer for the environment, yet it still times out. I am mostly convinced that this is an AWS bug.
I have an endpoint titled time_test for the mini server I created. When I make a POST request to the endpoint, I get a timeout at exactly 60 seconds (the request throws an exception on my end).
Here's the Python code to make the request.
import requests
url = f"http://...us-east-1.elasticbeanstalk.com/"
time_to_sleep = 65
url += f"time_test?time_to_sleep={time_to_sleep}"
response = requests.post(url=url, timeout=10000)
This throws an HTTPSException error, indicating that the server terminated the response, always at exactly 60 seconds.
However, the logs show a successful response.
My logs (specifically, "eb-docker/containers/eb-current-app/eb-blahblah-stdouterr.log) shows
[01/Jun/2022 22:05:49] "POST /time_test?time_to_sleep=65 HTTP/1.1" 200 -
Note the 200 successful status code.
I'm going to continue to find an answer to this problem, which seemingly has none, and will report back if so. Any help with how to change the environment to accept >60 second requests would be greatly appreciated. Please don't reply, "You should have shorter request times." Not helpful or applicable.
(Platform = Docker running on 64bit Amazon Linux 2/3.4.10)
Related:
How to increase FastAPI timeout in Docker to be deployed on AWS EB?
Elastic Beanstalk WebSocket Connection Dropped
PHP beanstalk application giving 504 errors
Blazor Server Side - Frequent 504 errors in AWS environment
504 error on aws elastic beanstalk
Deploying ebextensions on Elastic beanstalk and EC2
AWS bug. It magically started working after I reported this issue to support. No changes. Considering it magically stopped working, that's the conclusion I've come to.

Safari failing to load images - 421 Misdirected Request [duplicate]

Error:
The client needs a new connection for this request as the requested host name does not match the Server Name Indication (SNI) in use for this connection.
I recently purchased a EV SSL certificate from Comodo, installed it on my VPS (cPanel/WHM) and everything worked great. I then upgraded to http2 and am now receiving the error when switching between each website on the certificate. The 3 websites share the same IP address. From what I can tell, this may be the issue. I do not want to reissue a SSL cert for each domain as I paid for the EV multi domain cert. Is the answer to purchase 2 additional IPs and make sure each domain has its own IP? Or is there a way I can edit the virtual hosts so that I can maintain the same setup I have now?
I should mention, this is only happening on Safari, not chrome.
SSL Labs Report
https://www.ssllabs.com/ssltest/analyze.html?d=www.deschutesdesigngroup.com&s=142.4.0.142&hideResults=on
EasyApache HTTP vhost configuration
https://pastebin.com/dNeFRGWJ
EasyApache HTTPS vhost configuration
https://pastebin.com/vgWAD5mg
You have enabled HTTP/2 on only two of the three sites.
HTTP/2 will try to reuse the connection for multiple domains if both the IP address matches and the certificate covers all the necessary domains. This is the case here and so HTTP/2 is reused.
However if you run SSLLabs on all three domains you see a slight difference in the protocol used for Chrome (for example):
Chrome 70 / Win 10 RSA 2048 (SHA256) TLS 1.2 > h2
Chrome 70 / Win 10 RSA 2048 (SHA256) TLS 1.2 > http/1.1
Chrome 70 / Win 10 RSA 2048 (SHA256) TLS 1.2 > h2
And similarly further down in the ALPN setting:
ALPN Yes h2 http/1.1
ALPN Yes http/1.1
ALPN Yes h2 http/1.1
So going to the middle domain first will work as it will connect via HTTP/1.1 and so not reuse the connection. However going to the middle domain after initiating a request to either the first or last domain will attempt to reuse the HTTP/2 connection and fail as the middle domain doesn't support HTTP/2.
Web servers should return a 421 Misdirected Request status code for any requests when the browser attempts to reuse the connection when it shouldn't, to say "Yeah you really shouldn't be attempting to reuse the connection here! Can you try again on another connection please?". The same thing happens if there are different SSL/TLS setup (e.g. the cipher suite used for the connection is not accepted on the other domain).
Chrome and Firefox correctly handle the 421 response and transparently resend the requests over a new connection, which in this case then uses HTTP/1.1 (check out developer tools in the browser and you'll see this is true). Other browsers, including Safari used by iOS, have not implemented support of the relatively new 421 status code yet and so fail with an error like below:
Misdirected Request
The client needs a new connection for this request as the requested
host name does not match the Server Name Indication (SNI) in use for
this connection.
I presume there is no reason not to enable HTTP/2 on all domains and this was a misconfiguration error? If so enable HTTP/2 in all domains and your issue should be sorted.
If you do not want HTTP/2 on all domains, then you ensure the browser doesn't think it can reuse the connection. That means either using a separate IP address for that domain, or getting the certificate reissued for only two domains, and a separate certificate for the other than shouldn't share connections.

Django ERR_EMPTY_RESPONSE

I am currently running a Django site on ec2. The site sends a csv back to the client. The CSV is of varying sizes. If it is small the site works fine and client is able to download the file. However, if the file gets large, I get an ERR_EMPTY_RESPONSE. I am guessing this is because the connection is aborting without giving adequate time for the process to run fully. Is there a way to increase this time span?
Here's what my site is returning to the client.
with open('//home/ubuntu/Fantasy-Fire/website/optimizer/lineups.csv') as myfile:
response = HttpResponse(myfile, content_type='text/csv')
response['Content-Disposition'] = 'attachment; filename=lineups.csv'
return response
Is there some other argument that can allow me to ignore this error and keep generating the file even if it is taking awhile or is large?
I believe that you have any sort of backend proxy server which resets the connection to the Django backend and returns ERR_EMPTY_RESPONSE for the case. You should re-configure timeouts on your backend proxy. Usually that is nginx or apache used as a reverse proxy server.
What is Reverse Proxy Server
A reverse proxy server is an intermediate connection point positioned at a network’s edge. It receives initial HTTP connection requests, acting like the actual endpoint.
Essentially your network’s traffic cop, the reverse proxy serves as a gateway between users and your application origin server. In so doing it handles all policy management and traffic routing.
A reverse proxy operates by:
Receiving a user connection request
Completing a TCP three-way handshake, terminating the initial connection
Connecting with the origin server and forwarding the original request
More info at https://www.imperva.com/learn/performance/reverse-proxy/
One more possible case - your reverse proxy backend server doesn't have enough free space to process response from Django and aborts the request. You can also check free space on your reverse proxy balancer.
Within gunicorn, there is an argument for timeout, -t. When you run gunicorn, the default timeout is 30 seconds. Increase that to something your comfortable with like 90 or 120 seconds, whatever you think fits your application.

Azure Traffic Manager Browser Caching Issue

In Azure's traffic manager, I am doing some testing with TWO failover URLs: Two different endpoints are configured for the traffic manager (failover1.mysite.com, failover2.mysite.com.), however, my local browser (Chrome for example) seems to be caching the DNS record on its own and redirecting to what it thinks is still the destination, rather than letter Azure Traffic Manager re-route. Trying the request in a new browser or Incognito session will result in the request reaching the correct site. But for existing sessions, failover updates are not being registered and still hitting the site we are trying to redirect traffic away from. Does anyone have any experience with this?
I had the same issue while I was dealing with Azure Traffic Manager or AWS CloudFront.
DNS Record is associated with its TTL value. It is not something wrong with the Azure Traffic Manager. It is the TTL value that is letting the DNS client to cache the IP address.
How to check TTL value of DNS:
If you are using Windows,
https://support.rackspace.com/how-to/nslookup-checking-dns-records-on-windows/
If you are using linux follow the detailed instructions here,
https://www.cyberciti.biz/faq/howto-use-dig-to-find-dns-time-to-live-ttl-values/
Hope it helps.
From Microsoft's overview of their load balancing services:
Traffic Manager is a DNS-based traffic load balancer [...] it load balances only at the domain level. For that reason, it can't fail over as quickly as Front Door, because of common challenges around DNS caching and systems not honoring DNS TTLs.
With Front Door you can route requests to different backends based on rules and/or the health of the backends themselves so it doesn't have the issue you describe.

Some 502 errors in GCP HTTP Load Balancing

Our load balancer is returning 502 errors for some requests. It is just a very low percentage of the total requests, we have around 36000 request per hour and about 40 errors per hour, so just a 0,01% of the requests returns an error.
The instances are healthy when the error occurs and we have added this forwarding rule to the firewall for the load balancer: 130.211.0.0/22 tcp:1-5000 Apply to all targets
It is not a very serious problem because the application tolerates such errors, but I would like to know why they are given.
Any help will be apreciated.
It seems that there are no an easy solution for this.
As Mike Fotinakis explains in this blog (thank you for this info JasonG :)):
It turns out that there is a race condition between the Google Cloud HTTP(S) Load Balancer and NGINX’s default keep-alive timeout of 65 seconds. The NGINX timeout might be reached at the same time the load balancer tries to re-use the connection for another HTTP request, which breaks the connection and results in a 502 Bad Gateway response from the load balancer.
In my case I'm using Apache with the mpm_prefork module. The solution proposed is to increase the connection keepalive timeout to 650s, but this is not possible because each connection opens one new process (so this would represent a great waste of resources).
UPDATE:
It seems that there are some new documentation about this problem on the official load balancer documentation page (search for "Timeouts and retries"): https://cloud.google.com/compute/docs/load-balancing/http/
They recommend to set the KeepAliveTimeout value to 620 in both cases (Apache and Nginx).
I had an issue w/ 502s that was unexplainable after recreating a load balancer and backend config. I recreated my backend & instance group for unmanaged instances and this seemed to fix the issue for me. I wasn't able to identify any issues in my configuration in GCP :(
But I had a lot more errors - 1/10. There are load balancer logs that will tell you what the cause is and docs explain the causes.
Eg mine were:
jsonPayload: { statusDetails: "failed_to_pick_backend" #type: "type.googleapis.com/google.cloud.loadbalancing.type.LoadBal‌​ancerLogEntry" }
If you're using nginx and it's on POSTS and the error is reported as "backend_connection_closed_before_data_sent_to_client" it may be fixed by changing your nginx timeouts. See this excellent blog post:
https://blog.percy.io/tuning-nginx-behind-google-cloud-platform-http-s-load-balancer-305982ddb340#.btzyusgi6