Amazon SQS Worker Tier Auto Scaling fails

Amazon SQS Worker Tier Auto Scaling fails - amazon-elastic-beanstalk

We have an SQS Worker Tier app subscribed to a queue. When it is running it works fine, however, when it gets busy, and scales up, the new instance starts getting messages almost immediately, before it is actually ready. This results in 500 responses, and the messages being discarded to the dead letter queue.
We have our queue configured with a maximum attempt of 1; due to the database changes a message will make during consumption we can't just put it back in the queue in case of error.
I have tried using the monitor health url as I would with a normal web app, but this doesn't seem to work as messages continue to be sent regardless.
Is there any way of setting a delay on any new auto scaled instance before is starts receiving messages from the queue?

I am not sure how the instance is 'getting messages' before its ready, unless you are actually using SNS to PUSH the messages to the endpoint, as opposed to having the endpoint(instance) PULL the messages from the queue.
If you are pushing messages via SNS, then the easiest solution is to have the instance POLL the SQS queue for messages when its ready to process them - much safer and reliable, and obviously the instance can decide for itself when its ready to do work.
It also sounds to me like you solution is not architected properly. If accidentally processing the same message twice would cause you problems in your database, then your not using SQS in the correct manner. Work that SQS does should be idempotent - i.e. it should be able to processed more than one time without causing problems. Even when everything is running 100% correctly, on your end and at AWS, its possible that the same message will be sent more than once to your workers - you can't prevent that - and your processing needs to be able to handle that gracefully.

You can set the HTTP Connection setting (Configuration > Worker Configuration) in order to limit the number of concurrent connections to your worker. If you set it to 1, you're sure that 1 worker won't receive another request unless it has already responded.

Related

MySQL Aurora connection management with SQS and Lambda

I have a use case in my system where I need to process hundreds of user records nightly. Currently, I have a scheduled Lambda function which pulls all the users to be processed and places each onto an SQS queue. I then have another Lambda function that reads from this queue and handles the processing. Each user requires quite a lot of processing which uses quite a few connections for each user. I use a mysql transaction in as many places as I can to cut down the connections used. I'm running into issues with my Aurora MySQL database hitting the connection limit (1000 currently). I have tried playing around with the batch sizes as well as the lambda concurrency but I still seem to run into issues. Currently, the batch size is 10 and concurrency is 1. The Lambda function does not use a connection pool as I found that caused more issues with connections. Am I missing something here or is this just an issue with MySQL and Lambda scaling?
Thanks

Amazon RDS Proxy is the solution provided by AWS to prevent a large number of Lambda functions from running at the same time and overwhelming the connection limit of the database instance.
Alternatively, you could use this trick to throttle the rate of lambdas:
Create another SQS queue and fill it with a finite set of elements. Say 100 elements, for instance. The values you put into this queue don't matter. It's the quantity that is important.
Lambdas are activated by this queue.
When the lambdas are activated, they request the next value from your first SQS queue, with the users to be processed.
If there are no more users to process, i.e. if the first queue is empty, then the lambda exits without connecting to Aurora.
Each lambda invocation processes the user. When it is done, it disconnects from Aurora and then pushes a new element onto the second SQS queue as its last step, which activates another lambda.
This way there are never more than 100 lambdas running at a time. Adjust this value to however many lambdas you want to allow concurrently.

How does Amazon SQS takes care of not sending same message to different instances of same service?

I have a queue ( in this case Amazon SQS ) and there are N nodes of same service running which are consuming messages from SQS.
How can I make sure that during any point of time, not more than one nodes has read a same message from queue.
In case of Kafka, we know that, not more than one consumer from the same consumer group can be assigned to a single topic partition. How do we make sure the same thing is handled inside Amazon SQS or not ?

The Amazon mechanism to prevent that a message is delivered to multiple consumers is the Visibility Timeout:
Immediately after a message is received, it remains in the queue. To prevent other consumers from processing the message again, Amazon SQS sets a visibility timeout, a period of time during which Amazon SQS prevents other consumers from receiving and processing the message. The default visibility timeout for a message is 30 seconds. The minimum is 0 seconds. The maximum is 12 hours.
After the message is received, SQS starts the timeout and for its duration, it doesn't send it again to other consumers. After the timeout ends, if the message has not been deleted, SQS makes it available again for other consumers.
But as the note says:
For standard queues, the visibility timeout isn't a guarantee against receiving a message twice. For more information, see At-Least-Once Delivery.
If you need absolute guarantees of only once delivery, you have to option:
Design your application to be idempotent so that the result is
the same if it process the same message one or more time.
Try
using a SQS FIFO queue that provides exactly once processing.

SSIS - Script Component pulling information from RabbitMQ

A question that might be mostly theoretical, but I'd love to have my concerns put to rest (or confirmed).
I built a Script Component to pull data from RabbitMQ. On RabbitMQ, we basically set up a durable queue. This means messages will continue to be added to the queue, even when the server reboots. This construction allows us to periodically execute the package and grab all "new" messages since the last time we did so.
(We know RabbitMQ isn't set up to accommodate to this kind of scenario, but rather it expects there to be a constant listener to process messages. However, we are not comfortable having some task start when SQL Server starts, and pretty much running 24/7 to handle that, so we built something we can schedule to run every n minutes and empty the queue that way. If we'd not be able to run the task, we most likely are dealing with a failed SQL Server, and have different priorities).
The component sets up a connection, and then connects to the specific exchange + queue we are pulling messages from. Messages are in JSON format, so we deserialize the message into a class we defined in the script component.
For every message found, we disable auto-acknowledge, so we can process it and only acknowledge it once we're done with it (which ensures the message will be processed, and doesn't slip through). Then we de-serialize the message and push it onto the output buffer of the script component.
There's a few places things can go wrong, so we built a bunch of Try/Catch blocks in the code. However, seeing we're dealing with the queue aspect, and we need the information available to us, I'm wondering if someone can explain how/when a message that is sent to the output buffer is processed.
Is it batched up and then pushed? Is it sent straight away, and is the SSIS component perhaps not updating information back to SSIS in a timely fashion?
Would there be a chance for us to acknowledge a message, but that it somehow ends up not getting committed to our database, yet popped from the queue (as I think happens once a message is acknowledged)?

Sending email with Flask errors with SMTPHandler

I saw in the documentation an extremely easy way to send emails on Flask errors. My question is whether this will considerably affect performance of the app? As in, is the process running my app actually sending the email?
My current hunch is that because SMTP is a server running on another process, it will enqueue the email properly and send it when it can, meaning it won't affect the performance of the app.

Well, SMTPHandler inherits from logging.Handler. Looking at logging.Handler, while it does several things to handle being called in multiple threads it doesn't do anything to spawn multiple threads. The logging calls happen in the thread they are called on. So, if I am reading the code correctly, a logging call will block the thread it is running on until it completes (which means that if your SMTP server takes 30 seconds to respond your erroring thread will take time_to_error + 30 seconds + time_to_send + time_to_respond_to_request_with_500.)
That said, I could be mis-reading the code. However, you'd be better off using SysLogHandler and letting syslog handle sending you messages out of band.

Bi-directional communication with 1 socket - how to deal with collisions?

I have one app. that consists of "Manager" and "Worker". Currently, the worker always initiates the connection, says something to the manager, and the manager will send the response.
Since there is a LOT of communication between the Manager and the Worker, I'm considering to have a socket open between the two and do the communication. I'm also hoping to initiate the interaction from both sides - enabling the manager to say something to the worker whenever it wants.
However, I'm a little confused as to how to deal with "collisions". Say, the manager decides to say something to the worker, and at the same time the worker decides to say something to the manager. What will happen? How should such situation be handled?
P.S. I plan to use Netty for the actual implementation.

"I'm also hoping to initiate the interaction from both sides - enabling the manager to say something to the worker whenever it wants."
Simple answer. Don't.
Learn from existing protocols: Have a client and a server. Things will work out nicely. Worker can be the server and the Manager can be a client. Manager can make numerous requests. Worker responds to the requests as they arrive.
Peer-to-peer can be complex with no real value for complexity.

I'd go for a persistent bi-directional channel between server and client.
If all you'll have is one server and one client, then there's no collision issue... If the server accepts a connection, it knows it's the client and vice versa. Both can read and write on the same socket.
Now, if you have multiple clients and your server needs to send a request specifically to client X, then you need handshaking!
When a client boots, it connects to the server. Once this connection is established, the client identifies itself as being client X (the handshake message). The server now knows it has a socket open to client X and every time it needs to send a message to client X, it reuses that socket.
Lucky you, I've just written a tutorial (sample project included) on this precise problem. Using Netty! :)
Here's the link: http://bruno.linker45.eu/2010/07/15/handshaking-tutorial-with-netty/
Notice that in this solution, the server does not attempt to connect to the client. It's always the client who connects to the server.
If you were thinking about opening a socket every time you wanted to send a message, you should reconsider persistent connections as they avoid the overhead of connection establishment, consequently speeding up the data transfer rate N-fold.

I think you need to read up on sockets....
You don't really get these kinds of problems....Other than how to responsively handle both receiving and sending, generally this is done through threading your communications... depending on the app you can take a number of approaches to this.

The correct link to the Handshake/Netty tutorial mentioned in brunodecarvalho's response is http://bruno.factor45.org/blag/2010/07/15/handshaking-tutorial-with-netty/
I would add this as a comment to his question but I don't have the minimum required reputation to do so.

If you feel like reinventing the wheel and don't want to use middleware...
Design your protocol so that the other peer's answers to your requests are always easily distinguishable from requests from the other peer. Then, choose your network I/O strategy carefully. Whatever code is responsible for reading from the socket must first determine if the incoming data is a response to data that was sent out, or if it's a new request from the peer (looking at the data's header, and whether you've issued a request recently). Also, you need to maintain proper queueing so that when you send responses to the peer's requests it is properly separated from new requests you issue.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008