I have a batch of n messages in an SQS queue and a number of workers. These workers take messages from the queue, process them, and then delete them if they are successful. Once all the workers finish this batch of n messages, I want to perform an additional action. The only problem is figuring out when a batch is complete.
One way to do it is to check that the queue is empty. When I take a look at the SQS API, the only thing that seems close is the ApproximateNumberOfMessages attribute you get from GetQueueAttributes. However, the word "approximate" suggests that it really isn't intended for what I have in mind, and that its purpose is more for scaling up and down the number of workers based on roughly how many messages are in the queue.
What would be the standard way to achieve what I want? Or is SQS ill-fit for this purpose?
SQS doesn't really have any built-in mechanisms for grouping messages. Furthermore, SQS doesn't guarantee that a particular message won't be processed more than once[1], so you can't simply count the number of messages processed.
Instead, you'll probably need to track each message individually in an external datastore, and then after each message is processed, check to see if there are any remaining messages.
For instance:
As you enqueue each message in the group to the original queue, record the message ID in an external database along with a group number of your own invention.
After a worker processes a message, the worker should get the group number for that message from the database (or just include the group number as an attribute in the original message), and delete the message ID from the database (if it wasn't already deleted by another worker, which could happen if two workers got the same message from the queue). The worker should then enqueue a new message containing the group number into a second queue.
Another worker reads messages containing the group number from the second queue, and checks the database to see if any of the original messages for that group number remain. If there are any, this worker does nothing. If there are no more messages for the group, this worker performs your additional action. Be mindful that due to SQS' distributed nature, this final message could also be processed more than once, so the additional action should be idempotent (or at least somehow check to see if it has been performed already).
With this setup, you'll be able to run multiple unrelated batches through the system simultaneously.
You could consider adding a bit of code to your worker process(es) that starts a timer of some sort when it asks for a message to process and gets nothing back; if you worker asks for messages, processes messages and then delete messages, and as you say the 'batch' is just a collection of messages recieved arund the same time, then presumably if 5 minutes (or some other user-defined period) goes by and no new messages are returned after repeated requests, you might be able to kick-off your 'after batch' process. This will be more accurate if you can scale down your worker process to just one by the time it gets to the end of the queue (so you can be sure that other nodes are not still processing).
This is by no means perfect - and will depend on the flow / timing of your messages and the criticality of defining what belongs in a 'batch' and what does not.
Alternatively, if at the front-end you know the precise number of messages that get put into a batch, you could count down the number of processed messages and know you are down when you get down to zero.
Related
When I look over the tutorial of Robot Operating system (ROS), I found most example codes set the publisher's queue size to a larger value such as 1000. I think this leads to losing real-time response of the node.
For what purpose, do people set it to that large value?
From ROS docs (http://wiki.ros.org/ROS/Tutorials/WritingPublisherSubscriber):
Message publisher (producer):
"The second parameter to advertise() is the size of the message queue
used for publishing messages. If messages are published more quickly
than we can send them, the number here specifies how many messages to
buffer up before throwing some away."
Message subscriber:
"The second parameter to the subscribe() function is the size of the message queue. If messages are arriving faster than they are being processed, this is the number of messages that will be buffered up before beginning to throw away the oldest ones."
Possible explanation:
Think in the consumer-producer problem.
You can't guarantee that you will consume messages in the rate they arrive. So you create a queue that is filled as messages comes by sender (some sensor for instance).
Bad case: If your program delays in some other part and you can't read the messages in the rate they arrived the queue increases.
Good case: As soon as your other processing load diminishes you can read the queue faster and start to reduce it. If you have available time you will end up reducing queue size to zero.
So as for your question, if you send queue size to large value you may guarantee that will not lose messages. In a simple example you have no memory constraints so you can do anything you want, like use many GBytes of RAM to create a large queue and assures will always work. Or if you create a toy example to explain a concept you don't want your program to crash for other reasons.
A real life example can be a scenario of a waiter and a kitchen to wash dishes.
Suppose the costumers ends its meals and the waiter takes their dirty dishes to wash in the kitchen. He puts in a table. Whenever the dishwasher can, he goes to table and gets dishes and take to wash. In normal operation the table is never filled. But if someone else give another task to the dishwasher guy, the table will start to get full. Until some time the waiter can't place dishes anymore and leave tables dirty (problem in the system). But if table is artificially large there (let's say 1000 square units) the waiter will likely fulfill its job even if dishwasher is busy, considering that after some time he will be able to return to clean dishes.
Ok, long answer, but it may be of help to understand queues.
I have a table which stores the location of my user very frequently. I want to query this table frequently and return the newest rows I haven't read from.
What would be the best practice way to do this. My ideas are:
Add a boolean read flag, query all results where this is false, return them and then update them ALL. This might slow things down with the extra writes
Save the id of the last read row on the client side, and query for rows greater than this. Only issue here is that my client could lose their place
Some stream of data
there will eventually be multiple users and readers of the locations so this will not to scale somewhat
If what you have is a SQL database storing rows of things. I'd suggest something like option 2.
What I would probably do is keep a timestamp rather than in ID, and an index on that (a clustered index on MSSQL, or similar construct so that new rows are physically sorted by time). Then just query by anything newer than that.
That does have the "losing their place" issue. If the client MUST read every row published, then I'd either delete them after processing, or have a flag in the database to indicate that they have been processed. If the client just needs to restart reading current data, then I would do as above, but initialize the time with the most recent existing row.
If you MUST process every record, aren't limited to a database, what you're really talking about is a message queue. If you need to be able to access the individual data points after processing, then one step of the message handling could be to insert into a database for later querying(in addition to whatever this is doing with the data read).
Edit per comments:
If there's no processing that needs be done when receiving, but you just want to periodically update data then you'd be fine with solution of keeping the last received time or ID and not deleting the data. In that case I would recommend not persisting a last known id/timestamp across restarts/reconnects since you might end up inadvertently loading a bunch of data. Just reset it max when you restart.
On another note, when I did stuff like this I had good success using MQTT to transmit the data, and for the "live" updates. That is a pub/sub messaging protocol. You could have a process subscribing on the back end and forwarding data to the database, while the thing that wants the data frequently can subscribe directly to the stream of data for live updates. There's also a feature to hold onto the last published message and forward that to new subscribers so you don't start out completely empty.
I am using the the database queue driver in laravel to run jobs in the background.
One of my jobs creates a given number (thousands to hundred thousands) records in the database. I wrapped the code for this job in a transaction so that in case the job failed, the database writes would not be commited.
Initially to track progress of the job, i thought i would count the number of created records, divide by total number of expected records then display that in a ui as percentage against each job such that users can know how much longer they have to wait.
This doesn't work because the tables are locked during the transaction.
Am wondering if anybody knows how track progress on a queued job
For the ones who stumble on this question, there is a package which allows that: https://github.com/imTigger/laravel-job-status
As given in http://laravel.com/docs/5.1/queues#job-events
The Queue::after method can be called once a job has completed successfully
As given in http://laravel.com/docs/5.1/queues#failed-job-events
The Queue::failing method can be called when a queued job fails
Hope this is helpful :)
In Kafka, I have a producer queuing up work of clients. Each piece of work has a client ID on it. Work of different clients can be processed out of order, but work of one client must be processed in order.
To do this, I intend to have (for example) 20 topics to achieve parallelism. The producer will queue up work of a client ID into topic[client ID mod 20]. I then intend to have many consumers each capable of processing work of any client but I still want the work processed in order. This means that the next price of work in the topic can't begin to be processed before the previous piece has completed. In case of consumer failure it's OK to process work twice, but it means that the offset of that topic can't progress to the next piece of work.
Note: the number of messages per second is rather small (10s-100s messages).
To sum up:
'At least once' processing of every message (=work)
In order processing of work for one topic
Multiple consumers for every topic to support consumer failure
Can this be done using Kafka?
Yes, you can do this with Kafka. But you shouldn't do it quite the way you've described. Kafka already supports semantic partitioning within a topic if you provide a key with each message. In this case you'd create a topic with 20 partitions, then make the key for each message the client ID. This guarantees all messages with the same key end up in the same partition, i.e. it will do the partitioning that you were going to do manually.
When consuming, use the high level consumer, which automatically balances partitions across available consumers. If you want to absolutely guarantee at least once processing, you should commit the offsets manually and make sure you have fully processed messages you have consumed before committing them. Beware that consumers joining or leaving the group will cause rebalancing of partitions across the instances, and you'll have to make sure you handle that correctly (e.g. if your processing is stateful, you'll have to make sure that state can be moved between consumers upon rebalancing).
I am looking for an efficient way to identify account brute forcing.
My log database contains authentication log entries. Each entry has:
time stamp
username
IP address
login attempt results (success / fail )
I want to produce a report which indicates that which logins have been attacked. Attacked is defined as: unsuccessful login attempts not followed by a successful login attempt within N minutes (e.g. 10) from same IP address. The test cases are:
user/ip combo has attempted to login unsuccessfully twice and succeeded on third time (no attack)
user/ip combo has attempted to login unsuccessfully twice and succeeded on third time, while same user, but different ip has tried to unsuccessfully log in (attack on second user/ip combo)
I can imagine one solution with O(n*log(N)) solution: a cursor goes over each record and then does lookups with another cursor for later records to determine activity. Quite inefficient.
DB doesn't matter: SQL, MySQL, nosql, etc as data can be converted easily.
Group log items by 5min time intervals. For all groups which exceed half your alarming thresholds perform a more expensive but entirely correct check.
That will probably filter out almost all log items which are not a real attack. And a grouping operation is easy to program and quick to execute.
Depending on how much work you're willing to spend on this Complex Event Processing could be an option. Theres frameworks such as Esper if you do Java.
The idea is to create an event stream based on your server logs (or SQL result) and have Esper check for correlations. See example query.