If you have continuous messages generated with different deadlines, how would you process these messages in order of deadlines.
I have implemented by saving message to persistent store. Later schedule jobs processing the most recent expired messages.
If I have to implement this in pub/sub mechanism, how should my queue give priority to deadline time? Is there any queuing solution that delivers messages based on ttl or deadline?
Related
I am reading this writing: https://medium.com/#narengowda/system-design-dropbox-or-google-drive-8fd5da0ce55b. In the Synchronization Service part, it writes that:
The Response Queues that correspond to individual subscribed clients are responsible for delivering the update messages to each client. Since a message will be deleted from the queue once received by a client, we need to create separate Response Queues for each client to be able to share an update message which should be sent to multiple subscribed clients.
The context is that we need a response queue to send the file updates from one client to other clients. I am confused by this statement. If Dropbox has 100 million clients, we need to create 100 million queues, based on the statement. It is unimaginable to me. For example, a Kafka cluster can support up to 5K topics (https://stackoverflow.com/questions/32950503/can-i-have-100s-of-thousands-of-topics-in-a-kafka-cluster#:~:text=The%20rule%20of%20thumb%20is,5K%20topics%20should%20be%20fine.). We need 20K Kafka clusters in this case. Which queuing system can do 100 million "topics"?
Not sure but I expect such notification to clients via web-sockets only.
Additionally as this medium blog states that if client is not online then messages might have to be persisted in DB. After that when client comes online they can request for all updates after certain timestamp after which web sockets can be setup to facilitate future communication.
Happy to know your thoughts on this.
P.S : Most of dropbox system design blogs/vlogs have just copied from each other without going into low level detail.
I have a queue ( in this case Amazon SQS ) and there are N nodes of same service running which are consuming messages from SQS.
How can I make sure that during any point of time, not more than one nodes has read a same message from queue.
In case of Kafka, we know that, not more than one consumer from the same consumer group can be assigned to a single topic partition. How do we make sure the same thing is handled inside Amazon SQS or not ?
The Amazon mechanism to prevent that a message is delivered to multiple consumers is the Visibility Timeout:
Immediately after a message is received, it remains in the queue. To prevent other consumers from processing the message again, Amazon SQS sets a visibility timeout, a period of time during which Amazon SQS prevents other consumers from receiving and processing the message. The default visibility timeout for a message is 30 seconds. The minimum is 0 seconds. The maximum is 12 hours.
After the message is received, SQS starts the timeout and for its duration, it doesn't send it again to other consumers. After the timeout ends, if the message has not been deleted, SQS makes it available again for other consumers.
But as the note says:
For standard queues, the visibility timeout isn't a guarantee against receiving a message twice. For more information, see At-Least-Once Delivery.
If you need absolute guarantees of only once delivery, you have to option:
Design your application to be idempotent so that the result is
the same if it process the same message one or more time.
Try
using a SQS FIFO queue that provides exactly once processing.
I have a cron service running in the background every minute. The cron service is responsible to send realtime notifications to users based on a complex logic. It is mandatory for the cron to complete processing and deliver push notifications to all the users within the given minute. I use a third party push notification service to manage the delivery of the push notifications.
At any given minute, I have 50,000 users (increasing with time) who might be possible candidates to receive these notifications. I iterate through 50,000 users in a for loop in batches of 1000 to perform my application logic. The application logic involves 20+ simple database queries per user.
Is this a recommended architecture?
When the cron executes, the CPU utilisation shoots upto 100% for a span 3-4 seconds, 4-5 times a minute. The repeated 100% utilisation is because I have added delays between subsequent heavy computations to even the utilisation over a minute. Should I be concerned about this?
Is there any example where, we can trigger an event to send messages to JMS Queue when a table is updated/inserted ect for MYSQL/Postgre?
This sounds like a good task for pg_message_queue (which you can get off Google Code or PGXN), which allows you to queue requests. pg_message_queue doesn't do a great job of parallelism yet (in terms of parallel queue consumers), but I don't think you need that.
What you really want to do (and what pg_message_queue provides) is a queue table to hold the jms message, and then a trigger to queue that message. Then the question is how you get it from there to jms. You have basically two options (both of which are supported):
LISTEN for notifications, and when those come in handle them.
Periodically poll for notifications. You might do this if you have a lot of notifications coming in, so you can batch them every minute or so, or if you have few notifications coming in and you want to process them at midnight.
Naturally that is PostgreSQL only. Doing the same on MySQL? I don't know how to do that. I think you would be stuck with polling the table, but you could use pg_message_queue to understand basically how to do the rest. Note that in all cases this is fully transactional so the message would not be sent until after transaction commit, which is probably what you want.
I've got something like a job queue over RabbitMQ and, upon a request to cancel a job, I'd like to retract the tasks that have not yet started processing (their messages have not been ack'd), which corresponds to retracting these messages from the queues that they've been routed to.
I haven't found this functionality in AMQP or in the RabbitMQ API; perhaps I haven't searched well enough? Or will I have to use a workaround (it's not hard, but still)?
I would solve this scenario by having the worker check some sort of authoritative data source to determine if the the job should proceed or not. For example, the worker would check the job's status in a database to see if the job was canceled already.
For scenarios where the speed of processing jobs may be faster than the speed with which the authoritative store can be updated and read, a less guaranteed data store that trades speed for other characteristics may be useful.
An example of this would be to use Redis as the store for canceling processing of a message instead of a relational DB like MySQL. Redis is very fast, but makes fewer guarantees regarding the data it holds, whereas MySQL is much slower, but offers more guarantees about the data it holds.
In the end, the concept of checking with another source for whether or not to process a message is the same, but the way you implement that depends on your particular scenario.
RabbitMQ doesn't let you modify or delete messages after they've been enqueued. For that, you want some kind of database to hold the state of each job, and to use RabbitMQ to notify interested parties of changes in that state.
For lowish volumes, you can kludge it together with a queue per job. Create the queue, post the job description to the queue, announce the name of the queue to the workers. If the job needs to be cancelled before it is processed, deleted the job's queue; when the workers come to fetch the job description, they'll notice the queue has vanished.
Lighterweight and generally better would be to use redis or another key/value store to hold the job state (with a deleted or absent record meaning a cancelled or nonexistent job) and to use rabbitmq to notify about new/removed/changed records in the key/value store.
At least two ways to achieve your target:
basic.reject will requeue message if requeue=true is set (otherwise it will reject message).
(supported since RabbitMQ 2.0.0; see http://www.rabbitmq.com/blog/2010/08/03/well-ill-let-you-go-basicreject-in-rabbitmq/).
basic.recover will ask broker to redeliver unacked messages on channel.
You need to subscribe to all the queues to which messages have been routed, and consume them with ack.
For instance if you publish to a topic exchange with "test" as the routing key, and there are 3 persistent queues which subscribe to "test" you would need to consume those three queues. It might be better to add another queue which your consumer processes would also listen too, and tell them to ignore those messages.
An alternative, since you are using RabbitMQ, is to write a custom exchange plugin that will accept some out of band instruction to clear all queues. For instance you might have that exchange read a special message header that tells it to clear all queues to which this message is destined. This does require writing Erlang code, but there are 4 different exchange types implemented so you would only need to copy the most similar one and write the code for the new bahaviours. If you only use custom headers for this, then the body of the message can be a normal message for the consumers.
To sum up:
1) the publisher needs to consume the messages itself
2) the publisher can send a special message in a special queue to tell consumers to ignore the message
3) the publisher can send a special message to a custom exchange that will clear any existing messages from the queues before sending this special message to consumers.