Does Queue services like Kafka, Amazon SQS use some datastore to store messages? - message-queue

Do you know how does any queue services work behind the scenes? Do they store the messages in some datastore? Eg. Amazon SQS keeps the messages in its queue until any workers pulls those messages from SQS.
Where does it store behind the scenes? I think in some data store like Dynamodb or RDS. Is that correct to say? Are there any resources which can give me more insight on how the design of general Queue service looks like?

I designed the SQS replicated datastore. It was a completely custom implementation tailored for messaging use cases. Relational databases are not a good fit for high throughput queues.
Unfortunately, I cannot share the details of the datastore or replication algorithm as it was never made public by Amazon.
Kafka is open source, so you can get information about its store design.

Related

How does dropbox' response queue work, exactly?

I am reading this writing: https://medium.com/#narengowda/system-design-dropbox-or-google-drive-8fd5da0ce55b. In the Synchronization Service part, it writes that:
The Response Queues that correspond to individual subscribed clients are responsible for delivering the update messages to each client. Since a message will be deleted from the queue once received by a client, we need to create separate Response Queues for each client to be able to share an update message which should be sent to multiple subscribed clients.
The context is that we need a response queue to send the file updates from one client to other clients. I am confused by this statement. If Dropbox has 100 million clients, we need to create 100 million queues, based on the statement. It is unimaginable to me. For example, a Kafka cluster can support up to 5K topics (https://stackoverflow.com/questions/32950503/can-i-have-100s-of-thousands-of-topics-in-a-kafka-cluster#:~:text=The%20rule%20of%20thumb%20is,5K%20topics%20should%20be%20fine.). We need 20K Kafka clusters in this case. Which queuing system can do 100 million "topics"?
Not sure but I expect such notification to clients via web-sockets only.
Additionally as this medium blog states that if client is not online then messages might have to be persisted in DB. After that when client comes online they can request for all updates after certain timestamp after which web sockets can be setup to facilitate future communication.
Happy to know your thoughts on this.
P.S : Most of dropbox system design blogs/vlogs have just copied from each other without going into low level detail.

Real time communication between clients via websocket server on Google App Engine

This article describes how a websocket server for a chat application can look. We are planning to implement something similar; when a message is sent to the server it is sent to the correct recipient based on an authentication token and the message gets saved in a mysql database.
We will eventually host the server on Google App Engine, and I suspect that that will cause some issues with the above described approach, since that depends on all clients being connected to the same server, and that probably won't be the case since multiple instances will be created as needed. Is there a way to connect all instances so that this won't be a problem (Pub/Sub maybe? (That will cause additional costs though)), or should we find a different solution?
One idea I had was to use mysql-events to monitor the binlog from the websocket server for the creation of new rows in the messages table, but I read somewhere that that wasn't recommend. But I can't find where I read that, and maybe that is the best solution.
Since you asked about other solutions, I would recommend looking at Firebase and specifically the Realtime Database. Out of the box it provides all of the functionality that you need for realtime communication between connected clients and Cloud Messaging for clients who aren't.
Here's a tutorial that uses Firestore to create a realtime chat web app, but it can all be applied to the Realtime Database with minor modification. I say that because Firestore has expensive writes, which in my opinion make it unsuitable for a chat backend.

Sending a message from Azure Service bus to AWS SQS

We have a requirement to implement Azure Service Bus as Integration point to various Applications (including apps hosted in AWS). Each application will have its own SQS. So the idea is to have Azure Service Bus with Topics and Subscription filters to route messages to each SQS accordingly. However I am not sure as to how we can pick messages from a subscription filter and push the message to AWS SQS. I am not able to see any solution for this.
These two are inherently two different messages services and you will either need to find a third party connector/bridge between the two or create your own. This would be a process that would be retrieving messages from one broker and forwarding it to another.
When it comes to a third party, there's an example that you could have a look at. NServiceBus has a community extension called Router. The router allows achieving exactly what you're looking for.
Disclaimer: I contribute and work on NServiceBus

Stream data from MySQL Binary Log to Kinesis

We have a write-intensive table (on AWS RDS MySQL) from a legacy system and we'd like to stream every write event (insert or updated) from that table to kinesis. The idea is to create a pipe to warmup caches and update search engines.
Currently we do that using a rudimentar polling architecture, basically using SQL, but the ideal would be to have a push architecture reading the events directly from the transaction log.
Has anyone tried it? Any suggested architecture?
I've worked with some customers doing that already, in Oracle. Seems also that LinkedIn uses a lot that technique of streaming data from databases to somewhere else. They created a platform called Databus to accomplish that in an agnostic way - https://github.com/linkedin/databus/wiki/Databus-for-MySQL.
There is a public project in Github, following LinkedIn principles that is already streaming binlog from Mysql to Kinesis Streams - https://github.com/cmerrick/plainview
If you want to get into the nitty gritty details of LinkedIn approach, there is a really nice (and extensive) blog post available - https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying.
Last but not least, Yelp is doing that as well, but with Kafka - https://engineeringblog.yelp.com/2016/08/streaming-mysql-tables-in-real-time-to-kafka.html
Not getting into the basics of Kinesis Streams, for the sake of brevity, if we bring Kinesis Streams to the game, I don't see why it shouldn't work. As a matter of fact, it was built for that - your database transaction log is a stream of events. Borrowing an excerpt from Amazon Web Services public documentation: Amazon Kinesis Streams allows for real-time data processing. With Amazon Kinesis Streams, you can continuously collect data as it is generated and promptly react to critical information about your business and operations.
Hope this helps.
aws DMS service offers data migration from SQL db to kinesis.
Use the AWS Database Migration Service to Stream Change Data to Amazon Kinesis Data Streams
https://aws.amazon.com/blogs/database/use-the-aws-database-migration-service-to-stream-change-data-to-amazon-kinesis-data-streams/

Couchbase Sync Gateway

I am learning about couchbase. It's my first experience with NoSQL databases.
In the case of a central server and many users with mobile devices. I want every user on your database has different data.
I have doubts about the sync.
To synchronize, the server will have a database per user? A database per user sounds to many databases ... otherwise I do not understand how to differentiate user data.
Is it possible to communicate between server and device through couchdatabase?.
Is it good strategy for communication with the device to write to the replica that is in the server and Couchbase is responsible for making the communication? Where I can find an example of this?
Some of these questions are unclear, but I will attempt to answer:
The server will not have a database per user. It only has the databases (or "buckets") that you set up ahead of time. User data is separated by a mechanism called channels.
Communication between server and device is the exact point of Couchbase Lite. In this scenario, you must make all changes go through Sync Gateway in order for replication to function properly. This is handled for you by Couchbase Lite, so you just need to point it to a Sync Gateway instance and let it take care of replicating between various devices.