Assume I am building netflix and I want to log each view by the userID and the movie ID
The format would be viewID , userID, timestamp,
However in order to scale this, assume were getting 1000 views a second. Would it make sense to queue these views to SQS and then our queue readers can un-queue one by one and write it to the mysql database. This way the database is not overloaded with write requests.
Does this look like something that would work?
Faisal,
This is a reasonable architecture; however, you should know that writing to SQS is going to be many times slower than writing to something like RabbitMQ (or any local) message queue.
By default, SQS FIFO queues support up to 3,000 messages per second with batching, or up to 300 messages per second (300 send, receive, or delete operations per second) without batching. To request a limit increase, you need to file a support request.
That being said, starting with SQS wouldn't be a bad idea since it is easy to use and debug.
Additionally, you may want to investigate MongoDB for logging...check out the following references:
MongoDB is Fantastic for Logging
http://blog.mongodb.org/post/172254834/mongodb-is-fantastic-for-logging
Capped Collections
http://blog.mongodb.org/post/116405435/capped-collections
Using MongoDB for Real-time Analytics
http://blog.mongodb.org/post/171353301/using-mongodb-for-real-time-analytics
Related
I am reading this writing: https://medium.com/#narengowda/system-design-dropbox-or-google-drive-8fd5da0ce55b. In the Synchronization Service part, it writes that:
The Response Queues that correspond to individual subscribed clients are responsible for delivering the update messages to each client. Since a message will be deleted from the queue once received by a client, we need to create separate Response Queues for each client to be able to share an update message which should be sent to multiple subscribed clients.
The context is that we need a response queue to send the file updates from one client to other clients. I am confused by this statement. If Dropbox has 100 million clients, we need to create 100 million queues, based on the statement. It is unimaginable to me. For example, a Kafka cluster can support up to 5K topics (https://stackoverflow.com/questions/32950503/can-i-have-100s-of-thousands-of-topics-in-a-kafka-cluster#:~:text=The%20rule%20of%20thumb%20is,5K%20topics%20should%20be%20fine.). We need 20K Kafka clusters in this case. Which queuing system can do 100 million "topics"?
Not sure but I expect such notification to clients via web-sockets only.
Additionally as this medium blog states that if client is not online then messages might have to be persisted in DB. After that when client comes online they can request for all updates after certain timestamp after which web sockets can be setup to facilitate future communication.
Happy to know your thoughts on this.
P.S : Most of dropbox system design blogs/vlogs have just copied from each other without going into low level detail.
I'm developing a high traffic ad serving platform for some years now, using a master-master Maria DB cluster with an HAProxy in front for balancing relational data queries (read queries go to all of the servers, but writes only go to one, to prevent the servers from going out of sync). By relational data I mean things like campaign settings, user details, payments. I'm also using Redis for caching some of the less dynamic MySQL information, but I believe there are a lot of opportunities to make better use of it, since as soon as the traffic increases, I'm frequently hitting bottlenecks like:
too many connections to MySQL
deadlocks (possibly because writes start coming on multiple servers when the main one gets overloaded).
My goal is to move as much of the writes away from MySQL and into Redis, but I'm having a hard time filtering MySQL data based on the counts/budgets stored in Redis, especially in places where a traditional JOIN would be used.
A simplified example of such MySQL query that would get the campaign with the highest bid within the user's budget:
SELECT campaigns.id, campaigns.url FROM campaigns
JOIN users ON campaigns.user_id = users.id
ORDER BY LEAST(users.credits, campaigns.bid) DESC
LIMIT 1;
After a click is delivered to that campaign, a budget reduction is immediately needed. Of course, reducing the credits in MySQL is trivial, but as soon as a user starts sending multiple clicks per second, the problems start appearing (mainly deadlocks in a cluster or reaching the maximum number of connections).
Applying a credit reduction in Redis would be preferred, but I have troubles connecting the dots between a bunch of credit records in Redis and filtering and sorting MySQL records based on that.
What would be a good approach to this problem that will allow me to touch MySQL as little as possible? Or maybe there is a fully different approach I need to take for this to happen.
Any advice or links will be much appreciated.
I would not recommend to move all write requests to Redis, especially for data with strong consistency(like payments).
Redis is a in-memory database, which do not have ACID transaction guarantee like MySQL. So you data still have some chances to be lost after write to Redis even if you have AOF enabled, which can make your data inconsistent.
For you case I thing you can integrate message queue(Kafka, rabbitMQ) to avoid connection issues and deadlocks:
When transaction occurred, serialize the request with data to write and send to message queue.
MySQL will listen on MQ with a fixed consume rate(based on your need), and write the data into MySQL sequentially(and rewrite to Redis if you need cache)
For client side, you can have a thread to query the result in an infinite loop until write finished. This will make the async write performs like sync.
In this case, you will avoid resouces compete(like deadlocks), and will also smooth the write rate by a fixed consuming rate.
How to limit the number of transcations per second on a table in Mysql?
Like to prevent brute force login via an API
As David says, do this on the API. You cannot and should not limit your database. There's no way to distinguish the origin of the query, so you'll just shut down the database for everyone if one person decides to flood it, making a denial-of-service attack easier.
As for a solution there are many examples.
Nginx has a rate-limiting feature built in that can limit requests per interval of time, and is very flexible. This can be focused on particular endpoints, paths, or other criteria, making it easy to protect whatever parts of your system are vulnerable.
You'll also need to block clients that are trying to attack your system. Consider something like fail2ban which can read logs and automatically block source traffic from offenders. Log every failed attempt and this tool can do the rest.
I read the document of official AWS Kinesis Firehose but it doesn't mention how to handle duplicated events. Does anyone have experience on it? I googled someone use ElasticCache to do filtering, does it mean I need to use AWS Lambda to encapsulate such filtering logic? Is there any simple way like firehose to ingest data into Redshift and at the same time has "exactly once" semantics? Thanks a lot!
You can have duplication on both sides of the Kinesis Stream. You might put the same events twice into the Stream, and you might read the event twice by the consumers.
The producers side can happen if you try to put an event to the Kinesis stream, but for some reason you are not sure if it was written successfully or not, and you decide to put it again. The consumer side can happen if you are getting a batch of events and start processing them, and you crash before you managed to checkpoint your location, and the next worker is picking the same batch of events from the Kinesis stream, based on the last checkpoint sequence-ID.
Before you start solving this problem, you should evaluate how often do you have such duplication and what is the business impact of such duplications. Not every system is handling financial transactions that can't tolerate duplication. Nevertheless, if you decide that you need to have such de-duplication, a common way to solve it is to use some event-ID and track if you processed that event-ID already.
ElasticCache with Redis is a good place to track your event-ID. Every time you pick up an event for processing, you check if you already have it in the hash table in Redis, if you find it, you skip it, and if you don't find it, you add it to the table (with some TTL based on the possible time window for such duplication).
If you choose to use Kinesis Firehose (instead of Kinesis Streams), you no longer have control on the consumer application and you can't implement this process. Therefore, you either want to run such de-duplication logic on the producer side, switch to use Kinesis Streams and run your own code in Lambda or KCL, or settle for the de-duplication functions in Redshift (see below).
If you are not too sensitive to duplication, you can use some functions in Redshift, such as COUNT DISTINCT or LAST_VALUE in a WINDOW function.
Not sure if this could be a solution. But to handle duplicates, you need to write your own KCL. Firehose cannot gurantee no duplication. You can get rid of Firehose once you have your own KCL consumers that processes your data from the Kinesis Date Stream.
If you do so you can follow the linked article, (full disclosure, auther here), which stores events into S3 after deduplicating and processing it through a KCL consumer.
Store events by grouping them based on the minute they were received by the Kinesis data stream by looking at their ApproximateArrivalTimestamp. This allows us to always save our events on the same key prefix, given a batch of records no matter when they are processed. For e.g. all events received by Kinesis at 2020/02/02/ 15:55 Hrs will be stored at /2020/02/02/15/55/*. Therefore, if the key is already present in the given minute, it means that the batch has already been processed and stored to S3.
You can implement your own ISequenceStore which will be implemented against Redshift in your case (In the article, it is done against S3). Read the full article below.
https://www.nabin.dev/avoiding-duplicate-records-with-aws-kcl-and-s3
I'll describe the application I'm trying to build and the technology stack I'm thinking at the moment to know your opinion.
Users should be able to work in a list of task. These tasks are coming from an API with all the information about it: id, image urls, description, etc. The API is only available in one datacenter and in order to avoid the delay, for example in China, the tasks are stored in a queue.
So you'll have different queues depending of your country and once that you finish with your task it will be send to another queue which will write this information later on in the original datacenter
The list of task is quite huge that's why there is an API call to get the tasks(~10k rows), store it in a queue and users can work on them depending on the queue the country they are.
For this system, where you can have around 100 queues, I was thinking on redis to manage the list of tasks request(ex: get me 5k rows for China queue, write 500 rows in the write queue, etc).
The API response are coming as a list of json objects. These 10k rows for example need to be stored somewhere. Due to you need to be able to filter in this queue, MySQL isn't an option at least that I store every field of the json object as a new row. First think is a NoSQL DB but I wasn't too happy with MongoDB in the past and an API response doesn't change too much. Like I need relation tables too for other thing, I was thinking on PostgreSQL. It's a relation database and you have the ability to store json and filter by them.
What do you think? Ask me is something isn't clear
You can use HStore extension from PostgreSQL to store JSON, or dynamic columns from MariaDB (MySQL clone).
If you can move your persistence stack to java, then many interesting options are available: mapdb (but it requires memory and its api is changing rapidly), persistit, or mvstore (the engine behind H2).
All these would allow to store json with decent performances. I suggest you use a full text search engine like lucene to avoid searching json content in a slow way.