I'm just starting to evaluate ServiceBroker to determine if it can perform as a reliable queue in a very specific context. Here is the scenario:
(1) need to pre-calculate a large (several million) population of computationally expensive values and store in a queue.
(2) multiple processes will attempt to read/dequeue these values at run time on an as-needed basis. could be several hundred + reads per second.
(3) a monitor process will occasionally poll the queue and determine if the population minimum threshold has been reached, and will then re-populate the queue.
Due to some infrastructure/cost constraints, an industrial strength Queue (websphere) might not be an option. What I have seen thus far of Service Broker is not encouraging because it seems to be isolated to a "conversation" with 2 endpoints and in my scenario, my reads happen completely independent of my writes. Does anyone have any insight as to whether this is possible with SQL Service Broker?
Although Service Broker has not been designed for such scenarios, I think with a little tweaking it could work in you case.
One approach would be to pre-create a pool of conversations and then have the calculating process round-robin between theses conversations when storing values. Since receiving from a queue takes a lock on the conversation, the number of conversations essentially sets an upper bound on how many processes may dequeue values concurrently. I'm not sure about that, but you may need some logic on the receiver side to tell explicitly what conversation to receive from (in order to achieve better load balancing than default receive behavior would get).
If perf is not a concern then you may even drop the conversation pool idea and send each message on a separate dialog, which would make the implementation way simpler at the cost of significant perf hit.
All the above is said assuming the values may be dequeued in a random order, otherwise you need to guarantee the receive ordering by using a single conversation.
Related
I'm developing a high traffic ad serving platform for some years now, using a master-master Maria DB cluster with an HAProxy in front for balancing relational data queries (read queries go to all of the servers, but writes only go to one, to prevent the servers from going out of sync). By relational data I mean things like campaign settings, user details, payments. I'm also using Redis for caching some of the less dynamic MySQL information, but I believe there are a lot of opportunities to make better use of it, since as soon as the traffic increases, I'm frequently hitting bottlenecks like:
too many connections to MySQL
deadlocks (possibly because writes start coming on multiple servers when the main one gets overloaded).
My goal is to move as much of the writes away from MySQL and into Redis, but I'm having a hard time filtering MySQL data based on the counts/budgets stored in Redis, especially in places where a traditional JOIN would be used.
A simplified example of such MySQL query that would get the campaign with the highest bid within the user's budget:
SELECT campaigns.id, campaigns.url FROM campaigns
JOIN users ON campaigns.user_id = users.id
ORDER BY LEAST(users.credits, campaigns.bid) DESC
LIMIT 1;
After a click is delivered to that campaign, a budget reduction is immediately needed. Of course, reducing the credits in MySQL is trivial, but as soon as a user starts sending multiple clicks per second, the problems start appearing (mainly deadlocks in a cluster or reaching the maximum number of connections).
Applying a credit reduction in Redis would be preferred, but I have troubles connecting the dots between a bunch of credit records in Redis and filtering and sorting MySQL records based on that.
What would be a good approach to this problem that will allow me to touch MySQL as little as possible? Or maybe there is a fully different approach I need to take for this to happen.
Any advice or links will be much appreciated.
I would not recommend to move all write requests to Redis, especially for data with strong consistency(like payments).
Redis is a in-memory database, which do not have ACID transaction guarantee like MySQL. So you data still have some chances to be lost after write to Redis even if you have AOF enabled, which can make your data inconsistent.
For you case I thing you can integrate message queue(Kafka, rabbitMQ) to avoid connection issues and deadlocks:
When transaction occurred, serialize the request with data to write and send to message queue.
MySQL will listen on MQ with a fixed consume rate(based on your need), and write the data into MySQL sequentially(and rewrite to Redis if you need cache)
For client side, you can have a thread to query the result in an infinite loop until write finished. This will make the async write performs like sync.
In this case, you will avoid resouces compete(like deadlocks), and will also smooth the write rate by a fixed consuming rate.
Our mobile app track user events (Events can have many types)
Each mobile reporting the user event and later on can retrieve it.
I thought of writing to Redis and Mysql.
When user request:
1. Find on Redis
2. If not on Redis find on Mysql
3. Return the value
4. Keep Redis modified in case value wasnt existed.
5. set expiry policy to each key on redis to avoid out of mem.
Problem:
1. Reads: If many users at once requesting information which not existed at Redis mysql going to be overloaded with Reads (latency).
2. Writes: I am going to have lots of writes into Mysql since every event going to be written to both datasources.
Facts:
1. Expecting 10m concurrect users which writes and reads.
2. Need to serv each request with max latency of one second.
3. expecting to have couple of thousands requests per sec.
Any solutions for that kind of mechanism to have good qos?
3. Is that in any way Lambda architecture solution ?
Thank you.
Sorry, but such issues (complex) rarely have a ready answer here. Too many unknowns. What is your budget and how much hardware you have. Since 10 million clients are concurrent use your service your question is about hardware, not the software.
Here is no any words about several important requirements:
What is more important - consistency vs availability?
What is the read/write ratio?
Read/write ratio requirement
If you have 10,000,000 concurrent users this is problem in itself. But if you have much of reads it's not so terrible as it may seem. In this case you should take care about right indexes in mysql. Also buy servers with lot of RAM to keep at least index data in RAM. So one server can hold 3000-5000 concurrent select queries without any problems with latency requirement in 1 second (one of our statistic project hold up to 7,000 select rps per server on 4 years old ordinary harware).
If you have much of writes - all becomes more complicated. And consistency becomes main question.
Consistency vs availability
If consistency is important - go to the store for new servers with SSD drives and moder CPU. Do not forget to buy much RAM as possible. Why? If you have much of write requests your sql server would rebuild index with every write. And you can't do not use indexes because of your read requests do not to keep in latency requirement. Under consistency i mean - if you write something, you should do this in 1 second and if you read this data right after write - you get actual written information in 1 second.
Your problem 1:
Reads: If many users at once requesting information which not existed at Redis mysql going to be overloaded with Reads (latency).
Or well known "cache miss" problem. And it has just some solutions - horizontal scaling (buy more hardware) or precaching. Precaching in this case may be done in at least 3 scenarios:
Using non blocking read and wait up to one second while data wont be queried from SQL server. If it not, return data from Redis. Update in Redis immediately or throw queue - as you want.
Using blocking/non blocking read and return data from Redis as fast as possible, but with every ready query push jub to queue about update cache data in Redis (also may inform app it should requery data after some time).
Always read/write from Redis, but register job in queue every write request to update data in SQL.
Every of them is compromise:
High availability but consistency suffers, Redis is LRU cache.
High availability but consistency suffers, Redis is LRU cache.
High availability and consistency but requires lot of RAM for Redis.
Writes: I am going to have lots of writes into Mysql since every event going to be written to both datasources.
The filed of compromise again. Lot's of writes rests to hardware. So buy more or use queues for pending writes. So availability vs consistency again.
Event tracking means (usualy) you can return data close to real time but not in real time. For example have 1-10 seconds latency to update data on disk (mysql) keeping 1 second latency for write/read serving requests.
So, it's combination of 1/2/3 (or some other) techniques for data provessing:
Use LRU in Redis and do not use expire. Lot's of expire keys - problem as is. So we can't use to be sure we save RAM.
Use queue to warm up missing keys in Redis.
Use queue to write data into mysql server from Redis server.
Use additional requests to update data from client size of cache missing situation accures.
Which are the performance considerations I should keep in mind when I'm planning an SQL Azure application? Azure Storage, and the worker and the web roles looks very scalable, but if at the end they are using one database... it looks like the bottleneck.
I was trying to find numbers about:
How many concurrent connections does
SQL Azure support?
Which is the bandwidth?
But no luck.
For example, I'm planning and application that uses a very high level of inserts, but I need return the result of an aggregate function each time (e.g.: the sum of all records with same key in a column), so I can not go with table storage.
Batching is an option, but time response is critical as well, so I'm afraid the database will be bloated with lot of connections.
Sharding is another option, but even when the amount of inserts is massive, the amount of data is very small, 4 to 6 columns with one PK and no FK. So even a 1Gb DB would be an overkill (and an overpay :D) for a partition.
Which would be the performance keys I should keep in mind when I'm facing these kind of applications?
Cheers.
Achieving both scalability and performance can be very difficult, even in the cloud. Your question was primarily about scalability, so you may want to design your application in such a way that your data becomes "eventually" consistent, using queues for example. A worker role would listen for incoming insert requests and would perform the insert asynchronously.
To minimize the number of roundtrips to the database and optimize connection pooling make sure to batch your inserts as well. So you could send 100 inserts in one shot. Also keep in mind that SQL Azure now supports MARS (multiple active recordsets) so that you can return multiple SELECTs in a single batch back to the calling code. The use of batching and MARS should reduce the number of database connections to a minimum.
Sharding usually helps for Read operations; not so much for inserts (although I never benchmarked inserts with sharding). So I am not sure sharding will help you that much for your requirements.
Remember that the Azure offering is designed first for scalability and reasonable performance in a multitenancy environment, where your database is shared with others on the same server. So if you need strong performance with guaranteed response time you may need to reevaluate your hosting choices or indeed test the performance boundaries of Azure for your needs as suggested by tijmenvdk.
SQL Azure will throttle your connections if any form of resource contention occurs (this includes heavy load but might also occur when your database is physically moved around). Throttling is non-deterministic, meaning that you cannot predict if and when this happens. When throttling, SQL Azure will drop your connection, requiring you to perform a retry. Number of connections supported and bandwidth is not published "by design" due to the flexible nature of the underlying infrastructure. Having said that, the setup is optimized for high availability, not high throughput.
If the bursts happen at a known time, you might consider sharding just during those bursts and consolidating the data after the burst has happened. Another way to handle this, is to start queueing/batching writes if and only if throttling occurs. You can use an Azure Queue for that plus a worker role to empty the queue later. This "overflow mechanism" has the advantage of automatically engaging if throttling occurs.
As an alternative you could use Azure Table Storage and keep a separate table of running totals that you can report back instead of performing an aggregation over the data to return the required sum of all records (this might be tricky due to the lack of locking on the tables though).
Apologies for stating the obvious, but the first step would be to test if you run into throttling at all in your scenario. I would give the overflow solution a try.
I'm intending to use AMQP to allow a distributed collection of machines to report to a central location asynchronously. The idea is to drop messages into the queue and allow the central logging entity to process the queue in a decoupled fashion; the 'process' is simply to create or update a row in a database table.
A problem that I'm anticipating is the effect of network jitter in the message queuing process - what happens if an update accidentally gets in front of an insert because the time between the two messages being issued is less than the network jitter?
Reading the AMQP spec, it seems that I could just apply a higher priority to inserts so they skip the queue and get processed first. But presumably this only applies if a queue actually exists at the broker to be skipped. Is there a way to impose a buffer or delay at the broker to absorb this jitter and allow priority to be enacted before the messages are passed on to the consumer(s)?
Or do I have to go down the route of a resequencer as ActiveMQ suggests?
The lack of ordering between multiple publishers has nothing to do with network jitter, it's a completely natural thing in distributed applications. Messages from the same publisher will always be ordered. If you really need causal ordering of actions performed by different nodes then either a resequencer or a global sequence numbering scheme are your only options. Note that you cannot use sender timestamps for this, which is what everyone seems to try first..
We have a large project coming up soon with quite a lot of media processing (Images, Video) as well email output etc, the sort of stuff normally we'd put into a table called "email_queue" and we use a cron to run a script process the queue in the table.
I have been reading a lot on Message Queue systems like beanstalkd, and have even set it up. It was easy and nice to use, the problem is that I am unsure whether I am missing something.
Could someone detail the benefits of using a queue system rather than a table and a CRON? Since I really can't see to see what they are.
Thanks
Differences:
Once a message is put on the queue it can be immediately delivered. So if your cron normally ran every 5 minutes, you could process faster with the queuing.
If your queueing system supports transactions, then it will automatically re-deliver a message if the processing fails.
It can be harder to query what is in your queue. A database table has a nice way to search (sql).
If you have multiple servers/processes/threads handling messages, the queue system will make sure a message is only delivered to one of them. With a DB table you need to handle this via application code (locking, flags, etc ...)
A message queue (a distributed one at least, e.g. RabbitMQ) gives you the ability to distribute work across physical nodes. You still need to have a process on each node to dequeue work and process it.
It gets down ultimately to your requirements I guess. You can achieve a more manageable solution at scale with using message queues: you can decouple your nodes more easily.
Of course, there is a learning curve... so it again comes back to your target goals.
Note that on each node you can still reuse your cron/db table until (and if) you wish to change the implementation. That's what great about decoupling when you can.
First, queues are often backed by actual DB tables and can maintain message durability. That aside, the queue is a natural way to shove off work that needs to be done asynchronously, which if you design on that principal from the start is very powerful.
Other than the fact that a table (entity) has a set of hard columns (attributes), both this table being composed of a set of records composing as well as a queue are nothing more than lists of stuff You are employing the queue-as-a-table as a formal queue, just that you are polling it on a regular (cron) basis.
MQs add another nifty feature though of generally synchronizing access to the message itself (you may or may not be doing this in your SQL to get the next thing).
I like to consider the cron/table mechanism as POLL-based and the MQ as EVENT-based.
Benefit of a queue in my opinion is that it takes care of the sync'ing, status updating. MQs can be set up to "broadcast" (topic) or make available the message to a group of consumers or listeners.
MQs though asynchronous would likely operate between your cron window. How do you know that the number of messages you process in your table can be accomplished before the next cron job runs and tries to step on the previous job?
Multiple consumers for the MQ allows you to scale the work as you see fit. In the example above if you saw that your load average (just the same in the OS' process queue) is greater than you like, you can provision another consumer to handle said load, bringing it on and offline as metrics demand.
MQs can be set up to have different operational parameters such as message priority and performance (some queues can remain in memory, others persist to disk).
Downside is that (as already mentioned) that the queue can sometimes be hard to query and for which to obtain metrics. I always find MQ systems that have a DB backing store so that I can myself watch the queue with SQL.
This gets asked fairly frequently, and there's usually not a compelling reason to go MQ if you're comfortable with databases. Here's one example thread.
My take is that you might want to avoid the learning curve unless your data requirements include exceptionally high volumes, which is unlikely if you're thing cron rather than a process with a timer (much less multiple processes with timers.)