I'm trying to figure out the most efficient and scalable way to implement a processing queue mechanism in a sql database. The short of it is, I have a bunch of 'Domain' objects with associated 'Backlink' statistics. I want to figure out efficiently which Domains need to have their Backlinks processed.
Domain table: id, domainName
Backlinks table: id, domainId, count, checkedTime
The Backlinks table has many records (to keep a history) to one Domain record.
I need to efficiently select domains that are due to have their Backlinks processed. This could mean that the Backlinks record with the most recent checkedTime is far enough in the past, or that there is no Backlinks record at all for a domain record. Domains will need to be ordered for processing by a number of factors, including ordering by the oldest checkedTime first.
There are multiple ‘readers’ processing domains. If the same domain gets processed twice it’s not a huge deal, but it is a waste of cpu cycles.
The worker takes an indeterminate amount of time to process a domain. I would prefer to have some backup in the sense that a checkout would 'expire' rather than require the worker process to explicitly 'checkin' a record when it's finished, in case the worker fails for some reason.
The big issue here is scaling. From the start I’ll easily have about 2 million domains, and that number will keep growing daily. This means my Backlinks history will grow quickly too, as I expect to process in some cases daily, and other cases weekly for each domain.
The question becomes, what is the most efficient way to find domains that require backlinks processing?
Thanks for your help!
I decided to structure things a bit differently. Instead of finding domains that need to be processed based on the criteria of several tables, I'm assigning a date at which each metric needs to be processed for a given domain. This makes finding those domains needing processing much simpler of a query.
I ended up using the idea of batches where I find domains to process, mark them as being processed by a batch id, then return those domains to the worker. When the worker is done, it returns the results, and the batch is deleted, and the domains will naturally be ready for processing again in the future.
Related
I am building a simple shopping cart. Currently, to ensure that a customer can never purchase a product that is out of stock, when processing the order I have a loop for each product in their cart:
-- Begin a transaction --
Loop through each product in the cart and
Select the stock count from the products table
If it is in stock:
I will reduce the stock count from the product
Add the product to the order items table
Otherwise, I call a rollback and return an error
-- (If there isn't a call for rollback, everything ends off with a commit --
However, if at any given time, the stock count for a product is updated AFTER it has checked for that particular product, there may be inconsistencies.
Question: would it be a good idea to lock the table from writes whenever I am processing an order? So that when the 'loop' above occurs, I can be assured that no one else is able to alter the product count and it will always be accurate.
The idea is that the product count/availability will always be consistent, and there will never be an instance where the stock count goes to -1 (which would be unfulfillable).
However, I have seen so many posts on locks being inefficient/having bad effects. If so, what is the best way to accomplish this?
I have seen alternatives like handling it in an update + select query, but have seen that it may also not be suitable in some cases.
You have at least three strategies:
1. Pessimistic Locking
If your application will experience low activity then you can lock the tables (or single rows) to make sure no other thread changes the values during the processing of a purchase. It works, but it has performance limitations.
2. Optimistic Locking
If your application/web site must serve a high load then you can opt for the "optimistic locking" strategy. In this case you add a version number column to your critical tables and then you use it when reading/writing it.
When updating stock you check the version number you are updating must be the same that you read. If it's not the case anymore (another thread modified it) you roll back the transaction and can retry again a couple of times until you succeed.
It requires more development effor since you need to identify the bad case and implement retry logic (if you want to).
3. Processing Queues
You can implement processing queues. When a thread wants to "purchase an order" it can submit it to a processing queue for purchase orders. This queue can be implemented by one or more threads dedicated to this task; if you choose multiple threads they can be divided by order types, regions, categories, etc. to distribute the load.
This requires more programming effort since you need to manage asynchronous processing, but can sustain much higher levels of load.
You can use this strategy for multiple different tasks: purchasing orders, refilling stock, sending notifications, processing promotions, etc.
I'm having a hard time wrapping my head around the issue of an ELO-score-like calculation for a large amount of users on our platform.
For example. For every user in a large set of users, a complex formule, based on variable amounts of "things done", will result in a score for each user for a match-making-like principle.
For our situation, it's based on the amount of posts posted, connections accepted, messages sent, amount of sessions in a time period of one month, .. other things done etc.
I had two ideas to go about doing this:
Real-time: On every post, message, .. run the formula for that user
Once a week: Run the script to calculate everything for all users.
The concerns about these two I have:
Real-time: Would be an overkill of queries and calculations for each action a user performs. If let's say, 500 users are active, all of them are performing actions, the database would be having a hard time I think. There would them also run a script to re-calculate the score for inactive users (to lower their score)
Once a week: If we have for example 5.000 users (for our first phase), than that would result into running the calculation formula 5.000 times and could take a long time and will increase in time when more users join.
The calculation-queries for a single variable in a the entire formula of about 12 variables are mostly a simple 'COUNT FROM table', but a few are like counting "all connections of my connections" which takes a few joins.
I started with "logging" every action into a table for this purpose, just the counter values and increase/decrease them with every action and running the formula with these values (a record per week). This works but can't be applied for every variable (like the connections of connections).
Note: Our server-side is based on PHP with MySQL.
We're also running Redis, but I'm not sure if this could improve those bits and pieces.
We have the option to export/push data to other servers/databases if needed.
My main example is the app 'Tinder' which uses a sort-like algorithm for match making (maybe with less complex data variables because they're not using groups and communities that you can join)
I'm wondering if they run that real-time on every swipe, every setting change, .. or if they have like a script that runs continiously for a small batch of users each time.
Where it all comes down to. What would be the most efficient/non-database-table-locking way to do this, with keeping the idea in mind that there will be a moment that we're having 50.000 users for example?
The way I would handle this:
Implement the realtime algorithm.
Measure. Is it actually slow? Try optimizing
Still slow? Move the algorithm to a separate asynchronous process. Have the process run whenever there's an update. Really this is the same thing as 1, but it doesn't slow down PHP requests and if it gets busy, it can take more time to catch up.
Still slow? Now you might be able to optimize by batching several changes.
If you have 5000 users right now, make sure it runs well with 5000 users. You're not going to grow to 50.000 overnight, so adjust and invest in this as your problem changes. You might be surprised where your performance problems are.
Measuring is key though. If you really want to support 50K users right now, simulate and measure.
I suspect you should use the database as the "source of truth" aka "persistent storage".
Then fetch whatever is needed from the dataset when you update the ratings. Even lots of games by 5000 players should not take more than a few seconds to fetch and compute on.
Bottom line: Implement "realtime"; come back with table schema and SELECTs if you find that the table fetching is a significant fraction of the total time. Do the "math" in a programming language, not SQL.
We have been tracking user login events for a while now in a MongoDB collection. Each event contains the userID, datetime, and a couple other fundamental attributes about the event.
For a new feature, we want to present a graph of these login events, with different groups representing cohorts related to the user who did the event. Specifically, we want to group by the "Graduation Year" attribute of the user.
In our event log, we do not record the Graduation Year of the user who's logging in, so cannot easily query that directly. We see two ways to go forward, plus a 3rd "in-between" option:
Instead of making a single MongoDB query to get the logins, we make that query PLUS a second one to our Relational DB to get the secondary user data we require, and merge the two together.
We could optionally query for all the users, load them into memory, and loop through the Events, or we could go through the events and find only the User IDs that logged in and query for those specific User IDs. (Then loop again, merging them in.)
The post-processing could be done on the server-side or we could send all the data to the client. (Currently our plan is to just send the raw event data to the client for processing into the graph.)
Upsides: The event log is made to track events. User "Graduation Year" is not relevant to the event in question; it's relevant to the user who did the event. This seems to separate concerns more properly. As well, if we later decide we want to group on a different piece of metadata (let's say: male vs female), it's easy to just join that data in as well.
Downsides: Part of the beauty of our event log is that it quickly can spit out tons of aggregate data that's ready-to-use. If there are 10,000 users, we may have 100,000 logins. It seems crazy to need to loop through 100,000 logins whenever this data is requested new (as in, not cached).
We can write a script that does a one-time load of all the events (presumably in batches), then requests the user metadata and merges it in, re-writing the Event Log to include the relevant data.
Upsides: The event log is our single point of interaction when loading the data. Client requests all the logins; gets 100,000 rows; sorts them and groups them according to Graduation Year; [Caches it;] and graphs it. Will have a script ready to re-add more data if it came to that, down the road.
Downsides: We're essentially rewriting history. We're polluting our event log with secondary data that isn't explicitly about the event we claim to be tracking. Need to rewrite or modify the script to add more data that we didn't know we wanted to track, if we had to, down the road.
We replicate the Users table in MongoDB, perhaps only as-needed (say when an event's metadata is unavailable), and do a join (I guess that's a "$lookup" in Mongo) to this table.
Upsides: MongoDB does the heavy lifting of merging the data.
Downsides: We need to replicate and keep-up-to-date, somehow, a secondary collection of our Users' relevant metadata. I don't think MongoDB's $lookup works like a join in MySQL, and maybe isn't really any more performant at all? Although I'd look into this before we implemented.
For the sake of estimation, let's just say that any given visitor to our site will never have to load more than 100,000 logins and 10,000 users.
For what it's worth, Option #2 seems most preferable to me, even though it involves rewriting history, for performance reasons. Although I am aware that, at some point, if we were sending a user's browser multiple years of login data (that is, all 100,000 imaginary logins), maybe that's already too much data for their browser to process and render quickly, and perhaps we'd already be better off grouping it and aggregating it as some sort of regularly-scheduled process on the backend. (I don't know!)
As a Data Warehouse, 100K rows, is quite small.
Performance in DW depends on building and maintaining "Summary Tables". This makes a pre-determined set of possibly queries very efficient, without having to scan the entire 'Fact' table. My discussion of Summary Tables (in MySQL): http://mysql.rjweb.org/doc.php/summarytables
Situation:
I am currently designing a feed system for a social website whereby each user has a feed of their friends' activities. I have two possible methods how to generate the feeds and I would like to ask which is best in terms of ability to scale.
Events from all users are collected in one central database table, event_log. Users are paired as friends in the table friends. The RDBMS we are using is MySQL.
Standard method:
When a user requests their feed page, the system generates the feed by inner joining event_log with friends. The result is then cached and set to timeout after 5 minutes. Scaling is achieved by varying this timeout.
Hypothesised method:
A task runs in the background and for each new, unprocessed item in event_log, it creates entries in the database table user_feed pairing that event with all of the users who are friends with the user who initiated the event. One table row pairs one event with one user.
The problems with the standard method are well known – what if a lot of people's caches expire at the same time? The solution also does not scale well – the brief is for feeds to update as close to real-time as possible
The hypothesised solution in my eyes seems much better; all processing is done offline so no user waits for a page to generate and there are no joins so database tables can be sharded across physical machines. However, if a user has 100,000 friends and creates 20 events in one session, then that results in inserting 2,000,000 rows into the database.
Question:
The question boils down to two points:
Is this worst-case scenario mentioned above problematic, i.e. does table size have an impact on MySQL performance and are there any issues with this mass inserting of data for each event?
Is there anything else I have missed?
I think your hypothesised system generates too much data; firstly on the global scale the storage and indexing requirements on user_feed seems to escalate exponentially as your user-base becomes larger and more interconnected (both presumably desirable for a social network); secondly consider if in the course of a minute 1000 users each entered a new message and each had 100 friends - then your background thread has 100 000 inserts to do and might quickly fall behind.
I wonder if a compromise might be made between your two proposed solutions where a background thread updates a table last_user_feed_update which contains a single row for each user and a timestamp for the last time that users feed was changed.
Then although the full join and query would be required to refresh the feed, a quick query to the last_user_feed table will tell if a refresh is required or not. This seems to mitigate the biggest problems with your standard method as well as avoid the storage size difficulties but that background thread still has a lot of work to do.
The Hypothesized method works better when you limit the maximum number of friends.. a lot of sites set a safe upper boundary, including Facebook iirc. It limits 'hiccups' from when your 100K friends user generates activity.
Another problem with the hypothesized model is that some of the friends you are essentially pre-generating cache for may sign up and hardly ever log in. This is a pretty common situation for free sites, and you may want to limit the burden that these inactive users will cost you.
I've thought about this problem many times - it's not a problem MySQL is going to be good at solving. I've thought of ways I could use memcached and each user pushes what their latest few status items are to "their key" (and in a feed reading activity you fetch and aggregate all your friend's keys)... but I haven't tested this. I'm not sure of all the pros/cons yet.
Im building an application that requires extensive logging of actions of the users, payments, etc.
Am I better off with a monolithic logs table, and just log EVERYTHING into that.... or is it better to have separate log tables for each type of action Im logging (log_payment, log_logins, log_acc_changes)?
For example, currently Im logging user's interactions with a payment gateway. When they sign up for a trial, when trial becomes a subscription, when it gets rebilled, refunded, if there was a failure or not, etc.
I'd like to also start logging actions or events that dont interact with the payment gateway (renewal cancellations, bans, payment failures that were intercepted before the data is even sent to the gateway for verification, logins, etc).
EDIT:
The data will be regularly examined to verify its integrity, since based on it, people will need to be paid, so accurate data is very critical. Read queries will be done by myself and 2 other admins, so 99% of the time, its going to be write/update.
I just figured having multiple tables, just creates more points of failure during the critical mysql transactions that deal with inserting and updating the payment data, etc.
All other things being equal, smaller disjoint tables can have a performance advantage, especially when they're write-heavy (as table related to logs are liable to be) -- most DB mechanisms are better tuned for mostly-read, rarely-written tables. In terms of writing (and updating any indices you may have to maintain), small disjoint tables are a clear win, especially if there's any concurrency (depending on what engine you're using for your tables, of course -- that's a pretty important consideration in mysql!-).
In terms of reading, it all depends on your pattern of queries -- what queries will you need, and how often. In certain cases for a usage pattern such as you mention there might be some performance advantage in duplicating certain information -- e.g. if you often need an up-to-the-instant running total of a user's credits or debits, as well as detailed auditable logs of how the running total came to be, keeping a (logically redundant) table of running totals by users may be warranted (as well as the nicely-separated "log tables" about the various sources of credits and debits).
Transactional tables should never change, not be editable, and can serve as log files for that type of information. Design your "billing" tables to have timestamps, and that will be sufficient.
However, where data records are editable, you need to track who-changed-what-when. To do that, you have a couple of choices.
--
For a given table, you can have a table_history table that has a near-identical structure, with NULLable fields, and a two-part primary key (the primary key of the original table + a sequence). If for every insert or update operation, you write a record to this table, you have a complete log of everything that happened to table.
The advantage of this method is you get to keep the same column types for all logged data, plus it is more efficient to query.
--
Alternatively, you can have a single log table that has fields like "table", "key", "date", "who", and a related table that stores the changed fields and values.
The advantage of this method is that you get to write one logging routine and use it everywhere.
--
I suggest you evaluate the number of tables, performance needs, change volume, and then pick one and go with it.
It depends on the purpose of logging. For debugging and general monitoring purpose, a single log table with dynamic log level would be helpful so you can chronologically look at what the system is going through.
On the other hand, for audit trail purpose, there's nothing like having duplicate table for all tables with every CRUD action. This way, every information captured in the payment table or whatever would be captured in your audit table.
So, the answer is both.