Scenario is simple to describe, but might have a complex answer:
Imagine a case where you have one write only mysql database. Then you have about 5 or 6 read only databases. The write database has a count for a particular inventory. You have hundreds of thousands of users banging away at this particular inventory item, but only limited quantity. For argument's sake, say 10 items.
What's the best way to ensure that only 10 items get sold? If there is even a 200ms delta between the time the read-only slaves get updated, can't the integrity of the count go stale, thus selling inventory you do not have?
How would you solve/scale this problem?
The basic solution to concurrent users will probably cover this too. At some point in the "buy" transaction, you need to decrement the inventory (on the write-server). Through whatever method, enforce that inventory can't go below zero.
If there's one item left, and two people trying to buy it, one will be out of luck.
The replication latency is exactly the same thing. Two users see a product available, but by the time they try to buy it, it's gone. A good solution for that scenario covers both replication latency and a user simply snatching the last item out from under another user.
It all depends on when and what window you decide to lock the master table for the update.
A. If you have to be 100% sure an item will be attempted to be bought only when its surely available. You will have to lock the item for the particular user as soon as you list it to him (which means you will temporarily decrement the inventory stock)
B. If you are okay with showing the one off "sorry, we just ran out of stock" message. you should lock the item just before you bill (well, you could do that after transaction is complete. but at the cost of a very furious customer)
I would chose approach A for locking, and may be flag a "selling out soon" warning for items with very low stock left. (if its a very frequent situation, you could proly also count the number of concurrent users hitting on the item and give a more accurate warning)
From the business perspective, you wouldn't want to be so low on stock (lower than the number of concurrent buyers) This is inevitable of course at "christmas" times when its okay to be out of stock :)
Related
I have a project with customers buying a product with platform based tokens. I have a mysql table that tracks a customer buying x amount and one tracking customer consumption(-x amount). In order to display their Amount of tokens they have left on the platform and query funds left on spending I wanted to query (buys - comsumed). But I remembered that people alsways talk about space is cheaper than computation(Not just $ but querytime as well). Should I have a seperate table for querying amount that gets updated with each buy or consume ?
So far I have always tried to use the least amount of tables to make it simple and have easy oversight, but I start to question if that is right...
There is no right answer, keep in mind the goal of the application, and updates in software likely to happen.
If you keep in these 2 tables transactions the user may have, then the new column was necessary, cause you had to sum the columns. If one row is for one user (likely your case), then 90% you should use those 2 tables only.
I would suggest you not have that extra column. As far with my expierence, in that kind of situations has the down of the bigger the project becomes, the more difficult is for you and the other developers, to have in mind to update the new column, because is dependent variable.
Also, when the user buy products or consumption tokens, you will have to update the new token, so energy and time loss as well.
You can store the (buys - consumed) in session, and update when is needed(if real time update is not necessary, not multiple devices).
If you need continuous update, so multiple queries over time, then memory loss over energy-time loss is greater, so you should have that 3 table - column.
I am building a simple shopping cart. Currently, to ensure that a customer can never purchase a product that is out of stock, when processing the order I have a loop for each product in their cart:
-- Begin a transaction --
Loop through each product in the cart and
Select the stock count from the products table
If it is in stock:
I will reduce the stock count from the product
Add the product to the order items table
Otherwise, I call a rollback and return an error
-- (If there isn't a call for rollback, everything ends off with a commit --
However, if at any given time, the stock count for a product is updated AFTER it has checked for that particular product, there may be inconsistencies.
Question: would it be a good idea to lock the table from writes whenever I am processing an order? So that when the 'loop' above occurs, I can be assured that no one else is able to alter the product count and it will always be accurate.
The idea is that the product count/availability will always be consistent, and there will never be an instance where the stock count goes to -1 (which would be unfulfillable).
However, I have seen so many posts on locks being inefficient/having bad effects. If so, what is the best way to accomplish this?
I have seen alternatives like handling it in an update + select query, but have seen that it may also not be suitable in some cases.
You have at least three strategies:
1. Pessimistic Locking
If your application will experience low activity then you can lock the tables (or single rows) to make sure no other thread changes the values during the processing of a purchase. It works, but it has performance limitations.
2. Optimistic Locking
If your application/web site must serve a high load then you can opt for the "optimistic locking" strategy. In this case you add a version number column to your critical tables and then you use it when reading/writing it.
When updating stock you check the version number you are updating must be the same that you read. If it's not the case anymore (another thread modified it) you roll back the transaction and can retry again a couple of times until you succeed.
It requires more development effor since you need to identify the bad case and implement retry logic (if you want to).
3. Processing Queues
You can implement processing queues. When a thread wants to "purchase an order" it can submit it to a processing queue for purchase orders. This queue can be implemented by one or more threads dedicated to this task; if you choose multiple threads they can be divided by order types, regions, categories, etc. to distribute the load.
This requires more programming effort since you need to manage asynchronous processing, but can sustain much higher levels of load.
You can use this strategy for multiple different tasks: purchasing orders, refilling stock, sending notifications, processing promotions, etc.
I'm having a hard time wrapping my head around the issue of an ELO-score-like calculation for a large amount of users on our platform.
For example. For every user in a large set of users, a complex formule, based on variable amounts of "things done", will result in a score for each user for a match-making-like principle.
For our situation, it's based on the amount of posts posted, connections accepted, messages sent, amount of sessions in a time period of one month, .. other things done etc.
I had two ideas to go about doing this:
Real-time: On every post, message, .. run the formula for that user
Once a week: Run the script to calculate everything for all users.
The concerns about these two I have:
Real-time: Would be an overkill of queries and calculations for each action a user performs. If let's say, 500 users are active, all of them are performing actions, the database would be having a hard time I think. There would them also run a script to re-calculate the score for inactive users (to lower their score)
Once a week: If we have for example 5.000 users (for our first phase), than that would result into running the calculation formula 5.000 times and could take a long time and will increase in time when more users join.
The calculation-queries for a single variable in a the entire formula of about 12 variables are mostly a simple 'COUNT FROM table', but a few are like counting "all connections of my connections" which takes a few joins.
I started with "logging" every action into a table for this purpose, just the counter values and increase/decrease them with every action and running the formula with these values (a record per week). This works but can't be applied for every variable (like the connections of connections).
Note: Our server-side is based on PHP with MySQL.
We're also running Redis, but I'm not sure if this could improve those bits and pieces.
We have the option to export/push data to other servers/databases if needed.
My main example is the app 'Tinder' which uses a sort-like algorithm for match making (maybe with less complex data variables because they're not using groups and communities that you can join)
I'm wondering if they run that real-time on every swipe, every setting change, .. or if they have like a script that runs continiously for a small batch of users each time.
Where it all comes down to. What would be the most efficient/non-database-table-locking way to do this, with keeping the idea in mind that there will be a moment that we're having 50.000 users for example?
The way I would handle this:
Implement the realtime algorithm.
Measure. Is it actually slow? Try optimizing
Still slow? Move the algorithm to a separate asynchronous process. Have the process run whenever there's an update. Really this is the same thing as 1, but it doesn't slow down PHP requests and if it gets busy, it can take more time to catch up.
Still slow? Now you might be able to optimize by batching several changes.
If you have 5000 users right now, make sure it runs well with 5000 users. You're not going to grow to 50.000 overnight, so adjust and invest in this as your problem changes. You might be surprised where your performance problems are.
Measuring is key though. If you really want to support 50K users right now, simulate and measure.
I suspect you should use the database as the "source of truth" aka "persistent storage".
Then fetch whatever is needed from the dataset when you update the ratings. Even lots of games by 5000 players should not take more than a few seconds to fetch and compute on.
Bottom line: Implement "realtime"; come back with table schema and SELECTs if you find that the table fetching is a significant fraction of the total time. Do the "math" in a programming language, not SQL.
I've created many databases before, but I have never linked two tables together. I've tried looking around, but cannot find WHY one would need to link two or more tables together.
There is a good tutorial here that goes over database relationships, but does not explain why they would be needed. He just simply says that they are.
Are they truly necessary? I understand that (in his example) all orders have a customer, and so one would link the orders table to the customers table, but I just don't see why this would be absolutely necessary. I can (and have) created shopping carts and other complex databases that work just fine without creating any table relationships.
I've just started playing around with MySQL Workbench v6.0 for a new project that has a fairly large and complex database, and so I'm wondering if I am losing anything by creating the entire project without relationships?
NOTE: Please let me know if this question is too general or off topic, and I will change it. I understand that a lot can be said about this topic, and so I'm really just looking to know if I am opening myself up to any security issues or significant performance issues by not using relationships. Please be specific in your response; "Yes you are opening yourself up to performance issues" is useless and not helpful for myself, nor for anyone else looking at this thread at a later date. Please include details and specifics in your response.
Thank you in advance!
As Sam D points out in the comments, entire books can be written about database design and why having tables with relationships can make a lot of sense.
That said, theoretically, you lose absolutely no expressive/computational power by just putting everything in the same table. The primary arguments against doing so likely deal with performance and maintenance issues that might arise.
The answer revolves around granularity, space consumption, speed, and detail.
Inherently different types of data will be more granular than others, as items can always be rolled up to a larger umbrella. For a chain of stores, items sold can be rolled up into transactions, transactions can be rolled up into register batches, register batches can be rolled up to store sales, store sales can be rolled up to company sales. The two options then are:
Store the data at the lowest grain in a single table
Store the data in separate tables that are dedicated to purpose
In the first case, there would be a lot of redundant data, as each item sold at location 3 of 430 would have store, date, batch, transaction, and item information. That redundant data takes up a large volume of space, when you could very easily create separated tables for their unique purpose.
In this example, lets say there were a thousand transactions a day totaling a million items sold from that one store. By creating separate tables you would have:
Stores = 430 records
Registers = 10 records
Transactions = 1000 records
Items sold = 1000000 records
I'm sure your asking where the space savings comes in ... it is in the detail for each record. The store table has names, address, phone, etc. The register has number, purchase date, manager who reconciles, etc. Transactions have customer, date, time, amount, tax, etc. If these values were duplicated for every record over a single table it would be a massive redundancy of data adding up to far more space consumption than would occur just by linking a field in one table (transaction id) to a field in another table (item id) to show that relationship.
Additionally, the amount of space consumed, as well as the size of the overall table, inversely impacts the speed of you querying that data. By keeping tables small and capitalizing on the relationship identifiers to link between them, you can greatly increase the response time. Every time the query engine needs to find a value, it traverses the table until it finds it (that is a grave oversimplification, but not untrue), so the larger and broader the table the longer the seek time. These problems do not exist with insignificant volumes of data, but for organizations that deal with millions, billions, trillions of records (I work for one of them) storing everything in a single table would make the application unusable.
There is so very, very much more on this topic, but hopefully this gives a bit more insight.
Short answer: In a relational database like MySQL Yes. Check this out about referential integrity http://databases.about.com/cs/administration/g/refintegrity.htm
That does not mean that you have to use relational databases for your project. In fact the trend is to use Non-Relational databases (NoSQL), like MongoDB to achieve same results with better performance. More about RDBMS vs NoSQL http://www.zdnet.com/rdbms-vs-nosql-how-do-you-pick-7000020803/
I think that with this example you will understand better:
Let's we want to create on-line store. We have at minimum Users, Payments and Events (events about the pages where the user navigates or other actions). In this scenario we want to link in a secure and relational way the Users with the Payments. We do not want a Payment to be lost or assigned to another User. So we can use a RDBMS like MySQL to create the tables Users and Payments and linked the with proper Foreign Keys. However for the events, we are going to be a lot of them per users (maybe millions) and we need to track them in a fast way without killing the relation database. In that case a No-SQL database like MongoDB makes totally sense.
To sum up to can use an hybrid of SQL and NO-SQL, but either if you use one, the other or both kind of solutions, do it properly.
Situation:
I am currently designing a feed system for a social website whereby each user has a feed of their friends' activities. I have two possible methods how to generate the feeds and I would like to ask which is best in terms of ability to scale.
Events from all users are collected in one central database table, event_log. Users are paired as friends in the table friends. The RDBMS we are using is MySQL.
Standard method:
When a user requests their feed page, the system generates the feed by inner joining event_log with friends. The result is then cached and set to timeout after 5 minutes. Scaling is achieved by varying this timeout.
Hypothesised method:
A task runs in the background and for each new, unprocessed item in event_log, it creates entries in the database table user_feed pairing that event with all of the users who are friends with the user who initiated the event. One table row pairs one event with one user.
The problems with the standard method are well known – what if a lot of people's caches expire at the same time? The solution also does not scale well – the brief is for feeds to update as close to real-time as possible
The hypothesised solution in my eyes seems much better; all processing is done offline so no user waits for a page to generate and there are no joins so database tables can be sharded across physical machines. However, if a user has 100,000 friends and creates 20 events in one session, then that results in inserting 2,000,000 rows into the database.
Question:
The question boils down to two points:
Is this worst-case scenario mentioned above problematic, i.e. does table size have an impact on MySQL performance and are there any issues with this mass inserting of data for each event?
Is there anything else I have missed?
I think your hypothesised system generates too much data; firstly on the global scale the storage and indexing requirements on user_feed seems to escalate exponentially as your user-base becomes larger and more interconnected (both presumably desirable for a social network); secondly consider if in the course of a minute 1000 users each entered a new message and each had 100 friends - then your background thread has 100 000 inserts to do and might quickly fall behind.
I wonder if a compromise might be made between your two proposed solutions where a background thread updates a table last_user_feed_update which contains a single row for each user and a timestamp for the last time that users feed was changed.
Then although the full join and query would be required to refresh the feed, a quick query to the last_user_feed table will tell if a refresh is required or not. This seems to mitigate the biggest problems with your standard method as well as avoid the storage size difficulties but that background thread still has a lot of work to do.
The Hypothesized method works better when you limit the maximum number of friends.. a lot of sites set a safe upper boundary, including Facebook iirc. It limits 'hiccups' from when your 100K friends user generates activity.
Another problem with the hypothesized model is that some of the friends you are essentially pre-generating cache for may sign up and hardly ever log in. This is a pretty common situation for free sites, and you may want to limit the burden that these inactive users will cost you.
I've thought about this problem many times - it's not a problem MySQL is going to be good at solving. I've thought of ways I could use memcached and each user pushes what their latest few status items are to "their key" (and in a feed reading activity you fetch and aggregate all your friend's keys)... but I haven't tested this. I'm not sure of all the pros/cons yet.