Should you scale through tables or computation in Mysql?

Should you scale through tables or computation in Mysql? - mysql

I have a project with customers buying a product with platform based tokens. I have a mysql table that tracks a customer buying x amount and one tracking customer consumption(-x amount). In order to display their Amount of tokens they have left on the platform and query funds left on spending I wanted to query (buys - comsumed). But I remembered that people alsways talk about space is cheaper than computation(Not just $ but querytime as well). Should I have a seperate table for querying amount that gets updated with each buy or consume ?
So far I have always tried to use the least amount of tables to make it simple and have easy oversight, but I start to question if that is right...

There is no right answer, keep in mind the goal of the application, and updates in software likely to happen.
If you keep in these 2 tables transactions the user may have, then the new column was necessary, cause you had to sum the columns. If one row is for one user (likely your case), then 90% you should use those 2 tables only.
I would suggest you not have that extra column. As far with my expierence, in that kind of situations has the down of the bigger the project becomes, the more difficult is for you and the other developers, to have in mind to update the new column, because is dependent variable.
Also, when the user buy products or consumption tokens, you will have to update the new token, so energy and time loss as well.
You can store the (buys - consumed) in session, and update when is needed(if real time update is not necessary, not multiple devices).
If you need continuous update, so multiple queries over time, then memory loss over energy-time loss is greater, so you should have that 3 table - column.

Related

MySQL - Partitioning vs multiple table suggestion for a use case

We are having around 30,000 customers and each customer is having multiple products. We are currently storing all the products in a single table partitioned by KEY(customerid). I would like to get your suggestions if separate tables for each customer would be more beneficial over the partitioning OR we continue to use partitioning with current (HASH) or different type.
Number of products per customers varies, a few customers having > 1M products while some customers having as small as a few hundred products. This may result in not so perfect partitions.
If a customer account is to be deleted, so will be all products of that customer. In case of separate tables, this would be quite useful.
All customers are disjointed. So there is no query to access cross-customer products.
Number of customers are quite large (around 30k), I am not sure if that's a good idea to have so many tables.
Is any other partitioning scheme is better than what we currently using.
Thank you for your inputs.

Generally I would go with the single table solution that you already have, it's the simple, straight-forward way to go.
You don't mention your motivation for wanting to change your setup.
How many entries do you have in your products table?
Are you experiencing performance issues with your current setup? If not I might be inclined to call this a case of "premature optimization".
If you ARE experiencing performance issues I would start by analyzing those first (profiling) to determine whether they are caused by your single products table design being a bottleneck.
Practical advice I can offer: Make sure you are using InnoDB storage engine and not MyISAM since that will allow for row level locks.
The downside to your proposal of having one table for each customer is maintenance and complexity. If you ever want to change your schema of the product tables it will be a lot more complicated and error prone task than before. You might have to make a script to batch the changes of all those tables, and what if the script crashes halfway? Then half of you customers have a changed table schema and the other half doesn't. As I mentioned if you do not currently have a performance problem you would be adding this complexity and maintenance without gaining anything.
You state that "All customers are disjointed. So there is no query to access cross-customer products." however it might not stay that way forever. Imagine in 2 months you need to extract a list of all customers who own specific product of type x, that would be a simple SQL query in your current setup, in the multi-table setup you would have to make a script or small program that could iterate over all customers and for each customer make a product query. So what was 1 query before is now 30.000 queries.
What you propose is a simple form of sharding. If you decide to go that way you may want to look into sharding since there are other ways to approach than the somewhat aggressive approach of giving every customer a dedicated table. E.g. use a hash of each customer id as sharding key, so every customer is either part of group A or group B. Products owned by A-customers are in ProductTableA, products owned by B-customers are in ProductTableB. (in a real implementation you may want to hash to a value between 0-255 and then keep a reference list saying that 0-127 are table-A, 128-255 are table-B, that way if you ever decide to scale up and add one more table, you don't have to recalculate all your hashes you just update your reference list).

Best way to take the SQL SUM of a large data set in a distributed environement

Problem Scenario
Consider a database design for a super market. I have a two tables(A & B) to store the records of adding Items(A) to the inventory and to record sales(B). In order to get the running balance of a particular item in the shop, I have to take the sum of items from A and subtract sum of that particular items from B. Please consider this as the abstract scenario.
Assume that the number of rows in each table is very high.
My problem is what is the best practice to calculate the running balance in this case. Is it OK to write a SQL to do exactly what I mentioned above or is there any other performance wise and resources wise friendly methodology. I can't calculate the running balance real time since I am running this in a distributed environment. (using symmtericds). Hence in my case multiple stores adds records to their local databases and Symmetricds update those records in a master database.(Cloud) . How ever balance query will be always executed at the Master database.

Is is necessary to link or join tables in MySQL?

I've created many databases before, but I have never linked two tables together. I've tried looking around, but cannot find WHY one would need to link two or more tables together.
There is a good tutorial here that goes over database relationships, but does not explain why they would be needed. He just simply says that they are.
Are they truly necessary? I understand that (in his example) all orders have a customer, and so one would link the orders table to the customers table, but I just don't see why this would be absolutely necessary. I can (and have) created shopping carts and other complex databases that work just fine without creating any table relationships.
I've just started playing around with MySQL Workbench v6.0 for a new project that has a fairly large and complex database, and so I'm wondering if I am losing anything by creating the entire project without relationships?
NOTE: Please let me know if this question is too general or off topic, and I will change it. I understand that a lot can be said about this topic, and so I'm really just looking to know if I am opening myself up to any security issues or significant performance issues by not using relationships. Please be specific in your response; "Yes you are opening yourself up to performance issues" is useless and not helpful for myself, nor for anyone else looking at this thread at a later date. Please include details and specifics in your response.
Thank you in advance!

As Sam D points out in the comments, entire books can be written about database design and why having tables with relationships can make a lot of sense.
That said, theoretically, you lose absolutely no expressive/computational power by just putting everything in the same table. The primary arguments against doing so likely deal with performance and maintenance issues that might arise.

The answer revolves around granularity, space consumption, speed, and detail.
Inherently different types of data will be more granular than others, as items can always be rolled up to a larger umbrella. For a chain of stores, items sold can be rolled up into transactions, transactions can be rolled up into register batches, register batches can be rolled up to store sales, store sales can be rolled up to company sales. The two options then are:
Store the data at the lowest grain in a single table
Store the data in separate tables that are dedicated to purpose
In the first case, there would be a lot of redundant data, as each item sold at location 3 of 430 would have store, date, batch, transaction, and item information. That redundant data takes up a large volume of space, when you could very easily create separated tables for their unique purpose.
In this example, lets say there were a thousand transactions a day totaling a million items sold from that one store. By creating separate tables you would have:
Stores = 430 records
Registers = 10 records
Transactions = 1000 records
Items sold = 1000000 records
I'm sure your asking where the space savings comes in ... it is in the detail for each record. The store table has names, address, phone, etc. The register has number, purchase date, manager who reconciles, etc. Transactions have customer, date, time, amount, tax, etc. If these values were duplicated for every record over a single table it would be a massive redundancy of data adding up to far more space consumption than would occur just by linking a field in one table (transaction id) to a field in another table (item id) to show that relationship.
Additionally, the amount of space consumed, as well as the size of the overall table, inversely impacts the speed of you querying that data. By keeping tables small and capitalizing on the relationship identifiers to link between them, you can greatly increase the response time. Every time the query engine needs to find a value, it traverses the table until it finds it (that is a grave oversimplification, but not untrue), so the larger and broader the table the longer the seek time. These problems do not exist with insignificant volumes of data, but for organizations that deal with millions, billions, trillions of records (I work for one of them) storing everything in a single table would make the application unusable.
There is so very, very much more on this topic, but hopefully this gives a bit more insight.

Short answer: In a relational database like MySQL Yes. Check this out about referential integrity http://databases.about.com/cs/administration/g/refintegrity.htm
That does not mean that you have to use relational databases for your project. In fact the trend is to use Non-Relational databases (NoSQL), like MongoDB to achieve same results with better performance. More about RDBMS vs NoSQL http://www.zdnet.com/rdbms-vs-nosql-how-do-you-pick-7000020803/
I think that with this example you will understand better:
Let's we want to create on-line store. We have at minimum Users, Payments and Events (events about the pages where the user navigates or other actions). In this scenario we want to link in a secure and relational way the Users with the Payments. We do not want a Payment to be lost or assigned to another User. So we can use a RDBMS like MySQL to create the tables Users and Payments and linked the with proper Foreign Keys. However for the events, we are going to be a lot of them per users (maybe millions) and we need to track them in a fast way without killing the relation database. In that case a No-SQL database like MongoDB makes totally sense.
To sum up to can use an hybrid of SQL and NO-SQL, but either if you use one, the other or both kind of solutions, do it properly.

MySQL architecture for n * (n - 1) / 2 algorithm

I'm currently developing a website where users can search for other users based on attributes (age, height, town, education, etc.). I now want to implement some kind of rating between user profiles. The rating is calculated via its own algorithm based on similiarity between the 2 given profiles. User A has a rating "match rating" of 85 with User B and 79 with User C for example. B and C have a rating of 94 and so on....
The user should be able to search for certain attributes and filter the results by rating.
Since the rating differs from profile to profile and also depends on the user doing the search, I can't simply add a field to my users table and use ORDER BY. So far I came up with 2 solutions:
My first solution was to have a nightly batch job, that calculates the rating for every possible user combination and stores it in a separate table (user1, user2, rating). I then can join this table with the user table and order the result by rating. After doing some math I figured that this solution doesn't scale that well.
Based on the formula n * (n - 1) / 2 there are 45 possible combination for 10 users. For 1.000 users I suddenly have to insert 499.500 rating combinations into my rating table.
The second solution was to leave MySQL be and just calculate the rating on the fly within my application. This also doesn't scale well. Let's say the search should only return 100 results to the UI (with the highest rated on top). If I have 10.000 users and I want to do a search for every user living in New York sorted by rating, I have to load EVERY user that is living in NY into my app (let's say 3.000), apply the algorithm and then return only the top 100 to the user. This way I have loaded 2.900 useless user objects from the DB and wasted CPU on the algorithm without ever doing anything with it.
Any ideas how I can design this in my MySQL db or web app so that a user can have an individual rating with every other user in a way that the system scales beyond a couple thousand users?

If you have to match every user against every other user, the algorithm is O(N^2), whatever you do.
If you can exploit some sort of 1-dimensional "metric", then you can try and associate each user with a single synthetic value. But that's awkward and could be impossible.
But what you can do is to note which users require a change in their profiles (whenever any of the parameters on which the matching is based, changes). At that point you can batch-recalculate the table for those users only, thus working in O(N): if you have 10000 users and only 10 require recalculation, you have to examine 100,000 records instead of 100,000,000.
Other strategies would be to only run the main algorithm for records which have the greater chance of being compared: in your example, "same city". Or when updating records (but this would require to store (user_1, user_2, ranking, last_calculated), only recalculate those records with high ranking, very old, or never calculated. Lowest ranked matches aren't likely to change so much that they float to the top in a short time.
UPDATE
The problem is also operating with O(N^2) storage space.
How to reduce this space? I think I can see two approaches. One is to not put some information in the match table at all. The "match" function makes the more sense the more it is rigid and steep; having ten thousand "good matches" would mean that matching means very little. So we would still need lotsa recalculations when User1 changes some key data, in case it brings some of User1's "no-no" matches back into the "maybe" zone. But we would keep a smaller clique of active matches for each user.
Storage would still grow quadratically, but less steeply.
Another strategy would be to recalculate the match, and then we would need to develop some method for quickly selecting which users are likely to have a good match (thus limiting the number of rows retrieved by the JOIN), and some method to quickly calculate a match; which could entail somehow rewriting the match between User1 and User2 to a very simple function of a subset of DataUser1, DataUser2 (maybe using ancillary columns).
The challenge would be to leverage MySQL capabilities and offload some calculations the the MySQL engine.
To this purpose you might perhaps "map" some data, at input time (therefore in O(k)), to spatial information, or to strings and employ Levenshtein distance.
The storage for a single user would grow, but it would grow linearly, not quadratically, and MySQL SPATIAL indexes are very efficient.

If the search should only return the top 100 best matches, then why not just store those? It sounds like you would never want to search the bottom end of the results anyway, so just don't calculate them.
That way, your storage space is only o(n), rather than o(n^2), and updates should be, as well. If someone really wants to see matches past the first 100 (and you want to let them) then you have the option of running the query in real time at that point.

I agree with everything #Iserni says.
If you have a web app and users need to "login", then you might have an opportunity to create that user's rankings at that time and stash them into a temporary table (or rows in an existing table).
This will work in a reasonable amount of time (a few seconds) if all the data needed for the calculation fits into memory. The database engine should then be doing a full table scan and creating all the ratings.
This should work reasonably well for one user logging in. Passably for two . . . but it is not going to scale very well if you have, say, a dozen users logging in within one second.
Fundamentally, though, your rating does not scale well. You have to do a comparison of all users to all users to get the results. Whether this is batch (at night) or real-time (when someone has a query) doesn't change the nature of the problem. It is going to use a lot of computing resources, and multiple users making requests at the same time will be a bottleneck.

MySQL: how to ensure integrity in multiple read only architecture

Scenario is simple to describe, but might have a complex answer:
Imagine a case where you have one write only mysql database. Then you have about 5 or 6 read only databases. The write database has a count for a particular inventory. You have hundreds of thousands of users banging away at this particular inventory item, but only limited quantity. For argument's sake, say 10 items.
What's the best way to ensure that only 10 items get sold? If there is even a 200ms delta between the time the read-only slaves get updated, can't the integrity of the count go stale, thus selling inventory you do not have?
How would you solve/scale this problem?

The basic solution to concurrent users will probably cover this too. At some point in the "buy" transaction, you need to decrement the inventory (on the write-server). Through whatever method, enforce that inventory can't go below zero.
If there's one item left, and two people trying to buy it, one will be out of luck.
The replication latency is exactly the same thing. Two users see a product available, but by the time they try to buy it, it's gone. A good solution for that scenario covers both replication latency and a user simply snatching the last item out from under another user.

It all depends on when and what window you decide to lock the master table for the update.
A. If you have to be 100% sure an item will be attempted to be bought only when its surely available. You will have to lock the item for the particular user as soon as you list it to him (which means you will temporarily decrement the inventory stock)
B. If you are okay with showing the one off "sorry, we just ran out of stock" message. you should lock the item just before you bill (well, you could do that after transaction is complete. but at the cost of a very furious customer)
I would chose approach A for locking, and may be flag a "selling out soon" warning for items with very low stock left. (if its a very frequent situation, you could proly also count the number of concurrent users hitting on the item and give a more accurate warning)
From the business perspective, you wouldn't want to be so low on stock (lower than the number of concurrent buyers) This is inevitable of course at "christmas" times when its okay to be out of stock :)

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008