We have a mySQL database table for products. We are utilizing a cache layer to reduce database load, but we think that it's a good idea to minimize the actual data needed to be stored in the cache layer to speed up the application further.
All the products in the database, that is visible to visitors have a price attached to them:
The prices are stored in a different table, called prices . There are multiple price categories depending on which discount level each visitor (customer) applies to. From time to time, there are campaigns which means that a special price for each product is available. The special prices are stored in a table called specials.
Is it a bad to make a temp table that binds the tables together?
It would only have the neccessary information and would ofcourse be cached.
-------------|-------------|------------
| productId | hasPrice | hasSpecial
-------------|-------------|------------
1 | 1 | 0
2 | 1 | 1
By doing such, it would be super easy to know if the specific product really has a price, without having to iterate through the complete prices or specials table each time a product should be listed or presented.
Are temp tables a common thing for web applications or is it just bad design?
If you're going to cache this data anyways, does it really need to be in a temp table? You would only incur the overhead of the query when you needed to rebuild the cache, so the temp table might not even be necessary.
You should approach it like any other performance problem: Decide how much performance is necessary, then iterate doing testing on production-grade hardware in your lab. Do not do needless optimisations.
You shoud profile your app and discover if it's doing too many queries or the queries themselves are slow; most cases of web-app slowness are caused by doing too many queries (in my experience) even though the queries are very easy.
Normally the best engineering solution is to restructure the database, in some cases denormalising, to make the common read use-cases require fewer queries. Caching may be helpful as well, but refactoring so you need fewer queries is often the best.
Essentially you can increase the amount of work on the write-path to reduce the amount on the read-path, if you are planning to do a lot more reading than writing.
Related
I have a table named transactions, something like this:
id | user_id | business_id | amount | tracking_code | status | created_at | updated_at
As you can see, this is a table which keeps all transactions. Currently it has over 50M rows and every day about 4k new rows get added to it. I'm worried about one or two next years that the business scaled up and I will end up with a really huge table.
Currently we have two indexes on this table for a better search performance. Also the engine is innodb.
Any idea how it should be handled generally?
In the side of hardware resources, I'm completely ok to increase the server hardware when needed. But I guess, the issue will be managing the data in the future and not the resources.
I would not worry about application performance with 1 billion rows on a machine that can keep the indexes in memory.
However, I would suggest making the id columns as BigInt if you know that the table is growing at a fast pace as a 32-bit integer will be limited to 2^31-1= 2,147,483,647 rows
The performance of your table and search engine depends on:
How many join those queries do on this particular table?
How well your indexes are set up? Apparently good
How much RAM is in the machine hosting the DB?
Speed and number of processors related to it?
Size of the row/amount of data returned in the queries.
Disk space is cheap. You have, what, 3.2 GB in data and some factor on top of that in indexes. If the applications doesn't need all the data to be online, then you have the option of archive old data (dump then delete from table). You can look into compression as an option. Possibly in combination with some of the alternative storage engines
Don't worry about how many CPUs and their speed. That is rarely the limiting factor in MySQL.
Do you need either created_at or updated_at?
Don't bother with compression; it is unlikely to be worth it.
Do use smaller datatypes where practical (and conservative).
Stick with InnoDB. There are many reasons; I won't repeat them here.
Please show us SHOW CREATE TABLE and some of the critical queries. We may have more tips.
We are having around 30,000 customers and each customer is having multiple products. We are currently storing all the products in a single table partitioned by KEY(customerid). I would like to get your suggestions if separate tables for each customer would be more beneficial over the partitioning OR we continue to use partitioning with current (HASH) or different type.
Number of products per customers varies, a few customers having > 1M products while some customers having as small as a few hundred products. This may result in not so perfect partitions.
If a customer account is to be deleted, so will be all products of that customer. In case of separate tables, this would be quite useful.
All customers are disjointed. So there is no query to access cross-customer products.
Number of customers are quite large (around 30k), I am not sure if that's a good idea to have so many tables.
Is any other partitioning scheme is better than what we currently using.
Thank you for your inputs.
Generally I would go with the single table solution that you already have, it's the simple, straight-forward way to go.
You don't mention your motivation for wanting to change your setup.
How many entries do you have in your products table?
Are you experiencing performance issues with your current setup? If not I might be inclined to call this a case of "premature optimization".
If you ARE experiencing performance issues I would start by analyzing those first (profiling) to determine whether they are caused by your single products table design being a bottleneck.
Practical advice I can offer: Make sure you are using InnoDB storage engine and not MyISAM since that will allow for row level locks.
The downside to your proposal of having one table for each customer is maintenance and complexity. If you ever want to change your schema of the product tables it will be a lot more complicated and error prone task than before. You might have to make a script to batch the changes of all those tables, and what if the script crashes halfway? Then half of you customers have a changed table schema and the other half doesn't. As I mentioned if you do not currently have a performance problem you would be adding this complexity and maintenance without gaining anything.
You state that "All customers are disjointed. So there is no query to access cross-customer products." however it might not stay that way forever. Imagine in 2 months you need to extract a list of all customers who own specific product of type x, that would be a simple SQL query in your current setup, in the multi-table setup you would have to make a script or small program that could iterate over all customers and for each customer make a product query. So what was 1 query before is now 30.000 queries.
What you propose is a simple form of sharding. If you decide to go that way you may want to look into sharding since there are other ways to approach than the somewhat aggressive approach of giving every customer a dedicated table. E.g. use a hash of each customer id as sharding key, so every customer is either part of group A or group B. Products owned by A-customers are in ProductTableA, products owned by B-customers are in ProductTableB. (in a real implementation you may want to hash to a value between 0-255 and then keep a reference list saying that 0-127 are table-A, 128-255 are table-B, that way if you ever decide to scale up and add one more table, you don't have to recalculate all your hashes you just update your reference list).
We currently have a table that contains 90 columns and as the table is growing and the business needs change, we're having to alter the table alot (add/remove cols & indexes).
|------ (Table name: quotes)
|Column|Type|Null|Default
|------
|//**id**//|int(11)|No|
....
|completed_at|datetime|Yes|NULL
|reviewed_at|datetime|Yes|NULL
|marked_dud_at|datetime|Yes|NULL
|closed_at|datetime|Yes|NULL
|subscribed_at|datetime|Yes|NULL
|admin_checked_at|datetime|Yes|NULL
|priced_at|datetime|Yes|NULL
|number_verified_at|datetime|Yes|NULL
|created_at|datetime|Yes|NULL
|deleted_at|datetime|Yes|NULL
For the application, our staff are constantly querying all sorts of variations on the above data, example being where it has been completed (completed_at), checked (admin_checked_at) and not deleted, reviewed (deleted_at, reviewed_at)
We're thinking it may be easier to offload some of these columns into their own row, we'll call it quotes_actions, then when querying do some joining.
|------ (Table name: quotes_actions)
|Column|Type|Null|Default
|------
|//**id**//|int(11)|No|
|quote_id|int(11)|No|
|action|varchar(100)|No|
|user_id|int(11)|No|
|time|datetime|Yes|NULL
|created_at|datetime|Yes|NULL
An example would be action = 'completed' using the field, with an index covering quote_id and action.
We've split the data into this format on 150,000 rows and it's not any faster nor slower than querying the original database with correct indexes.
Has anyone got any experience with this and has any recommendations or pitfalls for each approach? It's taking a lot of time to add covering indexes and add columns to the original table as we needed them, whereas the second approach has the indexes set up ready to go but is introducing a lot more joins and more complicated queries.
0.09s
select * from `quotes`
where `completed_at` is not null
and `approved_at` is not null
and deleted_at is null
=>
0.0005s
select * from `quotes_new`
inner join quotes_actions as q1 on q1.action = 'completed' and q1.quote_id = quotes_new.id
inner join quotes_actions as q2 on q2.action = 'approved' and q2.quote_id = quotes_new.id
where quotes_new.deleted_at is null
In addition, if the 2nd approach is better, how do you query for negative results, where a quote hasn't been approved?
Database design will vary from application to application, and things that are great for one implementation will be terrible for another. You've identified a few things that are important to you:
speed of data access (at least no reduction in current performance)
ability to respond to application needs/changes
limiting complexity of queries
Without being able to see the entirity of your database and how you are using it, these are the principles I would follow:
Use Stored Procedures and Views for as much as possible
This is just good design. You create an adapter layer between your application and the data tables, which allows you to make whatever changes you need to in the database (and the views/stored procs) without having to change the application itself. Decoupling your systems makes maintenance significantly easier. Also this is good for security, as if the only way outsiders can access the data is through your stored procs, you've eliminated a few avenues of attack. (There's also debate about whether or not the DBMS will cache execution plans for stored procedures, making them execute faster than similar queries, but I'm not a DBA or DBDev, so I'm not touching that).
Attempt to limit width of tables
One thing I've seen time and time again is every time a need arises in a production systems, a column gets added to a table and they call it a day. Far easier than rewriting a bunch of queries or reviewing table structures. This is terrible design. If you've already limited the changes needed to the application layer by following my first piece of advice, you've limited the work needed to actually resolve table changes in the right way. You should always evaluate whether data belongs to the row in question, or if it should be offloaded into its own table. You shouldn't be afraid to radically alter your database, as sometimes it is necessary.
Looking at the data you've provided, I think your second option is okay. You've identified many columns that actually represent the same thing (the "status changes" or as you put it "quote actions" that occur) and offloaded that from the main table to a secondary table. This is perfectly fine, and likely will be effective. You can further "cheat" to make this table faster by offloading status onto its own table, and using an integer to represent it instead of a string (since the string doesn't matter to the database, and integers are far faster to index and search).
This is not to say a wide table is a bad thing, sometimes tables just need to be wide. You just need to evaluate whether the data really belongs to the entity the data row represents.
Approach queries in new ways
You will want to play with the execution plan tools of your DBMS and understand how each query really works. Changing the order of joins can drastically alter the query return speed, and you shouldn't be afraid to use table variables and temp tables in your queries. They are all tools at your disposal.
Querying for Negative Results
Since you asked this question specifically, I'll address it. This requires thinking about your query in a little different way (consequently, if you haven't, you should look into taking a course or working through a textbook of Relational Algebra, it makes understanding databases so much easier).
Your original query made finding something where the quote was not approved easy. It was all in the table: approved_at is null. Simple, easy peasy, no problems. Now, however, instead of being in a column on the main table, it is in its own table, that also represents all the other actions that could be taken. You need to break the problem down a little.
You want to find the set wherein of all orders, there is no action to signify it is approved. In SQL that looks like:
select quote_id from quotes_action where quote_id not in
(select quote_id from quotes_action where action = 'approved');
Final Thoughts
You need to sit down with your team and talk about how you want to move forward with this product. Spend a few days or a couple weeks really thinking deeply about it. Brainstorm....hackathon....do something to find a solution you like and makes your product better and more maintainable. We've all been in the situation where we have an unmaintainable product that could have been fixed at some point, but is beyond that point. Try not to get to that point, and fix it while you have the opportunity.
I'm making my first site with Django, and I'm having a database design problem.
I need to store some of the users history, and I don't know whether it's better to create a table like this for each user every time one signs up:
table: $USERNAME$
id | some_data | some_more | even_more
or have one massive table from the start, with everyone's data in:
table: user_history
id | username | some_data | some_more | even_more
I know how to do the second one, just declare it in my Django models. If I should do the first one, how can I in Django?
The first one organises the data more hierarchically but could potentially create a lot of tables depending on the popularity of the service (is this a bad thing?)
The second one seems to more suit Django's design philosophies (from what I've seen so far), and would be easier to run comparative searches between users, but could get huge (number of users * average items in history). Can MySQL handle, say, 1 billion records? (I won't get that, but it's good to plan ahead)
Definitely the second format is the way you want to go. MySQL is pretty good at handling large numbers of rows (assuming they're indexed and cached as appropriate, of course). For example, all versions of all pages on Wikipedia are stored on one table in their database, and that works absolutely fine.
I just don't know what Django is, but I'm sure it's not a good practice to create a table per user for logging, (or almost anything, for that matter).
Best regards.
You should definitely store all users in one table, one row per user. It's the only way you can filter out data using a WHERE clause. And I'm not sure if MySQL can handle 1 billion records, but I've never found the records limit as a limiting factor. I wouldn't worry about the records limit for now.
You see, every high-loaded project started with something that was just well-designed. Well designed system has better perspectives of being improved to handle huge loads.
Also keep in mind, that even genious guys in twitter/fb/etc did not know what issues they will experience after a while. And you will not know either. Solving loading/scalability challenges and their prediction is a sort of rocket-science.
So the best you can do now - is just starting with the most normalized db and academic solutions, and solve the bottlenecks as soon as they will appear.
When creating a relational database, you would only want to create a new table if it contains significantly different data than the original table. In this case, all of the tables will be pretty much the same, so you would only want 1 table for all users.
If you want to break it down even further, you may not want to store all the users actions in the user table. You may want to have 1 table for user information, and another for user history, ie:
table: User
Id | UserName | Password | other data
table: User_history
Id | some_data | timestamp
There's no need to be worried about the speed of your database as long as you define proper indexes on the fields you plan to search. Using those indexes will definitely speed up your response time as more records are put into your table. The database I work on has several tables with 30,000,000+ records and there's no slow-down.
Definitely DO NOT create a TABLE per user. Create a row per user, and possibly a row per user and smaller tables if some data can be factored.
definitely stick with one table for all users, consider complicated queries that may request extra resources for running on multiple tables instead of just one.
run some tests, regarding resources i am sure you will find out one table works best.
Everyone has pointed out that the second option is the way to go, I'll add my +1 to that.
About the first option, in Django, you create tables by declaring subclasses of django.models.Model and then when you run the management command syncdb it will look at all the models and create missing tables for all "managed" models. It might be possible to invoke this behavior at run time, but it isn't the way things are done.
I have a MySQL DB containing entry for pages of a website.
Let's say it has fields like:
Table pages:
id | title | content | date | author
Each of the pages can be voted by users, so I have two other tables
Table users:
id | name | etc etc etc
Table votes:
id | id_user | id_page | vote
Now, I have a page where I show a list of the pages (10-50 at a time) with various information along with the average vote of the page.
So, I was wondering if it were better to:
a) Run the query to display the pages (note that this is already fairly heavy as it queries three tables) and then for each entry do another query to calculate the mean vote (or add a 4th join to the main query?).
or
b) Add an "average vote" column to the pages table, which I will update (along with the vote table) when an user votes the page.
nico
Use the database for what it's meant for; option a is by far your best bet. It's worth noting that your query isn't actually particularly heavy, joining three tables; SQL really excels at this sort of thing.
Be cautious of this sort of attempt at premature optimization of SQL; SQL is far more efficient at what it does than most people think it is.
Note that another benefit from using your option a is that there's less code to maintain, and less chance of data diverging as code gets updated; it's a lifecycle benefit, and they're too often ignored for miniscule optimization benefits.
You might "repeat yourself" (violate DRY) for the sake of performance. The trade-offs are (a) extra storage, and (b) extra work in keeping everything self-consistent within your DB.
There are advantages/disadvantages both ways. Optimizing too early has its own set of pitfalls, though.
Honestly, for this issue, I would recommend redundent information. Multiple votes for multiple pages can really create a heavy load for a server, in my opinion. If you forsee to have real traffic on your website, of course... :-)