My question is about denormalization. In a database, when should you store derived data in its own column, rather than calculating it every time you need it?
For example, say you have Users who get Upvotes for their Questions. You display a User's reputation on their profile. When a User is Upvoted, should you increment their reputation, or should you calculate it when you retrieve their profile:
SELECT User.id, COUNT(*) AS reputation FROM User
LEFT JOIN Question
ON Question.User_id = User.id
LEFT JOIN Upvote
ON Upvote.Question_id = Question.id
GROUP BY User.id
How processor intensive does the query to get a User's reputation have to be before it would be worthwhile to keep track of it incrementally with its own column?
To continue our example, suppose an Upvote has a weight that depends on how many Upvotes (not how much reputation) the User who cast it has. The query to retrieve their reputation suddenly explodes:
SELECT
User.id AS User_id,
SUM(UpvoteWeight.weight) AS reputation
FROM User
LEFT JOIN Question
ON User.id = Question.User_id
LEFT JOIN (
SELECT
Upvote.Question_id,
COUNT(Upvote2.id)+1 AS weight
FROM Upvote
LEFT JOIN User
ON Upvote.User_id = User.id
LEFT JOIN Question
ON User.id = Question.User_id
LEFT JOIN Upvote AS Upvote2
ON
Question.id = Upvote2.Question_id
AND Upvote2.date < Upvote.date
GROUP BY Upvote.id
) AS UpvoteWeight ON Question.id = UpvoteWeight.Question_id
GROUP BY User.id
This is far out of proportion with the difficulty of an incremental solution. When would normalization be worth it, and when do the benefits of normalization lose to the benefits of denormalization (in this case query difficulty and/or performance)?
How processor intensive does the query to get a User's reputation have to be before it would be worthwhile to keep track of it incrementally with its own column?
There really are two questions here in guise of one: (1) Will this change improve the performace and (2) Will the performance improvement be worth the effort?
As far as whether the performance improvement, this is basically a standard pros/cons analysis.
The benefits of normalization are basically two-fold:
Easier data integrity
No issues with re-calculation (e.g. if the underlying data changes, the derived column needs to be re-calculated).
If you cover the data integrity with a robustly implemented solution (e.g. trigger, Sstored-proc-only data changes with revoked direct table change perms, etc...), then this becomes a straightforward calculation of whether the cost of verifying whether the source data change warrants the derived data re-calculation vs. recalculating the derived data every time. (NOTE: Another approach to keeping data integrity is to force the recalculation of derived data on schedule, where that data can afford to be inaccurate with some time tolerance. StackExchange takes this approach with some of its numbers).
In a typical scenario (many more retrieval of data and far less changes to the underlying data) the math pretty obviously skews in favor of keeping de-normalized derived data in the table.
In some rare cases where the underlying data changes VERY often yet the derived data is not retrieved that often, doing that might be detrimental.
Now, we are onto the far more important question: Will the performance improvement be worth the effort?
Please note that, as with ALL optimizations, the biggest question is "is the optimization even worth it at all?", and as such is the subject to two main considerations:
Measuring exact performance difference and generally profiling.
Context of this specific optimization in the big picture of your system.
E.g. if the difference in query performace - which as always when optimizing must first be measured - is 2% between cached derived data and computed one, the extra system complexity in implementing the reputation cache column may not be worth it in the first place. But what the threshold of caring vs. not caring is as far as marginal improvement depends on the big picture of your app. If you can take steps to improve query performance 10% in a different place, concentrate on that vs. 2%. If you're Google and extra 2% of query performance carries cost of 2 billion dollars in extra hardware to bear it, it needs to be optimized anyway.
There is really no clear-cut answer because it depends a lot of factors like the volume of the site and how often you display the reputation (i.e. only on their profile page or next to EVERY instance of their user name, everywhere). The only real answer is "when it gets too slow"; in other words, you would probably need to test both scenarios and get some real-world perfromance stats.
Personally I'd denormalize in this particular situation and have either an insert trigger on the upvote table or a periodic update query that updates the denromalized reputation column. Would it really be the end of the world is someone's rep said "204" instead of "205" until the page refreshes?
I just wanted to throw in another angle on the data integrity concern that DVK covered so well in the response above. Think about whether other systems may need to access/calculate the derived data -- even something as simple as a reporting system. If other systems need to use the derived value or update the upvote value then you may have additional considerations around how to reuse the calculation code or how to ensure that the derived value is consistently updated no matter what system changes the upvote.
Related
We currently have a table that contains 90 columns and as the table is growing and the business needs change, we're having to alter the table alot (add/remove cols & indexes).
|------ (Table name: quotes)
|Column|Type|Null|Default
|------
|//**id**//|int(11)|No|
....
|completed_at|datetime|Yes|NULL
|reviewed_at|datetime|Yes|NULL
|marked_dud_at|datetime|Yes|NULL
|closed_at|datetime|Yes|NULL
|subscribed_at|datetime|Yes|NULL
|admin_checked_at|datetime|Yes|NULL
|priced_at|datetime|Yes|NULL
|number_verified_at|datetime|Yes|NULL
|created_at|datetime|Yes|NULL
|deleted_at|datetime|Yes|NULL
For the application, our staff are constantly querying all sorts of variations on the above data, example being where it has been completed (completed_at), checked (admin_checked_at) and not deleted, reviewed (deleted_at, reviewed_at)
We're thinking it may be easier to offload some of these columns into their own row, we'll call it quotes_actions, then when querying do some joining.
|------ (Table name: quotes_actions)
|Column|Type|Null|Default
|------
|//**id**//|int(11)|No|
|quote_id|int(11)|No|
|action|varchar(100)|No|
|user_id|int(11)|No|
|time|datetime|Yes|NULL
|created_at|datetime|Yes|NULL
An example would be action = 'completed' using the field, with an index covering quote_id and action.
We've split the data into this format on 150,000 rows and it's not any faster nor slower than querying the original database with correct indexes.
Has anyone got any experience with this and has any recommendations or pitfalls for each approach? It's taking a lot of time to add covering indexes and add columns to the original table as we needed them, whereas the second approach has the indexes set up ready to go but is introducing a lot more joins and more complicated queries.
0.09s
select * from `quotes`
where `completed_at` is not null
and `approved_at` is not null
and deleted_at is null
=>
0.0005s
select * from `quotes_new`
inner join quotes_actions as q1 on q1.action = 'completed' and q1.quote_id = quotes_new.id
inner join quotes_actions as q2 on q2.action = 'approved' and q2.quote_id = quotes_new.id
where quotes_new.deleted_at is null
In addition, if the 2nd approach is better, how do you query for negative results, where a quote hasn't been approved?
Database design will vary from application to application, and things that are great for one implementation will be terrible for another. You've identified a few things that are important to you:
speed of data access (at least no reduction in current performance)
ability to respond to application needs/changes
limiting complexity of queries
Without being able to see the entirity of your database and how you are using it, these are the principles I would follow:
Use Stored Procedures and Views for as much as possible
This is just good design. You create an adapter layer between your application and the data tables, which allows you to make whatever changes you need to in the database (and the views/stored procs) without having to change the application itself. Decoupling your systems makes maintenance significantly easier. Also this is good for security, as if the only way outsiders can access the data is through your stored procs, you've eliminated a few avenues of attack. (There's also debate about whether or not the DBMS will cache execution plans for stored procedures, making them execute faster than similar queries, but I'm not a DBA or DBDev, so I'm not touching that).
Attempt to limit width of tables
One thing I've seen time and time again is every time a need arises in a production systems, a column gets added to a table and they call it a day. Far easier than rewriting a bunch of queries or reviewing table structures. This is terrible design. If you've already limited the changes needed to the application layer by following my first piece of advice, you've limited the work needed to actually resolve table changes in the right way. You should always evaluate whether data belongs to the row in question, or if it should be offloaded into its own table. You shouldn't be afraid to radically alter your database, as sometimes it is necessary.
Looking at the data you've provided, I think your second option is okay. You've identified many columns that actually represent the same thing (the "status changes" or as you put it "quote actions" that occur) and offloaded that from the main table to a secondary table. This is perfectly fine, and likely will be effective. You can further "cheat" to make this table faster by offloading status onto its own table, and using an integer to represent it instead of a string (since the string doesn't matter to the database, and integers are far faster to index and search).
This is not to say a wide table is a bad thing, sometimes tables just need to be wide. You just need to evaluate whether the data really belongs to the entity the data row represents.
Approach queries in new ways
You will want to play with the execution plan tools of your DBMS and understand how each query really works. Changing the order of joins can drastically alter the query return speed, and you shouldn't be afraid to use table variables and temp tables in your queries. They are all tools at your disposal.
Querying for Negative Results
Since you asked this question specifically, I'll address it. This requires thinking about your query in a little different way (consequently, if you haven't, you should look into taking a course or working through a textbook of Relational Algebra, it makes understanding databases so much easier).
Your original query made finding something where the quote was not approved easy. It was all in the table: approved_at is null. Simple, easy peasy, no problems. Now, however, instead of being in a column on the main table, it is in its own table, that also represents all the other actions that could be taken. You need to break the problem down a little.
You want to find the set wherein of all orders, there is no action to signify it is approved. In SQL that looks like:
select quote_id from quotes_action where quote_id not in
(select quote_id from quotes_action where action = 'approved');
Final Thoughts
You need to sit down with your team and talk about how you want to move forward with this product. Spend a few days or a couple weeks really thinking deeply about it. Brainstorm....hackathon....do something to find a solution you like and makes your product better and more maintainable. We've all been in the situation where we have an unmaintainable product that could have been fixed at some point, but is beyond that point. Try not to get to that point, and fix it while you have the opportunity.
I'm trying to build something like a message board, just for learning PHP and MySql. I've been watching some tutorials about conversation systems and in one of them the guy was storing ALL of the conversations in the same table. All the messages stored together. Is that a good idea? Wouldn't it get slow with the site growing? I was thinking of storing the conversations in files and then having a table to keep track of those files.
(I see a couple of similar questions were already asked but I can't find exactly what I need)
The proper way of doing this task is indeed having all your conversations in the same "tables" (normalised set) but after they are older than a few days (hours,minutes depending on your preferences or needs) they are moved into archived tables (normalised set), which will make sure performance never gets too slow. When retrieving messages if ones from the archived table are needed then a view or union of the two tables (or more if this scenario is using normalisation) is used.
Storing each message in a new table will not scale well, there is a limit to the number of tables you can have. Each table also requires 2 filehandles which can consume a significant amount of memory. If you have a many messages to one conversations to one topic you might break up the data to a table per topic, or possible consider a consistent hash ring with a certain number of tables. You can also partition the messages on hash as well, giving you further capacity.
Your question is a little vague, but in general, I strongly recommend you worry about performance and scalability when you can prove you have a problem.
All other things being equal, modern databases on modern hardware can store hundreds of millions of records without noticable performance problems.
And, again, in general terms, the thing that slows down a database is not the size of the record, but the access query. So if you're building "a table to keep track of files", you are likely to have the same access problems. So the expensive part is likely to be "find all conversations in descending date order, with the number of participants and the date of the last entry". Grabbing the actual conversation threads should be pretty quick.
My strong recommendation is to use the denormalized, relational model until you can prove you have a performance or scalability problem; then buy a bigger server. Then consider denormalizing. Once you've done that, you're probably at the size of Facebook.
Background:
I am trying to optimize a full Ajax driven Forum in RoR. As RoR is already not the optimal platform for a full ajax site, i am trying to optimize my sql requests and storage.
Issue:
The reputation for posts is based on simple likes/dislikes going from 0-100%, while primary only the last 100 votes should count PLUS 10% of the reputation of all other posts who refer/answer to that post. Now what is the most efficient way to store that value in my database for a fast read?
Tried solutions for Post.reputation:
a) reading all joins seperately on each request. That would be reading huge join tables and counting the joins. Does that create a big server load since it loads many entries or isn't that a problem since it is basically only 1 table?
b) not using joins at all, but storing the reputation sums in actual (+1 on like, +0.1 on sub-like) and potential (+1 on like or dislike, +0.1 on sub-like or sub-dislike). Then Post.reputation would be actual/potential. At the same time i would have to still use joins for users_posts to limit 1 vote per post. In my eyes this would be the solution with the best performance, but is there a way in implementing the 100 vote count limit with additional variables? Because it seems i pretty much dropped the information about the order of the votes, which would be important for that.
c) basically storing all joins as in a) but additionally storing the reputation value in the database for the DB read and calculating+writing it whenever a refering join is added. Is it a foul way of storing the same information multiple times in the DB?
Question:
Which solution would be the smartest storing that information in my database and accessing it quickly/often?
The best approach will be (c). Many times, in RDBMS, we do store redundant information as cache to increase performance.
Additional notes:
Ensure that the join table has an index on [post_id, id]. This will speedup selecting the 100th record from the join table.
Good place to do the updates is callback of the model of the join table. This will ensure that updates are within a transaction.
In Post's has_many definition specify :order to the criteria (most likely, id desc) that gives the newest user_post first. This will simplify other queries.
Let me know if you need some schematic code.
I am building a forum and I am trying to count all of the posts submitted by each user. Should I use COUNT(*) WHERE user_id = $user_id, or would it be faster if I kept a record of how many posts each user has each time he made a post and used a SELECT query to find it?
How much of a performance difference would this make? Would there be any difference between using InnoDB and MyISAM storage engines for this?
If you keep a record of how many post a user made, it will definitely be faster.
If you have an index on user field of posts table, you will get decent query speeds also. But it will hurt your database when your posts table is big enough. If you are planning to scale, then I would definitely recommend keeping record of users posts on a specific field.
Storing precalculated values is a common and simple, but very efficient sort of optimization.
So just add the column with amount of comments user has posted and maintain it with triggers or by your application.
The performance difference is:
With COUNT(*) you always will have index lookup + counting of results
With additional field you'll have index lookup + returning of a number (that already has an answer).
And there will be no significant difference between myisam and innodb in this case
Store the post count. It seems that this is a scalability question, regardless of the storage engine. Would you recalculate the count each time the user submitted a post, or would you run a job to take care of this load somewhere outside of the webserver sphere? What is your post volume? What kind of load can your server(s) handle? I really don't think the storage engine will be the point of failure. I say store the value.
If you have the proper index on user_id, then COUNT(user_id) is trivial.
It's also the correct approach, semantically.
this is really one of those 'trade off' questions.
Realistically, if your 'Posts' table has an index on the 'UserID' column and you are truly only wanting to return the number of posts pers user then using a query based on this column should perform perfectly well.
If you had another table 'UserPosts' for e'g., yes it would be quicker to query that table, but the real question would be 'is your 'Posts' table really so large that you cant just query it for this count. The trade off on both approaches is obviously this:
1) having a separate audit table, then there is an overhead when adding, updating a post
2) not having a separate audit table, then overhead in querying the table directly
My gut instinct is always to design a system to record the data in a sensibly normalised fashion. I NEVER make tables based on the fact that it might be quicker to GET some data for reporting purposes. I would only create them, if the need arised and it was essential to incoroporate them then, i would incorporate it.
At the end of the day, i think unless your 'posts' table is ridiculously large (i.e. more than a few millions of records, then there should be no problem in querying it for a distinct user count, presuming it is indexed correctly, i.e. an index placed on the 'UserID' column.
If you're using this information purely for display purposes (i.e. user jonny has posted 73 times), then it's easy enough to get the info out from the DB once, cache it, and then update it (the cache), when or if a change detection occurs.
Performance on post or performance on performance on count? From a data purist perspective a recorded count is not the same as an actual count. You can watch the front door to an auditorium and add the people that come in and subtract those the leave but what if some sneak in the back door? What if you bulk delete a problem topic? If you record the count then the a post is slowed down to calculate and record the count. For me data integrity is everything and I will count(star) every time. I just did a test on a table with 31 million row for a count(star) on an indexed column where the value had 424,887 rows - 1.4 seconds (on my P4 2 GB development machine as I intentionally under power my development server so I get punished for slow queries - on the production 8 core 16 GB server that count is less than 0.1 second). You can never guard your data from unexpected changes or errors in your program logic. Count(star) is the count and it is fast. If count(star) is slow you are going to have performance issues in other queries. I did star as the symbol caused a format change.
there are a whole pile of trade-offs, so no-one can give you the right answer. but here's an approach no-one else has mentioned:
you could use the "select where" query, but cache the result in a higher layer (memcache for example). so you code would look like:
count = memcache.get('article-count-' + user_id)
if count is None:
count = database.execute('select ..... where user_id = ' + user_id)
memcache.put('article-count-' + user_id, count)
and you would also need, when a user makes a new post
memcache.delete('article-count-' + user_id)
this will work best when the article count is used often, but updated rarely. it combines the advantage of efficient caching with the advantage of a normalized database. but it is not a good solution if the article count is needed only rarely (in which case, is optimisation necessary at all?). another unsuitable case is when someone's article count is needed often, but it is almost always a different person.
a further advantage of an approach like this is that you don't need to add the caching now. you can use the simplest database design and, if it turns out to be important to cache this data, add the caching later (without needing to change your schema).
more generally: you don't need to cache in your database. you could also put a cache "around" your database. something i have done with java is to use caching at the ibatis level, for example.
I'm creating a forum app in php and have a question regarding database design:
I can get all the posts for a specific topic.All the posts have an auto_increment identity column as well as a timestamp.
Assuming I want to know who the topic starter was, which is the best solution?
Get all the posts for the topic and order by timestamp. But what happens if someone immediately replies to the topic. Then I have the first two posts with the same timestamp(unlikely but possible). I can't know who the first one was. This is also normalized but becomes expensive after the table grows.
Get all the posts for the topic and order by post_id. This is an auto_increment column. Can I be guaranteed that the database will use an index id by insertion order? Will a post inserted later always have a higher id than previous rows? What if I delete a post? Would my database reuse the post_id later? This is mysql I'm using.
The easiest way off course is to simply add a field to the Topics table with the topic_starter_id and be done with it. But it is not normalized. I believe this is also the most efficient method after topic and post tables grow to millions of rows.
What is your opinion?
Zed's comment is pretty much spot on.
You generally want to achieve normalization, but denormalization can save potentially expensive queries.
In my experience writing forum software (five years commercially, five years as a hobby), this particular case calls for denormalization to save the single query. It's perfectly sane and acceptable to store both the first user's display name and id, as well as the last user's display name and id, just so long as the code that adds posts to topics always updates the record. You want one and only one code path here.
I must somewhat disagree with Charles on the fact that the only way to save on performance is to de-normalize to avoid an extra query.
To be more specific, there's an optimization that would work without denormalization (and attendant headaches of data maintenance/integrity), but ONLY if the user base is sufficiently small (let's say <1000 users, for the sake of argument - depends on your scale. Our apps use this approach with 10k+ mappings).
Namely, you have your application layer (code running on web server), retrieve the list of users into a proper cache (e.g. having data expiration facilities). Then, when you need to print first/last user's name, look it up in a cache on server side.
This avoids an extra query for every page view (as you need to only retrieve the full user list ONCE per N page views, when cache expires or when user data is updated which should cause cache expiration).
It adds a wee bit of CPU time and memory usage on web server, but in Yet Another Holy War (e.g. spend more resources on DB side or app server side) I'm firmly on the "don't waste DB resources" camp, seeing how scaling up DB is vastly harder than scaling up a web or app server.
And yes, if that (or other equally tricky) optimization is not feasible, I agree with Charles and Zed that you have a trade-off between normalization (less headaches related to data integrity) and performance gain (one less table to join in some queries). Since I'm an agnostic in that particular Holy War, I just go with what gives better marginal benefits (e.g. how much performance loss vs. how much cost/risk from de-normalization)