Database design of performance focused reputation - mysql

Background:
I am trying to optimize a full Ajax driven Forum in RoR. As RoR is already not the optimal platform for a full ajax site, i am trying to optimize my sql requests and storage.
Issue:
The reputation for posts is based on simple likes/dislikes going from 0-100%, while primary only the last 100 votes should count PLUS 10% of the reputation of all other posts who refer/answer to that post. Now what is the most efficient way to store that value in my database for a fast read?
Tried solutions for Post.reputation:
a) reading all joins seperately on each request. That would be reading huge join tables and counting the joins. Does that create a big server load since it loads many entries or isn't that a problem since it is basically only 1 table?
b) not using joins at all, but storing the reputation sums in actual (+1 on like, +0.1 on sub-like) and potential (+1 on like or dislike, +0.1 on sub-like or sub-dislike). Then Post.reputation would be actual/potential. At the same time i would have to still use joins for users_posts to limit 1 vote per post. In my eyes this would be the solution with the best performance, but is there a way in implementing the 100 vote count limit with additional variables? Because it seems i pretty much dropped the information about the order of the votes, which would be important for that.
c) basically storing all joins as in a) but additionally storing the reputation value in the database for the DB read and calculating+writing it whenever a refering join is added. Is it a foul way of storing the same information multiple times in the DB?
Question:
Which solution would be the smartest storing that information in my database and accessing it quickly/often?

The best approach will be (c). Many times, in RDBMS, we do store redundant information as cache to increase performance.
Additional notes:
Ensure that the join table has an index on [post_id, id]. This will speedup selecting the 100th record from the join table.
Good place to do the updates is callback of the model of the join table. This will ensure that updates are within a transaction.
In Post's has_many definition specify :order to the criteria (most likely, id desc) that gives the newest user_post first. This will simplify other queries.
Let me know if you need some schematic code.

Related

Best way to store hashtags and the uses / combinations in a database

So there is a huge collection of hashtags ( > 100.000 ) in a database.
Other functionality requires that the hashtags are stored in different tables ordered by the first letter ( tags_a, tags_b, ... ), but I think this fact can be ignored and be handled as one table for the solutions.
I now want to implement a counter for the uses of the tags.
So I think I could just apparent a column to these tables and have the corresponding value counted up each time the tag is used.
That way I would have to join the tables to e.g. get the top 100 used tags...
Is there a more clever and more performant and space saving way to save the counts ?
Also my other need is to store information about the combinations of tags that where used.
For example a user is using tags '#a, #e, #k, #w' and I want to have the information how often #a was used together with #e & #a with #k & #e with #w and so on...
The first approach that comes in my mind would be a table with columns: tag 1 (FK), tag 2 (FK) and count, but this would be a table that is tag-count * tag-count long... isn't there a better way ?
In the future I would like to have a kind of recommendation like:
You have used #e, you may also want to use #k.
( where #k is one of the most popular combination with #e )
Or something like a spam filter, where I check for possible / usual relations between tags.
How can I store this kind of information, again, in the most performant and space saving way ?
EDIT
I am expecting up to 1 million 'posts' per day, where each post can have up to 10 tags.
And actually these are not posts, but for simplicity, I will call them so.
The point is, that there will be a kind of AI implemented, that needs to read and learn from these and many other stored data, and also do other stuff, in a relatively short interval of time and we want to minimize the data to handle and process, as much as we can.
I would start off with designing the database to match the real world requirements. You have hashtags, which are related to posts of some kind that are written by users. That sounds to me like a table for Users, a table for Posts (with an FK to Users), a table for Hashtags, and a many-to-many linking table between Posts and Hashtags with the appropriate FKs.
If you want to find how many times a hashtag has been used, then simply query the tables. Unless you're getting into the 10s of millions (possibly 100s of millions) of rows of data, then with proper indexing you should be fine.
Once you've implemented the basic functionality, if tests show (and not hunches) that you are going to run into performance issues then you can refine your requirements further to handle that problem.
Assuming the main question is "How do bump counters hundred(s) of times per second?"
If you have SSD drive(s), simply do UPDATE ... SET x = x + 1 WHERE .... If not, you will bottleneck on disk activity. (Also, thousands may overwhelm SSDs.)
It would be nice if you could build that UPDATE with WHERE hashtag IN (the-10-or-so-hashtags). (Your choice to split the data into multiple tables essentially prevents this optimization.) But there is hope -- the counters really should be in a table of their own, not in the main table of users. This is to segregate the high activity of the counters from other uses.
So, you need to buffer up the hashtags and update them in batches. This might delay the counting a little, but that is better than swamping the system.
Will the info come in from one thread? Multiple threads, but one server? Multiple threads? The details of the solution depend on the answers to those questions (and probably other questions). Meanwhile, read my blog on high speed ingestion for some hints on the direction I will take you.

Storing duplicate data in MySQL tables

I have a table with all registered members, with columns like uid, username, last_action_time.
I also have a table that keeps track of who has been online in the past 5 minutes. It is populated by a cronjob by pulling data from members with last_action_time being less than 5 minutes ago.
Question: Should my online table include username or no? I'm asking this because I could JOIN both tables to obtain this data, but I could store the username in the online table and not have to join. My concern is that I will have duplicate data stored in two tables, and that seems wrong.
If you haven't run into performance issues, DO NOT denormalize. There is a good saying "normalize until it hurts, denormalize until it works". In your case, it works with normalized schema (users table joined). And data bases are designed to handle huge amounts of data.
This approach is called denormalization. I mean that sometimes for quick select query we have to duplicate some data across tables. In this case I believe that this one is good choice if you have a lot of data in both tables.
You just hit a very valid question: when does it make sense to duplicate data ?
I could rewrite your question as: when does it make sense to use a cache. Caches need maintenance, you need to keep them up to date yourself and they use up some extra space (although negligible in this case). But they have a pro: performance increase.
In the example you mentioned, you need to see if that performance increase is actually worth it and if it outweighs the additional work of having and maintaining a cache.
My gut feeling is that your database isn't gigantic, so joining every time should take a minimal amount of effort from the server, so I'd go with that.
Hope it helps
You shouldn't store the username in the online table. There shouldn't be any performance issue . Just use a join every time to get the username.
Plus, you don't need the online table at all, why don't you query only the users with an last_action_time < 5 min from the members table?
A user ID would be an integer (AKA 4 bytes). A username (i would imagine is up to 16 bytes). How many users? How ofter a username changes? These are the questions to consider.
I wold just store the username. I wou;ld have though once the username is registered it is fixed for the duration.
If is difficult to answer these questions without a little background - performance issues are difficult to think about when the depth and breath, usabge etc. is not known.

COUNT(*) WHERE vs. SELECT(*) WHERE performance

I am building a forum and I am trying to count all of the posts submitted by each user. Should I use COUNT(*) WHERE user_id = $user_id, or would it be faster if I kept a record of how many posts each user has each time he made a post and used a SELECT query to find it?
How much of a performance difference would this make? Would there be any difference between using InnoDB and MyISAM storage engines for this?
If you keep a record of how many post a user made, it will definitely be faster.
If you have an index on user field of posts table, you will get decent query speeds also. But it will hurt your database when your posts table is big enough. If you are planning to scale, then I would definitely recommend keeping record of users posts on a specific field.
Storing precalculated values is a common and simple, but very efficient sort of optimization.
So just add the column with amount of comments user has posted and maintain it with triggers or by your application.
The performance difference is:
With COUNT(*) you always will have index lookup + counting of results
With additional field you'll have index lookup + returning of a number (that already has an answer).
And there will be no significant difference between myisam and innodb in this case
Store the post count. It seems that this is a scalability question, regardless of the storage engine. Would you recalculate the count each time the user submitted a post, or would you run a job to take care of this load somewhere outside of the webserver sphere? What is your post volume? What kind of load can your server(s) handle? I really don't think the storage engine will be the point of failure. I say store the value.
If you have the proper index on user_id, then COUNT(user_id) is trivial.
It's also the correct approach, semantically.
this is really one of those 'trade off' questions.
Realistically, if your 'Posts' table has an index on the 'UserID' column and you are truly only wanting to return the number of posts pers user then using a query based on this column should perform perfectly well.
If you had another table 'UserPosts' for e'g., yes it would be quicker to query that table, but the real question would be 'is your 'Posts' table really so large that you cant just query it for this count. The trade off on both approaches is obviously this:
1) having a separate audit table, then there is an overhead when adding, updating a post
2) not having a separate audit table, then overhead in querying the table directly
My gut instinct is always to design a system to record the data in a sensibly normalised fashion. I NEVER make tables based on the fact that it might be quicker to GET some data for reporting purposes. I would only create them, if the need arised and it was essential to incoroporate them then, i would incorporate it.
At the end of the day, i think unless your 'posts' table is ridiculously large (i.e. more than a few millions of records, then there should be no problem in querying it for a distinct user count, presuming it is indexed correctly, i.e. an index placed on the 'UserID' column.
If you're using this information purely for display purposes (i.e. user jonny has posted 73 times), then it's easy enough to get the info out from the DB once, cache it, and then update it (the cache), when or if a change detection occurs.
Performance on post or performance on performance on count? From a data purist perspective a recorded count is not the same as an actual count. You can watch the front door to an auditorium and add the people that come in and subtract those the leave but what if some sneak in the back door? What if you bulk delete a problem topic? If you record the count then the a post is slowed down to calculate and record the count. For me data integrity is everything and I will count(star) every time. I just did a test on a table with 31 million row for a count(star) on an indexed column where the value had 424,887 rows - 1.4 seconds (on my P4 2 GB development machine as I intentionally under power my development server so I get punished for slow queries - on the production 8 core 16 GB server that count is less than 0.1 second). You can never guard your data from unexpected changes or errors in your program logic. Count(star) is the count and it is fast. If count(star) is slow you are going to have performance issues in other queries. I did star as the symbol caused a format change.
there are a whole pile of trade-offs, so no-one can give you the right answer. but here's an approach no-one else has mentioned:
you could use the "select where" query, but cache the result in a higher layer (memcache for example). so you code would look like:
count = memcache.get('article-count-' + user_id)
if count is None:
count = database.execute('select ..... where user_id = ' + user_id)
memcache.put('article-count-' + user_id, count)
and you would also need, when a user makes a new post
memcache.delete('article-count-' + user_id)
this will work best when the article count is used often, but updated rarely. it combines the advantage of efficient caching with the advantage of a normalized database. but it is not a good solution if the article count is needed only rarely (in which case, is optimisation necessary at all?). another unsuitable case is when someone's article count is needed often, but it is almost always a different person.
a further advantage of an approach like this is that you don't need to add the caching now. you can use the simplest database design and, if it turns out to be important to cache this data, add the caching later (without needing to change your schema).
more generally: you don't need to cache in your database. you could also put a cache "around" your database. something i have done with java is to use caching at the ibatis level, for example.

In a database, when should you store derived data?

My question is about denormalization. In a database, when should you store derived data in its own column, rather than calculating it every time you need it?
For example, say you have Users who get Upvotes for their Questions. You display a User's reputation on their profile. When a User is Upvoted, should you increment their reputation, or should you calculate it when you retrieve their profile:
SELECT User.id, COUNT(*) AS reputation FROM User
LEFT JOIN Question
ON Question.User_id = User.id
LEFT JOIN Upvote
ON Upvote.Question_id = Question.id
GROUP BY User.id
How processor intensive does the query to get a User's reputation have to be before it would be worthwhile to keep track of it incrementally with its own column?
To continue our example, suppose an Upvote has a weight that depends on how many Upvotes (not how much reputation) the User who cast it has. The query to retrieve their reputation suddenly explodes:
SELECT
User.id AS User_id,
SUM(UpvoteWeight.weight) AS reputation
FROM User
LEFT JOIN Question
ON User.id = Question.User_id
LEFT JOIN (
SELECT
Upvote.Question_id,
COUNT(Upvote2.id)+1 AS weight
FROM Upvote
LEFT JOIN User
ON Upvote.User_id = User.id
LEFT JOIN Question
ON User.id = Question.User_id
LEFT JOIN Upvote AS Upvote2
ON
Question.id = Upvote2.Question_id
AND Upvote2.date < Upvote.date
GROUP BY Upvote.id
) AS UpvoteWeight ON Question.id = UpvoteWeight.Question_id
GROUP BY User.id
This is far out of proportion with the difficulty of an incremental solution. When would normalization be worth it, and when do the benefits of normalization lose to the benefits of denormalization (in this case query difficulty and/or performance)?
How processor intensive does the query to get a User's reputation have to be before it would be worthwhile to keep track of it incrementally with its own column?
There really are two questions here in guise of one: (1) Will this change improve the performace and (2) Will the performance improvement be worth the effort?
As far as whether the performance improvement, this is basically a standard pros/cons analysis.
The benefits of normalization are basically two-fold:
Easier data integrity
No issues with re-calculation (e.g. if the underlying data changes, the derived column needs to be re-calculated).
If you cover the data integrity with a robustly implemented solution (e.g. trigger, Sstored-proc-only data changes with revoked direct table change perms, etc...), then this becomes a straightforward calculation of whether the cost of verifying whether the source data change warrants the derived data re-calculation vs. recalculating the derived data every time. (NOTE: Another approach to keeping data integrity is to force the recalculation of derived data on schedule, where that data can afford to be inaccurate with some time tolerance. StackExchange takes this approach with some of its numbers).
In a typical scenario (many more retrieval of data and far less changes to the underlying data) the math pretty obviously skews in favor of keeping de-normalized derived data in the table.
In some rare cases where the underlying data changes VERY often yet the derived data is not retrieved that often, doing that might be detrimental.
Now, we are onto the far more important question: Will the performance improvement be worth the effort?
Please note that, as with ALL optimizations, the biggest question is "is the optimization even worth it at all?", and as such is the subject to two main considerations:
Measuring exact performance difference and generally profiling.
Context of this specific optimization in the big picture of your system.
E.g. if the difference in query performace - which as always when optimizing must first be measured - is 2% between cached derived data and computed one, the extra system complexity in implementing the reputation cache column may not be worth it in the first place. But what the threshold of caring vs. not caring is as far as marginal improvement depends on the big picture of your app. If you can take steps to improve query performance 10% in a different place, concentrate on that vs. 2%. If you're Google and extra 2% of query performance carries cost of 2 billion dollars in extra hardware to bear it, it needs to be optimized anyway.
There is really no clear-cut answer because it depends a lot of factors like the volume of the site and how often you display the reputation (i.e. only on their profile page or next to EVERY instance of their user name, everywhere). The only real answer is "when it gets too slow"; in other words, you would probably need to test both scenarios and get some real-world perfromance stats.
Personally I'd denormalize in this particular situation and have either an insert trigger on the upvote table or a periodic update query that updates the denromalized reputation column. Would it really be the end of the world is someone's rep said "204" instead of "205" until the page refreshes?
I just wanted to throw in another angle on the data integrity concern that DVK covered so well in the response above. Think about whether other systems may need to access/calculate the derived data -- even something as simple as a reporting system. If other systems need to use the derived value or update the upvote value then you may have additional considerations around how to reuse the calculation code or how to ensure that the derived value is consistently updated no matter what system changes the upvote.

Forum Schema: should the "Topics" table countain topic_starter_Id? Or is it redundant information?

I'm creating a forum app in php and have a question regarding database design:
I can get all the posts for a specific topic.All the posts have an auto_increment identity column as well as a timestamp.
Assuming I want to know who the topic starter was, which is the best solution?
Get all the posts for the topic and order by timestamp. But what happens if someone immediately replies to the topic. Then I have the first two posts with the same timestamp(unlikely but possible). I can't know who the first one was. This is also normalized but becomes expensive after the table grows.
Get all the posts for the topic and order by post_id. This is an auto_increment column. Can I be guaranteed that the database will use an index id by insertion order? Will a post inserted later always have a higher id than previous rows? What if I delete a post? Would my database reuse the post_id later? This is mysql I'm using.
The easiest way off course is to simply add a field to the Topics table with the topic_starter_id and be done with it. But it is not normalized. I believe this is also the most efficient method after topic and post tables grow to millions of rows.
What is your opinion?
Zed's comment is pretty much spot on.
You generally want to achieve normalization, but denormalization can save potentially expensive queries.
In my experience writing forum software (five years commercially, five years as a hobby), this particular case calls for denormalization to save the single query. It's perfectly sane and acceptable to store both the first user's display name and id, as well as the last user's display name and id, just so long as the code that adds posts to topics always updates the record. You want one and only one code path here.
I must somewhat disagree with Charles on the fact that the only way to save on performance is to de-normalize to avoid an extra query.
To be more specific, there's an optimization that would work without denormalization (and attendant headaches of data maintenance/integrity), but ONLY if the user base is sufficiently small (let's say <1000 users, for the sake of argument - depends on your scale. Our apps use this approach with 10k+ mappings).
Namely, you have your application layer (code running on web server), retrieve the list of users into a proper cache (e.g. having data expiration facilities). Then, when you need to print first/last user's name, look it up in a cache on server side.
This avoids an extra query for every page view (as you need to only retrieve the full user list ONCE per N page views, when cache expires or when user data is updated which should cause cache expiration).
It adds a wee bit of CPU time and memory usage on web server, but in Yet Another Holy War (e.g. spend more resources on DB side or app server side) I'm firmly on the "don't waste DB resources" camp, seeing how scaling up DB is vastly harder than scaling up a web or app server.
And yes, if that (or other equally tricky) optimization is not feasible, I agree with Charles and Zed that you have a trade-off between normalization (less headaches related to data integrity) and performance gain (one less table to join in some queries). Since I'm an agnostic in that particular Holy War, I just go with what gives better marginal benefits (e.g. how much performance loss vs. how much cost/risk from de-normalization)