COUNT(*) WHERE vs. SELECT(*) WHERE performance - mysql

I am building a forum and I am trying to count all of the posts submitted by each user. Should I use COUNT(*) WHERE user_id = $user_id, or would it be faster if I kept a record of how many posts each user has each time he made a post and used a SELECT query to find it?
How much of a performance difference would this make? Would there be any difference between using InnoDB and MyISAM storage engines for this?

If you keep a record of how many post a user made, it will definitely be faster.
If you have an index on user field of posts table, you will get decent query speeds also. But it will hurt your database when your posts table is big enough. If you are planning to scale, then I would definitely recommend keeping record of users posts on a specific field.

Storing precalculated values is a common and simple, but very efficient sort of optimization.
So just add the column with amount of comments user has posted and maintain it with triggers or by your application.
The performance difference is:
With COUNT(*) you always will have index lookup + counting of results
With additional field you'll have index lookup + returning of a number (that already has an answer).
And there will be no significant difference between myisam and innodb in this case

Store the post count. It seems that this is a scalability question, regardless of the storage engine. Would you recalculate the count each time the user submitted a post, or would you run a job to take care of this load somewhere outside of the webserver sphere? What is your post volume? What kind of load can your server(s) handle? I really don't think the storage engine will be the point of failure. I say store the value.

If you have the proper index on user_id, then COUNT(user_id) is trivial.
It's also the correct approach, semantically.

this is really one of those 'trade off' questions.
Realistically, if your 'Posts' table has an index on the 'UserID' column and you are truly only wanting to return the number of posts pers user then using a query based on this column should perform perfectly well.
If you had another table 'UserPosts' for e'g., yes it would be quicker to query that table, but the real question would be 'is your 'Posts' table really so large that you cant just query it for this count. The trade off on both approaches is obviously this:
1) having a separate audit table, then there is an overhead when adding, updating a post
2) not having a separate audit table, then overhead in querying the table directly
My gut instinct is always to design a system to record the data in a sensibly normalised fashion. I NEVER make tables based on the fact that it might be quicker to GET some data for reporting purposes. I would only create them, if the need arised and it was essential to incoroporate them then, i would incorporate it.
At the end of the day, i think unless your 'posts' table is ridiculously large (i.e. more than a few millions of records, then there should be no problem in querying it for a distinct user count, presuming it is indexed correctly, i.e. an index placed on the 'UserID' column.
If you're using this information purely for display purposes (i.e. user jonny has posted 73 times), then it's easy enough to get the info out from the DB once, cache it, and then update it (the cache), when or if a change detection occurs.

Performance on post or performance on performance on count? From a data purist perspective a recorded count is not the same as an actual count. You can watch the front door to an auditorium and add the people that come in and subtract those the leave but what if some sneak in the back door? What if you bulk delete a problem topic? If you record the count then the a post is slowed down to calculate and record the count. For me data integrity is everything and I will count(star) every time. I just did a test on a table with 31 million row for a count(star) on an indexed column where the value had 424,887 rows - 1.4 seconds (on my P4 2 GB development machine as I intentionally under power my development server so I get punished for slow queries - on the production 8 core 16 GB server that count is less than 0.1 second). You can never guard your data from unexpected changes or errors in your program logic. Count(star) is the count and it is fast. If count(star) is slow you are going to have performance issues in other queries. I did star as the symbol caused a format change.

there are a whole pile of trade-offs, so no-one can give you the right answer. but here's an approach no-one else has mentioned:
you could use the "select where" query, but cache the result in a higher layer (memcache for example). so you code would look like:
count = memcache.get('article-count-' + user_id)
if count is None:
count = database.execute('select ..... where user_id = ' + user_id)
memcache.put('article-count-' + user_id, count)
and you would also need, when a user makes a new post
memcache.delete('article-count-' + user_id)
this will work best when the article count is used often, but updated rarely. it combines the advantage of efficient caching with the advantage of a normalized database. but it is not a good solution if the article count is needed only rarely (in which case, is optimisation necessary at all?). another unsuitable case is when someone's article count is needed often, but it is almost always a different person.
a further advantage of an approach like this is that you don't need to add the caching now. you can use the simplest database design and, if it turns out to be important to cache this data, add the caching later (without needing to change your schema).
more generally: you don't need to cache in your database. you could also put a cache "around" your database. something i have done with java is to use caching at the ibatis level, for example.

Related

500000 user DB is being rather slow

I have a database with the following structure:
username,email,ip,hash,salt
Currently we have around 600.000 users in this database.
Users are complaining that querying this database is rather slow.
In our tests, we found that it takes around 1.15 seconds to retrieve a user record.
This test is based on the following query:
SELECT * FROM users WHERE email = 'test#mail.com'
I'm no expert in database management. I know how to get by when using it like a dictionary, however I have no idea on database optimization.
I was hoping I could get some help. Ideally, we'd be able to query the DB like this in under a second on even 10 million users.
Does anyone have any suggestion on optimizing simple queries like this? I'm open to anything right now, even restructuring the database if there's a more logical way to do it. Because right now, they're just ordered in the order that they registered with.
MySQL has two important facilities for improving performance. For your type of query, 500,000 rows or 10,000,000 rows is just not a big deal. Although other technologies such as NOSQL can perform the same actions, applications such as yours typically rely on the ACID properties of databases. A relational database is probably the right solution.
The first facility -- as mentioned elsewhere -- are indexes. In your case:
create index idx_users_email on users(email);
An index will incur a very small amount of overhead for insert and delete operations. However, with the index, looking up a row should go down to well under 0.1 seconds -- even with concurrent queries.
Depending on other queries you are running other indexes may be appropriate.
The second important capability is partitioning the tables. This is not necessary for a users table. However, it can be quite useful for transactions and other types of data.
you could add an index as already mentioned in the comments, but one thought present itself - you are currently retrieving ALL info for that row - it would be more efficient to target the query to only retrieve that information which is necessary - such as
SELECT username FROM users WHERE email = 'test#mail.com';
also - you should investigate PDO and bound parameters for security.

Database design of performance focused reputation

Background:
I am trying to optimize a full Ajax driven Forum in RoR. As RoR is already not the optimal platform for a full ajax site, i am trying to optimize my sql requests and storage.
Issue:
The reputation for posts is based on simple likes/dislikes going from 0-100%, while primary only the last 100 votes should count PLUS 10% of the reputation of all other posts who refer/answer to that post. Now what is the most efficient way to store that value in my database for a fast read?
Tried solutions for Post.reputation:
a) reading all joins seperately on each request. That would be reading huge join tables and counting the joins. Does that create a big server load since it loads many entries or isn't that a problem since it is basically only 1 table?
b) not using joins at all, but storing the reputation sums in actual (+1 on like, +0.1 on sub-like) and potential (+1 on like or dislike, +0.1 on sub-like or sub-dislike). Then Post.reputation would be actual/potential. At the same time i would have to still use joins for users_posts to limit 1 vote per post. In my eyes this would be the solution with the best performance, but is there a way in implementing the 100 vote count limit with additional variables? Because it seems i pretty much dropped the information about the order of the votes, which would be important for that.
c) basically storing all joins as in a) but additionally storing the reputation value in the database for the DB read and calculating+writing it whenever a refering join is added. Is it a foul way of storing the same information multiple times in the DB?
Question:
Which solution would be the smartest storing that information in my database and accessing it quickly/often?
The best approach will be (c). Many times, in RDBMS, we do store redundant information as cache to increase performance.
Additional notes:
Ensure that the join table has an index on [post_id, id]. This will speedup selecting the 100th record from the join table.
Good place to do the updates is callback of the model of the join table. This will ensure that updates are within a transaction.
In Post's has_many definition specify :order to the criteria (most likely, id desc) that gives the newest user_post first. This will simplify other queries.
Let me know if you need some schematic code.

Should i recalculate big amounts of data from tables, or should I save it in my database?

My question is more general than specific, yet I am using an example to transfer the idea.
I have a forum, and in each replay I present the number of messages the users have.
Assuming that in some pages there are 15 different users, each has over 20,000 messages, should I recalculate the number of messages by counting how many entries in the messages table the user has, or would it be better to create a column in the users table that contains this data, and update the column every time a reply is made?
I know it defies the database normalizations rules, but it seems like a big waste to calculate it every time.
I'm using mySQL, if it matters.
Generally no, but in some specific cases, yes.
You should avoid having redundant data in a database. However, sometimes you have to make that tradeoff to get a decent performance.
I have actually done exactly the thing in your example. It works great for the performance, but it's really hard to keep the message count correct. You will get some inconsistent values sooner or later, so you need a plan for how to go through the values periodically and recalculate them.
You are talking about denormalization. Quoting wikipedia:
denormalization is the process of attempting to optimise the read
performance of a database by adding redundant data or by grouping
data.
Keep denormalized data in 'plain' code is not an easy issue. Remember than:
You can keep redundant data with triggers.
If your architecture includes ORM it is more easy to keep redundant data.
You could also go half way in your denormalisation: have a table with monthly data per user, filled by a monthly job, and calculate the number of messages on the fly, by counting the msg since 1st of month + sum of monthly data. Or if you don't need the monthly data, you can still calc on the fly over the month + a monthly process that updates the EOM figues. That will avoid triggers...
I'm surprised nobody has mentioned materialized views. These objects are very helpful when it comes to maintaining aggregates of data for performance reasons without violating the normalisation of our actual data. Find out more.
Have you tried to benchmark the results of counting the number of rows?
I'd recommend you just do you're calculation in a view. With the denormalization you're proposing, you're just exposing yourself to the risk of data corruption. The post count column will then end up with some arbitrary value that's go nothing to do with the reality of the number of posts.

Doing SUM() and GROUP BY over millions of rows on mysql

I have this query which only runs once per request.
SELECT SUM(numberColumn) AS total, groupColumn
FROM myTable
WHERE dateColumn < ? AND categoryColumn = ?
GROUP BY groupColumn
HAVING total > 0
myTable has less than a dozen columns and can grow up to 5 millions of rows, but more likely about 2 millions in production. All columns used in the query are numbers, except for dateColumn, and there are indexes on dateColumn and categoryColumn.
Would it be reasonble to expect this query to run in under 5 seconds with 5 million rows on most modern servers if the database is properly optimized?
The reason I'm asking is that we don't have 5 millions of data and we won't even hit 2 millions within the next few years, if the query doesn't run in under 5 seconds then, it's hard to know where the problem lies. Would it be because the query is not suitable for a large table, or the database isn't optimized, or the server isn't powerful enough? Basically, I'd like to know whether using SUM() and GROUP BY over a large table is reasonable.
Thanks.
As people in comments under your question suggested, the easiest way to verify is to generate random data and test query execution time. Please note that using clustered index on dateColumn can significantly change execution times due to the fact, that with "<" condition only subset of continuous disk data is retrieved in order to calculate sums.
If you are at the beginning of the process of development, I'd suggest concentrating not on the structure of table and indexes that collects data - but rather what do you expect to need to retrieve from the table in the future. I can share my own experience with presenting website administrator with web usage statistics. I had several webpages being requested from server, each of them falling into one on more "categories". My first approach was to collect each request in log table with some indexes, but the table grew much larger than I had at first estimated. :-) Due to the fact that statistics where analyzed in constant groups (weekly, monthly, and yearly) I decided to create addidtional table that was aggregating requests in predefined week/month/year grops. Each request incremented relevant columns - columns were refering to my "categories" . This broke some normalization rules, but allowed me to calculate statistics in a blink of an eye.
An important question is the dateColumn < ? condition. I am guessing it is filtering records that are out of date. It doesn't really matter how many records there are in the table. What matters is how much records this condition cuts down.
Having aggressive filtering by date combined with partitioning the table by date can give you amazing performance on ridiculously large tables.
As a side note, if you are not expecting to hit this much data in many years to come, don't bother solving it. Your business requirements may change a dozen times by then, together with the architecture, db layout, design and implementation details. planning ahead is great but sometimes you want to give a good enough solution as soon as possible and handle the future painful issues in the next release..

Storing duplicate data in MySQL tables

I have a table with all registered members, with columns like uid, username, last_action_time.
I also have a table that keeps track of who has been online in the past 5 minutes. It is populated by a cronjob by pulling data from members with last_action_time being less than 5 minutes ago.
Question: Should my online table include username or no? I'm asking this because I could JOIN both tables to obtain this data, but I could store the username in the online table and not have to join. My concern is that I will have duplicate data stored in two tables, and that seems wrong.
If you haven't run into performance issues, DO NOT denormalize. There is a good saying "normalize until it hurts, denormalize until it works". In your case, it works with normalized schema (users table joined). And data bases are designed to handle huge amounts of data.
This approach is called denormalization. I mean that sometimes for quick select query we have to duplicate some data across tables. In this case I believe that this one is good choice if you have a lot of data in both tables.
You just hit a very valid question: when does it make sense to duplicate data ?
I could rewrite your question as: when does it make sense to use a cache. Caches need maintenance, you need to keep them up to date yourself and they use up some extra space (although negligible in this case). But they have a pro: performance increase.
In the example you mentioned, you need to see if that performance increase is actually worth it and if it outweighs the additional work of having and maintaining a cache.
My gut feeling is that your database isn't gigantic, so joining every time should take a minimal amount of effort from the server, so I'd go with that.
Hope it helps
You shouldn't store the username in the online table. There shouldn't be any performance issue . Just use a join every time to get the username.
Plus, you don't need the online table at all, why don't you query only the users with an last_action_time < 5 min from the members table?
A user ID would be an integer (AKA 4 bytes). A username (i would imagine is up to 16 bytes). How many users? How ofter a username changes? These are the questions to consider.
I wold just store the username. I wou;ld have though once the username is registered it is fixed for the duration.
If is difficult to answer these questions without a little background - performance issues are difficult to think about when the depth and breath, usabge etc. is not known.