Doing SUM() and GROUP BY over millions of rows on mysql

Doing SUM() and GROUP BY over millions of rows on mysql - mysql

I have this query which only runs once per request.
SELECT SUM(numberColumn) AS total, groupColumn
FROM myTable
WHERE dateColumn < ? AND categoryColumn = ?
GROUP BY groupColumn
HAVING total > 0
myTable has less than a dozen columns and can grow up to 5 millions of rows, but more likely about 2 millions in production. All columns used in the query are numbers, except for dateColumn, and there are indexes on dateColumn and categoryColumn.
Would it be reasonble to expect this query to run in under 5 seconds with 5 million rows on most modern servers if the database is properly optimized?
The reason I'm asking is that we don't have 5 millions of data and we won't even hit 2 millions within the next few years, if the query doesn't run in under 5 seconds then, it's hard to know where the problem lies. Would it be because the query is not suitable for a large table, or the database isn't optimized, or the server isn't powerful enough? Basically, I'd like to know whether using SUM() and GROUP BY over a large table is reasonable.
Thanks.

As people in comments under your question suggested, the easiest way to verify is to generate random data and test query execution time. Please note that using clustered index on dateColumn can significantly change execution times due to the fact, that with "<" condition only subset of continuous disk data is retrieved in order to calculate sums.
If you are at the beginning of the process of development, I'd suggest concentrating not on the structure of table and indexes that collects data - but rather what do you expect to need to retrieve from the table in the future. I can share my own experience with presenting website administrator with web usage statistics. I had several webpages being requested from server, each of them falling into one on more "categories". My first approach was to collect each request in log table with some indexes, but the table grew much larger than I had at first estimated. :-) Due to the fact that statistics where analyzed in constant groups (weekly, monthly, and yearly) I decided to create addidtional table that was aggregating requests in predefined week/month/year grops. Each request incremented relevant columns - columns were refering to my "categories" . This broke some normalization rules, but allowed me to calculate statistics in a blink of an eye.

An important question is the dateColumn < ? condition. I am guessing it is filtering records that are out of date. It doesn't really matter how many records there are in the table. What matters is how much records this condition cuts down.
Having aggressive filtering by date combined with partitioning the table by date can give you amazing performance on ridiculously large tables.
As a side note, if you are not expecting to hit this much data in many years to come, don't bother solving it. Your business requirements may change a dozen times by then, together with the architecture, db layout, design and implementation details. planning ahead is great but sometimes you want to give a good enough solution as soon as possible and handle the future painful issues in the next release..

Related

Time Complexity of Sorting a database

I'm currently developing a mobile app and using Codeigniter MySQL. I'm now faced with a situation where I have a table of books (this table will be 100k+ with records). With in this table I have a column called NotSelling. Example of db:
Book A 45
Book B 0
Book C 159
Book D 78
.
.
.
Book Z 450
Where above the numbers are what appears in the NotSelling column in the db. I need to extract the top 20 books from this large table. Now my solution to doing this is to sort the table and then just use TOP to extract the top 20 records.
What I would like to know is about the performance of sorting of the table. As I'm sure constantly sorting the table to simply get the top 20 results would take a hideously long time. I have been given solutions to the problem:
index the NotSelling problem.
cache the query (but I've read about coarse invalidation which may cause problems as my case the invalidation frequency would be high)
Sort the table take the top 20 records, place them in another table and then periodically just update the table say every hour or so.
But all this being said does anyone know of a better solution to this problem or have a way/method of optimizing the performance of the functionality I'm looking to do? Note I am a newbie so should anyone be able to point me in the right direction where I can read up about database performance I would really appreciate it.

I think you are thinking too much here. Definitely a case of premature optimization. While all the above mentioned solutions are perfectly valid. You should know that 100K+ records is chowder to Mysql. We used to routinely order on tables with 30 million+ rows, with excellent perf.
But You MUST have index on the column being sorted on and double check your table schema. Reg. Caching too don't worry, mysql does that for you for repetitive queries when the table has not changed. But index on column is a must, primary and most important requirement.

Don't worry about the performance of sorting. That can always be fixed in the database at a later time by adding an index if it actually proves to be a problem.
In the design phase, optimization is a distraction. Instead, focus on the functionality and the directness that the implementation represents the problem. As long as those are on target, everything else can be fixed comparatively easily.

Depending on what kind of meta-data is kept inside of the data structure of the index backing the column, a traversal can likely be done in O(n) time with n being the number of items returned.
This means that in theory, whether you have 1 million or 200 trillion records, pulling the first 20 will be just as fast as long as you have an index. In practice, there is going to be a performance difference since a small index will fit in memory whereas a large one will have to use the disk.
So in short, you're worrying too much. As Srikar Appal, a properly indexed 100k record table is nothing to MySQL

COUNT() WHERE vs. SELECT() WHERE performance

I am building a forum and I am trying to count all of the posts submitted by each user. Should I use COUNT(*) WHERE user_id = $user_id, or would it be faster if I kept a record of how many posts each user has each time he made a post and used a SELECT query to find it?
How much of a performance difference would this make? Would there be any difference between using InnoDB and MyISAM storage engines for this?

If you keep a record of how many post a user made, it will definitely be faster.
If you have an index on user field of posts table, you will get decent query speeds also. But it will hurt your database when your posts table is big enough. If you are planning to scale, then I would definitely recommend keeping record of users posts on a specific field.

Storing precalculated values is a common and simple, but very efficient sort of optimization.
So just add the column with amount of comments user has posted and maintain it with triggers or by your application.
The performance difference is:
With COUNT(*) you always will have index lookup + counting of results
With additional field you'll have index lookup + returning of a number (that already has an answer).
And there will be no significant difference between myisam and innodb in this case

Store the post count. It seems that this is a scalability question, regardless of the storage engine. Would you recalculate the count each time the user submitted a post, or would you run a job to take care of this load somewhere outside of the webserver sphere? What is your post volume? What kind of load can your server(s) handle? I really don't think the storage engine will be the point of failure. I say store the value.

If you have the proper index on user_id, then COUNT(user_id) is trivial.
It's also the correct approach, semantically.

this is really one of those 'trade off' questions.
Realistically, if your 'Posts' table has an index on the 'UserID' column and you are truly only wanting to return the number of posts pers user then using a query based on this column should perform perfectly well.
If you had another table 'UserPosts' for e'g., yes it would be quicker to query that table, but the real question would be 'is your 'Posts' table really so large that you cant just query it for this count. The trade off on both approaches is obviously this:
1) having a separate audit table, then there is an overhead when adding, updating a post
2) not having a separate audit table, then overhead in querying the table directly
My gut instinct is always to design a system to record the data in a sensibly normalised fashion. I NEVER make tables based on the fact that it might be quicker to GET some data for reporting purposes. I would only create them, if the need arised and it was essential to incoroporate them then, i would incorporate it.
At the end of the day, i think unless your 'posts' table is ridiculously large (i.e. more than a few millions of records, then there should be no problem in querying it for a distinct user count, presuming it is indexed correctly, i.e. an index placed on the 'UserID' column.
If you're using this information purely for display purposes (i.e. user jonny has posted 73 times), then it's easy enough to get the info out from the DB once, cache it, and then update it (the cache), when or if a change detection occurs.

Performance on post or performance on performance on count? From a data purist perspective a recorded count is not the same as an actual count. You can watch the front door to an auditorium and add the people that come in and subtract those the leave but what if some sneak in the back door? What if you bulk delete a problem topic? If you record the count then the a post is slowed down to calculate and record the count. For me data integrity is everything and I will count(star) every time. I just did a test on a table with 31 million row for a count(star) on an indexed column where the value had 424,887 rows - 1.4 seconds (on my P4 2 GB development machine as I intentionally under power my development server so I get punished for slow queries - on the production 8 core 16 GB server that count is less than 0.1 second). You can never guard your data from unexpected changes or errors in your program logic. Count(star) is the count and it is fast. If count(star) is slow you are going to have performance issues in other queries. I did star as the symbol caused a format change.

there are a whole pile of trade-offs, so no-one can give you the right answer. but here's an approach no-one else has mentioned:
you could use the "select where" query, but cache the result in a higher layer (memcache for example). so you code would look like:
count = memcache.get('article-count-' + user_id)
if count is None:
count = database.execute('select ..... where user_id = ' + user_id)
memcache.put('article-count-' + user_id, count)
and you would also need, when a user makes a new post
memcache.delete('article-count-' + user_id)
this will work best when the article count is used often, but updated rarely. it combines the advantage of efficient caching with the advantage of a normalized database. but it is not a good solution if the article count is needed only rarely (in which case, is optimisation necessary at all?). another unsuitable case is when someone's article count is needed often, but it is almost always a different person.
a further advantage of an approach like this is that you don't need to add the caching now. you can use the simplest database design and, if it turns out to be important to cache this data, add the caching later (without needing to change your schema).
more generally: you don't need to cache in your database. you could also put a cache "around" your database. something i have done with java is to use caching at the ibatis level, for example.

Right design for MySQL database

I want to build a MySQL database for storing the ranking of a game every 1h.
Since this database will become quite large in a short time, I figured it's important to have a proper design. Therefor some advice would be gratefully appreciated.
In order to keep it as small as possible, I decided to log only the first 1500 positions of the ranking. Every ranking of a player holds the following values:
ranking position, playername, location, coordinates, alliance, race, level1, level2, points1, points2, points3, points4, points5, points6, date/time
My approach was to simply grab all values of each top 1500 player every hour by a php script and insert them into the MySQL as one row. So every day the MySQL will grow 36,000 rows. I will have a second script that deletes every row that is older than 28 days, otherwise the database would get insanely huge. Both scripts will run as a cronjob.
The following queries will be performed on this data:
The most important one is simply the query for a certain name. It should return all stats for the player for every hour as an array.
The second is a query in which all players have to be returned that didn't gain points1 during a certain time period from the latest entry. This should return a list of players that didn't gain points (for the last 24h for example).
The third is a query in which all players should be listed that lost a certain amount or more points2 in a certain time period from the latest entry.
The queries shouldn't take a lifetime, so I thought I should probably index playernames, points1 and points2.
Is my approach to this acceptable or will I run into a performance/handling disaster? Is there maybe a better way of doing this?

Here is where you risk a performance problem:
Your indexes will speed up your reads, but will considerably slow down your writes. Especially since your DB will have over 1 million rows in that one table at any given time. Since your writes are happening via cron, you should be okay as long as you insert your 1500 rows in batches rather than one round trip to the DB for every row. I'd also look into query compiling so that you save that overhead as well.
Ranhiru Cooray is correct, you should only store data like the player name once in the DB. Create a players table and use the primary key to reference the player in your ranking table. The same will go for location, alliance and race. I'm guessing that those are more or less enumerated values that you can store in another table to normalize your design and be returned in your results with appropriates JOINs. Normalizing your data will reduce the amount of redundant information in your database which will decrease it's size and increase it's performance.
Your design may also be flawed in your ranking position. Can that not be calculated by the DB when you select your rows? If not, can it be done by PHP? It's the same as with invoice tables, you never store the invoice total because it is redundant. The items/pricing/etc can be used to calculate the order totals.
With all the adding/deleting, I'd be sure to run OPTIMIZE frequently and keep good backups. MySQL tables---if using MyISAM---can become corrupted easily in high writing/deleting scenarios. InnoDB tends to fair a little better in those situations.
Those are some things to think about. Hope it helps.

MySQL speed optimization on a table with many rows : what is the best way to handle it?

I'm developping a chat application. I want to keep everything logged into a table (i.e. "who said what and when").
I hope that in a near future I'll have thousands of rows.
I was wondering : what is the best way to optimize the table, knowing that I'll do often rows insertion and sometimes group reading (i.e. showing an entire conversation from a user (look when he/she logged in/started to chat then look when he/she quit then show the entire conversation)).
This table should be able to handle (I hope though !) many many rows. (15000 / day => 4,5 M each month => 54 M of rows at the end of the year).
The conversations older than 15 days could be historized (but I don't know how I should do to do it right).
Any idea ?

I have two advices for you:
If you are expecting lots of writes
with little low priority reads. Then you
are better off with as little
indexes as possible. Indexes will
make insert slower. Only add what you really need.
If the log table
is going to get bigger and bigger
overtime you should consider log
rotation. Otherwise you might end up
with one gigantic corrupted table.

54 million rows is not that many, especially over a year.
If you are going to be rotating out lots of data periodically, I would recommend using MyISAM and MERGE tables. Since you won't be deleting or editing records, you won't have any locking issues as long as concurrency is set to 1. Inserts will then always be added to the end of the table, so SELECTs and INSERTs can happen simultaneously. So you don't have to use InnoDB based tables (which can use MERGE tables).
You could have 1 table per month, named something like data200905, data200904, etc. Your merge table would them include all the underlying tables you need to search on. Inserts are done on the merge table, so you don't have to worry about changing names. When it's time to rotate out data and create a new table, just redeclare the MERGE table.
You could even create multiple MERGE tables, based on quarter, years, etc. One table can be used in multiple MERGE tables.
I've done this setup on databases that added 30 million records per month.

Mysql does surprisingly well handling very large data sets with little more than standard database tuning and indexes. I ran a site that had millions of rows in a database and was able to run it just fine on mysql.
Mysql does have an "archive" table engine option for handling many rows, but the lack of index support will make it not a great option for you, except perhaps for historical data.
Index creation will be required, but you do have to balance them and not just create them because you can. They will allow for faster queries (and will required for usable queries on a table that large), but the more indexes you have, the more cost there will be inserting.
If you are just querying on your "user" id column, an index on there will not be a problem, but if you are looking to do full text queries on the messages, you may want to consider only indexing the user column in mysql and using something like sphynx or lucene for the full text searches, as full text searches in mysql are not the fastest and significantly slow down insert time.

You could handle this with two tables - one for the current chat history and one archive table. At the end of a period ( week, month or day depending on your traffic) you can archive current chat messages, remove them from the small table and add them to the archive.
This way your application is going to handle well the most common case - query the current chat status and this is going to be really fast.
For queries like "what did x say last month" you will query the archive table and it is going to take a little longer, but this is OK since there won't be that much of this queries and if someone does search like this he would be willing to wait a couple of seconds more.
Depending on your use cases you could extend this principle - if there will be a lot of queries for chat messages during last 6 months - store them in separate table too.
Similar principle (for completely different area) is used by the .NET garbage collector which has different storage for short lived objects, long lived objects, large objects, etc.

What is the cost of indexing multiple db columns?

I'm writing an app with a MySQL table that indexes 3 columns. I'm concerned that after the table reaches a significant amount of records, the time to save a new record will be slow. Please inform how best to approach the indexing of columns.
UPDATE
I am indexing a point_value, the
user_id, and an event_id, all required
for client-facing purposes. For an
instance such as scoring baseball runs
by player id and game id. What would
be the cost of inserting about 200 new
records a day, after the table holds
records for two seasons, say 72,000
runs, and after 5 seasons, maybe a
quarter million records? Only for
illustration, but I'm expecting to
insert between 25 and 200 records a
day.

Index what seems the most logical (that should hopefully be obvious, for example, a customer ID column in the CUSTOMERS table).
Then run your application and collect statistics periodically to see how the database is performing. RUNSTATS on DB2 is one example, I would hope MySQL has a similar tool.
When you find some oft-run queries doing full table scans (or taking too long for other reasons), then, and only then, should you add more indexes. It does little good to optimise a once-a-month-run-at-midnight query so it can finish at 12:05 instead of 12:07. However, it's a huge improvement to reduce a customer-facing query from 5 seconds down to 2 seconds (that's still too slow, customer-facing queries should be sub-second if possible).
More indexes tend to slow down inserts and speed up queries. So it's always a balancing act. That's why you only add indexes in specific response to a problem. Anything else is premature optimization and should be avoided.
In addition, revisit the indexes you already have periodically to see if they're still needed. It may be that the queries that caused you to add those indexes are no longer run often enough to warrant it.
To be honest, I don't believe indexing three columns on a table will cause you to suffer unless you plan on storing really huge numbers of rows :-) - indexing is pretty efficient.
After your edit which states:
I am indexing a point_value, the user_id, and an event_id, all required for client-facing purposes. For an instance such as scoring baseball runs by player id and game id. What would be the cost of inserting about 200 new records a day, after the table holds records for two seasons, say 72,000 runs, and after 5 seasons, maybe a quarter million records? Only for illustration, but I'm expecting to insert between 25 and 200 records a day.
My response is that 200 records a day is an extremely small value for a database, you definitely won't have anything to worry about with those three indexes.
Just this week, I imported a days worth of transactions into one of our database tables at work and it contained 2.1 million records (we get at least one transaction per second across the entire day from 25 separate machines). And it has four separate composite keys which is somewhat more intensive than your three individual keys.
Now granted, that's on a DB2 database but I can't imagine IBM are so much better than the MySQL people that MySQL can only handle less than 0.01% of the DB2 load.

I made some simple tests using my real project and real MySql database.
My results are: adding average index (1-3 columns in an index) to a table - makes inserts slower by 2.1%. So, if you add 20 indexes, your inserts will be slower by 40-50%. But your selects will be 10-100 times faster.
So is it ok to add many indexes? - It depends :) I gave you my results - You decide!

Nothing for select queries, though updates and especially inserts will be order of magnitudes slower - which you won't really notice before you start inserting a LOT of rows at the same time...
In fact at a previous employer (single user, desktop system) we actually DROPPED indexes before starting our "import routine" - which would first delete all records before inserting a huge number of records into the same table...
Then when we were finished with the insertion job we would re-create the indexes...
We would save 90% of the time for this operation by dropping the indexes before starting the operation and re-creating the indexes afterwards...
This was a Sybase database, but the same numbers apply for any database...
So be careful with indexes, they're FAR from "free"...

Only for illustration, but I'm expecting to insert between 25 and 200 records a day.
With that kind of insertion rate, the cost of indexing an extra column will be negligible.

Without some more details about expected usage of the data in your table worrying about indexes slowing you down smells a lot like premature optimization that should be avoided.
If you are really concerned about it, then setup a test database and simulate performance in the worst case scenarios. A test proving that is or is not a problem will probably be much more useful then trying to guess and worry about what may happen. If there is a problem you will be able to use your test setup to try different methods to fix the issue.

The index is there to speed retrieval of data, so the question should be "What data do I need to access quickly?". Without the index, some queries will do a full table scan (go through every row in the table) in order to find the data that you want. With a significant amount of records this will be a slow and expensive operation. If it is for a report that you run once a month then maybe thats okay; if it is for frequently accessed data then you will need the index to give your users a better experience.
If you find the speed of the insert operations are slow because of the index then this is a problem you can solve at the hardware level by throwing more CPUs, RAM and better hard drive technology at the problem.

What Pax said.
For the dimensions you describe, the only significant concern I can imagine is "What is the cost of failing to index multiple db columns?"

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008