Slow mysql query with huge in clause on primary key - mysql

I've got simple query with big IN clause:
SELECT test_id FROM sample WHERE id IN (99 000 of ids);
The explain gives me this result:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE sample range PRIMARY PRIMARY 4 NULL 40 Using where
The id field is primary key of sample table (~320 000 rows) and test_id is foreign key to test table - both are mysql InnoDB tables. Query takes over 2000 secs! I tried to join tables but it took a similar time. After some research i found this topic but the correct answer was only saying what the problem may be (which i don't understand, to be honest :/ ) and there is no solution other than
If these are in cache, the query should run fast
How can i speed up this query?
Please be as precise as possible, cause as I found out I'm a optimization novice.
EDIT 1:
SHOW CREATE TABLE sample
CREATE TABLE `sample` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`test_id` int(11) NOT NULL,
...
PRIMARY KEY (`id`),
KEY `sample_FI_1` (`test_id`),
... other keys ...,
CONSTRAINT `sample_FK_1` FOREIGN KEY (`test_id`) REFERENCES `test` (`id`),
... other foreign keys ...
) ENGINE=InnoDB AUTO_INCREMENT=315607 DEFAULT CHARSET=utf8 COLLATE=utf8_general_ci
There was simple join something like this:
SELECT t.* FROM test t JOIN sample s ON t.id = s.test_id JOIN sample_x x ON s.id = x.sample_id WHERE x.field_id = '321' AND x.value LIKE '%smth%';
innodb_buffer_pool_size:
SELECT # #innodb_buffer_pool_size /1024 /1024 /1024
##innodb_buffer_pool_size/1024/1024/1024
24.000000000000
Statuses:
SHOW TABLE STATUS FROM zert1442
Name Engine Version Row_format Rows Avg_row_length Data_length Max_data_length Index_length Data_free Auto_increment Create_time Update_time Check_time Collation Checksum Create_options Comment
...
sample InnoDB 10 Compact 357323 592 211632128 0 54837248 7340032 315647 2017-02-15 10:22:03 NULL NULL utf8_general_ci NULL
test InnoDB 10 Compact 174915 519 90865664 0 33947648 4194304 147167 2017-02-15 10:22:03 NULL NULL utf8_general_ci NULL
...

Here is your query:
SELECT t.*
FROM test t
JOIN sample s ON t.id = s.test_id
JOIN sample_x x ON s.id = x.sample_id
WHERE x.field_id = '321'
AND x.value LIKE '%smth%'
Unfortunately you didn't provide the SHOW CREATE TABLE output for the test or sample_x tables.
Regardless, add this index if it doesn't already exist:
ALTER TABLE sample_x
ADD INDEX `wr1` (`sample_id`,`field_id`,`value`)
This should improve things a bit. However, using LIKE with a wildcard at the start of the string cannot be improved (at least not without fulltext indexes.)
You can also try this index, too:
ALTER TABLE sample_x
ADD INDEX `wr2` (`field_id`,`value`,`sample_id`)
This will allow the optimizer to start at the sample_x table and then work backwards towards the test table. Which it prefers to do will depend on a lot of factors.
You can remove either of these indexes with the following:
ALTER TABLE sample_x
DROP INDEX `wr1`
Or
ALTER TABLE sample_x
DROP INDEX `wr2`
Experiment to see which helps your query the most, if either of them do. When measuring performance always run the query twice, and throw out the first result. That is because the buffer cache needs to be populated the first time, and so it can take longer and won't be an accurate measure of the real improvement.

First things first: what do you consider 'slow'? You're getting 99k records out of a table, that is bound to take some time!
In my experience, if your IN() list contains 99.000 values I'd REALLY consider putting these into a (temporary) table first and adding a unique index on said table (or PK if you prefer) and then JOIN that table against your sample table. That said, I'm not sure what the fastest way would be to get 99k id's into that (temporary) table; in .Net/MSSQL I'd use the SqlBulkCopy object, I'm not sure what you are using there.
PS: You can off course simply stick to rather verbose INSERT statements but I fear that inserting 99k values like that will take even more time than what you're seeing now. Where are these values coming from in the first place? The more I think about it, the more I'm guessing your underlying approach might be 'off'.

Related

Order by fields from different tables create full table scan. Should I combine data to one table?

This is a theoretical question. Sorry, but I don't have a working tables data to show, I'll try to improvise with a theoretical example.
Using MySql/MariaDB. Have indexes for all relevant fields.
I have a system, which historical design had a ProductType table, something like:
ID=1, Description="Milk"
ID=2, Description="Bread"
ID=3, Description="Salt"
ID=4, Description="Sugar"
and so on.
There are some features in the system that rely on the ProductType ID and the Description is also used in different places, such as for defining different properties of the product type.
There is also a Product table, with fields such as:
ID, ProductTypeID, Name
The Product:Name don't have the product type description in it, so a "Milk bottle 1l" will have an entry such as:
ID=101, ProductTypeID=1, Name="bottle 1l"
and "Sugar pack 1kg" will be:
ID=102, ProductTypeID=4, Name="pack 1kg"
You get the idea...
The system combines the ProductType:Description and Product:Name to show full product names to the users. This creates a systematic naming for all the products, so there is no way to define a product with a name such as "1l bottle of milk". I know that in English that might be hard to swallow, but that way works great with my local language.
Years passed, the database grow to millions of products.
Since full-text index should have all searched data in one table, I had to store the ProductType:Description inside the Product table in a string field I added that have different keywords related to the product, so the full-text search will be able to find anything related to the product (type, name, barcode, SKU and etc.)
Now I'm trying to solve the full table scans and it makes me think that current design might not be optimal and I'll have to redesign and store the full product name (type + name) in the same table...
In order to show the proper order of the products there's an ORDER BY TypeDescription ASC, ProductName ASC after the ProductType table is joined to Product select queries.
From my research I see that the database can't use indexes when the order is done on fields from different tables, so it's doing full table scan to get to the right entries.
During pagination, there's ORDER and LIMIT 50000,100 in the query that take lots of time.
There are sections with lots for products, so that ordering and limiting cause very long full table scans.
How would you handle that situation?
Change the design and store all query related data to the Product table? Feels a bit of a duplication and not natural solution.
Or maybe there's another way to solve it?
Will index on VARCHAR type (product name) be efficient for the ORDER speed? Or the database will still do full table scan?
My first question here. Couldn't find answers on similar cases.
Thanks!
I've tried to play with the queries to see if ordering by a VARCHAR field that have an index will work, but the EXPLAIN SELECT still shows that the query didn't use the index and did WHERE run :(
UPDATE
Trying to add some more data...
The situation is a bit more complicated and after digging a bit more it looks like the initial question was not in the right direction.
I removed the product type from the queries and still have the slow query.
I feel like it's a chicken and egg situation...
I have a table that maps prodcut IDs to section IDs:
CREATE TABLE `Product2Section` (
`SectionId` int(10) unsigned NOT NULL,
`ProductId` int(10) unsigned NOT NULL,
KEY `idx_ProductId` (`ProductId`),
KEY `idx_SectionId` (`SectionId`),
KEY `idx_ProductId_SectionId` (`ProductId`,`SectionId`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 ROW_FORMAT=DYNAMIC
The query (after stripping all non-relevant to the question feilds):
SELECT DISTINCT
DRIVER.ProductId AS ID,
p.*
FROM
Product2Section AS DRIVER
LEFT JOIN Product p ON
(p.ID = DRIVER.ProductId)
WHERE
DRIVER.SectionId IN(
544,545,546,548,550,551,552,553,554,555,556,557,558,559,560,561,562,563,564,566,567,568,570,571,572,573,574,575,1337,1343,1353,1358,1369,1385,1956,1957,1964,1973,1979,1980,1987,1988,1994,1999,2016,2020,576,577,578,579,580,582,586,587,589,590,591,593,596,597,598,604,605,606,608,609,612,613,614,615,617,619,620,621,622,624,625,626,627,628,629,630,632,634,635,637,639,640,642,643,644,645,647,648,651,656,659,660,661,662,663,665,667,669,670,672,674,675,677,683,684,689,690,691,695,726,728,729,730,731,734,736,741,742,743,745,746,749,752,758,761,762,763,764,768,769,771,772,773,774,775,776,777
)
ORDER BY
p.ProductName ASC
LIMIT 500900,100;
explain shows:
id
select_type
table
type
possible_keys
key
key_len
ref
rows
Extra
1
SIMPLE
DRIVER
index
idx_SectionId
idx_ProductId_SectionId
8
NULL
589966
Using where; Using index; Using temporary; Using filesort
1
SIMPLE
p
eq_ref
PRIMARY,idx_ID
PRIMARY
4
4project.DRIVER.ProductId
1
Using where
I've tried to select from the products table and join the Product2Section in order to filter the results, but get the same results:
SELECT DISTINCT
p.ID,
p.ProductName
FROM
Product p
LEFT JOIN
Product2Section p2s ON (p.ID=p2s.ProductId)
WHERE
p2s.SectionId IN(
544,545,546,548,550,551,552,553,554,555,556,557,558,559,560,561,562,563,564,566,567,568,570,571,572,573,574,575,1337,1343,1353,1358,1369,1385,1956,1957,1964,1973,1979,1980,1987,1988,1994,1999,2016,2020,576,577,578,579,580,582,586,587,589,590,591,593,596,597,598,604,605,606,608,609,612,613,614,615,617,619,620,621,622,624,625,626,627,628,629,630,632,634,635,637,639,640,642,643,644,645,647,648,651,656,659,660,661,662,663,665,667,669,670,672,674,675,677,683,684,689,690,691,695,726,728,729,730,731,734,736,741,742,743,745,746,749,752,758,761,762,763,764,768,769,771,772,773,774,775,776,777
)
ORDER BY
p.ProductName ASC
LIMIT 500900,
100;
explain:
id
select_type
table
type
possible_keys
key
key_len
ref
rows
Extra
1
SIMPLE
p2s
index
idx_ProductId,idx_SectionId,idx_ProductId_SectionId
idx_ProductId_SectionId
8
NULL
589966
Using where; Using index; Using temporary; Using filesort
1
SIMPLE
p
eq_ref
PRIMARY,idx_ID
PRIMARY
4
4project.p2s.ProductId
1
Using where
Don't see a way out of that situation.
The two single column indices on Product2Section serve no purpose. You should change your junction table to:
CREATE TABLE `Product2Section` (
`SectionId` int unsigned NOT NULL,
`ProductId` int unsigned NOT NULL,
PRIMARY KEY (`SectionId`, `ProductId`),
KEY `idx_ProductId_SectionId` (`ProductId`, `SectionId`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
There are other queries in the system that probably use the single field indexes
The single column indices cannot be used for anything that the two composite indices cannot be used for. They are just wasting space and cause unnecessary overhead on insert and for the optimizer. Setting one of the composite indices as PRIMARY stops InnoDB from having to create its own internal rowid, which just wastes space. It also adds the uniqueness constraint which is currently missing from your table.
From the docs:
Accessing a row through the clustered index is fast because the index search leads directly to the page that contains the row data. If a table is large, the clustered index architecture often saves a disk I/O operation when compared to storage organizations that store row data using a different page from the index record.
This is not significant for a "simple" junction table as both columns should be stored in both indices, therefor no further read is required.
You said:
that didn't really bother me since there was no real performance hit
You may not see the difference when running an individual query with no contention but the difference in a highly contended production environment can be huge, due to the amount of effort required.
Do you really need to accommodate 4,294,967,295 (int unsigned) sections? Perhaps the 65,535 provided by smallint unsigned would be enough?
You said:
Might change it in the future. Don't think it will change the performance somehow
Changing SectionId to smallint will reduce each index entry from 8 to 6 bytes. That's a 25% reduction in size. Smaller is faster.
Why are you using LEFT JOIN? The fact that you are happy to reverse the order of the tables in the query suggests it should be an INNER JOIN.
Do you have your buffer pool configured appropriately, or is it set to defaults? Please run ANALYZE TABLE Product2Section; and then provide the output from:
SELECT TABLE_ROWS, AVG_ROW_LENGTH, DATA_LENGTH + INDEX_LENGTH
FROM information_schema.TABLES
WHERE TABLE_NAME = 'Product2Section';
And:
SELECT ROUND(SUM(DATA_LENGTH + INDEX_LENGTH)/POW(1024, 3), 2)
FROM information_schema.TABLES
WHERE TABLE_SCHEMA = 'your_database_name';
And:
SHOW VARIABLES LIKE 'innodb_buffer%';

MySQL how to create a correct index

MySQL 5.5
I am trying to find the correct index for a query.
Table:
create table trans (
trans_id int(11) not null auto_increment,
acct_id int(11) not null,
status_id int(11) not null,
trans_transaction_type varchar(5) not null,
trans_amount float(9,3) default null,
trans_datetime datetime not null default '0000-00-00 00:00:00',
primary key (trans_id)
)
Query:
select trans_id
from trans
where acct_id = _acctid
and transaction_type in ('_1','_2','_3','_4')
and trans_datetime between _start and _end
and status_id = 6
Cardinality:
select *
from information_schema.statistics
where table_name='trans'
Result:
trans_id 424339375
acct_id 12123818
trans_transaction_type 70722272
trans_datetime 84866726
status_id 22
I am trying to find what is the correct index for the query?
alter table trans add index (acct_id, trans_transaction_type, trans_datetime, status_id);
alter table trans add index (acct_id, trans_datetime, trans_transaction_type, status_id);
etc...
Which columns go first in the index?
The goal is query speed/performance. Disk space usage is of no concern.
The base of indexing a table is to make the queries light to improve performance, the first index to be added should always be the primary key of the table (trans_id in this case), and after that, the other id columns should be indexed too.
alter table trans add index (trans_id, acct_id, status_id);
The other fields are not needed as indexes, unless you query too often based on them.
Plan A
Start with any WHERE clause that is col = constant. Then move on to one more thing.
Suggest you add both of the following, because it is not easy to predict which will be better:
INDEX(acct_id, status_id, transaction_type)
INDEX(acct_id, status_id, trans_datetime)
Plan B
Do you really have only trans_id in the SELECT list? If so, then it should not be bad to turn this into a "covering" index. That's an index where the entire operation can be performed in the BTree where the index lives thereby avoid having to reach over into the data.
To build such, first build the optimal non-covering index, then add the rest of the fields mentioned anywhere in the query. Either of these should work:
INDEX(acct_id, status_id, trans_datetime, transaction_type, trans_id)
INDEX(acct_id, status_id, transaction_type, trans_datetime, trans_id)
The first two fields can be in either order (both are '='). The last two fields can be in either order (both are useless for finding the rows; they exist only for 'covering').
I recommend against having more than, say, 5 columns in an index.
More info in my Index Cookbook.
Notes
Perform EXPLAIN SELECT. You should see 'Using index' when it is a 'covering' index.
I think the EXPLAIN's Key_len will (in all cases here) show the combined lengths of only acct_id and status_id.
You are in a Stored Procedure? If the version in the SP runs significantly slower than when you experiment, you may need to re-code to CONCAT, PREPARE, and EXECUTE the query.

Why does MySQL decide on wrong index?

I have a partitioned table in MySQL that looks like this:
CREATE TABLE `table1` (
`id` bigint(19) NOT NULL AUTO_INCREMENT,
`field1` varchar(255) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL,
`field2_id` int(11) NOT NULL,
`created_at` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
PRIMARY KEY (`id`,`created_at`),
KEY `index1` (`field2_id`,`id`)
) ENGINE=InnoDB AUTO_INCREMENT=603221206 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
/*!50100 PARTITION BY RANGE (to_days(created_at))
(PARTITION p_0 VALUES LESS THAN (730485) ENGINE = InnoDB,
..... lots more partitions .....
PARTITION p_20130117 VALUES LESS THAN (735250) ENGINE = InnoDB) */;
And this is a typical SELECT query on the table:
SELECT field1 from TABLE1 where field2_id = 12345 and id > 13314313;
Doing an explain on it, MySQL sometimes decides to use PRIMARY instead of index1. This seems to be pretty consistent when you do a first explain. However, after a few repeated explains, MySQL finally decides to use the index. The problem is, this table has millions of rows, and inserts and selects are hitting it on the order of several times per second. Choosing the wrong index was causing these SELECT queries to take up to ~40 seconds, instead of sub second times. Can't really schedule downtime, so I can't run an optimize on the table (because of the size, it would probably take a long time), and not sure it would help in this case anyway.
I fixed this by forcing the index, so it looks like this:
SELECT field1 from TABLE1 FORCE INDEX (index1) WHERE field2_id = 12345 and id > 13314313;
We're running this on MySQL 5.1.63, which we can't move away from at the moment.
My question is, why is MySQL choosing the wrong index? And is there something that can be done to fix it, besides forcing the index on all queries? Is partitioning confusing the InnoDB engine? I've worked a lot with MySQL, and have never seen this behavior before. The query is as simple as can be, and the index is also a perfect match. We have a lot of queries that are assuming the DB layer will do the right thing, and I don't want to go through all of them forcing to use the correct index.
Update 1:
This is the typical explain, without the FORCE INDEX clause. Once that's put in, the possible keys column only show the forced index.
id select_type table type possible_keys key key_len ref rows
1 SIMPLE table1 range PRIMARY,index1 index1 12 NULL 207
I'm not 100% sure, but i think this sounds logic:
You partition your table BY RANGE (to_days(created_at)). the created_at field is part of the primary_key. Your select-queries are using the other part of the primary-key. This way the server optimization engine thinks this would be the speediest index - using the partition and the id-primary-part.
i suggest (without knowing the real cause that lead to your choice) to change your partition-range to the id and change the order of your index1-key.
for more information on partitioning have a look
I'm not sure why the engine would pick the incorrect index. I would think that an index that has an EQUALITY test would supersede that of one with a >, < or range. However, another option that might help force the correct index would be to force a "computed" value on the other id column so the engine might not be able to do a direct correlation to the index... Something like
WHERE field2_id = 12345 and id > 13314313
changed to
WHERE field2_id = 12345 and id + 0 > 13314313

Proper Indexing/Optimization of a MySQL GROUP BY and JOIN Query

I've done a lot of reading and Googling on this and I cannot find any satisfactory answer so I'd appreciate any help. Most answers I find come close to my situation but do not address it (and attempting to follow the solutions has not done me any good).
See Edit #2 below for the best example
[This was the original question but is not a great representation of what I'm asking.]
Say I have 2 tables, each with 4 columns:
key (int, auto increment)
c1 (a date)
c2 (a varchar of length 3)
c3 (also a varchar of length 3)
And I want to perform the following query:
SELECT t.c1, t.c2, COUNT(*)
FROM test1 t
LEFT JOIN test2 t2 ON t2.key = t.key
GROUP BY t.c1, t.c2
Both key fields are indexed as primary keys. I want to get the number of rows returned in each grouping of c1, c2.
When I explain this query I get "using temporary; using filesort". The actual table I'm performing this query on is over 500,000 rows, so that means it's a time consuming query.
So my question is (assuming I'm not doing anything wrong in the query): is there a way to index this table to eliminate the temporary/filesort usage?
Thanks in advance for any help.
Edit
Here is the table definition (in this example both tables are identical - in reality they're not but I'm not sure it makes a difference at this point):
CREATE TABLE `test1` (
`key` int(11) NOT NULL auto_increment,
`c1` date NOT NULL,
`c2` varchar(3) NOT NULL,
`c3` varchar(3) NOT NULL,
PRIMARY KEY (`key`),
UNIQUE KEY `c1` (`c1`,`c2`),
UNIQUE KEY `c2_2` (`c2`,`c1`),
KEY `c2` (`c2`,`c3`)
) ENGINE=MyISAM AUTO_INCREMENT=3 DEFAULT CHARSET=utf8
Full EXPLAIN statement:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE t ALL NULL NULL NULL NULL 2 Using temporary; Using filesort
1 SIMPLE t2 eq_ref PRIMARY PRIMARY 4 tracking.t.key 1 Using index
This is just for my sample tables. In my real tables the rows for t says 500,000+ (every row in the table, though that could be related to something else).
Edit #2
Here is a more concrete example to better explain my situation.
Let's say I have data on Little League baseball games. I have two tables. One holds data on the games:
CREATE TABLE `ex_games` (
`game_id` int(11) NOT NULL auto_increment,
`home_team` int(11) NOT NULL,
`date` date NOT NULL,
PRIMARY KEY (`game_id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8
The other holds data on the at bats in each game:
CREATE TABLE `ex_atbats` (
`ab_id` int(11) NOT NULL auto_increment,
`game` int(11) NOT NULL,
`team` int(11) NOT NULL,
`player` int(11) NOT NULL,
`result` tinyint(1) NOT NULL,
PRIMARY KEY (`hit_id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8
So I have two questions. Let's start with the simple version: I want to return a list of games with a count of how many at bats are in each game. So I think I would do something like this:
SELECT date, home_team, COUNT(h.ab_id) FROM `ex_atbats` h
LEFT JOIN ex_games g ON g.game_id = h.game
GROUP BY g.game_id
This query uses filesort/temporary. Is there a better way to structure this or to index the tables to get rid of that?
Then, the trickier part: say I now want to not only include a count of the number of at bats, but also include a count of the number of at bats that were preceded by an at bat with the same result by the same team. I assume that would be something like:
SELECT g.date, g.home_team, COUNT(ab.ab_id), COUNT(ab2.ab_id) FROM `ex_atbats` ab
LEFT JOIN ex_games g ON g.game_id = ab.game
LEFT JOIN ex_atbats ab2 ON ab2.ab_id = ab.ab_id - 1 AND ab2.result = ab.result
GROUP BY g.game_id
Is that the correct way to structure that query? This also uses filesort/temporary.
So what is the optimal way to go about accomplishing these tasks?
Thanks again.
Phrases Using temporary/filesort usually are not related to the indexes used in the JOIN operation. There is numerous examples where you can have all indexes set (they show up in key and key_len columns in EXPLAIN) but you still get Using temporary and Using filesort.
Check out what the manual says about Using temporary and Using filesort:
How MySQL Uses Internal Temporary Tables
ORDER BY Optimization
Having a combined index for all columns used in GROUP BY clause may help to get rid of Using filesort in certain circumstances. If you also issue ORDER BY you may need to add more complex indexes.
If you have a huge dataset consider partitioning it using some criteria like date or timestamp by means of actual partitioning or a simple WHERE clause.
First of all, the tables' definitions do matter. It's one thing to join using two primary keys, another to join using a primary key from one side and a non-unique key in the other, etc. It also matters what type of engine the tables use as InnoDB treats Primary Keys differently than MyISAM engine.
What I notice though is that on table test1, the (c1,c2) combination is Unique and the fields are not nullable. This allows your query to be rewritten as:
SELECT t.c1, t.c2, COUNT(*)
FROM test1 t
LEFT JOIN test2 t2 ON t2.key = t.key
GROUP BY t.key
It will give the same results while using the same field for the JOIN and the GROUP BY. Note that MySQL allows you to use in the SELECT list fields that are not in the GROUP BY list, without having aggregate functions on them. This is not allowed in most other systems and is seen as a bug by some. In this situation though it is a very nice feature. Every row can be either identified by (key) or (c1,c2), so it shouldn't matter which of the two is used for the grouping.
Another thing to note is that when you use LEFT JOIN, it's common to use the joining column from the right side for the counting: COUNT(t2.key) and not COUNT(*). Your original query will give 1 in that column for records in test1 that do not mmatch any record in test2 because it counts rows while you probably want to count the related records in test2 - and show 0 in those cases.
So, try this query and post the EXPLAIN:
SELECT t.c1, t.c2, COUNT(t2.key)
FROM test1 t
LEFT JOIN test2 t2 ON t2.key = t.key
GROUP BY t.key
The indexes help with the join, but you still need to do a full sort in order to do the group by. Essentially, it still has to process every record in the set.
Adding a where clause and limiting the set would run faster, of course. It just won't get you the results you want.
There may be other options than doing a group by on the entire table. I notice you're doing a SELECT * - What are you trying to get out of the query?
SELECT DISTINCT c1, c2
FROM test t
LEFT JOIN test2 t2 ON t2.key = t.key
may run faster, for instance. (I realize this was just a sample query, but understand that it's hard to optimize when you don't know what the end goal is!)
EDIT - In doing some reading (http://dev.mysql.com/doc/refman/5.0/en/group-by-optimization.html), I learned that, under the correct circumstances, indexes can help significantly with the group by.
What I'm seeing is that it needs to be a sorted index (like BTREE), not a HASH. Perhaps:
CREATE INDEX c1c2 IN t (c1, c2) USING BTREE;
might help.
For innodb it will work, as the index caries your primary key by default. For myisam you have to have the key as the last column of your index be "key". That will give the optimizers all keys in the same order and he can skip the sort. You cannot do any range queryies on the index prefix theN, puts you right back into filesort. currently struggling with a similiar problem

MySQL: Optimizing COUNT(*) and GROUP BY

I have a simple MyISAM table resembling the following (trimmed for readability -- in reality, there are more columns, all of which are constant width and some of which are nullable):
CREATE TABLE IF NOT EXISTS `history` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`time` int(11) NOT NULL,
`event` int(11) NOT NULL,
`source` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `event` (`event`),
KEY `time` (`time`),
);
Presently the table contains only about 6,000,000 rows (of which currently about 160,000 match the query below), but this is expected to increase. Given a particular event ID and grouped by source, I want to know how many events with that ID were logged during a particular interval of time. The answer to the query might be something along the lines of "Today, event X happened 120 times for source A, 105 times for source B, and 900 times for source C."
The query I concocted does perform this task, but it performs monstrously badly, taking well over a minute to execute when the timespan is set to "all time" and in excess of 30 seconds for as little as a week back:
SELECT COUNT(*) AS count FROM history
WHERE event=2000 AND time >= 0 AND time < 1310563644
GROUP BY source
ORDER BY count DESC
This is not for real-time use, so even if the query takes a second or two that would be fine, but several minutes is not. Explaining the query gives the following, which troubles me for obvious reasons:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE history ref event,time event 4 const 160399 Using where; Using temporary; Using filesort
I've experimented with various multi-column indexes (such as (event, time)), but with no improvement. This seems like such a common use case that I can't imagine there not being a reasonable solution, but my Googling all boil down to versions of the query I already have, with no particular suggestions on how to avoid the temporary (and even then, why performance is so abysmal).
Any suggestions?
You say you have tried multi-column indexes. Have you also tried single-column indexes, one per column?
UPDATE: Also, the COUNT(*) operation over a GROUP BY clause is probably a lot faster, if the grouped column also has an index on it... Of course, this depends on the number of NULL values that are actually in that column, which are not indexed.
For event, MySQL can execute a UNIQUE SCAN, which is quite fast, whereas for time, a RANGE SCAN will be applied, which is not so fast... If you separate indexes, I'd expect better performance than with multi-column ones.
Also, maybe you could gain something by partitioning your table by some expected values / value ranges:
http://dev.mysql.com/doc/refman/5.5/en/partitioning-overview.html
I offer you to try this multi-column index:
ALTER TABLE `history` ADD INDEX `history_index` (`event` ASC, `time` ASC, `source` ASC);
Then if it doesn't help, try to force index on this query:
SELECT COUNT(*) AS count FROM history USE INDEX (history_index)
WHERE event=2000 AND time >= 0 AND time < 1310563644
GROUP BY source
ORDER BY count DESC
If the source are known or you want to find the count for specific source, then you can try like this.
select count(source= 'A' or NULL) as A,count(source= 'B' or NULL) as B from history;
and for ordering you can do it in your application code. Also try with indexing event and source together.
This will be definitely faster than the older one.