MySQL 5.5
I am trying to find the correct index for a query.
Table:
create table trans (
trans_id int(11) not null auto_increment,
acct_id int(11) not null,
status_id int(11) not null,
trans_transaction_type varchar(5) not null,
trans_amount float(9,3) default null,
trans_datetime datetime not null default '0000-00-00 00:00:00',
primary key (trans_id)
)
Query:
select trans_id
from trans
where acct_id = _acctid
and transaction_type in ('_1','_2','_3','_4')
and trans_datetime between _start and _end
and status_id = 6
Cardinality:
select *
from information_schema.statistics
where table_name='trans'
Result:
trans_id 424339375
acct_id 12123818
trans_transaction_type 70722272
trans_datetime 84866726
status_id 22
I am trying to find what is the correct index for the query?
alter table trans add index (acct_id, trans_transaction_type, trans_datetime, status_id);
alter table trans add index (acct_id, trans_datetime, trans_transaction_type, status_id);
etc...
Which columns go first in the index?
The goal is query speed/performance. Disk space usage is of no concern.
The base of indexing a table is to make the queries light to improve performance, the first index to be added should always be the primary key of the table (trans_id in this case), and after that, the other id columns should be indexed too.
alter table trans add index (trans_id, acct_id, status_id);
The other fields are not needed as indexes, unless you query too often based on them.
Plan A
Start with any WHERE clause that is col = constant. Then move on to one more thing.
Suggest you add both of the following, because it is not easy to predict which will be better:
INDEX(acct_id, status_id, transaction_type)
INDEX(acct_id, status_id, trans_datetime)
Plan B
Do you really have only trans_id in the SELECT list? If so, then it should not be bad to turn this into a "covering" index. That's an index where the entire operation can be performed in the BTree where the index lives thereby avoid having to reach over into the data.
To build such, first build the optimal non-covering index, then add the rest of the fields mentioned anywhere in the query. Either of these should work:
INDEX(acct_id, status_id, trans_datetime, transaction_type, trans_id)
INDEX(acct_id, status_id, transaction_type, trans_datetime, trans_id)
The first two fields can be in either order (both are '='). The last two fields can be in either order (both are useless for finding the rows; they exist only for 'covering').
I recommend against having more than, say, 5 columns in an index.
More info in my Index Cookbook.
Notes
Perform EXPLAIN SELECT. You should see 'Using index' when it is a 'covering' index.
I think the EXPLAIN's Key_len will (in all cases here) show the combined lengths of only acct_id and status_id.
You are in a Stored Procedure? If the version in the SP runs significantly slower than when you experiment, you may need to re-code to CONCAT, PREPARE, and EXECUTE the query.
Related
I have a couple of tables that looks like this:
CREATE TABLE Entities (
id INT NOT NULL AUTO_INCREMENT,
name VARCHAR(45) NOT NULL,
client_id INT NOT NULL,
display_name VARCHAR(45),
PRIMARY KEY (id)
)
CREATE TABLE Statuses (
id INT NOT NULL AUTO_INCREMENT,
name VARCHAR(45) NOT NULL,
PRIMARY KEY (id)
)
CREATE TABLE EventTypes (
id INT NOT NULL AUTO_INCREMENT,
name VARCHAR(45) NOT NULL,
PRIMARY KEY (id)
)
CREATE TABLE Events (
id INT NOT NULL AUTO_INCREMENT,
entity_id INT NOT NULL,
date DATE NOT NULL,
event_type_id INT NOT NULL,
status_id INT NOT NULL
)
Events is large > 100,000,000 rows
Entities, Statuses and EventTypes are small < 300 rows a piece
I have several indexes on Events, but the ones that come into play are
idx_events_date_ent_status_type (date, entity_id, status_id, event_type_id)
idx_events_date_ent_status_type (entity_id, status_id, event_type_id)
idx_events_date_ent_type (date, entity_id, event_type_id)
I have a large complicated query, but I'm getting the same slow query results with a simpler one like the one below (note, in the real queries, I don't use evt.*)
SELECT evt.*, ent.name AS ent_name, s.name AS stat_name, et.name AS type_name
FROM `Events` evt
JOIN `Entities` ent ON evt.entity_id = ent.id
JOIN `EventTypes` et ON evt.event_type_id = et.id
JOIN `Statuses` s ON evt.status_id = s.id
WHERE
evt.date BETWEEN #start_date AND #end_date AND
evt.entity_id IN ( 19 ) AND -- this in clause is built by code
evt.event_type_id = #type_id
For some reason, mysql keeps choosing the index which doesn't cover Events.date and the query takes 15 seconds or more and returns a couple thousand rows. If I change the query to:
SELECT evt.*, ent.name AS ent_name, s.name AS stat_name, et.name AS type_name
FROM `Events` evt force index (idx_events_date_ent_status_type)
JOIN `Entities` ent ON evt.entity_id = ent.id
JOIN `EventTypes` et ON evt.event_type_id = et.id
JOIN `Statuses` s ON evt.status_id = s.id
WHERE
evt.date BETWEEN #start_date AND #end_date AND
evt.entity_id IN ( 19 ) AND -- this in clause is built by code
evt.event_type_id = #type_id
The query takes .014 seconds.
Since this query is built by code, I would much rather not force the index, but mostly, I want to know why it chooses one index over the other. Is it because of the joins?
To give some stats, there are ~2500 distinct dates, and ~200 entities in the Events table. So I suppose that might be why it chooses the index with all of the low cardinality columns.
Do you think it would help to add date to the end of idx_events_date_ent_status_type? Since this is a large table, it takes a long time to add indexes.
I tried adding an additional index,
ix_events_ent_date_status_et(entity_id, date, status_id, event_type_id)
and it actually made the queries slower.
I will experiment a bit more, but I feel like I'm not sure how the optimizer makes it's decisions.
Additional Info:
I tried removing the join to the Statuses table, and mysql switches to ix_events_date_ent_type, and the query runs in 0.045 sec
I can't wrap my head around why removing a join to a table that is not part of the filter impacts the choice of index.
I would add this index:
ALTER TABLE Events ADD INDEX (event_type_id, entity_id, date);
The order of columns is important. Put all column(s) used in equality conditions first. This is event_type_id in this case.
The optimizer can use multiple columns to optimize equalities, if the columns are left-most and consecutive.
Then the optimizer can use one more column to optimize a range condition. A range condition is anything other than = or IS NULL. So range conditions include >, !=, BETWEEN, IN(), LIKE (with no leading wildcard), IS NOT NULL, and so on.
The condition on entity_id is also an equality condition if the IN() list has one element. MySQL's optimizer can treat a list of one value as an equality condition. But if the list has more than one value, it becomes a range condition. So if the example you showed of IN (19) is typical, then all three columns of the index will be used for filtering.
It's still worth putting date in the index, because it can at least tell the InnoDB storage engine to filter rows before returning them. See https://dev.mysql.com/doc/refman/8.0/en/index-condition-pushdown-optimization.html It's not quite as good as a real index lookup, but it's worthwhile.
I would also suggest creating a smaller table to test with. Doing experiments on a 100 million row table is time-consuming. But you do need a table with a non-trivial amount of data, because if you test on an empty table, the optimizer behaves differently.
Rearrange your indexes to have columns in this order:
Any column(s) that will be tested with = or IS NULL.
Column(s) tested with IN -- If there is a single value, this will be further optimized to = for you.
One "range" column, such as your date.
Note that nothing after a "range" test will be used by WHERE.
(There are exceptions, but most are not relevant here.)
More discussion: Index Cookbook
Since the tables smell like Data Warehousing, I suggest looking into
Summary Tables In some cases, long queries on Events can be moved to the summary table(s), where they run much faster. Also, this may eliminate the need for some (or maybe even all) secondary indexes.
Since Events is rather large, I suggest using smaller numbers where practical. INT takes 4 bytes. Speed will improve slightly if you shrink those where appropriate.
When you have INDEX(a,b,c), that index will handle cases that need INDEX(a,b) and INDEX(a). Keep the longer one. (Sometimes the Optimizer picks the shorter index 'erroneously'.)
To most effectively use a composite index on multiple values of two different fields, you need to specify the values with joins instead of simple where conditions. So assuming you are selecting dates from 2022-12-01 to 2022-12-03 and entity_id in (1,2,3), do:
select ...
from (select date('2022-12-01') date union all select date('2022-12-02') union all select date('2022-12-03')) dates
join Entities on Entities.id in (1,2,3)
join Events on Events.entity_id=Entities.id and Events.date=dates.date
If you pre-create a dates table with all dates from 0000-01-01 to 9999-12-31, then you can do:
select ...
from dates
join Entities on Entities.id in (1,2,3)
join Events on Events.entity_id=Entities.id and Events.date=dates.date
where dates.date between #start_date and #end_date
I have a simple table:
CREATE TABLE `user_values` (
`id` bigint NOT NULL AUTO_INCREMENT,
`user_id` bigint NOT NULL,
`value` varchar(100) NOT NULL,
PRIMARY KEY (`id`),
KEY `user_id` (`user_id`,`id`),
KEY `id` (`id`,`user_id`);
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;
that I am trying to execute the following simple query:
select * from user_values where user_id in (20020, 20030) order by id desc;
I would fully expect this query to 100% use an index (either the (user_id, id) one or the (id, user_id) one) Yet, it turns out that's not the case:
explain select * from user_values where user_id in (20020, 20030); yields:
id
select_type
table
partitions
type
key
key_len
ref
rows
filtered
Extra
1
SIMPLE
user_values
NULL
range
user_id
8
NULL
9
100.00
Using index condition; Using filesort
Why is that the case? How can I avoid a filesort on this trivial query?
You can't avoid the filesort in the query you show.
When you use a range predicate (for example, IN ( ) is a range predicate), and an index is used, the rows are read in index order. But there's no way for the MySQL query optimizer to guess that reading the rows in index order by user_id will guarantee they are also in id order. The two user_id values you are searching for are potentially scattered all over the table, in any order. Therefore MySQL must assume that once the matching rows are read, an extra step of sorting the result by id is necessary.
Here's an example of hypothetical data in which reading the rows by an index on user_id will not be in id order.
id
user_id
1
20030
2
20020
3
20016
4
20030
5
20020
So when reading from an index on (user_id, id), the matching rows will be returned in the following order, sorted by user_id first, then by id:
id
user_id
2
20020
5
20020
1
20030
4
20030
Clearly, the result is not in id order, so it needs to be sorted to satisfy the ORDER BY you requested.
The same kind of effect happens for other type of predicates, for example BETWEEN, or < or != or IS NOT NULL, etc. Every predicate except for = is a range predicate.
The only ways to avoid the filesort are to change the query in one of the following ways:
Omit the ORDER BY clause and accepting the results in whatever order the optimizer chooses to return them, which could be in id order, but only by coincidence.
Change the user_id IN (20020, 20030) to user_id = 20020, so there is only one matching user_id, and therefore reading the matching rows from the index will already be returned in the id order, and therefore the ORDER BY is a no-op. The optimizer recognizes when this is possible, and skips the filesort.
MySQL will most likely use index for the query (unless the user_id's in the query covers most of the rows).
The "filesort" happens in memory (it's really not a filesort), and is used to sort the found rows based on the ORDER BY clause.
You cannot avoid a "sort" in this case.
There were about 9 rows to sort, so it could not have taken long.
How long did the query take? Probably only a few milliseconds, so who cares?
"Filesort" does not necessarily mean that a "file" was involved. In many queries the sort is done in RAM.
Do you use id for anything other than to have a PRIMARY KEY on the table? If not, then this will help a small amount. (The speed-up won't be indicated in EXPLAIN.)
PRIMARY KEY (`user_id`,`id`), -- to avoid secondary lookups
KEY `id` (`id`); -- to keep auto_increment happy
Query
SELECT id FROM `user_tmp`
WHERE `code` = '9s5xs1sy'
AND `go` NOT REGEXP 'http://www.xxxx.example.com/aflam/|http://xx.example.com|http://www.xxxxx..example.com/aflam/|http://www.xxxxxx.example.com/v/|http://www.xxxxxx.example.com/vb/'
AND check='done'
AND `dataip` <1319992460
ORDER BY id DESC
LIMIT 50
MySQL returns:
Showing rows 0 - 29 ( 50 total, Query took 21.3102 sec) [id: 2622270 - 2602288]
Query took 21.3102 sec
if i remove
AND dataip <1319992460
MySQL returns
Showing rows 0 - 29 ( 50 total, Query took 0.0859 sec) [id: 3637556 - 3627005]
Query took 0.0859 sec
and if no data, MySQL returns
MySQL returned an empty result set (i.e. zero rows). ( Query took 21.7332 sec )
Query took 21.7332 sec
Explain plan:
SQL query: Explain SELECT * FROM `user_tmp` WHERE `code` = '93mhco3s5y' AND `too` NOT REGEXP 'http://www.10neen.com/aflam/|http://3ltool.com|http://www.10neen.com/aflam/|http://www.10neen.com/v/|http://www.m1-w3d.com/vb/' and checkopen='2010' and `dataip` <1319992460 ORDER BY id DESC LIMIT 50;
Rows: 1
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE user_tmp index NULL PRIMARY 4 NULL 50 Using where
Example of the database used
CREATE TABLE IF NOT EXISTS user_tmp ( id int(9) NOT NULL
AUTO_INCREMENT, ip text NOT NULL, dataip bigint(20) NOT NULL,
ref text NOT NULL, click int(20) NOT NULL, code text NOT
NULL, too text NOT NULL, name text NOT NULL, checkopen
text NOT NULL, contry text NOT NULL, vOperation text NOT NULL,
vBrowser text NOT NULL, iconOperation text NOT NULL,
iconBrowser text NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 AUTO_INCREMENT=4653425 ;
--
-- Dumping data for table user_tmp
INSERT INTO `user_tmp` (`id`, `ip`, `dataip`, `ref`, `click`, `code`, `too`, `name`, `checkopen`, `contry`, `vOperation`, `vBrowser`, `iconOperation`, `iconBrowser`) VALUES
(1, '54.125.78.84', 1319506641, 'http://xxxx.example.com/vb/showthread.php%D8%AA%D8%AD%D9%85%D9%8A%D9%84-%D8%A7%D8%BA%D9%86%D9%8A%D8%A9-%D8%A7%D9%84%D8%A8%D9%88%D9%85-giovanni-marradi-lovers-rendezvous-3cd-1999-a-155712.html', 0, '4mxxxxx5', 'http://www.xxx.example.com/aflam/', 'xxxxe', '2010', 'US', 'Linux', 'Chrome 12.0.742 ', 'linux.png', 'chrome.png');
I want the correct way to do the query and optimize database
You don't have any indexes besides the primary key. You need to make index on fields that you use in your WHERE statement. If you need to index only 1 field or a combination of several fields depends on the other SELECTs you will be running against that table.
Keep in mind that REGEXP cannot use indexes at all, LIKE can use index only when it does not begin with wildcard (so LIKE 'a%' can use index, but LIKE '%a' cannot), bigger than / smaller than (<>) usually don't use indexes also.
So you are left with the code and check fields. I suppose many rows will have the same value for check, so I would begin the index with code field. Multi-field indexes can be used only in the order in which they are defined...
Imagine index created for fields code, check. This index can be used in your query (where the WHERE clause contains both fields), also in the query with only code field, but not in query with only check field.
Is it important to ORDER BY id? If not, leave it out, it will prevent the sort pass and your query will finish faster.
I will assume you are using mysql <= 5.1
The answers above fall into two basic categories:
1. You are using the wrong column type
2. You need indexes
I will deal with each as both are relevant for performance which is ultimately what I take your questions to be about:
Column Types
The difference between bigint/int or int/char for the dataip question is basically not relevant to your issue. The fundamental issue has more to do with index strategy. However when considering performance holistically, the fact that you are using MyISAM as your engine for this table leads me to ask if you really need "text" column types. If you have short (less than 255 say) character columns, then making them fixed length columns will most likely increase performance. Keep in mind that if any one column is of variable length (varchar, text, etc) then this is not worth changing any of them.
Vertical Partitioning
The fact to keep in mind here is that even though you are only requesting the id column from the standpoint of disk IO and memory you are getting the entire row back. Since so many of the rows are text, this could mean a massive amount of data. Any of these rows that are not used for lookups of users or are not often accessed could be moved into another table where the foreign key has a unique key placed on it keeping the relationship 1:1.
Index Strategy
Most likely the problem is simply indexing as is noted above. The reason that your current situation is caused by adding the "AND dataip <1319992460" condition is that it forces a full table scan.
As stated above placing all the columns in the where clause in a single, composite index will help. The order of the columns in the index will no matter so long as all of them appear in the where clause.
However, the order could matter a great deal for other queries. A quick example would be an index made of (colA, colB). A query with "where colA = 'foo'" will use this index. But a query with "where colB = 'bar'" will not because colB is not the left most column in the index definition. So, if you have other queries that use these columns in some combination it is worth minimizing the number of indexes created on the table. This is b/c every index increases the cost of a write and uses disk space. Writes are expensive b/c of necessary disk activity. Don't make them more expensive.
You need to add index like this:
ALTER TABLE `user_tmp` ADD INDEX(`dataip`);
And if your column 'dataip' contains only unique values you can add unique key like this:
ALTER TABLE `user_tmp` ADD UNIQUE(`dataip`);
Keep in mind, that adding index can take long time on a big table, so don't do it on production server with out testing.
You need to create index on fields in the same order that that are using in where clause. Otherwise index is not be used. Index fields of your where clause.
does dataip really need to be a bigint? According to mysql The signed range is -9223372036854775808 to 9223372036854775807 ( it is a 64bit number ).
You need to choose the right column type for the job, and add the right type of index too. Else these queries will take forever.
I've done a lot of reading and Googling on this and I cannot find any satisfactory answer so I'd appreciate any help. Most answers I find come close to my situation but do not address it (and attempting to follow the solutions has not done me any good).
See Edit #2 below for the best example
[This was the original question but is not a great representation of what I'm asking.]
Say I have 2 tables, each with 4 columns:
key (int, auto increment)
c1 (a date)
c2 (a varchar of length 3)
c3 (also a varchar of length 3)
And I want to perform the following query:
SELECT t.c1, t.c2, COUNT(*)
FROM test1 t
LEFT JOIN test2 t2 ON t2.key = t.key
GROUP BY t.c1, t.c2
Both key fields are indexed as primary keys. I want to get the number of rows returned in each grouping of c1, c2.
When I explain this query I get "using temporary; using filesort". The actual table I'm performing this query on is over 500,000 rows, so that means it's a time consuming query.
So my question is (assuming I'm not doing anything wrong in the query): is there a way to index this table to eliminate the temporary/filesort usage?
Thanks in advance for any help.
Edit
Here is the table definition (in this example both tables are identical - in reality they're not but I'm not sure it makes a difference at this point):
CREATE TABLE `test1` (
`key` int(11) NOT NULL auto_increment,
`c1` date NOT NULL,
`c2` varchar(3) NOT NULL,
`c3` varchar(3) NOT NULL,
PRIMARY KEY (`key`),
UNIQUE KEY `c1` (`c1`,`c2`),
UNIQUE KEY `c2_2` (`c2`,`c1`),
KEY `c2` (`c2`,`c3`)
) ENGINE=MyISAM AUTO_INCREMENT=3 DEFAULT CHARSET=utf8
Full EXPLAIN statement:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE t ALL NULL NULL NULL NULL 2 Using temporary; Using filesort
1 SIMPLE t2 eq_ref PRIMARY PRIMARY 4 tracking.t.key 1 Using index
This is just for my sample tables. In my real tables the rows for t says 500,000+ (every row in the table, though that could be related to something else).
Edit #2
Here is a more concrete example to better explain my situation.
Let's say I have data on Little League baseball games. I have two tables. One holds data on the games:
CREATE TABLE `ex_games` (
`game_id` int(11) NOT NULL auto_increment,
`home_team` int(11) NOT NULL,
`date` date NOT NULL,
PRIMARY KEY (`game_id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8
The other holds data on the at bats in each game:
CREATE TABLE `ex_atbats` (
`ab_id` int(11) NOT NULL auto_increment,
`game` int(11) NOT NULL,
`team` int(11) NOT NULL,
`player` int(11) NOT NULL,
`result` tinyint(1) NOT NULL,
PRIMARY KEY (`hit_id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8
So I have two questions. Let's start with the simple version: I want to return a list of games with a count of how many at bats are in each game. So I think I would do something like this:
SELECT date, home_team, COUNT(h.ab_id) FROM `ex_atbats` h
LEFT JOIN ex_games g ON g.game_id = h.game
GROUP BY g.game_id
This query uses filesort/temporary. Is there a better way to structure this or to index the tables to get rid of that?
Then, the trickier part: say I now want to not only include a count of the number of at bats, but also include a count of the number of at bats that were preceded by an at bat with the same result by the same team. I assume that would be something like:
SELECT g.date, g.home_team, COUNT(ab.ab_id), COUNT(ab2.ab_id) FROM `ex_atbats` ab
LEFT JOIN ex_games g ON g.game_id = ab.game
LEFT JOIN ex_atbats ab2 ON ab2.ab_id = ab.ab_id - 1 AND ab2.result = ab.result
GROUP BY g.game_id
Is that the correct way to structure that query? This also uses filesort/temporary.
So what is the optimal way to go about accomplishing these tasks?
Thanks again.
Phrases Using temporary/filesort usually are not related to the indexes used in the JOIN operation. There is numerous examples where you can have all indexes set (they show up in key and key_len columns in EXPLAIN) but you still get Using temporary and Using filesort.
Check out what the manual says about Using temporary and Using filesort:
How MySQL Uses Internal Temporary Tables
ORDER BY Optimization
Having a combined index for all columns used in GROUP BY clause may help to get rid of Using filesort in certain circumstances. If you also issue ORDER BY you may need to add more complex indexes.
If you have a huge dataset consider partitioning it using some criteria like date or timestamp by means of actual partitioning or a simple WHERE clause.
First of all, the tables' definitions do matter. It's one thing to join using two primary keys, another to join using a primary key from one side and a non-unique key in the other, etc. It also matters what type of engine the tables use as InnoDB treats Primary Keys differently than MyISAM engine.
What I notice though is that on table test1, the (c1,c2) combination is Unique and the fields are not nullable. This allows your query to be rewritten as:
SELECT t.c1, t.c2, COUNT(*)
FROM test1 t
LEFT JOIN test2 t2 ON t2.key = t.key
GROUP BY t.key
It will give the same results while using the same field for the JOIN and the GROUP BY. Note that MySQL allows you to use in the SELECT list fields that are not in the GROUP BY list, without having aggregate functions on them. This is not allowed in most other systems and is seen as a bug by some. In this situation though it is a very nice feature. Every row can be either identified by (key) or (c1,c2), so it shouldn't matter which of the two is used for the grouping.
Another thing to note is that when you use LEFT JOIN, it's common to use the joining column from the right side for the counting: COUNT(t2.key) and not COUNT(*). Your original query will give 1 in that column for records in test1 that do not mmatch any record in test2 because it counts rows while you probably want to count the related records in test2 - and show 0 in those cases.
So, try this query and post the EXPLAIN:
SELECT t.c1, t.c2, COUNT(t2.key)
FROM test1 t
LEFT JOIN test2 t2 ON t2.key = t.key
GROUP BY t.key
The indexes help with the join, but you still need to do a full sort in order to do the group by. Essentially, it still has to process every record in the set.
Adding a where clause and limiting the set would run faster, of course. It just won't get you the results you want.
There may be other options than doing a group by on the entire table. I notice you're doing a SELECT * - What are you trying to get out of the query?
SELECT DISTINCT c1, c2
FROM test t
LEFT JOIN test2 t2 ON t2.key = t.key
may run faster, for instance. (I realize this was just a sample query, but understand that it's hard to optimize when you don't know what the end goal is!)
EDIT - In doing some reading (http://dev.mysql.com/doc/refman/5.0/en/group-by-optimization.html), I learned that, under the correct circumstances, indexes can help significantly with the group by.
What I'm seeing is that it needs to be a sorted index (like BTREE), not a HASH. Perhaps:
CREATE INDEX c1c2 IN t (c1, c2) USING BTREE;
might help.
For innodb it will work, as the index caries your primary key by default. For myisam you have to have the key as the last column of your index be "key". That will give the optimizers all keys in the same order and he can skip the sort. You cannot do any range queryies on the index prefix theN, puts you right back into filesort. currently struggling with a similiar problem
I have a simple MyISAM table resembling the following (trimmed for readability -- in reality, there are more columns, all of which are constant width and some of which are nullable):
CREATE TABLE IF NOT EXISTS `history` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`time` int(11) NOT NULL,
`event` int(11) NOT NULL,
`source` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `event` (`event`),
KEY `time` (`time`),
);
Presently the table contains only about 6,000,000 rows (of which currently about 160,000 match the query below), but this is expected to increase. Given a particular event ID and grouped by source, I want to know how many events with that ID were logged during a particular interval of time. The answer to the query might be something along the lines of "Today, event X happened 120 times for source A, 105 times for source B, and 900 times for source C."
The query I concocted does perform this task, but it performs monstrously badly, taking well over a minute to execute when the timespan is set to "all time" and in excess of 30 seconds for as little as a week back:
SELECT COUNT(*) AS count FROM history
WHERE event=2000 AND time >= 0 AND time < 1310563644
GROUP BY source
ORDER BY count DESC
This is not for real-time use, so even if the query takes a second or two that would be fine, but several minutes is not. Explaining the query gives the following, which troubles me for obvious reasons:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE history ref event,time event 4 const 160399 Using where; Using temporary; Using filesort
I've experimented with various multi-column indexes (such as (event, time)), but with no improvement. This seems like such a common use case that I can't imagine there not being a reasonable solution, but my Googling all boil down to versions of the query I already have, with no particular suggestions on how to avoid the temporary (and even then, why performance is so abysmal).
Any suggestions?
You say you have tried multi-column indexes. Have you also tried single-column indexes, one per column?
UPDATE: Also, the COUNT(*) operation over a GROUP BY clause is probably a lot faster, if the grouped column also has an index on it... Of course, this depends on the number of NULL values that are actually in that column, which are not indexed.
For event, MySQL can execute a UNIQUE SCAN, which is quite fast, whereas for time, a RANGE SCAN will be applied, which is not so fast... If you separate indexes, I'd expect better performance than with multi-column ones.
Also, maybe you could gain something by partitioning your table by some expected values / value ranges:
http://dev.mysql.com/doc/refman/5.5/en/partitioning-overview.html
I offer you to try this multi-column index:
ALTER TABLE `history` ADD INDEX `history_index` (`event` ASC, `time` ASC, `source` ASC);
Then if it doesn't help, try to force index on this query:
SELECT COUNT(*) AS count FROM history USE INDEX (history_index)
WHERE event=2000 AND time >= 0 AND time < 1310563644
GROUP BY source
ORDER BY count DESC
If the source are known or you want to find the count for specific source, then you can try like this.
select count(source= 'A' or NULL) as A,count(source= 'B' or NULL) as B from history;
and for ordering you can do it in your application code. Also try with indexing event and source together.
This will be definitely faster than the older one.