Very slow MySQL query performance - mysql

I've a query that takes about 18 seconds to finish:
THE QUERY:
SELECT YEAR(c.date), MONTH(c.date), p.district_id, COUNT(p.owner_id)
FROM commission c
INNER JOIN partner p ON c.customer_id = p.id
WHERE (c.date BETWEEN '2018-01-01' AND '2018-12-31')
AND (c.company_id = 90)
AND (c.source = 'ACTUAL')
AND (p.id IN (3062, 3063, 3064, 3065, 3066, 3067, 3068, 3069, 3070, 3071,
3072, 3073, 3074, 3075, 3076, 3077, 3078, 3079, 3081, 3082, 3083, 3084,
3085, 3086, 3087, 3088, 3089, 3090, 3091, 3092, 3093, 3094, 3095, 3096,
3097, 3098, 3099, 3448, 3449, 3450, 3451, 3452, 3453, 3454, 3455, 3456,
3457, 3458, 3459, 3460, 3461, 3471, 3490, 3491, 6307, 6368, 6421))
GROUP BY YEAR(c.date), MONTH(c.date), p.district_id
The commission table has around 2,8 millions of records, of which 860 000+ belong to the current year 2018. The partner table has at this moment 8600+ records.
RESULT
| `YEAR(c.date)` | `MONTH(c.date)` | district_id | `COUNT(c.id)` |
|----------------|-----------------|-------------|---------------|
| 2018 | 1 | 1 | 19154 |
| 2018 | 1 | 5 | 9184 |
| 2018 | 1 | 6 | 2706 |
| 2018 | 1 | 12 | 36296 |
| 2018 | 1 | 15 | 13085 |
| 2018 | 2 | 1 | 21231 |
| 2018 | 2 | 5 | 10242 |
| ... | ... | ... | ... |
55 rows retrieved starting from 1 in 18 s 374 ms
(execution: 18 s 368 ms, fetching: 6 ms)
EXPLAIN:
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | extra |
|----|-------------|-------|------------|-------|------------------------------------------------------------------------------------------------------|----------------------|---------|-----------------|------|----------|----------------------------------------------|
| 1 | SIMPLE | p | null | range | PRIMARY | PRIMARY | 4 | | 57 | 100 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | c | null | ref | UNIQ_6F7146F0979B1AD62FC0CB0F5F8A7F73,IDX_6F7146F09395C3F3,IDX_6F7146F0979B1AD6,IDX_6F7146F0AA9E377A | IDX_6F7146F09395C3F3 | 5 | p.id | 6716 | 8.33 | Using where |
DDL:
create table if not exists commission (
id int auto_increment
primary key,
date date not null,
source enum('ACTUAL', 'EXPECTED') not null,
customer_id int null,
transaction_id varchar(255) not null,
company_id int null,
constraint UNIQ_6F7146F0979B1AD62FC0CB0F5F8A7F73 unique (company_id, transaction_id, source),
constraint FK_6F7146F09395C3F3 foreign key (customer_id) references partner (id),
constraint FK_6F7146F0979B1AD6 foreign key (company_id) references companies (id)
) collate=utf8_unicode_ci;
create index IDX_6F7146F09395C3F3 on commission (customer_id);
create index IDX_6F7146F0979B1AD6 on commission (company_id);
create index IDX_6F7146F0AA9E377A on commission (date);
I noted that by removing the partner IN condition MySQL takes only 3s. I tried to replace it doing something crazy like this:
AND (',3062,3063,3064,3065,3066,3067,3068,3069,3070,3071,3072,3073,3074,3075,3076,3077,3078,3079,3081,3082,3083,3084,3085,3086,3087,3088,3089,3090,3091,3092,3093,3094,3095,3096,3097,3098,3099,3448,3449,3450,3451,3452,3453,3454,3455,3456,3457,3458,3459,3460,3461,3471,3490,3491,6307,6368,6421,'
LIKE CONCAT('%,', p.id, ',%'))
and the result was about 5s... great! but it's a hack.
WHY this query is taking a very long execution time when I uses IN statement? workaround, tips, links, etc. Thanks!

MySQL can use one index at a time. For this query you need a compound index covering the aspects of the search. Constant aspects of the WHERE clause should be used before range aspects like:
ALTER TABLE commission
DROP INDEX IDX_6F7146F0979B1AD6,
ADD INDEX IDX_6F7146F0979B1AD6 (company_id, source, date)

Here's what the Optimizer sees in your query.
Checking whether to use an index for the GROUP BY:
Functions (YEAR()) in the GROUP BY, so no.
Multiple tables (c and p) mentioned, so no.
For a JOIN, Optimizer will (almost always) start with one, then reach into the other. So, let's look at the two options:
If starting with p:
Assuming you have PRIMARY KEY(id), there is not much to think about. It will simply use that index.
For each row selected from p, it will then look into c, and any variation of this INDEX would be optimal.
c: INDEX(company_id, source, customer_id, -- in any order (all are tested "=")
date) -- last, since it is tested as a range
If starting with c:
c: INDEX(company_id, source, -- in any order (all are tested "=")
date) -- last, since it is tested as a range
-- slightly better:
c: INDEX(company_id, source, -- in any order (all are tested "=")
date, -- last, since it is tested as a range
customer_id) -- really last -- added only to make it "covering".
The Optimizer will look at "statistics" to crudely decide which table to start with. So, add all the indexes I suggested.
A "covering" index is one that contains all the columns needed anywhere in the query. It is sometimes wise to extend a 'good' index with more columns to make it "covering".
But there is a monkey wrench in here. c.customer_id = p.id means that customer_id IN (...) effectively exists. But now there are two "range-like" constraints -- one is an IN, the other is a 'range'. In some newer versions, the Optimizer will happily jump around due to the IN and still be able to do "range" scans. So, I recommend this ordering:
Test(s) of column = constant
Test(s) with IN
One 'range' test (BETWEEN, >=, LIKE with trailing wildcard, etc)
Perhaps add more columns to make it "covering" -- but don't do this step if you end up with more than, say, 5 columns in the index.
Hence, for c, the following is optimal for the WHERE, and happens to be "covering".
INDEX(company_id, source, -- first, but in any order (all "=")
customer_id, -- "IN"
date) -- last, since it is tested as a range
p: (same as above)
Since there was an IN or "range", there is no use seeing if the index can also handle the GROUP BY.
A note on COUNT(x) -- it checks that x is NOT NULL. It is usually just as correct to say COUNT(*), which counts the number of rows without any extra checking.
This is a non-starter since it hides the indexed column (id) in a function:
AND (',3062,3063,3064,3065,3066,...6368,6421,'
LIKE CONCAT('%,', p.id, ',%'))

With your LIKE-hack you are tricking optimizer so it uses different plan (most probably using IDX_6F7146F0AA9E377A index on the first place).
You should be able to see this in explain.
I think the real issue in your case is the second line of explain: server executing multiple functions (MONTH, YEAR) for 6716 rows and then trying to group all these rows. During this time all these 6716 rows should be stored (in memory or on disk that is based on your server configuration).
SELECT COUNT(*) FROM commission WHERE (date BETWEEN '2018-01-01' AND '2018-12-31') AND company_id = 90 AND source = 'ACTUAL';
=> How many rows are we talking about?
If the number in above query is much lower then 6716 I'd try to add covering index on columns customer_id, company_id, source and date. Not sure about the best order as it depends on data you have (check cardinality for these columns). I'd started with index (date, company_id, source, customer_id). Also, I'd add unique index (id, district_id, owner_id) on partner.
It is also possible to add additional generated stored columns _year and _month (if your server is a bit old you can add normal columns and fill them in with trigger) to rid off the multiple function executions.

Related

Reduce number of joins in mysql

I have 12 fixed tables (group, local, element, sub_element, service, ...), each table with different numbers of rows.
The columns 'id_' in all table is a primary key (int). The others columns are of datatype varchar(20). The maximum number of rows in these tables are 300.
Each table was created in this way:
CREATE TABLE group
(
id_G int NOT NULL,
name_group varchar(20) NOT NULL,
PRIMARY KEY (id_G)
);
|........GROUP......| |.......LOCAL.......| |.......SERVICE.......|
| id_G | name_group | | id_L | name_local | | id_S | name_service |
+------+------------+ +------+------------+ +------+--------------+
| 1 | group1 | | 1 | local1 | | 1 | service1 |
| 2 | group2 | | 2 | local2 | | 2 | service2 |
And I have one table that combine all these tables depending on user selects.
The 'id_' come from fixed tables selected by the user are recorded into this table.
This table was crate in this way:
CREATE TABLE group
(
id_E int NOT NULL,
event_name varchar(20) NOT NULL,
id_G int NOT NULL,
id_L int NOT NULL,
...
PRIMARY KEY (id_G)
);
The tables (event) look like this:
|....................EVENT.....................|
| id_E | event_name | id_G | id_L | ... |id_S |
+------+-------------+------+------+-----+-----+
| 1 | mater1 | 1 | 1 | ... | 3 |
| 2 | master2 | 2 | 2 | ... | 6 |
This table get greater each day, an now it has about thousunds of rows.
Column id_E is the primary key (int), event_name is varchar(20).
This table has, in addition of id_E and event_name columns, 12 other columns the came from the fixed tables.
Every time than I need to retrieve information on the event table, to turn more readable, I need to do about 12 joins.
My query look like this where i need to retrieve all columns from table event:
SELECT event_name, name_group, name_local ..., name_service
FROM event
INNER JOIN group on event.id_G = group.id_G
INNER JOIN local on event.id_L = local.id_L
...
INNER JOIN service on event.id_S = service.id_S
WHERE event.id_S = 7 (for example)
This slows down my system performance. Is there a way to reduce the number of joins? I've heard about using Natural Keys, but I think this is not a good idea to form my case thinking in future maintenance.
My queries are taking about 7 seconds and I need to reduce this time.
I changed the WHERE clause and this caused not affect. So, I am sure that the problem is that the query has so many joins.
Could someone give some help? thanks a lot...
MySQL has a great keyword of "STRAIGHT_JOIN" and might be what you are looking for. First, each of your lookup tables (id/description) I have to assume already have an index on the ID column since that is primary key.
Your event table is the one you are querying as the primary basis of the details and joining to the lookups per their respective IDs. As long as your WHERE clause applicable to the EVENT table is optimized, such as the ID you are looking for, it SHOULD be virtually instantaneous.
If it is not, then it might be that MySQL is trying to think for you and take one of the secondary lookup tables and make it a primary basis of the query for whatever reason, such as much lower record count. In this case, add the keyword and try it..
SELECT STRAIGHT_JOIN ... rest of your query
This tells MySQL to do the query in the order you gave it, thus the Event table first and it's where clause on the ID. It should find that one thing, then grab all the corresponding lookup descriptions from the other tables.
Create indexes, concretely use compound indexes, for instance, start creating a compound index for event and groups:
on table events create one for (event id, group id).
then, on the group table create another one for the next relation (group id, local id).
on local do the same with service, and so on...

Optimize mysql query involving millions of rows

I have, in a project, a database with two big tables, "terminosnoticia" have 400 Million rows and "noticia" 3 Million. I have one query I want to make lighter (it spend from 10s to 400s):
SELECT noticia_id, termino_id
FROM noticia
LEFT JOIN terminosnoticia on terminosnoticia.noticia_id=noticia.id AND termino_id IN (7818,12345)
WHERE noticia.fecha BETWEEN '2016-09-16 00:00' AND '2016-09-16 10:00'
AND noticia_id is not null AND termino_id is not null;`
The only viable solution I have to explore is to denormalize the database to include the 'fecha' field in the big table, but, this will multiply the index sizes.
Explain plan:
+----+-------------+-----------------+--------+-----------------------+------------+---------+-----------------------------------------+-------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------------+--------+-----------------------+------------+---------+-----------------------------------------+-------+-------------+
| 1 | SIMPLE | terminosnoticia | ref | noticia_id,termino_id | termino_id | 4 | const | 58480 | Using where |
| 1 | SIMPLE | noticia | eq_ref | PRIMARY,fecha | PRIMARY | 4 | db_resumenes.terminosnoticia.noticia_id | 1 | Using where |
+----+-------------+-----------------+--------+-----------------------+------------+---------+-----------------------------------------+-------+-------------+
Changing the query and creating the index as suggested, the explain plan is now:
+----+-------------+-------+--------+-------------------------------------------+---------------------+---------+---------------------------+-------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+-------------------------------------------+---------------------+---------+---------------------------+-------+-------------+
| 1 | SIMPLE | T | ref | noticia_id,termino_id,terminosnoticia_cpx | terminosnoticia_cpx | 4 | const | 60600 | Using index |
| 1 | SIMPLE | N | eq_ref | PRIMARY,fecha | PRIMARY | 4 | db_resumenes.T.noticia_id | 1 | Using where |
+----+-------------+-------+--------+-------------------------------------------+---------------------+---------+---------------------------+-------+-------------+
But the execution time does not vary too much...
Any idea?
As Strawberry pointed out, by having an "AND" in your where clause for NOT NULL
is the same as a regular INNER JOIN and can be reduced to.
SELECT
N.id as noticia_id,
T.termino_id
FROM
noticia N USING INDEX (fecha)
JOIN terminosnoticia T
on N.id = T.noticia_id
AND T.termino_id IN (7818,12345)
WHERE
N.fecha BETWEEN '2016-09-16 00:00' AND '2016-09-16 10:00'
Now, that said and aliases applied, I would suggest the following covering indexes as
table index
Noticia ( fecha, id )
terminosnoticia ( noticia_id, termino_id )
This way the query can get all the results directly from the indexes and not have to go to the raw data pages to qualify the other fields.
Assuming noticia_id is noticia's primary key, I would add the following indexes:
create index noticia_fecha_idx on noticia(fecha);
create index terminosnoticia_id_noticia_idx on terminosnoticia(noticia_id);
And try your queries again.
Do include the current execution plan of your query. It might help on helping you figuring this one out.
Try this:
SELECT tbl1.noticia_id, tbl1.termino_id FROM
( SELECT FROM terminosnoticia WHERE
terminosnoticia.termino_id IN (7818,12345)
AND terminosnoticia.noticia_id is not null
) tbl1 INNER JOIN
( SELECT id FROM noticia
WHERE noticia.fecha
BETWEEN '2016-09-16 00:00' AND '2016-09-16 10:00'
) tbl2 ON tbl1.id=tbl2.noticia.id
We're assuming that the noticia_id and termino_id are columns in terminosnoticia table. (We wouldn't have to guess, if all of the column references were qualified with the table name or a short table alias.)
Why is this an outer join? The predicates in the WHERE clause are going to exclude rows with NULL values for columns from terminosnoticia. That's going to negate the "outerness" of the join.
And if we write this as an inner join, those predicates in the WHERE clause are redundant. We already know that noticia_id won't be NULL (if it satisfies the equality predicate in the ON clause). Same for termino_id, that won't be NULL if it's equal to a value in the IN list.
I believe this query will return an equivalent result:
SELECT t.noticia_id
, t.termino_id
FROM noticia n
JOIN terminosnoticia t
ON t.noticia_id = n.id
AND t.termino_id IN (7818,12345)
WHERE n.fecha BETWEEN '2016-09-16 00:00' AND '2016-09-16 10:00'
What's left now is figuring out if there's any implicit datatype conversions.
We don't see the datatype of termino_id. So we don't know if that's defined as numeric. It's bad news if it's not, since MySQL will have to perform a conversion to numeric, for every row in the table, so it can do the comparison to the numeric literals.
We don't see the datatypes of the noticia_id, and whether that matches the datatype of the column it's being compared to, the id column from noticia table.
We also don't see the datatype of fecha. Based on the string literals in the between predicate, it looks like it's probably a DATETIME or TIMESTAMP. But that's just a guess. We don't know, since we don't have a table definition available to us.
Once we have verified that there aren't any implicit datatype conversions that are going to bite us...
For the query with the inner join (as above), the best shot at reasonable performance will likely be with MySQL making effective use of covering indexes. (A covering index allows MySQL to satisfy the query directly from from the index blocks, without needing to lookup pages in the underlying table.)
As DRApp's answer already states, the best candidates for covering indexes, for this particular query, would be:
... ON noticia (fecha, id)
... ON terminosnoticia (noticia_id, termino_id)
An index that has those same leading columns in that same order would also be suitable, and would render these indexes redundant.
The addition of these indexes will render other indexes redundant.
The first index would be redundant with ... ON noticia (fecha). Assuming the index isn't enforcing a UNIQUE constraint, it could be dropped. Any query making effective use of that index could use the new index, since fecha is the leading column in the new index.
Similarly, an index ... ON terminosnoticia (noticia_id) would be redundant. Again, assuming it's not a unique index, enforcing a UNIQUE constraint, that index could be dropped as well.

SQL query optimization - really nothing more to improve?

I have the following query. I picked it from mysql slow queries log:
SELECT AVG(item.duration) AS dur
FROM `item`
INNER JOIN item_step ON item_step.item_id = item.id
WHERE
item_step.number = '2' AND
(IS_OK(item_step.result) OR item_step.result2 IN ("R1", "R2")) AND
item.time >= '2015-03-01 07:00:00' AND
item.time < '2015-05-01 07:00:00';
As usually I tried to inspect it using explain:
+----+-------------+-----------+------+----------------------------+---------+---------+------------------+--------+----------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-----------+------+----------------------------+---------+---------+------------------+--------+----------+-------------+
| 1 | SIMPLE | item | ALL | PRIMARY,time | NULL | NULL | NULL | 790464 | 38.74 | Using where |
| 1 | SIMPLE | item_step | ref | number,item_id,result2_idx | item_id | 4 | debug_db.item.id | 1 | 100.00 | Using where |
+----+-------------+-----------+------+----------------------------+---------+---------+------------------+--------+----------+-------------+
Adding index to table item on id and time gave nothing.
Actually time column has an index,tables are connected using foreign keys and have an indexes..
I have no idea about what to do here. Is it really impossible to optimize this query to avoid using join_type = ALL ?
Since you already seem to have a FK from item_step.item_id to item.item_id, the only option you have for improvement is focusing on the parts being used to filter out records.
Slightly reformatting your query we have :
SELECT AVG(item.duration) AS dur
FROM `item`
INNER JOIN item_step
ON item_step.item_id = item.id
AND item_step.number = '2'
AND (IS_OK(item_step.result) OR item_step.result2 IN ("R1", "R2"))
WHERE item.time >= '2015-03-01 07:00:00'
AND item.time < '2015-05-01 07:00:00';
First thing to notice is IS_OK(item_step.result). I have no clue what's behind this function but I'm pretty sure it blocks the optimizer from using any index this field efficiently. If the formula is something that can be written in the query directly I would suggest to do so. (e.g. IN (1, 4, 9), or IN (SELECT OK FROM result_values) etc...)
Going by the field-names I'm going to assume that we FIRST want to reduce the item_id list to a minimum first and then use that reduced list to work on the item_step table. To do so you'll need an index on the time field first. I'm assuming that the item_id field is automatically included in the index as it's the PK field, but I'm no MySQL specialist and it might also depend on your storage engine. Anyay, in MSSQL that's how it would work, YMMV.
The second thing to do then is to go with this list of item_ids to the item_step table and reduce the number of records there. For this you'll want a compound index on item_id, number, result2, result. If you manage to write the IS_OK() function 'inline' into the query you might want to try swapping the last two fields around... something you'll need to test.
From what I read here and there, MySQL does not support something like INCLUDE on indexes in the same way as MSSQL does. A way around that would be to create a 'covering' index on time, duration on item. That way, everything can be done from the index directly, at the cost of more disk-space and CPU requirements when adding data to the item table.
In short:
add index on item on time, duration
add index on item_step on item_id, number, result2, result
see if you can inline the IS_OK() function.

MySQL indexes performance on huge tables

TL;DR:
I have a query on 2 huge tables. They are no indexes. It is slow. Therefore, I build indexes. It is slower. Why does this makes sense? What is the correct way to optimize it?
The background:
I have 2 tables
person, a table containing informations about people (id, birthdate)
works_in, a 0-N relation between person and a department; works_in contains id, person_id, department_id.
They are InnoDB tables, and it is sadly not an option to switch to MyISAM as data integrity is a requirement.
Those 2 tables are huge, and don't contain any indexes except a PRIMARY on their respective id.
I'm trying to get the age of the youngest person in each department, and here is the query I've came up with
SELECT MAX(YEAR(person.birthdate)) as max_year, works_in.department as department
FROM person
INNER JOIN works_in
ON works_in.person_id = person.id
WHERE person.birthdate IS NOT NULL
GROUP BY works_in.department
The query works, but I'm dissatisfied with performances, as it takes ~17s to run. This is expected, as the data is huge and needs to be written to disk, and they are no indexes on the tables.
EXPLAIN for this query gives
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
|----|-------------|---------|--------|---------------|---------|---------|--------------------------|----------|---------------------------------|
| 1 | SIMPLE | works_in| ALL | NULL | NULL | NULL | NULL | 22496409 | Using temporary; Using filesort |
| 1 | SIMPLE | person | eq_ref | PRIMARY | PRIMARY | 4 | dbtest.works_in.person_id| 1 | Using where |
I built a bunch of indexes for the 2 tables,
/* For works_in */
CREATE INDEX person_id ON works_in(person_id);
CREATE INDEX department_id ON works_in(department_id);
CREATE INDEX department_id_person ON works_in(department_id, person_id);
CREATE INDEX person_department_id ON works_in(person_id, department_id);
/* For person */
CREATE INDEX birthdate ON person(birthdate);
EXPLAIN shows an improvement, at least that's how I understand it, seeing that it now uses an index and scans less lines.
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
|----|-------------|---------|-------|--------------------------------------------------|----------------------|---------|------------------|--------|-------------------------------------------------------|
| 1 | SIMPLE | person | range | PRIMARY,birthdate | birthdate | 4 | NULL | 267818 | Using where; Using index; Using temporary; Using f... |
| 1 | SIMPLE | works_in| ref | person,department_id_person,person_department_id | person_department_id | 4 | dbtest.person.id | 3 | Using index |
However, the execution time of the query has doubled (from ~17s to ~35s).
Why does this makes sense, and what is the correct way to optimize this?
EDIT
Using Gordon Linoff's answer (first one), the execution time is ~9s (half of the initial). Choosing good indexes seems to indeed help, but the execution time is still pretty high. Any other idea on how to improve on this?
More information concerning the dataset:
There are about 5'000'000 records in the person table.
Of which only 130'000 have a valid (not NULL) birthdate
I indeed have a department table, which contains about 3'000'000 records (they are actually projects and not department)
For this query:
SELECT MAX(YEAR(p.birthdate)) as max_year, wi.department as department
FROM person p INNER JOIN
works_in wi
ON wi.person_id = p.id
WHERE p.birthdate IS NOT NULL
GROUP BY wi.department;
The best indexes are: person(birthdate, id) and works_in(person_id, department). These are covering indexes for the query and save the extra cost of reading data pages.
By the way, unless a lot of persons have NULL birthdates (i.e. there are departments where everyone has a NULL birthdate), the query is basically equivalent to:
SELECT MAX(YEAR(p.birthdate)) as max_year, wi.department as department
FROM person p INNER JOIN
works_in wi
ON wi.person_id = p.id
GROUP BY wi.department;
For this, the best indexes are person(id, birthdate) and works_in(person_id, department).
EDIT:
I cannot think of an easy way to solve the problem. One solution is more powerful hardware.
If you really need this information quickly, then additional work is needed.
One approach is to add a maximum birth date to the departments table, and add triggers. For works_in, you need triggers for update, insert, and delete. For persons, only update (presumably the insert and delete would be handled by works_in). This saves the final group by, which should be a big savings.
A simpler approach is to add a maximum birth date just to works_in. However, you will still need a final aggregation, and that might be expensive.
Indexing improves performance for MyISAM tables. It degrades performance on InnoDB tables.
Add indexes on columns that you expect to query the most. The more complex the data relationships grow, especially when those relationships are with / to itself (such as inner joins), the worse each query's performance gets.
With an index, the engine has to use the index to get matching values, which is fast. Then it has to use the matches to look up the actual rows in the table. If the index doesn't narrow down the number of rows, it can be faster to just look up all the rows in the table.
When to add an index on a SQL table field (MySQL)?
When to use MyISAM and InnoDB?
https://dba.stackexchange.com/questions/1/what-are-the-main-differences-between-innodb-and-myisam

How can i speed up a group by query that already uses indexes?

We have a MyISAM table with approximately 75 milion rows that has 5 columns:
id (int),
user_id(int),
page_id (int),
type (enum with 6 strings)
date_created(datetime).
We have a primary index on the ID column, a unique index (user_id, page_id, date_created) AND a composite index (page_id, date_created)
The problem is that the query below takes up to 90 seconds to complete
SELECT SQL_NO_CACHE user_id, count(id) nr
FROM `table`
WHERE `page_id`=301
and `date_created` BETWEEN '2012-01-03' AND '2012-02-03 23:59:59'
AND page_id<>user_id
group by `user_id`
This is the explain of this query
+----+-------------+----------------------------+-------+---------------+---------+---------+------+--------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------------------------+-------+---------------+---------+---------+------+--------+----------------------------------------------+
| 1 | SIMPLE | table | range | page_id | page_id | 12 | NULL | 520024 | Using where; Using temporary; Using filesort |
+----+-------------+----------------------------+-------+---------------+---------+---------+------+--------+----------------------------------------------+
EDIT:
At the suggestion of ypercube I tried adding a new index (page_id, user_id, date_created). However mysql does not use it bu default so i had to suggest it to the query optimizer. Here is the new query and the explain:
SELECT SQL_NO_CACHE user_id, count(*) nr FROM `table` USE INDEX (usridexp) WHERE `page_id`=301 and `date_created` BETWEEN '2012-01-03' AND '2012-02-03 23:59:59' AND page_id<>user_id group by `user_id` ORDER BY NULL
+----+-------------+----------------------------+------+---------------+----------+---------+-------+---------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------------------------+------+---------------+----------+---------+-------+---------+--------------------------+
| 1 | SIMPLE | table | ref | usridexp | usridexp | 4 | const | 3943444 | Using where; Using index |
+----+-------------+----------------------------+------+---------------+----------+---------+-------+---------+--------------------------+
Some changes that may improve the query:
Change COUNT(id) to COUNT(*). Since id is (I guess) the PRIMARY KEY and NOT NULL, the results will be identical.
Add an ORDER BY NULL after ther GROUP BY clause. In MySQL, a group by operation also sorts the results, unless you specify other wise.
The (page_id, date_created) is probably the best index that MySQL can use for this query but you could also try (page_id, user_id, date_created) (can you also post the EXPLAIN if you add this index?)
Another thing not related to the performance of this query:
If your (user_id, page_id, date_created) is UNIQUE and the id is auto generated (and not used for anything else but as a Primary Key), you can make it the PRIMARY KEY and drop the id column. One less index and 4 bytes less per row.
1) It depends on your data - but you should have multiple indexes available to allow MySQL to choose the best one. e.g. if the table had an index on page_id it wouldn't be scanning so many rows.
2) There is a way of optimising date searches. I haven't actually implemented this myself yet, but have a similar problem that I have thought about quite a bit.
Basically you are looking up data by day - but date compares are really slow. What you could do is create another table that stores earliest and latest ID from table for each day. That table would need to be populated at the end of each day.
After that you could break your query into two parts:
i) Find the IDs to search y running two queries:
select earliestID from idCacheTable where date = '2012-01-03';
select latestID from idCacheTable where date = '2012-02-03';
ii) You can then search directly on the primary key of the table, without doing a date compare on each row, which would be waaaaaay faster.
SELECT SQL_NO_CACHE user_id, count(id) nr
FROM table
WHERE page_id=301
and (id >= earliestID and id <= latestID)
AND page_id<>user_id
group by user_id;
The exact solution to your problem will depend on what your data looks like though, rather than one of those two things always being correct.
Sounds odd, but try to add JOIN statement:
SELECT SQL_NO_CACHE user_id, count(id) nr
FROM `table` t
JOIN `table` t2 ON t.`user_id`= t2.`user_id`
WHERE t.`page_id`=301
and t.`date_created` BETWEEN '2012-01-03' AND '2012-02-03 23:59:59'
AND t.`page_id`<>t.`user_id`
group by t.`user_id`
For similar problem, I got that query execute 20 times faster (3-4s instead 60+). JOIN statement does not perform anything smart - seems that speedup is fully to internal MySql implementation (Tested on MySql 5.1., table have rare user_id duplicates).