MySQL range query is slow - mysql

I have read different links like http://goo.gl/1nr3s2, http://goo.gl/gv4Vlc and other stackoverflow questions, but none of them help me with this problem.
This problem interacts with multiple tables, but the EXPLAIN method help me identify range is the main problem with the query.
First I need to explain that I have this table with this sample data (I will not use ids in any table to simplify the process)
+-------+----------+----------------+--------------+---------------+----------------+
| marca | submarca | modelo_inicial | modelo_final | motor | texto_articulo |
+-------+----------+----------------+--------------+---------------+----------------+
| Buick | Century | 1993 | 1996 | 4 Cil 2.2 Lts | BE1254AG4 |
| Buick | Century | 1993 | 1996 | 4 Cil 2.2 Lts | 854G4 |
+-------+----------+----------------+--------------+---------------+----------------+
This table has more than 1.5 Million rows and I have created a index that integrates initial_year and end_year in one and also initial_year has an index and end_year has another index independently like this structure.
CREATE TABLE `general` (
`id_general` int(11) NOT NULL AUTO_INCREMENT,
`id_marca_submarca` int(11) NOT NULL,
`id_modelo_inicial` int(11) NOT NULL,
`id_modelo_final` int(11) NOT NULL,
`id_motor` int(11) NOT NULL,
`id_articulo` int(11) NOT NULL,
PRIMARY KEY (`id_general`),
KEY `fk_general_articulo` (`id_articulo`),
KEY `modelo_inicial_final` (`id_modelo_inicial`,`id_modelo_final`),
KEY `indice_motor` (`id_motor`),
KEY `indice_marca_submarca` (`id_marca_submarca`),
KEY `indice_modelo_inicial` (`id_modelo_inicial`),
KEY `indice_modelo_final` (`id_modelo_final`),
CONSTRAINT `fk_general_articulo` FOREIGN KEY (`id_articulo`) REFERENCES `articulo` (`id_articulo`) ON DELETE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=1191853 DEFAULT CHARSET=utf8
I have another table that contains different years like this sample data:
+---------+----------------+
| id_modelo | texto_modelo |
+-----------+--------------+
| 76 | 2014 |
| 75 | 2013 |
............................
| 1 | 1939 |
+-----------+--------------+
I created a query that contains subquery to obtain specific data but took a lot of time. I will put some queries I have tried but none of them have worked properly for me
SELECT DISTINCT M.texto_modelo
FROM general G
INNER JOIN parque_vehicular.modelo M ON G.id_modelo_inicial <= M.id_modelo AND G.id_modelo_final >= M.id_modelo
WHERE EXISTS
(
SELECT DISTINCT A.id_articulo
...subquery...
WHERE A.id_articulo = G.id_articulo AND AD.id_distribuidor = 1
)
ORDER BY M.texto_modelo DESC;
And this query took a lot of seconds, so I use EXPLAIN and report is:
This is another query I tried.
SELECT DISTINCT M.texto_modelo
FROM general G
INNER JOIN parque_vehicular_rigs.modelo M ON M.id_modelo BETWEEN G.id_modelo_inicial AND G.id_modelo_final
WHERE EXISTS
(
SELECT DISTINCT A.id_articulo
...subquery WHERE A.id_articulo = G.id_articulo AND AD.id_distribuidor = 1
)
ORDER BY M.texto_modelo DESC;

Some operations you could do to change the query plan:
OP1: Get rid of all the keys or indexes in table general.
OP2: Use SELECT 1 instead of SELECT DISTINCT A.id_articulo in the sub query in EXISTS.
Do these operations separately, compare the differences.

Related

Optimize and speed up MySQL query selection

I'm trying to figure out which is the best way to optimize my current selection query on a MySQL database.
I have 2 MySQL tables with a relationship one-to-many. One is the user table that contains the unique list of users and It has around 22krows. One is the linedata table which contains all the possible coordinates for each user and it has around 490k rows.
In this case we can assume the foreign key between the 2 tables is the id value. In the case of the user table the id is also the auto-increment primary key, while in the linedata table it's not primary key cause we can have more rows for the same user.
The CREATE STMT structure
CREATE TABLE `user` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`isActive` tinyint(4) NOT NULL,
`userId` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
`name` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
`gender` varchar(45) COLLATE utf8_unicode_ci NOT NULL,
`age` int(11) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=21938 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
CREATE TABLE `linedata` (
`id` int(11) NOT NULL,
`userId` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
`timestamp` datetime NOT NULL,
`x` float NOT NULL,
`y` float NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
The selection query
SELECT
u.id,
u.isActive,
u.userId,
u.name,
u.gender,
u.age,
GROUP_CONCAT(CONCAT_WS(', ',timestamp,x, y)
ORDER BY timestamp ASC SEPARATOR '; '
) as linedata_0
FROM user u
JOIN linedata l
ON u.id=l.id
WHERE DATEDIFF(l.timestamp, '2018-02-28T20:00:00.000Z') >= 0
AND DATEDIFF(l.timestamp, '2018-11-20T09:20:08.218Z') <= 0
GROUP BY userId;
The EXPLAIN output
+-------+---------------+-----------+-----------+-------------------+-----------+---------------+-----------+-----------+------------------------------------------------------------+
| ID | SELECT_TYPE | TABLE | TYPE | POSSIBLE_KEYS | KEY | KEY_LEN | REF | ROWS | EXTRA |
+-------+---------------+-----------+-----------+-------------------+-----------+---------------+-----------+-----------+------------------------------------------------------------+
| 1 | SIMPLE | l | ALL | NULL | NULL | NULL | NULL | 491157 | "Using where; Using temporary; Using filesort" |
+-------+---------------+-----------+-----------+-------------------+-----------+---------------+-----------+-----------+------------------------------------------------------------+
| 1 | SIMPLE | u | eq_ref | PRIMARY | PRIMARY | 4 | l.id | 1 | NULL |
+-------+---------------+-----------+-----------+-------------------+-----------+---------------+-----------+-----------+------------------------------------------------------------+
The selection query works if for example I add another WHERE condition for filter single users. Let's say that I want to select just 200 user, then I got around 14 seconds as execution time. Around 7 seconds if I select just the first 100 users. But in case of having only datetime range condition it seems loading without an ending point. Any suggestions?
UPDATE
After following the Rick's suggestions now the query benchmark is around 14 seconds. Here below the EXPLAIN EXTENDED:
id,select_type,table,type,possible_keys,key,key_len,ref,rows,filtered,Extra
1,PRIMARY,u,index,PRIMARY,PRIMARY,4,NULL,21959,100.00,NULL
1,PRIMARY,l,ref,id_timestamp_index,id_timestamp_index,4,u.id,14,100.00,"Using index condition"
2,"DEPENDENT SUBQUERY",NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,"No tables used"
I have changed a bit some values of the tables:
Where the id in user table can be joined with userId in linedata table. And they are integer now. We will have string type just for the userId value in user table cause it is a sort of long string identifier like 0000309ab2912b2fd34350d7e6c079846bb6c5e1f97d3ccb053d15061433e77a_0.
So, just for make a quick example we will have in user and in linedata table:
+-------+-----------+-----------+-------------------+--------+---+
| id | isActive | userId | name | gender |age|
+-------+-----------+-----------+-------------------+--------+---+
| 1 | 1 | x4by4d | john | m | 22|
| 2 | 1 | 3ub3ub | bob | m | 50|
+-------+-----------+-----------+-------------------+--------+---+
+-------+-----------+-----------+------+---+
| id | userId |timestamp | x | y |
+-------+-----------+-----------+------+----+
| 1 | 1 | somedate | 30 | 10 |
| 2 | 1 | somedate | 45 | 15 |
| 3 | 1 | somedate | 50 | 20 |
| 4 | 2 | somedate | 20 | 5 |
| 5 | 2 | somedate | 25 | 10 |
+-------+-----------+-----------+------+----+
I have added a compound index made of userId and timestamp values in linedata table.
Maybe instead of having as primary key an ai id value for linedata table, if I add a composite primary key made of userId+timestamp? Should increase the performance or maybe not?
I need to help you fix several bugs before discussing performance.
First of all, '2018-02-28T20:00:00.000Z' won't work in MySQL. It needs to be '2018-02-28 20:00:00.000' and something needs to be done about the timezone.
Then, don't "hide a column in a function". That is DATEDIFF(l.timestamp ...) cannot use any indexing on timestamp.
So, instead of
WHERE DATEDIFF(l.timestamp, '2018-02-28T20:00:00.000Z') >= 0
AND DATEDIFF(l.timestamp, '2018-11-20T09:20:08.218Z') <= 0
do something like
WHERE l.timestamp >= '2018-02-28 20:00:00.000'
AND l.timestamp < '2018-11-20 09:20:08.218'
I'm confused about the two tables. Both have id and userid, yet you join on id. Perhaps instead of
CREATE TABLE `linedata` (
`id` int(11) NOT NULL,
`userId` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
...
you meant
CREATE TABLE `linedata` (
`id` int(11) NOT NULL AUTO_INCREMENT, -- (the id for `linedata`)
`userId` int NOT NULL, -- to link to the other table
...
PRIMARY KEY(id)
...
Then there could be several linedata rows for each user.
At that point, this
JOIN linedata l ON u.id=l.id
becomes
JOIN linedata l ON u.id=l.userid
Now, for performance: linedata needs INDEX(userid, timestamp) - in that order.
Now, think about the output. You are asking for up to 22K rows, with possibly hundreds of "ts,x,y" strung together in one of the columns. What will receive this much data? Will it choke on it?
And GROUP_CONCAT has a default limit of 1024 bytes. That will allow for about 50 points. If a 'user' can be in more than 50 spots in 9 days, consider increasing group_concat_max_len before running the query.
To make it work even faster, reformulate it this way:
SELECT u.id, u.isActive, u.userId, u.name, u.gender, u.age,
( SELECT GROUP_CONCAT(CONCAT_WS(', ',timestamp, x, y)
ORDER BY timestamp ASC
SEPARATOR '; ')
) as linedata_0
FROM user u
JOIN linedata l ON u.id = l.userid
WHERE l.timestamp >= '2018-02-28 20:00:00.000'
AND l.timestamp < '2018-11-20 09:20:08.218';
Another thing. You probably want to be able to look up a user by name; so add INDEX(name)
Oh, what the heck is the VARCHAR(255) for userID?? Ids are normally integers.

Mysql query optiomization (to avoid using UNION ALL)

Does exist any way how to optimize given query? I would like to get always all result from the user table and + also result form the picture table (if related exists). It is possible without using UNION ALL?
Lets consider following example
+----+--------+
| id | name |
+----+--------+
| 1 | Drosos |
| 2 | Jack |
+----+--------+
+----+---------+--------------+
| id | user_id | picture_name |
+----+---------+--------------+
| 1 | 1 | avatar.jpg |
| 2 | 1 | avatar2.jpg |
+----+---------+--------------+
Expected result
+--------+--------------+
| name | picture_name |
+--------+--------------+
| Drosos | avatar.jpg |
| Drosos | avatar2.jpg |
| Drosos | NULL |
| Jack | NULL |
+--------+--------------+
4 rows in set (0.00 sec)
User
CREATE TABLE `user` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`name` varchar(45) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=4 DEFAULT CHARSET=latin1;
Picture table
CREATE TABLE `picture` (
`id` int(11) NOT NULL,
`user_id` int(11) NOT NULL,
`picture_name` varchar(45) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
Query
SELECT u.name, p.picture_name FROM user u
INNER JOIN picture p ON p.user_id = u.id
UNION ALL
SELECT u.name, NULL FROM user u;
http://sqlfiddle.com/#!9/46d18a/1
Here's a method to get what you're after, but it's really only to illustrate theat UNION ALL is probably your best solution anyway. This is SQL Server syntax which should be pretty close to MySQL
SELECT u.name, p.picture_name
FROM user u
CROSS JOIN
(SELECT 1 as C1 UNION ALL SELECT 2) As CJ
LEFT JOIN picture p ON p.user_id = u.id AND CJ.C1 = 1
This duplicates the user table with a cross join then attaches pictures to just one of the copies
If you didn't need that extra Drosso | NULL then a simple left join would be fine
The best optimization which I have achieved is with using "materialized view" and applied needed indexes. The query which used to take ~0.4000 sec now takes ~0.0025 sec.
Materialized views are not supported by MySQL so I had to create table table and trigger manually (which is not great but in my case was worth to do).

MySQL optimize query for counting scheduled items over time periods

In a scheduling application I am working on I am dealing with a fairly complex database schema in order to describe a series of kids assigned to groups on timeslots on certain dates. Now in this schema, I want to query the database what the number of scheduled kids are on a certain group for a certain timeslot on a certain range of dates.
DB Schema
Timeslot: A timeslot has a certain start and end time (e.g. 13:00 - 18:00). Time can vary in 15-minute steps. In our application we want to schedule a kid on a group for the duration of this timeslot.
Time slice: For every 15 minutes in a 24-hour period exists a time slice record (96). 15 minutes is the smallest possible planning unit. A timeslot is assigned to each slice covered between its start and end time (for example, timeslot 13:00-18:00 will have a record pointing to time slice [13:00, 13:15, 13:30...17:45]). This makes it possible to count how many kids are 'occupying' the same time slice at any give time and date.
Kid: A kid is simply the entity being scheduled
Group: A group is a representation of a physical location with a specific capacity
GroupAssignment: A group assignment is bound in time. Between date 1 and 2 it could be group A, between date 2 and 3 it could be group B.
Occupancy: The main scheduling record. This has a timeslot_id, kid_id, start and end date. note: a kid is scheduled on the start day and every subsequent 7 days up to the end date.
DB Schema SQL
The number of records can be roughly derived from the auto_increment value. If not present, I mentioned them manually.
CREATE TABLE `group_assignment_caches` (
`group_id` int(11) DEFAULT NULL,
`occupancy_id` int(11) DEFAULT NULL,
`start` date DEFAULT NULL,
`end` date DEFAULT NULL,
KEY `index_group_assignment_caches_on_occupancy_id` (`occupancy_id`),
KEY `index_group_assignment_caches_on_group_id` (`group_id`),
KEY `index_group_assignment_caches_on_start_and_end` (`start`,`end`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
/* (~1500 records) */
CREATE TABLE `kids` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`name` varchar(255) DEFAULT NULL,
`archived` tinyint(1) NOT NULL DEFAULT '0',
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=592 DEFAULT CHARSET=utf8;
CREATE TABLE `occupancies` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`kid_id` int(11) DEFAULT NULL,
`timeslot_id` int(11) DEFAULT NULL,
`start` date DEFAULT NULL,
`end` date DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `index_occupancies_on_kid_id` (`kid_id`),
KEY `index_occupancies_on_timeslot_id` (`timeslot_id`),
KEY `index_occupancies_on_start_and_end` (`start`,`end`)
) ENGINE=InnoDB AUTO_INCREMENT=2675 DEFAULT CHARSET=utf8;
CREATE TABLE `time_slices` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`start` time DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `index_time_slices_on_start` (`start`)
) ENGINE=InnoDB AUTO_INCREMENT=97 DEFAULT CHARSET=latin1;
CREATE TABLE `timeslot_slices` (
`timeslot_id` int(11) DEFAULT NULL,
`time_slice_id` int(11) DEFAULT NULL,
KEY `index_timeslot_slices_on_timeslot_id` (`timeslot_id`),
KEY `index_timeslot_slices_on_time_slice_id` (`time_slice_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
/* (~1500 records) */
CREATE TABLE `timeslots` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`start` time DEFAULT NULL,
`end` time DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=91 DEFAULT CHARSET=utf8;
Current solution
So far, I have designed the following query to tie it all together. While it does work, it scales very poorly. Running the query using 1 date, 1 timeslot and 1 group it takes about 50ms. However, with 100 dates this becomes 1000ms and when you start adding groups and timeslots this quickly rises exponentially in the multiple seconds. Ive noticed that the runtime is highly dependent on the size of the timeslot. It seems that when a specific timeslot covers more time slices it escalates rapidly in runtime!
SELECT subq.date, subq.group_id, subq.timeslot_id, MAX(subq.spots) AS max_spots
FROM (
SELECT di.date,
ts.start,
gac.group_id AS group_id,
tss2.timeslot_id AS timeslot_id,
COUNT(*) AS spots
FROM date_intervals di,
timeslot_slices tss2,
occupancies o
JOIN timeslots t ON o.timeslot_id = t.id
JOIN group_assignment_caches gac ON o.id = gac.occupancy_id
JOIN timeslot_slices tss1 ON t.id = tss1.timeslot_id
JOIN time_slices ts ON tss1.time_slice_id = ts.id
JOIN kids k ON o.kid_id = k.id
WHERE di.date BETWEEN gac.start AND gac.end
AND di.date BETWEEN o.start AND o.end
AND MOD(DATEDIFF(di.date, o.start),7)=0
AND k.archived = 0
AND tss1.time_slice_id = tss2.time_slice_id
AND gac.group_id IN (3) AND tss2.timeslot_id IN (5)
GROUP BY ts.start, di.date, group_id, timeslot_id
) subq
GROUP BY subq.date, subq.group_id, subq.timeslot_id
Note that running the derived subquery separately takes the same amount of time. This yields 1 record with the number of occupancies for each time slice (15 min) for the given group in the given timeslot. This is great for debugging. Obviously I am only interested in the max number of occupancies for the entire timeslot.
Date_intervals is not described in the schema. This is a temporary table I fill using a REPEAT statement at the beginning of this procedure call. Its only column is 'date' and it's filled with 10-300 dates generally in most situations. The query should be able to handle this.
If I EXPLAIN this query, I get the following results. I am not really sure how to go further from here. The first row about the derived table can be ignored, since executing the subquery takes the same amount of time. The only other table not using an index is date_intervals di which is a small temporary table with 122 records.
+----+-------------+------------+--------+----------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------+---------+----------------------------+------+------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+--------+----------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------+---------+----------------------------+------+------------------------------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 5124 | Using temporary; Using filesort |
| 2 | DERIVED | tss2 | ref | index_timeslot_slices_on_timeslot_id,index_timeslot_slices_on_time_slice_id | index_timeslot_slices_on_timeslot_id | 5 | | 42 | Using where; Using temporary; Using filesort |
| 2 | DERIVED | ts | eq_ref | PRIMARY | PRIMARY | 4 | ookidoo.tss2.time_slice_id | 1 | |
| 2 | DERIVED | tss1 | ref | index_timeslot_slices_on_timeslot_id,index_timeslot_slices_on_time_slice_id | index_timeslot_slices_on_time_slice_id | 5 | ookidoo.tss2.time_slice_id | 6 | Using where |
| 2 | DERIVED | o | ref | PRIMARY,index_occupancies_on_timeslot_id,index_occupancies_on_kid_id,index_occupancies_on_start_and_end | index_occupancies_on_timeslot_id | 5 | ookidoo.tss1.timeslot_id | 6 | Using where |
| 2 | DERIVED | k | eq_ref | PRIMARY | PRIMARY | 4 | ookidoo.o.kid_id | 1 | Using where |
| 2 | DERIVED | gac | ref | index_group_assignment_caches_on_occupancy_id,index_group_assignment_caches_on_start_and_end,index_group_assignment_caches_on_group_id | index_group_assignment_caches_on_occupancy_id | 5 | ookidoo.o.id | 1 | Using where |
| 2 | DERIVED | di | range | PRIMARY | PRIMARY | 3 | NULL | 1 | Range checked for each record (index map: 0x1) |
| 2 | DERIVED | t | eq_ref | PRIMARY | PRIMARY | 4 | ookidoo.o.timeslot_id | 1 | Using where; Using index |
+----+-------------+------------+--------+----------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------+---------+----------------------------+------+------------------------------------------------+
Current results
The above query yields the following results (122 records, abbreviated)
date group_id timeslot_id max_spots
+------------+----------+-------------+-----------+
| date | group_id | timeslot_id | max_spots |
+------------+----------+-------------+-----------+
| 2012-08-20 | 3 | 5 | 12 |
| 2012-08-27 | 3 | 5 | 12 |
| 2012-09-03 | 3 | 5 | 12 |
| 2012-09-10 | 3 | 5 | 12 |
+------------+----------+-------------+-----------+
| 2014-11-24 | 3 | 5 | 15 |
| 2014-12-01 | 3 | 5 | 15 |
| 2014-12-08 | 3 | 5 | 15 |
| 2014-12-15 | 3 | 5 | 15 |
+------------+----------+-------------+-----------+
Wrapping up
I would like to know a way to either restructure my query or even my database schema in order to make querying this information less time consuming. I can't imagine this being impossible, considering there are relatively so little records present in this database (10-1000's for most tables)
Any sufficient complex problem can bring a computer to its knees. Actually, it's easy to create a complex problem, and difficult to make a complex problem easy.
Your single query is very complex. It goes over the entire database. Is that necessary? What happens if, for instance, you restrict it to one date? Does it scale better?
Using just a single query to do a complex task is often very efficient, but not always, as you've found out. I often find that the only way to break the exponential time needed to execute the task, is to split it up in multiple steps. One date at a time, for instance. Perhaps you don't always need them all?
In some of those cases I use an intermediate SQLite database that resides in memory. Operations on a small (!) temporary database in memory are very fast. It work like this:
$SQLiteDB = new PDO("sqlite::memory:");
$SQLiteDB->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION);
$SQL = "<any valid sqlite query>";
$SQLiteDB->query($SQL);
First check that you have the sqlite PHP module installed. Read the manual:
http://www.sqlite.org
When using this you first create tables in your new database and then you populate them with the needed data. You can use prepared statements if you have to copy multiple rows.
The tricky bit is taking apart your single complex query. How you would do that depends on the exact question you want to answer. The art is to limit the amount of data you have to work with. Don't copy the whole database, but make an informed selection.
A big advantage of taking multiple smaller steps is that your code may become much more readable, and understandable. I wouldn't want to be the guy who has to change your SQL query ten years from now because you went on to other things.
I have found a solution which is acceptable for my particular use case.
I have created an intermediate or 'cache' table with the following structure:
CREATE TABLE `occupancy_caches` (
`occupancy_id` int(11) DEFAULT NULL,
`kid_id` int(11) DEFAULT NULL,
`group_id` int(11) DEFAULT NULL,
`client_id` int(11) DEFAULT NULL,
`date` date DEFAULT NULL,
`timeslot_id` int(11) DEFAULT NULL,
`start` int(11) DEFAULT NULL,
`end` int(11) DEFAULT NULL,
KEY `index_occupancy_caches_on_date_and_client_id` (`date`,`client_id`),
KEY `index_occupancy_caches_on_date_and_group_id` (`date`,`group_id`),
KEY `index_occupancy_caches_on_occupancy_id` (`occupancy_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
This allowed me to completely eliminate the group_assignment_caches table and no longer did I have to search for dates using calculated columns (MOD(DATEDIFF...)). Also, I only needed a single join on the time slices instead of 2.
The downside, however, is that I now have to create an occupancy_caches record for every week covered by the original occupancies record. In most cases these occupancies describe a 4 year period. This means that for every occupancies record I now have to create 400 (!) records... Since the number of records will only grow linear, correct usage of indexes should keep this from spinning out of control when the system grows.
Time will tell, though...

Why is COUNT() query from large table much faster than SUM()

I have a data warehouse with the following tables:
main
about 8 million records
CREATE TABLE `main` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`cid` mediumint(8) unsigned DEFAULT NULL, //This is the customer id
`iid` mediumint(8) unsigned DEFAULT NULL, //This is the item id
`pid` tinyint(3) unsigned DEFAULT NULL, //This is the period id
`qty` double DEFAULT NULL,
`sales` double DEFAULT NULL,
`gm` double DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `idx_pci` (`pid`,`cid`,`iid`) USING HASH,
KEY `idx_pic` (`pid`,`iid`,`cid`) USING HASH
) ENGINE=InnoDB AUTO_INCREMENT=7978349 DEFAULT CHARSET=latin1
period
This table has about 50 records and has the following fields
id
month
year
customer
This has about 23,000 records and the following fileds
id
number //This field is unique
name //This is simply a description field
The following query runs very fast (less than 1 second) and returns about 2,000:
select count(*)
from mydb.main m
INNER JOIN mydb.period p ON p.id = m.pid
INNER JOIN mydb.customer c ON c.id = m.cid
WHERE p.year = 2013 AND c.number = 'ABC';
But this query is much slower (mmore than 45 seconds), which is the same as the previous but sums instead of counts:
select sum(sales)
from mydb.main m
INNER JOIN mydb.period p ON p.id = m.pid
INNER JOIN mydb.customer c ON c.id = m.cid
WHERE p.year = 2013 AND c.number = 'ABC';
When I explain each query, the ONLY difference I see is that on the 'count()'
query the 'Extra' field says 'Using index', while for the 'sum()' query this field is NULL.
Explain count() query
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
| 1 | SIMPLE | c | const | PRIMARY,idx_customer | idx_customer | 11 | const | 1 | Using index |
| 1 | SIMPLE | p | ref | PRIMARY,idx_period | idx_period | 4 | const | 6 | Using index |
| 1 | SIMPLE | m | ref | idx_pci,idx_pic | idx_pci | 6 | mydb.p.id,const | 7 | Using index |
Explain sum() query
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
| 1 | SIMPLE | c | const | PRIMARY,idx_customer | idx_customer | 11 | const | 1 | Using index |
| 1 | SIMPLE | p | ref | PRIMARY,idx_period | idx_period | 4 | const | 6 | Using index |
| 1 | SIMPLE | m | ref | idx_pci,idx_pic | idx_pci | 6 | mydb.p.id,const | 7 | NULL |
Why is the count() so much faster than sum()? Shouldn't it be using the index for both?
What can I do to make the sum() go faster?
Thanks in advance!
EDIT
All the tables show that it is using Engine InnoDB
Also, as a side note, if I just do a 'SELECT *' query, this runs very quickly (less than 2 seconds). I would expect that the 'SUM()' shouldn't take any longer than that since SELECT * has to retrieve the rows anyways...
SOLVED
This is what I've learned:
Since the sales field is not a part of the index, it has to retrieve the records from the hard drive (which can be kind've slow).
I'm not too familiar with this, but it looks like I/O performance can be increased by switching to a SSD (Solid-state drive). I'll have to research this more.
For now, I think I'm going to create another layer of summary in order to get the performance I'm looking for.
I redefined my index on the main table to be (pid,cid,iid,sales,gm,qty) and now the sum() queries are running VERY fast!
Thanks everybody!
The index is the list of key rows.
When you do the count() query the actual data from the database can be ignored and just the index used.
When you do the sum(sales) query, then each row has to be read from disk to get the sales figure, hence much slower.
Additionally, the indexes can be read in bulk and then processed in memory, while the disk access will be randomly trashing the drive trying to read rows from across the disk.
Finally, the index itself may have summaries of the counts (to help with the plan generation)
Update
You actually have three indexes on your table:
PRIMARY KEY (`id`),
KEY `idx_pci` (`pid`,`cid`,`iid`) USING HASH,
KEY `idx_pic` (`pid`,`iid`,`cid`) USING HASH
So you only have indexes on the columns id, pid, cid, iid. (As an aside, most databases are smart enough to combine indexes, so you could probably optimize your indexes somewhat)
If you added another key like KEY idx_sales(id,sales) that could improve performance, but given the likely distribution of sales values numerically, you would be adding extra performance cost for updates which is likely a bad thing
The simple answer is that count() is only counting rows. This can be satisfied by the index.
The sum() needs to identify each row and then fetch the page in order to get the sales column. This adds a lot of overhead -- about one page load per row.
If you add sales into the index, then it should also go very fast, because it will not have to fetch the original data.

Estimate/speedup huge table self-join on mysql

I have a huge table:
CREATE TABLE `messageline` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`hash` bigint(20) DEFAULT NULL,
`quoteLevel` int(11) DEFAULT NULL,
`messageDetails_id` bigint(20) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `FK2F5B707BF7C835B8` (`messageDetails_id`),
KEY `hash_idx` (`hash`),
KEY `quote_level_idx` (`quoteLevel`),
CONSTRAINT `FK2F5B707BF7C835B8` FOREIGN KEY (`messageDetails_id`) REFERENCES `messagedetails` (`id`) ON DELETE NO ACTION ON UPDATE NO ACTION
) ENGINE=InnoDB AUTO_INCREMENT=401798068 DEFAULT CHARSET=utf8 COLLATE=utf8_bin
I need to find duplicate lines this way:
create table foundline AS
select ml.messagedetails_id, ml.hash, ml.quotelevel
from messageline ml,
messageline ml1
where ml1.hash = ml.hash
and ml1.messagedetails_id!=ml.messagedetails_id
But this request is working >1 day already. This is too long. Few hours would be ok. How can I speed this up? Thanx.
Explain:
+----+-------------+-------+------+---------------+----------+---------+---------------+-----------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+----------+---------+---------------+-----------+-------------+
| 1 | SIMPLE | ml | ALL | hash_idx | NULL | NULL | NULL | 401798409 | |
| 1 | SIMPLE | ml1 | ref | hash_idx | hash_idx | 9 | skryb.ml.hash | 1 | Using where |
+----+-------------+-------+------+---------------+----------+---------+---------------+-----------+-------------+
You can find your duplicates like this
SELECT messagedetails_id, COUNT(*) c
FROM messageline ml
GROUP BY messagedetails_id HAVING c > 1;
If it is still too long, add a condition to split the request on an indexed field :
WHERE messagedetails_id < 100000
Is it required to do this solely with SQL? Because for such a number of records you would be better off to break this down into 2 steps:
First run the following query
CREATE TABLE duplicate_hashes
SELECT * FROM (
SELECT hash, GROUP_CONCAT(id) AS ids, COUNT(*) AS cnt,
COUNT(DISTINCT messagedetails_id) AS cnt_message_details,
GROUP_CONCAT(DISTINCT messagedetails_id) as messagedetails_ids
FROM messageline GROUP BY hash ORDER BY NULL HAVING cnt > 1
) tmp
WHERE cnt > cnt_message_details
This will give you the duplicate IDs for each hash and since you have an index on the hash field grouping by will be relatively fast. Now, by counting distinct messagedetails_id values and comparing you implicitly fulfill the requirement for different messagedetails_id
where ml1.hash = ml.hash
and ml1.messagedetails_id!=ml.messagedetails_id
Use a script to check each record of the duplicate_hashes table