In a scheduling application I am working on I am dealing with a fairly complex database schema in order to describe a series of kids assigned to groups on timeslots on certain dates. Now in this schema, I want to query the database what the number of scheduled kids are on a certain group for a certain timeslot on a certain range of dates.
DB Schema
Timeslot: A timeslot has a certain start and end time (e.g. 13:00 - 18:00). Time can vary in 15-minute steps. In our application we want to schedule a kid on a group for the duration of this timeslot.
Time slice: For every 15 minutes in a 24-hour period exists a time slice record (96). 15 minutes is the smallest possible planning unit. A timeslot is assigned to each slice covered between its start and end time (for example, timeslot 13:00-18:00 will have a record pointing to time slice [13:00, 13:15, 13:30...17:45]). This makes it possible to count how many kids are 'occupying' the same time slice at any give time and date.
Kid: A kid is simply the entity being scheduled
Group: A group is a representation of a physical location with a specific capacity
GroupAssignment: A group assignment is bound in time. Between date 1 and 2 it could be group A, between date 2 and 3 it could be group B.
Occupancy: The main scheduling record. This has a timeslot_id, kid_id, start and end date. note: a kid is scheduled on the start day and every subsequent 7 days up to the end date.
DB Schema SQL
The number of records can be roughly derived from the auto_increment value. If not present, I mentioned them manually.
CREATE TABLE `group_assignment_caches` (
`group_id` int(11) DEFAULT NULL,
`occupancy_id` int(11) DEFAULT NULL,
`start` date DEFAULT NULL,
`end` date DEFAULT NULL,
KEY `index_group_assignment_caches_on_occupancy_id` (`occupancy_id`),
KEY `index_group_assignment_caches_on_group_id` (`group_id`),
KEY `index_group_assignment_caches_on_start_and_end` (`start`,`end`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
/* (~1500 records) */
CREATE TABLE `kids` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`name` varchar(255) DEFAULT NULL,
`archived` tinyint(1) NOT NULL DEFAULT '0',
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=592 DEFAULT CHARSET=utf8;
CREATE TABLE `occupancies` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`kid_id` int(11) DEFAULT NULL,
`timeslot_id` int(11) DEFAULT NULL,
`start` date DEFAULT NULL,
`end` date DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `index_occupancies_on_kid_id` (`kid_id`),
KEY `index_occupancies_on_timeslot_id` (`timeslot_id`),
KEY `index_occupancies_on_start_and_end` (`start`,`end`)
) ENGINE=InnoDB AUTO_INCREMENT=2675 DEFAULT CHARSET=utf8;
CREATE TABLE `time_slices` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`start` time DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `index_time_slices_on_start` (`start`)
) ENGINE=InnoDB AUTO_INCREMENT=97 DEFAULT CHARSET=latin1;
CREATE TABLE `timeslot_slices` (
`timeslot_id` int(11) DEFAULT NULL,
`time_slice_id` int(11) DEFAULT NULL,
KEY `index_timeslot_slices_on_timeslot_id` (`timeslot_id`),
KEY `index_timeslot_slices_on_time_slice_id` (`time_slice_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
/* (~1500 records) */
CREATE TABLE `timeslots` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`start` time DEFAULT NULL,
`end` time DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=91 DEFAULT CHARSET=utf8;
Current solution
So far, I have designed the following query to tie it all together. While it does work, it scales very poorly. Running the query using 1 date, 1 timeslot and 1 group it takes about 50ms. However, with 100 dates this becomes 1000ms and when you start adding groups and timeslots this quickly rises exponentially in the multiple seconds. Ive noticed that the runtime is highly dependent on the size of the timeslot. It seems that when a specific timeslot covers more time slices it escalates rapidly in runtime!
SELECT subq.date, subq.group_id, subq.timeslot_id, MAX(subq.spots) AS max_spots
FROM (
SELECT di.date,
ts.start,
gac.group_id AS group_id,
tss2.timeslot_id AS timeslot_id,
COUNT(*) AS spots
FROM date_intervals di,
timeslot_slices tss2,
occupancies o
JOIN timeslots t ON o.timeslot_id = t.id
JOIN group_assignment_caches gac ON o.id = gac.occupancy_id
JOIN timeslot_slices tss1 ON t.id = tss1.timeslot_id
JOIN time_slices ts ON tss1.time_slice_id = ts.id
JOIN kids k ON o.kid_id = k.id
WHERE di.date BETWEEN gac.start AND gac.end
AND di.date BETWEEN o.start AND o.end
AND MOD(DATEDIFF(di.date, o.start),7)=0
AND k.archived = 0
AND tss1.time_slice_id = tss2.time_slice_id
AND gac.group_id IN (3) AND tss2.timeslot_id IN (5)
GROUP BY ts.start, di.date, group_id, timeslot_id
) subq
GROUP BY subq.date, subq.group_id, subq.timeslot_id
Note that running the derived subquery separately takes the same amount of time. This yields 1 record with the number of occupancies for each time slice (15 min) for the given group in the given timeslot. This is great for debugging. Obviously I am only interested in the max number of occupancies for the entire timeslot.
Date_intervals is not described in the schema. This is a temporary table I fill using a REPEAT statement at the beginning of this procedure call. Its only column is 'date' and it's filled with 10-300 dates generally in most situations. The query should be able to handle this.
If I EXPLAIN this query, I get the following results. I am not really sure how to go further from here. The first row about the derived table can be ignored, since executing the subquery takes the same amount of time. The only other table not using an index is date_intervals di which is a small temporary table with 122 records.
+----+-------------+------------+--------+----------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------+---------+----------------------------+------+------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+--------+----------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------+---------+----------------------------+------+------------------------------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 5124 | Using temporary; Using filesort |
| 2 | DERIVED | tss2 | ref | index_timeslot_slices_on_timeslot_id,index_timeslot_slices_on_time_slice_id | index_timeslot_slices_on_timeslot_id | 5 | | 42 | Using where; Using temporary; Using filesort |
| 2 | DERIVED | ts | eq_ref | PRIMARY | PRIMARY | 4 | ookidoo.tss2.time_slice_id | 1 | |
| 2 | DERIVED | tss1 | ref | index_timeslot_slices_on_timeslot_id,index_timeslot_slices_on_time_slice_id | index_timeslot_slices_on_time_slice_id | 5 | ookidoo.tss2.time_slice_id | 6 | Using where |
| 2 | DERIVED | o | ref | PRIMARY,index_occupancies_on_timeslot_id,index_occupancies_on_kid_id,index_occupancies_on_start_and_end | index_occupancies_on_timeslot_id | 5 | ookidoo.tss1.timeslot_id | 6 | Using where |
| 2 | DERIVED | k | eq_ref | PRIMARY | PRIMARY | 4 | ookidoo.o.kid_id | 1 | Using where |
| 2 | DERIVED | gac | ref | index_group_assignment_caches_on_occupancy_id,index_group_assignment_caches_on_start_and_end,index_group_assignment_caches_on_group_id | index_group_assignment_caches_on_occupancy_id | 5 | ookidoo.o.id | 1 | Using where |
| 2 | DERIVED | di | range | PRIMARY | PRIMARY | 3 | NULL | 1 | Range checked for each record (index map: 0x1) |
| 2 | DERIVED | t | eq_ref | PRIMARY | PRIMARY | 4 | ookidoo.o.timeslot_id | 1 | Using where; Using index |
+----+-------------+------------+--------+----------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------+---------+----------------------------+------+------------------------------------------------+
Current results
The above query yields the following results (122 records, abbreviated)
date group_id timeslot_id max_spots
+------------+----------+-------------+-----------+
| date | group_id | timeslot_id | max_spots |
+------------+----------+-------------+-----------+
| 2012-08-20 | 3 | 5 | 12 |
| 2012-08-27 | 3 | 5 | 12 |
| 2012-09-03 | 3 | 5 | 12 |
| 2012-09-10 | 3 | 5 | 12 |
+------------+----------+-------------+-----------+
| 2014-11-24 | 3 | 5 | 15 |
| 2014-12-01 | 3 | 5 | 15 |
| 2014-12-08 | 3 | 5 | 15 |
| 2014-12-15 | 3 | 5 | 15 |
+------------+----------+-------------+-----------+
Wrapping up
I would like to know a way to either restructure my query or even my database schema in order to make querying this information less time consuming. I can't imagine this being impossible, considering there are relatively so little records present in this database (10-1000's for most tables)
Any sufficient complex problem can bring a computer to its knees. Actually, it's easy to create a complex problem, and difficult to make a complex problem easy.
Your single query is very complex. It goes over the entire database. Is that necessary? What happens if, for instance, you restrict it to one date? Does it scale better?
Using just a single query to do a complex task is often very efficient, but not always, as you've found out. I often find that the only way to break the exponential time needed to execute the task, is to split it up in multiple steps. One date at a time, for instance. Perhaps you don't always need them all?
In some of those cases I use an intermediate SQLite database that resides in memory. Operations on a small (!) temporary database in memory are very fast. It work like this:
$SQLiteDB = new PDO("sqlite::memory:");
$SQLiteDB->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION);
$SQL = "<any valid sqlite query>";
$SQLiteDB->query($SQL);
First check that you have the sqlite PHP module installed. Read the manual:
http://www.sqlite.org
When using this you first create tables in your new database and then you populate them with the needed data. You can use prepared statements if you have to copy multiple rows.
The tricky bit is taking apart your single complex query. How you would do that depends on the exact question you want to answer. The art is to limit the amount of data you have to work with. Don't copy the whole database, but make an informed selection.
A big advantage of taking multiple smaller steps is that your code may become much more readable, and understandable. I wouldn't want to be the guy who has to change your SQL query ten years from now because you went on to other things.
I have found a solution which is acceptable for my particular use case.
I have created an intermediate or 'cache' table with the following structure:
CREATE TABLE `occupancy_caches` (
`occupancy_id` int(11) DEFAULT NULL,
`kid_id` int(11) DEFAULT NULL,
`group_id` int(11) DEFAULT NULL,
`client_id` int(11) DEFAULT NULL,
`date` date DEFAULT NULL,
`timeslot_id` int(11) DEFAULT NULL,
`start` int(11) DEFAULT NULL,
`end` int(11) DEFAULT NULL,
KEY `index_occupancy_caches_on_date_and_client_id` (`date`,`client_id`),
KEY `index_occupancy_caches_on_date_and_group_id` (`date`,`group_id`),
KEY `index_occupancy_caches_on_occupancy_id` (`occupancy_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
This allowed me to completely eliminate the group_assignment_caches table and no longer did I have to search for dates using calculated columns (MOD(DATEDIFF...)). Also, I only needed a single join on the time slices instead of 2.
The downside, however, is that I now have to create an occupancy_caches record for every week covered by the original occupancies record. In most cases these occupancies describe a 4 year period. This means that for every occupancies record I now have to create 400 (!) records... Since the number of records will only grow linear, correct usage of indexes should keep this from spinning out of control when the system grows.
Time will tell, though...
Related
I'm faced with a MySQL database which contains an events table with ~70 million rows which has foreign keys to other tables and is used to generate reports. Constructing a performant query to select (while counting/summing values) and grouping data per day from this table is proving challenging.
The database structure is as follows:
CREATE TABLE `client` (
`id` int NOT NULL AUTO_INCREMENT,
`name` varchar(255) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `idx_client_id_name` (`id`,`name`)
) ENGINE=InnoDB AUTO_INCREMENT=66 DEFAULT CHARSET=utf8mb3
CREATE TABLE `class` (
`id` int NOT NULL AUTO_INCREMENT,
`name` varchar(255) DEFAULT NULL,
`client_id` int DEFAULT NULL,
`duration` int DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `fk_client_id_idx` (`client_id`),
CONSTRAINT `fk_client_id` FOREIGN KEY (`client_id`) REFERENCES `client` (`id`) ON DELETE SET NULL ON UPDATE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=2606 DEFAULT CHARSET=utf8mb3
CREATE TABLE `event` (
`id` int NOT NULL AUTO_INCREMENT,
`start_time` datetime DEFAULT NULL,
`class_id` int DEFAULT NULL,
`venue_id` int DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `fk_class_id_idx` (`class_id`),
KEY `fk_venue_id_idx` (`venue_id`),
KEY `idx_1` (`venue_id`,`class_id`,`start_time`),
CONSTRAINT `fk_class_id` FOREIGN KEY (`class_id`) REFERENCES `class` (`id`) ON DELETE SET NULL ON UPDATE CASCADE,
CONSTRAINT `fk_venue_id` FOREIGN KEY (`venue_id`) REFERENCES `venue` (`id`) ON DELETE SET NULL ON UPDATE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=64093231 DEFAULT CHARSET=utf8mb3
CREATE TABLE `venue` (
`id` int NOT NULL AUTO_INCREMENT,
`name` varchar(255) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `idx_venue_id_name` (`id`,`name`)
) ENGINE=InnoDB AUTO_INCREMENT=29 DEFAULT CHARSET=utf8mb3
The query which is fine on an events table with a few thousand rows to demonstrate the desired outcome is as follows:
SELECT
CAST(event.start_time as date) as day,
class.name,
client.name,
venue.name,
COUNT(class.name) AS occurrences,
SUM(class.duration) AS duration
FROM
class,
client,
event,
venue
WHERE
event.venue_id = venue.id
AND event.class_id = class.id
AND class.client_id = client.id
GROUP BY day, class.name, client.name, venue.name
The database isn't indexed and although I've tried indexing with things like alter table events add index idx_test (venue_id, class_id, start_time); to improve performance it's still incredibly slow (I tend to abort them when they're past the 10 minute mark so don't know for sure how long they'd take to complete).
I figured this was a good use case for a summary table (as suggested by Rick James' guide) so that I could hold a separate set of summarized data broken down into day with occurrences and total duration calculated/incremented with each addition to the table (IODKU). However I'm then also up against creating rows per day in a summary table based on what is considered a day in the database (UTC) which may not match with the application's "day" due to timezone offset.
Short of converting the start_time column to a timestamp type (which is then inconsistent with all other date types in the database) is there any way round this or is there any other optimization I could be making to the original events table resulting in a more responsive query? TIA
Update 23/05
Here's the buffer pool size:
SHOW VARIABLES LIKE 'innodb_buffer_pool_size';
+-------------------------+-----------+
| Variable_name | Value |
+-------------------------+-----------+
| innodb_buffer_pool_size | 134217728 |
+-------------------------+-----------+
I've also made a bit of progress with indexing, modifying the query and creating a summary table.
I tried various ordering of columns to test indexes and found idx_event_venueid_classid_starttime (below), to be the most efficient for the event table:
SHOW INDEXES FROM EVENT;
+-------+------------+-------------------------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | Visible | Expression |
+-------+------------+-------------------------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| event | 0 | PRIMARY | 1 | id | A | 62142912 | NULL | NULL | | BTREE | | | YES | NULL |
| event | 1 | fk_class_id_idx | 1 | class_id | A | 51286 | NULL | NULL | YES | BTREE | | | YES | NULL |
| event | 1 | fk_venue_id_idx | 1 | venue_id | A | 16275 | NULL | NULL | YES | BTREE | | | YES | NULL |
| event | 1 | idx_event_venueid_classid_starttime | 1 | venue_id | A | 13378 | NULL | NULL | YES | BTREE | | | YES | NULL |
| event | 1 | idx_event_venueid_classid_starttime | 2 | class_id | A | 81331 | NULL | NULL | YES | BTREE | | | YES | NULL |
| event | 1 | idx_event_venueid_classid_starttime | 3 | start_time | A | 63909472 | NULL | NULL | YES | BTREE | | | YES | NULL |
+-------+------------+-------------------------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
Here's my modified version of the query, using JOIN syntax and now uses CONVERT_TZ to convert from UTC to the timezone required for reporting and then group that by the date (discarding the time portion):
SELECT
DATE(CONVERT_TZ(event.start_time,
'UTC',
'Europe/London')) AS tz_date,
class.name,
client.name,
venue.name,
COUNT(class.id) AS occurrences,
SUM(class.duration) AS duration
FROM
event
JOIN
class ON class.id = event.class_id
JOIN
venue ON venue.id = event.venue_id
JOIN
client ON client.id = class.client_id
GROUP BY tz_date, class.name, client.name, venue.name;
And here's the output of explain for that query:
+----+-------------+--------+------------+--------+---------------------------------------------------------------------+-------------------------------------+---------+-------------------------+------+----------+------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+--------+------------+--------+---------------------------------------------------------------------+-------------------------------------+---------+-------------------------+------+----------+------------------------------+
| 1 | SIMPLE | venue | NULL | index | PRIMARY,idx_venue_id_name | idx_venue_id_name | 772 | NULL | 28 | 100.00 | Using index; Using temporary |
| 1 | SIMPLE | event | NULL | ref | fk_class_id_idx,fk_venue_id_idx,idx_event_venueid_classid_starttime | idx_event_venueid_classid_starttime | 5 | example.venue.id | 4777 | 100.00 | Using where; Using index |
| 1 | SIMPLE | class | NULL | eq_ref | PRIMARY,fk_client_id_idx | PRIMARY | 4 | example.event.class_id | 1 | 100.00 | Using where |
| 1 | SIMPLE | client | NULL | eq_ref | PRIMARY,idx_client_id_name | PRIMARY | 4 | example.class.client_id | 1 | 100.00 | NULL |
+----+-------------+--------+------------+--------+---------------------------------------------------------------------+-------------------------------------+---------+-------------------------+------+----------+------------------------------+
The query takes ~1m 20s to run now so I figured I could prepend that with an insert into to populate a summary table with the dates being timezone specific and run that on a nightly basis. Summary table structure:
CREATE TABLE `summary` (
`tz_date` date NOT NULL,
`class` varchar(255) NOT NULL,
`client` varchar(255) NOT NULL,
`venue` varchar(255) NOT NULL,
`occurrences` int NOT NULL,
`duration` int NOT NULL,
PRIMARY KEY (`tz_date`,`class`,`client`,`venue`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb3
From the original ~60m+ rows in the event table, the aggregated summary table is populated with ~66k rows.
To then generate the reports from the summary table it takes a fraction of a second (shown below with data snipped):
SELECT * FROM SUMMARY;
66989 rows in set (0.03 sec)
I haven't looked into the impact of inserting into event while the query to populate the summary table is running - is using InnoDB likely to slow that down?
No further indexes are likely to help. It need to scan all the events table, reaching into the other tables to get the names.
Some things for us to look at:
SHOW VARIABLES LIKE 'innodb_buffer_pool_size';
EXPLAIN SELECT ...
How much RAM do you have?
Do the aggregates (COUNT and SUM) look correct? In some situations involving JOIN, they can be over-inflated.
Please use the newer JOIN ... ON syntax. (Won't change performance.)
As you observed, a Summary Table may help -- but only of the older data is not being modified. Please provide the SHOW CREATE TABLE and query for it.
Yes, timezone vs "definition of day" is a thorny issue. Notice how StackOverflow defines day based on UTC.
How many new rows are there per day? Are they spread out somewhat evenly throughout the day? If the average number of rows per hour is at least 20, then the Summary Table could be based on half-hour intervals. (I picked that because of India time vs most of the rest of the world.) The 20 comes from a Rule of Thumb that says that a summary table should have one-tenth as many rows as the Fact table.
Yes, TIMESTAMP instead of DATETIME may be a workaround.
Since you are talking about moderately large tables, consider whether to change INT NULL to SMALLINT UNSIGNED NOT NULL or some other sized integer.
(As for the cliff in 2038, ask yourself how many databases have been active on the same hardware and software since 2006. That may give some perspective on whether your design must survive 16 years.)
I have a mySql table where all status changes are recorded. I want to be able to query the status of all items on a specific date, or the last date for all items. The table I have now is:
CREATE TABLE `tra_rel_sta` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`tra_id` int(11) DEFAULT NULL,
`sta_id` int(11) DEFAULT NULL,
`changed_on` datetime DEFAULT NULL,
`changed_by` int(11) DEFAULT NULL,
`comments` text,
PRIMARY KEY (`id`),
KEY `tra_id` (`tra_id`),
KEY `rel` (`tra_id`,`sta_id`,`changed_on`),
KEY `sta_id` (`sta_id`),
KEY `changed_on` (`changed_on`),
KEY `tra_changed` (`tra_id`,`changed_on`)
) ENGINE=InnoDB AUTO_INCREMENT=51734 DEFAULT CHARSET=utf8;
(I know I'm probably overdoing the indexes, but I haven't exactly figured out how to optimize indexes yet).
The query I'm using now, which works is:
SELECT rel.changed_on, rel.changed_by, rel.tra_id, sta.id AS sta_id, sta.status, sta.description, sta.onHold, sta.awaitingApproval, sta.approved, sta.complete, sta.locked
FROM (
SELECT tra_id, MAX(changed_on) AS lst
FROM tra_rel_sta
GROUP BY tra_id
) AS rec
LEFT JOIN tra_rel_sta AS rel ON rel.changed_on = rec.lst AND rel.tra_id = rec.tra_id
LEFT JOIN tra_status AS sta ON sta.id = rel.sta_id
If I want to use a specific date, I insert a WHERE statement in the sub-query.
This works, but it takes about 0.65 seconds to run in PHP with about 51,733 records in the table. This query is used as a sub query in several others when I need to know the last status of an object, and as a result, is slowing down many application.
I've tried to use a sub query in the WHERE statement as described in MySQL: how to select record with latest date before a certain date but it takes almost twice as long. I've tried using a JOIN statement as described in MySQL select of record with latest date but I'm getting about the same or just slightly slower results.
How can I optimize this query or fix my indexes to make this more effective?
Thanks!!
As requested, EXPLAIN of query:
id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra
---|-------------|-------------|--------|-----------------------------------|---------|---------|-------------------|-------|-------------
1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 49931 | NULL
1 | PRIMARY | rel | ref | tra_id,rel,changed_on,tra_changed | tra_id | 5 | rec.tra_id | 1 | Using where
1 | PRIMARY | sta | eq_ref | PRIMARY | PRIMARY | 4 | csinfo.rel.sta_id | 1 | NULL
2 | DERIVED | tra_rel_sta | index | tra_id,rel,tra_changed | tra_id | 5 | NULL | 49931 | NULL
I have read different links like http://goo.gl/1nr3s2, http://goo.gl/gv4Vlc and other stackoverflow questions, but none of them help me with this problem.
This problem interacts with multiple tables, but the EXPLAIN method help me identify range is the main problem with the query.
First I need to explain that I have this table with this sample data (I will not use ids in any table to simplify the process)
+-------+----------+----------------+--------------+---------------+----------------+
| marca | submarca | modelo_inicial | modelo_final | motor | texto_articulo |
+-------+----------+----------------+--------------+---------------+----------------+
| Buick | Century | 1993 | 1996 | 4 Cil 2.2 Lts | BE1254AG4 |
| Buick | Century | 1993 | 1996 | 4 Cil 2.2 Lts | 854G4 |
+-------+----------+----------------+--------------+---------------+----------------+
This table has more than 1.5 Million rows and I have created a index that integrates initial_year and end_year in one and also initial_year has an index and end_year has another index independently like this structure.
CREATE TABLE `general` (
`id_general` int(11) NOT NULL AUTO_INCREMENT,
`id_marca_submarca` int(11) NOT NULL,
`id_modelo_inicial` int(11) NOT NULL,
`id_modelo_final` int(11) NOT NULL,
`id_motor` int(11) NOT NULL,
`id_articulo` int(11) NOT NULL,
PRIMARY KEY (`id_general`),
KEY `fk_general_articulo` (`id_articulo`),
KEY `modelo_inicial_final` (`id_modelo_inicial`,`id_modelo_final`),
KEY `indice_motor` (`id_motor`),
KEY `indice_marca_submarca` (`id_marca_submarca`),
KEY `indice_modelo_inicial` (`id_modelo_inicial`),
KEY `indice_modelo_final` (`id_modelo_final`),
CONSTRAINT `fk_general_articulo` FOREIGN KEY (`id_articulo`) REFERENCES `articulo` (`id_articulo`) ON DELETE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=1191853 DEFAULT CHARSET=utf8
I have another table that contains different years like this sample data:
+---------+----------------+
| id_modelo | texto_modelo |
+-----------+--------------+
| 76 | 2014 |
| 75 | 2013 |
............................
| 1 | 1939 |
+-----------+--------------+
I created a query that contains subquery to obtain specific data but took a lot of time. I will put some queries I have tried but none of them have worked properly for me
SELECT DISTINCT M.texto_modelo
FROM general G
INNER JOIN parque_vehicular.modelo M ON G.id_modelo_inicial <= M.id_modelo AND G.id_modelo_final >= M.id_modelo
WHERE EXISTS
(
SELECT DISTINCT A.id_articulo
...subquery...
WHERE A.id_articulo = G.id_articulo AND AD.id_distribuidor = 1
)
ORDER BY M.texto_modelo DESC;
And this query took a lot of seconds, so I use EXPLAIN and report is:
This is another query I tried.
SELECT DISTINCT M.texto_modelo
FROM general G
INNER JOIN parque_vehicular_rigs.modelo M ON M.id_modelo BETWEEN G.id_modelo_inicial AND G.id_modelo_final
WHERE EXISTS
(
SELECT DISTINCT A.id_articulo
...subquery WHERE A.id_articulo = G.id_articulo AND AD.id_distribuidor = 1
)
ORDER BY M.texto_modelo DESC;
Some operations you could do to change the query plan:
OP1: Get rid of all the keys or indexes in table general.
OP2: Use SELECT 1 instead of SELECT DISTINCT A.id_articulo in the sub query in EXISTS.
Do these operations separately, compare the differences.
I have a data warehouse with the following tables:
main
about 8 million records
CREATE TABLE `main` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`cid` mediumint(8) unsigned DEFAULT NULL, //This is the customer id
`iid` mediumint(8) unsigned DEFAULT NULL, //This is the item id
`pid` tinyint(3) unsigned DEFAULT NULL, //This is the period id
`qty` double DEFAULT NULL,
`sales` double DEFAULT NULL,
`gm` double DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `idx_pci` (`pid`,`cid`,`iid`) USING HASH,
KEY `idx_pic` (`pid`,`iid`,`cid`) USING HASH
) ENGINE=InnoDB AUTO_INCREMENT=7978349 DEFAULT CHARSET=latin1
period
This table has about 50 records and has the following fields
id
month
year
customer
This has about 23,000 records and the following fileds
id
number //This field is unique
name //This is simply a description field
The following query runs very fast (less than 1 second) and returns about 2,000:
select count(*)
from mydb.main m
INNER JOIN mydb.period p ON p.id = m.pid
INNER JOIN mydb.customer c ON c.id = m.cid
WHERE p.year = 2013 AND c.number = 'ABC';
But this query is much slower (mmore than 45 seconds), which is the same as the previous but sums instead of counts:
select sum(sales)
from mydb.main m
INNER JOIN mydb.period p ON p.id = m.pid
INNER JOIN mydb.customer c ON c.id = m.cid
WHERE p.year = 2013 AND c.number = 'ABC';
When I explain each query, the ONLY difference I see is that on the 'count()'
query the 'Extra' field says 'Using index', while for the 'sum()' query this field is NULL.
Explain count() query
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
| 1 | SIMPLE | c | const | PRIMARY,idx_customer | idx_customer | 11 | const | 1 | Using index |
| 1 | SIMPLE | p | ref | PRIMARY,idx_period | idx_period | 4 | const | 6 | Using index |
| 1 | SIMPLE | m | ref | idx_pci,idx_pic | idx_pci | 6 | mydb.p.id,const | 7 | Using index |
Explain sum() query
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
| 1 | SIMPLE | c | const | PRIMARY,idx_customer | idx_customer | 11 | const | 1 | Using index |
| 1 | SIMPLE | p | ref | PRIMARY,idx_period | idx_period | 4 | const | 6 | Using index |
| 1 | SIMPLE | m | ref | idx_pci,idx_pic | idx_pci | 6 | mydb.p.id,const | 7 | NULL |
Why is the count() so much faster than sum()? Shouldn't it be using the index for both?
What can I do to make the sum() go faster?
Thanks in advance!
EDIT
All the tables show that it is using Engine InnoDB
Also, as a side note, if I just do a 'SELECT *' query, this runs very quickly (less than 2 seconds). I would expect that the 'SUM()' shouldn't take any longer than that since SELECT * has to retrieve the rows anyways...
SOLVED
This is what I've learned:
Since the sales field is not a part of the index, it has to retrieve the records from the hard drive (which can be kind've slow).
I'm not too familiar with this, but it looks like I/O performance can be increased by switching to a SSD (Solid-state drive). I'll have to research this more.
For now, I think I'm going to create another layer of summary in order to get the performance I'm looking for.
I redefined my index on the main table to be (pid,cid,iid,sales,gm,qty) and now the sum() queries are running VERY fast!
Thanks everybody!
The index is the list of key rows.
When you do the count() query the actual data from the database can be ignored and just the index used.
When you do the sum(sales) query, then each row has to be read from disk to get the sales figure, hence much slower.
Additionally, the indexes can be read in bulk and then processed in memory, while the disk access will be randomly trashing the drive trying to read rows from across the disk.
Finally, the index itself may have summaries of the counts (to help with the plan generation)
Update
You actually have three indexes on your table:
PRIMARY KEY (`id`),
KEY `idx_pci` (`pid`,`cid`,`iid`) USING HASH,
KEY `idx_pic` (`pid`,`iid`,`cid`) USING HASH
So you only have indexes on the columns id, pid, cid, iid. (As an aside, most databases are smart enough to combine indexes, so you could probably optimize your indexes somewhat)
If you added another key like KEY idx_sales(id,sales) that could improve performance, but given the likely distribution of sales values numerically, you would be adding extra performance cost for updates which is likely a bad thing
The simple answer is that count() is only counting rows. This can be satisfied by the index.
The sum() needs to identify each row and then fetch the page in order to get the sales column. This adds a lot of overhead -- about one page load per row.
If you add sales into the index, then it should also go very fast, because it will not have to fetch the original data.
I am executing the below query to get daily user signin details from my table . I posted the sample table and the query i am executing .
Problem
My Query is not executing as range query it examines the whole table . It becomes slow for me . If I index the timestamp column is not useful because there is no major differnce between timestamp in milliseconds . I couldn't do indexing the timestamp column for a table because my product in a production setup . It will contains millions of rows. How can execute this query as a range one or any better solutions?
MYSQL Table
+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table | Create Table |
+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| SignInDetails | CREATE TABLE `SignInDetails` (
`USER_ID` bigint(20) DEFAULT NULL,
`UserName` char(200) DEFAULT NULL,
`TIMESTAMP` bigint(20) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1 |
+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Query
select USER_ID,TIMESTAMP from SignInDetails where TIMESTAMP
between UNIX_TIMESTAMP(CURRENT_DATE()-INTERVAL 1 DAY)*1000 and
UNIX_TIMESTAMP(CURRENT_DATE())*1000
Explain Output
+----+-------------+---------------+------+---------------+------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------------+------+---------------+------+---------+------+------+-------------+
| 1 | SIMPLE | SignInDetails | ALL | NULL | NULL | NULL | NULL | 21 | Using where |
+----+-------------+---------------+------+---------------+------+---------+------+------+-------------+
Total Rows In a table
21
Your desc output doesn't indicate that timestamp is indexed column. Can you provide the DDL, so that we can see that you actually creating an index for it?
Alternatively, you could use 2 columns to store the timestamp with milliseconds resolution, one is used for search and the other for extra resolution. I can imagine combinations: datetime/msec, date/bigint-msec, datetime/bigint-msec. Actual combination is going to depend on what kind of queries are used most frequently.