mysql huge table query optimization group by - mysql

I have huge table with about 40 million rows (GPS tracker positions), recorded every 10 seconds from multiple devices inside company. I want to select only the first row of every minute, so I used group by. The problem is that the table is growing up every 10 seconds, I've tried almost everything, googled many hours. So I decided to ask a question.
I'm using MySQL 5.7.11 InnoDB pool 50GB, server is Xeon X5650 with 64GB RAM.
table structure:
CREATE TABLE `eventData` (
`id` bigint(20) NOT NULL,
`position` point NOT NULL,
`speed` decimal(6,2) DEFAULT NULL,
`time` datetime DEFAULT NULL,
`device_id` int(9) DEFAULT NULL,
`processed` tinyint(1) NOT NULL DEFAULT '0',
`time_m` datetime GENERATED ALWAYS AS ((`time` - interval second(`time`) second)) VIRTUAL
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_czech_ci ROW_FORMAT=DYNAMIC;
ALTER TABLE `eventData`
ADD PRIMARY KEY (`id`),
ADD KEY `time` (`time`),
ADD KEY `device_id` (`device_id`,`processed`),
ADD KEY `time_m` (`time_m`);
SQL:
SELECT e.time, e.time_m, X(e.position) AS lat, Y(e.position) AS lng
FROM eventData AS e
WHERE
e.device_id = 86 AND
e.time BETWEEN '2016-02-29' AND '2016-03-06'
GROUP BY DAY(e.time),HOUR(e.time),MINUTE(e.time);
Explain:
EXPLAIN SELECT e.time, e.time_m, X(e.position) AS lat, Y(e.position) AS lng FROM eventData AS e WHERE e.device_id = 86 AND e.time BETWEEN '2016-02-29' AND '2016-03-06' GROUP BY DAY(e.time),HOUR(e.time),MINUTE(e.time);
+----+-------------+-------+------------+------+----------------+-----------+---------+-------+---------+----------+---------------------------------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+------+----------------+-----------+---------+-------+---------+----------+---------------------------------------------------------------------+
| 1 | SIMPLE | e | NULL | ref | time,device_id | device_id | 5 | const | 2122632 | 6.40 | Using index condition; Using where; Using temporary; Using filesort |
+----+-------------+-------+------------+------+----------------+-----------+---------+-------+---------+----------+---------------------------------------------------------------------+
describe:
DESCRIBE eventData;
+------------------+------------------------+------+-----+---------+-------------------+
| Field | Type | Null | Key | Default | Extra |
+------------------+------------------------+------+-----+---------+-------------------+
| id | bigint(20) | NO | PRI | NULL | auto_increment |
| position | point | NO | | NULL | |
| speed | decimal(6,2) | YES | | NULL | |
| time | datetime | YES | MUL | NULL | |
| device_id | int(9) | YES | MUL | NULL | |
| processed | tinyint(1) | NO | | 0 | |
| time_m | datetime | YES | MUL | NULL | VIRTUAL GENERATED |
+------------------+------------------------+------+-----+---------+-------------------+
I've tried:
without group by: ~0.06s
group by day,hour,minute: ~4.76s
group by virtual column (time_m): ~4.92s
group by e.time DIV 500: ~5.02s
I need to achieve better results than 5 seconds. Please help.

You could partition the table. For example by year. This would dramatically increase the performance due to much smaller indexes.
If this is not possible on your environment, try
GROUP BY date_format(e.time,'%d%H%i');

1) You can try composite index (device_id, time)
2) Try to group by virtual field:
SELECT MIN(e.time), e.time_m, X(e.position) AS lat, Y(e.position) AS lng
FROM eventData AS e
WHERE
e.device_id = 86 AND
e.time BETWEEN '2016-02-29' AND '2016-03-06'
GROUP BY e.time_m;

Related

How to use correct indexes with a double inner join query?

I have a query with 2 INNER JOIN statements, and only fetching a few column, but it is very slow even though I have indexes on all required columns.
My query
SELECT
dysfonctionnement,
montant,
listRembArticles,
case when dys.reimputation is not null then dys.reimputation else dys.responsable end as responsable_final
FROM
db.commandes AS com
INNER JOIN db.dysfonctionnements AS dys ON com.id_commande = dys.id_commande
INNER JOIN db.pe AS pe ON com.code_pe = pe.pe_id
WHERE
com.prestataireLAD REGEXP '.*'
AND pe_nom REGEXP 'bordeaux|chambéry-annecy|grenoble|lyon|marseille|metz|montpellier|nancy|nice|nimes|rouen|strasbourg|toulon|toulouse|vitry|vitry bis 1|vitry bis 2|vlg'
AND com.date_livraison BETWEEN '2022-06-11 00:00:00'
AND '2022-07-08 00:00:00';
It takes around 20 seconds to compute and fetch 4123 rows.
The problem
In order to find what's wrong and why is it so slow, I've used the EXPLAIN statement, here is the output:
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
|----|-------------|-------|------------|--------|----------------------------|-------------|---------|------------------------|--------|----------|-------------|
| 1 | SIMPLE | dys | | ALL | id_commande,id_commande_2 | | | | 878588 | 100.00 | Using where |
| 1 | SIMPLE | com | | eq_ref | id_commande,date_livraison | id_commande | 110 | db.dys.id_commande | 1 | 7.14 | Using where |
| 1 | SIMPLE | pe | | ref | pe_id | pe_id | 5 | db.com.code_pe | 1 | 100.00 | Using where |
I can see that the dysfonctionnements JOIN is rigged, and doesn't use a key even though it could...
Table definitions
commandes (included relevant columns only)
CREATE TABLE `commandes` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`id_commande` varchar(36) NOT NULL DEFAULT '',
`date_commande` datetime NOT NULL,
`date_livraison` datetime NOT NULL,
`code_pe` int(11) NOT NULL,
`traitement_dysfonctionnement` tinyint(4) DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `id_commande` (`id_commande`),
KEY `date_livraison` (`date_livraison`),
KEY `traitement_dysfonctionnement` (`traitement_dysfonctionnement`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
dysfonctionnements (again, relevant columns only)
CREATE TABLE `dysfonctionnements` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`id_commande` varchar(36) DEFAULT NULL,
`dysfonctionnement` varchar(150) DEFAULT NULL,
`responsable` varchar(50) DEFAULT NULL,
`reimputation` varchar(50) DEFAULT NULL,
`montant` float DEFAULT NULL,
`listRembArticles` text,
PRIMARY KEY (`id`),
UNIQUE KEY `id_commande` (`id_commande`,`dysfonctionnement`),
KEY `id_commande_2` (`id_commande`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
pe (again, relevant columns only)
CREATE TABLE `pe` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`pe_id` int(11) DEFAULT NULL,
`pe_nom` varchar(30) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `pe_nom` (`pe_nom`),
KEY `pe_id` (`pe_id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
Investigation
If I remove the db.pe table from the query and the WHERE clause on pe_nom, the query takes 1.7 seconds to fetch 7k rows, and with the EXPLAIN statement, I can see it is using keys as I expect it to do:
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
|----|-------------|-------|------------|-------|----------------------------|----------------|---------|------------------------|--------|----------|-----------------------------------------------|
| 1 | SIMPLE | com | | range | id_commande,date_livraison | date_livraison | 5 | | 389558 | 100.00 | Using index condition; Using where; Using MRR |
| 1 | SIMPLE | dys | | ref | id_commande,id_commande_2 | id_commande_2 | 111 | ooshop.com.id_commande | 1 | 100.00 | |
I'm open to any suggestions, I see no reason not to use the key when it does on a very similar query and it definitely makes it faster...
I had a similar experience when MySQL optimiser selected a joined table sequence far from optimal. At that time I used MySQL specific STRAIGHT_JOIN operator to overcome default optimiser behaviour. In your case I would try this:
SELECT
dysfonctionnement,
montant,
listRembArticles,
case when dys.reimputation is not null then dys.reimputation else dys.responsable end as responsable_final
FROM
db.commandes AS com
STRAIGHT_JOIN db.dysfonctionnements AS dys ON com.id_commande = dys.id_commande
INNER JOIN db.pe AS pe ON com.code_pe = pe.pe_id
Also, in your WHERE clause one of the REGEXP probably might be changed to IN operator, I assume it can use index.
Remove com.prestataireLAD REGEXP '.*'. The Optimizer probably won't realize that this has no impact on the resultset. If you are dynamically building the WHERE clause, then eliminate anything else you can.
id_commande_2 is redundant. In queries where it might be useful, the UNIQUE can take care of it.
These indexes might help:
com: INDEX(date_livraison, id_commande, code_pe)
pe: INDEX(pe_nom, pe_id)

MySQL report query optimization and timezone issues

I'm faced with a MySQL database which contains an events table with ~70 million rows which has foreign keys to other tables and is used to generate reports. Constructing a performant query to select (while counting/summing values) and grouping data per day from this table is proving challenging.
The database structure is as follows:
CREATE TABLE `client` (
`id` int NOT NULL AUTO_INCREMENT,
`name` varchar(255) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `idx_client_id_name` (`id`,`name`)
) ENGINE=InnoDB AUTO_INCREMENT=66 DEFAULT CHARSET=utf8mb3
CREATE TABLE `class` (
`id` int NOT NULL AUTO_INCREMENT,
`name` varchar(255) DEFAULT NULL,
`client_id` int DEFAULT NULL,
`duration` int DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `fk_client_id_idx` (`client_id`),
CONSTRAINT `fk_client_id` FOREIGN KEY (`client_id`) REFERENCES `client` (`id`) ON DELETE SET NULL ON UPDATE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=2606 DEFAULT CHARSET=utf8mb3
CREATE TABLE `event` (
`id` int NOT NULL AUTO_INCREMENT,
`start_time` datetime DEFAULT NULL,
`class_id` int DEFAULT NULL,
`venue_id` int DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `fk_class_id_idx` (`class_id`),
KEY `fk_venue_id_idx` (`venue_id`),
KEY `idx_1` (`venue_id`,`class_id`,`start_time`),
CONSTRAINT `fk_class_id` FOREIGN KEY (`class_id`) REFERENCES `class` (`id`) ON DELETE SET NULL ON UPDATE CASCADE,
CONSTRAINT `fk_venue_id` FOREIGN KEY (`venue_id`) REFERENCES `venue` (`id`) ON DELETE SET NULL ON UPDATE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=64093231 DEFAULT CHARSET=utf8mb3
CREATE TABLE `venue` (
`id` int NOT NULL AUTO_INCREMENT,
`name` varchar(255) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `idx_venue_id_name` (`id`,`name`)
) ENGINE=InnoDB AUTO_INCREMENT=29 DEFAULT CHARSET=utf8mb3
The query which is fine on an events table with a few thousand rows to demonstrate the desired outcome is as follows:
SELECT
CAST(event.start_time as date) as day,
class.name,
client.name,
venue.name,
COUNT(class.name) AS occurrences,
SUM(class.duration) AS duration
FROM
class,
client,
event,
venue
WHERE
event.venue_id = venue.id
AND event.class_id = class.id
AND class.client_id = client.id
GROUP BY day, class.name, client.name, venue.name
The database isn't indexed and although I've tried indexing with things like alter table events add index idx_test (venue_id, class_id, start_time); to improve performance it's still incredibly slow (I tend to abort them when they're past the 10 minute mark so don't know for sure how long they'd take to complete).
I figured this was a good use case for a summary table (as suggested by Rick James' guide) so that I could hold a separate set of summarized data broken down into day with occurrences and total duration calculated/incremented with each addition to the table (IODKU). However I'm then also up against creating rows per day in a summary table based on what is considered a day in the database (UTC) which may not match with the application's "day" due to timezone offset.
Short of converting the start_time column to a timestamp type (which is then inconsistent with all other date types in the database) is there any way round this or is there any other optimization I could be making to the original events table resulting in a more responsive query? TIA
Update 23/05
Here's the buffer pool size:
SHOW VARIABLES LIKE 'innodb_buffer_pool_size';
+-------------------------+-----------+
| Variable_name | Value |
+-------------------------+-----------+
| innodb_buffer_pool_size | 134217728 |
+-------------------------+-----------+
I've also made a bit of progress with indexing, modifying the query and creating a summary table.
I tried various ordering of columns to test indexes and found idx_event_venueid_classid_starttime (below), to be the most efficient for the event table:
SHOW INDEXES FROM EVENT;
+-------+------------+-------------------------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | Visible | Expression |
+-------+------------+-------------------------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| event | 0 | PRIMARY | 1 | id | A | 62142912 | NULL | NULL | | BTREE | | | YES | NULL |
| event | 1 | fk_class_id_idx | 1 | class_id | A | 51286 | NULL | NULL | YES | BTREE | | | YES | NULL |
| event | 1 | fk_venue_id_idx | 1 | venue_id | A | 16275 | NULL | NULL | YES | BTREE | | | YES | NULL |
| event | 1 | idx_event_venueid_classid_starttime | 1 | venue_id | A | 13378 | NULL | NULL | YES | BTREE | | | YES | NULL |
| event | 1 | idx_event_venueid_classid_starttime | 2 | class_id | A | 81331 | NULL | NULL | YES | BTREE | | | YES | NULL |
| event | 1 | idx_event_venueid_classid_starttime | 3 | start_time | A | 63909472 | NULL | NULL | YES | BTREE | | | YES | NULL |
+-------+------------+-------------------------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
Here's my modified version of the query, using JOIN syntax and now uses CONVERT_TZ to convert from UTC to the timezone required for reporting and then group that by the date (discarding the time portion):
SELECT
DATE(CONVERT_TZ(event.start_time,
'UTC',
'Europe/London')) AS tz_date,
class.name,
client.name,
venue.name,
COUNT(class.id) AS occurrences,
SUM(class.duration) AS duration
FROM
event
JOIN
class ON class.id = event.class_id
JOIN
venue ON venue.id = event.venue_id
JOIN
client ON client.id = class.client_id
GROUP BY tz_date, class.name, client.name, venue.name;
And here's the output of explain for that query:
+----+-------------+--------+------------+--------+---------------------------------------------------------------------+-------------------------------------+---------+-------------------------+------+----------+------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+--------+------------+--------+---------------------------------------------------------------------+-------------------------------------+---------+-------------------------+------+----------+------------------------------+
| 1 | SIMPLE | venue | NULL | index | PRIMARY,idx_venue_id_name | idx_venue_id_name | 772 | NULL | 28 | 100.00 | Using index; Using temporary |
| 1 | SIMPLE | event | NULL | ref | fk_class_id_idx,fk_venue_id_idx,idx_event_venueid_classid_starttime | idx_event_venueid_classid_starttime | 5 | example.venue.id | 4777 | 100.00 | Using where; Using index |
| 1 | SIMPLE | class | NULL | eq_ref | PRIMARY,fk_client_id_idx | PRIMARY | 4 | example.event.class_id | 1 | 100.00 | Using where |
| 1 | SIMPLE | client | NULL | eq_ref | PRIMARY,idx_client_id_name | PRIMARY | 4 | example.class.client_id | 1 | 100.00 | NULL |
+----+-------------+--------+------------+--------+---------------------------------------------------------------------+-------------------------------------+---------+-------------------------+------+----------+------------------------------+
The query takes ~1m 20s to run now so I figured I could prepend that with an insert into to populate a summary table with the dates being timezone specific and run that on a nightly basis. Summary table structure:
CREATE TABLE `summary` (
`tz_date` date NOT NULL,
`class` varchar(255) NOT NULL,
`client` varchar(255) NOT NULL,
`venue` varchar(255) NOT NULL,
`occurrences` int NOT NULL,
`duration` int NOT NULL,
PRIMARY KEY (`tz_date`,`class`,`client`,`venue`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb3
From the original ~60m+ rows in the event table, the aggregated summary table is populated with ~66k rows.
To then generate the reports from the summary table it takes a fraction of a second (shown below with data snipped):
SELECT * FROM SUMMARY;
66989 rows in set (0.03 sec)
I haven't looked into the impact of inserting into event while the query to populate the summary table is running - is using InnoDB likely to slow that down?
No further indexes are likely to help. It need to scan all the events table, reaching into the other tables to get the names.
Some things for us to look at:
SHOW VARIABLES LIKE 'innodb_buffer_pool_size';
EXPLAIN SELECT ...
How much RAM do you have?
Do the aggregates (COUNT and SUM) look correct? In some situations involving JOIN, they can be over-inflated.
Please use the newer JOIN ... ON syntax. (Won't change performance.)
As you observed, a Summary Table may help -- but only of the older data is not being modified. Please provide the SHOW CREATE TABLE and query for it.
Yes, timezone vs "definition of day" is a thorny issue. Notice how StackOverflow defines day based on UTC.
How many new rows are there per day? Are they spread out somewhat evenly throughout the day? If the average number of rows per hour is at least 20, then the Summary Table could be based on half-hour intervals. (I picked that because of India time vs most of the rest of the world.) The 20 comes from a Rule of Thumb that says that a summary table should have one-tenth as many rows as the Fact table.
Yes, TIMESTAMP instead of DATETIME may be a workaround.
Since you are talking about moderately large tables, consider whether to change INT NULL to SMALLINT UNSIGNED NOT NULL or some other sized integer.
(As for the cliff in 2038, ask yourself how many databases have been active on the same hardware and software since 2006. That may give some perspective on whether your design must survive 16 years.)

Abstruse difference in query speed

I do not understand the difference (line 2) of those two EXPLAINs. Maybe someone has a hint for me why mysql acts so different on those, which heavily affects query speed.
The slow query lasts 12 seconds (which equals querying all rows with that query) and uses a join on integer columns while the joined table has just 3 records:
SELECT `inv_assets`.`id` AS `id`, `site`.`description` AS `sitename`,
(SELECT COALESCE(DATE_FORMAT(CONVERT_TZ(MIN(inspdate),'UTC','Europe/Vienna'),'%Y-%m-%d'),'')
FROM `mobuto_inv_inspections` AS `nextinsp`
WHERE ((`nextinsp`.`objectlink` = `inv_assets`.`id`
AND `nextinsp`.`inspdate` >= NOW()))
) AS `nextinsp`
FROM `mobuto_inv_assets` AS `inv_assets`
LEFT JOIN `mobuto_inv_sites` AS `site`
ON (`site`.`siteid` = `inv_assets`.`site`
AND `site`.`_state` IN (2,0))
ORDER BY `inv_assets`.`type` ASC LIMIT 0, 20;
+----+--------------------+------------+--------+----------------+---------+---------+------------------------------+-------+----------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+------------+--------+----------------+---------+---------+------------------------------+-------+----------------------------------------------------+
| 1 | PRIMARY | inv_assets | ALL | NULL | NULL | NULL | NULL | 24857 | Using temporary; Using filesort |
| 1 | PRIMARY | site | ALL | PRIMARY,_state | NULL | NULL | NULL | 3 | Using where; Using join buffer (Block Nested Loop) |
| 2 | DEPENDENT SUBQUERY | nextinsp | ALL | inspdate | NULL | NULL | NULL | 915 | Using where |
+----+--------------------+------------+--------+----------------+---------+---------+------------------------------+-------+----------------------------------------------------+
The fast query consumes just a few fractions of a second, uses a join on varchar(32) columns and the joined table has 1352 records:
SELECT `inv_assets`.`id` AS `id`, `guarantor`.`lastname` AS `guarantoruname`,
(SELECT COALESCE(DATE_FORMAT(CONVERT_TZ(MIN(inspdate),'UTC','Europe/Vienna'),'%Y-%m-%d'),'')
FROM `mobuto_inv_inspections` AS `nextinsp`
LEFT JOIN `users` AS `saveuser`
ON (`saveuser`.`uid` = `nextinsp`.`saveuser`
AND `saveuser`.`_state` = '0')
WHERE ((`nextinsp`.`objectlink` = `inv_assets`.`id`
AND `nextinsp`.`inspdate` >= NOW()))
) AS `nextinsp`
FROM `mobuto_inv_assets` AS `inv_assets`
LEFT JOIN `users` AS `guarantor`
ON (`guarantor`.`uid` = `inv_assets`.`guarantor`
AND `guarantor`.`_state` = '0')
ORDER BY `inv_assets`.`type` ASC LIMIT 0, 20;
+----+--------------------+------------+--------+----------------+---------+---------+---------------------------------+-------+----------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+------------+--------+----------------+---------+---------+---------------------------------+-------+----------------+
| 1 | PRIMARY | inv_assets | ALL | NULL | NULL | NULL | NULL | 24857 | Using filesort |
| 1 | PRIMARY | guarantor | eq_ref | PRIMARY,_state | PRIMARY | 98 | mobuto_dev.inv_assets.guarantor | 1 | Using where |
| 2 | DEPENDENT SUBQUERY | nextinsp | ALL | inspdate | NULL | NULL | NULL | 915 | Using where |
| 2 | DEPENDENT SUBQUERY | saveuser | eq_ref | PRIMARY,_state | PRIMARY | 98 | mobuto_dev.nextinsp.saveuser | 1 | Using where |
+----+--------------------+------------+--------+----------------+---------+---------+---------------------------------+-------+----------------+
The strange thing to me is, when I remove the column (description) of the joined table in the 'column-select-part' (while the join still persists and IMHO mysql does not optimize it away when not used), the speed is back (because mysql does not use a temporary table any longer and the explain looks same as the fast one, having type=eq_ref).
But why does this work for the first sample only when no column selected, whereas I can select one in the second one!?
CREATE TABLE `mobuto_inv_assets` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`invnum` varchar(10) NOT NULL,
`oebglcat` varchar(4) NOT NULL,
`mark` varchar(100) NOT NULL,
`type` varchar(100) NOT NULL,
`serialnum` varchar(100) NOT NULL,
`desc` varchar(100) NOT NULL,
`site` int(11) NOT NULL DEFAULT '0',
`licnum` varchar(20) NOT NULL DEFAULT '',
`inquirer` varchar(100) NOT NULL DEFAULT '',
`inqdate` date NOT NULL DEFAULT '0000-00-00',
`supplier` varchar(100) NOT NULL DEFAULT '',
`suppldate` date NOT NULL DEFAULT '0000-00-00',
`supplnumber` varchar(30) NOT NULL DEFAULT '',
`invoicedate` date NOT NULL DEFAULT '0000-00-00',
`invoicenumber` varchar(30) NOT NULL DEFAULT '',
`purchaseprice` decimal(11,2) NOT NULL DEFAULT '0.00',
`leased` varchar(1) NOT NULL DEFAULT 'N',
`leasingcompany` varchar(100) NOT NULL DEFAULT '',
`leasingnumber` varchar(30) NOT NULL DEFAULT '',
`notes` text NOT NULL,
`inspnotes` text NOT NULL,
`inactive` varchar(1) NOT NULL DEFAULT 'N',
`maintain` varchar(1) NOT NULL DEFAULT 'Y',
`asset` varchar(1) NOT NULL DEFAULT 'Y',
`inspection` varchar(1) NOT NULL DEFAULT '',
`inspperson` varchar(100) NOT NULL DEFAULT '',
`guarantor` varchar(32) NOT NULL DEFAULT '',
`saveuser` varchar(32) NOT NULL,
`savetime` int(11) NOT NULL,
`recordid` varchar(32) NOT NULL,
`_state` int(11) NOT NULL DEFAULT '0',
PRIMARY KEY (`id`),
UNIQUE KEY `invnum` (`invnum`),
KEY `_state` (`_state`),
KEY `site` (`site`)
) ENGINE=InnoDB AUTO_INCREMENT=30707 DEFAULT CHARSET=utf8;
CREATE TABLE `mobuto_inv_sites` (
`siteid` int(11) NOT NULL AUTO_INCREMENT,
`description` varchar(100) NOT NULL,
`saveuser` varchar(32) NOT NULL,
`savetime` int(11) NOT NULL,
`recordid` varchar(32) NOT NULL,
`_state` int(11) NOT NULL DEFAULT '0',
PRIMARY KEY (`siteid`),
KEY `_state` (`_state`)
) ENGINE=InnoDB AUTO_INCREMENT=5 DEFAULT CHARSET=utf8;
mysql> SHOW INDEX FROM mobuto_inv_assets;
+-------------------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+-------------------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| mobuto_inv_assets | 0 | PRIMARY | 1 | id | A | 24857 | NULL | NULL | | BTREE | | |
| mobuto_inv_assets | 0 | invnum | 1 | invnum | A | 24857 | NULL | NULL | | BTREE | | |
| mobuto_inv_assets | 1 | _state | 1 | _state | A | 4 | NULL | NULL | | BTREE | | |
+-------------------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
Changes as requested by #Wilson Hauck:
Added index to column site in mobuto_inv_assets (reduced execution speed by almost half a second)
Seems that column nextinsp was missing in first query. Maybe lost while formatting the query. Of course there should be the same as in the fast one
Removed the saveuser join as it is not used there (saved another 2 seconds) and updated its EXPLAIN (last line removed)
SHOW INDEX FROM mobuto_inv_sites added
+------------------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+------------------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| mobuto_inv_sites | 0 | PRIMARY | 1 | siteid | A | 3 | NULL | NULL | | BTREE | | |
| mobuto_inv_sites | 1 | _state | 1 | _state | A | 3 | NULL | NULL | | BTREE | | |
+------------------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
Your first query is making less use of keys than the second. The possible_keys column in the explain plan shows where keys are available to be used, however, the key column shows where they are actually being used.
I would advise, short of seeing the structure of your DB, to make more use of these keys in your JOIN and WHERE clauses to speed it up.
I'd make sure that the query isn't being cached when you say you're modifying the select columns and the speed is varying.
12 seconds first query likely caused by ROWS column clues of simply 24857*3*915*1 = 68,232,465 total rows considered. Less than 1 second for second query ROWS column clues of simply 24857*1*915*1 = 22,744,155 total rows considered. The first query's use of Block Nested Loop processing is a major contributor to delaying the response.
Please post results of SHOW CREATE TABLE mobuto_inv_assets and mobuto_inv_sites. Please also post results of SHOW INDEX FROM mobuto_inv_assets and mobuto_inv_sites. With this additional information someone may be able to suggest improvements to SELECT .... queries that will avoid Block Nested Loop processing that is very time CPU intense with RBAR (Row By Agonizing Row processing). Additional indexing may be required.
Thanks for posting your two SHOW CREATE TABLE's, immensely helpful.
Please consider adding index with
ALTER TABLE mobuto_inv_sites ADD INDEX site --
if space permits on your system.
Also, the EXPLAIN showing for query1 is mismatched to the query.
The query does not refer to nextinsp or saveused that I can see in EXPLAIN.
Please replace the EXPLAIN for query1 after creating the index when you have an opportunity to test again and indicate any reduction in execution time required.
It would also be nice if you could post results of
SHOW INDEX FROM mobuto_inv_sites so we can see the scope of your data and cardinality.
If the inv_assets rows are populated with ACCURATE _state data
consider changing query1 to something like the following:
SELECT inv_assets.id AS id, site.description AS sitename,
(SELECT COALESCE(DATE_FORMAT(CONVERT_TZ(MIN(inspdate),'UTC','Europe/Vienna'),'%Y-%m-%d'),'')
FROM mobuto_inv_inspections AS nextinsp
WHERE ((nextinsp.objectlink = inv_assets.id
AND nextinsp.inspdate >= NOW()))
) AS nextinsp
FROM mobuto_inv_assets AS inv_assets
WHERE inv_assets._state = 2 OR inv_assets._state = 0
LEFT JOIN mobuto_inv_sites AS site
ON (site.siteid = inv_assets.site
AND site._state IN (2,0))
ORDER BY inv_assets.type ASC LIMIT 0, 20;
EXPLAIN should avoid table scan and subsequent Block Nested Loop processing.
If _state data in inv_assets is not ACCURATE on every row, this will not work.
2017-08-10 update 09:42 CT please post QUERY, EXPLAIN result, SHOW CREATE TABLE tblname for tables involved and SHOW INDEX FROM tblname for tables involved.

How to improve MySQL "fill the gaps" query

I have a table with currency exchange rates that I fill with data published by the ECB. That data contains gaps in the date dimension like e.g. holidays.
CREATE TABLE `imp_exchangerate` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`rate_date` date NOT NULL,
`currency` char(3) NOT NULL,
`rate` decimal(14,6) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `rate_date` (`rate_date`,`currency`),
KEY `imp_exchangerate_by_currency` (`currency`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
I also have a date dimension as youd expect in a data warehouse:
CREATE TABLE `d_date` (
`date_id` int(11) NOT NULL,
`full_date` date DEFAULT NULL,
---- etc.
PRIMARY KEY (`date_id`),
KEY `full_date` (`full_date`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
Now I try to fill the gaps in the exchangerates like this:
SELECT
d.full_date,
currency,
(SELECT rate FROM imp_exchangerate
WHERE rate_date <= d.full_date AND currency = c.currency
ORDER BY rate_date DESC LIMIT 1) AS rate
FROM
d_date d,
(SELECT DISTINCT currency FROM imp_exchangerate) c
WHERE
d.full_date >=
(SELECT min(rate_date) FROM imp_exchangerate
WHERE currency = c.currency) AND
d.full_date <= curdate()
Explain says:
+------+--------------------+------------------+-------+----------------------------------------+------------------------------+---------+------------+------+--------------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+--------------------+------------------+-------+----------------------------------------+------------------------------+---------+------------+------+--------------------------------------------------------------+
| 1 | PRIMARY | <derived3> | ALL | NULL | NULL | NULL | NULL | 201 | |
| 1 | PRIMARY | d | range | full_date | full_date | 4 | NULL | 6047 | Using where; Using index; Using join buffer (flat, BNL join) |
| 4 | DEPENDENT SUBQUERY | imp_exchangerate | ref | imp_exchangerate_by_currency | imp_exchangerate_by_currency | 3 | c.currency | 664 | |
| 3 | DERIVED | imp_exchangerate | range | NULL | imp_exchangerate_by_currency | 3 | NULL | 201 | Using index for group-by |
| 2 | DEPENDENT SUBQUERY | imp_exchangerate | index | rate_date,imp_exchangerate_by_currency | rate_date | 6 | NULL | 1 | Using where |
+------+--------------------+------------------+-------+----------------------------------------+------------------------------+---------+------------+------+--------------------------------------------------------------+
MySQL needs multiple hours to execute that query. Are there any Ideas how to improve that? I have tried with an index on rate without any noticable impact.
I have a solution for a while now: get rid of dependent subqueries. I had to think from different angles in mutliple places and here is the result:
SELECT
cd.date_id,
x.currency,
x.rate
FROM
imp_exchangerate x INNER JOIN
(SELECT
d.date_id,
max(rate_date) as rate_date,
currency
FROM
d_date d INNER JOIN
imp_exchangerate ON rate_date <= d.full_date
WHERE
d.full_date <= curdate()
GROUP BY
d.date_id,
currency) cd ON x.rate_date = cd.rate_date and x.currency = cd.currency
This query finishes in less then 10 minutes now compared to multiple hours for the original query.
Lesson learned: avoid dependent subqueries in MySQL like the plague!

Same mysql queries in two similar databases in the same machine have different perfomance

I have two databases one for dev and one for staging, and they're both on the same machine too. I'm having a problem with a query for two tables. here are the schema for the tables
Table 1 schema:
Table: import_schedule_t
Create Table: CREATE TABLE `import_schedule_t` (
`id` int(11) NOT NULL,
`theater_id` int(11) NOT NULL,
`movie_code` varchar(20) NOT NULL,
`start_time` datetime NOT NULL,
`end_time` datetime NOT NULL,
`pc_url` varchar(250) NOT NULL,
`mb_url` varchar(250) NOT NULL,
`url_type` int(11) DEFAULT '0',
`active` int(11) DEFAULT '1',
`intime` timestamp NULL DEFAULT CURRENT_TIMESTAMP,
`utime` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
`schedule_date` datetime NOT NULL,
`movie_name` text NOT NULL,
`screen_name` text NOT NULL,
PRIMARY KEY (`id`),
KEY `id` (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8
and Table 2 schema:
Table: wp_postmeta
Create Table: CREATE TABLE `wp_postmeta` (
`meta_id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`post_id` bigint(20) unsigned NOT NULL DEFAULT '0',
`meta_key` varchar(255) DEFAULT NULL,
`meta_value` longtext,
PRIMARY KEY (`meta_id`),
KEY `post_id` (`post_id`),
KEY `meta_key` (`meta_key`(191))
) ENGINE=MyISAM AUTO_INCREMENT=1399270 DEFAULT CHARSET=utf8
both of the tables are present in both of the databases i've mentioned. When i try to run this query:
SELECT DISTINCT movie_code,post_id
FROM import_schedule_t
INNER JOIN wp_postmeta
ON wp_postmeta.meta_value = import_schedule_t.movie_code
AND wp_postmeta.meta_key='update_movie_id'
WHERE DATE_FORMAT(start_time, '%Y-%m-%d')>= DATE_FORMAT(NOW(),'%Y-%m-%d')
dev database would finish the query in 20 seconds but the staging database would only run it for 1.4 seconds.
here's a sample data:
wp_postmeta table
+---------+---------+-----------------+------------+
| meta_id | post_id | meta_key | meta_value |
+---------+---------+-----------------+------------+
| 45150 | 74572 | update_movie_id | 74572 |
+---------+---------+-----------------+------------+
import_schedule_t table (omitted some of the fields)
+--------+------------+---------------------+---------------------+
| id | movie_code | start_time | end_time |
+--------+------------+---------------------+---------------------+
| 120884 | 74572 | 2015-07-04 12:50:00 | 2015-07-04 15:05:00 |
+--------+------------+---------------------+---------------------+
i already tried looking at the indexes and optimizing the tables but with no success, the query time on the dev database is still 20 seconds.
EXPLAIN EXTENDED on dev
+----+-------------+-------------------+------+---------------+------+---------+------+---------+----------+--------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------------------+------+---------------+------+---------+------+---------+----------+--------------------------------+
| 1 | SIMPLE | import_schedule_t | ALL | NULL | NULL | NULL | NULL | 23597 | 100.00 | Using where; Using temporary |
| 1 | SIMPLE | wp_postmeta | ALL | NULL | NULL | NULL | NULL | 1461731 | 100.00 | Using where; Using join buffer |
+----+-------------+-------------------+------+---------------+------+---------+------+---------+----------+--------------------------------+
EXPLAIN EXTENDED on staging
+----+-------------+-------------------+------+---------------+------+---------+------+---------+----------+--------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------------------+------+---------------+------+---------+------+---------+----------+--------------------------------+
| 1 | SIMPLE | import_schedule_t | ALL | NULL | NULL | NULL | NULL | 9311 | 100.00 | Using where; Using temporary |
| 1 | SIMPLE | wp_postmeta | ALL | NULL | NULL | NULL | NULL | 1461384 | 100.00 | Using where; Using join buffer |
+----+-------------+-------------------+------+---------------+------+---------+------+---------+----------+--------------------------------+
If both DBs are running on the same machine, with the same MySQL version, in the same harddrive, with the very same structure and data then it might be a fragmentation issue on the OS level. Take the servers down and defrag your disk.
On a side note: don't compare dates as strings, since dates are numbers internally in the DB, and they are compared much more efficiently (WHERE start_time >= curdate() ).
Also you can save some storage space if you define smaller ints for some fields (like the 'active' field). An int is a 4 byte number while a tinyint is 1 byte.
Sorry, can't comment cos I don't have enough reputation, BUT, I would expect the dev system has a lot more data in its tables.
On another point you should not use DATE_FORMAT - I would guess that is turning your dates into strings which are really inefficient to compare. Dates are just integers (internal to MySQL) so they can be compared in one cycle.. the string comparison could easily be 1000 (or more) cycles. You should probably index the start_time field as well to save it having to scan the table.
Anytime you have a query taking 20 seconds you should be suspicious you are doing something wrong! MySQL can do A LOT in 20 seconds.