MySQL report query optimization and timezone issues - mysql

I'm faced with a MySQL database which contains an events table with ~70 million rows which has foreign keys to other tables and is used to generate reports. Constructing a performant query to select (while counting/summing values) and grouping data per day from this table is proving challenging.
The database structure is as follows:
CREATE TABLE `client` (
`id` int NOT NULL AUTO_INCREMENT,
`name` varchar(255) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `idx_client_id_name` (`id`,`name`)
) ENGINE=InnoDB AUTO_INCREMENT=66 DEFAULT CHARSET=utf8mb3
CREATE TABLE `class` (
`id` int NOT NULL AUTO_INCREMENT,
`name` varchar(255) DEFAULT NULL,
`client_id` int DEFAULT NULL,
`duration` int DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `fk_client_id_idx` (`client_id`),
CONSTRAINT `fk_client_id` FOREIGN KEY (`client_id`) REFERENCES `client` (`id`) ON DELETE SET NULL ON UPDATE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=2606 DEFAULT CHARSET=utf8mb3
CREATE TABLE `event` (
`id` int NOT NULL AUTO_INCREMENT,
`start_time` datetime DEFAULT NULL,
`class_id` int DEFAULT NULL,
`venue_id` int DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `fk_class_id_idx` (`class_id`),
KEY `fk_venue_id_idx` (`venue_id`),
KEY `idx_1` (`venue_id`,`class_id`,`start_time`),
CONSTRAINT `fk_class_id` FOREIGN KEY (`class_id`) REFERENCES `class` (`id`) ON DELETE SET NULL ON UPDATE CASCADE,
CONSTRAINT `fk_venue_id` FOREIGN KEY (`venue_id`) REFERENCES `venue` (`id`) ON DELETE SET NULL ON UPDATE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=64093231 DEFAULT CHARSET=utf8mb3
CREATE TABLE `venue` (
`id` int NOT NULL AUTO_INCREMENT,
`name` varchar(255) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `idx_venue_id_name` (`id`,`name`)
) ENGINE=InnoDB AUTO_INCREMENT=29 DEFAULT CHARSET=utf8mb3
The query which is fine on an events table with a few thousand rows to demonstrate the desired outcome is as follows:
SELECT
CAST(event.start_time as date) as day,
class.name,
client.name,
venue.name,
COUNT(class.name) AS occurrences,
SUM(class.duration) AS duration
FROM
class,
client,
event,
venue
WHERE
event.venue_id = venue.id
AND event.class_id = class.id
AND class.client_id = client.id
GROUP BY day, class.name, client.name, venue.name
The database isn't indexed and although I've tried indexing with things like alter table events add index idx_test (venue_id, class_id, start_time); to improve performance it's still incredibly slow (I tend to abort them when they're past the 10 minute mark so don't know for sure how long they'd take to complete).
I figured this was a good use case for a summary table (as suggested by Rick James' guide) so that I could hold a separate set of summarized data broken down into day with occurrences and total duration calculated/incremented with each addition to the table (IODKU). However I'm then also up against creating rows per day in a summary table based on what is considered a day in the database (UTC) which may not match with the application's "day" due to timezone offset.
Short of converting the start_time column to a timestamp type (which is then inconsistent with all other date types in the database) is there any way round this or is there any other optimization I could be making to the original events table resulting in a more responsive query? TIA
Update 23/05
Here's the buffer pool size:
SHOW VARIABLES LIKE 'innodb_buffer_pool_size';
+-------------------------+-----------+
| Variable_name | Value |
+-------------------------+-----------+
| innodb_buffer_pool_size | 134217728 |
+-------------------------+-----------+
I've also made a bit of progress with indexing, modifying the query and creating a summary table.
I tried various ordering of columns to test indexes and found idx_event_venueid_classid_starttime (below), to be the most efficient for the event table:
SHOW INDEXES FROM EVENT;
+-------+------------+-------------------------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | Visible | Expression |
+-------+------------+-------------------------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| event | 0 | PRIMARY | 1 | id | A | 62142912 | NULL | NULL | | BTREE | | | YES | NULL |
| event | 1 | fk_class_id_idx | 1 | class_id | A | 51286 | NULL | NULL | YES | BTREE | | | YES | NULL |
| event | 1 | fk_venue_id_idx | 1 | venue_id | A | 16275 | NULL | NULL | YES | BTREE | | | YES | NULL |
| event | 1 | idx_event_venueid_classid_starttime | 1 | venue_id | A | 13378 | NULL | NULL | YES | BTREE | | | YES | NULL |
| event | 1 | idx_event_venueid_classid_starttime | 2 | class_id | A | 81331 | NULL | NULL | YES | BTREE | | | YES | NULL |
| event | 1 | idx_event_venueid_classid_starttime | 3 | start_time | A | 63909472 | NULL | NULL | YES | BTREE | | | YES | NULL |
+-------+------------+-------------------------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
Here's my modified version of the query, using JOIN syntax and now uses CONVERT_TZ to convert from UTC to the timezone required for reporting and then group that by the date (discarding the time portion):
SELECT
DATE(CONVERT_TZ(event.start_time,
'UTC',
'Europe/London')) AS tz_date,
class.name,
client.name,
venue.name,
COUNT(class.id) AS occurrences,
SUM(class.duration) AS duration
FROM
event
JOIN
class ON class.id = event.class_id
JOIN
venue ON venue.id = event.venue_id
JOIN
client ON client.id = class.client_id
GROUP BY tz_date, class.name, client.name, venue.name;
And here's the output of explain for that query:
+----+-------------+--------+------------+--------+---------------------------------------------------------------------+-------------------------------------+---------+-------------------------+------+----------+------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+--------+------------+--------+---------------------------------------------------------------------+-------------------------------------+---------+-------------------------+------+----------+------------------------------+
| 1 | SIMPLE | venue | NULL | index | PRIMARY,idx_venue_id_name | idx_venue_id_name | 772 | NULL | 28 | 100.00 | Using index; Using temporary |
| 1 | SIMPLE | event | NULL | ref | fk_class_id_idx,fk_venue_id_idx,idx_event_venueid_classid_starttime | idx_event_venueid_classid_starttime | 5 | example.venue.id | 4777 | 100.00 | Using where; Using index |
| 1 | SIMPLE | class | NULL | eq_ref | PRIMARY,fk_client_id_idx | PRIMARY | 4 | example.event.class_id | 1 | 100.00 | Using where |
| 1 | SIMPLE | client | NULL | eq_ref | PRIMARY,idx_client_id_name | PRIMARY | 4 | example.class.client_id | 1 | 100.00 | NULL |
+----+-------------+--------+------------+--------+---------------------------------------------------------------------+-------------------------------------+---------+-------------------------+------+----------+------------------------------+
The query takes ~1m 20s to run now so I figured I could prepend that with an insert into to populate a summary table with the dates being timezone specific and run that on a nightly basis. Summary table structure:
CREATE TABLE `summary` (
`tz_date` date NOT NULL,
`class` varchar(255) NOT NULL,
`client` varchar(255) NOT NULL,
`venue` varchar(255) NOT NULL,
`occurrences` int NOT NULL,
`duration` int NOT NULL,
PRIMARY KEY (`tz_date`,`class`,`client`,`venue`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb3
From the original ~60m+ rows in the event table, the aggregated summary table is populated with ~66k rows.
To then generate the reports from the summary table it takes a fraction of a second (shown below with data snipped):
SELECT * FROM SUMMARY;
66989 rows in set (0.03 sec)
I haven't looked into the impact of inserting into event while the query to populate the summary table is running - is using InnoDB likely to slow that down?

No further indexes are likely to help. It need to scan all the events table, reaching into the other tables to get the names.
Some things for us to look at:
SHOW VARIABLES LIKE 'innodb_buffer_pool_size';
EXPLAIN SELECT ...
How much RAM do you have?
Do the aggregates (COUNT and SUM) look correct? In some situations involving JOIN, they can be over-inflated.
Please use the newer JOIN ... ON syntax. (Won't change performance.)
As you observed, a Summary Table may help -- but only of the older data is not being modified. Please provide the SHOW CREATE TABLE and query for it.
Yes, timezone vs "definition of day" is a thorny issue. Notice how StackOverflow defines day based on UTC.
How many new rows are there per day? Are they spread out somewhat evenly throughout the day? If the average number of rows per hour is at least 20, then the Summary Table could be based on half-hour intervals. (I picked that because of India time vs most of the rest of the world.) The 20 comes from a Rule of Thumb that says that a summary table should have one-tenth as many rows as the Fact table.
Yes, TIMESTAMP instead of DATETIME may be a workaround.
Since you are talking about moderately large tables, consider whether to change INT NULL to SMALLINT UNSIGNED NOT NULL or some other sized integer.
(As for the cliff in 2038, ask yourself how many databases have been active on the same hardware and software since 2006. That may give some perspective on whether your design must survive 16 years.)

Related

How to use correct indexes with a double inner join query?

I have a query with 2 INNER JOIN statements, and only fetching a few column, but it is very slow even though I have indexes on all required columns.
My query
SELECT
dysfonctionnement,
montant,
listRembArticles,
case when dys.reimputation is not null then dys.reimputation else dys.responsable end as responsable_final
FROM
db.commandes AS com
INNER JOIN db.dysfonctionnements AS dys ON com.id_commande = dys.id_commande
INNER JOIN db.pe AS pe ON com.code_pe = pe.pe_id
WHERE
com.prestataireLAD REGEXP '.*'
AND pe_nom REGEXP 'bordeaux|chambéry-annecy|grenoble|lyon|marseille|metz|montpellier|nancy|nice|nimes|rouen|strasbourg|toulon|toulouse|vitry|vitry bis 1|vitry bis 2|vlg'
AND com.date_livraison BETWEEN '2022-06-11 00:00:00'
AND '2022-07-08 00:00:00';
It takes around 20 seconds to compute and fetch 4123 rows.
The problem
In order to find what's wrong and why is it so slow, I've used the EXPLAIN statement, here is the output:
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
|----|-------------|-------|------------|--------|----------------------------|-------------|---------|------------------------|--------|----------|-------------|
| 1 | SIMPLE | dys | | ALL | id_commande,id_commande_2 | | | | 878588 | 100.00 | Using where |
| 1 | SIMPLE | com | | eq_ref | id_commande,date_livraison | id_commande | 110 | db.dys.id_commande | 1 | 7.14 | Using where |
| 1 | SIMPLE | pe | | ref | pe_id | pe_id | 5 | db.com.code_pe | 1 | 100.00 | Using where |
I can see that the dysfonctionnements JOIN is rigged, and doesn't use a key even though it could...
Table definitions
commandes (included relevant columns only)
CREATE TABLE `commandes` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`id_commande` varchar(36) NOT NULL DEFAULT '',
`date_commande` datetime NOT NULL,
`date_livraison` datetime NOT NULL,
`code_pe` int(11) NOT NULL,
`traitement_dysfonctionnement` tinyint(4) DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `id_commande` (`id_commande`),
KEY `date_livraison` (`date_livraison`),
KEY `traitement_dysfonctionnement` (`traitement_dysfonctionnement`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
dysfonctionnements (again, relevant columns only)
CREATE TABLE `dysfonctionnements` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`id_commande` varchar(36) DEFAULT NULL,
`dysfonctionnement` varchar(150) DEFAULT NULL,
`responsable` varchar(50) DEFAULT NULL,
`reimputation` varchar(50) DEFAULT NULL,
`montant` float DEFAULT NULL,
`listRembArticles` text,
PRIMARY KEY (`id`),
UNIQUE KEY `id_commande` (`id_commande`,`dysfonctionnement`),
KEY `id_commande_2` (`id_commande`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
pe (again, relevant columns only)
CREATE TABLE `pe` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`pe_id` int(11) DEFAULT NULL,
`pe_nom` varchar(30) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `pe_nom` (`pe_nom`),
KEY `pe_id` (`pe_id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
Investigation
If I remove the db.pe table from the query and the WHERE clause on pe_nom, the query takes 1.7 seconds to fetch 7k rows, and with the EXPLAIN statement, I can see it is using keys as I expect it to do:
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
|----|-------------|-------|------------|-------|----------------------------|----------------|---------|------------------------|--------|----------|-----------------------------------------------|
| 1 | SIMPLE | com | | range | id_commande,date_livraison | date_livraison | 5 | | 389558 | 100.00 | Using index condition; Using where; Using MRR |
| 1 | SIMPLE | dys | | ref | id_commande,id_commande_2 | id_commande_2 | 111 | ooshop.com.id_commande | 1 | 100.00 | |
I'm open to any suggestions, I see no reason not to use the key when it does on a very similar query and it definitely makes it faster...
I had a similar experience when MySQL optimiser selected a joined table sequence far from optimal. At that time I used MySQL specific STRAIGHT_JOIN operator to overcome default optimiser behaviour. In your case I would try this:
SELECT
dysfonctionnement,
montant,
listRembArticles,
case when dys.reimputation is not null then dys.reimputation else dys.responsable end as responsable_final
FROM
db.commandes AS com
STRAIGHT_JOIN db.dysfonctionnements AS dys ON com.id_commande = dys.id_commande
INNER JOIN db.pe AS pe ON com.code_pe = pe.pe_id
Also, in your WHERE clause one of the REGEXP probably might be changed to IN operator, I assume it can use index.
Remove com.prestataireLAD REGEXP '.*'. The Optimizer probably won't realize that this has no impact on the resultset. If you are dynamically building the WHERE clause, then eliminate anything else you can.
id_commande_2 is redundant. In queries where it might be useful, the UNIQUE can take care of it.
These indexes might help:
com: INDEX(date_livraison, id_commande, code_pe)
pe: INDEX(pe_nom, pe_id)

MySql Record Matching Criteria With Latest Date

I have a mySql table where all status changes are recorded. I want to be able to query the status of all items on a specific date, or the last date for all items. The table I have now is:
CREATE TABLE `tra_rel_sta` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`tra_id` int(11) DEFAULT NULL,
`sta_id` int(11) DEFAULT NULL,
`changed_on` datetime DEFAULT NULL,
`changed_by` int(11) DEFAULT NULL,
`comments` text,
PRIMARY KEY (`id`),
KEY `tra_id` (`tra_id`),
KEY `rel` (`tra_id`,`sta_id`,`changed_on`),
KEY `sta_id` (`sta_id`),
KEY `changed_on` (`changed_on`),
KEY `tra_changed` (`tra_id`,`changed_on`)
) ENGINE=InnoDB AUTO_INCREMENT=51734 DEFAULT CHARSET=utf8;
(I know I'm probably overdoing the indexes, but I haven't exactly figured out how to optimize indexes yet).
The query I'm using now, which works is:
SELECT rel.changed_on, rel.changed_by, rel.tra_id, sta.id AS sta_id, sta.status, sta.description, sta.onHold, sta.awaitingApproval, sta.approved, sta.complete, sta.locked
FROM (
SELECT tra_id, MAX(changed_on) AS lst
FROM tra_rel_sta
GROUP BY tra_id
) AS rec
LEFT JOIN tra_rel_sta AS rel ON rel.changed_on = rec.lst AND rel.tra_id = rec.tra_id
LEFT JOIN tra_status AS sta ON sta.id = rel.sta_id
If I want to use a specific date, I insert a WHERE statement in the sub-query.
This works, but it takes about 0.65 seconds to run in PHP with about 51,733 records in the table. This query is used as a sub query in several others when I need to know the last status of an object, and as a result, is slowing down many application.
I've tried to use a sub query in the WHERE statement as described in MySQL: how to select record with latest date before a certain date but it takes almost twice as long. I've tried using a JOIN statement as described in MySQL select of record with latest date but I'm getting about the same or just slightly slower results.
How can I optimize this query or fix my indexes to make this more effective?
Thanks!!
As requested, EXPLAIN of query:
id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra
---|-------------|-------------|--------|-----------------------------------|---------|---------|-------------------|-------|-------------
1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 49931 | NULL
1 | PRIMARY | rel | ref | tra_id,rel,changed_on,tra_changed | tra_id | 5 | rec.tra_id | 1 | Using where
1 | PRIMARY | sta | eq_ref | PRIMARY | PRIMARY | 4 | csinfo.rel.sta_id | 1 | NULL
2 | DERIVED | tra_rel_sta | index | tra_id,rel,tra_changed | tra_id | 5 | NULL | 49931 | NULL

Same mysql queries in two similar databases in the same machine have different perfomance

I have two databases one for dev and one for staging, and they're both on the same machine too. I'm having a problem with a query for two tables. here are the schema for the tables
Table 1 schema:
Table: import_schedule_t
Create Table: CREATE TABLE `import_schedule_t` (
`id` int(11) NOT NULL,
`theater_id` int(11) NOT NULL,
`movie_code` varchar(20) NOT NULL,
`start_time` datetime NOT NULL,
`end_time` datetime NOT NULL,
`pc_url` varchar(250) NOT NULL,
`mb_url` varchar(250) NOT NULL,
`url_type` int(11) DEFAULT '0',
`active` int(11) DEFAULT '1',
`intime` timestamp NULL DEFAULT CURRENT_TIMESTAMP,
`utime` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
`schedule_date` datetime NOT NULL,
`movie_name` text NOT NULL,
`screen_name` text NOT NULL,
PRIMARY KEY (`id`),
KEY `id` (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8
and Table 2 schema:
Table: wp_postmeta
Create Table: CREATE TABLE `wp_postmeta` (
`meta_id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`post_id` bigint(20) unsigned NOT NULL DEFAULT '0',
`meta_key` varchar(255) DEFAULT NULL,
`meta_value` longtext,
PRIMARY KEY (`meta_id`),
KEY `post_id` (`post_id`),
KEY `meta_key` (`meta_key`(191))
) ENGINE=MyISAM AUTO_INCREMENT=1399270 DEFAULT CHARSET=utf8
both of the tables are present in both of the databases i've mentioned. When i try to run this query:
SELECT DISTINCT movie_code,post_id
FROM import_schedule_t
INNER JOIN wp_postmeta
ON wp_postmeta.meta_value = import_schedule_t.movie_code
AND wp_postmeta.meta_key='update_movie_id'
WHERE DATE_FORMAT(start_time, '%Y-%m-%d')>= DATE_FORMAT(NOW(),'%Y-%m-%d')
dev database would finish the query in 20 seconds but the staging database would only run it for 1.4 seconds.
here's a sample data:
wp_postmeta table
+---------+---------+-----------------+------------+
| meta_id | post_id | meta_key | meta_value |
+---------+---------+-----------------+------------+
| 45150 | 74572 | update_movie_id | 74572 |
+---------+---------+-----------------+------------+
import_schedule_t table (omitted some of the fields)
+--------+------------+---------------------+---------------------+
| id | movie_code | start_time | end_time |
+--------+------------+---------------------+---------------------+
| 120884 | 74572 | 2015-07-04 12:50:00 | 2015-07-04 15:05:00 |
+--------+------------+---------------------+---------------------+
i already tried looking at the indexes and optimizing the tables but with no success, the query time on the dev database is still 20 seconds.
EXPLAIN EXTENDED on dev
+----+-------------+-------------------+------+---------------+------+---------+------+---------+----------+--------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------------------+------+---------------+------+---------+------+---------+----------+--------------------------------+
| 1 | SIMPLE | import_schedule_t | ALL | NULL | NULL | NULL | NULL | 23597 | 100.00 | Using where; Using temporary |
| 1 | SIMPLE | wp_postmeta | ALL | NULL | NULL | NULL | NULL | 1461731 | 100.00 | Using where; Using join buffer |
+----+-------------+-------------------+------+---------------+------+---------+------+---------+----------+--------------------------------+
EXPLAIN EXTENDED on staging
+----+-------------+-------------------+------+---------------+------+---------+------+---------+----------+--------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------------------+------+---------------+------+---------+------+---------+----------+--------------------------------+
| 1 | SIMPLE | import_schedule_t | ALL | NULL | NULL | NULL | NULL | 9311 | 100.00 | Using where; Using temporary |
| 1 | SIMPLE | wp_postmeta | ALL | NULL | NULL | NULL | NULL | 1461384 | 100.00 | Using where; Using join buffer |
+----+-------------+-------------------+------+---------------+------+---------+------+---------+----------+--------------------------------+
If both DBs are running on the same machine, with the same MySQL version, in the same harddrive, with the very same structure and data then it might be a fragmentation issue on the OS level. Take the servers down and defrag your disk.
On a side note: don't compare dates as strings, since dates are numbers internally in the DB, and they are compared much more efficiently (WHERE start_time >= curdate() ).
Also you can save some storage space if you define smaller ints for some fields (like the 'active' field). An int is a 4 byte number while a tinyint is 1 byte.
Sorry, can't comment cos I don't have enough reputation, BUT, I would expect the dev system has a lot more data in its tables.
On another point you should not use DATE_FORMAT - I would guess that is turning your dates into strings which are really inefficient to compare. Dates are just integers (internal to MySQL) so they can be compared in one cycle.. the string comparison could easily be 1000 (or more) cycles. You should probably index the start_time field as well to save it having to scan the table.
Anytime you have a query taking 20 seconds you should be suspicious you are doing something wrong! MySQL can do A LOT in 20 seconds.

Inner join will not use index

Why would this query (and a number of similar variants) not use the index for ASIN on the 'tags' table? It insists on a full-table scan even when A contains just a few rows. As 'tags' table on production contains nearly a million entries, it's killing the query rather badly.
SELECT C.tag, count(C.tag) AS total
FROM
(
SELECT B.*
FROM
(
SELECT ASIN FROM requests WHERE user_id=9
) A
INNER JOIN tags B USING(ASIN)
) C
GROUP BY C.tag ORDER BY total DESC
EXPLAIN shows no index being used (run on test DB so rows in 'tags' is low, but still a full table scan):
| 1 | PRIMARY | <derived2> | system | NULL | NULL | NULL | NULL | 0 | const row not found |
| 2 | DERIVED | <derived3> | ALL | NULL | NULL | NULL | NULL | 28 | |
| 2 | DERIVED | B | ALL | NULL | NULL | NULL | NULL | 2593 | Using where; Using join buffer |
| 3 | DERIVED | borrowing_requests | ref | idx_user_id | idx_user_id | 5 | | 27 | Using where
Indexes:
| book_tags | 1 | asin | 1 | ASIN | A | 432 | NULL | NULL | | BTREE | |
| book_tags | 1 | idx_tag | 1 | tag | A | 1296 | NULL | NULL | | BTREE | |
| book_tags | 1 | idx_updated_on | 1 | updated_on | A | 518 | NULL | NULL | | BTREE
The query was rewritten from an INNER JOIN which was having the same problem:
SELECT tag, count(tag) AS total
FROM tags
INNER JOIN requests ON requests.ASIN=tags.ASIN
WHERE user_id=9
GROUP BY tag
ORDER BY total DESC
EXPLAIN:
| 1 | SIMPLE | tags | ALL | NULL | NULL | NULL | NULL | 2593 | Using temporary; Using filesort |
| 1 | SIMPLE | requests | ref | idx_ASIN,idx_user_id | idx_ASIN | 33 | func | 3 | Using where
I get the idea this is a real basic point I'm missing, but about 4 hours work on it has got me nowhere. Any advice is welcome.
EDIT:
I can see that the first query using sub-queries won't use indexes thanks to some replies, but this was being used as it ran twice as quick as the bottom query with just the INNER JOIN.
As an example, there are 70k rows in requests (all with an indexed ASIN), and 700k rows in tags, with 95k different ASINs in tags, each with less than 10 different tag records.
If a user has 10 requests, I only want the tags from those 10 ASINs to be listed and counted. In my mind, this should use tags.idx_ASIN and should lookup 100 rows (10 ASINs, each with max of 10 tags) at most from the tags table.
I'm missing something...I just can't see what.
EDIT:
requests CREATE TABLE:
CREATE TABLE IF NOT EXISTS `requests` (
`bid` int(40) NOT NULL AUTO_INCREMENT,
`user_id` int(20) DEFAULT NULL,
`ASIN` varchar(10) COLLATE utf8_unicode_ci DEFAULT NULL,
`status` enum('active','inactive','pending','deleted','completed') COLLATE utf8_unicode_ci NOT NULL,
`added_on` datetime NOT NULL,
`status_changed_on` datetime NOT NULL,
`last_emailed` datetime DEFAULT '0000-00-00 00:00:00',
PRIMARY KEY (`bid`),
KEY `idx_ASIN` (`ASIN`),
KEY `idx_status` (`status`),
KEY `idx_added_on` (`added_on`),
KEY `idx_user_id` (`user_id`),
KEY `idx_status_changed_on` (`status_changed_on`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci AUTO_INCREMENT=149380 ;
tags CREATE TABLE
CREATE TABLE IF NOT EXISTS `tags` (
`ASIN` varchar(10) NOT NULL,
`tag` varchar(50) NOT NULL,
`updated_on` datetime NOT NULL,
KEY `idx_tag` (`tag`),
KEY `idx_updated_on` (`updated_on`),
KEY `idx_asin` (`ASIN`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
There is no primary key on tags. I don't usually have tables without primary keys, but didn't see the need on this one. Could this be an issue?
AHA! Different charsets and collations. I shall correct that and try again!
Later:
That got it. Query went down from 10secs to 0.006secs. Thanks to everyone for getting me to look at this differently.
MySQL doesn't index subqueries. If you want indexes to improve performance of your queries, rewrite them to not use subqueries.
Try reversing the order of the tables in your original query:
SELECT tag, count(tag) AS total
FROM requests
INNER JOIN tags ON requests.ASIN=tags.ASIN
WHERE user_id=9
GROUP BY tag
ORDER BY total DESC
AHA! Different charsets and collations. I shall correct that and try again!
Later:
That got it. Query went down from 10secs to 0.006secs. Thanks to everyone for getting me to look at this differently.

MySQL Index is bigger than the data stored

I have a database with the following stats
Tables Data Index Total
11 579,6 MB 0,9 GB 1,5 GB
So as you can see the Index is close to 2x bigger. And there is one table with ~7 million rows that takes up at least 99% of this.
I also have two indexes that are very similar
a) UNIQUE KEY `idx_customer_invoice` (`customer_id`,`invoice_no`),
b) KEY `idx_customer_invoice_order` (`customer_id`,`invoice_no`,`order_no`)
Update: Here is the table definition (at least structurally) of the largest table
CREATE TABLE `invoices` (
`id` int(10) unsigned NOT NULL auto_increment,
`customer_id` int(10) unsigned NOT NULL,
`order_no` varchar(10) default NULL,
`invoice_no` varchar(20) default NULL,
`customer_no` varchar(20) default NULL,
`name` varchar(45) NOT NULL default '',
`archived` tinyint(4) default NULL,
`invoiced` tinyint(4) default NULL,
`time` timestamp NOT NULL default CURRENT_TIMESTAMP on update CURRENT_TIMESTAMP,
`group` int(11) default NULL,
`customer_group` int(11) default NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `idx_customer_invoice` (`customer_id`,`invoice_no`),
KEY `idx_time` (`time`),
KEY `idx_order` (`order_no`),
KEY `idx_customer_invoice_order` (`customer_id`,`invoice_no`,`order_no`)
) ENGINE=InnoDB AUTO_INCREMENT=9146048 DEFAULT CHARSET=latin1 |
Update 2:
mysql> show indexes from invoices;
+----------+------------+----------------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+----------+------------+----------------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| invoices | 0 | PRIMARY | 1 | id | A | 7578066 | NULL | NULL | | BTREE | |
| invoices | 0 | idx_customer_invoice | 1 | customer_id | A | 17 | NULL | NULL | | BTREE | |
| invoices | 0 | idx_customer_invoice | 2 | invoice_no | A | 7578066 | NULL | NULL | YES | BTREE | |
| invoices | 1 | idx_time | 1 | time | A | 541290 | NULL | NULL | | BTREE | |
| invoices | 1 | idx_order | 1 | order_no | A | 6091 | NULL | NULL | YES | BTREE | |
| invoices | 1 | idx_customer_invoice_order | 1 | customer_id | A | 17 | NULL | NULL | | BTREE | |
| invoices | 1 | idx_customer_invoice_order | 2 | invoice_no | A | 7578066 | NULL | NULL | YES | BTREE | |
| invoices | 1 | idx_customer_invoice_order | 3 | order_no | A | 7578066 | NULL | NULL | YES | BTREE | |
+----------+------------+----------------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
My questions are:
Is there a way to find unused indexes in MySQL?
Are there any common mistakes that impact the size of the index?
Can indexA safely be removed?
How can you measure the size of each index? All I get is the total of all indexes.
You can remove index A, because, as you have noted, it is a subset of another index. And it's possible to do this without disrupting normal processing.
The size of the index files is not alarming in itself and it can easily be true that the net benefit is positive. In other words, the usefulness and value of an index shouldn't be discounted because it results in a large file.
Index design is a complex and subtle art involving a deep understanding of the query optimizer explanations and extensive testing. But one common mistake is to include too few fields in an index in order to make it smaller. Another is to test indexes with insufficient, or insufficiently representative data.
I may be wrong, but the first index (idx_customer_invoice) is UNIQUE, the second (idx_customer_invoice_order) is not, so you'll probably lose the uniqueness constraint when you remove it. No?
Is there a way to find unused indexes in MySQL?
The database engine optimizer will select a proper index when attempting to optimize your query. Depending on when you collected statistics on your indexes last, the index which is chosen will vary. Unused indexes could suddenly become used because of new data repartition.
Can indexA safely be removed?
I would say yes, if indexA and indexB are B-Tree indexes. This is because an index that starts with the same columns in the same order will have the same structure.
use
show indexes from table;
to define what indexes do you have in a particular table. Cardinality would tell how useful your index is.
You can remove your indexes safely (it will not break a table), but beware: some queries might execute slower. First you should analyze your queries to decide whether you need a certain index or not.
I don't think you can find out data length of a particular index, though.
BUT, I think you probably think that if indexes length is greater than data length twice is something abnormal... Well, you are wrong. All of your indexes might be useful ;) If you have a table that provides a lot of information and you have to search on it upon a large number of column, it can easily be that indexes of this table will 2 times bigger in size that the tables data.
indexA can remove because there's a
indexB include indexA
what impact your index length is
your column type and column length
use:
select index_length from information_schema.tables
where table_name='your_table_name' and
table_schema='your_db_name';
get your table index_length