How to improve an indexed inner join query Mysql? - mysql

this is my first question ever on forum so do not hesitate to tell me if there is anything to improve in my question.
I have a big database with two tables
"visit" (6M rows) which basically stores each visit on a website
| visitdate | city |
----------------------------------
| 2014-12-01 00:00:02 | Paris |
| 2015-01-03 00:00:02 | Marseille|
"cityweather" (1M rows) that stores weather infos 3 times a day for a lot of cities
| weatherdate | city |
------------------------------------
| 2014-12-01 09:00:02 | Paris |
| 2014-12-01 09:00:02 | Marseille|
I precise that there can be cities in the table visit that are not in cityweather and vice versa and I need to only take citties that are common to both tables.
I first had a big query that I tried to run and failed and I am therefore trying to go back to the simplest possible query joining those two table but the performance are terrible.
SELECT COUNT(DISTINCT(t.city))
FROM visit t
INNER JOIN cityweather d
ON t.city = d.city;
I precise that both tables are indexed on the column city and I already did the COUNT(DISTINCT(city)) on both tables independantly and it takes less than one second for each.
You can find below te result of the EXPLAIN on this query :
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
----------------------------------
| 1 | SIMPLE | d | index | idx_city | idx_city | 303 | NULL | 1190553 | Using where; Using index |
| 1 | SIMPLE | t | ref | Idxcity | Idxcity | 303 | meteo.d.city | 465 | Using index |
You will find below the table information and especialy the engine for both tables :
visit
| Name | Engine | Version | Row_Format | Rows | Avg_row_len | Data_len | Max_data_len | Index_len | Data_free |
--------------------------------------------------------------------------------------------------------------------
| visit | InnoDB | 10 | Compact | 6208060 | 85 | 531628032 | 0 | 0 | 0 |
The SHOW CREATE TABLE output :
CREATE TABLE
`visit` (
`productid` varchar(8) DEFAULT NULL,
`visitdate` datetime DEFAULT NULL,
`minute` int(2) DEFAULT NULL,
`hour` int(2) DEFAULT NULL,
`weekday` int(1) DEFAULT NULL,
`quotation` int(10) unsigned DEFAULT NULL,
`amount` int(10) unsigned DEFAULT NULL,
`city` varchar(100) DEFAULT NULL,
`weathertype` varchar(30) DEFAULT NULL,
`temp` int(11) DEFAULT NULL,
`pressure` int(11) DEFAULT NULL,
`humidity` int(11) DEFAULT NULL,
KEY `Idxvisitdate` (`visitdate`),
KEY `Idxcity` (`city`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
citiweather
| Name | Engine | Version | Row_Format | Rows | Avg_row_len | Data_len | Max_data_len | Index_len | Data_free |
------------------------------------------------------------------------------------------------------------------------------
| cityweather | InnoDB | 10 | Compact | 1190553 | 73 | 877670784 | 0 | 0 | 30408704 |
The SHOW CREATE TABLE output :
CREATE TABLE `cityweather` (
`city` varchar(100) DEFAULT NULL,
`lat` decimal(13,9) DEFAULT NULL,
`lon` decimal(13,9) DEFAULT NULL,
`weatherdate` datetime DEFAULT NULL,
`temp` int(11) DEFAULT NULL,
`pressure` int(11) DEFAULT NULL,
`humidity` int(11) DEFAULT NULL,
KEY `Idxweatherdate` (`weatherdate`),
KEY `idx_city` (`city`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
I have the feeling that the problem comes from the type = index and the ref = NULL but I have no idea how to fix it...
You can find here a close question that did not help me solve my problem
Thanks !

Your query is so slow because the index you use can't get the number of lines down to a faster amount. See your EXPLAIN output: It tells you that the use of the index on city (idx_city) in table cityweather will require 1.190.553 lines to process. Joining by city to your visit table will require again 465 lines from that table.
As a result your database will have to process 1.190.553 x 465 lines.
As your query is you can't improve its performance. But you can modify your query e.g. by adding a condition on your visiting data to narrow the results down. Try all kinds of EXISTS queries as well.
Update
Perhaps this helps:
CREATE TEMPORARY TABLE tmpTbl
SELECT distinct city as city from cityweather;
ALTER TABLE tmpTbl Add index adweerf (city);
SELECT COUNT(DISTINCT(city)) FROM visit WHERE city in (SELECT city from tmpTbl);

Since IN ( SELECT ... ) optimizes poorly, change
SELECT COUNT(DISTINCT(city)) FROM visit WHERE city in (SELECT city from tmpTbl);
to
SELECT COUNT(*)
FROM ( SELECT DISTINCT city FROM cityweather ) x
WHERE EXISTS( SELECT * FROM visit
WHERE city = x.city );
Both tables need (and have) an index on city. I'm pretty sure it is better to put the smaller table (cityweather) in the SELECT DISTINCT.
Other points:
Every InnoDB table really should have a PRIMARY KEY.
You could save a lot of space by using TINYINT UNSIGNED (1 byte), etc, instead of using 4-byte INT always.
9 decimal places for lat/lng is excessive for cities, and takes 12 bytes. I vote for DECIMAL(4,2)/(5,2) (1.6km / 1mi resolution; 5 bytes) or DECIMAL(6,4)/(7,4) (16m/52ft, 7 bytes).

Related

MySQL report query optimization and timezone issues

I'm faced with a MySQL database which contains an events table with ~70 million rows which has foreign keys to other tables and is used to generate reports. Constructing a performant query to select (while counting/summing values) and grouping data per day from this table is proving challenging.
The database structure is as follows:
CREATE TABLE `client` (
`id` int NOT NULL AUTO_INCREMENT,
`name` varchar(255) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `idx_client_id_name` (`id`,`name`)
) ENGINE=InnoDB AUTO_INCREMENT=66 DEFAULT CHARSET=utf8mb3
CREATE TABLE `class` (
`id` int NOT NULL AUTO_INCREMENT,
`name` varchar(255) DEFAULT NULL,
`client_id` int DEFAULT NULL,
`duration` int DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `fk_client_id_idx` (`client_id`),
CONSTRAINT `fk_client_id` FOREIGN KEY (`client_id`) REFERENCES `client` (`id`) ON DELETE SET NULL ON UPDATE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=2606 DEFAULT CHARSET=utf8mb3
CREATE TABLE `event` (
`id` int NOT NULL AUTO_INCREMENT,
`start_time` datetime DEFAULT NULL,
`class_id` int DEFAULT NULL,
`venue_id` int DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `fk_class_id_idx` (`class_id`),
KEY `fk_venue_id_idx` (`venue_id`),
KEY `idx_1` (`venue_id`,`class_id`,`start_time`),
CONSTRAINT `fk_class_id` FOREIGN KEY (`class_id`) REFERENCES `class` (`id`) ON DELETE SET NULL ON UPDATE CASCADE,
CONSTRAINT `fk_venue_id` FOREIGN KEY (`venue_id`) REFERENCES `venue` (`id`) ON DELETE SET NULL ON UPDATE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=64093231 DEFAULT CHARSET=utf8mb3
CREATE TABLE `venue` (
`id` int NOT NULL AUTO_INCREMENT,
`name` varchar(255) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `idx_venue_id_name` (`id`,`name`)
) ENGINE=InnoDB AUTO_INCREMENT=29 DEFAULT CHARSET=utf8mb3
The query which is fine on an events table with a few thousand rows to demonstrate the desired outcome is as follows:
SELECT
CAST(event.start_time as date) as day,
class.name,
client.name,
venue.name,
COUNT(class.name) AS occurrences,
SUM(class.duration) AS duration
FROM
class,
client,
event,
venue
WHERE
event.venue_id = venue.id
AND event.class_id = class.id
AND class.client_id = client.id
GROUP BY day, class.name, client.name, venue.name
The database isn't indexed and although I've tried indexing with things like alter table events add index idx_test (venue_id, class_id, start_time); to improve performance it's still incredibly slow (I tend to abort them when they're past the 10 minute mark so don't know for sure how long they'd take to complete).
I figured this was a good use case for a summary table (as suggested by Rick James' guide) so that I could hold a separate set of summarized data broken down into day with occurrences and total duration calculated/incremented with each addition to the table (IODKU). However I'm then also up against creating rows per day in a summary table based on what is considered a day in the database (UTC) which may not match with the application's "day" due to timezone offset.
Short of converting the start_time column to a timestamp type (which is then inconsistent with all other date types in the database) is there any way round this or is there any other optimization I could be making to the original events table resulting in a more responsive query? TIA
Update 23/05
Here's the buffer pool size:
SHOW VARIABLES LIKE 'innodb_buffer_pool_size';
+-------------------------+-----------+
| Variable_name | Value |
+-------------------------+-----------+
| innodb_buffer_pool_size | 134217728 |
+-------------------------+-----------+
I've also made a bit of progress with indexing, modifying the query and creating a summary table.
I tried various ordering of columns to test indexes and found idx_event_venueid_classid_starttime (below), to be the most efficient for the event table:
SHOW INDEXES FROM EVENT;
+-------+------------+-------------------------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | Visible | Expression |
+-------+------------+-------------------------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| event | 0 | PRIMARY | 1 | id | A | 62142912 | NULL | NULL | | BTREE | | | YES | NULL |
| event | 1 | fk_class_id_idx | 1 | class_id | A | 51286 | NULL | NULL | YES | BTREE | | | YES | NULL |
| event | 1 | fk_venue_id_idx | 1 | venue_id | A | 16275 | NULL | NULL | YES | BTREE | | | YES | NULL |
| event | 1 | idx_event_venueid_classid_starttime | 1 | venue_id | A | 13378 | NULL | NULL | YES | BTREE | | | YES | NULL |
| event | 1 | idx_event_venueid_classid_starttime | 2 | class_id | A | 81331 | NULL | NULL | YES | BTREE | | | YES | NULL |
| event | 1 | idx_event_venueid_classid_starttime | 3 | start_time | A | 63909472 | NULL | NULL | YES | BTREE | | | YES | NULL |
+-------+------------+-------------------------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
Here's my modified version of the query, using JOIN syntax and now uses CONVERT_TZ to convert from UTC to the timezone required for reporting and then group that by the date (discarding the time portion):
SELECT
DATE(CONVERT_TZ(event.start_time,
'UTC',
'Europe/London')) AS tz_date,
class.name,
client.name,
venue.name,
COUNT(class.id) AS occurrences,
SUM(class.duration) AS duration
FROM
event
JOIN
class ON class.id = event.class_id
JOIN
venue ON venue.id = event.venue_id
JOIN
client ON client.id = class.client_id
GROUP BY tz_date, class.name, client.name, venue.name;
And here's the output of explain for that query:
+----+-------------+--------+------------+--------+---------------------------------------------------------------------+-------------------------------------+---------+-------------------------+------+----------+------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+--------+------------+--------+---------------------------------------------------------------------+-------------------------------------+---------+-------------------------+------+----------+------------------------------+
| 1 | SIMPLE | venue | NULL | index | PRIMARY,idx_venue_id_name | idx_venue_id_name | 772 | NULL | 28 | 100.00 | Using index; Using temporary |
| 1 | SIMPLE | event | NULL | ref | fk_class_id_idx,fk_venue_id_idx,idx_event_venueid_classid_starttime | idx_event_venueid_classid_starttime | 5 | example.venue.id | 4777 | 100.00 | Using where; Using index |
| 1 | SIMPLE | class | NULL | eq_ref | PRIMARY,fk_client_id_idx | PRIMARY | 4 | example.event.class_id | 1 | 100.00 | Using where |
| 1 | SIMPLE | client | NULL | eq_ref | PRIMARY,idx_client_id_name | PRIMARY | 4 | example.class.client_id | 1 | 100.00 | NULL |
+----+-------------+--------+------------+--------+---------------------------------------------------------------------+-------------------------------------+---------+-------------------------+------+----------+------------------------------+
The query takes ~1m 20s to run now so I figured I could prepend that with an insert into to populate a summary table with the dates being timezone specific and run that on a nightly basis. Summary table structure:
CREATE TABLE `summary` (
`tz_date` date NOT NULL,
`class` varchar(255) NOT NULL,
`client` varchar(255) NOT NULL,
`venue` varchar(255) NOT NULL,
`occurrences` int NOT NULL,
`duration` int NOT NULL,
PRIMARY KEY (`tz_date`,`class`,`client`,`venue`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb3
From the original ~60m+ rows in the event table, the aggregated summary table is populated with ~66k rows.
To then generate the reports from the summary table it takes a fraction of a second (shown below with data snipped):
SELECT * FROM SUMMARY;
66989 rows in set (0.03 sec)
I haven't looked into the impact of inserting into event while the query to populate the summary table is running - is using InnoDB likely to slow that down?
No further indexes are likely to help. It need to scan all the events table, reaching into the other tables to get the names.
Some things for us to look at:
SHOW VARIABLES LIKE 'innodb_buffer_pool_size';
EXPLAIN SELECT ...
How much RAM do you have?
Do the aggregates (COUNT and SUM) look correct? In some situations involving JOIN, they can be over-inflated.
Please use the newer JOIN ... ON syntax. (Won't change performance.)
As you observed, a Summary Table may help -- but only of the older data is not being modified. Please provide the SHOW CREATE TABLE and query for it.
Yes, timezone vs "definition of day" is a thorny issue. Notice how StackOverflow defines day based on UTC.
How many new rows are there per day? Are they spread out somewhat evenly throughout the day? If the average number of rows per hour is at least 20, then the Summary Table could be based on half-hour intervals. (I picked that because of India time vs most of the rest of the world.) The 20 comes from a Rule of Thumb that says that a summary table should have one-tenth as many rows as the Fact table.
Yes, TIMESTAMP instead of DATETIME may be a workaround.
Since you are talking about moderately large tables, consider whether to change INT NULL to SMALLINT UNSIGNED NOT NULL or some other sized integer.
(As for the cliff in 2038, ask yourself how many databases have been active on the same hardware and software since 2006. That may give some perspective on whether your design must survive 16 years.)

Same mysql queries in two similar databases in the same machine have different perfomance

I have two databases one for dev and one for staging, and they're both on the same machine too. I'm having a problem with a query for two tables. here are the schema for the tables
Table 1 schema:
Table: import_schedule_t
Create Table: CREATE TABLE `import_schedule_t` (
`id` int(11) NOT NULL,
`theater_id` int(11) NOT NULL,
`movie_code` varchar(20) NOT NULL,
`start_time` datetime NOT NULL,
`end_time` datetime NOT NULL,
`pc_url` varchar(250) NOT NULL,
`mb_url` varchar(250) NOT NULL,
`url_type` int(11) DEFAULT '0',
`active` int(11) DEFAULT '1',
`intime` timestamp NULL DEFAULT CURRENT_TIMESTAMP,
`utime` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
`schedule_date` datetime NOT NULL,
`movie_name` text NOT NULL,
`screen_name` text NOT NULL,
PRIMARY KEY (`id`),
KEY `id` (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8
and Table 2 schema:
Table: wp_postmeta
Create Table: CREATE TABLE `wp_postmeta` (
`meta_id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`post_id` bigint(20) unsigned NOT NULL DEFAULT '0',
`meta_key` varchar(255) DEFAULT NULL,
`meta_value` longtext,
PRIMARY KEY (`meta_id`),
KEY `post_id` (`post_id`),
KEY `meta_key` (`meta_key`(191))
) ENGINE=MyISAM AUTO_INCREMENT=1399270 DEFAULT CHARSET=utf8
both of the tables are present in both of the databases i've mentioned. When i try to run this query:
SELECT DISTINCT movie_code,post_id
FROM import_schedule_t
INNER JOIN wp_postmeta
ON wp_postmeta.meta_value = import_schedule_t.movie_code
AND wp_postmeta.meta_key='update_movie_id'
WHERE DATE_FORMAT(start_time, '%Y-%m-%d')>= DATE_FORMAT(NOW(),'%Y-%m-%d')
dev database would finish the query in 20 seconds but the staging database would only run it for 1.4 seconds.
here's a sample data:
wp_postmeta table
+---------+---------+-----------------+------------+
| meta_id | post_id | meta_key | meta_value |
+---------+---------+-----------------+------------+
| 45150 | 74572 | update_movie_id | 74572 |
+---------+---------+-----------------+------------+
import_schedule_t table (omitted some of the fields)
+--------+------------+---------------------+---------------------+
| id | movie_code | start_time | end_time |
+--------+------------+---------------------+---------------------+
| 120884 | 74572 | 2015-07-04 12:50:00 | 2015-07-04 15:05:00 |
+--------+------------+---------------------+---------------------+
i already tried looking at the indexes and optimizing the tables but with no success, the query time on the dev database is still 20 seconds.
EXPLAIN EXTENDED on dev
+----+-------------+-------------------+------+---------------+------+---------+------+---------+----------+--------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------------------+------+---------------+------+---------+------+---------+----------+--------------------------------+
| 1 | SIMPLE | import_schedule_t | ALL | NULL | NULL | NULL | NULL | 23597 | 100.00 | Using where; Using temporary |
| 1 | SIMPLE | wp_postmeta | ALL | NULL | NULL | NULL | NULL | 1461731 | 100.00 | Using where; Using join buffer |
+----+-------------+-------------------+------+---------------+------+---------+------+---------+----------+--------------------------------+
EXPLAIN EXTENDED on staging
+----+-------------+-------------------+------+---------------+------+---------+------+---------+----------+--------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------------------+------+---------------+------+---------+------+---------+----------+--------------------------------+
| 1 | SIMPLE | import_schedule_t | ALL | NULL | NULL | NULL | NULL | 9311 | 100.00 | Using where; Using temporary |
| 1 | SIMPLE | wp_postmeta | ALL | NULL | NULL | NULL | NULL | 1461384 | 100.00 | Using where; Using join buffer |
+----+-------------+-------------------+------+---------------+------+---------+------+---------+----------+--------------------------------+
If both DBs are running on the same machine, with the same MySQL version, in the same harddrive, with the very same structure and data then it might be a fragmentation issue on the OS level. Take the servers down and defrag your disk.
On a side note: don't compare dates as strings, since dates are numbers internally in the DB, and they are compared much more efficiently (WHERE start_time >= curdate() ).
Also you can save some storage space if you define smaller ints for some fields (like the 'active' field). An int is a 4 byte number while a tinyint is 1 byte.
Sorry, can't comment cos I don't have enough reputation, BUT, I would expect the dev system has a lot more data in its tables.
On another point you should not use DATE_FORMAT - I would guess that is turning your dates into strings which are really inefficient to compare. Dates are just integers (internal to MySQL) so they can be compared in one cycle.. the string comparison could easily be 1000 (or more) cycles. You should probably index the start_time field as well to save it having to scan the table.
Anytime you have a query taking 20 seconds you should be suspicious you are doing something wrong! MySQL can do A LOT in 20 seconds.

Mysql optimizer chooses wrong table order in query

We have simple database with 4 tables: files, file_versions, users, organizations.
I do select all files which owned by some organization with some condition on trashing date by this query:
select * FROM organizations o
LEFT JOIN users u ON o.id=u.organization_id
LEFT JOIN files f ON u.user_identity=f.owner_identity
LEFT JOIN file_versions fv ON f.owner_identity=fv.owner_identity
AND f.local_path=fv.local_path
WHERE o.id=2001237 AND o.trashed_file_age_limit>=1
AND f.trashing_date<(1433943058 - o.trashed_file_age_limit*24*60*60);
Explain select shows me that optimizer choose wrong table order, which is different from query order(organizations-> users->files->file_versions):
mysql> explain select * FROM organizations o LEFT JOIN users u ON o.id=u.organization_id LEFT JOIN files f ON u.user_identity=f.owner_identity LEFT JOIN file_versions fv ON f.owner_identity=fv.owner_identity AND f.local_path=fv.local_path WHERE o.id=2001237 AND o.trashed_file_age_limit>=1 AND f.trashing_date<(1433943058 - o.trashed_file_age_limit*24*60*60);
+----+-------------+-------+--------+----------------------------------+----------+---------+----------------------------------------------------+-----------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+----------------------------------+----------+---------+----------------------------------------------------+-----------+-------------+
| 1 | SIMPLE | o | const | PRIMARY | PRIMARY | 4 | const | 1 | |
| 1 | SIMPLE | f | ALL | PRIMARY | NULL | NULL | NULL | 109615125 | Using where |
| 1 | SIMPLE | u | eq_ref | PRIMARY,identity,organization_id | identity | 36 | filemirror.f.owner_identity | 1 | Using where |
| 1 | SIMPLE | fv | ref | PRIMARY | PRIMARY | 3035 | filemirror.u.user_identity,filemirror.f.local_path | 1 | |
+----+-------------+-------+--------+----------------------------------+----------+---------+----------------------------------------------------+-----------+-------------+
4 rows in set (0.01 sec)
Of couse this query is slow because of full scan by files table and I have to use STRAIGHT_JOIN(which is not equivalent to LEFT JOIN) to fix table order and make query faster.
mysql> explain select * FROM organizations o STRAIGHT_JOIN users u ON o.id=u.organization_id STRAIGHT_JOIN files f ON u.user_identity=f.owner_identity STRAIGHT_JOIN file_versions fv ON f.owner_identity=fv.owner_identity AND f.local_path=fv.local_path WHERE o.id=2001237 AND o.trashed_file_age_limit>=1 AND f.trashing_date<(1433943058 - o.trashed_file_age_limit*24*60*60);
+----+-------------+-------+-------+----------------------------------+---------+---------+----------------------------------------------------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+----------------------------------+---------+---------+----------------------------------------------------+---------+-------------+
| 1 | SIMPLE | o | const | PRIMARY | PRIMARY | 4 | const | 1 | |
| 1 | SIMPLE | u | ref | PRIMARY,identity,organization_id | PRIMARY | 4 | const | 36 | |
| 1 | SIMPLE | f | ref | PRIMARY | PRIMARY | 36 | filemirror.u.user_identity | 6089324 | Using where |
| 1 | SIMPLE | fv | ref | PRIMARY | PRIMARY | 3035 | filemirror.u.user_identity,filemirror.f.local_path | 1 | |
+----+-------------+-------+-------+----------------------------------+---------+---------+----------------------------------------------------+---------+-------------+
4 rows in set (0.00 sec)
My question is why mysql can change table order in not symmetric join operation?
Tables structure:
CREATE TABLE `file_versions` (
`owner_identity` char(36) character set latin1 collate latin1_bin NOT NULL,
`local_path` varchar(999) character set utf8 NOT NULL,
`version_number` int(11) unsigned NOT NULL,
...
PRIMARY KEY (`owner_identity`,`local_path`,`version_number`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 ROW_FORMAT=DYNAMIC;
CREATE TABLE `files` (
`owner_identity` char(36) character set latin1 collate latin1_bin NOT NULL,
`local_path` varchar(999) character set utf8 NOT NULL,
`version_number` int(11) unsigned NOT NULL,
..
`trashing_date` int(11) default NULL,
...
PRIMARY KEY (`owner_identity`,`local_path`),
KEY `trashing_date` (`trashing_date`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 ROW_FORMAT=DYNAMIC;
CREATE TABLE `organizations` (
`id` int(11) NOT NULL,
...
`trashed_file_age_limit` int(11) default NULL,
...
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 ROW_FORMAT=DYNAMIC;
CREATE TABLE `users` (
`organization_id` int(11) NOT NULL,
`id` int(11) NOT NULL,
`user_identity` char(36) character set latin1 collate latin1_bin NOT NULL,
...
PRIMARY KEY (`organization_id`,`id`),
UNIQUE KEY `identity` (`user_identity`),
KEY `organization_id` (`organization_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 ROW_FORMAT=DYNAMIC;
Mysql version 5.5
Look at the rows estimates, mysql thinks that it will need to read 109M rows of files table in first plan and 6M for each of 36 users = 216M rows for second plan. So it seems reasonable to read all 109M rows only once and in priamry key order instead reading them in separate blocks.. Those estimates does not seem very reasonable to me, so I would try running analyze table on files, but they are estimates so maybe you wont get better numbers.
Using LEFT join and then adding condition on the table to WHERE turns it into INNER join as Strawberry says in their comment - you have to have value for the where condition to ever be true, so mysql feels free to reorder those a bit, maybe even it seems better for optimizer to do "really-inner" joins first, so that may be second reason for that plan.
You can try using STRAIGHT_JOIN in different way - if you put it just once right after SELECT, then your join order is used by optimizer if possible (it usually is barring some weird right joins and other corner cases) without changing join type on specific tables (it is then used as sort of FLAG, in the way SQL_NO_CACHE is used to signalize something, instead of as special join type)
Then to make it even better, you may try adding index to files on (owner_identity, trashing_date) which should help in localizing specific files for each user and not globally as with current key on (trashing_date) only.

MySQL Query optimization on big table

I can't find a way to fasten simple queries in a huge table.
I don't think i'm asking something crazy to MySQL, even with the amount of datas… and i can't understand why these following queries have so much different execution time !
I tried my best to read all articles about big datas in mysql, fields optimization, and already achieved to reduce query time with field types… but really, i'm getting lost now with this kind of simple queries !
Here is an example on MySQL 5.1.69 :
SELECT rv.`id_prd`,SUM(`quantite`)
FROM `report_ventes` AS rv
WHERE `periode` BETWEEN 201301 AND 201312
GROUP BY rv.`id_prd`
Execution time : 3.76 sec
Let's add a LEFT JOIN and another selected field :
SELECT rv.`id_prd`,SUM(`quantite`),`acl_cip_7`
FROM `report_ventes` AS rv
LEFT JOIN `report_produits` AS rp
ON (rv.`id_prd` = rp.`id_prd`)
WHERE `periode` BETWEEN 201301 AND 201312
GROUP BY rv.`id_prd`
Execution time : 12.10 sec
Explain :
+----+-------------+-------+--------+---------------+---------+---------+--------------------------+----------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+---------------+---------+---------+--------------------------+----------+----------------------------------------------+
| 1 | SIMPLE | rv | ALL | periode | NULL | NULL | NULL | 16556188 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | rp | eq_ref | PRIMARY | PRIMARY | 4 | main_reporting.rv.id_prd | 1 | Using index |
+----+-------------+-------+--------+---------------+---------+---------+--------------------------+----------+----------------------------------------------+
And let's another where clause :
SELECT rv.`id_prd`,SUM(`quantite`),`acl_cip_7`
FROM `report_ventes` AS rv
LEFT JOIN `report_produits` AS rp
ON (rv.`id_prd` = rp.`id_prd`)
WHERE rp.`id_clas_prd` LIKE '1%'
AND `periode` BETWEEN 201301 AND 201312
GROUP BY rv.`id_prd`
Execution time : 21.00 sec
Explain :
+----+-------------+-------+--------+---------------------+---------+---------+--------------------------+----------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+---------------------+---------+---------+--------------------------+----------+----------------------------------------------+
| 1 | SIMPLE | rv | ALL | periode | NULL | NULL | NULL | 16556188 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | rp | eq_ref | PRIMARY,id_clas_prd | PRIMARY | 4 | main_reporting.rv.id_prd | 1 | Using where |
+----+-------------+-------+--------+---------------------+---------+---------+--------------------------+----------+----------------------------------------------+
And here are the tables parameters :
report_produits : 80 000 rows
CREATE TABLE `report_produits` (
`id_prd` int(11) unsigned NOT NULL,
`acl_cip_7` int(7) NOT NULL,
`acl_cip_ean_13` varchar(255) DEFAULT NULL,
`lib_prd` varchar(255) DEFAULT NULL,
`id_clas_prd` char(7) NOT NULL DEFAULT '',
`id_lab_prd` int(11) unsigned NOT NULL,
`id_rbt_prd` int(11) unsigned NOT NULL,
`id_tva_prd` int(11) unsigned NOT NULL,
`t_gen` varchar(255) NOT NULL,
`id_grp_gen` varchar(16) NOT NULL DEFAULT '',
`id_liste_delivrance` int(11) unsigned NOT NULL,
PRIMARY KEY (`id_prd`),
KEY `index_lab` (`id_lab_prd`),
KEY `index_grp` (`id_grp_gen`),
KEY `id_clas_prd` (`id_clas_prd`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
report_ventes : 16 556 188 rows
CREATE TABLE `report_ventes` (
`id` int(13) NOT NULL AUTO_INCREMENT,
`periode` mediumint(6) DEFAULT NULL,
`id_phie` smallint(4) unsigned NOT NULL,
`id_prd` mediumint(8) unsigned NOT NULL,
`quantite` smallint(11) DEFAULT NULL,
`ca_ht` decimal(10,2) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `periode` (`periode`)
) ENGINE=MyISAM AUTO_INCREMENT=18491315 DEFAULT CHARSET=utf8;
There is no covering index and MySQL decides that scanning the whole table is more effective than to use an index and lookup for the requested values.
You are joining to the report_ventes on the id_prd, but that column is not the part of the clustering index (PK in MySQL). This means, the server should lookup for all the values. The server bypasses the periode index possibly because it is not enough selective to use it.
An index could help which includes the id_prd, periode and quantite columns. With this index, there is a chance that the MySQL server will use it since it is a covering index for this query.
Give it a try, but its hard to tell the real truth without testing it on the actual environment.
Basically your indexes is not being used, i can't spot the precise reason without trying it on a sql server, but a common cause is the data has different types.
AND periode BETWEEN 201301 AND 201312
"periode" has datatype mediumint(6) and the litteral "201301" possible has datatype int(10)
LEFT JOIN `report_produits` AS rp ON (rv.`id_prd` = rp.`id_prd`)
Here are the 2 datatypes also different.

GeoIP table join with table of IP's in MySQL

I am having a issue finding a fast way of joining the tables looking like that:
mysql> explain geo_ip;
+--------------+------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+--------------+------------------+------+-----+---------+-------+
| ip_start | varchar(32) | NO | | "" | |
| ip_end | varchar(32) | NO | | "" | |
| ip_num_start | int(64) unsigned | NO | PRI | 0 | |
| ip_num_end | int(64) unsigned | NO | | 0 | |
| country_code | varchar(3) | NO | | "" | |
| country_name | varchar(64) | NO | | "" | |
| ip_poly | geometry | NO | MUL | NULL | |
+--------------+------------------+------+-----+---------+-------+
mysql> explain entity_ip;
+------------+---------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+------------+---------------------+------+-----+---------+-------+
| entity_id | int(64) unsigned | NO | PRI | NULL | |
| ip_1 | tinyint(3) unsigned | NO | | NULL | |
| ip_2 | tinyint(3) unsigned | NO | | NULL | |
| ip_3 | tinyint(3) unsigned | NO | | NULL | |
| ip_4 | tinyint(3) unsigned | NO | | NULL | |
| ip_num | int(64) unsigned | NO | | 0 | |
| ip_poly | geometry | NO | MUL | NULL | |
+------------+---------------------+------+-----+---------+-------+
Please note that I am not interested in finding the needed rows in geo_ip by only ONE IP address at once, I need a entity_ip LEFT JOIN geo_ip (or similar/analogue way).
This is what I have for now (using polygons as advised on http://jcole.us/blog/archives/2007/11/24/on-efficiently-geo-referencing-ips-with-maxmind-geoip-and-mysql-gis/):
mysql> EXPLAIN SELECT li.*, gi.country_code FROM entity_ip AS li
-> LEFT JOIN geo_ip AS gi ON
-> MBRCONTAINS(gi.`ip_poly`, li.`ip_poly`);
+----+-------------+-------+------+---------------+------+---------+------+--------+-------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+------+---------+------+--------+-------+
| 1 | SIMPLE | li | ALL | NULL | NULL | NULL | NULL | 2470 | |
| 1 | SIMPLE | gi | ALL | ip_poly_index | NULL | NULL | NULL | 155183 | |
+----+-------------+-------+------+---------------+------+---------+------+--------+-------+
mysql> SELECT li.*, gi.country_code FROM entity AS li LEFT JOIN geo_ip AS gi ON MBRCONTAINS(gi.`ip_poly`, li.`ip_poly`) limit 0, 20;
20 rows in set (2.22 sec)
No polygons
mysql> explain SELECT li.*, gi.country_code FROM entity_ip AS li LEFT JOIN geo_ip AS gi ON li.`ip_num` >= gi.`ip_num_start` AND li.`ip_num` <= gi.`ip_num_end` LIMIT 0,20;
+----+-------------+-------+------+---------------------------+------+---------+------+--------+-------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------------------+------+---------+------+--------+-------+
| 1 | SIMPLE | li | ALL | NULL | NULL | NULL | NULL | 2470 | |
| 1 | SIMPLE | gi | ALL | PRIMARY,geo_ip,geo_ip_end | NULL | NULL | NULL | 155183 | |
+----+-------------+-------+------+---------------------------+------+---------+------+--------+-------+
mysql> SELECT li.*, gi.country_code FROM entity_ip AS li LEFT JOIN geo_ip AS gi ON li.ip_num BETWEEN gi.ip_num_start AND gi.ip_num_end limit 0, 20;
20 rows in set (2.00 sec)
(On higher number of rows in the search - there is no difference)
Currently I cannot get any faster performance from these queries as 0.1 seconds per IP is way too slow for me.
Is there any way to make it faster?
This approach has some scalability issues (should you choose to move to, say, city-specific geoip data), but for the given size of data, it will provide considerable optimization.
The problem you are facing is effectively that MySQL does not optimize range-based queries very well. Ideally you want to do an exact ("=") look-up on an index rather than "greater than", so we'll need to build an index like that from the data you have available. This way MySQL will have much fewer rows to evaluate while looking for a match.
To do this, I suggest that you create a look-up table that indexes the geolocation table based on the first octet (=1 from 1.2.3.4) of the IP addresses. The idea is that for each look-up you have to do, you can ignore all geolocation IPs which do not begin with the same octet than the IP you are looking for.
CREATE TABLE `ip_geolocation_lookup` (
`first_octet` int(10) unsigned NOT NULL DEFAULT '0',
`ip_numeric_start` int(10) unsigned NOT NULL DEFAULT '0',
`ip_numeric_end` int(10) unsigned NOT NULL DEFAULT '0',
KEY `first_octet` (`first_octet`,`ip_numeric_start`,`ip_numeric_end`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Next, we need to take the data available in your geolocation table and produce data that covers all (first) octets the geolocation row covers: If you have an entry with ip_start = '5.3.0.0' and ip_end = '8.16.0.0', the lookup table will need rows for octets 5, 6, 7, and 8. So...
ip_geolocation
|ip_start |ip_end |ip_numeric_start|ip_numeric_end|
|72.255.119.248 |74.3.127.255 |1224701944 |1241743359 |
Should convert to:
ip_geolocation_lookup
|first_octet|ip_numeric_start|ip_numeric_end|
|72 |1224701944 |1241743359 |
|73 |1224701944 |1241743359 |
|74 |1224701944 |1241743359 |
Since someone here requested for a native MySQL solution, here's a stored procedure that will generate that data for you:
DROP PROCEDURE IF EXISTS recalculate_ip_geolocation_lookup;
CREATE PROCEDURE recalculate_ip_geolocation_lookup()
BEGIN
DECLARE i INT DEFAULT 0;
DELETE FROM ip_geolocation_lookup;
WHILE i < 256 DO
INSERT INTO ip_geolocation_lookup (first_octet, ip_numeric_start, ip_numeric_end)
SELECT i, ip_numeric_start, ip_numeric_end FROM ip_geolocation WHERE
( ip_numeric_start & 0xFF000000 ) >> 24 <= i AND
( ip_numeric_end & 0xFF000000 ) >> 24 >= i;
SET i = i + 1;
END WHILE;
END;
And then you will need to populate the table by calling that stored procedure:
CALL recalculate_ip_geolocation_lookup();
At this point you may delete the procedure you just created -- it is no longer needed, unless you want to recalculate the look-up table.
After the look-up table is in place, all you have to do is integrate it into your queries and make sure you're querying by the first octet. Your query to the look-up table will satisfy two conditions:
Find all rows which match the first octet of your IP address
Of that subset: Find the row which has the the range that matches your IP address
Because the step two is carried out on a subset of data, it is considerably faster than doing the range tests on the entire data. This is the key to this optimization strategy.
There are various ways for figuring out what the first octet of an IP address is; I used ( r.ip_numeric & 0xFF000000 ) >> 24 since my source IPs are in numeric form:
SELECT
r.*,
g.country_code
FROM
ip_geolocation g,
ip_geolocation_lookup l,
ip_random r
WHERE
l.first_octet = ( r.ip_numeric & 0xFF000000 ) >> 24 AND
l.ip_numeric_start <= r.ip_numeric AND
l.ip_numeric_end >= r.ip_numeric AND
g.ip_numeric_start = l.ip_numeric_start;
Now, admittedly I did get a little lazy in the end: You could easily get rid of ip_geolocation table altogether if you made the ip_geolocation_lookup table also contain the country data. I'm guessing dropping one table from this query would make it a bit faster.
And, finally, here are the two other tables I used in this response for reference, since they differ from your tables. I'm certain they are compatible, though.
# This table contains the original geolocation data
CREATE TABLE `ip_geolocation` (
`ip_start` varchar(16) NOT NULL DEFAULT '',
`ip_end` varchar(16) NOT NULL DEFAULT '',
`ip_numeric_start` int(10) unsigned NOT NULL DEFAULT '0',
`ip_numeric_end` int(10) unsigned NOT NULL DEFAULT '0',
`country_code` varchar(3) NOT NULL DEFAULT '',
`country_name` varchar(64) NOT NULL DEFAULT '',
PRIMARY KEY (`ip_numeric_start`),
KEY `country_code` (`country_code`),
KEY `ip_start` (`ip_start`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
# This table simply holds random IP data that can be used for testing
CREATE TABLE `ip_random` (
`ip` varchar(16) NOT NULL DEFAULT '',
`ip_numeric` int(10) unsigned NOT NULL DEFAULT '0',
PRIMARY KEY (`ip`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Just wanted to give back to the community:
Here's an even better and optimized way building on Aleksi's solution:
DROP PROCEDURE IF EXISTS recalculate_ip_geolocation_lookup;
DELIMITER ;;
CREATE PROCEDURE recalculate_ip_geolocation_lookup()
BEGIN
DECLARE i INT DEFAULT 0;
DROP TABLE `ip_geolocation_lookup`;
CREATE TABLE `ip_geolocation_lookup` (
`first_octet` smallint(5) unsigned NOT NULL DEFAULT '0',
`startIpNum` int(10) unsigned NOT NULL DEFAULT '0',
`endIpNum` int(10) unsigned NOT NULL DEFAULT '0',
`locId` int(11) NOT NULL,
PRIMARY KEY (`first_octet`,`startIpNum`,`endIpNum`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
INSERT IGNORE INTO ip_geolocation_lookup
SELECT startIpNum DIV 1048576 as first_octet, startIpNum, endIpNum, locId
FROM ip_geolocation;
INSERT IGNORE INTO ip_geolocation_lookup
SELECT endIpNum DIV 1048576 as first_octet, startIpNum, endIpNum, locId
FROM ip_geolocation;
WHILE i < 1048576 DO
INSERT IGNORE INTO ip_geolocation_lookup
SELECT i, startIpNum, endIpNum, locId
FROM ip_geolocation_lookup
WHERE first_octet = i-1
AND endIpNum DIV 1048576 > i;
SET i = i + 1;
END WHILE;
END;;
DELIMITER ;
CALL recalculate_ip_geolocation_lookup();
It builds way faster than his solution and drills down more easily because we're not just taking the first 8, but the first 20 bits. Join performance: 100000 rows in 158ms. You might have to rename the table and field names to your version.
Query by using
SELECT ip, kl.*
FROM random_ips ki
JOIN `ip_geolocation_lookup` kb ON (ki.`ip` DIV 1048576 = kb.`first_octet` AND ki.`ip` >= kb.`startIpNum` AND ki.`ip` <= kb.`endIpNum`)
JOIN ip_maxmind_locations kl ON kb.`locId` = kl.`locId`;
Can't comment yet, but user1281376's answers is wrong and doesn't work. the reason you only use the first octet is because you aren't going to match all ip ranges otherwise. there's plenty of ranges that span multiple second octets which user1281376s changed query isn't going to match. And yes, this actually happens if you use the Maxmind GeoIp data.
with aleksis suggestion you can do a simple comparison on the fîrst octet, thus reducing the matching set.
I found a easy way. I noticed that all first ip in the group % 256 = 0,
so we can add a ip_index table
CREATE TABLE `t_map_geo_range` (
`_ip` int(10) unsigned NOT NULL,
`_ipStart` int(10) unsigned NOT NULL,
PRIMARY KEY (`_ip`)
) ENGINE=MyISAM
How to fill the index table
FOR_EACH(Every row of ip_geo)
{
FOR(Every ip FROM ipGroupStart/256 to ipGroupEnd/256)
{
INSERT INTO ip_geo_index(ip, ipGroupStart);
}
}
How to use:
SELECT * FROM YOUR_TABLE AS A
LEFT JOIN ip_geo_index AS B ON B._ip = A._ip DIV 256
LEFT JOIN ip_geo AS C ON C.ipStart = B.ipStart;
More than 1000 times Faster.