I am having a issue finding a fast way of joining the tables looking like that:
mysql> explain geo_ip;
+--------------+------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+--------------+------------------+------+-----+---------+-------+
| ip_start | varchar(32) | NO | | "" | |
| ip_end | varchar(32) | NO | | "" | |
| ip_num_start | int(64) unsigned | NO | PRI | 0 | |
| ip_num_end | int(64) unsigned | NO | | 0 | |
| country_code | varchar(3) | NO | | "" | |
| country_name | varchar(64) | NO | | "" | |
| ip_poly | geometry | NO | MUL | NULL | |
+--------------+------------------+------+-----+---------+-------+
mysql> explain entity_ip;
+------------+---------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+------------+---------------------+------+-----+---------+-------+
| entity_id | int(64) unsigned | NO | PRI | NULL | |
| ip_1 | tinyint(3) unsigned | NO | | NULL | |
| ip_2 | tinyint(3) unsigned | NO | | NULL | |
| ip_3 | tinyint(3) unsigned | NO | | NULL | |
| ip_4 | tinyint(3) unsigned | NO | | NULL | |
| ip_num | int(64) unsigned | NO | | 0 | |
| ip_poly | geometry | NO | MUL | NULL | |
+------------+---------------------+------+-----+---------+-------+
Please note that I am not interested in finding the needed rows in geo_ip by only ONE IP address at once, I need a entity_ip LEFT JOIN geo_ip (or similar/analogue way).
This is what I have for now (using polygons as advised on http://jcole.us/blog/archives/2007/11/24/on-efficiently-geo-referencing-ips-with-maxmind-geoip-and-mysql-gis/):
mysql> EXPLAIN SELECT li.*, gi.country_code FROM entity_ip AS li
-> LEFT JOIN geo_ip AS gi ON
-> MBRCONTAINS(gi.`ip_poly`, li.`ip_poly`);
+----+-------------+-------+------+---------------+------+---------+------+--------+-------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+------+---------+------+--------+-------+
| 1 | SIMPLE | li | ALL | NULL | NULL | NULL | NULL | 2470 | |
| 1 | SIMPLE | gi | ALL | ip_poly_index | NULL | NULL | NULL | 155183 | |
+----+-------------+-------+------+---------------+------+---------+------+--------+-------+
mysql> SELECT li.*, gi.country_code FROM entity AS li LEFT JOIN geo_ip AS gi ON MBRCONTAINS(gi.`ip_poly`, li.`ip_poly`) limit 0, 20;
20 rows in set (2.22 sec)
No polygons
mysql> explain SELECT li.*, gi.country_code FROM entity_ip AS li LEFT JOIN geo_ip AS gi ON li.`ip_num` >= gi.`ip_num_start` AND li.`ip_num` <= gi.`ip_num_end` LIMIT 0,20;
+----+-------------+-------+------+---------------------------+------+---------+------+--------+-------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------------------+------+---------+------+--------+-------+
| 1 | SIMPLE | li | ALL | NULL | NULL | NULL | NULL | 2470 | |
| 1 | SIMPLE | gi | ALL | PRIMARY,geo_ip,geo_ip_end | NULL | NULL | NULL | 155183 | |
+----+-------------+-------+------+---------------------------+------+---------+------+--------+-------+
mysql> SELECT li.*, gi.country_code FROM entity_ip AS li LEFT JOIN geo_ip AS gi ON li.ip_num BETWEEN gi.ip_num_start AND gi.ip_num_end limit 0, 20;
20 rows in set (2.00 sec)
(On higher number of rows in the search - there is no difference)
Currently I cannot get any faster performance from these queries as 0.1 seconds per IP is way too slow for me.
Is there any way to make it faster?
This approach has some scalability issues (should you choose to move to, say, city-specific geoip data), but for the given size of data, it will provide considerable optimization.
The problem you are facing is effectively that MySQL does not optimize range-based queries very well. Ideally you want to do an exact ("=") look-up on an index rather than "greater than", so we'll need to build an index like that from the data you have available. This way MySQL will have much fewer rows to evaluate while looking for a match.
To do this, I suggest that you create a look-up table that indexes the geolocation table based on the first octet (=1 from 1.2.3.4) of the IP addresses. The idea is that for each look-up you have to do, you can ignore all geolocation IPs which do not begin with the same octet than the IP you are looking for.
CREATE TABLE `ip_geolocation_lookup` (
`first_octet` int(10) unsigned NOT NULL DEFAULT '0',
`ip_numeric_start` int(10) unsigned NOT NULL DEFAULT '0',
`ip_numeric_end` int(10) unsigned NOT NULL DEFAULT '0',
KEY `first_octet` (`first_octet`,`ip_numeric_start`,`ip_numeric_end`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Next, we need to take the data available in your geolocation table and produce data that covers all (first) octets the geolocation row covers: If you have an entry with ip_start = '5.3.0.0' and ip_end = '8.16.0.0', the lookup table will need rows for octets 5, 6, 7, and 8. So...
ip_geolocation
|ip_start |ip_end |ip_numeric_start|ip_numeric_end|
|72.255.119.248 |74.3.127.255 |1224701944 |1241743359 |
Should convert to:
ip_geolocation_lookup
|first_octet|ip_numeric_start|ip_numeric_end|
|72 |1224701944 |1241743359 |
|73 |1224701944 |1241743359 |
|74 |1224701944 |1241743359 |
Since someone here requested for a native MySQL solution, here's a stored procedure that will generate that data for you:
DROP PROCEDURE IF EXISTS recalculate_ip_geolocation_lookup;
CREATE PROCEDURE recalculate_ip_geolocation_lookup()
BEGIN
DECLARE i INT DEFAULT 0;
DELETE FROM ip_geolocation_lookup;
WHILE i < 256 DO
INSERT INTO ip_geolocation_lookup (first_octet, ip_numeric_start, ip_numeric_end)
SELECT i, ip_numeric_start, ip_numeric_end FROM ip_geolocation WHERE
( ip_numeric_start & 0xFF000000 ) >> 24 <= i AND
( ip_numeric_end & 0xFF000000 ) >> 24 >= i;
SET i = i + 1;
END WHILE;
END;
And then you will need to populate the table by calling that stored procedure:
CALL recalculate_ip_geolocation_lookup();
At this point you may delete the procedure you just created -- it is no longer needed, unless you want to recalculate the look-up table.
After the look-up table is in place, all you have to do is integrate it into your queries and make sure you're querying by the first octet. Your query to the look-up table will satisfy two conditions:
Find all rows which match the first octet of your IP address
Of that subset: Find the row which has the the range that matches your IP address
Because the step two is carried out on a subset of data, it is considerably faster than doing the range tests on the entire data. This is the key to this optimization strategy.
There are various ways for figuring out what the first octet of an IP address is; I used ( r.ip_numeric & 0xFF000000 ) >> 24 since my source IPs are in numeric form:
SELECT
r.*,
g.country_code
FROM
ip_geolocation g,
ip_geolocation_lookup l,
ip_random r
WHERE
l.first_octet = ( r.ip_numeric & 0xFF000000 ) >> 24 AND
l.ip_numeric_start <= r.ip_numeric AND
l.ip_numeric_end >= r.ip_numeric AND
g.ip_numeric_start = l.ip_numeric_start;
Now, admittedly I did get a little lazy in the end: You could easily get rid of ip_geolocation table altogether if you made the ip_geolocation_lookup table also contain the country data. I'm guessing dropping one table from this query would make it a bit faster.
And, finally, here are the two other tables I used in this response for reference, since they differ from your tables. I'm certain they are compatible, though.
# This table contains the original geolocation data
CREATE TABLE `ip_geolocation` (
`ip_start` varchar(16) NOT NULL DEFAULT '',
`ip_end` varchar(16) NOT NULL DEFAULT '',
`ip_numeric_start` int(10) unsigned NOT NULL DEFAULT '0',
`ip_numeric_end` int(10) unsigned NOT NULL DEFAULT '0',
`country_code` varchar(3) NOT NULL DEFAULT '',
`country_name` varchar(64) NOT NULL DEFAULT '',
PRIMARY KEY (`ip_numeric_start`),
KEY `country_code` (`country_code`),
KEY `ip_start` (`ip_start`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
# This table simply holds random IP data that can be used for testing
CREATE TABLE `ip_random` (
`ip` varchar(16) NOT NULL DEFAULT '',
`ip_numeric` int(10) unsigned NOT NULL DEFAULT '0',
PRIMARY KEY (`ip`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Just wanted to give back to the community:
Here's an even better and optimized way building on Aleksi's solution:
DROP PROCEDURE IF EXISTS recalculate_ip_geolocation_lookup;
DELIMITER ;;
CREATE PROCEDURE recalculate_ip_geolocation_lookup()
BEGIN
DECLARE i INT DEFAULT 0;
DROP TABLE `ip_geolocation_lookup`;
CREATE TABLE `ip_geolocation_lookup` (
`first_octet` smallint(5) unsigned NOT NULL DEFAULT '0',
`startIpNum` int(10) unsigned NOT NULL DEFAULT '0',
`endIpNum` int(10) unsigned NOT NULL DEFAULT '0',
`locId` int(11) NOT NULL,
PRIMARY KEY (`first_octet`,`startIpNum`,`endIpNum`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
INSERT IGNORE INTO ip_geolocation_lookup
SELECT startIpNum DIV 1048576 as first_octet, startIpNum, endIpNum, locId
FROM ip_geolocation;
INSERT IGNORE INTO ip_geolocation_lookup
SELECT endIpNum DIV 1048576 as first_octet, startIpNum, endIpNum, locId
FROM ip_geolocation;
WHILE i < 1048576 DO
INSERT IGNORE INTO ip_geolocation_lookup
SELECT i, startIpNum, endIpNum, locId
FROM ip_geolocation_lookup
WHERE first_octet = i-1
AND endIpNum DIV 1048576 > i;
SET i = i + 1;
END WHILE;
END;;
DELIMITER ;
CALL recalculate_ip_geolocation_lookup();
It builds way faster than his solution and drills down more easily because we're not just taking the first 8, but the first 20 bits. Join performance: 100000 rows in 158ms. You might have to rename the table and field names to your version.
Query by using
SELECT ip, kl.*
FROM random_ips ki
JOIN `ip_geolocation_lookup` kb ON (ki.`ip` DIV 1048576 = kb.`first_octet` AND ki.`ip` >= kb.`startIpNum` AND ki.`ip` <= kb.`endIpNum`)
JOIN ip_maxmind_locations kl ON kb.`locId` = kl.`locId`;
Can't comment yet, but user1281376's answers is wrong and doesn't work. the reason you only use the first octet is because you aren't going to match all ip ranges otherwise. there's plenty of ranges that span multiple second octets which user1281376s changed query isn't going to match. And yes, this actually happens if you use the Maxmind GeoIp data.
with aleksis suggestion you can do a simple comparison on the fîrst octet, thus reducing the matching set.
I found a easy way. I noticed that all first ip in the group % 256 = 0,
so we can add a ip_index table
CREATE TABLE `t_map_geo_range` (
`_ip` int(10) unsigned NOT NULL,
`_ipStart` int(10) unsigned NOT NULL,
PRIMARY KEY (`_ip`)
) ENGINE=MyISAM
How to fill the index table
FOR_EACH(Every row of ip_geo)
{
FOR(Every ip FROM ipGroupStart/256 to ipGroupEnd/256)
{
INSERT INTO ip_geo_index(ip, ipGroupStart);
}
}
How to use:
SELECT * FROM YOUR_TABLE AS A
LEFT JOIN ip_geo_index AS B ON B._ip = A._ip DIV 256
LEFT JOIN ip_geo AS C ON C.ipStart = B.ipStart;
More than 1000 times Faster.
Related
I have a query with 2 INNER JOIN statements, and only fetching a few column, but it is very slow even though I have indexes on all required columns.
My query
SELECT
dysfonctionnement,
montant,
listRembArticles,
case when dys.reimputation is not null then dys.reimputation else dys.responsable end as responsable_final
FROM
db.commandes AS com
INNER JOIN db.dysfonctionnements AS dys ON com.id_commande = dys.id_commande
INNER JOIN db.pe AS pe ON com.code_pe = pe.pe_id
WHERE
com.prestataireLAD REGEXP '.*'
AND pe_nom REGEXP 'bordeaux|chambéry-annecy|grenoble|lyon|marseille|metz|montpellier|nancy|nice|nimes|rouen|strasbourg|toulon|toulouse|vitry|vitry bis 1|vitry bis 2|vlg'
AND com.date_livraison BETWEEN '2022-06-11 00:00:00'
AND '2022-07-08 00:00:00';
It takes around 20 seconds to compute and fetch 4123 rows.
The problem
In order to find what's wrong and why is it so slow, I've used the EXPLAIN statement, here is the output:
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
|----|-------------|-------|------------|--------|----------------------------|-------------|---------|------------------------|--------|----------|-------------|
| 1 | SIMPLE | dys | | ALL | id_commande,id_commande_2 | | | | 878588 | 100.00 | Using where |
| 1 | SIMPLE | com | | eq_ref | id_commande,date_livraison | id_commande | 110 | db.dys.id_commande | 1 | 7.14 | Using where |
| 1 | SIMPLE | pe | | ref | pe_id | pe_id | 5 | db.com.code_pe | 1 | 100.00 | Using where |
I can see that the dysfonctionnements JOIN is rigged, and doesn't use a key even though it could...
Table definitions
commandes (included relevant columns only)
CREATE TABLE `commandes` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`id_commande` varchar(36) NOT NULL DEFAULT '',
`date_commande` datetime NOT NULL,
`date_livraison` datetime NOT NULL,
`code_pe` int(11) NOT NULL,
`traitement_dysfonctionnement` tinyint(4) DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `id_commande` (`id_commande`),
KEY `date_livraison` (`date_livraison`),
KEY `traitement_dysfonctionnement` (`traitement_dysfonctionnement`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
dysfonctionnements (again, relevant columns only)
CREATE TABLE `dysfonctionnements` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`id_commande` varchar(36) DEFAULT NULL,
`dysfonctionnement` varchar(150) DEFAULT NULL,
`responsable` varchar(50) DEFAULT NULL,
`reimputation` varchar(50) DEFAULT NULL,
`montant` float DEFAULT NULL,
`listRembArticles` text,
PRIMARY KEY (`id`),
UNIQUE KEY `id_commande` (`id_commande`,`dysfonctionnement`),
KEY `id_commande_2` (`id_commande`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
pe (again, relevant columns only)
CREATE TABLE `pe` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`pe_id` int(11) DEFAULT NULL,
`pe_nom` varchar(30) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `pe_nom` (`pe_nom`),
KEY `pe_id` (`pe_id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
Investigation
If I remove the db.pe table from the query and the WHERE clause on pe_nom, the query takes 1.7 seconds to fetch 7k rows, and with the EXPLAIN statement, I can see it is using keys as I expect it to do:
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
|----|-------------|-------|------------|-------|----------------------------|----------------|---------|------------------------|--------|----------|-----------------------------------------------|
| 1 | SIMPLE | com | | range | id_commande,date_livraison | date_livraison | 5 | | 389558 | 100.00 | Using index condition; Using where; Using MRR |
| 1 | SIMPLE | dys | | ref | id_commande,id_commande_2 | id_commande_2 | 111 | ooshop.com.id_commande | 1 | 100.00 | |
I'm open to any suggestions, I see no reason not to use the key when it does on a very similar query and it definitely makes it faster...
I had a similar experience when MySQL optimiser selected a joined table sequence far from optimal. At that time I used MySQL specific STRAIGHT_JOIN operator to overcome default optimiser behaviour. In your case I would try this:
SELECT
dysfonctionnement,
montant,
listRembArticles,
case when dys.reimputation is not null then dys.reimputation else dys.responsable end as responsable_final
FROM
db.commandes AS com
STRAIGHT_JOIN db.dysfonctionnements AS dys ON com.id_commande = dys.id_commande
INNER JOIN db.pe AS pe ON com.code_pe = pe.pe_id
Also, in your WHERE clause one of the REGEXP probably might be changed to IN operator, I assume it can use index.
Remove com.prestataireLAD REGEXP '.*'. The Optimizer probably won't realize that this has no impact on the resultset. If you are dynamically building the WHERE clause, then eliminate anything else you can.
id_commande_2 is redundant. In queries where it might be useful, the UNIQUE can take care of it.
These indexes might help:
com: INDEX(date_livraison, id_commande, code_pe)
pe: INDEX(pe_nom, pe_id)
I'm faced with a MySQL database which contains an events table with ~70 million rows which has foreign keys to other tables and is used to generate reports. Constructing a performant query to select (while counting/summing values) and grouping data per day from this table is proving challenging.
The database structure is as follows:
CREATE TABLE `client` (
`id` int NOT NULL AUTO_INCREMENT,
`name` varchar(255) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `idx_client_id_name` (`id`,`name`)
) ENGINE=InnoDB AUTO_INCREMENT=66 DEFAULT CHARSET=utf8mb3
CREATE TABLE `class` (
`id` int NOT NULL AUTO_INCREMENT,
`name` varchar(255) DEFAULT NULL,
`client_id` int DEFAULT NULL,
`duration` int DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `fk_client_id_idx` (`client_id`),
CONSTRAINT `fk_client_id` FOREIGN KEY (`client_id`) REFERENCES `client` (`id`) ON DELETE SET NULL ON UPDATE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=2606 DEFAULT CHARSET=utf8mb3
CREATE TABLE `event` (
`id` int NOT NULL AUTO_INCREMENT,
`start_time` datetime DEFAULT NULL,
`class_id` int DEFAULT NULL,
`venue_id` int DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `fk_class_id_idx` (`class_id`),
KEY `fk_venue_id_idx` (`venue_id`),
KEY `idx_1` (`venue_id`,`class_id`,`start_time`),
CONSTRAINT `fk_class_id` FOREIGN KEY (`class_id`) REFERENCES `class` (`id`) ON DELETE SET NULL ON UPDATE CASCADE,
CONSTRAINT `fk_venue_id` FOREIGN KEY (`venue_id`) REFERENCES `venue` (`id`) ON DELETE SET NULL ON UPDATE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=64093231 DEFAULT CHARSET=utf8mb3
CREATE TABLE `venue` (
`id` int NOT NULL AUTO_INCREMENT,
`name` varchar(255) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `idx_venue_id_name` (`id`,`name`)
) ENGINE=InnoDB AUTO_INCREMENT=29 DEFAULT CHARSET=utf8mb3
The query which is fine on an events table with a few thousand rows to demonstrate the desired outcome is as follows:
SELECT
CAST(event.start_time as date) as day,
class.name,
client.name,
venue.name,
COUNT(class.name) AS occurrences,
SUM(class.duration) AS duration
FROM
class,
client,
event,
venue
WHERE
event.venue_id = venue.id
AND event.class_id = class.id
AND class.client_id = client.id
GROUP BY day, class.name, client.name, venue.name
The database isn't indexed and although I've tried indexing with things like alter table events add index idx_test (venue_id, class_id, start_time); to improve performance it's still incredibly slow (I tend to abort them when they're past the 10 minute mark so don't know for sure how long they'd take to complete).
I figured this was a good use case for a summary table (as suggested by Rick James' guide) so that I could hold a separate set of summarized data broken down into day with occurrences and total duration calculated/incremented with each addition to the table (IODKU). However I'm then also up against creating rows per day in a summary table based on what is considered a day in the database (UTC) which may not match with the application's "day" due to timezone offset.
Short of converting the start_time column to a timestamp type (which is then inconsistent with all other date types in the database) is there any way round this or is there any other optimization I could be making to the original events table resulting in a more responsive query? TIA
Update 23/05
Here's the buffer pool size:
SHOW VARIABLES LIKE 'innodb_buffer_pool_size';
+-------------------------+-----------+
| Variable_name | Value |
+-------------------------+-----------+
| innodb_buffer_pool_size | 134217728 |
+-------------------------+-----------+
I've also made a bit of progress with indexing, modifying the query and creating a summary table.
I tried various ordering of columns to test indexes and found idx_event_venueid_classid_starttime (below), to be the most efficient for the event table:
SHOW INDEXES FROM EVENT;
+-------+------------+-------------------------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | Visible | Expression |
+-------+------------+-------------------------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| event | 0 | PRIMARY | 1 | id | A | 62142912 | NULL | NULL | | BTREE | | | YES | NULL |
| event | 1 | fk_class_id_idx | 1 | class_id | A | 51286 | NULL | NULL | YES | BTREE | | | YES | NULL |
| event | 1 | fk_venue_id_idx | 1 | venue_id | A | 16275 | NULL | NULL | YES | BTREE | | | YES | NULL |
| event | 1 | idx_event_venueid_classid_starttime | 1 | venue_id | A | 13378 | NULL | NULL | YES | BTREE | | | YES | NULL |
| event | 1 | idx_event_venueid_classid_starttime | 2 | class_id | A | 81331 | NULL | NULL | YES | BTREE | | | YES | NULL |
| event | 1 | idx_event_venueid_classid_starttime | 3 | start_time | A | 63909472 | NULL | NULL | YES | BTREE | | | YES | NULL |
+-------+------------+-------------------------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
Here's my modified version of the query, using JOIN syntax and now uses CONVERT_TZ to convert from UTC to the timezone required for reporting and then group that by the date (discarding the time portion):
SELECT
DATE(CONVERT_TZ(event.start_time,
'UTC',
'Europe/London')) AS tz_date,
class.name,
client.name,
venue.name,
COUNT(class.id) AS occurrences,
SUM(class.duration) AS duration
FROM
event
JOIN
class ON class.id = event.class_id
JOIN
venue ON venue.id = event.venue_id
JOIN
client ON client.id = class.client_id
GROUP BY tz_date, class.name, client.name, venue.name;
And here's the output of explain for that query:
+----+-------------+--------+------------+--------+---------------------------------------------------------------------+-------------------------------------+---------+-------------------------+------+----------+------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+--------+------------+--------+---------------------------------------------------------------------+-------------------------------------+---------+-------------------------+------+----------+------------------------------+
| 1 | SIMPLE | venue | NULL | index | PRIMARY,idx_venue_id_name | idx_venue_id_name | 772 | NULL | 28 | 100.00 | Using index; Using temporary |
| 1 | SIMPLE | event | NULL | ref | fk_class_id_idx,fk_venue_id_idx,idx_event_venueid_classid_starttime | idx_event_venueid_classid_starttime | 5 | example.venue.id | 4777 | 100.00 | Using where; Using index |
| 1 | SIMPLE | class | NULL | eq_ref | PRIMARY,fk_client_id_idx | PRIMARY | 4 | example.event.class_id | 1 | 100.00 | Using where |
| 1 | SIMPLE | client | NULL | eq_ref | PRIMARY,idx_client_id_name | PRIMARY | 4 | example.class.client_id | 1 | 100.00 | NULL |
+----+-------------+--------+------------+--------+---------------------------------------------------------------------+-------------------------------------+---------+-------------------------+------+----------+------------------------------+
The query takes ~1m 20s to run now so I figured I could prepend that with an insert into to populate a summary table with the dates being timezone specific and run that on a nightly basis. Summary table structure:
CREATE TABLE `summary` (
`tz_date` date NOT NULL,
`class` varchar(255) NOT NULL,
`client` varchar(255) NOT NULL,
`venue` varchar(255) NOT NULL,
`occurrences` int NOT NULL,
`duration` int NOT NULL,
PRIMARY KEY (`tz_date`,`class`,`client`,`venue`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb3
From the original ~60m+ rows in the event table, the aggregated summary table is populated with ~66k rows.
To then generate the reports from the summary table it takes a fraction of a second (shown below with data snipped):
SELECT * FROM SUMMARY;
66989 rows in set (0.03 sec)
I haven't looked into the impact of inserting into event while the query to populate the summary table is running - is using InnoDB likely to slow that down?
No further indexes are likely to help. It need to scan all the events table, reaching into the other tables to get the names.
Some things for us to look at:
SHOW VARIABLES LIKE 'innodb_buffer_pool_size';
EXPLAIN SELECT ...
How much RAM do you have?
Do the aggregates (COUNT and SUM) look correct? In some situations involving JOIN, they can be over-inflated.
Please use the newer JOIN ... ON syntax. (Won't change performance.)
As you observed, a Summary Table may help -- but only of the older data is not being modified. Please provide the SHOW CREATE TABLE and query for it.
Yes, timezone vs "definition of day" is a thorny issue. Notice how StackOverflow defines day based on UTC.
How many new rows are there per day? Are they spread out somewhat evenly throughout the day? If the average number of rows per hour is at least 20, then the Summary Table could be based on half-hour intervals. (I picked that because of India time vs most of the rest of the world.) The 20 comes from a Rule of Thumb that says that a summary table should have one-tenth as many rows as the Fact table.
Yes, TIMESTAMP instead of DATETIME may be a workaround.
Since you are talking about moderately large tables, consider whether to change INT NULL to SMALLINT UNSIGNED NOT NULL or some other sized integer.
(As for the cliff in 2038, ask yourself how many databases have been active on the same hardware and software since 2006. That may give some perspective on whether your design must survive 16 years.)
Thee is one table Mysql Table On which simple sql where date = 'Some date' is not working
Have checked logs.
Reload the tables several times & tried.
This is proof that record exists :-
select * from TRN_RP_CONSUMPTION_DAILY limit 1;
+------------+------------------------+----------------------+------------------------+----------------+-------------------+-----------------+-------------------+----------------------+--------------------+----------------------+-------------------+-----------------+---------------+--------------+
| TRCD_DATE | TRCD_SPREAD_START_DATE | TRCD_SPREAD_END_DATE | TRCD_SOURCE_CHENNEL_ID | TRCD_CIRCLE_ID | TRCD_CATALOUGE_ID | TRCD_CONTENT_ID | TRCD_SONG_CODE_ID | TRCD_YT_CP_POLICY_ID | TRCD_YT_CHANNEL_ID | TRCD_PRODUCT_NAME_ID | TRCD_TRXN_TYPE_ID | TRCD_TRXN_COUNT | TRCD_TOP_LINE | TRCD_REVENUE |
+------------+------------------------+----------------------+------------------------+----------------+-------------------+-----------------+-------------------+----------------------+--------------------+----------------------+-------------------+-----------------+---------------+--------------+
| 2018-01-01 | 2018-01-01 | 2018-01-04 | 5 | 1 | 1 | 945723 | 1 | 1 | 1 | 211 | 180 | 1.75 | 0 | 0 |
+------------+------------------------+----------------------+------------------------+----------------+-------------------+-----------------+-------------------+----------------------+--------------------+----------------------+-------------------+-----------------+---------------+--------------+
This is proof that index on date exists :-
select * from TRN_RP_CONSUMPTION_DAILY limit 1;
CREATE TABLE `TRN_RP_CONSUMPTION_DAILY` (
`TRCD_DATE` date NOT NULL DEFAULT '0000-00-00',
`TRCD_SPREAD_START_DATE` date NOT NULL DEFAULT '0000-00-00',
`TRCD_SPREAD_END_DATE` date NOT NULL DEFAULT '0000-00-00',
`TRCD_SOURCE_CHENNEL_ID` smallint(5) unsigned NOT NULL DEFAULT '1',
`TRCD_CIRCLE_ID` smallint(5) unsigned NOT NULL DEFAULT '1',
`TRCD_CATALOUGE_ID` int(10) unsigned NOT NULL DEFAULT '1',
`TRCD_CONTENT_ID` int(10) unsigned NOT NULL DEFAULT '1',
`TRCD_SONG_CODE_ID` int(10) unsigned NOT NULL DEFAULT '1',
`TRCD_YT_CP_POLICY_ID` tinyint(3) unsigned NOT NULL DEFAULT '1',
`TRCD_YT_CHANNEL_ID` tinyint(3) unsigned NOT NULL DEFAULT '1',
`TRCD_PRODUCT_NAME_ID` smallint(5) unsigned NOT NULL DEFAULT '1',
`TRCD_TRXN_TYPE_ID` tinyint(3) unsigned NOT NULL DEFAULT '1',
`TRCD_TRXN_COUNT` double NOT NULL DEFAULT '0',
`TRCD_TOP_LINE` double NOT NULL DEFAULT '0',
`TRCD_REVENUE` double NOT NULL DEFAULT '0',
KEY `IDX_TRCD_DATE` (`TRCD_DATE`),
KEY `IDX_TRCD_SOURCE_CHENNEL_ID` (`TRCD_SOURCE_CHENNEL_ID`),
KEY `IDX_TRCD_CATALOUGE_ID` (`TRCD_CATALOUGE_ID`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
This is the issue, it should give some count but its not coming (See 1st query result above data is present):-
select count(*) from TRN_RP_CONSUMPTION_DAILY where TRCD_DATE='2018-01-01';
+----------+
| count(*) |
+----------+
| 0 |
+----------+
Proof that it is using index :-
explain select count(*) from TRN_RP_CONSUMPTION_DAILY where TRCD_DATE='2018-01-01';
+----+-------------+--------------------------+------+---------------+---------------+---------+-------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------------------+------+---------------+---------------+---------+-------+------+-------------+
| 1 | SIMPLE | TRN_RP_CONSUMPTION_DAILY | ref | IDX_TRCD_DATE | IDX_TRCD_DATE | 3 | const | 1 | Using index |
+----+-------------+--------------------------+------+---------------+---------------+---------+-------+------+-------------+
Force index also not working :-
select count(*) from TRN_RP_CONSUMPTION_DAILY FORCE INDEX(IDX_TRCD_DATE) where TRCD_DATE='2018-01-01';
+----------+
| count(*) |
+----------+
| 0 |
+----------+
Yes, its huge table :-
select count(*) from TRN_RP_CONSUMPTION_DAILY;
+------------+
| count(*) |
+------------+
| 2006275044 |
+------------+
Table & Index Size :-
103G = TRN_RP_CONSUMPTION_DAILY.MYD
52G = TRN_RP_CONSUMPTION_DAILY.MYI
Surpriseingly this is working, but I can not use alway like this :-
select count(*) from TRN_RP_CONSUMPTION_DAILY where date(TRCD_DATE)='2018-01-01';
+----------+
| count(*) |
+----------+
| 1235523 |
+----------+
I know This is working because index will not come in account when using function on that indexed column.
Which Percona Server :-
Server version: 5.5.60-38.12-log Percona Server (GPL), Release 38.12,
Revision 26ef816
Nothing comes in Error or warning mysql log when query is giving zero count.
Where clouse on other column on which there is index that is working properly.
Can someone help why its not working on that date column?
I want to add some more index on this table but this is not working so I am stopping here.
Move from MyISAM to InnoDB.
Meanwhile, one of these should work. (Go down the list until you get a usable table. Most options are slow because they involve copying the table.)
CHECK TABLE TRN_RP_CONSUMPTION_DAILY; reports an error, do REPAIR TABLE TRN_RP_CONSUMPTION_DAILY;.
OPTIMIZE TABLE TRN_RP_CONSUMPTION_DAILY;
REPAIR TABLE TRN_RP_CONSUMPTION_DAILY USE_FRM;
DROP INDEX ... (for each index), then ADD INDEX ...
copy table over:
CREATE TABLE new LIKE real;
INSERT INTO new SELECT * FROM TRN_RP_CONSUMPTION_DAILY;
RENAME TABLE TRN_RP_CONSUMPTION_DAILY TO old,
new TO TRN_RP_CONSUMPTION_DAILY;
DROP TABLE old;
Restore from backup?
Those are things that are sometimes needed for MyISAM tables; InnoDB is more robust.
Have you tried STR_TO_DATE MySQL function in your where clause?
In your case: TRCD_DATE = STR_TO_DATE('2018-01-01','%Y,%m,%d') (or inverse %m and %d)
For more information: https://www.w3schools.com/sql/func_mysql_str_to_date.asp
I hope this answers to your problem/question.
You have the table as a date, but is it a date or date/time. May be failing on EXACT?
How about changing your where clause to
where TRCD_DATE >='2018-01-01' AND TRCD_DATE < '2018-01-02'
So you are getting anything from 12:00am (morning) all the way up to 11:59:59pm before the next day. Don't try to convert the data column, that will prevent use of an index.
I do not understand the difference (line 2) of those two EXPLAINs. Maybe someone has a hint for me why mysql acts so different on those, which heavily affects query speed.
The slow query lasts 12 seconds (which equals querying all rows with that query) and uses a join on integer columns while the joined table has just 3 records:
SELECT `inv_assets`.`id` AS `id`, `site`.`description` AS `sitename`,
(SELECT COALESCE(DATE_FORMAT(CONVERT_TZ(MIN(inspdate),'UTC','Europe/Vienna'),'%Y-%m-%d'),'')
FROM `mobuto_inv_inspections` AS `nextinsp`
WHERE ((`nextinsp`.`objectlink` = `inv_assets`.`id`
AND `nextinsp`.`inspdate` >= NOW()))
) AS `nextinsp`
FROM `mobuto_inv_assets` AS `inv_assets`
LEFT JOIN `mobuto_inv_sites` AS `site`
ON (`site`.`siteid` = `inv_assets`.`site`
AND `site`.`_state` IN (2,0))
ORDER BY `inv_assets`.`type` ASC LIMIT 0, 20;
+----+--------------------+------------+--------+----------------+---------+---------+------------------------------+-------+----------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+------------+--------+----------------+---------+---------+------------------------------+-------+----------------------------------------------------+
| 1 | PRIMARY | inv_assets | ALL | NULL | NULL | NULL | NULL | 24857 | Using temporary; Using filesort |
| 1 | PRIMARY | site | ALL | PRIMARY,_state | NULL | NULL | NULL | 3 | Using where; Using join buffer (Block Nested Loop) |
| 2 | DEPENDENT SUBQUERY | nextinsp | ALL | inspdate | NULL | NULL | NULL | 915 | Using where |
+----+--------------------+------------+--------+----------------+---------+---------+------------------------------+-------+----------------------------------------------------+
The fast query consumes just a few fractions of a second, uses a join on varchar(32) columns and the joined table has 1352 records:
SELECT `inv_assets`.`id` AS `id`, `guarantor`.`lastname` AS `guarantoruname`,
(SELECT COALESCE(DATE_FORMAT(CONVERT_TZ(MIN(inspdate),'UTC','Europe/Vienna'),'%Y-%m-%d'),'')
FROM `mobuto_inv_inspections` AS `nextinsp`
LEFT JOIN `users` AS `saveuser`
ON (`saveuser`.`uid` = `nextinsp`.`saveuser`
AND `saveuser`.`_state` = '0')
WHERE ((`nextinsp`.`objectlink` = `inv_assets`.`id`
AND `nextinsp`.`inspdate` >= NOW()))
) AS `nextinsp`
FROM `mobuto_inv_assets` AS `inv_assets`
LEFT JOIN `users` AS `guarantor`
ON (`guarantor`.`uid` = `inv_assets`.`guarantor`
AND `guarantor`.`_state` = '0')
ORDER BY `inv_assets`.`type` ASC LIMIT 0, 20;
+----+--------------------+------------+--------+----------------+---------+---------+---------------------------------+-------+----------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+------------+--------+----------------+---------+---------+---------------------------------+-------+----------------+
| 1 | PRIMARY | inv_assets | ALL | NULL | NULL | NULL | NULL | 24857 | Using filesort |
| 1 | PRIMARY | guarantor | eq_ref | PRIMARY,_state | PRIMARY | 98 | mobuto_dev.inv_assets.guarantor | 1 | Using where |
| 2 | DEPENDENT SUBQUERY | nextinsp | ALL | inspdate | NULL | NULL | NULL | 915 | Using where |
| 2 | DEPENDENT SUBQUERY | saveuser | eq_ref | PRIMARY,_state | PRIMARY | 98 | mobuto_dev.nextinsp.saveuser | 1 | Using where |
+----+--------------------+------------+--------+----------------+---------+---------+---------------------------------+-------+----------------+
The strange thing to me is, when I remove the column (description) of the joined table in the 'column-select-part' (while the join still persists and IMHO mysql does not optimize it away when not used), the speed is back (because mysql does not use a temporary table any longer and the explain looks same as the fast one, having type=eq_ref).
But why does this work for the first sample only when no column selected, whereas I can select one in the second one!?
CREATE TABLE `mobuto_inv_assets` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`invnum` varchar(10) NOT NULL,
`oebglcat` varchar(4) NOT NULL,
`mark` varchar(100) NOT NULL,
`type` varchar(100) NOT NULL,
`serialnum` varchar(100) NOT NULL,
`desc` varchar(100) NOT NULL,
`site` int(11) NOT NULL DEFAULT '0',
`licnum` varchar(20) NOT NULL DEFAULT '',
`inquirer` varchar(100) NOT NULL DEFAULT '',
`inqdate` date NOT NULL DEFAULT '0000-00-00',
`supplier` varchar(100) NOT NULL DEFAULT '',
`suppldate` date NOT NULL DEFAULT '0000-00-00',
`supplnumber` varchar(30) NOT NULL DEFAULT '',
`invoicedate` date NOT NULL DEFAULT '0000-00-00',
`invoicenumber` varchar(30) NOT NULL DEFAULT '',
`purchaseprice` decimal(11,2) NOT NULL DEFAULT '0.00',
`leased` varchar(1) NOT NULL DEFAULT 'N',
`leasingcompany` varchar(100) NOT NULL DEFAULT '',
`leasingnumber` varchar(30) NOT NULL DEFAULT '',
`notes` text NOT NULL,
`inspnotes` text NOT NULL,
`inactive` varchar(1) NOT NULL DEFAULT 'N',
`maintain` varchar(1) NOT NULL DEFAULT 'Y',
`asset` varchar(1) NOT NULL DEFAULT 'Y',
`inspection` varchar(1) NOT NULL DEFAULT '',
`inspperson` varchar(100) NOT NULL DEFAULT '',
`guarantor` varchar(32) NOT NULL DEFAULT '',
`saveuser` varchar(32) NOT NULL,
`savetime` int(11) NOT NULL,
`recordid` varchar(32) NOT NULL,
`_state` int(11) NOT NULL DEFAULT '0',
PRIMARY KEY (`id`),
UNIQUE KEY `invnum` (`invnum`),
KEY `_state` (`_state`),
KEY `site` (`site`)
) ENGINE=InnoDB AUTO_INCREMENT=30707 DEFAULT CHARSET=utf8;
CREATE TABLE `mobuto_inv_sites` (
`siteid` int(11) NOT NULL AUTO_INCREMENT,
`description` varchar(100) NOT NULL,
`saveuser` varchar(32) NOT NULL,
`savetime` int(11) NOT NULL,
`recordid` varchar(32) NOT NULL,
`_state` int(11) NOT NULL DEFAULT '0',
PRIMARY KEY (`siteid`),
KEY `_state` (`_state`)
) ENGINE=InnoDB AUTO_INCREMENT=5 DEFAULT CHARSET=utf8;
mysql> SHOW INDEX FROM mobuto_inv_assets;
+-------------------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+-------------------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| mobuto_inv_assets | 0 | PRIMARY | 1 | id | A | 24857 | NULL | NULL | | BTREE | | |
| mobuto_inv_assets | 0 | invnum | 1 | invnum | A | 24857 | NULL | NULL | | BTREE | | |
| mobuto_inv_assets | 1 | _state | 1 | _state | A | 4 | NULL | NULL | | BTREE | | |
+-------------------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
Changes as requested by #Wilson Hauck:
Added index to column site in mobuto_inv_assets (reduced execution speed by almost half a second)
Seems that column nextinsp was missing in first query. Maybe lost while formatting the query. Of course there should be the same as in the fast one
Removed the saveuser join as it is not used there (saved another 2 seconds) and updated its EXPLAIN (last line removed)
SHOW INDEX FROM mobuto_inv_sites added
+------------------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+------------------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| mobuto_inv_sites | 0 | PRIMARY | 1 | siteid | A | 3 | NULL | NULL | | BTREE | | |
| mobuto_inv_sites | 1 | _state | 1 | _state | A | 3 | NULL | NULL | | BTREE | | |
+------------------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
Your first query is making less use of keys than the second. The possible_keys column in the explain plan shows where keys are available to be used, however, the key column shows where they are actually being used.
I would advise, short of seeing the structure of your DB, to make more use of these keys in your JOIN and WHERE clauses to speed it up.
I'd make sure that the query isn't being cached when you say you're modifying the select columns and the speed is varying.
12 seconds first query likely caused by ROWS column clues of simply 24857*3*915*1 = 68,232,465 total rows considered. Less than 1 second for second query ROWS column clues of simply 24857*1*915*1 = 22,744,155 total rows considered. The first query's use of Block Nested Loop processing is a major contributor to delaying the response.
Please post results of SHOW CREATE TABLE mobuto_inv_assets and mobuto_inv_sites. Please also post results of SHOW INDEX FROM mobuto_inv_assets and mobuto_inv_sites. With this additional information someone may be able to suggest improvements to SELECT .... queries that will avoid Block Nested Loop processing that is very time CPU intense with RBAR (Row By Agonizing Row processing). Additional indexing may be required.
Thanks for posting your two SHOW CREATE TABLE's, immensely helpful.
Please consider adding index with
ALTER TABLE mobuto_inv_sites ADD INDEX site --
if space permits on your system.
Also, the EXPLAIN showing for query1 is mismatched to the query.
The query does not refer to nextinsp or saveused that I can see in EXPLAIN.
Please replace the EXPLAIN for query1 after creating the index when you have an opportunity to test again and indicate any reduction in execution time required.
It would also be nice if you could post results of
SHOW INDEX FROM mobuto_inv_sites so we can see the scope of your data and cardinality.
If the inv_assets rows are populated with ACCURATE _state data
consider changing query1 to something like the following:
SELECT inv_assets.id AS id, site.description AS sitename,
(SELECT COALESCE(DATE_FORMAT(CONVERT_TZ(MIN(inspdate),'UTC','Europe/Vienna'),'%Y-%m-%d'),'')
FROM mobuto_inv_inspections AS nextinsp
WHERE ((nextinsp.objectlink = inv_assets.id
AND nextinsp.inspdate >= NOW()))
) AS nextinsp
FROM mobuto_inv_assets AS inv_assets
WHERE inv_assets._state = 2 OR inv_assets._state = 0
LEFT JOIN mobuto_inv_sites AS site
ON (site.siteid = inv_assets.site
AND site._state IN (2,0))
ORDER BY inv_assets.type ASC LIMIT 0, 20;
EXPLAIN should avoid table scan and subsequent Block Nested Loop processing.
If _state data in inv_assets is not ACCURATE on every row, this will not work.
2017-08-10 update 09:42 CT please post QUERY, EXPLAIN result, SHOW CREATE TABLE tblname for tables involved and SHOW INDEX FROM tblname for tables involved.
this is my first question ever on forum so do not hesitate to tell me if there is anything to improve in my question.
I have a big database with two tables
"visit" (6M rows) which basically stores each visit on a website
| visitdate | city |
----------------------------------
| 2014-12-01 00:00:02 | Paris |
| 2015-01-03 00:00:02 | Marseille|
"cityweather" (1M rows) that stores weather infos 3 times a day for a lot of cities
| weatherdate | city |
------------------------------------
| 2014-12-01 09:00:02 | Paris |
| 2014-12-01 09:00:02 | Marseille|
I precise that there can be cities in the table visit that are not in cityweather and vice versa and I need to only take citties that are common to both tables.
I first had a big query that I tried to run and failed and I am therefore trying to go back to the simplest possible query joining those two table but the performance are terrible.
SELECT COUNT(DISTINCT(t.city))
FROM visit t
INNER JOIN cityweather d
ON t.city = d.city;
I precise that both tables are indexed on the column city and I already did the COUNT(DISTINCT(city)) on both tables independantly and it takes less than one second for each.
You can find below te result of the EXPLAIN on this query :
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
----------------------------------
| 1 | SIMPLE | d | index | idx_city | idx_city | 303 | NULL | 1190553 | Using where; Using index |
| 1 | SIMPLE | t | ref | Idxcity | Idxcity | 303 | meteo.d.city | 465 | Using index |
You will find below the table information and especialy the engine for both tables :
visit
| Name | Engine | Version | Row_Format | Rows | Avg_row_len | Data_len | Max_data_len | Index_len | Data_free |
--------------------------------------------------------------------------------------------------------------------
| visit | InnoDB | 10 | Compact | 6208060 | 85 | 531628032 | 0 | 0 | 0 |
The SHOW CREATE TABLE output :
CREATE TABLE
`visit` (
`productid` varchar(8) DEFAULT NULL,
`visitdate` datetime DEFAULT NULL,
`minute` int(2) DEFAULT NULL,
`hour` int(2) DEFAULT NULL,
`weekday` int(1) DEFAULT NULL,
`quotation` int(10) unsigned DEFAULT NULL,
`amount` int(10) unsigned DEFAULT NULL,
`city` varchar(100) DEFAULT NULL,
`weathertype` varchar(30) DEFAULT NULL,
`temp` int(11) DEFAULT NULL,
`pressure` int(11) DEFAULT NULL,
`humidity` int(11) DEFAULT NULL,
KEY `Idxvisitdate` (`visitdate`),
KEY `Idxcity` (`city`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
citiweather
| Name | Engine | Version | Row_Format | Rows | Avg_row_len | Data_len | Max_data_len | Index_len | Data_free |
------------------------------------------------------------------------------------------------------------------------------
| cityweather | InnoDB | 10 | Compact | 1190553 | 73 | 877670784 | 0 | 0 | 30408704 |
The SHOW CREATE TABLE output :
CREATE TABLE `cityweather` (
`city` varchar(100) DEFAULT NULL,
`lat` decimal(13,9) DEFAULT NULL,
`lon` decimal(13,9) DEFAULT NULL,
`weatherdate` datetime DEFAULT NULL,
`temp` int(11) DEFAULT NULL,
`pressure` int(11) DEFAULT NULL,
`humidity` int(11) DEFAULT NULL,
KEY `Idxweatherdate` (`weatherdate`),
KEY `idx_city` (`city`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
I have the feeling that the problem comes from the type = index and the ref = NULL but I have no idea how to fix it...
You can find here a close question that did not help me solve my problem
Thanks !
Your query is so slow because the index you use can't get the number of lines down to a faster amount. See your EXPLAIN output: It tells you that the use of the index on city (idx_city) in table cityweather will require 1.190.553 lines to process. Joining by city to your visit table will require again 465 lines from that table.
As a result your database will have to process 1.190.553 x 465 lines.
As your query is you can't improve its performance. But you can modify your query e.g. by adding a condition on your visiting data to narrow the results down. Try all kinds of EXISTS queries as well.
Update
Perhaps this helps:
CREATE TEMPORARY TABLE tmpTbl
SELECT distinct city as city from cityweather;
ALTER TABLE tmpTbl Add index adweerf (city);
SELECT COUNT(DISTINCT(city)) FROM visit WHERE city in (SELECT city from tmpTbl);
Since IN ( SELECT ... ) optimizes poorly, change
SELECT COUNT(DISTINCT(city)) FROM visit WHERE city in (SELECT city from tmpTbl);
to
SELECT COUNT(*)
FROM ( SELECT DISTINCT city FROM cityweather ) x
WHERE EXISTS( SELECT * FROM visit
WHERE city = x.city );
Both tables need (and have) an index on city. I'm pretty sure it is better to put the smaller table (cityweather) in the SELECT DISTINCT.
Other points:
Every InnoDB table really should have a PRIMARY KEY.
You could save a lot of space by using TINYINT UNSIGNED (1 byte), etc, instead of using 4-byte INT always.
9 decimal places for lat/lng is excessive for cities, and takes 12 bytes. I vote for DECIMAL(4,2)/(5,2) (1.6km / 1mi resolution; 5 bytes) or DECIMAL(6,4)/(7,4) (16m/52ft, 7 bytes).