iam fighting with some performance problems on a very simple table which seems to be very slow when fetching data by using its primary key (bigint)
I have this table with 124 million entries:
CREATE TABLE `nodes` (
`id` bigint(20) NOT NULL,
`lat` float(13,7) NOT NULL,
`lon` float(13,7) NOT NULL,
PRIMARY KEY (`id`),
KEY `lat_index` (`lat`),
KEY `lon_index` (`lon`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
and a simple query which takes some id from another table using the IN clause to fetch data from the nodes tables, but it takes like 1 hour only to fetch a few rows from this table.
EXPLAIN shows me its not using the PRIMARY key as index, its simply scanning the whole table. Why that? id and the id column from the other table are both from type bigint(20).
mysql> EXPLAIN SELECT lat, lon FROM nodes WHERE id IN (SELECT node_id FROM ways_elements WHERE way_id = '4962890');
+----+--------------------+-------------------+------+---------------+--------+---------+-------+-----------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+-------------------+------+---------------+--------+---------+-------+-----------+-------------+
| 1 | PRIMARY | nodes | ALL | NULL | NULL | NULL | NULL | 124035228 | Using where |
| 2 | DEPENDENT SUBQUERY | ways_elements | ref | way_id | way_id | 8 | const | 2 | Using where |
+----+--------------------+-------------------+------+---------------+--------+---------+-------+-----------+-------------+
The query SELECT node_id FROM ways_elements WHERE way_id = '4962890' simply returns two node ids, so the whole query should only return two rows, but it takes more or less 1 hour.
Using "force index (PRIMARY)" didnt help, even if it would help, why does MySQL not take that index since its a primary key? EXPLAIN doesnt even mention anything in the possible_keys columns but select_type shows PRIMARY.
Am i doing something wrong?
How does this perform?
SELECT lat, lon FROM nodes t1 join ways_elements t2 on (t1.id=t2.node_id) WHERE t2.way_id = '4962890'
I suspect that your query is checking each row in nodes against each item in the "IN" clause.
This is what is called a correlated subquery. You can see this as reference or this popular question posted on Stackoverflow. A better query to use is:
SELECT lat,
lon
FROM nodes n
JOIN ways_elements w ON n.id = w.node_id
WHERE way_id = '4962890'
Related
I have a MySQL (v8.0.30) table that stores trades, the schema is the following:
CREATE TABLE `log_fill` (
`id` bigint NOT NULL AUTO_INCREMENT,
`orderId` varchar(63) NOT NULL,
`clientOrderId` varchar(36) NOT NULL,
`symbol` varchar(31) NOT NULL,
`executionId` varchar(255) DEFAULT NULL,
`executionSide` tinyint NOT NULL COMMENT '0 = long, 1 = short',
`executionSize` decimal(15,2) NOT NULL,
`executionPrice` decimal(21,8) unsigned NOT NULL,
`executionTime` bigint unsigned NOT NULL,
`executionValue` decimal(21,8) NOT NULL,
`executionFee` decimal(13,8) NOT NULL,
`feeAsset` varchar(63) DEFAULT NULL,
`positionSizeBeforeFill` decimal(21,8) DEFAULT NULL,
`apiKey` int NOT NULL,
`side` varchar(20) DEFAULT NULL,
`reconciled` tinyint unsigned NOT NULL DEFAULT '0',
PRIMARY KEY (`id`),
UNIQUE KEY `executionId` (`executionId`,`executionSide`),
KEY `apiKey` (`apiKey`)
) ENGINE=InnoDB AUTO_INCREMENT=6522695 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
As you can see, there's a BTREE index on the column apiKey which stands for a user, this way I can quickly retrieve all trades for a specific user.
My goal is a query that returns positionSizeBeforeFill + executionSize for the last record, given an apiKey and a symbol. So I wrote the following:
SELECT positionSizeBeforeFill + executionSize
FROM log_fill
WHERE apiKey = 90 AND symbol = 'ABCD'
ORDER BY id DESC
However the execution is extremely slow (around 100ms). I've noticed that running either WHERE or ORDER BY (and not both together) drastically reduces execution time. For example
SELECT positionSizeBeforeFill + executionSize
FROM log_fill
WHERE apiKey = 90 AND symbol = 'ABCD'
only takes 220 microseconds to execute. The number of records after filtering by apiKey and symbol is 388.
Similarly,
SELECT positionSizeBeforeFill + executionSize
FROM log_fill
ORDER BY id DESC
takes 26 microseconds (on a 3 million records table).
All in all, separately running WHERE and ORDER BY takes microseconds of execution, when I combine them we scale up to milliseconds (around 1000x more).
Running EXPLAIN on the slow query it turns out it has to examine 116032 rows.
I tried to create a temporary table hoping for MySQL to perform sorting only on the filtered records, but the outcome is the same. Was wondering whether the problem might be the index (whose cardinality is 203), but how can it be the case when WHERE alone takes very little time? I could not find similar cases on other questions or forums. I think I just fail at understanding how InnoDB selects data, I thought it would first filter by WHERE and then perform ORDER BY on the filtered rows. How can I improve this? Thanks!
Edit: The EXPLAIN statement on the slow query returns
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+--------------+------------+------+---------------+--------+---------+-------+--------+----------+----------------------------------+
| 1 | SIMPLE | log_fill_tmp | NULL | ref | apiKey | apiKey | 4 | const | 116032 | 10.00 | Using where; Backward index scan |
The query with WHERE only
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+--------------+------------+------+---------------+--------+---------+-------+--------+----------+-------------+
| 1 | SIMPLE | log_fill_tmp | NULL | ref | apiKey | apiKey | 4 | const | 116032 | 10.00 | Using where |
The query with ORDER BY only on the full table
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+--------------+------------+-------+---------------+---------+---------+------+---------+----------+---------------------+
| 1 | SIMPLE | log_fill_tmp | NULL | index | NULL | PRIMARY | 8 | NULL | 2503238 | 100.00 | Backward index scan |
26 microseconds for any query against a 3-million-row table implies that you have the Query cache enabled. Please rerunning your timings with SELECT SQL_NO_CACHE .... (Even milliseconds would be suspicious.) Were all 3M rows returned? Shoveling that many rows probably takes more than a second under any circumstances.
Meanwhile, to speed up the first two queries, add
INDEX(symbol, apikey, id)
EXPLAIN gives only approximate (eg, 116032) counts. A "cardinality" of 203 is also just an estimate, but it is used by the Optimizer in some situations. Please get the exact count just to check that there really are any rows:
SELECT COUNT(*)
FROM log_fill
WHERE apiKey = 90 AND symbol = 'ABCD'
With the ORDER BY id DESC, it will scan the entire B+Tree what holds the data. As it says, it will do a 'Backward index scan'. However, since the "index" is the PRIMARY KEY and the PK is clustered with the data, it is really referring to the data's BTree.
The EXPLAIN for the first query decided that the indexes were not useful enough for WHERE; instead it avoided the sort (ORDER BY) by doing the Backward full table scan, same as the 3rd query. (And ignored any rows that did not match the WHERE.
I added a composite index on (apiKey, symbol) and now the query runs in as little as 0.2ms. In order to reduce the creation time for the index I reduce the number of records from 3M to 500K and the gain in time is about 97%, I believe it's going to be more on the full table. I thought using just the apiKey index it would first filter out by user, then by symbol, but I'm probably wrong.
I have three tables, bug, bugrule and bugtrace, for which relationships are:
bug 1--------N bugrule
id = bugid
bugrule 0---------N bugtrace
id = ruleid
Because I'm almost always interested in relations between bug <---> bugtrace I have created an appropriate VIEW which is used as part of several queries. Interestingly, queries using this VIEW have significantly worse performance than equivalent queries using the underlying JOIN explicitly.
VIEW definition:
CREATE VIEW bugtracev AS
SELECT t.*, r.bugid
FROM bugtrace AS t
LEFT JOIN bugrule AS r ON t.ruleid=r.id
WHERE r.version IS NULL
Execution plan for a query using the VIEW (bad performance):
mysql> explain
SELECT c.id,state,
(SELECT COUNT(DISTINCT(t.id)) FROM bugtracev AS t
WHERE t.bugid=c.id)
FROM bug AS c
WHERE c.version IS NULL
AND c.id<10;
+----+--------------------+-------+-------+---------------+--------+---------+-----------------+---------+-----------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+-------+-------+---------------+--------+---------+-----------------+---------+-----------------------+
| 1 | PRIMARY | c | range | id_2,id | id_2 | 8 | NULL | 3 | Using index condition |
| 2 | DEPENDENT SUBQUERY | t | index | NULL | ruleid | 9 | NULL | 1426004 | Using index |
| 2 | DEPENDENT SUBQUERY | r | ref | id_2,id | id_2 | 8 | bugapp.t.ruleid | 1 | Using where |
+----+--------------------+-------+-------+---------------+--------+---------+-----------------+---------+-----------------------+
3 rows in set (0.00 sec)
Execution plan for a query using the underlying JOIN directly (good performance):
mysql> explain
SELECT c.id,state,
(SELECT COUNT(DISTINCT(t.id))
FROM bugtrace AS t
LEFT JOIN bugrule AS r ON t.ruleid=r.id
WHERE r.version IS NULL
AND r.bugid=c.id)
FROM bug AS c
WHERE c.version IS NULL
AND c.id<10;
+----+--------------------+-------+-------+---------------+--------+---------+-------------+--------+-----------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+-------+-------+---------------+--------+---------+-------------+--------+-----------------------+
| 1 | PRIMARY | c | range | id_2,id | id_2 | 8 | NULL | 3 | Using index condition |
| 2 | DEPENDENT SUBQUERY | r | ref | id_2,id,bugid | bugid | 8 | bugapp.c.id | 1 | Using where |
| 2 | DEPENDENT SUBQUERY | t | ref | ruleid | ruleid | 9 | bugapp.r.id | 713002 | Using index |
+----+--------------------+-------+-------+---------------+--------+---------+-------------+--------+-----------------------+
3 rows in set (0.00 sec)
CREATE TABLE statements (reduced by irrelevant columns) are:
mysql> show create table bug;
CREATE TABLE `bug` (
`id` bigint(20) NOT NULL,
`version` int(11) DEFAULT NULL,
`state` varchar(16) DEFAULT NULL,
UNIQUE KEY `id_2` (`id`,`version`),
KEY `id` (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
mysql> show create table bugrule;
CREATE TABLE `bugrule` (
`id` bigint(20) NOT NULL,
`version` int(11) DEFAULT NULL,
`bugid` bigint(20) NOT NULL,
UNIQUE KEY `id_2` (`id`,`version`),
KEY `id` (`id`),
KEY `bugid` (`bugid`),
CONSTRAINT `bugrule_ibfk_1` FOREIGN KEY (`bugid`) REFERENCES `bug` (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
mysql> show create table bugtrace;
CREATE TABLE `bugtrace` (
`id` bigint(20) NOT NULL,
`ruleid` bigint(20) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `ruleid` (`ruleid`),
CONSTRAINT `bugtrace_ibfk_1` FOREIGN KEY (`ruleid`) REFERENCES `bugrule` (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
You ask why about query optimization for a couple of complex queries with COUNT(DISTINCT val) and dependent subqueries. It's hard to know why for sure.
You will probably fix most of your performance problem by getting rid of your dependent subquery, though. Try something like this:
SELECT c.id,state, cnt.cnt
FROM bug AS c
LEFT JOIN (
SELECT bugid, COUNT(DISTINCT id) cnt
FROM bugtracev
GROUP BY bugid
) cnt ON c.id = cnt.bugid
WHERE c.version IS NULL
AND c.id<10;
Why does this help? To satisfy the query the optimizer can choose to run the GROUP BY subquery just once, rather than many times. And, you can use EXPLAIN on the GROUP BY subquery to understand its performance.
You may also get a performance boost by creating a compound index on bugrule that matches the query in your view. Try this one.
CREATE INDEX bugrule_v ON bugrule (version, ruleid, bugid)
and try switching the last two columns like so
CREATE INDEX bugrule_v ON bugrule (version, ruleid, bugid)
These indexes are called covering indexes because they contain all the columns needed to satisfy your query. version appears first because that helps optimize WHERE version IS NULL in your view definition. That makes it faster.
Pro tip: Avoid using SELECT * in views and queries, especially when you have performance problems. Instead, list the columns you actually need. The * may force the query optimizer to avoid a covering index, even when the index would help.
When using MySQL 5.6 (or older), try with at least MySQL 5.7. According to What’s New in MySQL 5.7?:
We have to a large extent unified the handling of derived tables and views. Until now, subqueries in the FROM clause (derived tables) were unconditionally materialized, while views created from the same query expressions were sometimes materialized and sometimes merged into the outer query. This behavior, beside being inconsistent, can lead to a serious performance penalty.
I have the following database schema
keywords(id, keyword, lang) :( about 8M records)
topics(id, topic, lang) : ( about 2.6M records)
topic_keywords(topic_id, keyword_id, weight) : (200M records)
In a script, I have about 50-100 keywords with an additional field keyword_score and I want to retrieve the top 20 topics that corresponds to those keywords based on the following formula : SUM(keyword_score * topic_weight)
A solution I implemented currently in my script is :
I create a temporary table as follow temporary_keywords(keyword_id, keyword_score )
Insert all 50-100 keywords to it with their keyword_score
Then execute the following query to retrieve topics
SELECT topic_id, SUM(weight * keyword_score) AS score
FROM temporary_keywords
JOIN topic_keywords USING keyword_id
GROUP BY topic_id
ORDER BY score DESC
LIMIT 20
This solution works, but it takes in some cases up to 3 seconds to execute, which is too much for me.
I'm asking if there is a way to optimize this query? or should I redesign the data structure into a NoSQL database?
Any other solutions or ideas beyond what is listed above are most appreciated
UPDATE (SHOW CREATE TABLE)
CREATE TABLE `topic_keywords` (
`topic_id` int(11) NOT NULL,
`keyword_id` int(11) NOT NULL,
`weight` float DEFAULT '0',
PRIMARY KEY (`topic_id`,`keyword_id`),
KEY `keyword_id_idx` (`keyword_id`,`topic_id`,`weight`)
)
CREATE TEMPORARY TABLE temporary_keywords
( keyword_id INT PRIMARY KEY NOT NULL,
keyword_score DOUBLE
)
EXPLAIN QUERY
+----+-------------+--------------------+------+----------------------+----------------------+---------+--------------------------------------+----------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------------+------+----------------------+----------------------+---------+--------------------------------------+----------+---------------------------------+
| 1 | SIMPLE | temporary_keywords | ALL | PRIMARY | NULL | NULL | NULL | 100 | Using temporary; Using filesort |
| 1 | SIMPLE | topic_keywords | ref | keyword_id_idx | keyword_id_idx | 4 | topics.temporary_keywords.keyword_id | 10778853 | Using index |
+----+-------------+--------------------+------+----------------------+----------------------+---------+--------------------------------------+----------+---------------------------------+
Incorrect, but uncaught, syntax.
JOIN topic_keywords USING keyword_id
-->
JOIN topic_keywords USING(keyword_id)
If that does not fix it, please provide EXPLAIN FORMAT=JSON SELECT ...
I got a weird problem of MySQL index. I have a table views_video:
CREATE TABLE `views_video` (
`video_id` smallint(5) unsigned NOT NULL,
`record_date` date NOT NULL,
`region` char(2) NOT NULL DEFAULT '',
`views` mediumint(8) unsigned NOT NULL
PRIMARY KEY (`video_id`,`record_date`,`region`),
KEY `video_id` (`video_id`)
)
The table contains 3.4 million records.
I run the EXPLAIN on this query:
SELECT video_id, views FROM views_video where video_id <= 156
I got:
+----+-------------+-------------+-------+------------------+----------+---------+------+--------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------+-------+------------------+----------+---------+------+--------+-------------+
| 1 | SIMPLE | views_video | range | PRIMARY,video_id | video_id | 2 | NULL | 587984 | Using where |
+----+-------------+-------------+-------+------------------+----------+---------+------+--------+-------------+
But when I run the EXPLAIN on this query:
SELECT video_id, views FROM views_video where video_id <= 157
I got:
+----+-------------+-------------+------+------------------+------+---------+------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------+------+------------------+------+---------+------+---------+-------------+
| 1 | SIMPLE | views_video | ALL | PRIMARY,video_id | NULL | NULL | NULL | 3412892 | Using where |
+----+-------------+-------------+------+------------------+------+---------+------+---------+-------------+
video_id is from 1 to 1034. There is nothing special between 156 and 157.
What happens here?
* update *
I have added more data into the database. Now video_id is from 1 to 1064. And the table now has 3.8M records. And the difference become 114 and 115.
I'm guessing that with 3.4 million records, and only 1064 possible entries for your key, your selectivity is very low. (In other words, there are many duplicates, which makes it far less useful as a key.) The optimizer is taking its best guess if it is more efficient to use the key or not. You've found a threshold for that decision.
It might be the key population
Run these
SELECT (COUNT(1)/20) INTO #FivePctOfData FROM views_video;
SELECT COUNT(1) videpidcount,video_id FROM FROM views_video
WHERE id <= 157 GROUP BY video_id;
The query optimizer proabably took a vacation when one one of the key hit the 5% threshold.
You said there are 3.4 million rows. 5% would be 170,000. Perhaps this number was exceeded at some point in the query optimizer's life cycle on your query.
If you've added/deleted substantial data since creating the table, it's worthwhile to try ANALYZE TABLE on it. It frequently solves a lot of phantom indexing issues, and it's very fast even on large tables.
Update: Also, the unique index values are very low compared to the number of rows in the table. MySQL won't use indexes when a single index value points to too many rows. Try constraining the query further with another column that's part of the primary key.
I've been messing around all day trying to find why my query performance is terrible. It is extremely simple, yet can take over 15 minutes to execute (I abort the query at that stage). I am joining a table with over 2 million records.
This is the select:
SELECT
audit.MessageID, alerts.AlertCount
FROM
audit
LEFT JOIN (
SELECT MessageID, COUNT(ID) AS 'AlertCount'
FROM alerts
GROUP BY MessageID
) AS alerts ON alerts.MessageID = audit.MessageID
This is the EXPLAIN
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
| 1 | PRIMARY | AL | index | NULL | IDX_audit_MessageID | 4 | NULL | 2330944 | 100.00 | Using index |
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 124140 | 100.00 | |
| 2 | DERIVED | alerts | index | NULL | IDX_alerts_MessageID | 5 | NULL | 124675 | 100.00 | Using index |
This is the schema:
# Not joining, just showing types
CREATE TABLE messages (
ID int NOT NULL AUTO_INCREMENT,
MessageID varchar(255) NOT NULL,
PRIMARY KEY (ID),
INDEX IDX_messages_MessageID (MessageID)
);
# 2,324,931 records
CREATE TABLE audit (
ID int NOT NULL AUTO_INCREMENT,
MessageID int NOT NULL,
LogTimestamp timestamp NOT NULL,
PRIMARY KEY (ID),
INDEX IDX_audit_MessageID (MessageID),
CONSTRAINT FK_audit_MessageID FOREIGN KEY(MessageID) REFERENCES messages(ID)
);
# 124,140
CREATE TABLE alerts (
ID int NOT NULL AUTO_INCREMENT,
AlertLevel int NOT NULL,
Text nvarchar(4096) DEFAULT NULL,
MessageID int DEFAULT 0,
PRIMARY KEY (ID),
INDEX IDX_alert_MessageID (MessageID),
CONSTRAINT FK_alert_MessageID FOREIGN KEY(MessageID) REFERENCES messages(ID)
);
A few very important things to note - the MessageID is not 1:1 in either 'audit' or 'alerts'; The MessageID can exist in one table, but not the other, or may exist in both (which is the purpose of my join); In my test DB, none of the MessageID exist in both. In other words, my query will return 2.3 million records with 0 as the count.
Another thing to note is that the 'audit' and 'alert' tables used to use MessageID as varchar(255). I created the 'messages' table expecting that it would fix the join. It actually made it worse. Previously, it would take 78 seconds, now, it never returns.
What am I missing about MySQL?
Subqueries are very hard for the MySQL engine to optimize. Try:
SELECT
audit.MessageID, COUNT(alerts.ID) AS AlertCount
FROM
audit
LEFT JOIN alerts ON alerts.MessageID = audit.MessageID
GROUP BY audit.MessageID
You're joining to a subquery.
The subquery results are effectively a temporary table - note the <derived2> in the query execution plan. As you can see there, they're not indexed, since they're ephemeral.
You should execute the query as a single unit with a join, rather than joining to the results of a second query.
EDIT: Andrew has posted an answer with one example of how to do your work in a normal join query, instead of in two steps.