MySQL index doesn't work - mysql

I got a weird problem of MySQL index. I have a table views_video:
CREATE TABLE `views_video` (
`video_id` smallint(5) unsigned NOT NULL,
`record_date` date NOT NULL,
`region` char(2) NOT NULL DEFAULT '',
`views` mediumint(8) unsigned NOT NULL
PRIMARY KEY (`video_id`,`record_date`,`region`),
KEY `video_id` (`video_id`)
)
The table contains 3.4 million records.
I run the EXPLAIN on this query:
SELECT video_id, views FROM views_video where video_id <= 156
I got:
+----+-------------+-------------+-------+------------------+----------+---------+------+--------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------+-------+------------------+----------+---------+------+--------+-------------+
| 1 | SIMPLE | views_video | range | PRIMARY,video_id | video_id | 2 | NULL | 587984 | Using where |
+----+-------------+-------------+-------+------------------+----------+---------+------+--------+-------------+
But when I run the EXPLAIN on this query:
SELECT video_id, views FROM views_video where video_id <= 157
I got:
+----+-------------+-------------+------+------------------+------+---------+------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------+------+------------------+------+---------+------+---------+-------------+
| 1 | SIMPLE | views_video | ALL | PRIMARY,video_id | NULL | NULL | NULL | 3412892 | Using where |
+----+-------------+-------------+------+------------------+------+---------+------+---------+-------------+
video_id is from 1 to 1034. There is nothing special between 156 and 157.
What happens here?
* update *
I have added more data into the database. Now video_id is from 1 to 1064. And the table now has 3.8M records. And the difference become 114 and 115.

I'm guessing that with 3.4 million records, and only 1064 possible entries for your key, your selectivity is very low. (In other words, there are many duplicates, which makes it far less useful as a key.) The optimizer is taking its best guess if it is more efficient to use the key or not. You've found a threshold for that decision.

It might be the key population
Run these
SELECT (COUNT(1)/20) INTO #FivePctOfData FROM views_video;
SELECT COUNT(1) videpidcount,video_id FROM FROM views_video
WHERE id <= 157 GROUP BY video_id;
The query optimizer proabably took a vacation when one one of the key hit the 5% threshold.
You said there are 3.4 million rows. 5% would be 170,000. Perhaps this number was exceeded at some point in the query optimizer's life cycle on your query.

If you've added/deleted substantial data since creating the table, it's worthwhile to try ANALYZE TABLE on it. It frequently solves a lot of phantom indexing issues, and it's very fast even on large tables.
Update: Also, the unique index values are very low compared to the number of rows in the table. MySQL won't use indexes when a single index value points to too many rows. Try constraining the query further with another column that's part of the primary key.

Related

MySQL - Query becomes slow when combining WHERE and ORDER BY

I have a MySQL (v8.0.30) table that stores trades, the schema is the following:
CREATE TABLE `log_fill` (
`id` bigint NOT NULL AUTO_INCREMENT,
`orderId` varchar(63) NOT NULL,
`clientOrderId` varchar(36) NOT NULL,
`symbol` varchar(31) NOT NULL,
`executionId` varchar(255) DEFAULT NULL,
`executionSide` tinyint NOT NULL COMMENT '0 = long, 1 = short',
`executionSize` decimal(15,2) NOT NULL,
`executionPrice` decimal(21,8) unsigned NOT NULL,
`executionTime` bigint unsigned NOT NULL,
`executionValue` decimal(21,8) NOT NULL,
`executionFee` decimal(13,8) NOT NULL,
`feeAsset` varchar(63) DEFAULT NULL,
`positionSizeBeforeFill` decimal(21,8) DEFAULT NULL,
`apiKey` int NOT NULL,
`side` varchar(20) DEFAULT NULL,
`reconciled` tinyint unsigned NOT NULL DEFAULT '0',
PRIMARY KEY (`id`),
UNIQUE KEY `executionId` (`executionId`,`executionSide`),
KEY `apiKey` (`apiKey`)
) ENGINE=InnoDB AUTO_INCREMENT=6522695 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
As you can see, there's a BTREE index on the column apiKey which stands for a user, this way I can quickly retrieve all trades for a specific user.
My goal is a query that returns positionSizeBeforeFill + executionSize for the last record, given an apiKey and a symbol. So I wrote the following:
SELECT positionSizeBeforeFill + executionSize
FROM log_fill
WHERE apiKey = 90 AND symbol = 'ABCD'
ORDER BY id DESC
However the execution is extremely slow (around 100ms). I've noticed that running either WHERE or ORDER BY (and not both together) drastically reduces execution time. For example
SELECT positionSizeBeforeFill + executionSize
FROM log_fill
WHERE apiKey = 90 AND symbol = 'ABCD'
only takes 220 microseconds to execute. The number of records after filtering by apiKey and symbol is 388.
Similarly,
SELECT positionSizeBeforeFill + executionSize
FROM log_fill
ORDER BY id DESC
takes 26 microseconds (on a 3 million records table).
All in all, separately running WHERE and ORDER BY takes microseconds of execution, when I combine them we scale up to milliseconds (around 1000x more).
Running EXPLAIN on the slow query it turns out it has to examine 116032 rows.
I tried to create a temporary table hoping for MySQL to perform sorting only on the filtered records, but the outcome is the same. Was wondering whether the problem might be the index (whose cardinality is 203), but how can it be the case when WHERE alone takes very little time? I could not find similar cases on other questions or forums. I think I just fail at understanding how InnoDB selects data, I thought it would first filter by WHERE and then perform ORDER BY on the filtered rows. How can I improve this? Thanks!
Edit: The EXPLAIN statement on the slow query returns
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+--------------+------------+------+---------------+--------+---------+-------+--------+----------+----------------------------------+
| 1 | SIMPLE | log_fill_tmp | NULL | ref | apiKey | apiKey | 4 | const | 116032 | 10.00 | Using where; Backward index scan |
The query with WHERE only
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+--------------+------------+------+---------------+--------+---------+-------+--------+----------+-------------+
| 1 | SIMPLE | log_fill_tmp | NULL | ref | apiKey | apiKey | 4 | const | 116032 | 10.00 | Using where |
The query with ORDER BY only on the full table
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+--------------+------------+-------+---------------+---------+---------+------+---------+----------+---------------------+
| 1 | SIMPLE | log_fill_tmp | NULL | index | NULL | PRIMARY | 8 | NULL | 2503238 | 100.00 | Backward index scan |
26 microseconds for any query against a 3-million-row table implies that you have the Query cache enabled. Please rerunning your timings with SELECT SQL_NO_CACHE .... (Even milliseconds would be suspicious.) Were all 3M rows returned? Shoveling that many rows probably takes more than a second under any circumstances.
Meanwhile, to speed up the first two queries, add
INDEX(symbol, apikey, id)
EXPLAIN gives only approximate (eg, 116032) counts. A "cardinality" of 203 is also just an estimate, but it is used by the Optimizer in some situations. Please get the exact count just to check that there really are any rows:
SELECT COUNT(*)
FROM log_fill
WHERE apiKey = 90 AND symbol = 'ABCD'
With the ORDER BY id DESC, it will scan the entire B+Tree what holds the data. As it says, it will do a 'Backward index scan'. However, since the "index" is the PRIMARY KEY and the PK is clustered with the data, it is really referring to the data's BTree.
The EXPLAIN for the first query decided that the indexes were not useful enough for WHERE; instead it avoided the sort (ORDER BY) by doing the Backward full table scan, same as the 3rd query. (And ignored any rows that did not match the WHERE.
I added a composite index on (apiKey, symbol) and now the query runs in as little as 0.2ms. In order to reduce the creation time for the index I reduce the number of records from 3M to 500K and the gain in time is about 97%, I believe it's going to be more on the full table. I thought using just the apiKey index it would first filter out by user, then by symbol, but I'm probably wrong.

MySQL - Created index isn't showing up as possible key

I have the following table (it has more data columns, removed them because it would be a long post):
CREATE TABLE `members` (
`memberid` int(11) NOT NULL AUTO_INCREMENT,
`firstname` varchar(45) COLLATE utf8_unicode_ci DEFAULT NULL,
`lastname` varchar(45) COLLATE utf8_unicode_ci DEFAULT NULL,
PRIMARY KEY (`memberid`),
KEY `members_lname_ix` (`lastname`)
) ENGINE=InnoDB AUTO_INCREMENT=1019 DEFAULT CHARSET=utf8
COLLATE=utf8_unicode_ci;
By default, a user only ever accesses 10-20 rows from this table at a time and it is usually sorted by the lastname column, it's all paginated server side. so I decided to add an index to lastname to help with sorting, however the index does not seem to be working like I would expect it to. when I run EXPLAIN SELECT * FROM members ORDER BY lastname ASC I get:
id | select_type | table | type | possible_keys | key | key_len | ref | rows | extra
1 | simple | members | ALL | null | null | null | null | 711 | using filesort
I can at least confirm the index exists because if I run SHOW INDEX FROM members I get:
Table | Non_Unique | Key_name | Seq_in_ix | Col_name | Collation | Cardinality | Sub part | Packed | Null | Ix type
members | 0 | PRIMARY | 1 | memberid | A | 711 | null | null | (blank) | BTREE
members | 1 | members_lname_ix | 1 | lastname | A | 711 | null | null | YES | BTREE
if I add USE INDEX (members_lname_ix) both possible_keys and key will remain null. However if I add FORCE INDEX (members_lname_ix) possible_keys remains null and key shows members_lname_ix. This is my first time trying to apply indexing but to me this doesn't seem very intuitive - it feels like mysql should know that I created an index for lastname, no? I can't quite figure out what I'm doing wrong here unless I am misunderstanding something. Is the solution here to just keep using FORCE INDEX?
There are two ways to perform that query:
Plan A (as you were expecting):
Scan through the index sequentially, reading the entire (estimated) 711 rows.
Randomly look up each row in the data BTree. This involves reading the entire dataset.
Deliver the data in order.
Plan B (what it does):
Scan through the data, reading all 711 rows.
Sort the data
Deliver the sorted data.
Plan B does not touch the index at all; this was deemed to be a bigger savings than not having to sort the data.
In a table as tiny as yours, it would be hard to see a difference in speed. (In my test case, it took under 10 milliseconds either way.) In huge tables, the difference could be significant.
For optimal pagination, see http://mysql.rjweb.org/doc.php/pagination

mariadb optimisation of primary key not working

If you use a count on a non-null-column, on one table, without any where-parts, the optimaizer just return the number of rows in that table.
If you ask for a DISTINCT count on a UNIQE non-null-column, like the PRIMARY KEY, the answers should be the same, but this time mariadb do the calculations insted.
And if you have left join on other tables, and still no where-parts, the results should still be the number of rows in that table.
Is there a reason for mariadb not using thous optimizations? Is there case when the DISTINCT count of an unfiltered primary key, could give any other result then the number of rows in that tabel?
case:
CREATE TABLE products (
our_article_id varchar(50) CHARACTER SET utf8 NOT NULL,
...,
PRIMARY KEY(our_article_id)
);
CREATE TABLE product_article_id (
article_id varchar(255) COLLATE utf8_bin NOT NULL,
our_article_id varchar(50) CHARACTER SET utf8 NOT NULL,
...
PRIMARY KEY(article_id),
INDEX(our_article_id)
);
Count queries, 1st, basic count
DESCRIBE SELECT COUNT(our_article_id) FROM products;
+------+-------------+-------+------+---------------+------+---------+------+------+------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-------+------+---------------+------+---------+------+------+------------------------------+
| 1 | SIMPLE | NULL | NULL | NULL | NULL | NULL | NULL | NULL | Select tables optimized away |
+------+-------------+-------+------+---------------+------+---------+------+------+------------------------------+
2nd DISTINCT on primary key
DESCRIBE SELECT COUNT(DISTINCT our_article_id) FROM products;
+------+-------------+----------+-------+---------------+---------+---------+------+--------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+----------+-------+---------------+---------+---------+------+--------+-------------+
| 1 | SIMPLE | products | index | NULL | PRIMARY | 152 | NULL | 225089 | Using index |
+------+-------------+----------+-------+---------------+---------+---------+------+--------+-------------+
3th, DISTINCT on PRIMARY KEY, and a LEFT JOIN without WHERE-parts
DESCRIBE SELECT COUNT(DISTINCT our_article_id) FROM products LEFT JOIN product_article_id USING (our_article_id);
+------+-------------+--------------------+-------+---------------+---------+---------+----------------------------------+--------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+--------------------+-------+---------------+---------+---------+----------------------------------+--------+-------------+
| 1 | SIMPLE | products | index | NULL | PRIMARY | 152 | NULL | 225089 | Using index |
| 1 | SIMPLE | product_article_id | ref | PRIMARY | PRIMARY | 152 | testseek.products.our_article_id | 12579 | Using index |
+------+-------------+--------------------+-------+---------------+---------+---------+----------------------------------+--------+-------------+
"Is there a reason for mariadb not using thous optimizations?" -- There are a zillion missing optimizations in MySQL/MariaDB; that's missing. Let's look at the history.
MySQL started about 2 decades ago as a lean and mean database engine. It focused on features that most people needed, while minimizing the overhead. This meant that a lot of rare optimizations were not in the early releases, and only get added over time if they seem important enough.
Take the PRIMARY KEY, for example. It is defined as UNIQUE. It is BTree organized. And, with InnoDB, it is also defined as Clustered. Other vendors allow various combinations clustering, non-BTree indexing, etc. MySQL decided that the limitations were "good enough" for "most" people.
Over the years, the 'worst' omissions have been gradually fixed. Transactions is probably the biggest and most important. It arrived in 2001(?), and MyISAM is being removed this year (2016) with the advent of 8.0.
4.1 (2002?) saw subqueries. Before that, creating a tmp table was "good enough". Now (8.0) subqueries are being one-upped by CTEs, which covers a few things that neither tmp tables nor subqueries can do efficiently.
There have been a huge number of optimizations put into MySQL 5.6 and 5.7 and MariaDB 10.x; you probably have not used more than a couple of them. The product is into "diminishing returns". It would damage its "lean and mean" heritage if it slowed down the optimizer to check for the next thousand extremely rare optimizations.
Meanwhile, guys like me spend a lot of time saying "MySQL/MariaDB doesn't have that; here's the workaround". It's the shorter COUNT(*) in your case. Since there is a clean workaround, it may be another decade before your suggestions are implemented. It is OK to file a bug report with bugs.mysql.com or mariadb.com to suggest the optimizations.
Another, almost never needed case, is INDEX(a ASC, b DESC) as a way of optimizing ORDER BY a ASC, b DESC. That is coming with 8.0. But I doubt if more than one query in 5,000 really needs it. (I have seen a lot of queries.) I suggest that its rarity is why it took two decades to implement it. The lack of a clean workaround is why it did not take another decade.

Is there a way to hint mysql to use Using index for group-by

I was busying myself with exploring GROUP BY optimizations. On a classical "max salary per departament" query. And suddenly weird results. The dump below goes straight from my console. NO COMMAND were issued between these two EXPLAINS. Only some time had passed.
mysql> explain select name, t1.dep_id, salary
from emploee t1
JOIN ( select dep_id, max(salary) msal
from emploee
group by dep_id
) t2
ON t1.salary=t2.msal and t1.dep_id = t2.dep_id
order by salary desc;
+----+-------------+------------+-------+---------------+--------+---------+-------------------+------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+-------+---------------+--------+---------+-------------------+------+---------------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 4 | Using temporary; Using filesort |
| 1 | PRIMARY | t1 | ref | dep_id | dep_id | 8 | t2.dep_id,t2.msal | 1 | |
| 2 | DERIVED | emploee | index | NULL | dep_id | 8 | NULL | 84 | Using index |
+----+-------------+------------+-------+---------------+--------+---------+-------------------+------+---------------------------------+
3 rows in set (0.00 sec)
mysql> explain select name, t1.dep_id, salary
from emploee t1
JOIN ( select dep_id, max(salary) msal
from emploee
group by dep_id
) t2
ON t1.salary=t2.msal and t1.dep_id = t2.dep_id
order by salary desc;
+----+-------------+------------+-------+---------------+--------+---------+-------------------+------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+-------+---------------+--------+---------+-------------------+------+---------------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 4 | Using temporary; Using filesort |
| 1 | PRIMARY | t1 | ref | dep_id | dep_id | 8 | t2.dep_id,t2.msal | 3 | |
| 2 | DERIVED | emploee | range | NULL | dep_id | 4 | NULL | 9 | Using index for group-by |
+----+-------------+------------+-------+---------------+--------+---------+-------------------+------+---------------------------------+
3 rows in set (0.00 sec)
As you may notice, it examined ten times less rows in second run. I assume it's because some inner counters got changed. But I don't want to depend on these counters. So - is there a way to hint mysql to use "Using index for group by" behavior only?
Or - if my speculations are wrong - is there any other explanation on the behavior and how to fix it?
CREATE TABLE `emploee` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`name` varchar(255) DEFAULT NULL,
`dep_id` int(11) NOT NULL,
`salary` int(11) NOT NULL,
PRIMARY KEY (`id`),
KEY `dep_id` (`dep_id`,`salary`)
) ENGINE=InnoDB AUTO_INCREMENT=85 DEFAULT CHARSET=latin1 |
+-----------+
| version() |
+-----------+
| 5.5.19 |
+-----------+
Hm, showing the cardinality of indexes may help, but keep in mind: range's are usually slower then indexes there.
Because it think it can match the full index in the first one, it uses the full one. In the second one, it drops the index and goes for a range, but guesses the total number of rows satisfying that larger range wildly lower then the smaller full index, because all cardinality has changed. Compare it to this: why would "AA" match 84 rows, but "A[any character]" match only 9 (note that it uses 8 bytes of the key in the first, 4 bytes in the second)? The second one will in reality not read less rows, EXPLAIN just guesses the number of rows differently after an update on it's metadata of indexes. Not also that EXPLAIN does not tell you what a query will do, but what it probably will do.
Updating the cardinality can or will occur when:
The cardinality (the number of different key values) in every index of a table is calculated when a table is opened, at SHOW TABLE STATUS and ANALYZE TABLE and on other circumstances (like when the table has changed too much). Note that all tables are opened, and the statistics are re-estimated, when the mysql client starts if the auto-rehash setting is set on (the default).
So, assume 'at any point' due to 'changed too much', and yes, connecting with the mysql client can alter the behavior in choosing indexes of a server. Also: reconnecting of the mysql client after it lost its connection after a timeout counts as connecting with auto-rehash AFAIK. If you want to give mysql help to find the proper method, run ANALYZE TABLE once in a while, especially after heavy updating. If you think the cardinality it guesses is often wrong, you can alter the number of pages it reads to guess some statistics, but keep in mind a higher number means a longer running update of that cardinality, and something you don't want to happen that often when 'data has changed to much' on a table with a lot of operations.
TL;DR: it guesses rows differently, but you'd actually prefer the first behavior if the data makes that possible.
Adding:
On this previously linked page, we can probably also find why especially dep_id might have this problem:
small values like 1 or 2 can result in very inaccurate estimates of cardinality
I could imagine the number of different dep_id's is typically quite small, and I've indeed observed a 'bouncing' cardinality on non-unique indexes with quite a small range compared to the number of rows in my own databases. It easily guesses a range of 1-10 in the hundreds and then down again the next time, just based on the specific sample pages it picks & some algorithm that tries to extrapolate that.

mysql does not use primary bigint index

iam fighting with some performance problems on a very simple table which seems to be very slow when fetching data by using its primary key (bigint)
I have this table with 124 million entries:
CREATE TABLE `nodes` (
`id` bigint(20) NOT NULL,
`lat` float(13,7) NOT NULL,
`lon` float(13,7) NOT NULL,
PRIMARY KEY (`id`),
KEY `lat_index` (`lat`),
KEY `lon_index` (`lon`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
and a simple query which takes some id from another table using the IN clause to fetch data from the nodes tables, but it takes like 1 hour only to fetch a few rows from this table.
EXPLAIN shows me its not using the PRIMARY key as index, its simply scanning the whole table. Why that? id and the id column from the other table are both from type bigint(20).
mysql> EXPLAIN SELECT lat, lon FROM nodes WHERE id IN (SELECT node_id FROM ways_elements WHERE way_id = '4962890');
+----+--------------------+-------------------+------+---------------+--------+---------+-------+-----------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+-------------------+------+---------------+--------+---------+-------+-----------+-------------+
| 1 | PRIMARY | nodes | ALL | NULL | NULL | NULL | NULL | 124035228 | Using where |
| 2 | DEPENDENT SUBQUERY | ways_elements | ref | way_id | way_id | 8 | const | 2 | Using where |
+----+--------------------+-------------------+------+---------------+--------+---------+-------+-----------+-------------+
The query SELECT node_id FROM ways_elements WHERE way_id = '4962890' simply returns two node ids, so the whole query should only return two rows, but it takes more or less 1 hour.
Using "force index (PRIMARY)" didnt help, even if it would help, why does MySQL not take that index since its a primary key? EXPLAIN doesnt even mention anything in the possible_keys columns but select_type shows PRIMARY.
Am i doing something wrong?
How does this perform?
SELECT lat, lon FROM nodes t1 join ways_elements t2 on (t1.id=t2.node_id) WHERE t2.way_id = '4962890'
I suspect that your query is checking each row in nodes against each item in the "IN" clause.
This is what is called a correlated subquery. You can see this as reference or this popular question posted on Stackoverflow. A better query to use is:
SELECT lat,
lon
FROM nodes n
JOIN ways_elements w ON n.id = w.node_id
WHERE way_id = '4962890'