Indexing the prefix of a varchar - mysql

I have a 20 Million records table that has this column:
category_value varchar(4000).
I also have an index on this column, but still get bad results:
mysql> select count(*) from daniel_table where category_value like 'giraffe%';
+----------+
| count(*) |
+----------+
| 107130 |
+----------+
1 row in set (2 min 4.33 sec)
So what i did, is created a new field -
Short_value varchar(32)
Which contains- Substr(category_value, 1, 32)
And of course has an index.
Now, when i search this field its better:
mysql> select count(*) from daniel_table where short_value like 'giraffe%';
+----------+
| count(*) |
+----------+
| 107130 |
+----------+
1 row in set (1.36 sec)
However, I don't want to create a new field, it's duplicate.
I tried creating this index on the original varchar(4000):
key cat32(category_value(32))
And got this result:
mysql> select count(*) from daniel_table use index (cat32) where Category_Value like 'giraffe%' ;
+----------+
| count(*) |
+----------+
| 107130 |
+----------+
1 row in set (24.60 sec)
Which is still bad performance comparing to the varchar(32) field.
When looking into this a little bit I've figured out the problem.
As you can see in the explain, there is no "using index" in the extra field.
Meaning, Mysql fetch the table for each row its found.
Why does it do it?
How can it be avoided?
The explain:
+----+-------------+--------------------------------+-------+---------------+-------+---------+------+--------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------------------------+-------+---------------+-------+---------+------+--------+-------------+
| 1 | SIMPLE | daniel_table | range | cat32 | cat32 | 34 | NULL | 195824 | Using where |
+----+-------------+--------------------------------+-------+---------------+-------+---------+------+--------+-------------+
Thanks alot!!

Related

How to build index and query with time range and id sort?

Here is the table data
id
time
amount
1
20221104
15
2
20221104
10
3
20221105
7
4
20221105
19
5
20221106
10
The id and time field is asc, but time can be same.
The rows are very large, so we don't want to use page limit offset method, but with cursor id.
first query:
select * from t where time > xxx and time < yyy order by id asc limit 10;
get the biggest id zzz, then
next query:
select * from t where time > xxx and time < yyy and id > zzz order by id asc limit 10;
How should I build the index?
If I use id as index, the time range will cause huge scan if time is far away.
And If I use time as index, seek id will not be effective.
The following index should be enough for both queries:
alter table t add index `time_id` (`time`,`id`);
Note, use proper date/datetime data types , will save a lot of pain in the future
The key is composite index by leftmost prefixing principle. But both queries here start with
range expression. So I suppose that simply creating index on (a, b) is unable to optimize effectively
because indexing process stops after range condition. It is enough to create index like this:
CREATE INDEX index_time ON t (`time`)
More can be referenced here:
https://www.ibm.com/docs/en/informix-servers/12.10?topic=indexes-use-composite
https://orangematter.solarwinds.com/2019/02/05/the-left-prefix-index-rule/
First I agree with #ErgestBasha Suggestion:
If you follow the general performance rules:
CREATE TABLE t (id INT, time DATE, amount DEC(3,1));
Query OK, 0 rows affected (0.02 sec)
mysql> INSERT INTO t VALUES
-> (1, '2022-11-04', 15),
-> (2, '2022-11-04', 10),
-> (3, '2022-11-05', 7),
-> (4, '2022-11-05', 19),
-> (5, '2022-11-06', 10);
Query OK, 5 rows affected (0.00 sec)
Records: 5 Duplicates: 0 Warnings: 0
mysql> SELECT * FROM t;
+------+------------+--------+
| id | time | amount |
+------+------------+--------+
| 1 | 2022-11-04 | 15.0 |
| 2 | 2022-11-04 | 10.0 |
| 3 | 2022-11-05 | 7.0 |
| 4 | 2022-11-05 | 19.0 |
| 5 | 2022-11-06 | 10.0 |
+------+------------+--------+
5 rows in set (0.00 sec)
mysql> ALTER TABLE t ADD INDEX idx_time_id (time,id);
Query OK, 0 rows affected (0.02 sec)
Records: 0 Duplicates: 0 Warnings: 0
mysql> SHOW INDEXES FROM t;
+-------+------------+-------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | Visible | Expression |
+-------+------------+-------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| t | 1 | idx_time_id | 1 | time | A | 3 | NULL | NULL | YES | BTREE | | | YES | NULL |
| t | 1 | idx_time_id | 2 | id | A | 5 | NULL | NULL | YES | BTREE | | | YES | NULL |
+-------+------------+-------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
2 rows in set (0.01 sec)
mysql> SELECT * FROM t WHERE time > '2022-11-04' AND time < '2022-11-06' AND id > 3 ORDER BY id ASC LIMIT 10;
+------+------------+--------+
| id | time | amount |
+------+------------+--------+
| 4 | 2022-11-05 | 19.0 |
+------+------------+--------+
1 row in set (0.00 sec)
mysql> EXPLAIN SELECT * FROM t WHERE time > '2022-11-04' AND time < '2022-11-06' AND id > 3 ORDER BY id ASC LIMIT 10;
+----+-------------+-------+------------+-------+---------------+-------------+---------+------+------+----------+---------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+-------+---------------+-------------+---------+------+------+----------+---------------------------------------+
| 1 | SIMPLE | t | NULL | range | idx_time_id | idx_time_id | 4 | NULL | 2 | 33.33 | Using index condition; Using filesort |
+----+-------------+-------+------------+-------+---------------+-------------+---------+------+------+----------+---------------------------------------+
1 row in set, 1 warning (0.00 sec)
As you can see, It uses the indexes defined (time,id) and uses the range scan access method. Also Extra column you can see that index is used during operation!
See this for iterating through a compound key:
http://mysql.rjweb.org/doc.php/deletebig#iterating_through_a_compound_key
It cannot be done with two ANDs; tt needs one AND and one OR.
See this for why OFFSET should be avoided when Paginating

MySQL JOINing a TABLE to itself without a primary key

I created two tables in mysql. Each contains an integer idx and a string name. In one, the idx was the primary key.
CREATE TABLE table_indexed (
idx INTEGER,
name VARCHAR(24),
PRIMARY KEY(idx)
);
CREATE TABLE table_not_indexed (
idx INTEGER,
name VARCHAR(24)
);
I then added the same data to both tables. 3 million lines of distinct values to idx (1-3_000_00, randomly arranged) and 3 million random arrangements of 8 lowercase characters to name.
Then I ran a query where I joined each table to itself. The table without the primary key runs almost 3 times as fast.
mysql> SELECT COUNT(*)
-> FROM table_indexed t1 JOIN table_indexed t2
-> ON t1.idx = t2.idx;
+----------+
| COUNT(*) |
+----------+
| 3000000 |
+----------+
1 row in set (11.80 sec)
mysql> SELECT COUNT(*)
-> FROM table_not_indexed t1 JOIN table_not_indexed t2
-> ON t1.idx = t2.idx;
+----------+
| COUNT(*) |
+----------+
| 3000000 |
+----------+
1 row in set (4.12 sec)
EDIT: Asked mySQL to Explain the query.
mysql> EXPLAIN SELECT COUNT(*)
-> FROM table_indexed t1 JOIN table_indexed t2
-> ON t1.idx = t2.idx;
+----+-------------+-------+------------+--------+---------------+---------+---------+--------------------------+---------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+--------+---------------+---------+---------+--------------------------+---------+----------+-------------+
| 1 | SIMPLE | t1 | NULL | index | PRIMARY | PRIMARY | 4 | NULL | 3171970 | 100.00 | Using index |
| 1 | SIMPLE | t2 | NULL | eq_ref | PRIMARY | PRIMARY | 4 | index_test3000000.t1.idx | 1 | 100.00 | Using index |
+----+-------------+-------+------------+--------+---------------+---------+---------+--------------------------+---------+----------+-------------+
2 rows in set, 1 warning (0.00 sec)
mysql> EXPLAIN SELECT COUNT(*)
-> FROM table_not_indexed t1 JOIN table_not_indexed t2
-> ON t1.idx = t2.idx;
+----+-------------+-------+------------+------+---------------+------+---------+------+---------+----------+--------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+------+---------------+------+---------+------+---------+----------+--------------------------------------------+
| 1 | SIMPLE | t1 | NULL | ALL | NULL | NULL | NULL | NULL | 2993208 | 100.00 | NULL |
| 1 | SIMPLE | t2 | NULL | ALL | NULL | NULL | NULL | NULL | 2993208 | 10.00 | Using where; Using join buffer (hash join) |
+----+-------------+-------+------------+------+---------------+------+---------+------+---------+----------+--------------------------------------------+
2 rows in set, 1 warning (0.00 sec)
mysql>
In both cases it does a table scan of t1, then looks for the matching row in t2.
In this case USING INDEX is equivalent to using the PK when the PK is involved. (EXPLAIN is a bit sloppy and inconsistent in this area.)
Sometimes you can get more details with EXPLAIN FORMAT=JSON SELECT .... (Might not be anything useful in this case.)
"rows" is just an estimate.
The non-indexed case reads t2 entirely into memory and builds a Hash index on it. With too small a value for join_buffer_size, you can experience the alternative -- repeated full table scans of t2.
Your experiment is a good example of when the "join buffer" is good, but not as good as an appropriate index.
Your experiment would probably come out the same with two separate tables instead of a "self-join".
"3 times as fast" -- I would expect a lot of variation in the "3" for different test cases.
For more on join_buffer_size, BNL, and BKA (Block Nested-Loop or Batched Key Access), see https://dev.mysql.com/doc/refman/8.0/en/server-system-variables.html#sysvar_join_buffer_size
It is potentially unsafe to set join_buffer_size bigger than 1% of RAM.

MySQL Similar Queries taking longer

So I have a table about 2GB in size, I run 2 queries and one takes about 200ms and the other takes over 8s, both return '0' which is correct. I added device_id and time_server as indexes and assumed it would make them quicker, it did.. well for one of the queries. So why is there huge difference in time taken to query the same table?
Have I just been unlucky in that one query is running in memory and the other is running from disk as I've hit the limit of innodb_buffer_pool_size?
Why the difference in the row counts from EXPLAIN, if both return a count of 0, I'd have thought it would do a full table scan and the row count would be identical?
Its worth noting that CPU, RAM, Disk I/O etc.. are all fine with nothing obvious that could slow it down. Repeatedly running the queries gives the same results, so its consistent.
Query 1
mysql> SELECT count(*) AS count
FROM mydb.gps
WHERE device_id = 780 AND time_server > '2021-08-03 16:32:48';
+-------+
| count |
+-------+
| 0 |
+-------+
1 row in set (8.20 sec)
Query 2:
mysql> SELECT count(*) AS count
FROM mydb.gps
WHERE device_id = 430 AND time_server > '2021-08-03 16:32:48';
+-------+
| count |
+-------+
| 0 |
+-------+
1 row in set (0.02 sec)
If I run an explain on them:
Query 1
mysql> EXPLAIN
-> SELECT count(*) AS count FROM mydb.gps WHERE device_id = 780 AND time_server > '2021-08-03 16:32:48';
+----+-------------+-------+------------+------+-----------------------+-----------+---------+-------+--------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+------+-----------------------+-----------+---------+-------+--------+----------+-------------+
| 1 | SIMPLE | gps | NULL | ref | device_id,time_server | device_id | 5 | const | 282416 | 2.12 | Using where |
+----+-------------+-------+------------+------+-----------------------+-----------+---------+-------+--------+----------+-------------+
1 row in set, 1 warning (0.00 sec)
Query 2
mysql> EXPLAIN
-> SELECT count(*) AS count FROM mydb.gps WHERE device_id = 430 AND time_server > '2021-08-03 16:32:48';
+----+-------------+-------+------------+------+-----------------------+-----------+---------+-------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+------+-----------------------+-----------+---------+-------+------+----------+-------------+
| 1 | SIMPLE | gps | NULL | ref | device_id,time_server | device_id | 5 | const | 2001 | 2.12 | Using where |
+----+-------------+-------+------------+------+-----------------------+-----------+---------+-------+------+----------+-------------+
1 row in set, 1 warning (0.00 sec)
You have more than 100x more rows with the first device_id, so the query has more rows to scan to check the time_server value. You may be able to improve it by creating a multi-column index:
ALTER TABLE gps ADD INDEX (device_id, time_server);

Find value within a range in database table

I need the SQL equivalent of this.
I have a table like this
ID MN MX
-- -- --
A 0 3
B 4 6
C 7 9
Given a number, say 5, I want to find the ID of the row where MN and MX contain that number, in this case that would be B.
Obviously,
SELECT ID FROM T WHERE ? BETWEEN MN AND MX
would do, but I have 9 million rows and I want this to run as fast as possible. In particular, I know that there can be only one matching row, I now that the MN-MX ranges cover the space completely, and so on. With all these constraints on the possible answers, there should be some optimizations I can make. Shouldn't there be?
All I have so far is indexing MN and using the following
SELECT ID FROM T WHERE ? BETWEEN MN AND MX ORDER BY MN LIMIT 1
but that is weak.
If you have an index spanning MN and MX it should be pretty fast, even with 9M rows.
alter table T add index mn_mx (mn, mx);
Edit
I just tried a test w/ a 1M row table
mysql> select count(*) from T;
+----------+
| count(*) |
+----------+
| 1000001 |
+----------+
1 row in set (0.17 sec)
mysql> show create table T\G
*************************** 1. row ***************************
Table: T
Create Table: CREATE TABLE `T` (
`id` int(10) NOT NULL AUTO_INCREMENT,
`mn` int(10) DEFAULT NULL,
`mx` int(10) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `mn_mx` (`mn`,`mx`)
) ENGINE=InnoDB AUTO_INCREMENT=1048561 DEFAULT CHARSET=utf8
1 row in set (0.00 sec)
mysql> select * from T order by rand() limit 1;
+--------+-----------+-----------+
| id | mn | mx |
+--------+-----------+-----------+
| 112940 | 948004986 | 948004989 |
+--------+-----------+-----------+
1 row in set (0.65 sec)
mysql> explain select id from T where 948004987 between mn and mx;
+----+-------------+-------+-------+---------------+-------+---------+------+--------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+-------+---------+------+--------+--------------------------+
| 1 | SIMPLE | T | range | mn_mx | mn_mx | 5 | NULL | 239000 | Using where; Using index |
+----+-------------+-------+-------+---------------+-------+---------+------+--------+--------------------------+
1 row in set (0.00 sec)
mysql> select id from T where 948004987 between mn and mx;
+--------+
| id |
+--------+
| 112938 |
| 112939 |
| 112940 |
| 112941 |
+--------+
4 rows in set (0.03 sec)
In my example I just had an incrementing range of mn values and then set mx to +3 that so that's why I got more than 1, but should apply the same to you.
Edit 2
Reworking your query will definitely be better
mysql> explain select id from T where mn<=947892055 and mx>=947892055;
+----+-------------+-------+-------+---------------+-------+---------+------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+-------+---------+------+------+--------------------------+
| 1 | SIMPLE | T | range | mn_mx | mn_mx | 5 | NULL | 9 | Using where; Using index |
+----+-------------+-------+-------+---------------+-------+---------+------+------+--------------------------+
It's worth noting even though the first explain reported many more rows to be scanned I had enough innodb buffer pool set to keep the entire thing in RAM after creating it; so it was still pretty fast.
If there are no gaps in your set, a simple gte comparison will work:
SELECT ID FROM T WHERE ? >= MN ORDER BY MN ASC LIMIT 1

Why is this MySQL JOIN statement returning more results?

I have two (characteristic_list and measure_list) tables that are related to each other by a column called 'm_id'. I want to retrieve records using filters (columns from characteristic_list) within a date range (columns from measure_list). When I gave the following SQL using INNER JOIN, it takes a while to retrieve the record. What am I doing wrong?
mysql> explain select c.power_set_point, m.value, m.uut_id, m.m_id, m.measurement_status, m.step_name from measure_list as m INNER JOIN characteristic_lis
t as c ON (m.m_id=c.m_id) WHERE (m.sequence_end_time BETWEEN '2010-06-18' AND '2010-06-20');
+----+-------------+-------+------+---------------+------+---------+------+-------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+------+---------+------+-------+-------------+
| 1 | SIMPLE | c | ALL | NULL | NULL | NULL | NULL | 82952 | |
| 1 | SIMPLE | m | ALL | NULL | NULL | NULL | NULL | 85321 | Using where |
+----+-------------+-------+------+---------------+------+---------+------+-------+-------------+
2 rows in set (0.00 sec)
mysql> select count(*) from measure_list;
+----------+
| count(*) |
+----------+
| 83635 |
+----------+
1 row in set (0.18 sec)
mysql> select count(*) from characteristic_list;
+----------+
| count(*) |
+----------+
| 83635 |
+----------+
1 row in set (0.10 sec)
The reason this query takes a while to execute is because it has to scan the entire table. You never want to see "ALL" as the type of the query. To speed things up, you need to make smart decisions about what to index.
See the following documents at the MySQL site:
http://dev.mysql.com/doc/refman/5.1/en/mysql-indexes.html
http://dev.mysql.com/doc/refman/5.1/en/using-explain.html
As an add-on to the previous answer by Dan, you should consider indexing the join columns and the where columns. In this case, that means the m_id cols in both tables and the sequence_end_time in the measure_list table. They are small enough that you could add an index, run explain plan and time it, then change the index and compare. Should be relatively quick to solve.