Does ranging the SQL query speed up the query time? - mysql

There is a table words contains word and id columns and 50000 records. I know words with the structure %XC%A are between the id=30000 and the id=35000.
Now consider the following queries:
SELECT * FROM words WHERE word LIKE '%XCX%A'
and
SELECT * FROM words WHERE id>30000 and id < 35000 and word LIKE '%XCX%A'
From time consuming perspective, is there any difference between them?

Well, let's find out...
Here's a data set of approximately 50000 words. Some of the words (but only in the range 30000 to 35000) follow the pattern described:
EXPLAIN
SELECT * FROM words WHERE word LIKE '%XCX%A';
+----+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
| 1 | SIMPLE | words | index | NULL | word | 14 | NULL | 50976 | Using where; Using index |
+----+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
EXPLAIN
SELECT * FROM words WHERE id>30000 and id < 35000 and word LIKE '%XCX%A';
+----+-------------+-------+-------+---------------+---------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+---------+---------+------+------+-------------+
| 1 | SIMPLE | words | range | PRIMARY | PRIMARY | 4 | NULL | 1768 | Using where |
+----+-------------+-------+-------+---------------+---------+---------+------+------+-------------+
We can see that the first query scans the entire dataset (50976 rows), while the second query only scans rows between the given ids (in my example there are approximately 1768 rows between ids 30000 and 35000; there are lots of unused ids, but that's just a side effect of the way in which the data was created).
So, we can see that by adding the range, MySQL only has to scan (at worst) one fifth of the data set (5000 rows instead oof 50000 rows). This isn't going to make much of a difference on such a small dataset, but it will on dataset 100, or 1000 times this size.
One thing to note is that the two queries will return the same data set (because we know that valid values are only to be found within that id range), but they won't necessarily return the dataset in the same order. For consistency, you would need an ORDER BY clause.
Another thing to note is, of course, that there's no point indexing word (for this query anyway), because '%...' cannot use an index.

Related

Mysql: explain returns more rows than the actual number

I have a table, which contains 40 M rows by counting.
select count(*) from xxxs;
returns 38000389
but the explain:
mysql> explain select * from xxxs where s_uuid = "21eaef";
+----+-------------+-------+------------+------+---------------+------+---------+------+----------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+------+---------------+------+---------+------+----------+----------+-------------+
| 1 | SIMPLE | xxxs | NULL | ALL | NULL | NULL | NULL | NULL | 56511776 | 10.00 | Using where |
+----+-------------+-------+------------+------+---------------+------+---------+------+----------+----------+-------------+
1 row in set, 1 warning (0.06 sec)
why the rows is 56M which is much larger than 40 M?
Thanks
UPDATE
1, the above query may take several minutes. is it normal? How to tune the performance?
2, I plan to create an index on s_uuid. I guess it will improve the performance. Am I right?
The "rows" in EXPLAIN is an estimate based on statistics that were gathered in the recent past. The value is rarely exact; sometimes it is even off by more than a factor of two.
Still, the estimate is usually "good enough" for the Optimizer to decide how to perform the query.
Another place to see this estimate of row count is via
SHOW TABLE STATUS LIKE 'xxxs';
(As mentioned in a Comment) Adding this is likely to speed up select * from xxxs where s_uuid = "21eaef";:
INDEX(s_uuid)
I say "likely to" because, if a lot of rows have s_uuid = "21eaef", the Optimizer will shun the index and simply scan the entire table rather than bouncing back and forth from the index's BTree and the data's BTree. You can see the "shun" in EXPLAIN by having Possible keys = idx_uuid but key = NULL.
There will be cases where the Optimizer makes the 'wrong' choice. But we can discuss that in another Q&A.

Bad query performance. Where do I need to create the indexes?

I have a relativelly simple query, but it's performance is really bad.
I'm currently at something like 0.800 sec. per query.
Is there anything that I could change to make it faster?
I've already tried indexing the columns, used in where statement and join, but nothing works.
Here's the query that I'm using:
SELECT c.bloqueada, c.nomeF, c.confirmadoData
FROM encomendas_c_linhas el1
LEFT JOIN encomendas_c c ON c.lojaEncomenda=el1.lojaEncomenda
WHERE (el1.artigoE IN (342197) OR el1.artigoD IN (342197))
AND el1.desmembrado = 1
Here's the EXPLAIN:
As Bill Karwin asked, here follows the Query used to create the table:
Table "encomendas_c_linhas"
https://pastebin.com/73WcsDnE
Table "encomendas_c"
https://pastebin.com/yCUx3wh0
In your EXPLAIN, we see that it's accessing the el1 table with type: ALL which means it's scanning the whole table. The rows field shows an estimate of 644,236 rows scanned (this is approximate).
In your table definition for the el1 table, you have this index:
KEY `order_item_id_desmembrado` (`order_item_id`,`desmembrado`),
Even though desmembrado appears in an index, that index will not help this query. Your query searches for desmembrado = 1, with no condition for order_item_id.
Think of a telephone book: I can search for people with last name 'Smith' and the order of the phone book helps me find them quickly. But if I search for people with the first name 'Sarah' the book doesn't help. Only if I search on the leftmost column(s) of the index does it help.
So you need an index with desmembrado as the leftmost column. Then the search for desmembrado = 1 might use the index to select those matching rows.
ALTER TABLE encomendas_c_linhas ADD INDEX (desmembrado);
Note that if a large enough portion of the table matches, MySQL skips that index anyway. There's no advantage in using the index if it will match a large portion of rows. In my experience, MySQL's optimizer's judgement is that it avoids the index if the condition matches > 20% of the rows of the table.
The other conditions are in a disjunction (terms of an OR expression). There's no way to optimize these with a single index. Again, the telephone book example: Search for people with last name 'Smith' OR first name 'Sarah'. The last name lookup is optimized, but the first name lookup is not. No problem, we can make another index with first name listed first, so the first name lookup will be optimized. But in general, MySQL will use only one index per table reference per query. So optimizing the first name lookup will spoil the last name lookup.
Here's a workaround: Rewrite the OR condition into a UNION of two queries, with one term in each query:
SELECT c.bloqueada, c.nomeF, c.confirmadoData
FROM encomendas_c_linhas el1
LEFT JOIN encomendas_c c ON c.lojaEncomenda=el1.lojaEncomenda
WHERE el1.artigoD IN ('342197')
AND el1.desmembrado = 1
UNION
SELECT c.bloqueada, c.nomeF, c.confirmadoData
FROM encomendas_c_linhas el1
LEFT JOIN encomendas_c c ON c.lojaEncomenda=el1.lojaEncomenda
WHERE el1.artigoE IN ('342197')
AND el1.desmembrado = 1;
Make sure there's an index for each case.
ALTER TABLE encomendas_c_linhas
ADD INDEX des_artigoD (desmembrado, artigoD),
ADD INDEX des_artigoE (desmembrado, artigoE);
Each of these compound indexes may be used in the respective subquery, so it will optimize the lookup of two columns in each case.
Also notice I put the values in quotes, like IN ('342197') because the columns are varchar, and you need to compare to a varchar to make use of the index. Comparing a varchar column to an integer value will make the match successfully, but will not use the index.
Here's an EXPLAIN I tested for the previous query, that shows the two new indexes are used, and it shows ref: const,const which means both columns of the index are used for the lookup.
+----+--------------+------------+------+-------------------------+---------------+---------+------------------------+------+-----------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------+------------+------+-------------------------+---------------+---------+------------------------+------+-----------------------+
| 1 | PRIMARY | el1 | ref | des_artigoD,des_artigoE | des_artigoD | 41 | const,const | 1 | Using index condition |
| 1 | PRIMARY | c | ref | lojaEncomenda | lojaEncomenda | 61 | test.el1.lojaEncomenda | 2 | NULL |
| 2 | UNION | el1 | ref | des_artigoD,des_artigoE | des_artigoE | 41 | const,const | 1 | Using index condition |
| 2 | UNION | c | ref | lojaEncomenda | lojaEncomenda | 61 | test.el1.lojaEncomenda | 2 | NULL |
| NULL | UNION RESULT | <union1,2> | ALL | NULL | NULL | NULL | NULL | NULL | Using temporary |
+----+--------------+------------+------+-------------------------+---------------+---------+------------------------+------+-----------------------+
But we can do one step better. Sometimes adding other columns to the index is helpful, because if all columns needed by the query are included in the index, then it doesn't have to read the table rows at all. This is called a covering index and it's indicated if you see "Using index" in the EXPLAIN Extra field.
So here's the new index definition:
ALTER TABLE encomendas_c_linhas
ADD INDEX des_artigoD (desmembrado, artigoD, lojaEncomenda),
ADD INDEX des_artigoE (desmembrado, artigoE, lojaEncomenda);
The third column is not used for lookup, but it's used when joining to the other table c.
You can also get the same covering index effect by creating the second index suggested in the answer by #TheImpaler:
create index ix2 on encomendas_c (lojaEncomenda, bloqueada, nomeF, confirmadoData);
We see the EXPLAIN now shows the "Using index" note for all table references:
+----+--------------+------------+------+-------------------------+-------------+---------+------------------------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------+------------+------+-------------------------+-------------+---------+------------------------+------+--------------------------+
| 1 | PRIMARY | el1 | ref | des_artigoD,des_artigoE | des_artigoD | 41 | const,const | 1 | Using where; Using index |
| 1 | PRIMARY | c | ref | lojaEncomenda,ix2 | ix2 | 61 | test.el1.lojaEncomenda | 1 | Using index |
| 2 | UNION | el1 | ref | des_artigoD,des_artigoE | des_artigoE | 41 | const,const | 1 | Using where; Using index |
| 2 | UNION | c | ref | lojaEncomenda,ix2 | ix2 | 61 | test.el1.lojaEncomenda | 1 | Using index |
| NULL | UNION RESULT | <union1,2> | ALL | NULL | NULL | NULL | NULL | NULL | Using temporary |
+----+--------------+------------+------+-------------------------+-------------+---------+------------------------+------+--------------------------+
I would create the following indexes:
create index ix1 on encomendas_c_linhas (artigoE, artigoD, desmembrado);
create index ix2 on encomendas_c (lojaEncomenda, bloqueada, nomeF, confirmadoData);
The first one is the critical one. The second one will improve performance further.

MySQL composite index column order & performance

I have a table with approx 500,000 rows and I'm testing two composite indexes for it. The first index puts the ORDER BY column last, and the second one is in reverse order.
What I don't understand is why the second index appears to offer better performance by estimating 30 rows to be scanned compared to 889 for the first query, as I was under the impression the second index could not be properly used as the ORDER BY column is not last. Would anyone be able to explain why this is the case? MySQL prefers the first index if both exist.
Note that the second EXPLAIN lists possible_keys as NULL but still lists a chosen key.
1) First index
ALTER TABLE user ADD INDEX test1_idx (city_id, quality);
(cardinality 12942)
EXPLAIN SELECT * FROM user u WHERE u.city_id = 3205 ORDER BY u.quality DESC LIMIT 30;
+----+-------------+-------+--------+---------------+-----------+---------+----------------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+---------------+-----------+---------+----------------+------+-------------+
| 1 | SIMPLE | u | ref | test1_idx | test1_idx | 3 | const | 889 | Using where |
+----+-------------+-------+--------+---------------+-----------+---------+----------------+------+-------------+
2) Second index (same fields in reverse order)
ALTER TABLE user ADD INDEX test2_idx (quality, city_id);
(cardinality 7549)
EXPLAIN SELECT * FROM user u WHERE u.city_id = 3205 ORDER BY u.quality DESC LIMIT 30;
+----+-------------+-------+--------+---------------+-----------+---------+----------------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+---------------+-----------+---------+----------------+------+-------------+
| 1 | SIMPLE | u | index | NULL | test2_idx | 5 | NULL | 30 | Using where |
+----+-------------+-------+--------+---------------+-----------+---------+----------------+------+-------------+
UPDATE:
The second query does not perform well in a real-life scenario whereas the first one does, as expected. I would still be curious as to why MySQL EXPLAIN provides such opposite information.
The rows in EXPLAIN is just an estimate of the number of rows that MySQL believes it must examine to produce the result.
I remembered reading one article from Peter Zaitsev of Percona that this number could be very inaccurate. So you can not simply compare the query efficiency based on this number.
I agree with you that the first index will produce better result in normal scenarios.
You should have noticed that the type column in the first EXPLAIN is ref while index for the second. ref is usually better than a index scan. As you mentioned, if both keys exists, MySQL prefer the first one.
I guess your data type
city_id: MEDIUMINT 3 Bytes
quality: SMALLINT 2 Bytes
As I know,
For
SELECT * FROM user u WHERE u.city_id = 3205 ORDER BY u.quality DESC LIMIT 30;
The second index(quality, city_id) can not be fully used.
Because Order by is Range scan, which can only do for last part of your index.
The first Index looks fit perfect.
I guess that some time Mysql is not so smart. maybe the amount of city_id targeted could effect mysql decide which index will be used.
You may try key word
FORCE INDEX(test1_idx)

How to speed up mysql select in database with highly redundant key values

I have a very simple MYSQL database with only 3 columns but several millions of rows.
Two of the colums (hid1, hid2) describe study objects (about 50,000 of them) and the third column (score) is the result of a comparison of hid1 with hid2. Thus, the number of rows is max(hid1)*max(hid2), which is quite a big number. Because the table has to be written only once and read many million times, I selected a MyISAM table (I hope this was a good idea). Initially, it was planned that I would retrieve 'score' for a given pair of hid1,hid2 but it turned out to be more convenient to retrieve all scores (and hid2) for a given hid1.
My table ("result") looks like this:
+-------+-----------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------+-----------------------+------+-----+---------+-------+
| hid1 | mediumint(8) unsigned | YES | MUL | NULL | |
| hid2 | mediumint(8) unsigned | YES | | NULL | |
| score | float | YES | | NULL | |
+-------+-----------------------+------+-----+---------+-------+
and a typical query would be
select hid1,hid2,score from result where hid1=13531 into outfile "/tmp/ttt"
Here is the problem: The query just takes too long, at least sometimes. For some 'hid1' values, I get the result back in under a second. For other hid1 (particularly for big numbers), I have to wait for up to 40 sec. As I said, I have to run thousands of these queryies, so I am interested in speeding things up.
Let me reiterate: there are about 50,000 hits to the query, and I don't need them in any particular order. Am I doing something wrong here, or is a relational database like MySQL not up to this task?
What I already tried is to increase the key_buffer in /etc/mysql/my.conf
this appeared to help, but not much. The index on hid1 is a few GB, does the key_buffer have to be bigger than the index size to be effective?
Any hint would be appreciated.
Edit: here is an example run with the corresponding 'explain' output:
select hid1,hid2,score from result where hid1=132885 into outfile "/tmp/ttt"
Query OK, 16465 rows affected (31.88 sec)
As you can see below, the index hid1_idx is actually being used:
mysql> explain select hid1,hid2,score from result where hid1=132885 into outfile "/tmp/ttt";
+----+-------------+--------+------+---------------+------------+---------+-------+-------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------+------+---------------+------------+---------+-------+-------+-------------+
| 1 | SIMPLE | result | ref | hid1_index | hid1_index | 4 | const | 15456 | Using where |
+----+-------------+--------+------+---------------+------------+---------+-------+-------+-------------+
1 row in set (0.00 sec)
What I do find puzzling is the fact that query with low numbers for hid1 always are much faster than those with high numbers. This is not what I would expect from using an index.
Two random suggestions, based on a query pattern that always involve equality filter on hid1:
Use InnoDB table instead and take advantage of a clustered index on (hid1, hid2). That way all rows belonging to the same hid will be physically located together, and this will speed up retreival.
Hash-partition the table on hid1, with a suitable nr of partitions.
The simplest way to optimize a query like that, would be to use an index. A simple thing like
alter table results add index(hid1)
would improve the query you sent. Even more, if you want to search by both fields at once, you can use both fields in the index.
alter table results add index(hid1, hid2)
That way, MySQL can access results in a very organized way, and find the information you want.
If you run an explain on the first query, you might see something like
| select_type | table | type|possible_keys| rows |Extra
| SIMPLE | results| ALL | | 7765605| Using where
After adding the index, you should see
| select_type | table | type|possible_keys| rows |Extra
| SIMPLE | results| ref |hid1 | 2816304|
Which is telling you, in the first case, that it needs to check ALL the rows, and in the second case, that it can find the information using a ref
If you know the combination of hid1 and hid2 is unique, you should consider making that your primary key. That will automatically also add an index to hid1. See: http://dev.mysql.com/doc/refman/5.5/en/multiple-column-indexes.html
Also, check the output of EXPLAIN. See: http://dev.mysql.com/doc/refman/5.5/en/select-optimization.html and related links.

What does the filtered column of a EXPLAIN'ed query mean in MySQL?

mysql> EXPLAIN EXTENDED SELECT * FROM table WHERE column = 1 LIMIT 10;
+----+-------------+----------+------+---------------+--------------+---------+-------+--------+----------+-------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+----------+------+---------------+--------------+---------+-------+--------+----------+-------+
| 1 | SIMPLE | table | ref | column | column | 1 | const | 341878 | 100.00 | |
+----+-------------+----------+------+---------------+--------------+---------+-------+--------+----------+-------+
1 row in set, 1 warning (0.00 sec)
What does the filtered column mean, and should the number be high or low? Yes, I've read the docs, but I don't quite understand what the number indicates or what values are considered desirable.
The filtered column is an estimated percentage that specifies the number of rows that will be joined with the previous table. In your case it's 100%, i.e. all rows.
The rows, as you presumably know, is an estimation of the number of rows examined by the query, so rows x filtered / 100 will be the number of joins that have to be made (the number of rows left after applying the table condition).
For additional info, see What's New in MySQL Optimizer - What's new in MySQL Query Optimizer (slide 10).
EDIT:
There is an Appendix in the High performance MySQL (2nd edition) book which suggests that the optimizer currently uses this estimate (filtered) only for the ALL, index, range and index_merge access methods.