Can someone explain this... before I have myself committed? The first result set should have two results, same as the second, no?
mysql> SELECT * FROM kuru_footwear_2.customer_address_entity_varchar
-> WHERE attribute_id=31 AND entity_id=324134;
+----------+----------------+--------------+-----------+-------+
| value_id | entity_type_id | attribute_id | entity_id | value |
+----------+----------------+--------------+-----------+-------+
| 885263 | 2 | 31 | 324134 | NULL |
+----------+----------------+--------------+-----------+-------+
1 row in set (0.00 sec)
mysql> SELECT * FROM kuru_footwear_2.customer_address_entity_varchar
-> WHERE value_id=885263 OR value_id=950181;
+----------+----------------+--------------+-----------+-------+
| value_id | entity_type_id | attribute_id | entity_id | value |
+----------+----------------+--------------+-----------+-------+
| 885263 | 2 | 31 | 324134 | NULL |
| 950181 | 2 | 31 | 324134 | NULL |
+----------+----------------+--------------+-----------+-------+
2 rows in set (0.00 sec)
attribute_id is a SMALLINT(5)
entity_id is a INT(10)
The problem is that you have a unique index on (entity_id,attribute_id). The query optimizer notices this when you write a query whose WHERE clause is covered by the index, and only returns 1 row since the uniqueness of the index implies that there's at most one matching row.
I'm not sure how you can have those duplicates in the first place, it seems like there's something corrupted in the table. Adding a unique index to a table will normally remove any duplicates. In fact, this is often suggested as a way to get rid of duplicates in a table, see How do I delete all the duplicate records in a MySQL table without temp tables.
In the first statement's selection (the 'WHERE' clause), you are using AND; in the second statement, you are using OR. This boils down to the definition of these logical operators. MySQL's official documentation doesn't say much about these other than that AND and OR are their own natural logical operators. If this is confusing, you may want to read up on basic Boolean Algebra.
Related
My paintings table looks like this
| id | artist_id | name
| 1 | 7 | landscape painting
| 2 | 15 | flowers painting
| 3 | 15 | scuffed painting
The artist_id is indexed and the name has a fulltext index on it. The table contains about 10M record.
Queries that match the name agains some keywords are ok:
select * from `paintings` where match (`name`) against ('+scuffed*' in boolean mode) limit 10;
10 rows in set (0.04 sec)
But when I sometimes want to only check for a certain painting done by a certain artist:
select * from `paintings` where `artist_id` = 15 and match (`name`) against ('+scuffed*' in boolean mode) limit 10;
7 rows in set (0.40 sec)
As you can see it takes 10x longer to run the query when I include the artist_id. I also tried running a nested query in order to get only paintings that have specific ids:
select * from `paintings` where id in (SELECT id from paintings where artist_id = 15) and match (`name`) against ('+scuffed*' in boolean mode) limit 10;
7 rows in set (0.44 sec)
This ended up being even slower.
How can this query be optimized to work well with and without a where clause on the artist_id?
Thank you!
You need to create a COMPOSITE INDEX KEY consisting of columns (id and artist_id) to speed up your query:
mysql> ALTER TABLE paintings ADD INDEX cmp_id_artist_id (id, artist_id);
Query OK, 0 rows affected (0.02 sec)
Records: 0 Duplicates: 0 Warnings: 0
mysql> SHOW INDEXES FROM paintings;
+-----------+------------+------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | Visible | Expression |
+-----------+------------+------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| paintings | 1 | cmp_id_artist_id | 1 | id | A | 3 | NULL | NULL | | BTREE | | | YES | NULL |
| paintings | 1 | cmp_id_artist_id | 2 | artist_id | A | 3 | NULL | NULL | YES | BTREE | | | YES | NULL |
| paintings | 1 | ftindex_name | 1 | name | NULL | 3 | NULL | NULL | YES | FULLTEXT | | | YES | NULL |
+-----------+------------+------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
3 rows in set (0.00 sec)
And now you can test again your 2nd query:
mysql> select * from `paintings` where `artist_id` = 15 and match (`name`) against ('+scuffed*' in boolean mode) limit 10;
+----+-----------+------------------+
| id | artist_id | name |
+----+-----------+------------------+
| 3 | 15 | scuffed painting |
+----+-----------+------------------+
1 row in set (0.00 sec)
Run EXPLAIN SELECT ... to see how the query is performed.
I think your second query is as optimal is can be. MySQL will perform the MATCH first, then check any other conditions.
You could add INDEX(artist_id), but I don't think that will help.
More
Let me provide another approach, then talk through the pros an cons.
PRIMARY KEY(artist_id, id),
INDEX(id),
FULLTEXT(name),
-- When searching _only_ by `name`:
WHERE MATCH(name) AGAINST('+string' IN BOOLEAN MODE)
-- When searching by artist _and_ name:
WHERE artist_id = 123
AND name LIKE '%string%';
Comments:
Without the artist, you get the full speed of FULLTEXT.
With artist, you abandon FULLTEXT (because it seems not to work for your dataset) and switch to the composite index, plus need to look only at those rows with the given artist_id.
id needs to be indexed to keep AUTO_INCREMENT happy; a simple INDEX(id) suffices.
The PRIMARY KEY must be unique; including id suffices.
Starting the PK with `artist_id clusters the rows for a given artist together, thereby speeding the query up (some for small tables, a lot for big tables).
This requires that you create two different queries, but that is probably not a big problem.
For an artist that has a lot of works, it may be 'too' slow -- because it is failing to use FULLTEXT.
FULLTEXT and LIKE have different rules for what will/won't match. So you may get different answers.
Another recent Question discussed why AGAINST('... +the ...' IN BOOLEAN MODE) never finds any rows.
AGAINST("+color +purple") will find purple color, but LIKE '%color purple% won't.
Be careful about benchmarking any approaches -- the performance will depend on how many works an artist has, and many other un-obvious differences.
SELECT AVG(table1.column1) as a,
table2.column2
FROM table1
LEFT OUTER JOIN table2
ON table2.column2 = table1.column2
GROUP BY table2.column2 ORDER BY a DESC LIMIT 10
This is MySQL code. I have 1.5 Million rows in table1, 200.000 rows in table2.
I am still waiting for the query to finish.
Does anybody know a way to work in a shorter time?
Lot of comments in the same vein but I thought I'd give a thorough answer. I'm gonna use one of my own tables/databases here for explanation. Let's take this query:
SELECT A.id, B.asin FROM AmazonWishlistItems A LEFT JOIN AmazonWishlistItemPrices B ON (B.asin = A.asin) WHERE A.asin LIKE "%C%"
This query returns about 851 and takes 0.5 seconds. If we add the word EXPLAIN to the query, MySQL tells us what this query is doing.
mysql> EXPLAIN SELECT A.id, B.asin FROM AmazonWishlistItems A LEFT JOIN AmazonWishlistItemPrices B ON (B.asin = A.asin) WHERE A.asin LIKE "%C%";
+----+-------------+-------+------+---------------+------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+------+---------+------+------+-------------+
| 1 | SIMPLE | A | ALL | NULL | NULL | NULL | NULL | 1183 | Using where |
| 1 | SIMPLE | B | ALL | NULL | NULL | NULL | NULL | 6594 | |
+----+-------------+-------+------+---------------+------+---------+------+------+-------------+
2 rows in set (0.00 sec)
Important column to look at here is the rows as this is the number of records MySQL is having to look at and in this case for tables A and B it is having to look up all the rows even though there's only 851 that fit the condition. This is how tables can get out of hand quickly, this only has 6594 record to search through but left alone this could easily reach your 1.5 million rows.
So we can cut this down by adding an index to the table, allowing MySQL to store a reference for each record.
ALTER TABLE AmazonWishlistItemPrices ADD INDEX idx_asin (asin)
This simply says create an index called idx_asin and use the column asin to do the indexing. If we re run our EXPLAIN...
mysql> EXPLAIN SELECT A.id, B.asin FROM AmazonWishlistItems A LEFT JOIN AmazonWishlistItemPrices B ON (B.asin = A.asin) WHERE A.asin LIKE "%C%";
+----+-------------+-------+------+---------------+----------+---------+---------------------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+----------+---------+---------------------+------+-------------+
| 1 | SIMPLE | A | ALL | NULL | NULL | NULL | NULL | 1183 | Using where |
| 1 | SIMPLE | B | ref | idx_asin | idx_asin | 12 | mah_database.A.asin | 6 | Using index |
+----+-------------+-------+------+---------------+----------+---------+---------------------+------+-------------+
2 rows in set (0.00 sec)
We're down to six rows and you can see in the possible_keys it's found our index. You may find that with certain joins and where clauses your indexes are being ignored that's simply MySQL saying "I'm going to have to get all the data anyway" because of the conditions you've provided in the WHERE condition.
It's best to use numeric keys for indexing, you can get away with some varchars but they do take up disk space. You should have a PRIMARY KEY on each table where possible. So look at your database structure and consider adding some indexes.
Final thing to check if your table has indexes you can use SHOW CREATE TABLE followed by the table name.
I am trying to figure out the most efficient way to extract values from database that has the structure similar to this:
table test:
int id (primary, auto increment)
varchar(50) stuff,
varchar(50) important_stuff;
where I need to do a query like
select * from test where important_stuff like 'prefix%';
The size of the entire table is approximately 10 million rows, however there are only about 500-1000 distinct values for important_stuff. My current solution is indexing important_stuff however the performance is not satisfactory. Will it be better to create a separate table that will match distinct important_stuff to a certain id, which will be stored in the 'test' table and then do
(select id from stuff_lookup where important_stuff like 'prefix%') a join select * from test b where b.stuff_id=a.id
or this:
select * from test where stuff_id exists in(select id from stuff_lookup where important_stuff like 'prefix%')
What is the best way to optimize things like that?
How big is innodb_buffer_pool_size? How much RAM is available? The former should be about 70% of the latter. You'll see in a minute why I bring up this setting.
Based on your 3 suggested SELECTs, the original one will work as good as the two complex ones. In some other case, the complex formulation might work better.
INDEX(important_stuff) is the 'best' index for
select * from test where important_stuff like 'prefix%';
Now, let's study how that query works with that index:
Reach into the BTree index, starting at 'prefix'. (Effort: Virtually instantaneous)
Scan forward for, say, 1000 entries. That will be about 10 InnoDB blocks (16KB each). Each entry will have the PRIMARY KEY (id). (Effort: <= 10 disk hits)
For each entry, look up the row (so you can get "*"). That's 1000 PK lookups in the BTree that contains both the PK and the data. At best, they might all be in 10 blocks. At worst, they could be in 1000 separate blocks. (Effort: 10-1000 blocks)
Total Effort: ~1010 blocks (worst case).
A standard spinning disk can handle ~100 reads/second. So. we are looking at 10 seconds.
Now, run the query again. Guess what; all those blocks are now in RAM (cached in the "buffer_pool", which is hopefully big enough for all of them). And it runs in less than 1 second.
OPTIMIZE TABLE was not necessary! It was not a statistics refresh, but rather caching that sped up the query.
I'm not MySQL user but I made some tests on my local database. I've added 10 millions rows as you wrote and distinct datas from third column are loaded quite fast. These are my results.
mysql> describe bigtable;
+-----------------+-------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-----------------+-------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| stuff | varchar(50) | NO | | NULL | |
| important_stuff | varchar(50) | NO | MUL | NULL | |
+-----------------+-------------+------+-----+---------+----------------+
3 rows in set (0.03 sec)
mysql> select count(*) from bigtable;
+----------+
| count(*) |
+----------+
| 10000089 |
+----------+
1 row in set (2.87 sec)
mysql> select count(distinct important_stuff) from bigtable;
+---------------------------------+
| count(distinct important_stuff) |
+---------------------------------+
| 1000 |
+---------------------------------+
1 row in set (0.01 sec)
mysql> select distinct important_stuff from bigtable;
....
| is_987 |
| is_988 |
| is_989 |
| is_99 |
| is_990 |
| is_991 |
| is_992 |
| is_993 |
| is_994 |
| is_995 |
| is_996 |
| is_997 |
| is_998 |
| is_999 |
+-----------------+
1000 rows in set (0.15 sec)
Important information is that I refreshed statistics on this table (before this operation I needed ~10 seconds to load these data).
mysql> optimize table bigtable;
I was busying myself with exploring GROUP BY optimizations. On a classical "max salary per departament" query. And suddenly weird results. The dump below goes straight from my console. NO COMMAND were issued between these two EXPLAINS. Only some time had passed.
mysql> explain select name, t1.dep_id, salary
from emploee t1
JOIN ( select dep_id, max(salary) msal
from emploee
group by dep_id
) t2
ON t1.salary=t2.msal and t1.dep_id = t2.dep_id
order by salary desc;
+----+-------------+------------+-------+---------------+--------+---------+-------------------+------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+-------+---------------+--------+---------+-------------------+------+---------------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 4 | Using temporary; Using filesort |
| 1 | PRIMARY | t1 | ref | dep_id | dep_id | 8 | t2.dep_id,t2.msal | 1 | |
| 2 | DERIVED | emploee | index | NULL | dep_id | 8 | NULL | 84 | Using index |
+----+-------------+------------+-------+---------------+--------+---------+-------------------+------+---------------------------------+
3 rows in set (0.00 sec)
mysql> explain select name, t1.dep_id, salary
from emploee t1
JOIN ( select dep_id, max(salary) msal
from emploee
group by dep_id
) t2
ON t1.salary=t2.msal and t1.dep_id = t2.dep_id
order by salary desc;
+----+-------------+------------+-------+---------------+--------+---------+-------------------+------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+-------+---------------+--------+---------+-------------------+------+---------------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 4 | Using temporary; Using filesort |
| 1 | PRIMARY | t1 | ref | dep_id | dep_id | 8 | t2.dep_id,t2.msal | 3 | |
| 2 | DERIVED | emploee | range | NULL | dep_id | 4 | NULL | 9 | Using index for group-by |
+----+-------------+------------+-------+---------------+--------+---------+-------------------+------+---------------------------------+
3 rows in set (0.00 sec)
As you may notice, it examined ten times less rows in second run. I assume it's because some inner counters got changed. But I don't want to depend on these counters. So - is there a way to hint mysql to use "Using index for group by" behavior only?
Or - if my speculations are wrong - is there any other explanation on the behavior and how to fix it?
CREATE TABLE `emploee` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`name` varchar(255) DEFAULT NULL,
`dep_id` int(11) NOT NULL,
`salary` int(11) NOT NULL,
PRIMARY KEY (`id`),
KEY `dep_id` (`dep_id`,`salary`)
) ENGINE=InnoDB AUTO_INCREMENT=85 DEFAULT CHARSET=latin1 |
+-----------+
| version() |
+-----------+
| 5.5.19 |
+-----------+
Hm, showing the cardinality of indexes may help, but keep in mind: range's are usually slower then indexes there.
Because it think it can match the full index in the first one, it uses the full one. In the second one, it drops the index and goes for a range, but guesses the total number of rows satisfying that larger range wildly lower then the smaller full index, because all cardinality has changed. Compare it to this: why would "AA" match 84 rows, but "A[any character]" match only 9 (note that it uses 8 bytes of the key in the first, 4 bytes in the second)? The second one will in reality not read less rows, EXPLAIN just guesses the number of rows differently after an update on it's metadata of indexes. Not also that EXPLAIN does not tell you what a query will do, but what it probably will do.
Updating the cardinality can or will occur when:
The cardinality (the number of different key values) in every index of a table is calculated when a table is opened, at SHOW TABLE STATUS and ANALYZE TABLE and on other circumstances (like when the table has changed too much). Note that all tables are opened, and the statistics are re-estimated, when the mysql client starts if the auto-rehash setting is set on (the default).
So, assume 'at any point' due to 'changed too much', and yes, connecting with the mysql client can alter the behavior in choosing indexes of a server. Also: reconnecting of the mysql client after it lost its connection after a timeout counts as connecting with auto-rehash AFAIK. If you want to give mysql help to find the proper method, run ANALYZE TABLE once in a while, especially after heavy updating. If you think the cardinality it guesses is often wrong, you can alter the number of pages it reads to guess some statistics, but keep in mind a higher number means a longer running update of that cardinality, and something you don't want to happen that often when 'data has changed to much' on a table with a lot of operations.
TL;DR: it guesses rows differently, but you'd actually prefer the first behavior if the data makes that possible.
Adding:
On this previously linked page, we can probably also find why especially dep_id might have this problem:
small values like 1 or 2 can result in very inaccurate estimates of cardinality
I could imagine the number of different dep_id's is typically quite small, and I've indeed observed a 'bouncing' cardinality on non-unique indexes with quite a small range compared to the number of rows in my own databases. It easily guesses a range of 1-10 in the hundreds and then down again the next time, just based on the specific sample pages it picks & some algorithm that tries to extrapolate that.
mysql> desc users;
+-------------+------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------------+------------------+------+-----+---------+----------------+
| id | int(10) unsigned | NO | PRI | NULL | auto_increment |
| email | varchar(128) | NO | UNI | | |
| password | varchar(32) | NO | | | |
| screen_name | varchar(64) | YES | UNI | NULL | |
| reputation | int(10) unsigned | NO | | 0 | |
| imtype | varchar(1) | YES | MUL | 0 | |
| last_check | datetime | YES | MUL | NULL | |
| robotno | int(10) unsigned | YES | | NULL | |
+-------------+------------------+------+-----+---------+----------------+
8 rows in set (0.00 sec)
mysql> create index i_users_imtype_robotno on users(imtype,robotno);
Query OK, 24 rows affected (0.25 sec)
Records: 24 Duplicates: 0 Warnings: 0
mysql> explain select * from users where imtype!='0' and robotno is null;
+----+-------------+-------+------+------------------------+------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+------------------------+------+---------+------+------+-------------+
| 1 | SIMPLE | users | ALL | i_users_imtype_robotno | NULL | NULL | NULL | 24 | Using where |
+----+-------------+-------+------+------------------------+------+---------+------+------+-------------+
1 row in set (0.00 sec)
But this way,it's used:
mysql> explain select * from users where imtype in ('1','2') and robotno is null;
+----+-------------+-------+-------+------------------------+------------------------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+------------------------+------------------------+---------+------+------+-------------+
| 1 | SIMPLE | users | range | i_users_imtype_robotno | i_users_imtype_robotno | 11 | NULL | 3 | Using where |
+----+-------------+-------+-------+------------------------+------------------------+---------+------+------+-------------+
1 row in set (0.01 sec)
Besides,this one also did not use index:
mysql> explain select id,email,imtype from users where robotno=1;
+----+-------------+-------+------+---------------+------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+------+---------+------+------+-------------+
| 1 | SIMPLE | users | ALL | NULL | NULL | NULL | NULL | 24 | Using where |
+----+-------------+-------+------+---------------+------+---------+------+------+-------------+
1 row in set (0.00 sec)
SELECT *
FROM users
WHERE imtype != '0' and robotno is null
This condition is not satisified by a single contiguous range of (imtype, robotno).
If you have records like this:
imtype robotno
$ NULL
$ 1
0 NULL
0 1
1 NULL
1 1
2 NULL
2 1
, ordered by (imtype, robotno), then the records 1, 5 and 7 would be returned, while other records wouldn't.
You'll need create this index to satisfy the condition:
CREATE INDEX ix_users_ri ON users (robotno, imptype)
and rewrite your query a little:
SELECT *
FROM users
WHERE (
robotno IS NULL
AND imtype < '0'
)
OR
(
robotno IS NULL
AND imtype > '0'
)
, which will result in two contiguous blocks:
robotno imtype
--- first block start
NULL $
--- first block end
NULL 0
--- second block start
NULL 1
NULL 2
--- second block end
1 $
1 0
1 1
1 2
This index will also serve this query:
SELECT id, email, imtype
FROM users
WHERE robotno = 1
, which is not served now by any index for the same reason.
Actually, the index for this query:
SELECT *
FROM users
WHERE imtype in ('1', '2')
AND robotno is null
is used only for coarse filtering on imtype (note using where in the extra field), it doesn't range robotno's
You need an index that has robotno as the first column. Your existing index is (imtype,robotno). Since imtype is not in the where clause, it can't use that index.
An index on (robotno,imtype) could be used for queries with just robotno in the where clause, and also for queries with both imtype and robotno in the where clause (but not imtype by itself).
Check out the docs on how MySQL uses indexes, and look for the parts that talk about multi-column indexes and "leftmost prefix".
BTW, if you think you know better than the optimizer, which is often the case, you can force MySQL to use a specific index by appending
FORCE INDEX (index_name) after FROM users.
It's because 'robotno' is potentially a primary key, and it uses that instead of the index.
A database systems query planner determines whether to do an index scan or not by analyzing the selectivity of the query's where clause relative to the index. (Indexes are also used to join tables together, but you only have users here.)
The first query has where imtype != '0'. This would select nearly all of the rows in users, assuming you have a large number of distinct values of imtype. The inequality operator is inherently unselective. So the MySQL query planner is betting here that reading through the index won't help and that it may as well just do a sequential scan through the whole table, since it probably would have to do that anyway.
On the other hand, had you said where imtype ='0', equality is a highly selective operator, and MySQL would bet that by reading just a few index blocks it could avoid reading nearly all of the blocks of the users table itself. So it would pick the index.
In your second example, where imtype in ('1','2'), MySQL knows that the index will be highly selective (though only half as selective as where imtype = '0'), and it will again bet that using the index will lead to a big payoff, as you discovered.
In your third example, where robotno=1, MySQL probably can't effectively use the index on users(imtype,robotno) since it would need to read in all the index blocks to find the robotno=1 record numbers: the index is sorted by imtype first, then robotno. If you had another index on users(robotno), MySQL would eagerly use it though.
As a footnote, if you had two indexes, one on users(imtype), and the other on users(imtype,robotno), and your query was on where imtype = '0', either index would make your query fast, but MySQL would probably select users(imtype) simply because it's more compact and fewer blocks would need to be read from it.
I'm being very simplistic here. Early database systems would just look at imtype's datatype and make a very rough guess at the selectivity of your query, but people very quickly realized that giving the query planner interesting facts like the total size of the table, the number of ditinct values in each column, etc. would enable it to make much smarter decisions. For instance if you had a users table where imtype was only every '0' or '1', the query planner might choose the index, since in that case the where imtype != '0' is more selective.
Take a look at the MySQL UPDATE STATISTICS statement and you'll see that its query planner must be sophisticated. For that reason I'd hesitate a great deal before using the FORCE statement to dictate a query plan to it. Instead, use UPDATE STATISTICS to give the query planner improved information to base its decisions on.
Your index is over users(imtype,robotno). In order to use this index, either imtype or imtype and robotno must be used to qualify the rows. You are just using robotno in your query, thus it can't use this index.