Why does it contain "Using where"? - mysql

Here is my table schema.
CREATE TABLE `usr_block_phone` (
`usr_block_phone_uid` BIGINT (20) UNSIGNED NOT NULL AUTO_INCREMENT,
`time` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP,
`usr_uid` INT (10) UNSIGNED NOT NULL,
`block_phone` VARCHAR (20) NOT NULL,
`status` INT (4) NOT NULL,
PRIMARY KEY (`usr_block_phone_uid`),
KEY `block_phone` (`block_phone`),
KEY `usr_uid_block_phone` (`usr_uid`, `block_phone`) USING BTREE,
KEY `usr_uid` (`usr_uid`) USING BTREE
) ENGINE = INNODB DEFAULT CHARSET = utf8
And This is my SQL
SELECT
ubp.usr_block_phone_uid
FROM
usr_block_phone ubp
WHERE
ubp.usr_uid = 19
AND ubp.block_phone = '80000000001'
By the way, when I ran "EXPLAIN", I got the result as following.
+------+-------------+-------+------+-----------------------------------------+---------------------+---------+-------------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-------+------+-----------------------------------------+---------------------+---------+-------------+------+--------------------------+
| 1 | SIMPLE | ubp | ref | block_phone,usr_uid_block_phone,usr_uid | usr_uid_block_phone | 66 | const,const | 1 | Using where; Using index |
+------+-------------+-------+------+-----------------------------------------+---------------------+---------+-------------+------+--------------------------+
Why is index usr_uid_block_phone not working?
I want to use using index only.
This table has 20000 rows now.

Your index is actually used, see the key column. At the moment the query looks good and the execution plan is good as well.
Fill it with at least a hundred for it to be used (and ensure you still use a predicate that filters just one row).
And a general advice: it's near to impossible to predict how optimiser would behave in a particular situation unless you're a mysql dbms developer yourself. So it's always better to try on a dataset that is as close (in terms of size and quality of data) to your production as possible.

Both columns that are used in the WHERE clause (usr_uid and block_phone) are present in the usr_uid_block_phone index and this makes it a possible key to be used to process the query. Even more, it is the index selected but because of the small number of rows in the table, MySQL decides that is faster to not use an index.
The reason is in the expressions present in the SELECT clause:
SELECT
ubp.usr_block_phone_uid
Because the column usr_block_phone_uid is not present in the selected index, in order to process the query MySQL needs to read both the index (to determine what rows match the WHERE conditions) and the table data (to get the value of column usr_block_phone_uid of those rows).
It is faster to read only the table data and use the WHERE conditions to find the matching rows and get their usr_block_phone_uid column. It needs to read data from storage from one place. It needs to read the same data and the index data if it uses an index.
The situation (and the report of EXPLAIN) changes when the table grows. At some point, reading information from the index (and using it to filter out rows) is compensated by the large number of rows that are filtered out (i.e. their data is not read from the storage).
The exact point when this happens is not fixed. It depends a lot of the structure of your table and how the values in the table are spread out. Even when the table is large, MySQL can decide to ignore the index in order to read less information from the storage medium. For example, if a large percentage (let's say 90%) of the table rows match the WHERE condition, it is more efficient to read all the table data (and ignore the index) than to read 90% of table data and 90% of the index.
90% in the previous paragraph is a figure I made up for explanation purposes. I don't know how MySQL decides that it's better to ignore the index.

Related

Optimizing the performance of MySQL regarding aggregation

I'm trying to optimize a report query, as most of report queries this one incorporates aggregation. Since the size of table is considerable and growing, I need to tend to its performance.
For example, I have a table with three columns: id, name, action. And I would like to count the number of actions each name has done:
SELECT name, COUNT(id) AS count
FROM tbl
GROUP BY name;
As simple as it gets, I can't run it in a acceptable time. It might take 30 seconds and there's no index, whatsoever, I can add which is taken into account, nevertheless improves it.
When I run EXPLAIN on the above query, it never uses any of indices of the table, i.e. an index on name.
Is there any way to improve the performance of aggregation? Why the index is not used?
[UPDATE]
Here's the EXPLAIN's output:
+----+-------------+-------+------+---------------+------+---------+------+---------+----------+-----------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------+---------------+------+---------+------+---------+----------+-----------------+
| 1 | SIMPLE | tbl | ALL | NULL | NULL | NULL | NULL | 4025567 | 100.00 | Using temporary |
+----+-------------+-------+------+---------------+------+---------+------+---------+----------+-----------------+
And here is the table's schema:
CREATE TABLE `tbl` (
`id` bigint(20) unsigned NOT NULL DEFAULT '0',
`name` varchar(1000) NOT NULL,
`action` int unsigned NOT NULL,
PRIMARY KEY (`id`),
KEY `inx` (`name`(255))
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
The problem with your query and use of indexes is that you refer to two different columns in your SELECT statement yet only have one column in your indexes, plus the use of a prefix on the index.
Try this (refer to just the name column):
SELECT name, COUNT(*) AS count
FROM tbl
GROUP BY name;
With the following index (no prefix):
tbl (name)
Don't use a prefix on the index for this query because if you do, MySQL won't be able to use it as a covering index (will still have to hit the table).
If you use the above, MySQL will scan through the index on the name column, but won't have to scan the actual table data. You should see USING INDEX in the explain result.
This is as fast as MySQL will be able to accomplish such a task. The alternative is to store the aggregate result separately and keep it updated as your data changes.
Also, consider reducing the size of the name column, especially if you're hitting index size limits, which you most likely are hence why you're using the prefix. Save some room by not using UTF8 if you don't need it (UTF8 is 3 bytes per character for index).
It's a very common question and key for solution lies in fact, that your table is growing.
So, first way would be: to create index by name column if it isn't created yet. But: this will solve your issue for a time.
More proper approach would be: to create separate statistics table like
tbl_counts
+------+-------+
| name | count |
+------+-------+
And store your counts separately. When changing (insert/update or delete) your data in tbl table - you'll need to adjust corresponding row inside tbl_counts table. This way allows you to get rid of performing COUNT query at all - but will need to add some logic inside tbl table.
To maintain integrity of your statistics table you can either use triggers or do that inside application. This method is good if performance of COUNT query is much more important for you than your data changing queries (but overhead from changing tbl_counts table won't be too high)

MySQL indexing char(1) columns

I have a table with a complex query that I look for optimization,
I read most of the documentation on MySQL indexing .. but in this case I`m not sure
what to do:
Data structure:
-- please, don't comment on the field types and names, it is outsourced project.
CREATE TABLE items(
record_id INT(10) UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
solid CHAR(1) NOT NULL, -- only 'Y','N' values
optional CHAR(1) NULL, -- only 'Y','N', NULL values
data TEXT
);
Query:
SELECT * FROM items
WHERE record_id != 88
AND solid = 'Y'
AND optional !='N' -- 'Y' OR NULL
Of course there are extra joins and related data, but this are the biggest filters.
In the scenario of:
- 200 000+ records,
- 10% (from all) with solid = 'Y',
- 10% (from all) with optional !='N',
What would be good index for this query ?
or more precisely:
does the first check record != 88 slows they query in any way ?
(it only eleminates one result...?)
which is faster (optional !='N') or ( 'optional' = 'Y' OR 'optional' iS NULL )
as mentioned above optional = 'N' are 10% of the total count.
is there anything special for indexing a CHAR(1) column with only 2 possible values?
can I use this index (record_id, solid, optional)?
can I create a index for specific value (solid = 'Y', optional !='N')?
As #Jack requested, current EXPLAIN result (out of 30 000 total rows with 20 results):
+-------------+-------+--------------+---------+---------+------+-------+-------------+
| select_type | type | possible_key | key | key_len | ref | rows | Extra |
+-------------+-------+--------------+---------+---------+------+-------+-------------+
| PRIMARY | range | PRIMARY | PRIMARY | 4 | NULL | 16228 | Using where |
+-------------+-------+--------------+---------+---------+------+-------+-------------+
This is an interesting question. Overall, your query has an estimated selectivity of about 1%. So, if 100 records fit on a page, then you would assume that each page would still have to be read, even with the index. Because a record is so small (depending on data that is), this is quite likely. From that perspective, an index is not worth it.
An index would be worth it under the following circumstances. The first is when the index is a covering index, meaning that you can satisfy the query with all the columns in the index. For example:
select count(*)
FROM items
WHERE record_id != 88 AND solid = 'Y' AND optional !='N' -- 'Y' OR NULL
Where the index is on solid, optional, record_id. The query doesn't need to go back to the original data pages.
Another case would be when the index is a primary (or clustered) index. The data is stored in that order, so fetching a limited number of results would reduce the read overhead of the query. The downside to this is that updates and inserts are more expensive, because data actually has to move.
My best guess in your case is that an index would not be useful, unless data is quite large (in the kilobyte range).
You should try to put indexes on the columns that will do the most discrimination. Usually indexing a binary column is not very helpful, if the database is about evenly split between the values. But if the value you often search for only appears 10% of the time, it can be a useful index.
If any of the columns are indexed, they will usually be checked before doing any other WHERE processing. The order that you put the conditions in the WHERE clause is not generally relevant. You can use EXPLAIN to find out which indexes a query uses.

Is it better to force index usage for an ORDER BY?

I'm currently trying to optimize a query generated by Doctrine 2 on this table:
CREATE TABLE `publication` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`global_order` int(11) NOT NULL,
`title` varchar(63) COLLATE utf8_unicode_ci NOT NULL,
`slug` varchar(63) COLLATE utf8_unicode_ci NOT NULL,
`type` varchar(7) COLLATE utf8_unicode_ci NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `UNIQ_AF3C6779B12CE9DB` (`global_order`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
The query is
SELECT *
FROM publication
WHERE type IN ('article', 'event', 'work')
ORDER BY global_order DESC
type is a discriminator column added by Doctrine. Although the WHERE clause is useless as type is always one of the IN values, I cannot remove it.
EXPLAIN shows me
+------+---------------+------+------+-----------------------------+
| type | possible_keys | key | rows | Extra |
+------+---------------+------+------+-----------------------------+
| ALL | NULL | NULL | 562 | Using where; Using filesort |
+------+---------------+------+------+-----------------------------+
(rows is different each time I execute the query)
After some reading I found I can force an index usage like this:
ALTER TABLE `publication` DROP INDEX `UNIQ_AF3C6779B12CE9DB` ,
ADD UNIQUE `UNIQ_AF3C6779B12CE9DB` ( `global_order` , `type` )
and
SELECT *
FROM publication
FORCE INDEX(UNIQ_AF3C6779B12CE9DB)
WHERE global_order > 0
AND type IN ('article', 'event', 'work')
ORDER BY global_order DESC
The WHERE clause is always useless, but this time EXPLAIN shows me
+-------+-----------------------+-----------------------+------+-------------+
| type | possible_keys | key | rows | Extra |
+-------+-----------------------+-----------------------+------+-------------+
| range | UNIQ_AF3C6779B12CE9DB | UNIQ_AF3C6779B12CE9DB | 499 | Using where |
+-------+-----------------------+-----------------------+------+-------------+
It seems to me it's better, but it seems it's not common to have to force an index too so I wonder if it's really efficient for such a simple query.
Does anyone know what is the better way to perform this query?
Thanks!
If your query really is:
SELECT *
FROM publication
WHERE type IN ('article', 'event', 'work')
ORDER BY global_order DESC
... and all entries (or nearly all) will match the IN clause, you're actually better off with no index at all. If you toss in a limit clause, then the index you'll want is actually on global_order, without the type field. The reason for this is, it actually costs something to read an index.
If you're going for the entire table, sequentially reading the table and sorting its rows in memory will be your cheapest plan. If you only need a few rows and most will match the where clause, going for the smallest index will do the trick.
To understand why, picture the disk IO involved.
Suppose you want the whole table without an index. To do this, you read data_page1, data_page2, data_page3, etc., visiting the various disk pages involved in order, until you reach the end of the table. You then then sort and return.
If you want the top 5 rows without an index, you'd sequentially read the entire table as before, while heap-sorting the top 5 rows. Admittedly, that's a lot of reading and sorting for a handful of rows.
Suppose, now, that you want the whole table with an index. To do this, you read index_page1, index_page2, etc., sequentially. This then leads you to visit, say, data_page3, then data_page1, then data_page3 again, then data_page2, etc., in a completely random order (that by which the sorted rows appear in the data). The IO involved makes it cheaper to just read the whole mess sequentially and sort the grab bag in memory.
If you merely want the top 5 rows of an indexed table, in contrast, using the index becomes the correct strategy. In the worst case scenario you load 5 data pages in memory and move on.
A good SQL query planner, btw, will make its decision on whether to use an index or not based on how fragmented your data is. If fetching rows in order means zooming back and forth across the table, a good planner may decide that it's not worth using the index. In contrast, if the table is clustered using that same index, the rows are guaranteed to be in order, increasing the likelihood that it'll get used.
But then, if you join the same query with another table and that other table has an extremely selective where clause that can use a small index, the planner might decide it's actually better to, e.g. fetch all IDs of rows that are tagged as foo, hash join them with publications, and heap sort them in memory.
MySQL tries to determine the best way to run a given query, and decides whether or not to use indexes based on what it thinks is the best.
It isn't always correct. Sometimes manually forcing a query to use an index is faster, sometimes its not.
If you run some testing with sample data in your specific situation, you should be able to see which method performs faster, and stick with that one.
Make sure you take into account query caching to get an accurate performance benchmark.
Forcing the use of an index is rarely the best answer. In general it is better to create and/or optimize the indices (indexes) so that MySQL chooses to use them. (It is even better to optimize the queries, but I understand you cannot do that here.)
When you are using something like Doctrine where you cannot optimize the queries and the indices don't help, your best bet is to focus on query caching. :-)

Why is InnoDB table size much larger than expected?

I'm trying to figure out storage requirements for different storage engines. I have this table:
CREATE TABLE `mytest` (
`num1` int(10) unsigned NOT NULL,
KEY `key1` (`num1`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
When I insert some values and then run show table status; I get the following:
+----------------+--------+---------+------------+---------+----------------+-------------+------------------+--------------+-----------+----------------+---------------------+---------------------+------------+-------------------+----------+----------------+---------+
| Name | Engine | Version | Row_format | Rows | Avg_row_length | Data_length | Max_data_length | Index_length | Data_free | Auto_increment | Create_time | Update_time | Check_time | Collation | Checksum | Create_options | Comment |
+----------------+--------+---------+------------+---------+----------------+-------------+------------------+--------------+-----------+----------------+---------------------+---------------------+------------+-------------------+----------+----------------+---------+
| mytest | InnoDB | 10 | Compact | 1932473 | 35 | 67715072 | 0 | 48840704 | 4194304 | NULL | 2010-05-26 11:30:40 | NULL | NULL | latin1_swedish_ci | NULL | | |
Notice avg_row_length is 35. I am baffled that InnoDB would not make better use of space when I'm just storing a non-nullable integer.
I have run this same test on myISAM and by default myISAM uses 7 bytes per row on this table. When I run
ALTER TABLE mytest MAX_ROWS=50000000, AVG_ROW_LENGTH = 4;
causes myISAM to finally correctly use 5-byte rows.
When I run the same ALTER TABLE statement for InnoDB the avg_row_length does not change.
Why would such a large avg_row_length be necessary when only storing a 4-byte unsigned int?
InnoDB tables are clustered, that means that all data are contained in a B-Tree with the PRIMARY KEY as a key and all other columns as a payload.
Since you don't define an explicit PRIMARY KEY, InnoDB uses a hidden 6-byte column to sort the records on.
This and overhead of the B-Tree organization (with extra non-leaf-level blocks) requires more space than sizeof(int) * num_rows.
Here is some more info you might find useful.
InnoDB allocates data in terms of 16KB pages, so 'SHOW TABLE STATUS' will give inflated numbers for row size if you only have a few rows and the table is < 16K total. (For example, with 4 rows the average row size comes back as 4096.)
The extra 6 bytes per row for the "invisible" primary key is a crucial point when space is a big consideration. If your table is only one column, that's the ideal column to make the primary key, assuming the values in it are unique:
CREATE TABLE `mytest2`
(`num1` int(10) unsigned NOT NULL primary key)
ENGINE=InnoDB DEFAULT CHARSET=latin1;
By using a PRIMARY KEY like this:
No INDEX or KEY clause is needed, because you don't have a secondary index. The index-organized format of InnoDB tables gives you fast lookup based on the primary key value for free.
You don't wind up with another copy of the NUM1 column data, which is what happens when that column is indexed explicitly.
You don't wind up with another copy of the 6-byte invisible primary key values. The primary key values are duplicated in each secondary index. (That's also the reason why you probably don't want 10 indexes on a table with 10 columns, and you probably don't want a primary key that combines several different columns or is a long string column.)
So overall, sticking with just a primary key means less data associated with the table + indexes. To get a sense of overall data size, I like to run with
set innodb_file_per_table = 1;
and examine the size of the data/database/*table*.ibd files. Each .ibd file contains the data for an InnoDB table and all its associated indexes.
To quickly build up a big table for testing, I usually run a statement like so:
insert into mytest
select * from mytest;
Which doubles the amount of data each time. In the case of the single-column table using a primary key, since the values had to be unique, I used a variation to keep the values from colliding with each other:
insert into mytest2
select num1 + (select count(*) from mytest2) from mytest2;
This way, I was able to get average row size down to 25. The space overhead is based on the underlying assumption that you want to have fast lookup for individual rows using a pointer-style mechanism, and most tables will have a column whose values serve as pointers (i.e. the primary key) in addition to the columns with real data that gets summed, averaged, and displayed.
IN addition to Quassnoi's very fine answer, you should probably try it out using a significant data set.
What I'd do is, load 1M rows of simulated production data in, then measure the table size and use that as a guide.
That's what I've done in the past anyway
MyISAM
MyISAM, except in really old versions, uses a 7-byte "pointer" for locating a row, and a 6-byte pointer inside indexes. These defaults lead to a huge max table size. More details: http://mysql.rjweb.org/doc.php/limits#myisam_specific_limits . The kludgy way to change those involves the ALTER .. MAX_ROWS=50000000, AVG_ROW_LENGTH = 4 that you discovered. The server multiplies those values together to compute how many bytes the data pointer needs to be. Hence, you stumbled on how to shrink the avg_row_length.
But you actually needed to declare a table with fewer than 7 bytes to hit it! The pointer size shows in multiple places:
Free space links in the .MYD default to 7 bytes. So, when you delete a row, a link is provided to the next free spot. That link needs to be 7 bytes (by default), hence the row size was artificially extended from the 4-byte INT to make room for it! (There are more details having to do with whether the column is NULLable , etc.
FIXED vs DYNAMIC row -- When the table is FIXED size, the "pointer" is a row number. For DYNAMIC, it is a byte offset into the .MYD.
Index entries must also point to data rows with a pointer. So your ALTER should have shrunk the .MYI file as well!
There are more details, but MyISAM is likely to go away, so this ancient history is not likely to be of concern to anyone.
InnoDB
https://stackoverflow.com/a/64417275/1766831

What indexes can be used to improve this query?

This query selects all the unique visitor sessions in a certain date range:
select distinct(accessid) from accesslog where date > '2009-09-01'
I have indexes on the following fields:
accessid
date
some other fields
Here's what explain looks like:
mysql> explain select distinct(accessid) from accesslog where date > '2009-09-01';
+----+-------------+-----------+-------+----------------------+------+---------+------+-------+------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+-------+----------------------+------+---------+------+-------+------------------------------+
| 1 | SIMPLE | accesslog | range | date,dateurl,dateaff | date | 3 | NULL | 64623 | Using where; Using temporary |
+----+-------------+-----------+-------+----------------------+------+---------+------+-------+------------------------------+
mysql> explain select distinct(accessid) from accesslog;
+----+-------------+-----------+-------+---------------+----------+---------+------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+-------+---------------+----------+---------+------+---------+-------------+
| 1 | SIMPLE | accesslog | index | NULL | accessid | 257 | NULL | 1460253 | Using index |
+----+-------------+-----------+-------+---------------+----------+---------+------+---------+-------------+
Why doesn't the query with the date clause use the accessid index?
Are there any other indexes I can use to speed up queries for distinct accessid's in certain date spans?
Edit - Resolution
Reducing column width on accessid from varchar 255 to char 32 improved query time by ~75%.
Adding a date+accessid index had no effect on query time.
An index on (date,accessid) could help. However, before tweaking indices I'd recommend checking the type of your accessid column. EXPLAIN says the key is 257 bytes long, which sounds like a lot for an ID column. Are you using a VARCHAR(256) for accessid? If so, can't you use a more compact type? If it's a number, it should by INT (SMALLINT, BIGINT, whatever fits your needs) and if it's an alphanumeric ID, can it really be 256 chars long? If its length is fixed, can't you use CHAR (CHAR(32) for example) instead?
Your problem is that your condition is a range clause (on the date column).
A multi-column index of date->accessid likely wont help the situation as MySQL can't use index columns after a range condition. In theory they should be able to use it to cover the computation in this case, but it appears to be a shortcoming in MySQL, I've never gotten it to use a multi column index in this situation successfully.
You can try creating an index on (date,accessid) hoping that it will use it to cover the query (so you won't need to hit any tables), but I don't hold much hope. There's not a great deal you can do.
Edit:
My answer is courtesy of High Performance MySQL - Second Edition, worth it's weight in gold if you have to do serious MySQL development.
Why doesn't the query with the date clause not use the accessid index?
Because using the date index is more efficient. That's because it's likely to pare the search space down faster.
At least one DBMS (DB2/z, I don't know much about MySQL) would benefit from an index on date+accessid since the access IDs would be sorted within dates in that index. That DBMS will use the date+accessid key to efficiently use the where clause to whittle down the search space and to return distinct values of accessid within that space.
Whether MySQL is that smart, I have no idea. My suggestion would be to try it and see (which is the best answer to most DB optimization questions).
The query uses the 'date' index because thats what you use in the where clause.
This is the only sensible option, if it used the access id index it would need to read all the accessid rows then check the date before it and only then decide if it was distinct.
If this is a really big table a compound index on date and accessid might help.
Why doesn't the query with the date clause not use the accessid index?
Because using the date index allows it to ignore a large part of the data in the table. The chances are that the table holds mostly historical data, and a lot of it refers to dates a lot longer ago than the beginning of the current month, so the date criterion is selective and reduces the workload for the optimizer by allowing it to ignore most of the data.
If it used the accessid index, it would also have to read each row (as well as each index entry) to see whether the date meets the search criterion. This means reading the whole of the index and the whole of the table - in fact, it would do better in the context to ignore the index, but I started of with "if it used the accessid index".
Are there any other indexes I can use to speed up queries for distinct accessid's in certain date spans?
Depending on the sophistication of the optimizer, an index on (date, accessid) might improve things. It can do range searches on the leading column of the index, and the trailing column means that it does not have to refer to the data in the table to establish the accessid - the information is in the index. So, this might convert a query that access an index and a table into one that only accesses the index - which will reduce the amount of I/O needed and therefore improve the performance of the query.
If you have other criteria that need data from other columns, or you need to return more than just the unique accessid values, then you end up reading part of the table data; this is probably still a win compared with scanning the whole of the table.
I have no way of testing it, but I would definitely try to add an index spanning both accessid and date.
Index optimizations if often like alchemy. Different DBMS behave differently, and sometimes you need to simply try (and fail) various combinations. I’m not saying it’s not possible to reason. It is in many cases, but up to a certain point. Often it’s simply faster and easier to follow your instinct.