MariaDB - Indexing not improving performance on char(255) field - mysql

I'm trying to execute this SQL query on a table which has a little shy of 1 million records:
SELECT * FROM enty_score limit 100;
It gives me result in about 600 ms
As soon as I add a where clause on a field `dim_agg_strategy` char(255) DEFAULT NULL, it takes 40 seconds to execute:
SELECT * FROM enty_score WHERE dim_agg_strategy='COMPOSITE_AVERAGE_LAKE' limit 100;
I've tried to create an index but there is no improvement it still takes 40 seconds to execute the same query:
ALTER TABLE `enty_score` ADD INDEX `dim_agg_strategy_index` (`dim_agg_strategy`);
SELECT INDEX_NAME, COLUMN_NAME, CARDINALITY, NULLABLE, INDEX_TYPE
FROM information_schema.statistics where INDEX_NAME = 'dim_agg_strategy_index';
INDEX_NAME |COLUMN_NAME |CARDINALITY|NULLABLE|INDEX_TYPE|
----------------------+----------------+-----------+--------+----------+
dim_agg_strategy_index|dim_agg_strategy| 586|YES |BTREE |
A little more info, this column which I have placed in where clause just contains 6 distinct values:
select distinct dim_agg_strategy from enty_score;
dim_agg_strategy |
-------------------------+
COMPOSITE_AVERAGE |
COMPOSITE_AVERAGE_ALL |
COMPOSITE_AVERAGE_LAKE |
COMPOSITE_AVERAGE_NONLAKE|
NORMALISED_AVERAGE |
SIMPLE_AVERAGE |

The optimizer noticed that there were very few different values for that indexed column. So it realized that a lot of the rows would be needed. So it decided to simply plow through the table and not bother with the index. (Using the index would involve bouncing back and forth between the index's BTree and the data's BTree a lot.)
So, you counter by pointing out the LIMIT 100. This is a valid question. Alas, this points out a deficiency in the Optimizer.
It is torn between
Ignore the index, which is likely to be optimal if it needed to scan the entire table. Note: That would happen if the 100 rows you needed happened to be at the end of the table.
Use the index, but pay the extra overhead. Here it is failing to realize that 100 is a lot less than 1M, hence improving the odds that the index would usually be the best approach.
Let's try to fool it... DROP that index and add another index. This time put 2 columns:
(dim_agg_strategy, xx)
where xx is some other column.
(Let me know if this trick works for you.)

Related

Mysql - speed up select query from 2 million rows

I have a table like below,
Field Type Null Key Default Extra
id bigint(11) NO PRI NULL auto_increment
deviceId bigint(11) NO MUL NULL
value double NO NULL
time timestamp YES MUL 0000-00-00 00:00:00
It has more than 2 million rows. When I run select * from tableName; It takes more than 15 mins.
When I run select value,time from sensor_value where time > '2017-05-21 04:47:48' and deviceId>=812; It takes more than 45 sec to load.
Note : 512 has more than 92514 rows.
Even I have added index for column like below,
ALTER TABLE `sensor_value`
ADD INDEX `IDX_FIELDS1_2` (`time`, `deviceId`) ;
How do I make select query fast?(load in 1sec) Am I doing indexing wrong?
Only 4 columns? Sounds like you have very little RAM, or innodb_buffer_pool_size is set too low. Hence, you were seriously I/O-bound and/or swapping.
WHERE time > '2017-05-21 04:47:48'
AND deviceId >= 812
is two ranges. There is no thorough way to optimize that. Either of these would help. If you have both, the Optimizer might pick the better one:
INDEX(time)
INDEX(deviceId)
When using a 'secondary' index in InnoDB, the query first looks in the index BTree; when there is a match there, it has to look up in the 'data' BTree (using the PRIMARY KEY for lookup).
Some of the anomalous times you saw when trying INDEX(time, deviceId) were because the filtering kept from having to reach over into the data as often.
Do you use id for anything other than uniqueness? Is the pair deviceId & time unique? If the answers are 'no' and 'yes', then get rid of id and change to PRIMARY KEY(deviceId, time). Or you could swap those two columns. What other queries do you have?
Getting rid of id shrinks the table some, thereby cutting down on I/O.
When using combined index usually you must use equality operator on first column and then you can use range criteria on second column. So I recommend you change the order of columns in your index like this:
ALTER TABLE `sensor_value`
ADD INDEX `IDX_FIELDS1_2` (`deviceId`, `time`) ;
then change to use equal sign for deviceId(use deviceId=812 not deviceId>=812):
select value,time from sensor_value where time > '2017-05-21 04:47:48' and deviceId=812;
I hope it could help.
2 million records is not much for Mysql and it is normal to get result in less than 1 sec for 1 billion records if you do the right things.

Stop MySQL after first match

I noticed that adding LIMIT 1 at the end of a query does not decrease the execution time. I have a few thousand records and a simple query. How do I make MySQL stop after the first match?
For example, these two queries both take approximately half a second:
SELECT id,content FROM links WHERE LENGTH(content)<500 ORDER BY likes
SELECT id,content FROM links WHERE LENGTH(content)<500 ORDER BY likes LIMIT 1
Edit: And here's the explain results:
id | select_type | table | type possible_keys | key | key_len | ref | rows | Extra
1 | SIMPLE | links | ALL | NULL | NULL | NULL | NULL | 38556 | Using where; Using filesort
The difference between the two queries run time relies on the actual data.
There are several possible scenarios:
There are many records with LENGTH(content)<500
In this case, MySQL will start scanning all table rows (according to primary key order since you didn't provide any ORDER BY).
There is no index use since your WHERE condition can't be indexed.
Since there are relatively many rows with LENGTH(content)<500, the LIMIT query will return faster than the other one.
There are no records with LENGTH(content)<500
Again, MySQL will start scanning all table rows, but will have to go through all records to figure out none of them satisfies the condition.
Again no index can be used for the same reason.
In this case - the two queries will have exactly the same run time.
Anything between those two scenarios will have different run times, which will be farther apart as you have more valid records in the table.
Edit
Now that you added the ORDER BY, the answer is a bit different:
If there is an index on likes column, ORDER BY would use it and the time would be the time it takes to get to the first record that satisfies the WHERE condition (if 66% of the records do, than this should be faster than without the LIMIT).
If there is no index on likes column, the ORDER BY will take most of the time - MySQL must scan all table to get all records that satisfy the WHERE, then order them by likes, and then take the first one.
In this case both queries will have similar run time (scanning and sorting the results is much longer than returning 1 record or many records...)!
Calling functions on data results in an automatic table scan, these can't be indexed. What you might do is create a derived column where you've saved this value in advance if performance is a concern here:
ALTER TABLE links ADD COLUMN content_length INT
UPDATE links SET content_length=LENGTH(content)
ALTER TABLE links ADD INDEX idx_content_length (content_length)
Once denormalized like this and properly you'll be able to run the query much faster. Keep in mind you'll have to populate content_length each time you add a record.

MySQL performance difference between JOIN and IN

I wanted to find all hourly records that have a successor in a ~5m row table.
I tried :
SELECT DISTINCT (date_time)
FROM my_table
JOIN (SELECT DISTINCT (DATE_ADD( date_time, INTERVAL 1 HOUR)) date_offset
FROM my_table) offset_dates
ON date_time = date_offset
and
SELECT DISTINCT(date_time)
FROM my_table
WHERE date_time IN (SELECT DISTINCT(DATE_ADD(date_time, INTERVAL 1 HOUR))
FROM my_table)
The first one completes in a few seconds, the seconds hangs for hours.
I can understand that the sooner is better but why such a huge performance gap?
-------- EDIT ---------------
Here are the EXPLAIN for both queries
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY <derived2> ALL NULL NULL NULL NULL 1710 Using temporary
1 PRIMARY my_table ref PRIMARY PRIMARY 8 offset_dates.date_offset 555 Using index
2 DERIVED my_table index NULL PRIMARY 13 NULL 5644204 Using index; Using temporary
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY my_table range NULL PRIMARY 8 NULL 9244 Using where; Using index for group-by
2 DEPENDENT SUBQUERY my_table index NULL PRIMARY 13 NULL 5129983 Using where; Using index; Using temporary
In general, a query using a join will perform better than an equivalent query using IN (...), because the former can take advantage of indexes while the latter can't; the entire IN list must be scanned for each row which might be returned.
(Do note that some database engines perform better than others in this case; for example, SQL Server can produce equivalent performance for both types of queries.)
You can see what the MySQL query optimizer intends to do with a given SELECT query by prepending EXPLAIN to the query and running it. This will give you, among other things, a count of rows the engine will have to examine for each step in a query; multiply these counts to get the overall number of rows the engine will have to visit, which can serve as a rough estimate of likely performance.
I would prefix both queries by explain, and then compare the difference in the access plans. You will probably find that the first query looks at far fewer rows than the second.
But my hunch is that the JOIN is applied more immediately than the WHERE clause. So, in the WHERE clause you are getting every record from my_table, applying an arithmetic function, and then sorting them because select distinct usually requires a sort and sometimes it creates a temporary table in memory or on disk. The # of rows examined is probably the product of the size of each table.
But in the JOIN clause, a lot of the rows that are being examined and sorted in the WHERE clause are probably eliminated beforehand. You probably end up looking at far fewer rows... and the database probably takes easier measures to accomplish it.
But I think this post answers your question best: SQL fixed-value IN() vs. INNER JOIN performance
'IN' clause is usually slow for huge tables. As far as I remember, for the second statement you printed out - it will simply loop through all rows of my_table (unless you have index there) checking each row for match of WHERE clause. In general IN is treated as a set of OR clauses with all set elements in it.
That's why, I think, using temporary tables that are created in background of JOIN query is faster.
Here are some helpful links about that:
MySQL Query IN() Clause Slow on Indexed Column
inner join and where in() clause performance?
http://explainextended.com/2009/08/18/passing-parameters-in-mysql-in-list-vs-temporary-table/
Another things to consider is that with your IN style, very little future optimization is possible compared to the JOIN. With the join you can possibly add an index, which, who knows, it depends on the data set, it might speed things up by a 2, 5, 10 times. With the IN, it's going to run that query.

Is it better to force index usage for an ORDER BY?

I'm currently trying to optimize a query generated by Doctrine 2 on this table:
CREATE TABLE `publication` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`global_order` int(11) NOT NULL,
`title` varchar(63) COLLATE utf8_unicode_ci NOT NULL,
`slug` varchar(63) COLLATE utf8_unicode_ci NOT NULL,
`type` varchar(7) COLLATE utf8_unicode_ci NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `UNIQ_AF3C6779B12CE9DB` (`global_order`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
The query is
SELECT *
FROM publication
WHERE type IN ('article', 'event', 'work')
ORDER BY global_order DESC
type is a discriminator column added by Doctrine. Although the WHERE clause is useless as type is always one of the IN values, I cannot remove it.
EXPLAIN shows me
+------+---------------+------+------+-----------------------------+
| type | possible_keys | key | rows | Extra |
+------+---------------+------+------+-----------------------------+
| ALL | NULL | NULL | 562 | Using where; Using filesort |
+------+---------------+------+------+-----------------------------+
(rows is different each time I execute the query)
After some reading I found I can force an index usage like this:
ALTER TABLE `publication` DROP INDEX `UNIQ_AF3C6779B12CE9DB` ,
ADD UNIQUE `UNIQ_AF3C6779B12CE9DB` ( `global_order` , `type` )
and
SELECT *
FROM publication
FORCE INDEX(UNIQ_AF3C6779B12CE9DB)
WHERE global_order > 0
AND type IN ('article', 'event', 'work')
ORDER BY global_order DESC
The WHERE clause is always useless, but this time EXPLAIN shows me
+-------+-----------------------+-----------------------+------+-------------+
| type | possible_keys | key | rows | Extra |
+-------+-----------------------+-----------------------+------+-------------+
| range | UNIQ_AF3C6779B12CE9DB | UNIQ_AF3C6779B12CE9DB | 499 | Using where |
+-------+-----------------------+-----------------------+------+-------------+
It seems to me it's better, but it seems it's not common to have to force an index too so I wonder if it's really efficient for such a simple query.
Does anyone know what is the better way to perform this query?
Thanks!
If your query really is:
SELECT *
FROM publication
WHERE type IN ('article', 'event', 'work')
ORDER BY global_order DESC
... and all entries (or nearly all) will match the IN clause, you're actually better off with no index at all. If you toss in a limit clause, then the index you'll want is actually on global_order, without the type field. The reason for this is, it actually costs something to read an index.
If you're going for the entire table, sequentially reading the table and sorting its rows in memory will be your cheapest plan. If you only need a few rows and most will match the where clause, going for the smallest index will do the trick.
To understand why, picture the disk IO involved.
Suppose you want the whole table without an index. To do this, you read data_page1, data_page2, data_page3, etc., visiting the various disk pages involved in order, until you reach the end of the table. You then then sort and return.
If you want the top 5 rows without an index, you'd sequentially read the entire table as before, while heap-sorting the top 5 rows. Admittedly, that's a lot of reading and sorting for a handful of rows.
Suppose, now, that you want the whole table with an index. To do this, you read index_page1, index_page2, etc., sequentially. This then leads you to visit, say, data_page3, then data_page1, then data_page3 again, then data_page2, etc., in a completely random order (that by which the sorted rows appear in the data). The IO involved makes it cheaper to just read the whole mess sequentially and sort the grab bag in memory.
If you merely want the top 5 rows of an indexed table, in contrast, using the index becomes the correct strategy. In the worst case scenario you load 5 data pages in memory and move on.
A good SQL query planner, btw, will make its decision on whether to use an index or not based on how fragmented your data is. If fetching rows in order means zooming back and forth across the table, a good planner may decide that it's not worth using the index. In contrast, if the table is clustered using that same index, the rows are guaranteed to be in order, increasing the likelihood that it'll get used.
But then, if you join the same query with another table and that other table has an extremely selective where clause that can use a small index, the planner might decide it's actually better to, e.g. fetch all IDs of rows that are tagged as foo, hash join them with publications, and heap sort them in memory.
MySQL tries to determine the best way to run a given query, and decides whether or not to use indexes based on what it thinks is the best.
It isn't always correct. Sometimes manually forcing a query to use an index is faster, sometimes its not.
If you run some testing with sample data in your specific situation, you should be able to see which method performs faster, and stick with that one.
Make sure you take into account query caching to get an accurate performance benchmark.
Forcing the use of an index is rarely the best answer. In general it is better to create and/or optimize the indices (indexes) so that MySQL chooses to use them. (It is even better to optimize the queries, but I understand you cannot do that here.)
When you are using something like Doctrine where you cannot optimize the queries and the indices don't help, your best bet is to focus on query caching. :-)

MySQL performance boost after create & drop index

I have a large MySQL, MyISAM table of around 4 million rows running in a core 2 duo, 8G RAM laptop.
This table has 30 columns including varchar, decimal and int types.
I have an index on a varchar(16). Let's call this column: "indexed_varchar_column".
My query is
SELECT 9 columns FROM the_table WHERE indexed_varchar_column = 'something';
It always returns around 5000 rows for every 'something' I query against.
An EXPLAIN to the query returns this:
+----+-------------+-------------+------+----------------------------------------------------+--------------------------------------------+---------+-------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------+------+----------------------------------------------------+--------------------------------------------+---------+-------+------+-------------+
| 1 | SIMPLE | the_table | ref | many indexes including indexed_varchar_column | another_index NOT: indexed_varchar_column! | 19 | const | 5247 | Using where |
+----+-------------+-------------+------+----------------------------------------------------+--------------------------------------------+---------+-------+------+-------------+
First thing is I'm not sure is why another_index is chosen. In fact it chooses an index which is a composite index of indexed_varchar_column and another 2 columns (which form part of the selected ones). Perhaps this makes sense since it may make things a bit faster for not having to read 2 of the columns in the query. The real QUESTION is the following one:
The query takes 5 seconds for every 'something' I match. On the 2nd time I query against 'something' it takes 0.15 secs (I guess because the query is being cached). When I run another query against 'something_new' it takes again 5 seconds. So, it is consistent.
THE PROBLEM IS: I discovered that creating an index (another composite index including my indexed_varchar_column) and dropping it again produces that all further queries against new 'something_other' take only 0.15 secs. Please note that 1) I create an index 2) drop it again. So everything is in the same state.
I guess all the operations needed for building and dropping indices make the SQL engine to cache something that is then reused. When I run EXPLAIN on a query after all this I get exactly the same as before.
How can I proceed to understand what is cached in the create-drop index procedure so that I can cache it without manipulating indices?
UPDATE:
Following a comment from Marc B that suggested that when mySQL creates an index it internally does a SELECT... I tried the following:
SELECT * FROM my_table;
It took 30 secs and returned 4 million rows. The good thing is that all further queries are very fast again (until I reboot the system). Please note that after rebooting the queries are slow again. I guess this is because mySQL is using some sort of OS caching.
Any idea? How can I explicitly cache the table I guess?
UPDATE 2:
Perhaps I should have mentioned that this table may be severely fragmented. It's 4 million rows but I remove lots of old fields regularly. I also add new ones. Since I had large gaps in IDs (for the rows deleted) every day I drop the primary index (ID) and create it again with consecutive numbers. The table may be then very fragmented and therefore IO must be an issue... Not sure what to do.
Thanks everybody for your help.
Finally I discovered (thanks to the hint of Marc B) that my table was severely fragmented after many INSERTs and DELETEs. I updated the question with this info some hours ago. There are two things that help:
1)
ALTER TABLE my_table ORDER BY indexed_varchar_column;
2) Running:
myisamchk --sort-records=4 my_table.MYI (where 4 corresponds to my index)
I believe both commands are equivalent. Queries are fast even after a system reboot.
I've put this ALTER TABLE ORDER BY command on a cron that is run everyday. It takes 2 minutes but it's worth it.
How many indexes do you have that contain the indexed_varchar_column? Do you have a single index for just the indexed_varchar_column?
Have you tried:
SELECT 9 columns FROM USE INDEX (name_of_index) the_table WHERE indexed_varchar_column = 'something';?
What is the order of the columns in your composite index.
You must use (at least) a left-associative sub-set of the columns in your query
If you have an index on foo,bar, and baz, that will not be usable as an index against bar or baz by themeselves. Only (foo), (foo,bar), and (foo,bar,baz).
EXPLAIN is your friend here. It will tell you which index, if any, is being used by a query.
EDIT Here's a postgres explain of a simple left join query for comparison.
Nested Loop Left Join (cost=0.00..16.97 rows=13 width=103)
Join Filter: (pagesets.id = pages.pageset_id)
-> Index Scan using ix_pages_pageset_id on pages (cost=0.00..8.51 rows=13 width=80)
Index Cond: (pageset_id = 515)
-> Materialize (cost=0.00..8.27 rows=1 width=23)
-> Index Scan using pagesets_pkey on pagesets (cost=0.00..8.27 rows=1 width=23)
Index Cond: (id = 515)