MySQL performance boost after create & drop index - mysql

I have a large MySQL, MyISAM table of around 4 million rows running in a core 2 duo, 8G RAM laptop.
This table has 30 columns including varchar, decimal and int types.
I have an index on a varchar(16). Let's call this column: "indexed_varchar_column".
My query is
SELECT 9 columns FROM the_table WHERE indexed_varchar_column = 'something';
It always returns around 5000 rows for every 'something' I query against.
An EXPLAIN to the query returns this:
+----+-------------+-------------+------+----------------------------------------------------+--------------------------------------------+---------+-------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------+------+----------------------------------------------------+--------------------------------------------+---------+-------+------+-------------+
| 1 | SIMPLE | the_table | ref | many indexes including indexed_varchar_column | another_index NOT: indexed_varchar_column! | 19 | const | 5247 | Using where |
+----+-------------+-------------+------+----------------------------------------------------+--------------------------------------------+---------+-------+------+-------------+
First thing is I'm not sure is why another_index is chosen. In fact it chooses an index which is a composite index of indexed_varchar_column and another 2 columns (which form part of the selected ones). Perhaps this makes sense since it may make things a bit faster for not having to read 2 of the columns in the query. The real QUESTION is the following one:
The query takes 5 seconds for every 'something' I match. On the 2nd time I query against 'something' it takes 0.15 secs (I guess because the query is being cached). When I run another query against 'something_new' it takes again 5 seconds. So, it is consistent.
THE PROBLEM IS: I discovered that creating an index (another composite index including my indexed_varchar_column) and dropping it again produces that all further queries against new 'something_other' take only 0.15 secs. Please note that 1) I create an index 2) drop it again. So everything is in the same state.
I guess all the operations needed for building and dropping indices make the SQL engine to cache something that is then reused. When I run EXPLAIN on a query after all this I get exactly the same as before.
How can I proceed to understand what is cached in the create-drop index procedure so that I can cache it without manipulating indices?
UPDATE:
Following a comment from Marc B that suggested that when mySQL creates an index it internally does a SELECT... I tried the following:
SELECT * FROM my_table;
It took 30 secs and returned 4 million rows. The good thing is that all further queries are very fast again (until I reboot the system). Please note that after rebooting the queries are slow again. I guess this is because mySQL is using some sort of OS caching.
Any idea? How can I explicitly cache the table I guess?
UPDATE 2:
Perhaps I should have mentioned that this table may be severely fragmented. It's 4 million rows but I remove lots of old fields regularly. I also add new ones. Since I had large gaps in IDs (for the rows deleted) every day I drop the primary index (ID) and create it again with consecutive numbers. The table may be then very fragmented and therefore IO must be an issue... Not sure what to do.

Thanks everybody for your help.
Finally I discovered (thanks to the hint of Marc B) that my table was severely fragmented after many INSERTs and DELETEs. I updated the question with this info some hours ago. There are two things that help:
1)
ALTER TABLE my_table ORDER BY indexed_varchar_column;
2) Running:
myisamchk --sort-records=4 my_table.MYI (where 4 corresponds to my index)
I believe both commands are equivalent. Queries are fast even after a system reboot.
I've put this ALTER TABLE ORDER BY command on a cron that is run everyday. It takes 2 minutes but it's worth it.

How many indexes do you have that contain the indexed_varchar_column? Do you have a single index for just the indexed_varchar_column?
Have you tried:
SELECT 9 columns FROM USE INDEX (name_of_index) the_table WHERE indexed_varchar_column = 'something';?

What is the order of the columns in your composite index.
You must use (at least) a left-associative sub-set of the columns in your query
If you have an index on foo,bar, and baz, that will not be usable as an index against bar or baz by themeselves. Only (foo), (foo,bar), and (foo,bar,baz).
EXPLAIN is your friend here. It will tell you which index, if any, is being used by a query.
EDIT Here's a postgres explain of a simple left join query for comparison.
Nested Loop Left Join (cost=0.00..16.97 rows=13 width=103)
Join Filter: (pagesets.id = pages.pageset_id)
-> Index Scan using ix_pages_pageset_id on pages (cost=0.00..8.51 rows=13 width=80)
Index Cond: (pageset_id = 515)
-> Materialize (cost=0.00..8.27 rows=1 width=23)
-> Index Scan using pagesets_pkey on pagesets (cost=0.00..8.27 rows=1 width=23)
Index Cond: (id = 515)

Related

MariaDB - Indexing not improving performance on char(255) field

I'm trying to execute this SQL query on a table which has a little shy of 1 million records:
SELECT * FROM enty_score limit 100;
It gives me result in about 600 ms
As soon as I add a where clause on a field `dim_agg_strategy` char(255) DEFAULT NULL, it takes 40 seconds to execute:
SELECT * FROM enty_score WHERE dim_agg_strategy='COMPOSITE_AVERAGE_LAKE' limit 100;
I've tried to create an index but there is no improvement it still takes 40 seconds to execute the same query:
ALTER TABLE `enty_score` ADD INDEX `dim_agg_strategy_index` (`dim_agg_strategy`);
SELECT INDEX_NAME, COLUMN_NAME, CARDINALITY, NULLABLE, INDEX_TYPE
FROM information_schema.statistics where INDEX_NAME = 'dim_agg_strategy_index';
INDEX_NAME |COLUMN_NAME |CARDINALITY|NULLABLE|INDEX_TYPE|
----------------------+----------------+-----------+--------+----------+
dim_agg_strategy_index|dim_agg_strategy| 586|YES |BTREE |
A little more info, this column which I have placed in where clause just contains 6 distinct values:
select distinct dim_agg_strategy from enty_score;
dim_agg_strategy |
-------------------------+
COMPOSITE_AVERAGE |
COMPOSITE_AVERAGE_ALL |
COMPOSITE_AVERAGE_LAKE |
COMPOSITE_AVERAGE_NONLAKE|
NORMALISED_AVERAGE |
SIMPLE_AVERAGE |
The optimizer noticed that there were very few different values for that indexed column. So it realized that a lot of the rows would be needed. So it decided to simply plow through the table and not bother with the index. (Using the index would involve bouncing back and forth between the index's BTree and the data's BTree a lot.)
So, you counter by pointing out the LIMIT 100. This is a valid question. Alas, this points out a deficiency in the Optimizer.
It is torn between
Ignore the index, which is likely to be optimal if it needed to scan the entire table. Note: That would happen if the 100 rows you needed happened to be at the end of the table.
Use the index, but pay the extra overhead. Here it is failing to realize that 100 is a lot less than 1M, hence improving the odds that the index would usually be the best approach.
Let's try to fool it... DROP that index and add another index. This time put 2 columns:
(dim_agg_strategy, xx)
where xx is some other column.
(Let me know if this trick works for you.)

How to decide which fields must be indexed in a database table

Explanation
I have a table which does not have a primary key (or not even a composite key).
The table is for storing the time slots (opening hours and food delivery available hours) of the food shops. Let's call the table "business_hours" and the main fields are as below.
shop_id
day (0 - 6, means Sunday - Saturday)
type (open, delivery)
start_time
end_time
As an example, if shop A is opened on Monday from 9.00am - 01.00pm and 05.00pm to 10.00pm, there will be two records in business_hours table for this scenario.
-----------------------------------------------
| shop_id | day | type | start_time | end_time
-----------------------------------------------
| 1000 | 1 | open | 09:00:00 | 13:00:00
-----------------------------------------------
| 1000 | 1 | open | 17:00:00 | 22:00:00
-----------------------------------------------
When I query this table, I will use shop_id always as the first condition in where clause.
Ex:
SELECT COUNT(*) FROM business_hours WHERE shop_id = 1000 AND day = 1 AND type = 'open' AND start_time <= '13.29.00' AND end_time > '13.29.00';
Question
Applying index for "shop_id" is enough or "day" & "type" fields also should be indexed?
Also better if you can explain how the indexing really works.
It depends on several factors that you should specify:
How fast will the data grow
What is the estimated table size in rows
What queries will be run against that table
How fast do you expect the queries to run
It is more about thinking like: Some service will make thousands of inserts of new records per hour, the old records will be archived nightly and reports are to be created nightly from that table. In such a case you may prefer to not to create many indexes since they slow down inserts.
On the other hand if your table will grow and change slowly and many users will run queries against it, you need to have proper indexes to speed up queries.
If you can, try to create clustered unique primary key that most queries can benefit from. If you have data that form some timeline and most queries will get ranges of data using the datetime criteria (like from - to), it is better to include datetime in clustered index - you will get fastest query performance.
So something like this will grant you best performance for the mentioned select. (But you cannot store duplicate business hours for one shop and type)
CREATE TABLE Business_hours
( shop_id INT NOT NULL
, day INT NOT NULL
--- other columns
, CONSTRAINT Business_hours_PK
PRIMARY KEY (shop_id, day, type, start_time, end_time) -- your clustered index
)
Just creating an index on fields used in the SELECT (all of them or just some of them most used), will speed up your query too:
CREATE INDEX BusinessHours_IX ON business_hours (shop_id,day,type, start_time, end_time);
Difference between clustered and non-clustered is that clustered index affects order in which are db records stored on disk.
You can use EXPLAIN to find missing indexes in your database, see this answer.
For more detail this blog.
Yes, You are create a clustered index on this column (shop_id,day,type). I have create a index like above:
Create clustered index Ix on business_hours (shop_id,day,type)
Use this index your select query like above:
SELECT COUNT(*) FROM business_hours with (index (Ix)) WHERE shop_id = 1000 AND day = 1 AND type = 'open' AND start_time <= '13.29.00' AND end_time > '13.29.00';
You are get result fast but a table which have a primary key than not create
clustered index and create a non clustered index
It depends on your usability if you are not updating the record then use clustered index
on
CREATE CLUSTERED INDEX Saleperday ON business_hours (shop_id,day,type);
because Clustered index traverse along the B Tree and stores the entire row on node itself, So searching is fast. But Updating records is memory cost effective as it shifts the entire row from memory crating new entry for same record.
OR ELSE
If Your are updating the records then non clustered index.
If you create ware house then use Column Store Indexes
For better understanding your can go to these links
http://www.programmerinterview.com/index.php/database-sql/clustered-vs-non-clustered-index/
http://www.patrickkeisler.com/2014/04/what-is-non-clustered-columnstore-index.html
http://searchsqlserver.techtarget.com/feature/SQL-Server-2014-columnstore-index-the-good-the-bad-and-the-clustered
Please reply for answer.
Having decided against a primary key means the following would be allowed:
| shop_id | day | type | start_time | end_time
+---------+-----+--------+------------+---------
| 1000 | 1 | open | 09:00:00 | 13:00:00
| 1000 | 1 | open | 09:00:00 | 13:00:00
| 1000 | 1 | open | 17:00:00 | 22:00:00
| 1000 | 1 | closed | 17:00:00 | 22:00:00
So you can have duplicate entries that may lead to strange query results and even have a shop open and closed in the very same time range. (But well, we all know that even with a primary key you'd still need a before-insert trigger to detect a range overlapping, e.g. 12:00-15:00 vs. 13:00-16:00, and throw an error in case. - How I wish there were some built-in range detection, so we could, say, have a unique index on (shop_id, day, range(start_time, end_time)).)
As to your question: Provided your database is built well, you already have a foreign key on shop_id. You don't need any further index as long as you consider your queries fast enough.
Once you think you need to speed them up, you can add composite indexes as needed. That would usually be an index on all columns in the slow query's WHERE clause. If that still doesn't suffice add the columns that are in the GROUP BY clause, if any. Next step would be to add the columns of the HAVING clause, if any. Next would be the columns of the ORDER BY clause. And the last step would be to even add all columns in your SELECT clause, which would give you a covering index, i.e. all data needed for the query would be in the index and the table itself would hence not have to be accessed any longer.
But as mentioned: As long as you don't have performance issues, you don't have to add any composite indexes.
To decide which fields must be indexed in a database table you need to observe the behavior of each query sent to the table. Indexes are the means of providing an efficient access path between the application and the data. The index provides the access path; so, when query asks for data to the database, it will know where to go to retrieve the data.
Here is some official Microsoft documentation
Clustered Indexes A clustered index stores the actual table data pages at the leaf level, and the table data is ordered physically
around the key. A table can have only one clustered index, and when
this index is created, the following events also occur: • Table data
is rearranged. • New index pages are created. • All nonclustered
indexes within the database are rebuilt. As a result, there are many
disk I/O operations and extensive use of system and memory resources.
If you plan to create a clustered index, be sure you have free space
equal to at least 1.5 times the amount of data in the table. The extra
free space ensures that you have enough space to complete the
operation efficiently.
Nonclustered Indexes In a nonclustered index, pages at the leaf level contain a bookmark that tells SQL Server where to find the data
row corresponding to the key in the index. If the table has a
clustered index, the bookmark indicates the clustered index key. If
the table does not have a clustered index, the bookmark is an actual
row locator. When you create a nonclustered index, SQL Server creates
the required index pages but does not rearrange table data.
The Indexing Method recommended by professionals is comprised of three phases: Monitor, Analyze, and then implements the index. That
means you need to observe the behavior of your database when you run a
query then work for get the best performance
SQL server use this operation for fetch the data:
Table scan: Reads the entire heap and, most likely, passes all the data to a secondary filter operation
Index scan: Reads the entire leaf level (every row) of the clustered index or non-clustered index. The index scan operation might
filter the rows and return only those rows that meet the criteria, or
it might pass all the rows to another filter operation depending on
the complexity of the criteria. The data may or may not be ordered.
Index seek: Locates specific row(s) data using the index and returns only the selected rows in an ordered list
So, once you know that you can run the query and use the option Display the Estimated Execution Plan and analyses the performance,
I recommend reading this post SQL SERVER – Index Seek Vs. Index Scan and Optimizing Your Query Plans with the SQL

Speeding up a MySql DELETE that relies on a BIT column

I’m using MySql 5.5.46 and have an InnoDB table with a Bit Column (named “ENABLED”). There is no index on this column. The table has 26 million rows, so understandably, the statement
DELETE FROM my_table WHERE ENABLED = 0;
takes a really long time. My question is, is there anything I can do (without upgrading MySQL, which is not an option at this time), to speed up the time it takes to run this query? My “innodb_buffer_pool_size” variable is set to the following:
show variables like 'innodb_buffer_pool_size';
+-------------------------+-------------+
| Variable_name | Value |
+-------------------------+-------------+
| innodb_buffer_pool_size | 11674845184 |
+-------------------------+-------------+
Do the DELETE in "chunks" of 1000, based on the PRIMARY KEY. See Delete Big. That article goes into details about efficient ways to chunk, and what to do about gaps in the PK.
(With that 11GB buffer_pool, I assume you have 16GB of RAM?)
In general, MySQL will do a table scan instead of using an index if the number of rows to be selected is more than about 20% of the total number of rows. Hence, almost never are "flag" fields worth indexing by themselves.

Stop MySQL after first match

I noticed that adding LIMIT 1 at the end of a query does not decrease the execution time. I have a few thousand records and a simple query. How do I make MySQL stop after the first match?
For example, these two queries both take approximately half a second:
SELECT id,content FROM links WHERE LENGTH(content)<500 ORDER BY likes
SELECT id,content FROM links WHERE LENGTH(content)<500 ORDER BY likes LIMIT 1
Edit: And here's the explain results:
id | select_type | table | type possible_keys | key | key_len | ref | rows | Extra
1 | SIMPLE | links | ALL | NULL | NULL | NULL | NULL | 38556 | Using where; Using filesort
The difference between the two queries run time relies on the actual data.
There are several possible scenarios:
There are many records with LENGTH(content)<500
In this case, MySQL will start scanning all table rows (according to primary key order since you didn't provide any ORDER BY).
There is no index use since your WHERE condition can't be indexed.
Since there are relatively many rows with LENGTH(content)<500, the LIMIT query will return faster than the other one.
There are no records with LENGTH(content)<500
Again, MySQL will start scanning all table rows, but will have to go through all records to figure out none of them satisfies the condition.
Again no index can be used for the same reason.
In this case - the two queries will have exactly the same run time.
Anything between those two scenarios will have different run times, which will be farther apart as you have more valid records in the table.
Edit
Now that you added the ORDER BY, the answer is a bit different:
If there is an index on likes column, ORDER BY would use it and the time would be the time it takes to get to the first record that satisfies the WHERE condition (if 66% of the records do, than this should be faster than without the LIMIT).
If there is no index on likes column, the ORDER BY will take most of the time - MySQL must scan all table to get all records that satisfy the WHERE, then order them by likes, and then take the first one.
In this case both queries will have similar run time (scanning and sorting the results is much longer than returning 1 record or many records...)!
Calling functions on data results in an automatic table scan, these can't be indexed. What you might do is create a derived column where you've saved this value in advance if performance is a concern here:
ALTER TABLE links ADD COLUMN content_length INT
UPDATE links SET content_length=LENGTH(content)
ALTER TABLE links ADD INDEX idx_content_length (content_length)
Once denormalized like this and properly you'll be able to run the query much faster. Keep in mind you'll have to populate content_length each time you add a record.

Is it possible to optimize a query that uses the '<>' operator?

This a follow-up to a previous question.
How can I optimize this query so that it does not perform a full table scan?
SELECT Employee.name FROM Employee WHERE Employee.id <> 1000;
.
explain SELECT Employee.name FROM Employee WHERE Employee.id <> 1000;
+----+-------------+-------------+------+---------------+------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------+------+---------------+------+---------+------+------+-------------+
| 1 | SIMPLE | Employee | ALL | PRIMARY | NULL | NULL | NULL | 5000 | Using where |
+----+-------------+-------------+------+---------------+------+---------+------+------+-------------+
(Empoyee.id is the primary key, in case that isn't clear.)
Have a covering index for name and id, and it should be able to fulfill the query using the index. This might be faster, because there's a good chance the entire index will already be in memory, while a table scan is more likely to need to go to disk.
Because of the low (non-existent) selectivity of your where clause you may need to provide a hint to get the database to use your index. I'm a sql server guy, and so I'm not sure of the syntax needed in mysql to hint an index, or even if mysql is able to take advantage of a covering index in this manner.
That said, I doubt you can get much improvement: you're returning every row but one. You should expect that to need to scan the table.
There are a lot of things to try, it depends on how the database engine chooses to parse it, really. Some options:
select employee.name from employee where employee.id not in (1000);
You could also try a union with a less than and then a greater than.
But in the specific example you are giving (which may just be too simple for your real case) a table scan isn't necessarily a bad thing. If all the records have to be returned except one, using an index may in fact be slower.
In traditional databases, you cant!
Of course, you could just omit all Employees with the given Id (when it is key or has an index) -- but normally you will still have the total majority of the table under your feet. So using an index might complicate things and thus a fts normally is the faster option.
When you have specialized databases, you could store the names of all employees adjacent to each other.
Edit: I now saw the other answer of Joel. Yes, this could be a way, since in fact your special index is now a specialized form of storing a part of the content. Good databases can just use the index content when it covers the columns needed -- this is rather nice. Of course, you will endup in a so called "full index scan" (but normally much faster as a full-table-scan).
Nothing you can do will increase performance. In this case the database must do a complete table scan, as you are asking for every record save one. Reading every page in an index on top of that would only reduce performance. Fortunately, even if you added an index, the database would be smart enough to ignore it...
EDIT to address #Juergens comment.
Juergen, you are right about a covering index, but there are conflicting effects here. Any use of an index in a scenario like this has bad effects in one sense... The query engine could have to perform one I/O Operation for each level in the index, for each row it needs to examine. If there are, say, 5 levels in the index, and 1M rows, that would be 5 Million I/O operations, compared to only 1M I/Os to do a complete table scan. This is why, in this scenario, most query optimizers would ignore any available index and do the table scan anyway. (unless you force it to use the index with a hint) The only mitigating factor is if EVERY attribute required by the query is in the index (covering index) and the number of index rows per page on disk is sufficiently smaller than the number of table rows per page to counteract the negative effect of having to traverse each level of the index for each row returned by the query.