I have a table with ~6M rows that is extracting around ~20,000-30,000 rows per query with index optimization. However, as a lot of people are extracting these rows consecutively (every 30 seconds or so) the site will often time out for people.
I recently migrated the database to a 3-server MySQL cluster with a huge amount of RAM (512GB per server) and the performance haven't improved a lot.
I was wondering if partioning would be the best way to proceed to improve performance. As I have absolutely no experience with partioning I thought I would ask here.
My question is, all of these rows have a column that will either have the value 0, 1, 2 or 3.
Would it be possible somehow to place all the rows with value 1 in a certain column on one partition, and all rows with value 2 in a column in another one? And would they move automatically based on the value being updated in the primary table? And most importantly, could it help out with performance as it would only have to look through finding 1 row in 20,000-30,000 instead of 6,000,000
Yes, MySQL supports partitioning. You can define the partitions pretty well, like:
CREATE TABLE MyTable (
id INT AUTO_INCREMENT PRIMARY KEY,
somestuff INT,
otherstuff VARCHAR(100),
KEY (somestuff)
) PARTITION BY HASH(id) PARTITIONS 4;
INSERT INTO MyTable () VALUES (), (), (), ();
You can verify how many rows are in each partition after this:
SELECT PARTITION_NAME, TABLE_ROWS FROM INFORMATION_SCHEMA.PARTITIONS WHERE TABLE_NAME='MyTable';
+----------------+------------+
| PARTITION_NAME | TABLE_ROWS |
+----------------+------------+
| p0 | 1 |
| p1 | 1 |
| p2 | 1 |
| p3 | 1 |
+----------------+------------+
However, there are two things that trip people up when they try to use partitioning in MySQL:
First, as https://dev.mysql.com/doc/refman/5.7/en/partitioning-limitations-partitioning-keys-unique-keys.html says:
every unique key on the table must use every column in the table's partitioning expression.
This means if you want to partition by somestuff in the example above, you can't. That would fail the requirement that primary key include the column named in the partition expression.
ALTER TABLE MyTable PARTITION BY HASH(somestuff) PARTITIONS 4;
ERROR 1503 (HY000): A PRIMARY KEY must include all columns in the table's partitioning function
You can get around this by removing any primary key or unique key constraints from your table, but that leaves you with kind of a malformed table.
Second, partitioning speeds up queries only if you can take advantage of partition pruning, and this happens only if your query conditions include the column used in the partition expression.
mysql> EXPLAIN PARTITIONS SELECT * FROM MyTable WHERE SomeStuff = 3;
+----+-------------+---------+-------------+------+---------------+-----------+---------+-------+------+-------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------+-------------+------+---------------+-----------+---------+-------+------+-------+
| 1 | SIMPLE | MyTable | p0,p1,p2,p3 | ref | somestuff | somestuff | 5 | const | 4 | NULL |
+----+-------------+---------+-------------+------+---------------+-----------+---------+-------+------+-------+
Note this says it will need to scan partitions p0,p1,p2,p3 — i.e. the whole table. There is no partition pruning, therefore no performance improvement because it is not reducing the number of rows examined.
If you do search for a specific value in the column used in the partitioning expression, you can see that MySQL is able to reduce the number of partitions it scans:
mysql> EXPLAIN PARTITIONS SELECT * FROM MyTable WHERE id = 3;
+----+-------------+---------+------------+-------+---------------+---------+---------+-------+------+-------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------+------------+-------+---------------+---------+---------+-------+------+-------+
| 1 | SIMPLE | MyTable | p3 | const | PRIMARY | PRIMARY | 4 | const | 1 | NULL |
+----+-------------+---------+------------+-------+---------------+---------+---------+-------+------+-------+
Partitioning can help a lot in very specific circumstances, but partitioning isn't as versatile as most people think.
In most cases, it's better to define more specific indexes in your table to support the queries you need to run.
Related
I was expecting this query to use a key.
mysql> DESCRIBE TABLE Foo;
+-------+-------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------+-------------+------+-----+---------+----------------+
| id | bigint(20) | NO | PRI | NULL | auto_increment |
| name | varchar(50) | NO | UNI | NULL | |
+-------+-------------+------+-----+---------+----------------+
mysql> EXPLAIN SELECT id FROM Foo WHERE name='foo';
+----+-------------+-------+------+---------------+------+---------+------+------+-----------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+------+---------+------+------+-----------------------------------------------------+
| 1 | SIMPLE | NULL | NULL | NULL | NULL | NULL | NULL | NULL | Impossible WHERE noticed after reading const tables |
+----+-------------+-------+------+---------------+------+---------+------+------+-----------------------------------------------------+
Foo has a unique index on name, so why isn't the index being used in the SELECT?
From the MySQL Manual page entitled EXPLAIN Output Format:
Impossible WHERE noticed after reading const tables (JSON property:
message)
MySQL has read all const (and system) tables and notice that the WHERE
clause is always false.
and the definition of const tables, from the Page entitled Constants and Constant Tables:
A MySQL constant is something more than a mere literal in the query.
It can also be the contents of a constant table, which is defined as
follows:
A table with zero rows, or with only one row
A table expression that is restricted with a WHERE condition,
containing expressions of the form column = constant, for all the
columns of the table's primary key, or for all the columns of any of
the table's unique keys (provided that the unique columns are also
defined as NOT NULL).
The second reference is a page and half long. Please refer to it.
const
const
The table has at most one matching row, which is read at the start of
the query. Because there is only one row, values from the column in
this row can be regarded as constants by the rest of the optimizer.
const tables are very fast because they are read only once.
const is used when you compare all parts of a PRIMARY KEY or UNIQUE
index to constant values. In the following queries, tbl_name can be
used as a const table:
SELECT * FROM tbl_name WHERE primary_key=1;
SELECT * FROM tbl_name WHERE primary_key_part1=1 AND
primary_key_part2=2;
It could be because that the said table Foo very less volume of data. In such case optimizer will choose to do table scan rather than looking through index.
As MySQL Documentation clearly says
Indexes are less important for queries on small tables, or big tables
where report queries process most or all of the rows. When a query
needs to access most of the rows, reading sequentially is faster than
working through an index. Sequential reads minimize disk seeks, even
if not all the rows are needed for the query.
I'm trying to troubleshoot a performance issue on MySQL, so I wanted to create a smaller version of a table to work with. When I add a LIMIT clause to the query, it goes from about 2 seconds (for the full insert) to astronomical (42 minutes).
mysql> select pr.player_id, max(pr.insert_date) as insert_date from player_record pr
inner join date_curr dc on pr.player_id = dc.player_id where pr.insert_date < '2012-05-15'
group by pr.player_id;
+------------+-------------+
| 1002395119 | 2012-05-14 |
...
| 1002395157 | 2012-05-14 |
| 1002395187 | 2012-05-14 |
| 1002395475 | 2012-05-14 |
+------------+-------------+
105776 rows in set (2.19 sec)
mysql> select pr.player_id, max(pr.insert_date) as insert_date from player_record pr
inner join date_curr dc on pr.player_id = dc.player_id where pr.insert_date < '2012-05-15'
group by pr.player_id limit 1;
+------------+-------------+
| player_id | insert_date |
+------------+-------------+
| 1000000080 | 2012-05-14 |
+------------+-------------+
1 row in set (42 min 23.26 sec)
mysql> describe player_record;
+------------------------+------------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+------------------------+------------------------+------+-----+---------+-------+
| player_id | int(10) unsigned | NO | PRI | NULL | |
| insert_date | date | NO | PRI | NULL | |
| xp | int(10) unsigned | YES | | NULL | |
+------------------------+------------------------+------+-----+---------+-------+
17 rows in set (0.01 sec) (most columns removed)
There are 20 million rows in the player_record table, so I am creating two tables in memory for the specific dates I am looking to compare.
CREATE temporary TABLE date_curr
(
player_id INT UNSIGNED NOT NULL,
insert_date DATE,
PRIMARY KEY player_id (player_id, insert_date)
) ENGINE=MEMORY;
INSERT into date_curr
SELECT player_id,
MAX(insert_date) AS insert_date
FROM player_record
WHERE insert_date BETWEEN '2012-05-15' AND '2012-05-15' + INTERVAL 6 DAY
GROUP BY player_id;
CREATE TEMPORARY TABLE date_prev LIKE date_curr;
INSERT into date_prev
SELECT pr.player_id,
MAX(pr.insert_date) AS insert_date
FROM player_record pr
INNER join date_curr dc
ON pr.player_id = dc.player_id
WHERE pr.insert_date < '2012-05-15'
GROUP BY pr.player_id limit 0,20000;
date_curr has 216k entries, and date_prev has 105k entries if I don't use a limit.
These tables are just part of the process, used to trim down another table (500 million rows) to something manageable. date_curr includes the player_id and insert_date from the current week, and date_prev has the player_id and most recent insert_date from BEFORE the current week for any player_id present in date_curr.
Here is the explain output:
mysql> explain SELECT pr.player_id,
MAX(pr.insert_date) AS insert_date
FROM player_record pr
INNER JOIN date_curr dc
ON pr.player_id = dc.player_id
WHERE pr.insert_date < '2012-05-15'
GROUP BY pr.player_id
LIMIT 0,20000;
+----+-------------+-------+-------+---------------------+-------------+---------+------+--------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------------+-------------+---------+------+--------+----------------------------------------------+
| 1 | SIMPLE | pr | range | PRIMARY,insert_date | insert_date | 3 | NULL | 396828 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | dc | ALL | PRIMARY | NULL | NULL | NULL | 216825 | Using where; Using join buffer |
+----+-------------+-------+-------+---------------------+-------------+---------+------+--------+----------------------------------------------+
2 rows in set (0.03 sec)
This is on a system with 24G RAM dedicated to the database, and currently is pretty much idle. This specific database is the test so it is completely static. I did a mysql restart and it still has the same behavior.
Here is the 'show profile all' output, with most time being spent on copying to tmp table.
| Status | Duration | CPU_user | CPU_system | Context_voluntary | Context_involuntary | Block_ops_in | Block_ops_out | Messages_sent | Messages_received | Page_faults_major | Page_faults_minor | Swaps | Source_function | Source_file | Source_line |
| Copying to tmp table | 999.999999 | 999.999999 | 0.383941 | 110240 | 18983 | 16160 | 448 | 0 | 0 | 0 | 43 | 0 | exec | sql_select.cc | 1976 |
A bit of a long answer but I hope you can learn something from this.
So based on the evidence in the explain statement you can see that there was two possible indexes that the MySQL query optimizer could have used they are as follows:
possible_keys
PRIMARY,insert_date
However the MySQL query optimizer decided to use the following index:
key
insert_date
This is a rare occasion where MySQL query optimizer used the wrong index. Now there is a probable cause for this. You are working on a static development database. You probably restored this from production to do development against.
When the MySQL optimizer needs to make a decision on which index to use in a query it looks at the statistics around all the possible indexes. You can read more about statistics here http://dev.mysql.com/doc/innodb-plugin/1.0/en/innodb-other-changes-statistics-estimation.html for a starter.
So when you update, insert and delete from a table you change the index statistics. It might be that the MySQL server because of the static data had the wrong statistics and chose the wrong index. This however is just a guess at this point as a possible root cause.
Now lets dive into the indexes. There was two possible indexes to use the primary key index and the index on insert_date. MySQL used the insert_date one. Remember during a query execution MySQL can only use one index always. Lets look at the difference between the primary key index and the insert_date index.
Simple fact about a primary key index(aka clustered):
A primary key index is normally a btree structure that contains the data rows i.e. it is the table as it contains the date.
Simple fact about secondary index(aka non-clustered):
A secondary index is normally a btree structure that contains the data being indexed(the columns in the index) and a pointer to the location of the data row on the primary key index.
This is a subtle but big difference.
Let me explain when you read a primary key index you are reading the table. The table is in order of the primary index as well. Thus to find a value I would search the index read the data which is 1 operation.
When you read a secondary index you search the index find the pointer then read the primary key index to find the data based on the pointer. This is essentially 2 operations making the operation of reading a secondary index twice as costly as reading the primary key index.
In your case since it chose the insert_date as the index to use it was doing double the work just to do the join. That is problem one.
Now when you LIMIT a recordset it is the last piece of execution of the query. MySQL has to take the entire recordset sort it (if not sorted allready) based on ORDER BY and GROUP BY conditions then take the number of records you want and send it back based on the LIMIT BY section. MySQL has to do a lot of work to keep track of records to send and where it is in the record set etc. LIMIT BY does have a performance hit but I suspect there might be a contributing factor read on.
Look at your GROUP BY it is by player_id. The index that is used is insert_date. GROUP BY essentially orders your record set, however since it had no index to use for ordering (remember a index is sorted in the order of the column(s) contained in it). Essentially you were asking sort/order on player_id and the index used was sorted on insert_date.
This step caused the filesort problem which essentially takes the data that is returned from reading the secondary index and primary index(remember the 2 operations) and then has to sort them. Sorting is normally done on disk as it is a very very expensive operation to do in memory. Thus the entire query result was written to disk and sorted painfully slow to get you your results.
By removing the insert_date index MySQL will now use the primary key index which means the data is ordered(ORDER BY/GROUP BY) player_id and insert_date. This will eliminate the need to read the secondary index and then use the pointer to read the primary key index i.e. the table, and since the data is already sorted MySQL has very little work when applying the GROUP BY piece of the query.
Now the following is a bit of a educated guess again if you could post the results of the explain statement after the index was dropped I would probably be able to confirm my thinking. So by using the wrong index the results were sorted on disk to apply the LIMIT BY properly. Removing the LIMIT BY allows MySQL to probably sort in Memory as it does not have to apply the LIMIT BY and keep track of what is being returned. The LIMIT BY probably caused the temporary table to be created. Once again difficult to say without seeing the difference between the statements i.e. output of explain.
Hopefully this gives you a better understanding of indexes and why they are a double edged sword.
Had the same problem. When I added FORCE INDEX (id) it went back to the few milliseconds of a query it was without the limit, while producing the same results.
I have a table containing 60 million rows. The structure is like entryid, date, sourceid, detail, views. (entryid, date, sourceid, detail) is the PK, and I also have indexes for each field except views.
The problem is the cardinalities of the four indexes are zero, but I am sure they should not.
I wonder why is that? And does it mean the index doesn't work?
It's possible that the table statistics have not been updated.
See this page on optimizing MyISAM tables:
To help MySQL better optimize queries, use ANALYZE TABLE or run
myisamchk --analyze on a table after it has been loaded with data.
This updates a value for each index part that indicates the average
number of rows that have the same value. (For unique indexes, this is
always 1.) MySQL uses this to decide which index to choose when you
join two tables based on a nonconstant expression. You can check the
result from the table analysis by using SHOW INDEX FROM tbl_name and
examining the Cardinality value. myisamchk --description --verbose
shows index distribution information.
The best way to determine whether an index is helping is to explain a query:
mysql> explain select 1;
+----+-------------+-------+------+---------------+------+---------+------+------+----------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+------+---------+------+------+----------------+
| 1 | SIMPLE | NULL | NULL | NULL | NULL | NULL | NULL | NULL | No tables used |
+----+-------------+-------+------+---------------+------+---------+------+------+----------------+
1 row in set (0.00 sec)
I have the following MySQL table (simplified):
CREATE TABLE `track` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`title` varchar(256) NOT NULL,
`is_active` tinyint(1) NOT NULL,
PRIMARY KEY (`id`),
KEY `is_active` (`is_active`, `id`)
) ENGINE=MyISAM AUTO_INCREMENT=7495088 DEFAULT CHARSET=utf8
The 'is_active' column marks rows that I want to ignore in most, but not all, of my queries. I have some queries that read chunks out of this table periodically. One of them looks like this:
SELECT id,title from track where (track.is_active=1 and track.id > 5580702) ORDER BY id ASC LIMIT 10;
This query takes over a minute to execute. Here's the execution plan:
> EXPLAIN SELECT id,title from track where (track.is_active=1 and track.id > 5580702) ORDER BY id ASC LIMIT 10;
+----+-------------+-------+------+----------------+--------+---------+-------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+----------------+--------+---------+-------+---------+-------------+
| 1 | SIMPLE | t | ref | PRIMARY,is_active | is_active | 1 | const | 3747543 | Using where |
+----+-------------+-------+------+----------------+--------+---------+-------+---------+-------------+
Now, if I tell MySQL to ignore the 'is_active' index, the query happens instantaneously.
> EXPLAIN SELECT id,title from track IGNORE INDEX(is_active) WHERE (track.is_active=1 AND track.id > 5580702) ORDER BY id ASC LIMIT 10;
+----+-------------+-------+-------+---------------+---------+---------+------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+---------+---------+------+---------+-------------+
| 1 | SIMPLE | t | range | PRIMARY | PRIMARY | 4 | NULL | 1597518 | Using where |
+----+-------------+-------+-------+---------------+---------+---------+------+---------+-------------+
Now, what's really strange is that if I FORCE MySQL to use the 'is_active' index, the query once again happens instantaneously!
+----+-------------+-------+-------+---------------+---------+---------+------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+---------+---------+------+---------+-------------+
| 1 | SIMPLE | t | range | is_active |is_active| 5 | NULL | 1866730 | Using where |
+----+-------------+-------+-------+---------------+---------+---------+------+---------+-------------+
I just don't understand this behavior. In the 'is_active' index, rows should be sorted by is_active, followed by id. I use both the 'is_active' and 'id' columns in my query, so it seems like it should only need to do a few hops around the tree to find the IDs, then use those IDs to retrieve the titles from the table.
What's going on?
EDIT: More info on what I'm doing:
Query cache is disabled
Running OPTIMIZE TABLE and ANALYZE TABLE had no effect
6,620,372 rows have 'is_active' set to True. 874,714 rows have 'is_active' set to False.
Using FORCE INDEX(is_active) once again speeds up the query.
MySQL version 5.1.54
It looks like MySQL is making a poor decision about how to use the index.
From that query plan, it is showing it could have used either the PRIMARY or is_active index, and it has chosen is_active in order to narrow by track.is_active first. However, it is only using the first column of the index (track.is_active). That gets it 3747543 results which then have to be filtered and sorted.
If it had chosen the PRIMARY index, it would be able to narrow down to 1597518 rows using the index, and they would be retrieved in order of track.id already, which should require no further sorting. That would be faster.
New information:
In the third case where you are using FORCE INDEX, MySQL is using the is_active index but now instead of only using the first column, it is using both columns (see key_len). It is therefore now able to narrow by is_active and sort and filter by id using the same index, and since is_active is a single constant, the ORDER BY is satisfied by the second column (ie the rows from a single branch of the index are already in sorted order). This seems to be an even better outcome than using PRIMARY - and probably what you intended in the first place, right?
I don't know why it wasn't using both columns of this index without FORCE INDEX, unless the query has changed in a subtle way in between. If not I'd put it down to MySQL making bad decisions.
I think the speedup is due to your where clause. I am assuming that it is only retrieving a small subset of the rows in the entire large table. It is faster to do a table scan of the retrieved data for is_active on the small subset than to do the filtering through a large index file. Traversing a single column index is much faster than traversing a combined index.
Few things you could try:
Do an OPTIMIZE and CHECK on your table, so mysql will re-calculate index values
have a look at http://dev.mysql.com/doc/refman/5.1/en/index-hints.html - you can tell mysql to choose the right index in different cases
Table structure:
+-------------+----------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------------+----------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| total | int(11) | YES | | NULL | |
| thedatetime | datetime | YES | MUL | NULL | |
+-------------+----------+------+-----+---------+----------------+
Total rows: 137967
mysql> explain select * from out where thedatetime <= NOW();
+----+-------------+-------------+------+---------------+------+---------+------+--------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------+------+---------------+------+---------+------+--------+-------------+
| 1 | SIMPLE | out | ALL | thedatetime | NULL | NULL | NULL | 137967 | Using where |
+----+-------------+-------------+------+---------------+------+---------+------+--------+-------------+
The real query is much more longer with more table joins, the point is, I can't get the table to use the datetime index. This is going to be hard for me if I want to select all data until certain date. However, I noticed that I can get MySQL to use the index if I select a smaller subset of data.
mysql> explain select * from out where thedatetime <= '2008-01-01';
+----+-------------+-------------+-------+---------------+-------------+---------+------+-------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------+-------+---------------+-------------+---------+------+-------+-------------+
| 1 | SIMPLE | out | range | thedatetime | thedatetime | 9 | NULL | 15826 | Using where |
+----+-------------+-------------+-------+---------------+-------------+---------+------+-------+-------------+
mysql> select count(*) from out where thedatetime <= '2008-01-01';
+----------+
| count(*) |
+----------+
| 15990 |
+----------+
So, what can I do to make sure MySQL will use the index no matter what date that I put?
There are two things in play here -
Index is not selective enough - if the index covers more than approx. 30% of the rows, MySQL will decide a full table scan is more efficient. When you contract the range the index kicks in.
One index per table in a join
The real query is much more longer
with more table joins, the point is ...
The point is exactly because it has joins that it probably can't use that index. MySQL can use one index per table in a join (unless it qualifies for an index-merge optimization). If the primary key is already used for the join, thedatetime won't be used. In order to use it, you need to create a multi-column index on the join key + thedatetime index, in the correct order.
Check the EXPLAIN of the actual query to see which key MySQL uses for the join. Modify that index to include the thedatetime column as well, or create a new multi-column index from both (depending on what you use the join key for).
Everything works as it is supposed to. :)
Indexes are there to speed up retrieval. They do it using index lookups.
In you first query the index is not used because you are retrieving ALL rows, and in this case using index is slower (lookup index, get row, lookup index, get row... x number of rows is slower then get all rows == table scan)
In the second query you are retrieving only a portion of the data and in this case table scan is much slower.
The job of the optimizer is to use statistics that RDBMS keeps on the index to determine the best plan. In first case index was considered, but planner (correctly) threw it away.
EDIT
You might want to read something like this to get some concepts and keywords regarding mysql query planner.