MySql Join performance vs Where + in - mysql

I have encountered a problem when joining two tables. One large table with 140M rows and the other is small table with 100 rows while joining on the primary key.
The two tables are:
DataTable
{
Date timestamp,
Hash varchar(20),
Type varchar(20),
Purchases int,
Store varchar(20),
Primary key (Date, Hash)
}
DataTable is a very big table with 140M rows
ProductTable
{
Hash varchar(20),
Name varchar(20),
Primary key (Hash)
}
ProductTable is small table with only 100 rows
I ran two separate queries
Select sum(DataTable.Purchases),DataTable.Store
from DataTable
Join ProductTable on ProductTable.Hash = DataTable.Hash
Where Type =2
and Date<='2015-12-31'
and Date>='2015-1-1'
group by DataTable.Store
This took very long time (actually never ends). When running explain this showed that it processed almost the half of table. As shown in this explain:
select_type |table |type |possible_keys|key | key_len | ref |rows |Extra |
-----------------------------------------------------------------------------
Simple |DataTable |All |Date, Hash |Null | Null | Null |7*10^7|using where
Simlpe |ProductTable| eq_ref |PRIMARY PRIMARY | 386 | ProductTable.Hash | 1 | Using where |
Just for kicks I took all the relevant hashes from the ProductTable and put them in a Where In clause. Like the following:
Select sum(DataTable.Purchases),DataTable.Store
From DataTable
Where DataTable.Hash in ("1ha84u","1ha850","1ha851",...,"1hl931")
Type =2
and DataTable.Date<='2015-12-31'
and DataTable.Date>='2015-12-1'
group by DataTable.Store
This resulted in much better performance - taking less than 2 seconds and scanning less rows.
select_type |table |type |possible_keys|key | key_len | ref |rows |Extra |
----------------------------------------------------------------------------
Simple |DataTable |range |Date, Hash |Date, Hash | 62 | Null|11097|using where
I do not understand why the Primary key wasn't used for the first query and why the second one resulted in much better performance.
I have made sure that the result was not cached by MySql between runs.

Try to add index (Hash, Date) to DataTable

Related

Improving select statement performance

I have a table with around 500,000 rows, with a composite primary key index. I'm trying a simple select statement as follows
select * from transactions where account_id='1' and transaction_id='003a4955acdbcf72b5909f122f84d';
The explain statement gives me this
id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra
-------------------------------------------------------------------------------------------------------------------------------------
1 | SIMPLE | transactions | NULL | const | PRIMARY | PRIMARY | 94 | const,const | 1 | 100.00 | NULL
My primary index is on account_id and transaction_id.
My engine is InnoDB.
When I run my query it takes around 156 milliseconds.
Given that explain shows that only one row needs to be examined, I'm not sure how to optimize this further. What changes can I do to significantly improve this?
I'm going to speculate a bit, given the information provided: your primary key is composed of an integer field account_id and a varchar one called transaction_id.
Since they're both components of the PRIMARY index created when you defined them as PRIMARY KEY(account_id, transaction_id), as they are they're the best you can have.
I think the bottleneck here is the transaction_id: as a string, it requires more effort to be indexed, and to be searched for. Changing its type to a different, easier to search one (i.e. numbers) would probably help.
The only other improvement I see is to simplify the PRIMARY KEY itself, either by removing the account_id field (it seems useless to me, since the transaction_id tastes like being unique, but that depends on your design) or by substitute the whole key with an integer, AUTO INCREMENT value (not recommended).

MySQL Partitioning and Automatic Movement of Rows

I have a table with ~6M rows that is extracting around ~20,000-30,000 rows per query with index optimization. However, as a lot of people are extracting these rows consecutively (every 30 seconds or so) the site will often time out for people.
I recently migrated the database to a 3-server MySQL cluster with a huge amount of RAM (512GB per server) and the performance haven't improved a lot.
I was wondering if partioning would be the best way to proceed to improve performance. As I have absolutely no experience with partioning I thought I would ask here.
My question is, all of these rows have a column that will either have the value 0, 1, 2 or 3.
Would it be possible somehow to place all the rows with value 1 in a certain column on one partition, and all rows with value 2 in a column in another one? And would they move automatically based on the value being updated in the primary table? And most importantly, could it help out with performance as it would only have to look through finding 1 row in 20,000-30,000 instead of 6,000,000
Yes, MySQL supports partitioning. You can define the partitions pretty well, like:
CREATE TABLE MyTable (
id INT AUTO_INCREMENT PRIMARY KEY,
somestuff INT,
otherstuff VARCHAR(100),
KEY (somestuff)
) PARTITION BY HASH(id) PARTITIONS 4;
INSERT INTO MyTable () VALUES (), (), (), ();
You can verify how many rows are in each partition after this:
SELECT PARTITION_NAME, TABLE_ROWS FROM INFORMATION_SCHEMA.PARTITIONS WHERE TABLE_NAME='MyTable';
+----------------+------------+
| PARTITION_NAME | TABLE_ROWS |
+----------------+------------+
| p0 | 1 |
| p1 | 1 |
| p2 | 1 |
| p3 | 1 |
+----------------+------------+
However, there are two things that trip people up when they try to use partitioning in MySQL:
First, as https://dev.mysql.com/doc/refman/5.7/en/partitioning-limitations-partitioning-keys-unique-keys.html says:
every unique key on the table must use every column in the table's partitioning expression.
This means if you want to partition by somestuff in the example above, you can't. That would fail the requirement that primary key include the column named in the partition expression.
ALTER TABLE MyTable PARTITION BY HASH(somestuff) PARTITIONS 4;
ERROR 1503 (HY000): A PRIMARY KEY must include all columns in the table's partitioning function
You can get around this by removing any primary key or unique key constraints from your table, but that leaves you with kind of a malformed table.
Second, partitioning speeds up queries only if you can take advantage of partition pruning, and this happens only if your query conditions include the column used in the partition expression.
mysql> EXPLAIN PARTITIONS SELECT * FROM MyTable WHERE SomeStuff = 3;
+----+-------------+---------+-------------+------+---------------+-----------+---------+-------+------+-------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------+-------------+------+---------------+-----------+---------+-------+------+-------+
| 1 | SIMPLE | MyTable | p0,p1,p2,p3 | ref | somestuff | somestuff | 5 | const | 4 | NULL |
+----+-------------+---------+-------------+------+---------------+-----------+---------+-------+------+-------+
Note this says it will need to scan partitions p0,p1,p2,p3 — i.e. the whole table. There is no partition pruning, therefore no performance improvement because it is not reducing the number of rows examined.
If you do search for a specific value in the column used in the partitioning expression, you can see that MySQL is able to reduce the number of partitions it scans:
mysql> EXPLAIN PARTITIONS SELECT * FROM MyTable WHERE id = 3;
+----+-------------+---------+------------+-------+---------------+---------+---------+-------+------+-------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------+------------+-------+---------------+---------+---------+-------+------+-------+
| 1 | SIMPLE | MyTable | p3 | const | PRIMARY | PRIMARY | 4 | const | 1 | NULL |
+----+-------------+---------+------------+-------+---------------+---------+---------+-------+------+-------+
Partitioning can help a lot in very specific circumstances, but partitioning isn't as versatile as most people think.
In most cases, it's better to define more specific indexes in your table to support the queries you need to run.

Why is my query checking 1000's of rows even though the table is indexed?

The table has around 20K rows and the following create code:
CREATE TABLE `inventory` (
`ID` int(11) NOT NULL AUTO_INCREMENT,
`TID` int(11) DEFAULT NULL,
`RID` int(11) DEFAULT NULL,
`CID` int(11) DEFAULT NULL,
`value` text COLLATE utf8_unicode_ci,
PRIMARY KEY (`ID`),
KEY `index_TID_CID_value` (`TID`,`CID`,`value`(25))
);
and this is the result of the explain query
mysql> explain select rowID from inventory where TID=4 and CID=28 and value=3290843588097;
+----+-------------+------------+------+------------------------+-----------------------+---------+-------------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+------+------------------------+-----------------------+---------+-------------+------+-------------+
| 1 | SIMPLE | inventory | ref | index_TID_CID_value | index_TID_CID_value | 10 | const,const | 9181 | Using where |
+----+-------------+------------+------+------------------------+-----------------------+---------+-------------+------+-------------+
1 row in set (0.00 sec)
The combination of TID=4 and CID=28 has around 13K rows in the table.
My questions are:
Why is the explain result telling me that around 9k rows will be
examined to get the final result?
Why is the column ref showing only const,const since 3 columns are included in the multi column index shouldn't ref be const,const,const ?
Update 7 Oct 2016
Query:
select rowID from inventory where TID=4 and CID=28 and value=3290843588097;
I ran it about 10 times and took the times of the last five (they were the same)
No index - 0.02 seconds
Index (TID, CID) - 0.03 seconds
Index (TID, CID, value) - 0.00 seconds
Also the same explain query looks different today, how?? note the key len has changed to 88 and the ref has changed to const,const,const also the rows to examine have reduced to 2.
mysql> explain select rowID from inventory where TID=4 and CID=28 and value='3290843588097';
+----+-------------+-----------+------+----------------------+---------------------+---------+-------------------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+------+----------------------+---------------------+---------+-------------------+------+-------------+
| 1 | SIMPLE | inventory | ref | index_TID_CID_value | index_TID_CID_value | 88 | const,const,const | 2 | Using where |
+----+-------------+-----------+------+----------------------+---------------------+---------+-------------------+------+-------------+
1 row in set (0.04 sec)
To explicitly answer your questions.
The explain plan is giving you ~9k rows queried due to the fact that the engine needs to search through the index tree to find the row IDs that match your where-clause criteria. The index is going to produce a mapping of each possible combination of the index column values to a list of the rowIDs associated with that combination. In effect, the engine is searching those combinations to find the right one; this is done by scanning the combination, hence the ~9k amount.
Since your where-clause criteria involves all three of the index columns, the engine is optimizing the search by leveraging the index for the first two columns, and then short-circuiting the third column and getting all rowID results for that combination.
In your specific use-case, I'm assuming you want to optimize performance of the search. I would recommend that you create just an index on TID and CID (not value). The reason for this is that you currently only have 2 combinations of these values out of ~20k records. This means that using an index with just 2 columns, the engine will be able to almost immediately cut out half of the records when doing a search on all three values. (This is all assuming that this index will be applied to a table with a much larger dataset.) Since your metrics are based off a smaller dataset, you may not be seeing the order of magnitude of performance differences between using the index and not.

Request execution goes too long time

I'm looking for some suggestion or optimization.
Table definition:
CREATE TABLE IF NOT EXISTS MilestonesAndFlags(
id SERIAL,
site_id BIGINT,
milestone BIGINT,
value BIGINT,
TIMESTAMP BIGINT,
timestamp_confirmation BIGINT,
COMMENT TEXT,
INDEX(site_id),
INDEX(milestone),
INDEX(milestone,site_id)
);
In this table I store different milestones with timestamp (to be able to make historical view of any changes) per different sites. Table has about million rows at that time.
The problem occures when I try to get latest actual milestone value per sites using queries like
SELECT site_id,
value
FROM SitesMilestonesAndFlags
WHERE id IN
(SELECT max(id)
FROM SitesMilestonesAndFlags
WHERE milestone=1
GROUP BY milestone,
site_id);
This request execution time is higher 5 minutes on my PC..
EXPLAIN seems to be OK:
+----+--------------------+--------------------+------+-----------------------+-------------+---------+-------+--------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+--------------------+------+-----------------------+-------------+---------+-------+--------+--------------------------+
| 1 | PRIMARY | MilestonesAndFlags | ALL | NULL | NULL | NULL | NULL | 1111320| Using where |
| 2 | DEPENDENT SUBQUERY | MilestonesAndFlags | ref | milestone,milestone_2 | milestone_2 | 9 | const | 180660| Using where; Using index |
+----+--------------------+--------------------+------+-----------------------+-------------+---------+-------+--------+--------------------------+
Any suggestion about more correct query or table structure?
MySQL >= 5.5
I'll take a shot and propose that you use a temporary aliased table instead of the where statement that is a dependent subquery. Not sure if mysql optimized or runs the subquery for every row of the main/outer query.
It would be very interesting if you ran the queries on large data sizes and come back with your results.
Example:
SELECT *
FROM MilestonesAndFlags AS MF,
(SELECT max(id) AS id
FROM MilestonesAndFlags
WHERE milestone=1
GROUP BY milestone,
site_id) AS MaxMF
WHERE MaxMF.id = MF.id;
SQLFiddle: http://sqlfiddle.com/#!2/a0d628/10
Pros and Cons:
Pro:
Avoidance of dependent subquery.
Cons:
Join causes projection and selection. This causes all rows of temp table to be "multiplied" with rows of original table and then where condition filters.
Update
I suspect also that the version of mysql plays a major role in the optimizations done.
Below the explain results for 2 different mysql versions where one defined the subquery as dependent and the other as not.
MySQL 5.5.32
ID SELECT_TYPE TABLE TYPE POSSIBLE_KEYS KEY KEY_LEN REF ROWS EXTRA
1 PRIMARY MilestonesAndFlags ALL (null) (null) (null) (null) 29 Using where; Using filesort
2 DEPENDENT SUBQUERY MilestonesAndFlags ref milestone,milestone_2 milestone_2 9 const 15 Using where; Using index
http://sqlfiddle.com/#!2/a0d628/11
MySQL MySQL 5.6.6 m9
ID SELECT_TYPE TABLE TYPE POSSIBLE_KEYS KEY KEY_LEN REF ROWS EXTRA
1 PRIMARY MilestonesAndFlags ALL (null) (null) (null) (null) 29 Using where; Using filesort
2 SUBQUERY MilestonesAndFlags ref milestone,milestone_2 milestone_2 9 const 15 Using where; Using index
http://sqlfiddle.com/#!9/a0d62/2

MySQL index doesn't work

I got a weird problem of MySQL index. I have a table views_video:
CREATE TABLE `views_video` (
`video_id` smallint(5) unsigned NOT NULL,
`record_date` date NOT NULL,
`region` char(2) NOT NULL DEFAULT '',
`views` mediumint(8) unsigned NOT NULL
PRIMARY KEY (`video_id`,`record_date`,`region`),
KEY `video_id` (`video_id`)
)
The table contains 3.4 million records.
I run the EXPLAIN on this query:
SELECT video_id, views FROM views_video where video_id <= 156
I got:
+----+-------------+-------------+-------+------------------+----------+---------+------+--------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------+-------+------------------+----------+---------+------+--------+-------------+
| 1 | SIMPLE | views_video | range | PRIMARY,video_id | video_id | 2 | NULL | 587984 | Using where |
+----+-------------+-------------+-------+------------------+----------+---------+------+--------+-------------+
But when I run the EXPLAIN on this query:
SELECT video_id, views FROM views_video where video_id <= 157
I got:
+----+-------------+-------------+------+------------------+------+---------+------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------+------+------------------+------+---------+------+---------+-------------+
| 1 | SIMPLE | views_video | ALL | PRIMARY,video_id | NULL | NULL | NULL | 3412892 | Using where |
+----+-------------+-------------+------+------------------+------+---------+------+---------+-------------+
video_id is from 1 to 1034. There is nothing special between 156 and 157.
What happens here?
* update *
I have added more data into the database. Now video_id is from 1 to 1064. And the table now has 3.8M records. And the difference become 114 and 115.
I'm guessing that with 3.4 million records, and only 1064 possible entries for your key, your selectivity is very low. (In other words, there are many duplicates, which makes it far less useful as a key.) The optimizer is taking its best guess if it is more efficient to use the key or not. You've found a threshold for that decision.
It might be the key population
Run these
SELECT (COUNT(1)/20) INTO #FivePctOfData FROM views_video;
SELECT COUNT(1) videpidcount,video_id FROM FROM views_video
WHERE id <= 157 GROUP BY video_id;
The query optimizer proabably took a vacation when one one of the key hit the 5% threshold.
You said there are 3.4 million rows. 5% would be 170,000. Perhaps this number was exceeded at some point in the query optimizer's life cycle on your query.
If you've added/deleted substantial data since creating the table, it's worthwhile to try ANALYZE TABLE on it. It frequently solves a lot of phantom indexing issues, and it's very fast even on large tables.
Update: Also, the unique index values are very low compared to the number of rows in the table. MySQL won't use indexes when a single index value points to too many rows. Try constraining the query further with another column that's part of the primary key.