I have some tables I want to join, but it cannot take dozens of seconds.
I want to go from this query that takes ~1s
SELECT COUNT(*) FROM business_group bg WHERE bg.group_id=1040
+----------+
| COUNT(*) |
+----------+
| 1229380 |
+----------+
1 row in set
Time: 1.173s
to this joined query that is taking ~50s
SELECT COUNT(*) FROM business b
INNER JOIN business_group bg ON b.id=bg.business_id
WHERE bg.group_id=1040
+----------+
| COUNT(*) |
+----------+
| 1229380 |
+----------+
1 row in set
Time: 51.346s
Why does it take that long if the only thing it does differently is to join on the primary key of the business table (business.id)?
Besides this primary key index, I also have this one (group_id, business_id) on business_group (with (business_id, group_id) it took even longer).
Following is the execution plan:
+----+-------------+-------+------------+--------+---------------------------------------------------------+-----------------------------+---------+----------------------+---------+----------+--------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+--------+---------------------------------------------------------+-----------------------------+---------+----------------------+---------+----------+--------------------------+
| 1 | SIMPLE | bg | <null> | ref | FKo2q0jurx07ein31bgmfvuk8gf,idx_bg_group_id_business_id | idx_bg_group_id_business_id | 9 | const | 2654528 | 100.0 | Using index |
| 1 | SIMPLE | b | <null> | eq_ref | PRIMARY | PRIMARY | 4 | database.bg.group_id | 1 | 100.0 | Using where; Using index |
+----+-------------+-------+------------+--------+---------------------------------------------------------+-----------------------------+---------+----------------------+---------+----------+--------------------------+
Is it possible to optimize the second query so it takes less time?
business table is ~45M rows while business_group is ~60M rows.
I'm writing this as someone who does a lot of indexing setups on SQL Server rather than MySQL. It is too long as a comment, and is based on what I believe are fundamentals, so hopefully it will help.
Why?
Firstly - why does it take so long for the second query to run? The answer is that it needs to do a lot more work in the second one
To demonstrate, imagine the only non-clustered index you have is on business_group for group_id.
You run the first query SELECT COUNT(*) FROM business_group bg WHERE bg.group_id=1040.
All the engine needs to do is to seek to the appropriate spot in the index (where group_id = 1040), then read/count rows from the index (which is just a series of ints) until it changes - because the non-clustered index is sorted by that column.
Note if you had a second field in the non-clustered index (e.g., group_id, business_id), it would be almost as fast because it's still sorted on group_id first. However, it will be a slightly larger read as each row is 2x the size of the single-column version (but would still be very small).
Imagine you then run a slightly different query, counting business_id instead of * e.g., SELECT COUNT(business_id) FROM business_group bg WHERE bg.group_id=1040.
Assuming business_id is not the PK (and is not in the non-clustered index), then for every row it finds in the index, it then needs to go back and read the business_id from the table check it's not null (either in some sort of loop/reference, or read the whole table - I'm not 100% on how MySQL deals with these). However, it is a lot more work than above.
If business_id was in the index (as above, for group_id, business_id), then it could read that data straight from the index and not need to refer back to the original table - which is good.
Now add the join (your second query) SELECT COUNT(*) FROM business b INNER JOIN business_group bg ON b.id=bg.business_id WHERE bg.group_id=1040. The engine needs to
Get each business_id as above
Potentially sort the business IDs to help with the join
Join it to the business table (to ensure it has a valid row in the business table)
... and to do so, it may need to read all the row's data in the business table
Suggestions #1 - Avoid going to the business table
If you set up foreign keys to ensure that business_id in business_group is valid - then do you need to run the version with the join? Just run the first version.
Suggestion #2 - Indexes
If this was SQL Server and you needed that second query to run as fast as possible, I would set up two non-clustered indexes
NONCLUSTERED INDEX ... ON business_group (group_id, business_id)
NONCLUSTERED INDEX ... ON business (id)
The first means the engine can seek directly to the specific group_id, and then get a sorted list of business_id.
The second provides a sorted list of id (business_id) from the business table. As it has the same sort as the the results from the first index, it means the join is a lot less work.
However, the second one is controversial - many people would say 'no' to this as it overlaps your PK (or, more specifically, clustered index). It would also be sorted the same way. However, at least in SQL Server, this would include all the other data about the businesses e.g., the business name, address, etc. So to read the list of IDs from business, you'd also need to read the rest of the data - taking a lot more time.
However, if you put a non-clustered index just on ID, it will be a very narrow index (just the IDs) and therefore the amount of data to be read would be much less - and therefore often a lot faster.
Note though, that this is not as fast as if you could avoid doing the join altogether (e.g., Suggestion #1 above).
Related
TL;DR:
I have a query on 2 huge tables. They are no indexes. It is slow. Therefore, I build indexes. It is slower. Why does this makes sense? What is the correct way to optimize it?
The background:
I have 2 tables
person, a table containing informations about people (id, birthdate)
works_in, a 0-N relation between person and a department; works_in contains id, person_id, department_id.
They are InnoDB tables, and it is sadly not an option to switch to MyISAM as data integrity is a requirement.
Those 2 tables are huge, and don't contain any indexes except a PRIMARY on their respective id.
I'm trying to get the age of the youngest person in each department, and here is the query I've came up with
SELECT MAX(YEAR(person.birthdate)) as max_year, works_in.department as department
FROM person
INNER JOIN works_in
ON works_in.person_id = person.id
WHERE person.birthdate IS NOT NULL
GROUP BY works_in.department
The query works, but I'm dissatisfied with performances, as it takes ~17s to run. This is expected, as the data is huge and needs to be written to disk, and they are no indexes on the tables.
EXPLAIN for this query gives
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
|----|-------------|---------|--------|---------------|---------|---------|--------------------------|----------|---------------------------------|
| 1 | SIMPLE | works_in| ALL | NULL | NULL | NULL | NULL | 22496409 | Using temporary; Using filesort |
| 1 | SIMPLE | person | eq_ref | PRIMARY | PRIMARY | 4 | dbtest.works_in.person_id| 1 | Using where |
I built a bunch of indexes for the 2 tables,
/* For works_in */
CREATE INDEX person_id ON works_in(person_id);
CREATE INDEX department_id ON works_in(department_id);
CREATE INDEX department_id_person ON works_in(department_id, person_id);
CREATE INDEX person_department_id ON works_in(person_id, department_id);
/* For person */
CREATE INDEX birthdate ON person(birthdate);
EXPLAIN shows an improvement, at least that's how I understand it, seeing that it now uses an index and scans less lines.
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
|----|-------------|---------|-------|--------------------------------------------------|----------------------|---------|------------------|--------|-------------------------------------------------------|
| 1 | SIMPLE | person | range | PRIMARY,birthdate | birthdate | 4 | NULL | 267818 | Using where; Using index; Using temporary; Using f... |
| 1 | SIMPLE | works_in| ref | person,department_id_person,person_department_id | person_department_id | 4 | dbtest.person.id | 3 | Using index |
However, the execution time of the query has doubled (from ~17s to ~35s).
Why does this makes sense, and what is the correct way to optimize this?
EDIT
Using Gordon Linoff's answer (first one), the execution time is ~9s (half of the initial). Choosing good indexes seems to indeed help, but the execution time is still pretty high. Any other idea on how to improve on this?
More information concerning the dataset:
There are about 5'000'000 records in the person table.
Of which only 130'000 have a valid (not NULL) birthdate
I indeed have a department table, which contains about 3'000'000 records (they are actually projects and not department)
For this query:
SELECT MAX(YEAR(p.birthdate)) as max_year, wi.department as department
FROM person p INNER JOIN
works_in wi
ON wi.person_id = p.id
WHERE p.birthdate IS NOT NULL
GROUP BY wi.department;
The best indexes are: person(birthdate, id) and works_in(person_id, department). These are covering indexes for the query and save the extra cost of reading data pages.
By the way, unless a lot of persons have NULL birthdates (i.e. there are departments where everyone has a NULL birthdate), the query is basically equivalent to:
SELECT MAX(YEAR(p.birthdate)) as max_year, wi.department as department
FROM person p INNER JOIN
works_in wi
ON wi.person_id = p.id
GROUP BY wi.department;
For this, the best indexes are person(id, birthdate) and works_in(person_id, department).
EDIT:
I cannot think of an easy way to solve the problem. One solution is more powerful hardware.
If you really need this information quickly, then additional work is needed.
One approach is to add a maximum birth date to the departments table, and add triggers. For works_in, you need triggers for update, insert, and delete. For persons, only update (presumably the insert and delete would be handled by works_in). This saves the final group by, which should be a big savings.
A simpler approach is to add a maximum birth date just to works_in. However, you will still need a final aggregation, and that might be expensive.
Indexing improves performance for MyISAM tables. It degrades performance on InnoDB tables.
Add indexes on columns that you expect to query the most. The more complex the data relationships grow, especially when those relationships are with / to itself (such as inner joins), the worse each query's performance gets.
With an index, the engine has to use the index to get matching values, which is fast. Then it has to use the matches to look up the actual rows in the table. If the index doesn't narrow down the number of rows, it can be faster to just look up all the rows in the table.
When to add an index on a SQL table field (MySQL)?
When to use MyISAM and InnoDB?
https://dba.stackexchange.com/questions/1/what-are-the-main-differences-between-innodb-and-myisam
I just stumbled upon a few lines of code in a system I just started working with that I don't really get. The system has a large table that saves lots of entities with unique IDs and removes them once they're not longer needed but it never reuses them. So the table looks like this
------------------------
| id |info1|info2|info3|
------------------------
| 1 | foo1| foo2| foo3|
------------------------
| 17 | bar1| bar2| bar3|
------------------------
| 26 | bam1| bam2| bam3|
------------------------
| 328| baz1| baz2| baz3|
------------------------
etc.
In one place in the codebase there is a while loop whose purpose it is to loop through all entities in the DB and do things to them and right now this is solved like this
int lastId = fetchMaxId()
int id = 0
while (id = fetchNextId()){
doStuffWith(id)
}
where fetchMaxId is straight forward
int fetchMaxId(){
return sqlQuery("SELECT MAX(id) FROM Table")
}
but fetchNextId confuses me. It is implemented as
int fetchNextId(currentId, maxId){
return sqlQuery("
SELECT id FROM Table where id > :currentId and id <= :maxId LIMIT 1
")
}
This system has been in production for several years so it obviously works but when I tried searching for a solution to why this works I only found people saying the same thing that I already thought i knew. The order in which a MySQL DB returns the result is not easily determined and should not be relied upon so if you wan't a particular order use a ORDER BY clause. But are there some times you can safely ignore the ORDER BY? This code has worked for 12 years and continued to work through several DB updates. Are we just lucky or am I missing something here? Before I saw this code I would have said that if you called
fetchNextId(1, 328)
you could end up with either 17 or 26 as the answer.
Some clues to why this works may be that the id column is the primary key of the Table in question and it's set to auto increment but I can't find any documentation that would explain why
fetchNextId(1, 328)
should always returns 17 when called on the table-snippet given above.
The short answer is yes, the primary key has an order, all indexes have an order, and a primary key is simply a unique index.
As you have rightly said, you should not rely on data being returned in the order the data is stored in, the optimiser is free to return it in any order it likes, and this will be dependent on the query plan. I will however attempt to explain why your query has worked for 12 years.
Your clustered index is just your table data, and your clustering key defines the order that it is stored in. The data is stored on the leaf, and the clustering key helps the root (and intermediate notes) act as pointers to quickly get to the right leaf to retrieve the data. A nonclustered index is a very similar structure, but the lowest level simply contains a pointer to the correct position on the leaf of the clustered index.
In MySQL the primary key and the clustered index are synonymous, so the primary key is ordered, however they are fundamentally two different things. In other DBMS you can define both a primary key and a clustered index, when you do this your primary key becomes a unique nonclustered index with a pointer back to the clustered index.
In it's simplest terms you can imagine a table with an ID column that is the primary key, and another column (A), your B-Tree structure for your clustered index would be something like:
Root Node
+---+
| 1 |
+---+
Intermediate Nodes
+---+ +---+ +---+
| 1 | | 4 | | 7 |
+---+ +---+ +---+
Leaf
+-----------+ +-----------+ +-----------+
ID -> | 1 | 2 | 3 | | 4 | 5 | 6 | | 7 | 8 | 9 |
A -> | A | B | C | | D | E | F | | G | H | I |
+-----------+ +-----------+ +-----------+
In reality the leaf pages will be much bigger, but this is just a demo. Each page also has a pointer to the next page and the previous page for ease of traversing the tree. So when you do a query like:
SELECT ID, A
FROM T
WHERE ID > 5
LIMIT 1;
you are scanning a unique index so it is very likely this will be a sequential scan. Very likely is not guaranteed though.
MySQL will scan the Root node, if there is a potential match it will move on to the intermediate nodes, if the clause had been something like WHERE ID < 0 then MySQL would know that there were no results without going any further than the root node.
Once it moves on to the intermediate node it can identify that it needs to start on the second page (between 4 and 7) to start searching for an ID > 5. So it will sequentially scan the leaf starting on the second leaf page, having already identified the LIMIT 1 it will stop once it finds a match (in this case 6) and return this data from the leaf. In such a simple example this behaviour appears to be reliable and logical. I have tried to force exceptions by choosing an ID value I know is at the end of a leaf page to see if the leaf will be scanned in the reverse order, but as yet have been unable to produce this behaviour, this does not however mean it won't happen, or that future releases of MySQL won't do this in the scenarios I have tested.
In short, just add an order by, or use MIN(ID) and be done with it. I wouldn't lose too much sleep trying to delve into the inner workings of the query optimiser to see what kind of fragmentation, or data ranges would be required to observe different ordering of the clustered index within the query plan.
The answer to your question is yes. If you look at MySQL documentation you will see that whenever a table has a primary key it has an associated index.
When looking at the documentation for indexes you will see that they will mention primary keys as a type of index.
So in case of your particular scenario:
SELECT id FROM Table where id > :currentId and id <= :maxId LIMIT 1
The query will stop executing as soon as it has found a value because of the LIMIT 1.
Without the LIMIT 1 it would have returned 17, 24 and 328.
However will all that said I don't think you will run into any order problems when the primary key is auto incrementing but whenever there is a scenario were the primary key is a unique employee no. instead of an auto incrementing field I would not trust the order of the result because the documentation also notes that MySQL reads sequentially, so the possibility is there that a primary key could fall out of the WHERE clause conditions and be skipped.
I'm sure I must be doing something stupid, but as is often the case I can't figure out what it is.
I'm trying to run this query:
SELECT `f`.`FrenchWord`, `f`.`Pronunciation`, `e`.`EnglishWord`
FROM (`FrenchWords` f)
INNER JOIN `FrenchEnglishMappings` m ON `m`.`FrenchForeignKey`=`f`.`id`
INNER JOIN `EnglishWords` e ON `e`.`id`=`m`.`EnglishForeignKey`
WHERE `f`.`Pronunciation` = '[whatever]';
When I run it, what happens seems quite weird to me. I get the results of the query fine, 2 rows in about 0.002 seconds.
However, I also get a huge spike in CPU and SHOW PROCESSLIST shows two identical processes for that query with state 'Copying to tmp table on disk'. These seem to keep running endlessly until I kill them or the system freezes.
None of the tables involved is big - between 100k and 600k rows each. tmp_table_size and max_heap_table_size are both 16777216.
Edit: EXPLAIN on the statement gives:
+edit reduced keylen of Pronunciation to 112
+----+-------------+-------+--------+-------------------------------------------------------------+-----------------+---------+----------------------------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+-------------------------------------------------------------+-----------------+---------+----------------------------+------+----------------------------------------------+
| 1 | SIMPLE | f | ref | PRIMARY,Pronunciation | Pronunciation | 112 | const | 2 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | m | ref | tmpindex,CombinedIndex,FrenchForeignKey,EnglishForeignKey | tmpindex | 4 | dict.f.id | 1 | Using index |
| 1 | SIMPLE | e | eq_ref | PRIMARY,id | PRIMARY | 4 | dict.m.EnglishForeignKey | 1 | |
+----+-------------+-------+--------+-------------------------------------------------------------+-----------------+---------+----------------------------+------+----------------------------------------------+
I'd be grateful if someone could point out what might be causing this. What I really don't understand is what MySQL is doing - surely if the query is complete then it doesn't need to do anything else?
UPDATE
Thanks for all the responses. I learnt something from all of them. This query was made massively faster after following the advice of nrathaus. I added a PronunciationHash binary(16) column to FrenchWords that contains unhex( md5 ( Pronunciation ) ). That is indexed with a keylen of 16 (vs 600+ for the varchar index on Pronunciation), and queries are much faster now.
As said by the EXPLAIN, you key size is HUGE : 602, this requires MySQL to write down the data.
You need to reduce (greatly) the keylen, I believe recommended is below 128.
I suggest you create a column called MD5_FrenchWord which will contain the MD5 value of FrenchWord. Then use this column for the GROUP BY. This assumes that you are looking for similarities, when you group by rather than the actual value
You are misusing GROUP BY. This clause is entirely pointless unless you also have a summary function such as MAX(something) or COUNT(*) in your SELECT clause.
Try removing GROUP BY and see if it helps.
It's not clear what you're trying to do with GROUP BY. But you might try SELECT DISTINCT if you're trying to dedup your result set.
Looking further at this question, it seems like you might benefit from a couple of compound indexes.
First, can you make sure your table declarations have NOT NULL in as many columns as possible?
Second, you're retrieving Pronunciation, FrenchWord, and id from your Frenchwords table, so try this compound index on that table. Your query will then be able to get what it needs directly from the index, saving a bunch of disk io. Notice that Pronunciation is mentioned first in the compound index declaration because that's the value you're searching for. This allows MySQL to do a lookup on the index, and get the other information it needs directly from the index, without thrashing back to the table itself.
(Pronunciation, FrenchWord, id)
You're retrieving Englishword from Englishwords looking it up by id. So, the same reasoning can apply to this compound index.
(id, Englishword)
Finally, I can't tell what your ORDER BY is for, once you use SELECT DISTINCT. You might try getting rid of it. But it probably makes no difference.
Give this a try. If your MySQL server is still thrashing after you make these changes, you have some kind of configuration problem.
First, here is the query I have:
SELECT
COUNT(*) as velocity_count,
SUM(`disbursements`.`amount`) as summation_amount
FROM `disbursements`
WHERE
`disbursements`.`accumulation_hash` = '40ad7f250cf23919bd8cc4619850a40444c5e90c978f88635a09ccf66a82ffb38e39ea51cdfd651b0ebdac5f5ca37cd7a17e0f60fea6cbce1397ccff5fa37346'
AND `disbursements`.`caller_id` = 1
AND `disbursements`.`active` = 1
AND (version_hash != '86b4111677294b27a1805643d193b8d437b6ddb170b4ed5dec39aa89bf070d160cbbcd697dfc1988efea8429b1f1557625bf956180c65d3dcd3a318280e0d2da')
AND (`disbursements`.`created_at` BETWEEN '2012-12-15 23:33:22'
AND '2013-01-14 23:33:22') LIMIT 1
Explain extended returns the following:
+----+-------------+---------------+-------+-----------------------------------------------------------------------------------------------------------------------------------------------+------------------------------+---------+------+--------+----------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+---------------+-------+-----------------------------------------------------------------------------------------------------------------------------------------------+------------------------------+---------+------+--------+----------+--------------------------+
| 1 | SIMPLE | disbursements | range | unique_request_index,index_disbursements_on_caller_id,disbursement_summation_index,disbursement_velocity_index,disbursement_version_out_index | disbursement_summation_index | 1543 | NULL | 191422 | 100.00 | Using where; Using index |
+----+-------------+---------------+-------+-----------------------------------------------------------------------------------------------------------------------------------------------+------------------------------+---------+------+--------+----------+--------------------------+
The actual query counts about 95,000 rows. If I explain another query that hits ~50 rows the explain is identical, just with fewer rows estimated.
The index being chosen covers accumulation_hash, caller_id, active, version_hash, created_at, amount in that order.
I've tried playing around with doing COUNT(id) or COUNT(caller_id) since these are non-null fields and return the same thing as count(*), but it doesn't have any impact on the plan or the run time of the actual query.
This is also a heavy insert table, essentially every single query will have had a row inserted or updated since the last time it was run, so the mysql query cache isn't entirely useful.
Before I go and make some sort of bucketed time sequence cache with something like memcache or redis, is there an obvious solution to getting this to work much faster? A normal ~50 row query returns in 5MS, the ones across 90k+ rows are taking 500-900MS and I really can't afford anything much past 100MS.
I should point out the dates are a rolling 30 day window that needs to be essentially real time. Expiration could probably happen with ~1 minute granularity, but new items need to be seen immediately upon commit. I'm also on RDS, Read IOPS are essentially 0, and cpu is about 60-80%. When I'm not querying the giant 90,000+ record items, CPU typically stays below 10%.
You could try an index that has created_at before version_hash (might get a better shot at having an index range scan... not clear how that non-equality predicate on the version_hash affects the plan, but I suspect it disables a range scan on the created_at column.
Other than that, the query and the index look about as good as you are going to get, the EXPLAIN output shows the query being satisfied from the index.
And the performance of the statement doesn't sound too unreasonable, given that it's aggregating 95,000+ rows, especially given the key length of 1543 bytes. That's a much larger size than I normally deal with.
What are the datatypes of the columns in the index, and what is the cluster key or primary key?
accumulation_hash - 128-character representation of 512-bit value
caller_id - integer or numeric (?)
active - integer or numeric (?)
version_hash - another 128-characters
created_at - datetime (8bytes) or timestamp (4bytes)
amount - numeric or integer
95,000 rows at 1543 bytes each is on the order of 140MB of data.
I have a table from a legacy system which does not have a primary key. It records transactional data for issuing materials in a factory.
For simplicities sake, lets say each row contains job_number, part_number, quantity & date_issued.
I added an index to the date issued column. When I run an EXPLAIN SELECT * FROM issued_parts WHERE date_issued > '20100101', it shows this:
+----+-------------+----------------+------+-------------------+------+---------+------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------------+------+-------------------+------+---------+------+---------+-------------+
| 1 | SIMPLE | issued_parts | ALL | date_issued_alloc | NULL | NULL | NULL | 9724620 | Using where |
+----+-------------+----------------+------+-------------------+------+---------+------+---------+-------------+
So it sees the key, but it doesn't use it?
Can someone explain why?
Something tells me the MySQL Query Optimizer decided correctly.
Here is how you can tell. Run these:
Count of Rows
SELECT COUNT(1) FROM issued_parts;
Count of Rows Matching Your Query
SELECT COUNT(1) FROM issued_parts WHERE date_issued > '20100101';
If the number of rows you are actually retrieving exceeds 5% of the table's total number, the MySQL Query Optimizer decides it would be less effort to do a full table scan.
Now, if your query was more exact, for example, with this:
SELECT * FROM issued_parts WHERE date_issued = '20100101';
then, you will get a different EXPLAIN plan altogether.
possible_keys names keys with the relevant columns in, but that doesn't mean that each key in it is going to be useful for the query. In this case, none are.
There are multiple types of indexes (indices?). A hash index is a fast way to do a lookup on an item given a specific value. If you have a bunch of discreet values that you are querying against, (for example, a list of 10 dates) then you can calculate a hash for each of those values, and look them up in the index. Since you aren't doing a lookup on a specific value, but rather doing a comparison, a hash index won't help you.
On the other hand, a B-Tree index can help you because it gives an ordering to the elements it is indexing. For instance, see here: http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html for mysql (search for B-Tree Index Characteristics) . You may want to check that your table is using a b-tree index for it's index column.