I have very big database of customers. This query was ok before I added ORDER BY. How can I optimize my query speed?
$sql = "SELECT * FROM customers
LEFT JOIN ids ON customer_ids.customer_id = customers.customer_id AND ids.type = '10'
ORDER BY customers.name LIMIT 10";
ids.type and customers.name are my indexes
Explain query
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE customers ALL NULL NULL NULL NULL 955 Using temporary; Using filesort
1 SIMPLE ids ALL type NULL NULL NULL 3551 Using where; Using join buffer (Block Nested Loop)
(I assume you meant to type ids.customer_id = customer.customer_id and not customer_ids.customer_id)
Without the ORDER BY mysql grabbed the first 10 ids of type 10 (indexed), looked up the customer for them, and was done. (Note that the LEFT JOIN here is really an INNER JOIN because the join conditions will only hold for rows that have a match in both tables)
With the ORDER BY mysql is probably retrieving all type=10 customers then sorting them by first name to find the first 10.
You could speed this up by either denormalizing the customers table (copy the type into the customer record) or creating a mapping table to hold the customer_id, name, type tuples. In either case, add an index on (type, name). If using the mapping table, use it to do a 3-way join with customers and ids.
If type=10 is reasonably common, you could also force the query to walk the customers table by name and check the type for each with STRAIGHT JOIN. It won't be as fast as a compound index, but it will be faster than pulling up all matches.
And as suggested above, run an EXPLAIN on your query to see the query plan that mysql is using.
LEFT is the problem. By saying LEFT JOIN, you are implying that some customers may not have a corresponding row(s) in ids. And you are willing to accept NULLs for the fields in place of such an ids row.
If that is not the case, then remove LEFT. Then make sure you have an index on ids that starts with type. Also, customers must have an index (probably the PRIMARY KEY) starting with customer_id. With those, the optimizer can start with ids, filter on type earlier, thereby have less work to do.
But, still, it must collect lots of rows before doing the sort (ORDER BY); only then can it deliver the 10 (LIMIT).
While you are at it, add INDEX(customer_id) to ids -- that is what is killing performance for the LEFT version.
Related
I have an example query such as:
SELECT
rest.name, rest.shortname
FROM
restaurant AS rest
INNER JOIN specials ON rest.id=specials.restaurantid
WHERE
specials.dateend >= CURDATE()
AND
rest.state='VIC'
AND
rest.status = 1
AND
specials.status = 1
ORDER BY
rest.name ASC;
Just wondering of the below two indexes, which would be best on the restaurant table?
id,state,status,name
state,status,name
Just not sure if column used in the join should be included?
Funny enough though, I have created both types for testing and both times MySQL chooses the primary index, which is just id. Why is that?
Explain Output:
1,'SIMPLE','specials','index','NewIndex1\,NewIndex2\,NewIndex3\,NewIndex4','NewIndex4','11',\N,82,'Using where; Using index; Using temporary; Using filesort',
1,'SIMPLE','rest','eq_ref','PRIMARY\,search\,status\,state\,NewIndex1\,NewIndex2\,id-suburb\,NewIndex3\,id-status-name','PRIMARY','4','db_name.specials.restaurantid',1,'Using where'
Not many rows at the moment so perhaps that's why it's choosing PRIMARY!?
For optimum performance, you need at least 2 indexes:
The most important index is the one on the foreign key:
CREATE INDEX specials_rest_fk ON specials(restaurantid);
Without this, your queries will perform poorly, because every row in rest that matches the WHERE conditions will require a full tablescan of specials.
The next index to define would be the one that helps look up the fewest rows of rest given your conditions. Only one index is ever used, so you want to make that index find as few rows from rest as possible.
My guess, state and status:
CREATE INDEX rest_index_1 on rest(state, status);
Your index suggestion of (id, ...) is pointless, because id is unique - adding more column won't help, and in fact would worsen performance if it were used, because the index entries would be larger and you'd get less entries per I/O page read.
But you can gain performance by writing the query better too; if you move the conditions on specials into the join ON condition, you'll gain significant performance, because join conditions are evaluated as the join is made, but where conditions are evaluated on all joined rows, meaning the temporary result set that is filtered by the WHERE clause is much larger and therefore slower.
Change your query to this:
SELECT rest.name, rest.shortname
FROM restaurant AS rest
INNER JOIN specials
ON rest.id=specials.restaurantid
AND specials.dateend >= CURDATE()
AND specials.status = 1
WHERE rest.state='VIC'
AND rest.status = 1
ORDER BY rest.name;
Note how the conditions on specials are now in the ON clause.
I'm connecting to a database which I cannot administer and I wrote a query which did a left join between two tables - one small and one several orders of magnitude larger. At some point, the database returned this error:
Incorrect key file for table '/tmp/#sql_some_table.MYI'; try to repair it
I contacted the administrators and I was told that I'm getting this error because I'm doing the left join incorrectly, that I should NEVER left join a small table to a large table and that I should reverse the join order. The reason they gave was that when done my way, MySQL will try to create a temp table which is too large and query will fail. Their solution fails elsewhere, but that's not important here.
I found their explanation odd, so I ran explain on my query:
id = '1'
select_type = 'SIMPLE'
table = 'small_table'
type = 'ALL'
possible_keys = NULL
key = NULL
key_len = NULL
ref = NULL
rows = '23'
Extra = 'Using temporary; Using filesort'
id = '1'
select_type = 'SIMPLE'
table = 'large_table'
type = 'ref'
possible_keys = 'ID,More'
key = 'ID'
key_len = '4'
ref = 'their_db.small_table.ID'
rows = '41983'
Extra = NULL
(The 41983 rows in the second table are not very interesting to me, I just needed the latest record, which is why my query has order by large_table.ValueDateTime desc limit 1 at the end.)
I was careful enough to do a select by columns which the admins themselves told me, should hold unique values (and thus I assumed indexed), but it seems they haven't indexed those columns.
My question is - is doing the join the way I did ('small_table LEFT JOIN large_table') bad practice in general, or can such queries be made to execute successfully with proper indexing?
Edit:
Here's what the query looks like (this is not the actual query, but similar):
select large_table.ValueDateTime as LastDate,
small_table.DeviceIMEI as IMEI,
small_table.Other_Columns as My_Names,
large_table.Pwr as Voltage,
large_table.Temp as Temperature
from small_table left join large_table on small_table.ID = large_table.ID
where DeviceIMEI = 500
order by ValueDateTime desc
limit 1;
Basically what I'm doing is trying to get the most current data for a device, given that Voltage and Temperature change over time. The DeviceIMEI, ID and ValueDateTime should be unique, but aren't indexed (like I said earlier, I don't administer the database, I only have read permissions).
Edit 2:
Please focus on answering my actual question, not on attempting to rewrite my original query.
The left join thing is a red herring.
It is however an actual problem of running out of space for temp tables. But the order of your join makes no difference. The only thing that matters is how many rows MySQL must work with.
Which brings me to the LIMIT command:
Here is the problem:
It order to get the single row you asked for, MySQL has to sort the ENTIRE record set, then grab the top one. In order to sort it, it must store it - either in memory or on disk. And that's where you are running out of space. Every single column you requested get stored on disk, for the entire table, then sorted.
This is slow, very slow, and uses a lot of disk space.
Solutions:
You want MySQL to be able to use an index for the sorting. But in your query it can't. It's using an index for the join reference, and MySQL can only use one index per query.
Do you even have an index on the sort column? Try that first.
Another option is to do a separate query, where you select just the ID of the large table, LIMIT 1. Then the temporary table is MUCH smaller since all it has are IDs without all the other columns.
Once you know the ID then retrieve all the columns you need directly from the tables. You can do this in one shot with a subquery. If you post your query I could rewrite it to show you, but it's basically ID = (SELECT ID FROM ..... LIMIT 1)
I am trying to optimize a mysql select query but I just can't work out how to do it.
I have been reading around here and any other relevant results on google but can't find anybody with quite my problem, hopefully this isn't because I have constructed really bad DB Tables
I have three tables, Table A, Table B, Table C
Table A has a one to one relationship with table B
and
Table A has a one to many relationship with table C
because I need all the information from table c which relates to each row in table A I am using a GROUP BY to condense the results down to the number of relevant results from Table A.
Unfortunately When adding the GROUP BY it causes mysql to create a tempory table which slows everything right down.
I have read about in which instances mysql will use a tempory table and it seems I should be able to make it work but just can't work out how to do it, I am hoping that I can restucture my query with the same results but without the tmp table which is making it go slow.
I also have an order in my query. If I remove the GROUP BY I get all the rows in table A as well as the accumalative results of table C BUT it does all this in a fraction of a second.
With the GROUP BY and with or without the ORDER BY, because mysql then uses a tmp table it takes anything from 2 to 15 seconds depending on any other criteria needed.
this is basically my query (although details changed for security reasons)
SELECT
table_a.*, table_b.*, table_c.*
FROM `table_a`
LEFT JOIN table_b ON table_b.ID = table_a.table_b_id
LEFT JOIN table_c ON table_a.ID = table_c.table_a_id AND table_c.deleted !='yes'
WHERE table_a.deleted !='yes' AND table_b.deleted !='yes'
GROUP BY table_a.ID
ORDER BY table_a.date_added DESC
I have around 15k results in table a and 100k results in table c
I really need all the information from table c which relates to the rows in table a so my thoughts are if
a. I can remove the group by and accomplish the same end result then it would have removed the need for a tmp table and speeded it all up
b. Alter how I am joining the tables together so that the sorting of the results for the group by is more efficient and then hopefully remove the tmp table as well
Can I make this query more efficient or am I just stuck because of how my tables are setup?
Thanks for the reply.
Sorry I should have mentioned I have tried group by and order being the same value of table_a.ID but from mysql EXPLAIN this still seems to be using a tmp table.
From testing it seems removing order by and just having group by still doesn't work although order by on its own is fine, it just brings back ALL the results.
In doing the EXPLAIN though, I have noticed that possible_keys and key are both null for table_a even though table_b and table_c are both reporting using a column for possible_keys and key.
Should i be getting an indication from EXPLAIN that table_a has a key?
Matt
Sorry its from phpmyadmin, so not the best layout.
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE table_a ALL NULL NULL NULL NULL 13297 Using where; Using temporary; Using filesort
1 SIMPLE table_b eq_ref PRIMARY,ID, PRIMARY 3 db.table_a.table_b_id 1 Using where
1 SIMPLE table_c ref table_a_id table_a_id 3 db.table_a.ID 9
When you have GROUP BY and ORDER BY with different fields, you will pretty much always end up with a temporary table.
As a side note, You select may be producing unexpected results due to the inclusion of all fields but only grouping by table_a.ID. Records in Table C may be getting mangled.
I am running the be query
SELECT packages.id, packages.title, subcat.id, packages.weight
FROM packages ,provider, packagestosubcat,
packagestocity, subcat, usertosubcat,
usertocity, usertoprovider
WHERE packages.endDate >'2011-03-11 06:00:00' AND
usertosubcat.userid = 1 AND
usertocity.userid = 1 AND
packages.providerid = provider.id AND
packages.id = packagestosubcat.packageid AND
packages.id = packagestocity.packageid AND
packagestosubcat.subcatid = subcat.id AND
usertosubcat.subcatid = packagestosubcat.subcatid AND
usertocity.cityid = packagestocity.cityid AND
(
provider.providertype = 'reg' OR
(
usertoprovider.userid = 1 AND
provider.providertype != 'reg' AND
usertoprovider.providerid = provider.ID
)
)
GROUP BY packages.title
ORDER BY subcat.id, packages.weight DESC
When i run explain, everything seems to look ok except for the scan on the usertoprovider table, which doesn't seem to be using table's keys:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE usertocity ref user,city user 4 const 4 Using temporary; Using filesort
1 SIMPLE packagestocity ref city,packageid city 4 usertocity.cityid 419
1 SIMPLE packages eq_ref PRIMARY,enddate PRIMARY 4 packagestocity.packageid 1 Using where
1 SIMPLE provider eq_ref PRIMARY,providertype PRIMARY 4 packages.providerid 1 Using where
1 SIMPLE packagestosubcat ref subcatid,packageid packageid 4 packages.id 1 Using where
1 SIMPLE subcat eq_ref PRIMARY PRIMARY 4 packagestosubcat.subcatid 1
1 SIMPLE usertosubcat ref userid,subcatid subcatid 4 const 12 Using where
1 SIMPLE usertoprovider ALL userid,providerid NULL NULL NULL 3735 Using where
As you can see in the above query, the condition itself is:
provider.providertype = 'reg' OR
(
usertoprovider.userid = 1 AND
provider.providertype != 'reg' AND
usertoprovider.providerid = provider.ID
)
Both tables, provider and usertoprovider, are indexed. provider has indexes on providerid and providertype while usertoprovider has indexes on userid and providerid
The cardinality of the keys is:
provider.id=47, provider.type=1, usertoprovider.userid=1245, usertoprovider.providerid=6
So its quite obvious that the indexes are not used.
Further more, to test it out, i went ahead and:
Duplicated the usertoprovider table
Inserted all the provider values that have providertype='reg' into the cloned table
Simplified the condition to (usertoprovider.userid = 1 AND usertoprovider.providerid = provider.ID)
The query execution time changed from 8.1317 sec. to 0.0387 sec.
Still, provider values that have providertype='reg' are valid for all the users and i would like to avoid inserting these values into the usertoprovider table for all the users since this data is redundant.
Can someone please explain why MySQL still runs a full scan and doesn't use the keys? What can be done to avoid it?
It seems that provider.providertype != 'reg' is redundant (always true) unless provider.providertype is nullable and you want the query to fail on NULL.
And shouldn't != be <> instead to be standard SQL, although MySQL may allow !=?
On cost of table scans
It is not necessarily that a full table scan is more expensive than walking an index, because walking an index still requires multiple page accesses. In many database engines, if your table is small enough to fit inside a few pages, and the number of rows are small enough, it will be cheaper to do a table scan. Database engines make this type of decision based on the data and index statistics of the table.
This case
However, in your case, it might also be because of the other leg in your OR clause: provider.providertype = 'reg'. If providertype is "reg", then this query joins in ALL the rows of usertoprovider (most likely not what you want) since it is a multi-table cross join.
The database engine is correct in determining that you'll likely need all the table rows in usertoprovider anyway (unless none of the providertype's is "reg", but the engine also may know!).
The query hides this fact because you are grouping on the (MASSIVE!) result set later on and just returning the package ID, so you won't see how many usertoprovider rows have been returned. But it will run very slowly. Get rid of the GROUP BY clause to find out how many rows you are actually forcing the database engine to work on!!!
The reason you see a massive speed improvement if you fill out the usertoprovider table is because then every row participates in a join, and there is no full cross join happening in the case of "reg". Before, if you have 1,000 rows in usertoprovider, every row with type="reg" expands the result set 1,000 times. Now, that row joins with only one row in usertoprovider, and the result set is not expanded.
If you really want to pass anything with providertype='reg', but not in your many-to-many mapping table, then the easiest way may be to use a sub-query:
Remove usertoprovider from your FROM clause
Do the following:
provider.providertype='reg' OR EXISTS (SELECT * FROM usertoprovider WHERE userid=1 AND providerid = provider.ID)
Another method is to use an OUTER JOIN on the usertoprovider -- any row with "reg" which is not in the table will come back with one row of NULL instead of expanding the result set.
Hmm, I know that MySQL does funny things with grouping. In any other RDBMS, your query won't even be executed. What does that even mean,
SELECT packages.id
[...]
GROUP BY packages.title
ORDER BY subcat.id, packages.weight DESC
You want to group by title. Then in standard SQL syntax, this means you can only select title and aggregate functions of the other columns. MySQL magically tries to execute (and probably guess) what you may have meant to execute. So what would you expect to be selected as packages.id ? The First matching package ID for every title? Or the last? And what would the ORDER BY clause mean with respect to the grouping? How can you order by columns that are not part of the result set (because only packages.title really is)?
There are two solutions, as far as I can see:
You're on the right track with your query, then remove the ORDER BY clause, because I don't think it will affect your result, but it may severely slow down your query.
You have a SQL problem, not a performance problem
I was using a query that looked similar to this one:
SELECT `episodes`.*, IFNULL(SUM(`views_sum`.`clicks`), 0) as `clicks`
FROM `episodes`, `views_sum`
WHERE `views_sum`.`index` = "episode" AND `views_sum`.`key` = `episodes`.`id`
GROUP BY `episodes`.`id`
... which takes ~0.1s to execute. But it's problematic, because some episodes don't have a corresponding views_sum row, so those episodes aren't included in the result.
What I want is NULL values when a corresponding views_sum row doesn't exist, so I tried using a LEFT JOIN instead:
SELECT `episodes`.*, IFNULL(SUM(`views_sum`.`clicks`), 0) as `clicks`
FROM `episodes`
LEFT JOIN `views_sum` ON (`views_sum`.`index` = "episode" AND `views_sum`.`key` = `episodes`.`id`)
GROUP BY `episodes`.`id`
This query produces the same columns, and it also includes the few rows missing from the 1st query.
BUT, the 2nd query takes 10 times as long! A full second.
Why is there such a huge discrepancy between the execution times when the result is so similar? There's nowhere near 10 times as many rows — it's like 60 from the 1st query, and 70 from the 2nd. That's not to mention that the 10 additional rows have no views to sum!
Any light shed would be greatly appreciated!
(There are indexes on episodes.id, views_sum.index, and views_sum.key.)
EDIT:
I copy-pasted the SQL from above, and here are the EXPLAINs, in order:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE views_sum ref index,key index 27 const 6532 Using where; Using temporary; Using filesort
1 SIMPLE episodes eq_ref PRIMARY PRIMARY 4 db102914_itw.views_sum.key 1 Using where
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE episodes ALL NULL NULL NULL NULL 70 Using temporary; Using filesort
1 SIMPLE views_sum ref index,key index 27 const 6532
Here's the query I ultimately came up with, after many, many iterations. (The SQL_NO_CACHE flag is there so I can test execution times.)
SELECT SQL_NO_CACHE e.*, IFNULL(SUM(vs.`clicks`), 0) as `clicks`
FROM `episodes` e
LEFT JOIN
(SELECT * FROM `views_sum` WHERE `index` = "episode") vs
ON vs.`key` = e.`id`
GROUP BY e.`id`
Because the ON condtion views_sum.index = "episode" is static, i.e., isn't dependent on the row it's joined to, I was able to get a massive performance boost by first using a subquery to limit the views_sum table before joining.
My query now takes ~0.2s. And what's even better, the time doesn't grow as you increase the offset of the query (unlike my first LEFT JOIN attempt). It stays the same, even if you do a sort on the clicks column.
You should have a combined index on views_sum.index and views_sum.key. I suspect you will always use both fields together if i look at the names. Also, I would rewrite the first query to use a proper INNER JOIN clause instead of a filtered cartesian product.
I suspect the performance of both queries will be much closer together if you do this. And, more importantly, much faster than they are now.
edit: Thinking about it, I would probably add a third column to that index: views_sum.clicks, which probably can be used for the SUM. But remember that multi-column indexes can only be used left to right.
It's all about the indexes. You'll have to play around with it a bit or post your database schema on here. Just as a rough guess i'd say you should make sure you have an index on views_sum.key.
Normally, a LEFT JOIN will be slower than an INNER JOIN or a CROSS JOIN because it has to view the first table differently. Put another way, the difference in time isn't related to the size of the result, but the full size of the left table.
I also wonder if you're asking MySQL to figure things out for you that you should be doing yourself. Specifically, that SUM() function would normally require a GROUP BY clause.