I am running the be query
SELECT packages.id, packages.title, subcat.id, packages.weight
FROM packages ,provider, packagestosubcat,
packagestocity, subcat, usertosubcat,
usertocity, usertoprovider
WHERE packages.endDate >'2011-03-11 06:00:00' AND
usertosubcat.userid = 1 AND
usertocity.userid = 1 AND
packages.providerid = provider.id AND
packages.id = packagestosubcat.packageid AND
packages.id = packagestocity.packageid AND
packagestosubcat.subcatid = subcat.id AND
usertosubcat.subcatid = packagestosubcat.subcatid AND
usertocity.cityid = packagestocity.cityid AND
(
provider.providertype = 'reg' OR
(
usertoprovider.userid = 1 AND
provider.providertype != 'reg' AND
usertoprovider.providerid = provider.ID
)
)
GROUP BY packages.title
ORDER BY subcat.id, packages.weight DESC
When i run explain, everything seems to look ok except for the scan on the usertoprovider table, which doesn't seem to be using table's keys:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE usertocity ref user,city user 4 const 4 Using temporary; Using filesort
1 SIMPLE packagestocity ref city,packageid city 4 usertocity.cityid 419
1 SIMPLE packages eq_ref PRIMARY,enddate PRIMARY 4 packagestocity.packageid 1 Using where
1 SIMPLE provider eq_ref PRIMARY,providertype PRIMARY 4 packages.providerid 1 Using where
1 SIMPLE packagestosubcat ref subcatid,packageid packageid 4 packages.id 1 Using where
1 SIMPLE subcat eq_ref PRIMARY PRIMARY 4 packagestosubcat.subcatid 1
1 SIMPLE usertosubcat ref userid,subcatid subcatid 4 const 12 Using where
1 SIMPLE usertoprovider ALL userid,providerid NULL NULL NULL 3735 Using where
As you can see in the above query, the condition itself is:
provider.providertype = 'reg' OR
(
usertoprovider.userid = 1 AND
provider.providertype != 'reg' AND
usertoprovider.providerid = provider.ID
)
Both tables, provider and usertoprovider, are indexed. provider has indexes on providerid and providertype while usertoprovider has indexes on userid and providerid
The cardinality of the keys is:
provider.id=47, provider.type=1, usertoprovider.userid=1245, usertoprovider.providerid=6
So its quite obvious that the indexes are not used.
Further more, to test it out, i went ahead and:
Duplicated the usertoprovider table
Inserted all the provider values that have providertype='reg' into the cloned table
Simplified the condition to (usertoprovider.userid = 1 AND usertoprovider.providerid = provider.ID)
The query execution time changed from 8.1317 sec. to 0.0387 sec.
Still, provider values that have providertype='reg' are valid for all the users and i would like to avoid inserting these values into the usertoprovider table for all the users since this data is redundant.
Can someone please explain why MySQL still runs a full scan and doesn't use the keys? What can be done to avoid it?
It seems that provider.providertype != 'reg' is redundant (always true) unless provider.providertype is nullable and you want the query to fail on NULL.
And shouldn't != be <> instead to be standard SQL, although MySQL may allow !=?
On cost of table scans
It is not necessarily that a full table scan is more expensive than walking an index, because walking an index still requires multiple page accesses. In many database engines, if your table is small enough to fit inside a few pages, and the number of rows are small enough, it will be cheaper to do a table scan. Database engines make this type of decision based on the data and index statistics of the table.
This case
However, in your case, it might also be because of the other leg in your OR clause: provider.providertype = 'reg'. If providertype is "reg", then this query joins in ALL the rows of usertoprovider (most likely not what you want) since it is a multi-table cross join.
The database engine is correct in determining that you'll likely need all the table rows in usertoprovider anyway (unless none of the providertype's is "reg", but the engine also may know!).
The query hides this fact because you are grouping on the (MASSIVE!) result set later on and just returning the package ID, so you won't see how many usertoprovider rows have been returned. But it will run very slowly. Get rid of the GROUP BY clause to find out how many rows you are actually forcing the database engine to work on!!!
The reason you see a massive speed improvement if you fill out the usertoprovider table is because then every row participates in a join, and there is no full cross join happening in the case of "reg". Before, if you have 1,000 rows in usertoprovider, every row with type="reg" expands the result set 1,000 times. Now, that row joins with only one row in usertoprovider, and the result set is not expanded.
If you really want to pass anything with providertype='reg', but not in your many-to-many mapping table, then the easiest way may be to use a sub-query:
Remove usertoprovider from your FROM clause
Do the following:
provider.providertype='reg' OR EXISTS (SELECT * FROM usertoprovider WHERE userid=1 AND providerid = provider.ID)
Another method is to use an OUTER JOIN on the usertoprovider -- any row with "reg" which is not in the table will come back with one row of NULL instead of expanding the result set.
Hmm, I know that MySQL does funny things with grouping. In any other RDBMS, your query won't even be executed. What does that even mean,
SELECT packages.id
[...]
GROUP BY packages.title
ORDER BY subcat.id, packages.weight DESC
You want to group by title. Then in standard SQL syntax, this means you can only select title and aggregate functions of the other columns. MySQL magically tries to execute (and probably guess) what you may have meant to execute. So what would you expect to be selected as packages.id ? The First matching package ID for every title? Or the last? And what would the ORDER BY clause mean with respect to the grouping? How can you order by columns that are not part of the result set (because only packages.title really is)?
There are two solutions, as far as I can see:
You're on the right track with your query, then remove the ORDER BY clause, because I don't think it will affect your result, but it may severely slow down your query.
You have a SQL problem, not a performance problem
Related
So this might be a bit silly, but the alternative I was using is worse. I am trying to write an excel sheet using data from my database and a PHP tool called Box/Spout. The thing is that Box/Spout reads rows one at a time, and they are not retrieved via index ( e.g. rows[10], rows[42], rows[156] )
I need to retrieve data from the database in the order the rows come out. I have a database with a list of customers, that came in via Import and I have to write them into the excel spreadsheet. They have phone numbers, emails, and an address. Sorry for the confusion... :/ So I compiled this fairly complex query:
SELECT
`Import`.`UniqueID`,
`Import`.`RowNum`,
`People`.`PeopleID`,
`People`.`First`,
`People`.`Last`,
GROUP_CONCAT(
DISTINCT CONCAT_WS(',', `PhonesTable`.`Phone`, `PhonesTable`.`Type`)
ORDER BY `PhonesTable`.`PhoneID` DESC
SEPARATOR ';'
) AS `Phones`,
GROUP_CONCAT(
DISTINCT CONCAT_WS(',', `EmailsTable`.`Email`)
ORDER BY `EmailsTable`.`EmailID` DESC
SEPARATOR ';'
) AS `Emails`,
`Properties`.`Address1`,
`Properties`.`city`,
`Properties`.`state`,
`Properties`.`PostalCode5`,
...(17 more `People` Columns)...,
FROM `T_Import` AS `Import`
LEFT JOIN `T_CustomerStorageJoin` AS `CustomerJoin`
ON `Import`.`UniqueID` = `CustomerJoin`.`ImportID`
LEFT JOIN `T_People` AS `People`
ON `CustomerJoin`.`PersID`=`People`.`PeopleID`
LEFT JOIN `T_JoinPeopleIDPhoneID` AS `PeIDPhID`
ON `People`.`PeopleID` = `PeIDPhID`.`PeopleID`
LEFT JOIN `T_Phone` AS `PhonesTable`
ON `PeIDPhID`.`PhoneID`=`PhonesTable`.`PhoneID`
LEFT JOIN `T_JoinPeopleIDEmailID` AS `PeIDEmID`
ON `People`.`PeopleID` = `PeIDEmID`.`PeopleID`
LEFT JOIN `T_Email` AS `EmailsTable`
ON `PeIDEmID`.`EmailID`=`EmailsTable`.`EmailID`
LEFT JOIN `T_JoinPeopleIDPropertyID` AS `PeIDPrID`
ON `People`.`PeopleID` = `PeIDPrID`.`PeopleID`
AND `PeIDPrID`.`PropertyCP`='CurrentImported'
LEFT JOIN `T_Property` AS `Properties`
ON `PeIDPrID`.`PropertyID`=`Properties`.`PropertyID`
WHERE `Import`.`CustomerCollectionID`=$ccID
AND `RowNum` >= $rnOffset
AND `RowNum` < $rnLimit
GROUP BY `RowNum`;
So I have indexes on every ON segment, and the WHERE segment. When RowNumber is like around 0->2500 in value, the query runs great and executes within a couple seconds. But it seems like the query execution time exponentially multiplies the larger RowNumber gets.
I have an EXPLAIN here: and at pastebin( https://pastebin.com/PksYB4n2 )
id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 SIMPLE Import NULL ref CustomerCollectionID,RowNumIndex CustomerCollectionID 4 const 48108 8.74 Using index condition; Using where; Using filesort;
1 SIMPLE CustomerJoin NULL ref ImportID ImportID 4 MyDatabase.Import.UniqueID 1 100 NULL
1 SIMPLE People NULL eq_ref PRIMARY,PeopleID PRIMARY 4 MyDatabase.CustomerJoin.PersID 1 100 NULL
1 SIMPLE PeIDPhID NULL ref PeopleID PeopleID 5 MyDatabase.People.PeopleID 8 100 NULL
1 SIMPLE PhonesTable NULL eq_ref PRIMARY,PhoneID,PhoneID_2 PRIMARY 4 MyDatabase.PeIDPhID.PhoneID 1 100 NULL
1 SIMPLE PeIDEmID NULL ref PeopleID PeopleID 5 MyDatabase.People.PeopleID 5 100 NULL
1 SIMPLE EmailsTable NULL eq_ref PRIMARY,EmailID,DupeDeleteSelect PRIMARY 4 MyDatabase.PeIDEmID.EmailID 1 100 NULL
1 SIMPLE PeIDPrID NULL ref PeopleMSCP,PeopleID,PropertyCP PeopleMSCP 5 MyDatabase.People.PeopleID 4 100 Using where
1 SIMPLE Properties NULL eq_ref PRIMARY,PropertyID PRIMARY 4 MyDatabase.PeIDPrID.PropertyID 1 100 NULL
I apologize if the formatting is absolutely terrible. I'm not sure what good formatting looks like so I may have jumbled it a bit on accident, plus the tabs got screwed up.
What I want to know is how to speed up the query time. The databases are very large, like in the 10s of millions of rows. And they aren't always like this as our tables are constantly changing, however I would like to be able to handle it when they are.
I tried using LIMIT 2000, 1000 for example, but I know that it's less efficient than using an indexed column. So I switched over to RowNumber. I feel like this was a good decision, but it seems like MySQL is still looping every single row before the offset variable which kind of defeats the purpose of my index... I think? I'm not sure. I also basically split this particular query into about 10 singular queries, and ran them one by one, for each row of the excel file. It takes a LONG time... TOO LONG. This is fast, but, obviously I'm having a problem.
Any help would be greatly appreciated, and thank you ahead of time. I'm sorry again for my lack of post organization.
The order of the columns in an index matters. The order of the clauses in WHERE does not matter (usually).
INDEX(a), INDEX(b) is not the same as the "composite" INDEX(a,b). I deliberately made composite indexes where they seemed useful.
INDEX(a,b) and INDEX(b,a) are not interchangeable unless both a and b are tested with =. (Plus a few exceptions.)
A "covering" index is one where all the columns for the one table are found in the one index. This sometimes provides an extra performance boost. Some of my recommended indexes are "covering". It implies that only the index BTree need be accessed, not also the data BTree; this is where it picks up some speed.
In EXPLAIN SELECT ... a "covering" index is indicated by "Using index" (which is not the same as "Using index condition"). (Your Explain shows no covering indexes currently.)
An index 'should not' have more than 5 columns. (This is not a hard and fast rule.) T5's index had f5 columns to be covering; it was not practical to make a covering index for T2.
When JOINing, the order of the tables does not matter; the Optimizer is free to shuffle them around. However, these "rules" apply:
A LEFT JOIN may force ordering of the tables. (I think it does in this case.) (I ordered the columns based on what I think the Optimizer wants; there may be some flexibility.)
The WHERE clause usually determines which table to "start with". (You test on T1 only, so obviously it will start with T1.
The "next table" to be referenced (via NLJ - Nested Loop Join) is determined by a variety of things. (In your case it is pretty obvious -- namely the ON column(s).)
More on indexing: http://mysql.rjweb.org/doc.php/index_cookbook_mysql
Revised Query
1. Import: (CustomerCollectionID, -- '=' comes first
RowNum, -- 'range'
UniqueID) -- 'covering'
Import shows up in WHERE, so is first in Explain; Also due to LEFTs
Properties: (PropertyID) -- is that the PK?
PeIDPrID: (PropertyCP, PeopleID, PropertyID)
3. People: (PeopleID)
I assume that is the `PRIMARY KEY`? (Too many for "covering")
(Since `People` leads to 3 other table; I won't number the rest.)
EmailsTable: (EmailID, Email)
PeIDEmID: (PeopleID, -- JOIN from People
EmailID) -- covering
PhonesTable: (PhoneID, Type, Phone)
PeIDPhID: (PeopleID, PhoneID)
2. CustomerJoin: (ImportID, -- coming from `Import` (see ON...)
PersID) -- covering
After adding those, I expect most lines of EXPLAIN to say Using index.
The lack of at least a composite index on Import is the main problem leading to your performance complaint.
Bad GROUP BY
When there is a GROUP BY that does not include all the non-aggregated columns that are not directly dependent on the group by column(s), you get random values for the extras. I see from the EXPLAIN ("Rows") that several tables probably have multiple rows. You really ought to think about the garbage being generated by this query.
Curiously, Phones and Emails are feed into GROUP_CONCAT(), thereby avoiding the above issue, but the "Rows" is only 1.
(Read about ONLY_FULL_GROUP_BY; it might explain the issue better.)
(I'm listing this as a separate Answer since it is orthogonal to my other Answer.)
I call this the "explode-implode" syndrome. The query does a JOIN, getting a bunch of rows, thereby generating several rows, and puts multiple rows into an intermediate table. Then the GROUP BY implodes back to down to the original set of rows.
Let me focus on a portion of the query that could be reformulated to provide a performance improvement:
SELECT ...
GROUP_CONCAT(
DISTINCT CONCAT_WS(',', `EmailsTable`.`Email`)
ORDER BY `EmailsTable`.`EmailID` DESC
SEPARATOR ';'
) AS `Emails`,
...
FROM ...
LEFT JOIN `T_Email` AS `EmailsTable`
ON `PeIDEmID`.`EmailID`=`EmailsTable`.`EmailID`
...
GROUP BY `RowNum`;
Instead, move the table and aggregation function into a subquery
SELECT ...
( SELECT GROUP_CONCAT(
DISTINCT CONCAT_WS(',', `Email`)
ORDER BY `EmailID` DESC
SEPARATOR ';' )
FROM T_Email
WHERE `PeIDEmID`.`EmailID` = `EmailID`
) AS `Emails`,
...
FROM ...
-- and Remove: LEFT JOIN `T_Email` ON ...
...
-- and possibly Remove: GROUP BY ...;
Ditto for PhonesTable.
(It is unclear whether the GROUP BY can be removed; other things may need it.)
I have this following SQL query, which, when I originally coded it, was exceptionally fast, it now takes over 1 second to complete:
SELECT counted/scount as ratio, [etc]
FROM
playlists
LEFT JOIN (
select AID, PLID FROM (SELECT AID, PLID FROM p_s ORDER BY `order` asc, PLSID desc)as g GROUP BY PLID
) as t USING(PLID)
INNER JOIN (
SELECT PLID, count(PLID) as scount from p_s LEFT JOIN audio USING(AID) WHERE removed='0' and verified='1' GROUP BY PLID
) as g USING(PLID)
LEFT JOIN (
select AID, count(AID) as counted FROM a_p_all WHERE ".time()." - playtime < 2678400 GROUP BY AID
) as r USING(AID)
LEFT JOIN audio USING (AID)
LEFT JOIN members USING (UID)
WHERE scount > 4 ORDER BY ratio desc
LIMIT 0, 20
I have identified the problem, the a_p_all table has over 500k rows. This is slowing down the query. I have come up with a solution:
Create a smaller temporary table, that only stores the data necessary, and deletes anything older than is needed.
However, is there a better method to use? Optimally I wouldn't need a temporary table; what do sites such as YouTube/Facebook do for large tables to keep query times fast?
edit
This is the EXPLAIN table for the query in the answer from #spencer7593
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY <derived3> ALL NULL NULL NULL NULL 20
1 PRIMARY u eq_ref PRIMARY PRIMARY 8 q.AID 1 Using index
1 PRIMARY m eq_ref PRIMARY PRIMARY 8 q.UID 1 Using index
3 DERIVED <derived6> ALL NULL NULL NULL NULL 20
6 DERIVED t ALL NULL NULL NULL NULL 21
5 DEPENDENT SUBQUERY s ALL NULL NULL NULL NULL 49 Using where; Using filesort
4 DEPENDENT SUBQUERY c ALL NULL NULL NULL NULL 49 Using where
4 DEPENDENT SUBQUERY o eq_ref PRIMARY PRIMARY 8 database.c.AID 1 Using where
2 DEPENDENT SUBQUERY a ALL NULL NULL NULL NULL 510594 Using where
Two "big rock" issues stand out to me.
Firstly, this predicate
WHERE ".time()." - playtime < 2678400
(I'm assuming that this isn't the actual SQL being submitted to the database, but that what's being sent to the database is something like this...
WHERE 1409192073 - playtime < 2678400
such that we want only rows where playtime is within the past 31 days (i.e. within 31*24*60*60 seconds of the integer value returned by time().
This predicate can't make use of a range scan operation on a suitable index on playtime. MySQL evaluates the expression on the left side for every row in the table (every row that isn't excluded by some other predicate), and the result of that expression is compared to the literal on the right.
To improve performance, rewrite the predicate that so that the comparison is made on the bare column. Compare the value stored in the playtime column to an expression that needs to be evaluated one time, for example:
WHERE playtime > 1409192073 - 2678400
With a suitable index available, MySQL can perform a "range" scan operation, and efficiently eliminate a boatload of rows that don't need to be evaluated.
The second "big rock" is the inline views, or "derived tables" in MySQL parlance. MySQL is much different than other databases in how inline views are processed. MySQL actually runs that innermost query, and stores the result set as a temporary MyISAM table, and then the outer query runs against the MyISAM table. (The name that MySQL uses, "derived table", makes sense when we understand how MySQL processes the inline view.) Also, MySQL does not "push" predicates down, from an outer query down into the view queries. And on the derived table, there are no indexes created. (I believe MySQL 5.7 is changing that, and does sometimes create indexes, to improve performance.) But large "derived tables" can have a significant performance impact.
Also, the LIMIT clause gets applied last in the statement processing; that's after all the rows in the resultset are prepared and sorted. Even if you are returning only 20 rows, MySQL still prepares the entire resultset; it just doesn't transfer them to the client.
Lots of the column references are not qualified with the table name or alias, so we don't know, for example, which table (p_s or audio) contains the removed and verified columns.
(We know it can't be both, if MySQL isn't throwing a "ambiguous column" error. But MySQL has access to the table definitions, where we don't. MySQL also knows something about the cardinality of the columns, in particular, which columns (or combination of columns) are UNIQUE, and which columns can contain NULL values, etc.
Best practice is to qualify ALL column references with the table name or (preferably) a table alias. (This makes it much easier on the human reading the SQL, and it also avoids a query from breaking when a new column is added to a table.)
Also, the query as a LIMIT clause, but there's no ORDER BY clause (or implied ORDER BY), which makes the resultset indeterminate. We don't have any guaranteed which will be the "first" rows returned.
EDIT
To return only 20 rows from playlists (out of thousands or more), I might try using correlated subqueries in the SELECT list; using a LIMIT clause in an inline view to winnow down the number of rows that I'd need to run the subqueries for. Correlated subqueries can eat your lunch (and your lunchbox too) in terms of performance with large sets, due to the number of times those need to be run.
From what I can gather, you are attempting to return 20 rows from playlists, picking up the related row from member (by the foreign key in playlists), finding the "first" song in the playlist; getting a count of times that "song" has been played in the past 31 days (from any playlist); getting the number of times a song appears on that playlist (as long as it's been verified and hasn't been removed... the outerness of that LEFT JOIN is negated by the predicates on the removed and verified columns, if either of those columns is from the audio table...).
I'd take a shot with something like this, to compare performance:
SELECT q.*
, ( SELECT COUNT(1)
FROM a_p_all a
WHERE a.playtime < 1409192073 - 2678400
AND a.AID = q.AID
) AS counted
FROM ( SELECT p.PLID
, p.UID
, p.[etc]
, ( SELECT COUNT(1)
FROM p_s c
JOIN audio o
ON o.AID = c.AID
AND o.removed='0'
AND o.verified='1'
WHERE c.PLID = p.PLID
) AS scount
, ( SELECT s.AID
FROM p_s s
WHERE s.PLID = p.PLID
ORDER BY s.order ASC, s.PLSID DESC
LIMIT 1
) AS AID
FROM ( SELECT t.PLID
, t.[etc]
FROM playlists t
ORDER BY NULL
LIMIT 20
) p
) q
LEFT JOIN audio u ON u.AID = q.AID
LEFT JOIN members m ON m.UID = q.UID
LIMIT 0, 20
UPDATE
Dude, the EXPLAIN output is showing that you don't have suitable indexes available. To get any decent chance at performance with the correlated subqueries, you're going to want to add some indexes, e.g.
... ON a_p_all (AID, playtime)
... ON p_s (PLID, order, PLSID, AID)
I'm not that well versed in indexing in MySQL, and am having a hard time trying to understand how the EXPLAIN output works, and how it can be read to know if my query is optimised or not.
I have a fairly large table (1.1M records) and I am executing the below query:
SELECT * FROM `Member` this_ WHERE (this_._Temporary_Flag = 0 or this_._Temporary_Flag
is null) and (this_._Deleted = 0 or this_._Deleted is null) and
(this_.Username = 'XXXXXXXX' or this_.Email = 'XXXXXXXX')
ORDER BY this_.Priority asc;
It takes a very long time to execute, between 30 - 60 seconds most of the times. The output of the EXPLAIN query is as below:
id select_type table type possible_keys key key_len ref rows Extra
----------------------------------------------------------------------------------------------------------------------------------
1 SIMPLE this_ ref_or_null _Temporary_Flag,_Deleted,username,email _Temporary_Flag 2 const 33735 Using where; Using filesort
What does this statement exactly mean? Does it mean that this query can be optimised? The table has mostly single-column indexes. What are the important output from the EXPLAIN query which I should use?
It is saying that the index it has chosen to use is the one called _Temporary_Flag (which I assume is on the _Temporary_Flag column). This is not a great index to use (it still leaves it looking at 33k records), but the best it can use in the situation. It might be worth adding an index covering both the _Temporary_Flag and the _Deleted columns.
However I doubt that narrows things down much.
One issue is that MySQL can only use a single index on a table within a query. It is likely that the best indexes to use would be on Username and another on Email, but as your query has an OR there it would have to chose one or the other.
A way round this restriction on indexes is to use 2 queries unioned together, something like this:-
SELECT *
FROM `Member` this_
WHERE (this_._Temporary_Flag = 0
or this_._Temporary_Flag is null)
and (this_._Deleted = 0
or this_._Deleted is null)
and this_.Email = 'XXXXXXXX'
UNION
SELECT *
FROM `Member` this_
WHERE (this_._Temporary_Flag = 0
or this_._Temporary_Flag is null)
and (this_._Deleted = 0
or this_._Deleted is null)
and this_.Username = 'XXXXXXXX'
ORDER BY this_.Priority asc;
http://dev.mysql.com/doc/refman/5.5/en/explain-output.html
Explain tells you what MySQL is doing, it doesn't necessarily tell you or even imply what can be done to make things better.
That said, there are a few warning signs that generally imply that you can optimize a query; the biggest one in this case is the occurrence of Using filesort in the Extra column.
The documentation explains what happens in that case:
MySQL must do an extra pass to find out how to retrieve the rows in
sorted order. The sort is done by going through all rows according to
the join type and storing the sort key and pointer to the row for all
rows that match the WHERE clause. The keys then are sorted and the
rows are retrieved in sorted order.
Another warning sign in your case is the key that is used. While not necessarily true in your case, a well normalized structure will generally require unique values for Username and Email.
So, why does it take so long when you are specifying those two things? Shouldn't the optimizer be able to just go straight to those rows? Probably not, because you are specifying them with an OR, which makes it difficult for the optimizer to use indexes to find those rows.
Instead, the optimizer decided to _Temporary_Flag to look through all the results, which probably didn't narrow the result set much, especially given that the Explain says that approximately 33735 rows were looked at.
So, working on the assumption that email and username will be much more selective than this key, you could try rewriting your query as a UNION.
SELECT * FROM `Member` this_
WHERE
(this_._Temporary_Flag = 0 or this_._Temporary_Flag
is null)
and
(this_._Deleted = 0 or this_._Deleted is null)
and this_.Email = 'XXXXXXXX'
UNION
SELECT * FROM `Member` this_
WHERE (this_._Temporary_Flag = 0 or this_._Temporary_Flag
is null)
and (this_._Deleted = 0 or this_._Deleted is null)
and
this_.Username = 'XXXXXXXX'
ORDER BY this_.Priority asc;
So, those are a couple of warning signs from EXPLAIN: Look for Using filesort and strange key choices as indicators that you can probably improve things.
I'm connecting to a database which I cannot administer and I wrote a query which did a left join between two tables - one small and one several orders of magnitude larger. At some point, the database returned this error:
Incorrect key file for table '/tmp/#sql_some_table.MYI'; try to repair it
I contacted the administrators and I was told that I'm getting this error because I'm doing the left join incorrectly, that I should NEVER left join a small table to a large table and that I should reverse the join order. The reason they gave was that when done my way, MySQL will try to create a temp table which is too large and query will fail. Their solution fails elsewhere, but that's not important here.
I found their explanation odd, so I ran explain on my query:
id = '1'
select_type = 'SIMPLE'
table = 'small_table'
type = 'ALL'
possible_keys = NULL
key = NULL
key_len = NULL
ref = NULL
rows = '23'
Extra = 'Using temporary; Using filesort'
id = '1'
select_type = 'SIMPLE'
table = 'large_table'
type = 'ref'
possible_keys = 'ID,More'
key = 'ID'
key_len = '4'
ref = 'their_db.small_table.ID'
rows = '41983'
Extra = NULL
(The 41983 rows in the second table are not very interesting to me, I just needed the latest record, which is why my query has order by large_table.ValueDateTime desc limit 1 at the end.)
I was careful enough to do a select by columns which the admins themselves told me, should hold unique values (and thus I assumed indexed), but it seems they haven't indexed those columns.
My question is - is doing the join the way I did ('small_table LEFT JOIN large_table') bad practice in general, or can such queries be made to execute successfully with proper indexing?
Edit:
Here's what the query looks like (this is not the actual query, but similar):
select large_table.ValueDateTime as LastDate,
small_table.DeviceIMEI as IMEI,
small_table.Other_Columns as My_Names,
large_table.Pwr as Voltage,
large_table.Temp as Temperature
from small_table left join large_table on small_table.ID = large_table.ID
where DeviceIMEI = 500
order by ValueDateTime desc
limit 1;
Basically what I'm doing is trying to get the most current data for a device, given that Voltage and Temperature change over time. The DeviceIMEI, ID and ValueDateTime should be unique, but aren't indexed (like I said earlier, I don't administer the database, I only have read permissions).
Edit 2:
Please focus on answering my actual question, not on attempting to rewrite my original query.
The left join thing is a red herring.
It is however an actual problem of running out of space for temp tables. But the order of your join makes no difference. The only thing that matters is how many rows MySQL must work with.
Which brings me to the LIMIT command:
Here is the problem:
It order to get the single row you asked for, MySQL has to sort the ENTIRE record set, then grab the top one. In order to sort it, it must store it - either in memory or on disk. And that's where you are running out of space. Every single column you requested get stored on disk, for the entire table, then sorted.
This is slow, very slow, and uses a lot of disk space.
Solutions:
You want MySQL to be able to use an index for the sorting. But in your query it can't. It's using an index for the join reference, and MySQL can only use one index per query.
Do you even have an index on the sort column? Try that first.
Another option is to do a separate query, where you select just the ID of the large table, LIMIT 1. Then the temporary table is MUCH smaller since all it has are IDs without all the other columns.
Once you know the ID then retrieve all the columns you need directly from the tables. You can do this in one shot with a subquery. If you post your query I could rewrite it to show you, but it's basically ID = (SELECT ID FROM ..... LIMIT 1)
I was using a query that looked similar to this one:
SELECT `episodes`.*, IFNULL(SUM(`views_sum`.`clicks`), 0) as `clicks`
FROM `episodes`, `views_sum`
WHERE `views_sum`.`index` = "episode" AND `views_sum`.`key` = `episodes`.`id`
GROUP BY `episodes`.`id`
... which takes ~0.1s to execute. But it's problematic, because some episodes don't have a corresponding views_sum row, so those episodes aren't included in the result.
What I want is NULL values when a corresponding views_sum row doesn't exist, so I tried using a LEFT JOIN instead:
SELECT `episodes`.*, IFNULL(SUM(`views_sum`.`clicks`), 0) as `clicks`
FROM `episodes`
LEFT JOIN `views_sum` ON (`views_sum`.`index` = "episode" AND `views_sum`.`key` = `episodes`.`id`)
GROUP BY `episodes`.`id`
This query produces the same columns, and it also includes the few rows missing from the 1st query.
BUT, the 2nd query takes 10 times as long! A full second.
Why is there such a huge discrepancy between the execution times when the result is so similar? There's nowhere near 10 times as many rows — it's like 60 from the 1st query, and 70 from the 2nd. That's not to mention that the 10 additional rows have no views to sum!
Any light shed would be greatly appreciated!
(There are indexes on episodes.id, views_sum.index, and views_sum.key.)
EDIT:
I copy-pasted the SQL from above, and here are the EXPLAINs, in order:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE views_sum ref index,key index 27 const 6532 Using where; Using temporary; Using filesort
1 SIMPLE episodes eq_ref PRIMARY PRIMARY 4 db102914_itw.views_sum.key 1 Using where
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE episodes ALL NULL NULL NULL NULL 70 Using temporary; Using filesort
1 SIMPLE views_sum ref index,key index 27 const 6532
Here's the query I ultimately came up with, after many, many iterations. (The SQL_NO_CACHE flag is there so I can test execution times.)
SELECT SQL_NO_CACHE e.*, IFNULL(SUM(vs.`clicks`), 0) as `clicks`
FROM `episodes` e
LEFT JOIN
(SELECT * FROM `views_sum` WHERE `index` = "episode") vs
ON vs.`key` = e.`id`
GROUP BY e.`id`
Because the ON condtion views_sum.index = "episode" is static, i.e., isn't dependent on the row it's joined to, I was able to get a massive performance boost by first using a subquery to limit the views_sum table before joining.
My query now takes ~0.2s. And what's even better, the time doesn't grow as you increase the offset of the query (unlike my first LEFT JOIN attempt). It stays the same, even if you do a sort on the clicks column.
You should have a combined index on views_sum.index and views_sum.key. I suspect you will always use both fields together if i look at the names. Also, I would rewrite the first query to use a proper INNER JOIN clause instead of a filtered cartesian product.
I suspect the performance of both queries will be much closer together if you do this. And, more importantly, much faster than they are now.
edit: Thinking about it, I would probably add a third column to that index: views_sum.clicks, which probably can be used for the SUM. But remember that multi-column indexes can only be used left to right.
It's all about the indexes. You'll have to play around with it a bit or post your database schema on here. Just as a rough guess i'd say you should make sure you have an index on views_sum.key.
Normally, a LEFT JOIN will be slower than an INNER JOIN or a CROSS JOIN because it has to view the first table differently. Put another way, the difference in time isn't related to the size of the result, but the full size of the left table.
I also wonder if you're asking MySQL to figure things out for you that you should be doing yourself. Specifically, that SUM() function would normally require a GROUP BY clause.