Is it bad practice to do "small_table LEFT JOIN large_table"?

Is it bad practice to do "small_table LEFT JOIN large_table"? - mysql

I'm connecting to a database which I cannot administer and I wrote a query which did a left join between two tables - one small and one several orders of magnitude larger. At some point, the database returned this error:
Incorrect key file for table '/tmp/#sql_some_table.MYI'; try to repair it
I contacted the administrators and I was told that I'm getting this error because I'm doing the left join incorrectly, that I should NEVER left join a small table to a large table and that I should reverse the join order. The reason they gave was that when done my way, MySQL will try to create a temp table which is too large and query will fail. Their solution fails elsewhere, but that's not important here.
I found their explanation odd, so I ran explain on my query:
id = '1'
select_type = 'SIMPLE'
table = 'small_table'
type = 'ALL'
possible_keys = NULL
key = NULL
key_len = NULL
ref = NULL
rows = '23'
Extra = 'Using temporary; Using filesort'
id = '1'
select_type = 'SIMPLE'
table = 'large_table'
type = 'ref'
possible_keys = 'ID,More'
key = 'ID'
key_len = '4'
ref = 'their_db.small_table.ID'
rows = '41983'
Extra = NULL
(The 41983 rows in the second table are not very interesting to me, I just needed the latest record, which is why my query has order by large_table.ValueDateTime desc limit 1 at the end.)
I was careful enough to do a select by columns which the admins themselves told me, should hold unique values (and thus I assumed indexed), but it seems they haven't indexed those columns.
My question is - is doing the join the way I did ('small_table LEFT JOIN large_table') bad practice in general, or can such queries be made to execute successfully with proper indexing?
Edit:
Here's what the query looks like (this is not the actual query, but similar):
select large_table.ValueDateTime as LastDate,
small_table.DeviceIMEI as IMEI,
small_table.Other_Columns as My_Names,
large_table.Pwr as Voltage,
large_table.Temp as Temperature
from small_table left join large_table on small_table.ID = large_table.ID
where DeviceIMEI = 500
order by ValueDateTime desc
limit 1;
Basically what I'm doing is trying to get the most current data for a device, given that Voltage and Temperature change over time. The DeviceIMEI, ID and ValueDateTime should be unique, but aren't indexed (like I said earlier, I don't administer the database, I only have read permissions).
Edit 2:
Please focus on answering my actual question, not on attempting to rewrite my original query.

The left join thing is a red herring.
It is however an actual problem of running out of space for temp tables. But the order of your join makes no difference. The only thing that matters is how many rows MySQL must work with.
Which brings me to the LIMIT command:
Here is the problem:
It order to get the single row you asked for, MySQL has to sort the ENTIRE record set, then grab the top one. In order to sort it, it must store it - either in memory or on disk. And that's where you are running out of space. Every single column you requested get stored on disk, for the entire table, then sorted.
This is slow, very slow, and uses a lot of disk space.
Solutions:
You want MySQL to be able to use an index for the sorting. But in your query it can't. It's using an index for the join reference, and MySQL can only use one index per query.
Do you even have an index on the sort column? Try that first.
Another option is to do a separate query, where you select just the ID of the large table, LIMIT 1. Then the temporary table is MUCH smaller since all it has are IDs without all the other columns.
Once you know the ID then retrieve all the columns you need directly from the tables. You can do this in one shot with a subquery. If you post your query I could rewrite it to show you, but it's basically ID = (SELECT ID FROM ..... LIMIT 1)

Related

Why is this INNER JOIN/ORDER BY mysql query so slow?

I have very big database of customers. This query was ok before I added ORDER BY. How can I optimize my query speed?
$sql = "SELECT * FROM customers
LEFT JOIN ids ON customer_ids.customer_id = customers.customer_id AND ids.type = '10'
ORDER BY customers.name LIMIT 10";
ids.type and customers.name are my indexes
Explain query
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE customers ALL NULL NULL NULL NULL 955 Using temporary; Using filesort
1 SIMPLE ids ALL type NULL NULL NULL 3551 Using where; Using join buffer (Block Nested Loop)

(I assume you meant to type ids.customer_id = customer.customer_id and not customer_ids.customer_id)
Without the ORDER BY mysql grabbed the first 10 ids of type 10 (indexed), looked up the customer for them, and was done. (Note that the LEFT JOIN here is really an INNER JOIN because the join conditions will only hold for rows that have a match in both tables)
With the ORDER BY mysql is probably retrieving all type=10 customers then sorting them by first name to find the first 10.
You could speed this up by either denormalizing the customers table (copy the type into the customer record) or creating a mapping table to hold the customer_id, name, type tuples. In either case, add an index on (type, name). If using the mapping table, use it to do a 3-way join with customers and ids.
If type=10 is reasonably common, you could also force the query to walk the customers table by name and check the type for each with STRAIGHT JOIN. It won't be as fast as a compound index, but it will be faster than pulling up all matches.
And as suggested above, run an EXPLAIN on your query to see the query plan that mysql is using.

LEFT is the problem. By saying LEFT JOIN, you are implying that some customers may not have a corresponding row(s) in ids. And you are willing to accept NULLs for the fields in place of such an ids row.
If that is not the case, then remove LEFT. Then make sure you have an index on ids that starts with type. Also, customers must have an index (probably the PRIMARY KEY) starting with customer_id. With those, the optimizer can start with ids, filter on type earlier, thereby have less work to do.
But, still, it must collect lots of rows before doing the sort (ORDER BY); only then can it deliver the 10 (LIMIT).
While you are at it, add INDEX(customer_id) to ids -- that is what is killing performance for the LEFT version.

analysing what to improve via an EXPLAIN query on MySQL

I'm not that well versed in indexing in MySQL, and am having a hard time trying to understand how the EXPLAIN output works, and how it can be read to know if my query is optimised or not.
I have a fairly large table (1.1M records) and I am executing the below query:
SELECT * FROM `Member` this_ WHERE (this_._Temporary_Flag = 0 or this_._Temporary_Flag
is null) and (this_._Deleted = 0 or this_._Deleted is null) and
(this_.Username = 'XXXXXXXX' or this_.Email = 'XXXXXXXX')
ORDER BY this_.Priority asc;
It takes a very long time to execute, between 30 - 60 seconds most of the times. The output of the EXPLAIN query is as below:
id select_type table type possible_keys key key_len ref rows Extra
----------------------------------------------------------------------------------------------------------------------------------
1 SIMPLE this_ ref_or_null _Temporary_Flag,_Deleted,username,email _Temporary_Flag 2 const 33735 Using where; Using filesort
What does this statement exactly mean? Does it mean that this query can be optimised? The table has mostly single-column indexes. What are the important output from the EXPLAIN query which I should use?

It is saying that the index it has chosen to use is the one called _Temporary_Flag (which I assume is on the _Temporary_Flag column). This is not a great index to use (it still leaves it looking at 33k records), but the best it can use in the situation. It might be worth adding an index covering both the _Temporary_Flag and the _Deleted columns.
However I doubt that narrows things down much.
One issue is that MySQL can only use a single index on a table within a query. It is likely that the best indexes to use would be on Username and another on Email, but as your query has an OR there it would have to chose one or the other.
A way round this restriction on indexes is to use 2 queries unioned together, something like this:-
SELECT *
FROM `Member` this_
WHERE (this_._Temporary_Flag = 0
or this_._Temporary_Flag is null)
and (this_._Deleted = 0
or this_._Deleted is null)
and this_.Email = 'XXXXXXXX'
UNION
SELECT *
FROM `Member` this_
WHERE (this_._Temporary_Flag = 0
or this_._Temporary_Flag is null)
and (this_._Deleted = 0
or this_._Deleted is null)
and this_.Username = 'XXXXXXXX'
ORDER BY this_.Priority asc;

http://dev.mysql.com/doc/refman/5.5/en/explain-output.html
Explain tells you what MySQL is doing, it doesn't necessarily tell you or even imply what can be done to make things better.
That said, there are a few warning signs that generally imply that you can optimize a query; the biggest one in this case is the occurrence of Using filesort in the Extra column.
The documentation explains what happens in that case:
MySQL must do an extra pass to find out how to retrieve the rows in
sorted order. The sort is done by going through all rows according to
the join type and storing the sort key and pointer to the row for all
rows that match the WHERE clause. The keys then are sorted and the
rows are retrieved in sorted order.
Another warning sign in your case is the key that is used. While not necessarily true in your case, a well normalized structure will generally require unique values for Username and Email.
So, why does it take so long when you are specifying those two things? Shouldn't the optimizer be able to just go straight to those rows? Probably not, because you are specifying them with an OR, which makes it difficult for the optimizer to use indexes to find those rows.
Instead, the optimizer decided to _Temporary_Flag to look through all the results, which probably didn't narrow the result set much, especially given that the Explain says that approximately 33735 rows were looked at.
So, working on the assumption that email and username will be much more selective than this key, you could try rewriting your query as a UNION.
SELECT * FROM `Member` this_
WHERE
(this_._Temporary_Flag = 0 or this_._Temporary_Flag
is null)
and
(this_._Deleted = 0 or this_._Deleted is null)
and this_.Email = 'XXXXXXXX'
UNION
SELECT * FROM `Member` this_
WHERE (this_._Temporary_Flag = 0 or this_._Temporary_Flag
is null)
and (this_._Deleted = 0 or this_._Deleted is null)
and
this_.Username = 'XXXXXXXX'
ORDER BY this_.Priority asc;
So, those are a couple of warning signs from EXPLAIN: Look for Using filesort and strange key choices as indicators that you can probably improve things.

MySQL indexing in an "or" statement

I have 3 tables that I need to join, these join together fine using indexes. However, we are transitioning over from using one legacy field as the identifier to another one in another table.
LEGACYID is that legacy field, while NEWID is the new field. Both fields are varchars.
Both fields are indexed exclusively with a btree index, both tables are MyISAM.
SELECT Username
FROM CUST C use index(primary,NEWID)
JOIN TBLSHP S ON S.CUSID = C.CUSID
JOIN TBLQ Q ON Q.SHPID = S.SHPID
WHERE C.LEGACYID = '692041'
OR Q.NEWID = '692041'
This query takes 5.147 seconds, that's 5 seconds longer than I expect.
When doing an EXPLAIN EXTENDED query the index type for NEWID is ALL i.e. full table scan , possible keys are (primary,NEWID) and key(null). If I remove the LEGACYID from the Or statement, explain says key (NEWID) will now be used. If I remove NEWID from the OR statement changes occur as following:
the type of the table joins for (S,C) change from type ref to eq_ref
key_len changes from 4 to 5 (on both)
extra changes from empty to "Using where" .
With either one of the statements removed from the the OR statement the query runs at expected speeds.
Table Q has 183k records; C:115k; S:169k.
One last point. if I move the query placement:
SELECT Username
FROM CUST C use index(primary,NEWID)
JOIN TBLSHP S ON S.CUSID = C.CUSID
LEFT JOIN TBLQ Q ON Q.SHPID = S.SHPID
AND Q.NEWID = '692041'
WHERE C.LEGACYID = '692041'
Although its not the same query, for the way the data works, it will provide the results I need, and the speed is down to under a .1 of a second again.
I did want to clarify that I don't really need a query that works solution. Thanks to Ponies below that already has provided one. What I need to know is if anyone else has run into this problem and can explain why this is happening and what I can do for this simple or statement to use both indexes.

If you know there won't be duplicates, change UNION to UNION ALL (UNION ALL is faster because it doesn't remove duplicates). Otherwise, use:
SELECT Username
FROM CUST C use index(primary,NEWID)
JOIN TBLSHP S ON S.CUSID = C.CUSID
JOIN TBLQ Q ON Q.SHPID = S.SHPID
WHERE C.LEGACYID = '692041'
UNION
SELECT Username
FROM CUST C use index(primary,NEWID)
JOIN TBLSHP S ON S.CUSID = C.CUSID
JOIN TBLQ Q ON Q.SHPID = S.SHPID
WHERE Q.NEWID = '692041'
ORs are notoriously bad performers, because it splinters the execution path. The UNION alleviates that splintering, and the combines the two results sets. That said, IN is preferable to ORs because though being logically the same, the execution of IN is generally more optimized.
UNION isn't always the answer
Investigate many options, comparing the EXPLAIN PLAN output before determining a solution. I've come across a couple recently that perform better using a cursor than a single query using esoteric functionality.
Also, make sure foreign key columns (what you're using in the ON clause when JOINing) are indexed. MySQL has started (v5.5+?) to automatically do this when a foreign key constraint is made, but that's only for InnoDB tables.

MySQL join query performance issue

I am running the be query
SELECT packages.id, packages.title, subcat.id, packages.weight
FROM packages ,provider, packagestosubcat,
packagestocity, subcat, usertosubcat,
usertocity, usertoprovider
WHERE packages.endDate >'2011-03-11 06:00:00' AND
usertosubcat.userid = 1 AND
usertocity.userid = 1 AND
packages.providerid = provider.id AND
packages.id = packagestosubcat.packageid AND
packages.id = packagestocity.packageid AND
packagestosubcat.subcatid = subcat.id AND
usertosubcat.subcatid = packagestosubcat.subcatid AND
usertocity.cityid = packagestocity.cityid AND
(
provider.providertype = 'reg' OR
(
usertoprovider.userid = 1 AND
provider.providertype != 'reg' AND
usertoprovider.providerid = provider.ID
)
)
GROUP BY packages.title
ORDER BY subcat.id, packages.weight DESC
When i run explain, everything seems to look ok except for the scan on the usertoprovider table, which doesn't seem to be using table's keys:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE usertocity ref user,city user 4 const 4 Using temporary; Using filesort
1 SIMPLE packagestocity ref city,packageid city 4 usertocity.cityid 419
1 SIMPLE packages eq_ref PRIMARY,enddate PRIMARY 4 packagestocity.packageid 1 Using where
1 SIMPLE provider eq_ref PRIMARY,providertype PRIMARY 4 packages.providerid 1 Using where
1 SIMPLE packagestosubcat ref subcatid,packageid packageid 4 packages.id 1 Using where
1 SIMPLE subcat eq_ref PRIMARY PRIMARY 4 packagestosubcat.subcatid 1
1 SIMPLE usertosubcat ref userid,subcatid subcatid 4 const 12 Using where
1 SIMPLE usertoprovider ALL userid,providerid NULL NULL NULL 3735 Using where
As you can see in the above query, the condition itself is:
provider.providertype = 'reg' OR
(
usertoprovider.userid = 1 AND
provider.providertype != 'reg' AND
usertoprovider.providerid = provider.ID
)
Both tables, provider and usertoprovider, are indexed. provider has indexes on providerid and providertype while usertoprovider has indexes on userid and providerid
The cardinality of the keys is:
provider.id=47, provider.type=1, usertoprovider.userid=1245, usertoprovider.providerid=6
So its quite obvious that the indexes are not used.
Further more, to test it out, i went ahead and:
Duplicated the usertoprovider table
Inserted all the provider values that have providertype='reg' into the cloned table
Simplified the condition to (usertoprovider.userid = 1 AND usertoprovider.providerid = provider.ID)
The query execution time changed from 8.1317 sec. to 0.0387 sec.
Still, provider values that have providertype='reg' are valid for all the users and i would like to avoid inserting these values into the usertoprovider table for all the users since this data is redundant.
Can someone please explain why MySQL still runs a full scan and doesn't use the keys? What can be done to avoid it?

It seems that provider.providertype != 'reg' is redundant (always true) unless provider.providertype is nullable and you want the query to fail on NULL.
And shouldn't != be <> instead to be standard SQL, although MySQL may allow !=?
On cost of table scans
It is not necessarily that a full table scan is more expensive than walking an index, because walking an index still requires multiple page accesses. In many database engines, if your table is small enough to fit inside a few pages, and the number of rows are small enough, it will be cheaper to do a table scan. Database engines make this type of decision based on the data and index statistics of the table.
This case
However, in your case, it might also be because of the other leg in your OR clause: provider.providertype = 'reg'. If providertype is "reg", then this query joins in ALL the rows of usertoprovider (most likely not what you want) since it is a multi-table cross join.
The database engine is correct in determining that you'll likely need all the table rows in usertoprovider anyway (unless none of the providertype's is "reg", but the engine also may know!).
The query hides this fact because you are grouping on the (MASSIVE!) result set later on and just returning the package ID, so you won't see how many usertoprovider rows have been returned. But it will run very slowly. Get rid of the GROUP BY clause to find out how many rows you are actually forcing the database engine to work on!!!
The reason you see a massive speed improvement if you fill out the usertoprovider table is because then every row participates in a join, and there is no full cross join happening in the case of "reg". Before, if you have 1,000 rows in usertoprovider, every row with type="reg" expands the result set 1,000 times. Now, that row joins with only one row in usertoprovider, and the result set is not expanded.
If you really want to pass anything with providertype='reg', but not in your many-to-many mapping table, then the easiest way may be to use a sub-query:
Remove usertoprovider from your FROM clause
Do the following:
provider.providertype='reg' OR EXISTS (SELECT * FROM usertoprovider WHERE userid=1 AND providerid = provider.ID)
Another method is to use an OUTER JOIN on the usertoprovider -- any row with "reg" which is not in the table will come back with one row of NULL instead of expanding the result set.

Hmm, I know that MySQL does funny things with grouping. In any other RDBMS, your query won't even be executed. What does that even mean,
SELECT packages.id
[...]
GROUP BY packages.title
ORDER BY subcat.id, packages.weight DESC
You want to group by title. Then in standard SQL syntax, this means you can only select title and aggregate functions of the other columns. MySQL magically tries to execute (and probably guess) what you may have meant to execute. So what would you expect to be selected as packages.id ? The First matching package ID for every title? Or the last? And what would the ORDER BY clause mean with respect to the grouping? How can you order by columns that are not part of the result set (because only packages.title really is)?
There are two solutions, as far as I can see:
You're on the right track with your query, then remove the ORDER BY clause, because I don't think it will affect your result, but it may severely slow down your query.
You have a SQL problem, not a performance problem

Strange Performance Issues with INNER JOIN vs. LEFT JOIN

I was using a query that looked similar to this one:
SELECT `episodes`.*, IFNULL(SUM(`views_sum`.`clicks`), 0) as `clicks`
FROM `episodes`, `views_sum`
WHERE `views_sum`.`index` = "episode" AND `views_sum`.`key` = `episodes`.`id`
GROUP BY `episodes`.`id`
... which takes ~0.1s to execute. But it's problematic, because some episodes don't have a corresponding views_sum row, so those episodes aren't included in the result.
What I want is NULL values when a corresponding views_sum row doesn't exist, so I tried using a LEFT JOIN instead:
SELECT `episodes`.*, IFNULL(SUM(`views_sum`.`clicks`), 0) as `clicks`
FROM `episodes`
LEFT JOIN `views_sum` ON (`views_sum`.`index` = "episode" AND `views_sum`.`key` = `episodes`.`id`)
GROUP BY `episodes`.`id`
This query produces the same columns, and it also includes the few rows missing from the 1st query.
BUT, the 2nd query takes 10 times as long! A full second.
Why is there such a huge discrepancy between the execution times when the result is so similar? There's nowhere near 10 times as many rows — it's like 60 from the 1st query, and 70 from the 2nd. That's not to mention that the 10 additional rows have no views to sum!
Any light shed would be greatly appreciated!
(There are indexes on episodes.id, views_sum.index, and views_sum.key.)
EDIT:
I copy-pasted the SQL from above, and here are the EXPLAINs, in order:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE views_sum ref index,key index 27 const 6532 Using where; Using temporary; Using filesort
1 SIMPLE episodes eq_ref PRIMARY PRIMARY 4 db102914_itw.views_sum.key 1 Using where
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE episodes ALL NULL NULL NULL NULL 70 Using temporary; Using filesort
1 SIMPLE views_sum ref index,key index 27 const 6532

Here's the query I ultimately came up with, after many, many iterations. (The SQL_NO_CACHE flag is there so I can test execution times.)
SELECT SQL_NO_CACHE e.*, IFNULL(SUM(vs.`clicks`), 0) as `clicks`
FROM `episodes` e
LEFT JOIN
(SELECT * FROM `views_sum` WHERE `index` = "episode") vs
ON vs.`key` = e.`id`
GROUP BY e.`id`
Because the ON condtion views_sum.index = "episode" is static, i.e., isn't dependent on the row it's joined to, I was able to get a massive performance boost by first using a subquery to limit the views_sum table before joining.
My query now takes ~0.2s. And what's even better, the time doesn't grow as you increase the offset of the query (unlike my first LEFT JOIN attempt). It stays the same, even if you do a sort on the clicks column.

You should have a combined index on views_sum.index and views_sum.key. I suspect you will always use both fields together if i look at the names. Also, I would rewrite the first query to use a proper INNER JOIN clause instead of a filtered cartesian product.
I suspect the performance of both queries will be much closer together if you do this. And, more importantly, much faster than they are now.
edit: Thinking about it, I would probably add a third column to that index: views_sum.clicks, which probably can be used for the SUM. But remember that multi-column indexes can only be used left to right.

It's all about the indexes. You'll have to play around with it a bit or post your database schema on here. Just as a rough guess i'd say you should make sure you have an index on views_sum.key.

Normally, a LEFT JOIN will be slower than an INNER JOIN or a CROSS JOIN because it has to view the first table differently. Put another way, the difference in time isn't related to the size of the result, but the full size of the left table.
I also wonder if you're asking MySQL to figure things out for you that you should be doing yourself. Specifically, that SUM() function would normally require a GROUP BY clause.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008