Large SQL database - solving efficiency - mysql

I have this following SQL query, which, when I originally coded it, was exceptionally fast, it now takes over 1 second to complete:
SELECT counted/scount as ratio, [etc]
FROM
playlists
LEFT JOIN (
select AID, PLID FROM (SELECT AID, PLID FROM p_s ORDER BY `order` asc, PLSID desc)as g GROUP BY PLID
) as t USING(PLID)
INNER JOIN (
SELECT PLID, count(PLID) as scount from p_s LEFT JOIN audio USING(AID) WHERE removed='0' and verified='1' GROUP BY PLID
) as g USING(PLID)
LEFT JOIN (
select AID, count(AID) as counted FROM a_p_all WHERE ".time()." - playtime < 2678400 GROUP BY AID
) as r USING(AID)
LEFT JOIN audio USING (AID)
LEFT JOIN members USING (UID)
WHERE scount > 4 ORDER BY ratio desc
LIMIT 0, 20
I have identified the problem, the a_p_all table has over 500k rows. This is slowing down the query. I have come up with a solution:
Create a smaller temporary table, that only stores the data necessary, and deletes anything older than is needed.
However, is there a better method to use? Optimally I wouldn't need a temporary table; what do sites such as YouTube/Facebook do for large tables to keep query times fast?
edit
This is the EXPLAIN table for the query in the answer from #spencer7593
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY <derived3> ALL NULL NULL NULL NULL 20
1 PRIMARY u eq_ref PRIMARY PRIMARY 8 q.AID 1 Using index
1 PRIMARY m eq_ref PRIMARY PRIMARY 8 q.UID 1 Using index
3 DERIVED <derived6> ALL NULL NULL NULL NULL 20
6 DERIVED t ALL NULL NULL NULL NULL 21
5 DEPENDENT SUBQUERY s ALL NULL NULL NULL NULL 49 Using where; Using filesort
4 DEPENDENT SUBQUERY c ALL NULL NULL NULL NULL 49 Using where
4 DEPENDENT SUBQUERY o eq_ref PRIMARY PRIMARY 8 database.c.AID 1 Using where
2 DEPENDENT SUBQUERY a ALL NULL NULL NULL NULL 510594 Using where

Two "big rock" issues stand out to me.
Firstly, this predicate
WHERE ".time()." - playtime < 2678400
(I'm assuming that this isn't the actual SQL being submitted to the database, but that what's being sent to the database is something like this...
WHERE 1409192073 - playtime < 2678400
such that we want only rows where playtime is within the past 31 days (i.e. within 31*24*60*60 seconds of the integer value returned by time().
This predicate can't make use of a range scan operation on a suitable index on playtime. MySQL evaluates the expression on the left side for every row in the table (every row that isn't excluded by some other predicate), and the result of that expression is compared to the literal on the right.
To improve performance, rewrite the predicate that so that the comparison is made on the bare column. Compare the value stored in the playtime column to an expression that needs to be evaluated one time, for example:
WHERE playtime > 1409192073 - 2678400
With a suitable index available, MySQL can perform a "range" scan operation, and efficiently eliminate a boatload of rows that don't need to be evaluated.
The second "big rock" is the inline views, or "derived tables" in MySQL parlance. MySQL is much different than other databases in how inline views are processed. MySQL actually runs that innermost query, and stores the result set as a temporary MyISAM table, and then the outer query runs against the MyISAM table. (The name that MySQL uses, "derived table", makes sense when we understand how MySQL processes the inline view.) Also, MySQL does not "push" predicates down, from an outer query down into the view queries. And on the derived table, there are no indexes created. (I believe MySQL 5.7 is changing that, and does sometimes create indexes, to improve performance.) But large "derived tables" can have a significant performance impact.
Also, the LIMIT clause gets applied last in the statement processing; that's after all the rows in the resultset are prepared and sorted. Even if you are returning only 20 rows, MySQL still prepares the entire resultset; it just doesn't transfer them to the client.
Lots of the column references are not qualified with the table name or alias, so we don't know, for example, which table (p_s or audio) contains the removed and verified columns.
(We know it can't be both, if MySQL isn't throwing a "ambiguous column" error. But MySQL has access to the table definitions, where we don't. MySQL also knows something about the cardinality of the columns, in particular, which columns (or combination of columns) are UNIQUE, and which columns can contain NULL values, etc.
Best practice is to qualify ALL column references with the table name or (preferably) a table alias. (This makes it much easier on the human reading the SQL, and it also avoids a query from breaking when a new column is added to a table.)
Also, the query as a LIMIT clause, but there's no ORDER BY clause (or implied ORDER BY), which makes the resultset indeterminate. We don't have any guaranteed which will be the "first" rows returned.
EDIT
To return only 20 rows from playlists (out of thousands or more), I might try using correlated subqueries in the SELECT list; using a LIMIT clause in an inline view to winnow down the number of rows that I'd need to run the subqueries for. Correlated subqueries can eat your lunch (and your lunchbox too) in terms of performance with large sets, due to the number of times those need to be run.
From what I can gather, you are attempting to return 20 rows from playlists, picking up the related row from member (by the foreign key in playlists), finding the "first" song in the playlist; getting a count of times that "song" has been played in the past 31 days (from any playlist); getting the number of times a song appears on that playlist (as long as it's been verified and hasn't been removed... the outerness of that LEFT JOIN is negated by the predicates on the removed and verified columns, if either of those columns is from the audio table...).
I'd take a shot with something like this, to compare performance:
SELECT q.*
, ( SELECT COUNT(1)
FROM a_p_all a
WHERE a.playtime < 1409192073 - 2678400
AND a.AID = q.AID
) AS counted
FROM ( SELECT p.PLID
, p.UID
, p.[etc]
, ( SELECT COUNT(1)
FROM p_s c
JOIN audio o
ON o.AID = c.AID
AND o.removed='0'
AND o.verified='1'
WHERE c.PLID = p.PLID
) AS scount
, ( SELECT s.AID
FROM p_s s
WHERE s.PLID = p.PLID
ORDER BY s.order ASC, s.PLSID DESC
LIMIT 1
) AS AID
FROM ( SELECT t.PLID
, t.[etc]
FROM playlists t
ORDER BY NULL
LIMIT 20
) p
) q
LEFT JOIN audio u ON u.AID = q.AID
LEFT JOIN members m ON m.UID = q.UID
LIMIT 0, 20
UPDATE
Dude, the EXPLAIN output is showing that you don't have suitable indexes available. To get any decent chance at performance with the correlated subqueries, you're going to want to add some indexes, e.g.
... ON a_p_all (AID, playtime)
... ON p_s (PLID, order, PLSID, AID)

Related

PHP; MySQL JOIN query on large datasets gets slower as WHERE conditions update

So this might be a bit silly, but the alternative I was using is worse. I am trying to write an excel sheet using data from my database and a PHP tool called Box/Spout. The thing is that Box/Spout reads rows one at a time, and they are not retrieved via index ( e.g. rows[10], rows[42], rows[156] )
I need to retrieve data from the database in the order the rows come out. I have a database with a list of customers, that came in via Import and I have to write them into the excel spreadsheet. They have phone numbers, emails, and an address. Sorry for the confusion... :/ So I compiled this fairly complex query:
SELECT
`Import`.`UniqueID`,
`Import`.`RowNum`,
`People`.`PeopleID`,
`People`.`First`,
`People`.`Last`,
GROUP_CONCAT(
DISTINCT CONCAT_WS(',', `PhonesTable`.`Phone`, `PhonesTable`.`Type`)
ORDER BY `PhonesTable`.`PhoneID` DESC
SEPARATOR ';'
) AS `Phones`,
GROUP_CONCAT(
DISTINCT CONCAT_WS(',', `EmailsTable`.`Email`)
ORDER BY `EmailsTable`.`EmailID` DESC
SEPARATOR ';'
) AS `Emails`,
`Properties`.`Address1`,
`Properties`.`city`,
`Properties`.`state`,
`Properties`.`PostalCode5`,
...(17 more `People` Columns)...,
FROM `T_Import` AS `Import`
LEFT JOIN `T_CustomerStorageJoin` AS `CustomerJoin`
ON `Import`.`UniqueID` = `CustomerJoin`.`ImportID`
LEFT JOIN `T_People` AS `People`
ON `CustomerJoin`.`PersID`=`People`.`PeopleID`
LEFT JOIN `T_JoinPeopleIDPhoneID` AS `PeIDPhID`
ON `People`.`PeopleID` = `PeIDPhID`.`PeopleID`
LEFT JOIN `T_Phone` AS `PhonesTable`
ON `PeIDPhID`.`PhoneID`=`PhonesTable`.`PhoneID`
LEFT JOIN `T_JoinPeopleIDEmailID` AS `PeIDEmID`
ON `People`.`PeopleID` = `PeIDEmID`.`PeopleID`
LEFT JOIN `T_Email` AS `EmailsTable`
ON `PeIDEmID`.`EmailID`=`EmailsTable`.`EmailID`
LEFT JOIN `T_JoinPeopleIDPropertyID` AS `PeIDPrID`
ON `People`.`PeopleID` = `PeIDPrID`.`PeopleID`
AND `PeIDPrID`.`PropertyCP`='CurrentImported'
LEFT JOIN `T_Property` AS `Properties`
ON `PeIDPrID`.`PropertyID`=`Properties`.`PropertyID`
WHERE `Import`.`CustomerCollectionID`=$ccID
AND `RowNum` >= $rnOffset
AND `RowNum` < $rnLimit
GROUP BY `RowNum`;
So I have indexes on every ON segment, and the WHERE segment. When RowNumber is like around 0->2500 in value, the query runs great and executes within a couple seconds. But it seems like the query execution time exponentially multiplies the larger RowNumber gets.
I have an EXPLAIN here: and at pastebin( https://pastebin.com/PksYB4n2 )
id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 SIMPLE Import NULL ref CustomerCollectionID,RowNumIndex CustomerCollectionID 4 const 48108 8.74 Using index condition; Using where; Using filesort;
1 SIMPLE CustomerJoin NULL ref ImportID ImportID 4 MyDatabase.Import.UniqueID 1 100 NULL
1 SIMPLE People NULL eq_ref PRIMARY,PeopleID PRIMARY 4 MyDatabase.CustomerJoin.PersID 1 100 NULL
1 SIMPLE PeIDPhID NULL ref PeopleID PeopleID 5 MyDatabase.People.PeopleID 8 100 NULL
1 SIMPLE PhonesTable NULL eq_ref PRIMARY,PhoneID,PhoneID_2 PRIMARY 4 MyDatabase.PeIDPhID.PhoneID 1 100 NULL
1 SIMPLE PeIDEmID NULL ref PeopleID PeopleID 5 MyDatabase.People.PeopleID 5 100 NULL
1 SIMPLE EmailsTable NULL eq_ref PRIMARY,EmailID,DupeDeleteSelect PRIMARY 4 MyDatabase.PeIDEmID.EmailID 1 100 NULL
1 SIMPLE PeIDPrID NULL ref PeopleMSCP,PeopleID,PropertyCP PeopleMSCP 5 MyDatabase.People.PeopleID 4 100 Using where
1 SIMPLE Properties NULL eq_ref PRIMARY,PropertyID PRIMARY 4 MyDatabase.PeIDPrID.PropertyID 1 100 NULL
I apologize if the formatting is absolutely terrible. I'm not sure what good formatting looks like so I may have jumbled it a bit on accident, plus the tabs got screwed up.
What I want to know is how to speed up the query time. The databases are very large, like in the 10s of millions of rows. And they aren't always like this as our tables are constantly changing, however I would like to be able to handle it when they are.
I tried using LIMIT 2000, 1000 for example, but I know that it's less efficient than using an indexed column. So I switched over to RowNumber. I feel like this was a good decision, but it seems like MySQL is still looping every single row before the offset variable which kind of defeats the purpose of my index... I think? I'm not sure. I also basically split this particular query into about 10 singular queries, and ran them one by one, for each row of the excel file. It takes a LONG time... TOO LONG. This is fast, but, obviously I'm having a problem.
Any help would be greatly appreciated, and thank you ahead of time. I'm sorry again for my lack of post organization.
The order of the columns in an index matters. The order of the clauses in WHERE does not matter (usually).
INDEX(a), INDEX(b) is not the same as the "composite" INDEX(a,b). I deliberately made composite indexes where they seemed useful.
INDEX(a,b) and INDEX(b,a) are not interchangeable unless both a and b are tested with =. (Plus a few exceptions.)
A "covering" index is one where all the columns for the one table are found in the one index. This sometimes provides an extra performance boost. Some of my recommended indexes are "covering". It implies that only the index BTree need be accessed, not also the data BTree; this is where it picks up some speed.
In EXPLAIN SELECT ... a "covering" index is indicated by "Using index" (which is not the same as "Using index condition"). (Your Explain shows no covering indexes currently.)
An index 'should not' have more than 5 columns. (This is not a hard and fast rule.) T5's index had f5 columns to be covering; it was not practical to make a covering index for T2.
When JOINing, the order of the tables does not matter; the Optimizer is free to shuffle them around. However, these "rules" apply:
A LEFT JOIN may force ordering of the tables. (I think it does in this case.) (I ordered the columns based on what I think the Optimizer wants; there may be some flexibility.)
The WHERE clause usually determines which table to "start with". (You test on T1 only, so obviously it will start with T1.
The "next table" to be referenced (via NLJ - Nested Loop Join) is determined by a variety of things. (In your case it is pretty obvious -- namely the ON column(s).)
More on indexing: http://mysql.rjweb.org/doc.php/index_cookbook_mysql
Revised Query
1. Import: (CustomerCollectionID, -- '=' comes first
RowNum, -- 'range'
UniqueID) -- 'covering'
Import shows up in WHERE, so is first in Explain; Also due to LEFTs
Properties: (PropertyID) -- is that the PK?
PeIDPrID: (PropertyCP, PeopleID, PropertyID)
3. People: (PeopleID)
I assume that is the `PRIMARY KEY`? (Too many for "covering")
(Since `People` leads to 3 other table; I won't number the rest.)
EmailsTable: (EmailID, Email)
PeIDEmID: (PeopleID, -- JOIN from People
EmailID) -- covering
PhonesTable: (PhoneID, Type, Phone)
PeIDPhID: (PeopleID, PhoneID)
2. CustomerJoin: (ImportID, -- coming from `Import` (see ON...)
PersID) -- covering
After adding those, I expect most lines of EXPLAIN to say Using index.
The lack of at least a composite index on Import is the main problem leading to your performance complaint.
Bad GROUP BY
When there is a GROUP BY that does not include all the non-aggregated columns that are not directly dependent on the group by column(s), you get random values for the extras. I see from the EXPLAIN ("Rows") that several tables probably have multiple rows. You really ought to think about the garbage being generated by this query.
Curiously, Phones and Emails are feed into GROUP_CONCAT(), thereby avoiding the above issue, but the "Rows" is only 1.
(Read about ONLY_FULL_GROUP_BY; it might explain the issue better.)
(I'm listing this as a separate Answer since it is orthogonal to my other Answer.)
I call this the "explode-implode" syndrome. The query does a JOIN, getting a bunch of rows, thereby generating several rows, and puts multiple rows into an intermediate table. Then the GROUP BY implodes back to down to the original set of rows.
Let me focus on a portion of the query that could be reformulated to provide a performance improvement:
SELECT ...
GROUP_CONCAT(
DISTINCT CONCAT_WS(',', `EmailsTable`.`Email`)
ORDER BY `EmailsTable`.`EmailID` DESC
SEPARATOR ';'
) AS `Emails`,
...
FROM ...
LEFT JOIN `T_Email` AS `EmailsTable`
ON `PeIDEmID`.`EmailID`=`EmailsTable`.`EmailID`
...
GROUP BY `RowNum`;
Instead, move the table and aggregation function into a subquery
SELECT ...
( SELECT GROUP_CONCAT(
DISTINCT CONCAT_WS(',', `Email`)
ORDER BY `EmailID` DESC
SEPARATOR ';' )
FROM T_Email
WHERE `PeIDEmID`.`EmailID` = `EmailID`
) AS `Emails`,
...
FROM ...
-- and Remove: LEFT JOIN `T_Email` ON ...
...
-- and possibly Remove: GROUP BY ...;
Ditto for PhonesTable.
(It is unclear whether the GROUP BY can be removed; other things may need it.)

MySQL JOIN not filtering on WHERE clause with < > operators, since moving from MySQL 5.6 -> 5.7

We're upgrading our DB systems to MySQL 5.7 coming from MySQL 5.6 and since the upgrade a few queries have been running really slow.
After some investigating we narrowed it down to a few JOIN queries which suddenly don't listen to the 'WHERE' clause anymore when using a 'larger than' > or 'smaller than' < operator. When using a '=' operator it does work as expected. When querying a large table this caused a constant 100% CPU usage.
The queries have been simplified to explain the issue at hand; when using explain we get the following outputs:
explain
select * from TableA as A
left join
(
select
DATE_FORMAT(created_at,'%H:%i:00') as `time`
FROM
TableB
WHERE
created_at < DATE_ADD(CURDATE(), INTERVAL -3 HOUR)
)
as V ON V.time = A.time
Output
id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 SIMPLE A NULL ALL NULL NULL NULL NULL 10080 100.00 NULL
1 SIMPLE TableB NULL index created_at created_at 4 NULL 488389 100.00 Using where; Using index; Using join buffer (Block Nested Loop)
As you can see, it's querying/matching 488389 rows and not using the where clause since this is the total records in that table.
And now running the same query but with a LIMIT 99999999 command or using the '=' operator:
explain
select * from TableA as A
left join
(
select
DATE_FORMAT(created_at,'%H:%i:00') as `time`
FROM
TableB
WHERE
created_at < DATE_ADD(CURDATE(), INTERVAL -3 HOUR) LIMIT 999999999
)
as V ON V.time = A.time
Output
id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 PRIMARY A NULL ALL NULL NULL NULL NULL 10080 100.00 NULL
1 PRIMARY <derived2> NULL ALL NULL NULL NULL NULL 244194 100.00 Using where; Using join buffer (Block Nested Loop)
2 DERIVED TableB NULL range created_at created_at 4 NULL 244194 100.00 Using where; Using index
You can see it's suddenly only matching '244194' rows which is a part of the table, or with the '=' operator:
id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 SIMPLE A NULL ALL NULL NULL NULL NULL 10080 100.00 NULL
1 SIMPLE TableB NULL ref created_at created_at 4 const 1 100.00 Using where; Using index
Just 1 row, as expected.
So the question now is, have we been querying in a wrong way and
just now finding out while upgrading or have things changed since
MySQL 5.6? It seems odd that the = operator works, but the <
and > are ignored for some reason, unless when using a LIMIT?..
We've searched around and couldn't find the cause of this issue, and we'd rather not use the limit 9999999 solution in our code for obvious reasons.
Note When running just the query inside the join, it works as expected as well.
Note We've also ran the same test on MariaDB 10.1, same issue.
The explain row-output is merely a guess on how many rows it will hit. It is based upon statistical data, that has been resettet with your update. And if I had to guess how many rows of all your existing rows are older than yesterday 9pm, I would too guess its closer to "all rows" than to "just some rows". The reason why 'limit 99999999' is displaying another rowcount is the same: it just guesses the limit will have an effect; in this case, mysql guesses it will be exactly half of the rows (what would be, if true, a strange coincidence), and of course, it doesn't actually look at the limit-value, since 999999999 will not limit anything when you only have 500k rows; and even the "1" in case of "=" is just a guess (and might more often be 0 than 1, and maybe sometimes more).
This estimate will help choose the correct execution plan, and being wrong in this guess is just a problem if it would choose the wrong one; your execution plan looks fine though, and there are not many option to do it otherwise. It does exactly as expected: Scan the index for all dates using the index on created_at. Since you do a left join, you cannot skip values from tableA even if you would start with the inner query, so there is really no alternative execution plan available. (The optimizer actually have been changed in 5.7., but here is doesn't have an effect.)
If that is your actual query, there is no real reason why it should be slower than before (only regarding this query; there are of course a lot of general performance options that might have an indirect effect, like caching strategies, buffersizes, ..., but with standard options, it should not have an effect here).
If not, and you e.g. actually use additional columns from TableB in the subquery (it is often hard to guess which maybe important things have gotten "simplified away" in questions), and thus need access to the actual table, it might depends on how your data is structured (or better: in what order you added it). And you might try Optimize table TableB to make your table and indexes fresh and new, it can't hurt (but will lock your table for a little while).
With mysql 5.7., you now can add generated columns, so it might be worth a try to generate a cleaned up column time as DATE_FORMAT(created_at,'%H:%i:00'), so you don't have to calculate it anymore. And maybe add it to your index, so you don't have to sort it anymore to improve the block nested join, but that may depend on your actual query and how often you use it (spamming indexes increases overhead and uses space).
In MySQL 5.7, derived tables (sub-queries in FROM clause) will be merged into the outer query if possible. This is usually an advantage since one avoids that the result of the sub-query is stored in a temporary table. However, for your query, MySQL 5.6 will create an index on this temporary table that could be used for the join execution.
The problem with the merged query is that the index on TableB.created_at can not be used when the column is a parameter to a function. If you can change the query so that the transformation is made to the column on the left side of the join, an index can be used to access the table on the right side. Something like:
select * from TableA as A
left join
(
select created_at as time
FROM TableB
WHERE created_at < DATE_ADD(CURDATE(), INTERVAL -3 HOUR)
)
as V ON V.time = func(A.time)
Alternatively, if you can use inner join instead of left join, MySQL can reverse the join order, so that the index on tableA.time can be used for the join.
If the subquery use LIMIT, it can not be merged. Hence, by using LIMIT you will get the same query plan as was used in MySQL 5.6.
Use JOIN instead of LEFT JOIN unless you need the 'right' table to be optional.
Avoid JOIN ( SELECT ... ). Although 5.6 and 5.7 added some features to handle it, it is usually better to turn the subquery into a simpler JOIN.
Your time expression leads to 9pm yesterday; did you mean "3 hours ago" instead?
See if this gives the desired results and runs faster:
select A.*, DATE_FORMAT(B.created_at,'%H:%i:00') as `time`
from TableA as A
JOIN TableB as B ON B.time = A.time
WHERE B.created_at < NOW() - INTERVAL 3 HOUR -- (assuming "3 hours ago")
As for 5.6 vs 5.7... 5.7 has a new, 'better', optimizer, based on a "cost model". However your particular query makes it virtually impossible for the optimizer to come up with good costs. I guess that 5.6 happened on the better EXPLAIN, and 5.7 happened on a worse one. By simplifying the query, I think both optimizers will have a better chance at performing the query faster.
You do need these indexes:
B: INDEX(time, created_at) -- in that order
A: INDEX(time)

Why is this INNER JOIN/ORDER BY mysql query so slow?

I have very big database of customers. This query was ok before I added ORDER BY. How can I optimize my query speed?
$sql = "SELECT * FROM customers
LEFT JOIN ids ON customer_ids.customer_id = customers.customer_id AND ids.type = '10'
ORDER BY customers.name LIMIT 10";
ids.type and customers.name are my indexes
Explain query
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE customers ALL NULL NULL NULL NULL 955 Using temporary; Using filesort
1 SIMPLE ids ALL type NULL NULL NULL 3551 Using where; Using join buffer (Block Nested Loop)
(I assume you meant to type ids.customer_id = customer.customer_id and not customer_ids.customer_id)
Without the ORDER BY mysql grabbed the first 10 ids of type 10 (indexed), looked up the customer for them, and was done. (Note that the LEFT JOIN here is really an INNER JOIN because the join conditions will only hold for rows that have a match in both tables)
With the ORDER BY mysql is probably retrieving all type=10 customers then sorting them by first name to find the first 10.
You could speed this up by either denormalizing the customers table (copy the type into the customer record) or creating a mapping table to hold the customer_id, name, type tuples. In either case, add an index on (type, name). If using the mapping table, use it to do a 3-way join with customers and ids.
If type=10 is reasonably common, you could also force the query to walk the customers table by name and check the type for each with STRAIGHT JOIN. It won't be as fast as a compound index, but it will be faster than pulling up all matches.
And as suggested above, run an EXPLAIN on your query to see the query plan that mysql is using.
LEFT is the problem. By saying LEFT JOIN, you are implying that some customers may not have a corresponding row(s) in ids. And you are willing to accept NULLs for the fields in place of such an ids row.
If that is not the case, then remove LEFT. Then make sure you have an index on ids that starts with type. Also, customers must have an index (probably the PRIMARY KEY) starting with customer_id. With those, the optimizer can start with ids, filter on type earlier, thereby have less work to do.
But, still, it must collect lots of rows before doing the sort (ORDER BY); only then can it deliver the 10 (LIMIT).
While you are at it, add INDEX(customer_id) to ids -- that is what is killing performance for the LEFT version.

Complex MySQL Select Left Join Optimization Indexing

I have a very complex query that is running and finding locations of members joining the subscription details and sorting by distance.
Can someone provide instruction on the correct indexes and cardinality I should add to make this load faster.
Right now on 1 million records it takes 75 seconds and I know it can be improved.
Thank you.
SELECT SQL_CALC_FOUND_ROWS (((acos(sin((33.987541*pi()/180)) * sin((users_data.lat*pi()/180))+cos((33.987541*pi()/180)) * cos((users_data.lat*pi()/180)) * cos(((-118.472153- users_data.lon)* pi()/180))))*180/pi())*60*1.1515) as distance,subscription_types.location_limit as location_limit,users_data.user_id,users_data.last_name,users_data.filename,users_data.user_id,users_data.phone_number,users_data.city,users_data.state_code,users_data.zip_code,users_data.country_code,users_data.quote,users_data.subscription_id,users_data.company,users_data.position,users_data.profession_id,users_data.experience,users_data.account_type,users_data.verified,users_data.nationwide,IF(listing_type = 'Company', company, last_name) as name
FROM `users_data`
LEFT JOIN `users_reviews` ON users_data.user_id=users_reviews.user_id AND users_reviews.review_status='2'
LEFT JOIN users_locations ON users_locations.user_id=users_data.user_id
LEFT JOIN subscription_types ON users_data.subscription_id=subscription_types.subscription_id
WHERE users_data.active='2'
AND subscription_types.searchable='1'
AND users_data.state_code='CA'
AND users_data.country_code='US'
GROUP BY users_data.user_id
HAVING distance <= '50'
OR location_limit='all'
OR users_data.nationwide='1'
ORDER BY subscription_types.search_priority ASC, distance ASC
LIMIT 0,10
EXPLAIN
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE users_reviews system user_id,review_status NULL NULL NULL 0 const row not found
1 SIMPLE users_locations system user_id NULL NULL NULL 0 const row not found
1 SIMPLE users_data ref subscription_id,active,state_code,country_code state_code 47 const 88241 Using where; Using temporary; Using filesort
1 SIMPLE subscription_types ALL PRIMARY,searchable NULL NULL NULL 4 Using where; Using join buffer
You query is not that complex. You have only one join, on a table subscription_types which is certainly a little table with no more than a few hundred rows.
Where are your indexes ? The best way to improve your query is to create indexes on the field you are filtering, like active, country_code, state_code and searchable
Have you create the foreign key on users_data.subscription_id ? You need an index on that too.
ForceIndex is useless, let the RDBMS determine the best indexes to chose.
Left Join is useless too, because the line subscription_types.searchable='1' will remove the unmatch correspondance
The order on search_priority implies that you need indexes on this columns too
The filtering in the HAVING can make the indexes not used. You don't need to put these filters in the HAVING. If I understand your table schema, this is not really the aggregate that is filtered.
Your table contains 1 million rows, but how much rows are returned, without the limit? With the right indexes, the query should execute under a second.
SELECT ...
FROM `users_data`
INNER JOIN subscription_types
ON users_data.subscription_id = subscription_types.subscription_id
WHERE users_data.active='2'
AND users_data.country_code='US'
AND users_data.state_code='NY'
AND subscription_types.searchable='1'
AND (distance <= '50' OR location_limit='all' OR users_data.nationwide='1')
GROUP BY users_data.user_id
ORDER BY subscription_types.search_priority ASC, distance ASC
LIMIT 0,10

MySQL join query performance issue

I am running the be query
SELECT packages.id, packages.title, subcat.id, packages.weight
FROM packages ,provider, packagestosubcat,
packagestocity, subcat, usertosubcat,
usertocity, usertoprovider
WHERE packages.endDate >'2011-03-11 06:00:00' AND
usertosubcat.userid = 1 AND
usertocity.userid = 1 AND
packages.providerid = provider.id AND
packages.id = packagestosubcat.packageid AND
packages.id = packagestocity.packageid AND
packagestosubcat.subcatid = subcat.id AND
usertosubcat.subcatid = packagestosubcat.subcatid AND
usertocity.cityid = packagestocity.cityid AND
(
provider.providertype = 'reg' OR
(
usertoprovider.userid = 1 AND
provider.providertype != 'reg' AND
usertoprovider.providerid = provider.ID
)
)
GROUP BY packages.title
ORDER BY subcat.id, packages.weight DESC
When i run explain, everything seems to look ok except for the scan on the usertoprovider table, which doesn't seem to be using table's keys:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE usertocity ref user,city user 4 const 4 Using temporary; Using filesort
1 SIMPLE packagestocity ref city,packageid city 4 usertocity.cityid 419
1 SIMPLE packages eq_ref PRIMARY,enddate PRIMARY 4 packagestocity.packageid 1 Using where
1 SIMPLE provider eq_ref PRIMARY,providertype PRIMARY 4 packages.providerid 1 Using where
1 SIMPLE packagestosubcat ref subcatid,packageid packageid 4 packages.id 1 Using where
1 SIMPLE subcat eq_ref PRIMARY PRIMARY 4 packagestosubcat.subcatid 1
1 SIMPLE usertosubcat ref userid,subcatid subcatid 4 const 12 Using where
1 SIMPLE usertoprovider ALL userid,providerid NULL NULL NULL 3735 Using where
As you can see in the above query, the condition itself is:
provider.providertype = 'reg' OR
(
usertoprovider.userid = 1 AND
provider.providertype != 'reg' AND
usertoprovider.providerid = provider.ID
)
Both tables, provider and usertoprovider, are indexed. provider has indexes on providerid and providertype while usertoprovider has indexes on userid and providerid
The cardinality of the keys is:
provider.id=47, provider.type=1, usertoprovider.userid=1245, usertoprovider.providerid=6
So its quite obvious that the indexes are not used.
Further more, to test it out, i went ahead and:
Duplicated the usertoprovider table
Inserted all the provider values that have providertype='reg' into the cloned table
Simplified the condition to (usertoprovider.userid = 1 AND usertoprovider.providerid = provider.ID)
The query execution time changed from 8.1317 sec. to 0.0387 sec.
Still, provider values that have providertype='reg' are valid for all the users and i would like to avoid inserting these values into the usertoprovider table for all the users since this data is redundant.
Can someone please explain why MySQL still runs a full scan and doesn't use the keys? What can be done to avoid it?
It seems that provider.providertype != 'reg' is redundant (always true) unless provider.providertype is nullable and you want the query to fail on NULL.
And shouldn't != be <> instead to be standard SQL, although MySQL may allow !=?
On cost of table scans
It is not necessarily that a full table scan is more expensive than walking an index, because walking an index still requires multiple page accesses. In many database engines, if your table is small enough to fit inside a few pages, and the number of rows are small enough, it will be cheaper to do a table scan. Database engines make this type of decision based on the data and index statistics of the table.
This case
However, in your case, it might also be because of the other leg in your OR clause: provider.providertype = 'reg'. If providertype is "reg", then this query joins in ALL the rows of usertoprovider (most likely not what you want) since it is a multi-table cross join.
The database engine is correct in determining that you'll likely need all the table rows in usertoprovider anyway (unless none of the providertype's is "reg", but the engine also may know!).
The query hides this fact because you are grouping on the (MASSIVE!) result set later on and just returning the package ID, so you won't see how many usertoprovider rows have been returned. But it will run very slowly. Get rid of the GROUP BY clause to find out how many rows you are actually forcing the database engine to work on!!!
The reason you see a massive speed improvement if you fill out the usertoprovider table is because then every row participates in a join, and there is no full cross join happening in the case of "reg". Before, if you have 1,000 rows in usertoprovider, every row with type="reg" expands the result set 1,000 times. Now, that row joins with only one row in usertoprovider, and the result set is not expanded.
If you really want to pass anything with providertype='reg', but not in your many-to-many mapping table, then the easiest way may be to use a sub-query:
Remove usertoprovider from your FROM clause
Do the following:
provider.providertype='reg' OR EXISTS (SELECT * FROM usertoprovider WHERE userid=1 AND providerid = provider.ID)
Another method is to use an OUTER JOIN on the usertoprovider -- any row with "reg" which is not in the table will come back with one row of NULL instead of expanding the result set.
Hmm, I know that MySQL does funny things with grouping. In any other RDBMS, your query won't even be executed. What does that even mean,
SELECT packages.id
[...]
GROUP BY packages.title
ORDER BY subcat.id, packages.weight DESC
You want to group by title. Then in standard SQL syntax, this means you can only select title and aggregate functions of the other columns. MySQL magically tries to execute (and probably guess) what you may have meant to execute. So what would you expect to be selected as packages.id ? The First matching package ID for every title? Or the last? And what would the ORDER BY clause mean with respect to the grouping? How can you order by columns that are not part of the result set (because only packages.title really is)?
There are two solutions, as far as I can see:
You're on the right track with your query, then remove the ORDER BY clause, because I don't think it will affect your result, but it may severely slow down your query.
You have a SQL problem, not a performance problem