Optimize SQL query on large-ish table - mysql

First of all, this question regards MySQL 3.23.58, so be advised.
I have 2 tables with the following definition:
Table A: id INT (primary), customer_id INT, offlineid INT
Table B: id INT (primary), name VARCHAR(255)
Now, table A contains in the range of 65k+ records, while table B contains ~40 records. In addition to the 2 primary key indexes, there is also an index on the offlineid field in table A. There are more fields in each table, but they are not relevant (as I see it, ask if necessary) for this query.
I was first presented with the following query (query time: ~22 seconds):
SELECT b.name, COUNT(*) AS orders, COUNT(DISTINCT(a.kundeid)) AS leads
FROM katalogbestilling_katalog a, medie b
WHERE a.offlineid = b.id
GROUP BY b.name
Now, each id in medie is associated with a different name, meaning you could group by id as well as name. A bit of testing back and forth settled me on this (query time: ~6 seconds):
SELECT a.name, COUNT(*) AS orders, COUNT(DISTINCT(b.kundeid)) AS leads
FROM medie a
INNER JOIN katalogbestilling_katalog b ON a.id = b.offline
GROUP BY b.offline;
Is there any way to crank it down to "instant" time (max 1 second at worst)? I added the index on offlineid, but besides that and the re-arrangement of the query, I am at a loss for what to do. The EXPLAIN query shows me the query is using fileshort (the original query also used temp tables). All suggestions are welcome!

I'm going to guess that your main problem is that you are using such an old version of MySQL. Maybe MySQL 3 doesn't like the COUNT(DISTINCT()).
Alternately, it might just be system performance. How much memory do you have?
Still, MySQL 3 is really old. I would at least put together a test system to see if a newer version ran that query faster.

Unfortunately, mysql 3 doesn't support sub-queries. I suspect that the old version in general is what's causing the slow performance.

You could try making sure there are covering indexes defined on each table. A covering index is just an index where each column requested in the select or used in a join is included in the index. This way, the engine only has to read the index entry and doesn't have to also do the corresponding row lookup to get any requested columns not included in the index. I've used this technique with great success in Oracle and MS SqlServer.
Looking at your query, you could try:
one index for medie.id, medie.name
one index for katalogbestilling_katalog.offlineid, katalogbestilling_katalog.kundeid
The columns should be defined in these orders for the index. That makes a difference whether the index can be used or not.
More info here:
Covering Index Info

You may get a small increase in performance if you remove the inner join and replace it with a nested select statement also remove the count(*) and replace it with the PK.
SELECT a.name, COUNT(*) AS orders, COUNT(DISTINCT(b.kundeid)) AS leads
FROM medie aINNER JOIN katalogbestilling_katalog b ON a.id = b.offline
GROUP BY b.offline;
would be
SELECT a.name,
COUNT(a.id) AS orders,
(SELECT COUNT(kundeid) FROM katalogbestilling_katalog b WHERE b.offline = a.id) AS Leads
FROM medie a;

Well if the query is run often enough to warrant the overhead, create an index on table A containing the fields used in the query. Then all the results can be read from an index and it wont have to scan the table.
That said, all my experience is based on MSSQL, so might not work.

Your second query is fine and 65k+40k rows is not very large :)
Put an new index on katalogbestilling_katalog.offline column and it will run faster for you.

How is kundeid defined? It would be helpful to see the full schema for both tables (as generated by MySQL, ie. with indexes) as well as the output of EXPLAIN with the queries above.
The easiest way to debug this and find out what is your bottleneck would be to start removing fields, one by one, from the query and measure how long does it take to run (remember to run RESET QUERY CACHE before running each query). At some point you'll see a significant drop in the execution time and then you've identified your bottleneck. For example:
SELECT b.name, COUNT(*) AS orders, COUNT(DISTINCT(a.kundeid)) AS leads
FROM katalogbestilling_katalog a, medie b
WHERE a.offlineid = b.id
GROUP BY b.name
may become
SELECT b.name, COUNT(DISTINCT(a.kundeid)) AS leads
FROM katalogbestilling_katalog a, medie b
WHERE a.offlineid = b.id
GROUP BY b.name
to eliminate the possibility of "orders" being the bottleneck, or
SELECT b.name, COUNT(*) AS orders
FROM katalogbestilling_katalog a, medie b
WHERE a.offlineid = b.id
GROUP BY b.name
to eliminate "leads" from the equasion. This will lead you in the right direction.
update: I'm not suggesting removing any of the data from the final query. Just remove them to reduce the number of variables while looking for the bottleneck. Given your comment, I understand
SELECT b.name
FROM katalogbestilling_katalog a, medie b
WHERE a.offlineid = b.id
GROUP BY b.name
is still performing badly? This clearly means it's either the join that is not optimized or the group by (which you can test by removing the group by - either the JOIN will be still slow, in which case that's the problem you need to fix, or it won't - in which case it's obviously the GROUP BY). Can you post the output of
EXPLAIN SELECT b.name
FROM katalogbestilling_katalog a, medie b
WHERE a.offlineid = b.id
GROUP BY b.name
as well as the table schemas (to make it easier to debug)?
update #2
there's also a possibility that all of your indeces are created correctly but you have you mysql installation misconfigured when it comes to max memory usage or something along those lines which forces it to use disk sortation.

Try adding an index to (offlineid, kundeid)
I added 180,000 BS rows to katalog and 30,000 BS rows to medie (with katalog offlineid's corresponding to medie id's and with a few overlapping kundeid's to make sure the disinct counts work). Mind you this is on mysql 5, so if you don't have similar results, mysql 3 may be your culprit, but from what I recall mysql 3 should be able to handle this just fine.
My tables:
CREATE TABLE `katalogbestilling_katalog` (
`id` int(11) NOT NULL auto_increment,
`offlineid` int(11) NOT NULL,
`kundeid` int(11) NOT NULL,
PRIMARY KEY (`id`),
KEY `offline_id` (`offlineid`,`kundeid`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 AUTO_INCREMENT=60001 ;
CREATE TABLE `medie` (
`id` int(11) NOT NULL auto_increment,
`name` varchar(255) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 AUTO_INCREMENT=30001 ;
My query:
SELECT b.name, COUNT(*) AS orders, COUNT(DISTINCT(a.kundeid)) AS leads
FROM medie b
INNER JOIN katalogbestilling_katalog a ON b.id = a.offlineid
GROUP BY a.offlineid
LIMIT 0 , 30
"Showing rows 0 - 29 (30,000 total, Query took 0.0018 sec)"
And the explain:
id: 1
select_type: SIMPLE
table: a
type: index
possible_keys: NULL
key: offline_id
key_len: 8
ref: NULL
rows: 180000
Extra: Using index
id: 1
select_type: SIMPLE
table: b
type: eq_ref
possible_keys: PRIMARY
key: PRIMARY
key_len: 4
ref: test.a.offlineid
rows: 1
Extra:

Try optimizing the server itself. See this post by Peter Zaitsev for the most important variables. Some are InnoDB specific, while others are for MyISAM. You didnt mention which engine you were using which might be relevant in this case (count(*) is much faster in MyISAM than in InnoDB for example).
Here is another post from same blog, and an article from MySQL Forge

Related

PHP; MySQL JOIN query on large datasets gets slower as WHERE conditions update

So this might be a bit silly, but the alternative I was using is worse. I am trying to write an excel sheet using data from my database and a PHP tool called Box/Spout. The thing is that Box/Spout reads rows one at a time, and they are not retrieved via index ( e.g. rows[10], rows[42], rows[156] )
I need to retrieve data from the database in the order the rows come out. I have a database with a list of customers, that came in via Import and I have to write them into the excel spreadsheet. They have phone numbers, emails, and an address. Sorry for the confusion... :/ So I compiled this fairly complex query:
SELECT
`Import`.`UniqueID`,
`Import`.`RowNum`,
`People`.`PeopleID`,
`People`.`First`,
`People`.`Last`,
GROUP_CONCAT(
DISTINCT CONCAT_WS(',', `PhonesTable`.`Phone`, `PhonesTable`.`Type`)
ORDER BY `PhonesTable`.`PhoneID` DESC
SEPARATOR ';'
) AS `Phones`,
GROUP_CONCAT(
DISTINCT CONCAT_WS(',', `EmailsTable`.`Email`)
ORDER BY `EmailsTable`.`EmailID` DESC
SEPARATOR ';'
) AS `Emails`,
`Properties`.`Address1`,
`Properties`.`city`,
`Properties`.`state`,
`Properties`.`PostalCode5`,
...(17 more `People` Columns)...,
FROM `T_Import` AS `Import`
LEFT JOIN `T_CustomerStorageJoin` AS `CustomerJoin`
ON `Import`.`UniqueID` = `CustomerJoin`.`ImportID`
LEFT JOIN `T_People` AS `People`
ON `CustomerJoin`.`PersID`=`People`.`PeopleID`
LEFT JOIN `T_JoinPeopleIDPhoneID` AS `PeIDPhID`
ON `People`.`PeopleID` = `PeIDPhID`.`PeopleID`
LEFT JOIN `T_Phone` AS `PhonesTable`
ON `PeIDPhID`.`PhoneID`=`PhonesTable`.`PhoneID`
LEFT JOIN `T_JoinPeopleIDEmailID` AS `PeIDEmID`
ON `People`.`PeopleID` = `PeIDEmID`.`PeopleID`
LEFT JOIN `T_Email` AS `EmailsTable`
ON `PeIDEmID`.`EmailID`=`EmailsTable`.`EmailID`
LEFT JOIN `T_JoinPeopleIDPropertyID` AS `PeIDPrID`
ON `People`.`PeopleID` = `PeIDPrID`.`PeopleID`
AND `PeIDPrID`.`PropertyCP`='CurrentImported'
LEFT JOIN `T_Property` AS `Properties`
ON `PeIDPrID`.`PropertyID`=`Properties`.`PropertyID`
WHERE `Import`.`CustomerCollectionID`=$ccID
AND `RowNum` >= $rnOffset
AND `RowNum` < $rnLimit
GROUP BY `RowNum`;
So I have indexes on every ON segment, and the WHERE segment. When RowNumber is like around 0->2500 in value, the query runs great and executes within a couple seconds. But it seems like the query execution time exponentially multiplies the larger RowNumber gets.
I have an EXPLAIN here: and at pastebin( https://pastebin.com/PksYB4n2 )
id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 SIMPLE Import NULL ref CustomerCollectionID,RowNumIndex CustomerCollectionID 4 const 48108 8.74 Using index condition; Using where; Using filesort;
1 SIMPLE CustomerJoin NULL ref ImportID ImportID 4 MyDatabase.Import.UniqueID 1 100 NULL
1 SIMPLE People NULL eq_ref PRIMARY,PeopleID PRIMARY 4 MyDatabase.CustomerJoin.PersID 1 100 NULL
1 SIMPLE PeIDPhID NULL ref PeopleID PeopleID 5 MyDatabase.People.PeopleID 8 100 NULL
1 SIMPLE PhonesTable NULL eq_ref PRIMARY,PhoneID,PhoneID_2 PRIMARY 4 MyDatabase.PeIDPhID.PhoneID 1 100 NULL
1 SIMPLE PeIDEmID NULL ref PeopleID PeopleID 5 MyDatabase.People.PeopleID 5 100 NULL
1 SIMPLE EmailsTable NULL eq_ref PRIMARY,EmailID,DupeDeleteSelect PRIMARY 4 MyDatabase.PeIDEmID.EmailID 1 100 NULL
1 SIMPLE PeIDPrID NULL ref PeopleMSCP,PeopleID,PropertyCP PeopleMSCP 5 MyDatabase.People.PeopleID 4 100 Using where
1 SIMPLE Properties NULL eq_ref PRIMARY,PropertyID PRIMARY 4 MyDatabase.PeIDPrID.PropertyID 1 100 NULL
I apologize if the formatting is absolutely terrible. I'm not sure what good formatting looks like so I may have jumbled it a bit on accident, plus the tabs got screwed up.
What I want to know is how to speed up the query time. The databases are very large, like in the 10s of millions of rows. And they aren't always like this as our tables are constantly changing, however I would like to be able to handle it when they are.
I tried using LIMIT 2000, 1000 for example, but I know that it's less efficient than using an indexed column. So I switched over to RowNumber. I feel like this was a good decision, but it seems like MySQL is still looping every single row before the offset variable which kind of defeats the purpose of my index... I think? I'm not sure. I also basically split this particular query into about 10 singular queries, and ran them one by one, for each row of the excel file. It takes a LONG time... TOO LONG. This is fast, but, obviously I'm having a problem.
Any help would be greatly appreciated, and thank you ahead of time. I'm sorry again for my lack of post organization.
The order of the columns in an index matters. The order of the clauses in WHERE does not matter (usually).
INDEX(a), INDEX(b) is not the same as the "composite" INDEX(a,b). I deliberately made composite indexes where they seemed useful.
INDEX(a,b) and INDEX(b,a) are not interchangeable unless both a and b are tested with =. (Plus a few exceptions.)
A "covering" index is one where all the columns for the one table are found in the one index. This sometimes provides an extra performance boost. Some of my recommended indexes are "covering". It implies that only the index BTree need be accessed, not also the data BTree; this is where it picks up some speed.
In EXPLAIN SELECT ... a "covering" index is indicated by "Using index" (which is not the same as "Using index condition"). (Your Explain shows no covering indexes currently.)
An index 'should not' have more than 5 columns. (This is not a hard and fast rule.) T5's index had f5 columns to be covering; it was not practical to make a covering index for T2.
When JOINing, the order of the tables does not matter; the Optimizer is free to shuffle them around. However, these "rules" apply:
A LEFT JOIN may force ordering of the tables. (I think it does in this case.) (I ordered the columns based on what I think the Optimizer wants; there may be some flexibility.)
The WHERE clause usually determines which table to "start with". (You test on T1 only, so obviously it will start with T1.
The "next table" to be referenced (via NLJ - Nested Loop Join) is determined by a variety of things. (In your case it is pretty obvious -- namely the ON column(s).)
More on indexing: http://mysql.rjweb.org/doc.php/index_cookbook_mysql
Revised Query
1. Import: (CustomerCollectionID, -- '=' comes first
RowNum, -- 'range'
UniqueID) -- 'covering'
Import shows up in WHERE, so is first in Explain; Also due to LEFTs
Properties: (PropertyID) -- is that the PK?
PeIDPrID: (PropertyCP, PeopleID, PropertyID)
3. People: (PeopleID)
I assume that is the `PRIMARY KEY`? (Too many for "covering")
(Since `People` leads to 3 other table; I won't number the rest.)
EmailsTable: (EmailID, Email)
PeIDEmID: (PeopleID, -- JOIN from People
EmailID) -- covering
PhonesTable: (PhoneID, Type, Phone)
PeIDPhID: (PeopleID, PhoneID)
2. CustomerJoin: (ImportID, -- coming from `Import` (see ON...)
PersID) -- covering
After adding those, I expect most lines of EXPLAIN to say Using index.
The lack of at least a composite index on Import is the main problem leading to your performance complaint.
Bad GROUP BY
When there is a GROUP BY that does not include all the non-aggregated columns that are not directly dependent on the group by column(s), you get random values for the extras. I see from the EXPLAIN ("Rows") that several tables probably have multiple rows. You really ought to think about the garbage being generated by this query.
Curiously, Phones and Emails are feed into GROUP_CONCAT(), thereby avoiding the above issue, but the "Rows" is only 1.
(Read about ONLY_FULL_GROUP_BY; it might explain the issue better.)
(I'm listing this as a separate Answer since it is orthogonal to my other Answer.)
I call this the "explode-implode" syndrome. The query does a JOIN, getting a bunch of rows, thereby generating several rows, and puts multiple rows into an intermediate table. Then the GROUP BY implodes back to down to the original set of rows.
Let me focus on a portion of the query that could be reformulated to provide a performance improvement:
SELECT ...
GROUP_CONCAT(
DISTINCT CONCAT_WS(',', `EmailsTable`.`Email`)
ORDER BY `EmailsTable`.`EmailID` DESC
SEPARATOR ';'
) AS `Emails`,
...
FROM ...
LEFT JOIN `T_Email` AS `EmailsTable`
ON `PeIDEmID`.`EmailID`=`EmailsTable`.`EmailID`
...
GROUP BY `RowNum`;
Instead, move the table and aggregation function into a subquery
SELECT ...
( SELECT GROUP_CONCAT(
DISTINCT CONCAT_WS(',', `Email`)
ORDER BY `EmailID` DESC
SEPARATOR ';' )
FROM T_Email
WHERE `PeIDEmID`.`EmailID` = `EmailID`
) AS `Emails`,
...
FROM ...
-- and Remove: LEFT JOIN `T_Email` ON ...
...
-- and possibly Remove: GROUP BY ...;
Ditto for PhonesTable.
(It is unclear whether the GROUP BY can be removed; other things may need it.)

Optimizate My SQL Index Multiple Table JOIN

I have a 5 tables in mysql. And when I want execute query it executed too long.
There are structure of my tables:
Reciept(count rows: 23799640)reciept table structure
reciept_goods(count rows: 39398989)reciept_goods table structure
good(count rows: 17514)good table structure
good_categories(count rows: 121)good_categories table structure
retail_category(count rows: 10)retail_category table structure
My Indexes:
Date -->reciept.date #1
reciept_goods_index --> reciept_goods.recieptId #1,
reciept_goods.shopId #2,
reciept_goods.goodId #3
category_id -->good.category_id #1
I have a next sql request:
SELECT
R.shopId,
sales,
sum(Amount) as sum_amount,
count(distinct R.id) as count_reciept,
RC.id,
RC.name
FROM
reciept R
JOIN reciept_goods RG
ON R.id = RG.RecieptId
AND R.ShopID = RG.ShopId
JOIN good G
ON RG.GoodId = G.id
JOIN good_categories GC
ON G.category_id = GC.id
JOIN retail_category RC
ON GC.retail_category_id = RC.id
WHERE
R.date >= '2018-01-01 10:00:00'
GROUP BY
R.shopId,
R.sales,
RC.id
Explain this query gives next result:
Explain query
and execution time = 236sec
if use straight_join good ON (good.id = reciept_goods.GoodId ) explain query
Explain query
and execution time = 31sec
SELECT STRAIGHT_JOIN ... rest of query
I think, that problem in the indexes of my tables, but I don't uderstand how to fix them, can someone help me?
With about 2% of your rows in reciepts having the correct date, the 2nd execution plan chosen (with straight_join) seems to be the right execution order. You should be able to optimize it by adding the following covering indexes:
reciept(date, sales)
reciept_goods(recieptId, shopId, goodId, amount)
I assume that the column order in your primary key for reciept_goods currently is (goodId, recieptId, shopId) (or (goodId, shopId, receiptId)). You could change that to recieptId, shopId, goodId (and if you look at e.g. the table name, you may wanted to do this anyway); in that case, you do not need the 2nd index (at least for this query). I would assume that this primary key made MySQL take the slower execution plan (of course assuming that it would be faster) - although sometimes it's just bad statistics, especially on a test server.
With those covering indexes, MySQL should take the faster explain plan even without straight_join, if it doesn't, just add it again (although I would like a look at both executions plans then). Also check that those two new indexes are used in the explain plan, otherwise I may have missed a column.
It looks like you are depending on walking through a couple of many:many tables? Many people design them inefficiently.
Here I have compiled a list of 7 tips on making mapping tables more efficient. The most important is use of composite indexes.

Optimizing Update Query with compound index

I tested an update between two large (~5 mil records each) which was taking 10 seconds or so per update. So, doing Explain for my very first time tested the select:
SELECT
T1.Z, T2.Z
FROM
TableB T1
INNER JOIN TableL T2
on T1.Name=T2.Name
and T1.C=T2.C
and T1.S=T2.S
and T1.Number>=T2.MinNumber
and T1.Number<=T2.MaxNumber
Explain returned the following as possible keys:
Name
C
S
Number
and chose C as the key.
I was told that my best bet was to make a compound key, and in the order of the select so I did
Alter Table TableB Add Index Compound (Name,C,S,Number)
And did an explain again, hoping it would choose my compound but now even though it shows the compound index as a possible key it still chooses Index C.
I read that I can force the index I want with:
SELECT
T1.Z, T2.Z
FROM TableB T1 Force Index(Compound)
INNER JOIN TableL T2
on T1.Name=T2.Name
and T1.C=T2.C
and T1.S=T2.S
and T1.Number>=T2.MinNumber
and T1.Number<=T2.MaxNumber
yet I am not sure if it makes any sense to over-ride MySql's selection and, given that if it doesn't help the update is going to take almost two years it doesn't seem like a smart thing to test.
Is there some step I am missin? Do I need to remove the other keys so that it chooses my compound one and if so how will I know if it will even make a difference (given that Mysql saw it and rejected it)?
Explain output on T1: (note: I did not yet add the Compound Index as the table is huge and it might be wasted time until I figure this out. I previously added it on a highly truncated version of the table but that won't help with this explain)
Table1
select_type: simple
type: ref
possible_keys:
Number,C,S,Name
key: Name
key_len: 303
ref: func
rows: 4
Extra: using where
Explain for Table2
select_type: SIMPLE
type: ALL
possible_Keys: MinNumber, MaxNumber
key:
key_length:
ref:
rows: 5,447,100
Extra:
Cardinality (only showing indexes relevant here as there are a few others):
Primary: 5139680
Name: 1284920
Number: 57749
C: 7002
S: 21
So based on some great comments/input I came up with a solution. One flashbulb input from Paul Spiegel was that trying to join two 5+mil tables using several VarChar fields was not recommended.
So what I did was create a UniqueTable with ID and UnqiueRecord Fields.
I then made the UniqueRecord a Unique Index.
I inserted into that table from Both TableA and TableB as:
Insert IGNORE into `Unique` (UniqueRecord)
Select Concat(Name,C,S) from Table1 Group by Name,C,S;
Insert IGNORE into `Unique` (UniqueRecord)
Select Concat(Name,C,S) from Table2 Group by Name,C,S
This gave me unique records from both within and between the two tables.
I then added a UniqeRecord_ID field to both Table1 and Table 2.
I then did a join between each table and the UniqueRecord to write the UniqueRecord ID to each table:
Update Table1 as T1
Inner Join Unique as T2
On Concat(T1.Name,T1.S,T1.C) = T2.UniqueRecord
Set T1.UniqueRecord_ID=T2.ID
Finally, I added a key to each table on UniqueRecord_ID.
My Explain showed that it only used that key from T2 however whereas it was taking 10 seconds per record for the select prior (I tested on 1,10,100 and stopped there as I did not have the requisite 578 days to test the whole table :| ) the entire select, returning close to 5 million records took 72 seconds.
Note that the first table (whichever one it is) must be fully scanned. So, the best we can do is to have a good index on the second table.
The optimal index (as already noted) for T1 is (Name,C,S,Number). For T2 it is (Name,C,S,MinNumber,MaxNumber), which is bulkier.
The optimizer seems to want to start with T1; perhaps it is slightly smaller. Let's force it to start with T2 by changing INNER JOIN to STRAIGHT_JOIN and swapping the order:
SELECT
T1.Z, T2.Z
FROM TableL T2 -- note
STRAIGHT_JOIN TableB T1 -- note
on T1.Name=T2.Name
and T1.C=T2.C
and T1.S=T2.S
and T1.Number>=T2.MinNumber
and T1.Number<=T2.MaxNumber
Then, let's do one more optimization: If Z is not 'too big', let's include it at the end of the index so that it becomes a "Covering index":
INDEX(Name,C,S,Number,Z)
(Name, C, S can be in any order, but Number, Z must be in that order and at the end.) If you currently have INDEX(Name), DROP it as being redundant.
Then the EXPLAIN will say that you are doing a full table scan of T2, plus a "Using index" on T1.
Please provide SHOW CREATE TABLE; there may be more optimizations.

Large SQL database - solving efficiency

I have this following SQL query, which, when I originally coded it, was exceptionally fast, it now takes over 1 second to complete:
SELECT counted/scount as ratio, [etc]
FROM
playlists
LEFT JOIN (
select AID, PLID FROM (SELECT AID, PLID FROM p_s ORDER BY `order` asc, PLSID desc)as g GROUP BY PLID
) as t USING(PLID)
INNER JOIN (
SELECT PLID, count(PLID) as scount from p_s LEFT JOIN audio USING(AID) WHERE removed='0' and verified='1' GROUP BY PLID
) as g USING(PLID)
LEFT JOIN (
select AID, count(AID) as counted FROM a_p_all WHERE ".time()." - playtime < 2678400 GROUP BY AID
) as r USING(AID)
LEFT JOIN audio USING (AID)
LEFT JOIN members USING (UID)
WHERE scount > 4 ORDER BY ratio desc
LIMIT 0, 20
I have identified the problem, the a_p_all table has over 500k rows. This is slowing down the query. I have come up with a solution:
Create a smaller temporary table, that only stores the data necessary, and deletes anything older than is needed.
However, is there a better method to use? Optimally I wouldn't need a temporary table; what do sites such as YouTube/Facebook do for large tables to keep query times fast?
edit
This is the EXPLAIN table for the query in the answer from #spencer7593
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY <derived3> ALL NULL NULL NULL NULL 20
1 PRIMARY u eq_ref PRIMARY PRIMARY 8 q.AID 1 Using index
1 PRIMARY m eq_ref PRIMARY PRIMARY 8 q.UID 1 Using index
3 DERIVED <derived6> ALL NULL NULL NULL NULL 20
6 DERIVED t ALL NULL NULL NULL NULL 21
5 DEPENDENT SUBQUERY s ALL NULL NULL NULL NULL 49 Using where; Using filesort
4 DEPENDENT SUBQUERY c ALL NULL NULL NULL NULL 49 Using where
4 DEPENDENT SUBQUERY o eq_ref PRIMARY PRIMARY 8 database.c.AID 1 Using where
2 DEPENDENT SUBQUERY a ALL NULL NULL NULL NULL 510594 Using where
Two "big rock" issues stand out to me.
Firstly, this predicate
WHERE ".time()." - playtime < 2678400
(I'm assuming that this isn't the actual SQL being submitted to the database, but that what's being sent to the database is something like this...
WHERE 1409192073 - playtime < 2678400
such that we want only rows where playtime is within the past 31 days (i.e. within 31*24*60*60 seconds of the integer value returned by time().
This predicate can't make use of a range scan operation on a suitable index on playtime. MySQL evaluates the expression on the left side for every row in the table (every row that isn't excluded by some other predicate), and the result of that expression is compared to the literal on the right.
To improve performance, rewrite the predicate that so that the comparison is made on the bare column. Compare the value stored in the playtime column to an expression that needs to be evaluated one time, for example:
WHERE playtime > 1409192073 - 2678400
With a suitable index available, MySQL can perform a "range" scan operation, and efficiently eliminate a boatload of rows that don't need to be evaluated.
The second "big rock" is the inline views, or "derived tables" in MySQL parlance. MySQL is much different than other databases in how inline views are processed. MySQL actually runs that innermost query, and stores the result set as a temporary MyISAM table, and then the outer query runs against the MyISAM table. (The name that MySQL uses, "derived table", makes sense when we understand how MySQL processes the inline view.) Also, MySQL does not "push" predicates down, from an outer query down into the view queries. And on the derived table, there are no indexes created. (I believe MySQL 5.7 is changing that, and does sometimes create indexes, to improve performance.) But large "derived tables" can have a significant performance impact.
Also, the LIMIT clause gets applied last in the statement processing; that's after all the rows in the resultset are prepared and sorted. Even if you are returning only 20 rows, MySQL still prepares the entire resultset; it just doesn't transfer them to the client.
Lots of the column references are not qualified with the table name or alias, so we don't know, for example, which table (p_s or audio) contains the removed and verified columns.
(We know it can't be both, if MySQL isn't throwing a "ambiguous column" error. But MySQL has access to the table definitions, where we don't. MySQL also knows something about the cardinality of the columns, in particular, which columns (or combination of columns) are UNIQUE, and which columns can contain NULL values, etc.
Best practice is to qualify ALL column references with the table name or (preferably) a table alias. (This makes it much easier on the human reading the SQL, and it also avoids a query from breaking when a new column is added to a table.)
Also, the query as a LIMIT clause, but there's no ORDER BY clause (or implied ORDER BY), which makes the resultset indeterminate. We don't have any guaranteed which will be the "first" rows returned.
EDIT
To return only 20 rows from playlists (out of thousands or more), I might try using correlated subqueries in the SELECT list; using a LIMIT clause in an inline view to winnow down the number of rows that I'd need to run the subqueries for. Correlated subqueries can eat your lunch (and your lunchbox too) in terms of performance with large sets, due to the number of times those need to be run.
From what I can gather, you are attempting to return 20 rows from playlists, picking up the related row from member (by the foreign key in playlists), finding the "first" song in the playlist; getting a count of times that "song" has been played in the past 31 days (from any playlist); getting the number of times a song appears on that playlist (as long as it's been verified and hasn't been removed... the outerness of that LEFT JOIN is negated by the predicates on the removed and verified columns, if either of those columns is from the audio table...).
I'd take a shot with something like this, to compare performance:
SELECT q.*
, ( SELECT COUNT(1)
FROM a_p_all a
WHERE a.playtime < 1409192073 - 2678400
AND a.AID = q.AID
) AS counted
FROM ( SELECT p.PLID
, p.UID
, p.[etc]
, ( SELECT COUNT(1)
FROM p_s c
JOIN audio o
ON o.AID = c.AID
AND o.removed='0'
AND o.verified='1'
WHERE c.PLID = p.PLID
) AS scount
, ( SELECT s.AID
FROM p_s s
WHERE s.PLID = p.PLID
ORDER BY s.order ASC, s.PLSID DESC
LIMIT 1
) AS AID
FROM ( SELECT t.PLID
, t.[etc]
FROM playlists t
ORDER BY NULL
LIMIT 20
) p
) q
LEFT JOIN audio u ON u.AID = q.AID
LEFT JOIN members m ON m.UID = q.UID
LIMIT 0, 20
UPDATE
Dude, the EXPLAIN output is showing that you don't have suitable indexes available. To get any decent chance at performance with the correlated subqueries, you're going to want to add some indexes, e.g.
... ON a_p_all (AID, playtime)
... ON p_s (PLID, order, PLSID, AID)

MySQL performance, inner join, how to avoid Using temporary and filesort

I have a table 1 and table 2.
Table 1
PARTNUM - ID_BRAND
partnum is the primary key
id_brand is "indexed"
Table 2
ID_BRAND - BRAND_NAME
id_brand is the primary key
brand_name is "indexed"
The table 1 contains 1 million of records and the table 2 contains 1.000 records.
I'm trying to optimize some query using EXPLAIN and after a lot of try I have reached a dead end.
EXPLAIN
SELECT pm.partnum, pb.brand_name
FROM products_main AS pm
LEFT JOIN products_brands AS pb ON pm.id_brand=pb.id_brand
ORDER BY pb.brand ASC
LIMIT 0, 10
The query returns this execution plan:
ID, SELECT_TYPE, TABLE, TYPE, POSSIBLE_KEYS, KEY, KEY_LEN , REF, ROWS, EXTRA
1, SIMPLE, pm, range, PRIMARY, PRIMARY, 1, , 1000000, Using where; Using temporary; Using filesort
1, SIMPLE, pb, ref, PRIMARY, PRIMARY, 4, demo.pm.id_pbrand, 1,
The MySQL query optimizer shows a temporary + filesort in the execution plan.
How can I avoid this?
The "EVIL" is in the ORDER BY pb.brand ASC. Ordering by that external field seems to be the bottleneck..
First of all, I question the use of an outer join seeing as the order by is operating on the rhs, and the NULL's injected by the left join are likely to play havoc with it.
Regardless, the simplest approach to speeding up this query would be a covering index on pb.id_brand and pb.brand. This will allow the order by to be evaluated 'using index' with the join condition. The alternative is to find some way to reduce the size of the intermediate result passed to the order-by.
Still, the combination of outer-join, order-by, and limit, leaves me wondering what exactly you are querying for, and if there might not be a better way of expressing the query itself.
Try replacing the join with a subquery. MySQL's optimizer kind of sucks; subqueries often give better performance than joins.
First, try changing your index on the products_brands table. Delete the existing one on brand_name, and create a new one:
ALTER TABLE products_brands ADD INDEX newIdx (brand_name, id_brand)
Then, the table will already have a "orderedByBrandName" index with the ids you need for the join, and you can try:
EXPLAIN
SELECT pb.brand_name, pm.partnum
FROM products_brands AS pb
LEFT JOIN products_main AS pm ON pb.id_brand = pm.id_brand
LIMIT 0, 10
Note that I also changed the order of the tables in the query, so you start with the small one.
This question is somewhat outdated, but I did find it, and so will other people.
Mysql uses temporary if the ORDER BY or GROUP BY contains columns from tables other than the first table in the join queue.
So you just need to have the join order reversed by using STRAIGHT_JOIN, to bypass the order invented by optimizer:
SELECT STRAIGHT_JOIN pm.partnum, pb.brand_name
FROM products_brands AS pb
RIGHT JOIN products_main AS pm ON pm.id_brand=pb.id_brand
ORDER BY pb.brand ASC
LIMIT 0, 10
Also make sure that max_heap_table_size AND tmp_table_size variables are set to a number big enough to store the results:
SET global tmp_table_size=100000000;
SET global max_heap_table_size=100000000;
-- 100 megabytes in this example. These can be set in my.cnf config file, too.