Can you index subqueries? - mysql

I have a table and a query that looks like below. For a working example, see this SQL Fiddle.
SELECT o.property_B, SUM(o.score1), w.score
FROM o
INNER JOIN
(
SELECT o.property_B, SUM(o.score2) AS score FROM o GROUP BY property_B
) w ON w.property_B = o.property_B
WHERE o.property_A = 'specific_A'
GROUP BY property_B;
With my real data, this query takes 27 seconds. However, if I first create w as a temporary Table and index property_B, it all together takes ~1 second.
CREATE TEMPORARY TABLE w AS
SELECT o.property_B, SUM(o.score2) AS score FROM o GROUP BY property_B;
ALTER TABLE w ADD INDEX `property_B_idx` (property_B);
SELECT o.property_B, SUM(o.score1), w.score
FROM o
INNER JOIN w ON w.property_B = o.property_B
WHERE o.property_A = 'specific_A'
GROUP BY property_B;
DROP TABLE IF EXISTS w;
Is there a way to combine the best of these two queries? I.e. a single query with the speed advantages of the indexing in the subquery?
EDIT
After Mehran's answer below, I read this piece of explanation in the MySQL documentation:
As of MySQL 5.6.3, the optimizer more efficiently handles subqueries in the FROM clause (that is, derived tables):
...
For cases when materialization is required for a subquery in the FROM clause, the optimizer may speed up access to the result by adding an index to the materialized table. If such an index would permit ref access to the table, it can greatly reduce amount of data that must be read during query execution. Consider the following query:
SELECT * FROM t1
JOIN (SELECT * FROM t2) AS derived_t2 ON t1.f1=derived_t2.f1;
The optimizer constructs an index over column f1 from derived_t2 if doing so would permit the use of ref access for the lowest cost execution plan. After adding the index, the optimizer can treat the materialized derived table the same as a usual table with an index, and it benefits similarly from the generated index. The overhead of index creation is negligible compared to the cost of query execution without the index. If ref access would result in higher cost than some other access method, no index is created and the optimizer loses nothing.

First of all you need to know that creating a temporary table is absolutely a feasible solution. But in cases no other choice is applicable which is not true here!
In your case, you can easily boost your query as FrankPl pointed out because your sub-query and main-query are both grouping by the same field. So you don't need any sub-queries. I'm going to copy and paste FrankPl's solution for the sake of completeness:
SELECT o.property_B, SUM(o.score1), SUM(o.score2)
FROM o
GROUP BY property_B;
Yet it doesn't mean it's impossible to come across a scenario in which you wish you could index a sub-query. In which cases you've got two choices, first is using a temporary table as you pointed out yourself, holding the results of the sub-query. This solution is advantageous since it is supported by MySQL for a long time. It's just not feasible if there's a huge amount of data involved.
The second solution is using MySQL version 5.6 or above. In recent versions of MySQL new algorithms are incorporated so an index defined on a table used within a sub-query can also be used outside of the sub-query.
[UPDATE]
For the edited version of the question I would recommend the following solution:
SELECT o.property_B, SUM(IF(o.property_A = 'specific_A', o.score1, 0)), SUM(o.score2)
FROM o
GROUP BY property_B
HAVING SUM(IF(o.property_A = 'specific_A', o.score1, 0)) > 0;
But you need to work on the HAVING part. You might need to change it according to your actual problem.

I am not really that familiar with MySql, I mostly worked with Oracle.
If you want a where-clause in the SUM, you can use decode or case.
it would look something like that
SELECT o.property_B, , SUM(decode(property_A, 'specific_A', o.score1, 0), SUM(o.score2)
FROM o
GROUP BY property_B;
or with case
SELECT o.property_B, , SUM(CASE
WHEN property_A = 'specific_A' THEN o.score1
ELSE 0
END ),
SUM(o.score2)
FROM o
GROUP BY property_B;

I do not see why you would need the join at all. I would assume that
SELECT o.property_B, SUM(o.score1), SUM(o.score2)
FROM o
GROUP BY property_B;
should give what you want, but with a much simpler and hence better to optimize statement.

It should be the duty of MySQL to optimise your query, and I don't think there is a way to create an index on the fly. However, you can try to force the use of the index of property_o (if you have it). See http://dev.mysql.com/doc/refman/5.1/en/index-hints.html
Also, you can merge the create and alter statements, if you prefer.

Related

SQL query takes too much time (3 joins)

I'm facing an issue with an SQL Query. I'm developing a php website, and to avoid making too much queries, I prefer to make a big one looking like :
select m.*, cj.*, cjb.*, me.pseudo as pseudo_acheteur
from mercato m
JOIN cartes_joueur cj
ON m.ID_carte = cj.ID_carte_joueur
JOIN cartes_joueur_base cjb
ON cj.ID_carte_joueur_base = cjb.ID_carte_joueur_base
JOIN membres me
ON me.ID_membre = cj.ID_membre
where not exists (select * from mercato_encheres me where me.ID_mercato = m.ID_mercato)
and cj.ID_membre = 2
and m.status <> 'cancelled'
ORDER BY total_carac desc, cj.level desc, cjb.nom_carte asc
This should return all cards sold by the member without any bet on it. In the result, I need all the information to display them.
Here is the approximate rows in each table :
mercato : 1200
cartes_joueur : 800 000
carte_joueur_base : 62
membres : 2000
mercato_enchere : 15 000
I tried to reduce them (in dev environment) by deleting old data; but the query still needs 10~15 seconds to execute (which is way too long on a website )
Thanks for your help.
Let's take a look.
The use of * in SELECT clauses is harmful to query performance. Why? It's wasteful. It needlessly adds to the volume of data the server must process, and in the case of JOINs, can force the processing of columns with duplicate values. If you possibly can do so, try to enumerate the columns you need.
You may not have useful indexes on your tables for accelerating this. We can't tell. Please notice that MySQL can't exploit multiple indexes in a single query, so to make a query fast you often need a well-chosen compound index. I suggest you try defining the index (ID_membre, ID_carte_jouer, ID_carte_joueur_base) on your cartes_joueur table. Why? Your query matches for equality on the first of those columns, and then uses the second and third column in ON conditions.
I have often found that writing a query with the largest table (most rows) first helps me think clearly about optimizing. In your case your largest table is cartes_jouer and you are choosing just one ID_membre value from that table. Your clearest path to optimization is the knowledge that you only need to examine approximately 400 rows from that table, not 800 000. An appropriate compound index will make that possible, and it's easiest to imagine that index's columns if the table comes first in your query.
You have a correlated subquery -- this one.
where not exists (select *
from mercato_encheres me
where me.ID_mercato = m.ID_mercato)
MySQL's query planner can be stupidly literal-minded when it sees this, running it thousands of times. In your case it's even worse: it's got SELECT * in it: see point 1 above.
It should be refactored to use the LEFT JOIN ... IS NULL pattern. Here's how that goes.
select whatever
from mercato m
JOIN ...
JOIN ...
LEFT JOIN mercato_encheres mench ON mench.ID_mercato = m.ID_mercato
WHERE mench.ID_mercato IS NULL
and ...
ORDER BY ...
Explanation: The use of LEFT JOIN rather than ordinary inner JOIN allows rows from the mercato table to be preserved in the output even when the ON condition does not match them to tables in the mercato_encheres table. The mismatching rows get NULL values for the second table. The mench.ID_mercato IS NULL condition in the WHERE clause then selects only the mismatching rows.

Query Speed Issue with NOT EXISTS condition

I have a query that works, but it is slow. Is there a way to speed this up? Basically I have a table with timecard entries, and then a second table with time breakdowns of that entry, related by the TimecardID. What I am looking for is timeblocks that there are no breakdowns for. I thought if I cut the criteria down to 2 months that it would speed it up. Thanks for your help
SELECT * FROM Timecards
WHERE NOT EXISTS (SELECT TimeCardID FROM TimecardBreakdown WHERE Timecards.ID = TimecardBreakdown.TimeCardID)
AND Status <> 0
AND DateIn >= CURRENT_DATE() - INTERVAL 2 MONTH
It seems you want to know the TimecardIDs which do not exist in the TimecardBreakdown table, in which case you can use the left outer join.
SELECT a.*
FROM Timecards a
LEFT OUTER JOIN TimecardBreakdown b ON a.TimecardID = b.TimecardID
WHERE b.TimecardID IS NULL
This would get rid of the subquery (which is expensive) and use join (which is more efficient).
MySQL stinks doing correlated subqueries fast. Try to make your subqueries independent and join them. You can use the LEFT JOIN ... IS NULL pattern to replace WHERE NOT EXISTS.
SELECT tc.*
FROM Timecards tc
LEFT JOIN TimecardBreakdown tcb ON tc.ID = tcb.TimeCardId
WHERE tc.DateIn >= CURRENT_DATE() - INTERVAL 2 MONTH
AND tc.Status <> 0
AND tcb.TimeCardId IS NULL
Some optimization points.
First, if you can change tc.Status <> 0 to tc.Status > 0 it makes an index range scan possible on that column.
Second, when you're optimizing stuff, SELECT * is considered harmful. Instead, if you can give the names of just the columns you need, things will be quicker. The database server has to sling around all the data you ask for; it can't tell if you're going to ignore some of it.
Third, this query will be helped by a compound index on Timecards (DateIn, Status, ID). That compound index can be used to do the heavy lifing of satisfying your query conditions.
That's called a covering index; it contains the data needed to satisfy much of your query. If you were to index just the DateIn column, then the query handler would have to bounce back to the main table to find the values of Status and ID. When those columns appear in the index, it saves that extra operation.
If you SELECT a certain set of columns rather than doing SELECT *, including those columns in the covering index can dramatically improve query performance. That's one of several reasons SELECT * is considered harmful.
(Some makes and model of DBMS have ways to specify lists of columns to ride along on indexes without actually indexing them. MySQL requires you to index them. But covering indexes still help.)
Read this: http://use-the-index-luke.com/

Optimize MySQL query for group_concat function

SELECT SQL_NO_CACHE link.stop, stop.common_name, locality.name, stop.bearing, stop.latitude, stop.longitude
FROM service
JOIN pattern ON pattern.service = service.code
JOIN link ON link.section = pattern.section
JOIN naptan.stop ON stop.atco_code = link.stop
JOIN naptan.locality ON locality.code = stop.nptg_locality_ref
GROUP BY link.stop
The above query takes roughly 800ms - 1000ms to run.
If I append a group_concat statement the query then takes 8 - 10 seconds:
SELECT SQL_NO_CACHE link.stop, link.stop, stop.common_name, locality.name, stop.bearing, stop.latitude, stop.longitude, group_concat(service.line) lines
How can I change this query so that it runs in less than 2 seconds with the group_concat statement?
SQL Fiddle: http://sqlfiddle.com/#!9/414fe
EXPLAIN statements for both queries: http://i.imgur.com/qrURgzV.png
How long does this query take?
SELECT p.section, GROUP_CONCAT(s.line)
FROM pattern p join
service s
ON p.service = s.code
GROUP BY p.section
I am thinking that you can do the group_concat() in a subquery, so the outer query does not need an aggregation. This can speed queries when there is one table in the subquery. In your case, there are two.
The final results would be something like:
link.section = pattern.section
SELECT SQL_NO_CACHE . . .,
(SELECT GROUP_CONCAT(s.line)
FROM pattern p join
service s
ON p.service = s.code
WHERE p.section = link.section
) as lines
FROM link JOIN
naptan.stop
ON stop.atco_code = link.stop JOIN
naptan.locality
ON locality.code = stop.nptg_locality_ref;
For this query, you want the following additional indexes: pattern(section, service) and service(code, line).
I don't know if this will work, but it is worth a try.
Note: this is assuming that you really don't need the group by for the rest of the columns.
A remark: You're using the nonstandard MySQL extension to GROUP BY. It happens to work for you because link.stop is joined to stop.atco_code, which itself is a primary key. But you need to be very careful with this.
I suggest you add some compound indexes. You join in to pattern on service and join out based on section. So add this index.
ALTER TABLE pattern ADD INDEX service_section (service, section, line);
This will let the query use just the index, and not have to hit the table itself to retrieve the information needed for the JOIN or your GROUP_CONCAT() operation. (You might also delete the index on just service, this new index makes it redundant).
Similarly, you want to create an index (section, stop) on the link table, and get rid of the index on just section.
On stop, you're using most of the columns, and you already have an index (PK) on atco_code, so let this one be.
Finally, on locality put an index on (code,name).
All this indexing monkey business should cut down the amount of work MySQL must do to satisfy your query.
Now look, as soon as you add WHERE anything = anything to the query, you may need to add a column to one or more of these indexes. You definitely should read up on multi-column indexing and grouping; good indexing is a critical success factor for your kind of data.
You should also run ANALYZE TABLE xxxx on each of your tables after inserting lots of rows, to make sure the query optimizer can see appropriate information about the content of the table and indexes.

How do I tell the MySQL Optimizer to use the index on a derived table?

Suppose you have a query like this...
SELECT T.TaskID, T.TaskName, TAU.AssignedUsers
FROM `tasks` T
LEFT OUTER JOIN (
SELECT TaskID, GROUP_CONCAT(U.FirstName, ' ',
U.LastName SEPARATOR ', ') AS AssignedUsers
FROM `tasks_assigned_users` TAU
INNER JOIN `users` U ON (TAU.UserID=U.UserID)
GROUP BY TaskID
) TAU ON (T.TaskID=TAU.TaskID)
Multiple people can be assigned to a given task. The purpose of this query is to show one row per task, but with the people assigned to the task in a single column
Now... suppose you have the proper indexes setup on tasks, users, and tasks_assigned_users. The MySQL Optimizer will still not use the TaskID index when joining tasks to the derived table. WTF?!?!?
So, my question is... how can you make this query use the index on tasks_assigned_users.TaskID? Temporary tables are lame, so if that's the only solution... the MySQL Optimizer is stupid.
Indexes used:
tasks
PRIMARY - TaskID
users
PRIMARY - UserID
tasks_assigned_users
PRIMARY - (TaskID,UserID)
Additional index UNIQUE - (UserID,TaskID)
EDIT: Also, this page says that derived tables are executed/materialized before joins occur. Why not re-use the keys to perform the join?
EDIT 2: MySQL Optimizer won't let you put index hints on derived tables (presumably because there are no indexes on derived tables)
EDIT 3: Here is a really nice blog post about this: http://venublog.com/2010/03/06/how-to-improve-subqueries-derived-tables-performance/ Notice that Case #2 is the solution I'm looking for, but it appears that MySQL does not support this at this time. :(
EDIT 4: Just found this: "As of MySQL 5.6.3, the optimizer more efficiently handles subqueries in the FROM clause (that is, derived tables):... During query execution, the optimizer may add an index to a derived table to speed up row retrieval from it." Seems promising...
There is a solution to this in MySQL Server 5.6 - the preview release (at the time of this writing).
http://dev.mysql.com/doc/refman/5.6/en/from-clause-subquery-optimization.html
Although, I'm not sure if the MySQL Optimizer will re-use indexes that already exist when it "adds indexes to the derived table"
Consider the following query:
SELECT * FROM t1
JOIN (SELECT * FROM t2) AS derived_t2 ON t1.f1=derived_t2.f1;
The documentation says: "The optimizer constructs an index over column f1 from derived_t2 if doing so would permit the use of ref access for the lowest cost execution plan."
OK, that's great, but does the optimizer re-use indexes from t2? In other words, what if an index existed for t2.f1? Does this index get re-used, or does the optimizer recreate this index for the derived table? Who knows?
EDIT: The best solution until MySQL 5.6 is to create a temporary table, create an index on that table, and then run the SELECT query on the temp table.
The problem I see is that by doing a subquery there is no underlying indexed table.
If you are having a performance I'd do the grouping at the end, something like this:
SELECT T.TaskID, T.TaskName, GROUP_CONCAT(U.FirstName, ' ', U.LastName SEPARATOR ', ') AS AssignedUsers
FROM `tasks` T
LEFT OUTER JOIN `tasks_assigned_users` TAU ON (T.TaskID=TAU.TaskID)
INNER JOIN `users` U ON (TAU.UserID=U.UserID)
GROUP BY T.TaskID, T.TaskName
I'm afraid, it's not possible. You have to create a temporary table or a view to use an index.

Optimize slow SQL query using indexes

I have a problem optimizing a really slow SQL query. I think is an index problem, but I can´t find which index I have to apply.
This is the query:
SELECT
cl.ID, cl.title, cl.text, cl.price, cl.URL, cl.ID AS ad_id, cl.cat_id,
pix.file_name, area.area_name, qn.quarter_name
FROM classifieds cl
/*FORCE INDEX (date_created) */
INNER JOIN classifieds_pix pix ON cl.ID = pix.classified_id AND pix.picture_no = 0
INNER JOIN zip_codes zip ON cl.zip_id = zip.zip_id AND zip.area_id = 132
INNER JOIN area_names area ON zip.area_id = area.id
LEFT JOIN quarter_names qn ON zip.quarter_id = qn.id
WHERE
cl.confirmed = 1
AND cl.country = 'DE'
AND cl.date_created <= NOW() - INTERVAL 1 DAY
ORDER BY
cl.date_created
desc LIMIT 7
MySQL takes about 2 seconds to get the result, and start working in pix.picture_no, but if I force index to "date_created" the query goes much faster, and takes only 0.030 s. But the problem is that the "INNER JOIN zip_codes..." is not always in the query, and when is not, the forced index make the query slow again.
I've been thinking in make a solution by PHP conditions, but I would like to know what is the problem with indexes.
These are several suggestions on how to optimize your query.
NOW Function - You're using the NOW() function in your WHERE clause. Instead, I recommend to use a constant date / timestamp, to allow the value to be cached and optimized. Otherwise, the value of NOW() will be evaluated for each row in the WHERE clause. An alternative to a constant value in case you need a dynamic value, is to add the value from the application (for example calculate the current timestamp and inject it to the query as a constant in the application before executing the query.
To test this recommendation before implementing this change, just replace NOW() with a constant timestamp and check for performance improvements.
Indexes - in general, I would suggest adding an index the contains all columns of your WHERE clause, in this case: confirmed, country, date_created. Start with the column that will cut the amount of data the most and move forward from there. Make sure you adjust the WHERE clause to the same order of the index, otherwise the index won't be used.
I used EverSQL SQL Query Optimizer to get these recommendations (disclaimer: I'm a co-founder of EverSQL and humbly provide these suggestions).
I would actually have a compound index on all elements of your where such as
(country, confirmed, date_created)
Having the country first would keep your optimized index subset to one country first, then within that, those that are confirmed, and finally the date range itself. Don't query on just the date index alone. Since you are ordering by date, the index should be able to optimize it too.
Add explain in front of the query and run it again. This will show you the indexes that are being used.
See: 13.8.2 EXPLAIN Statement
And for an explanation of explain see MySQL Explain Explained. Or: Optimizing MySQL: Queries and Indexes