I need to optimize a query for a ranking that is taking forever (the query itself works, but I know it's awful and I've just tried it with a good number of records and it gives a timeout).
I'll briefly explain the model. I have 3 tables: player, team and player_team. I have players, that can belong to a team. Obvious as it sounds, players are stored in the player table and teams in team. In my app, each player can switch teams at any time, and a log has to be mantained. However, a player is considered to belong to only one team at a given time. The current team of a player is the last one he's joined.
The structure of player and team is not relevant, I think. I have an id column PK in each. In player_team I have:
id (PK)
player_id (FK -> player.id)
team_id (FK -> team.id)
Now, each team is assigned a point for each player that has joined. So, now, I want to get a ranking of the first N teams with the biggest number of players.
My first idea was to get first the current players from player_team (that is one record top for each player; this record must be the player's current team). I failed to find a simple way to do it (tried GROUP BY player_team.player_id HAVING player_team.id = MAX(player_team.id), but that didn't cut it.
I tried a number of querys that didn't work, but managed to get this working.
SELECT
COUNT(*) AS total,
pt.team_id,
p.facebook_uid AS owner_uid,
t.color
FROM
player_team pt
JOIN player p ON (p.id = pt.player_id)
JOIN team t ON (t.id = pt.team_id)
WHERE
pt.id IN (
SELECT max(J.id)
FROM player_team J
GROUP BY J.player_id
)
GROUP BY
pt.team_id
ORDER BY
total DESC
LIMIT 50
As I said, it works but looks very bad and performs worse, so I'm sure there must be a better way to go. Anyone has any ideas for optimizing this?
I'm using mysql, by the way.
Thanks in advance
Adding the explain. (Sorry, not sure how to format it properly)
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY t ALL PRIMARY NULL NULL NULL 5000 Using temporary; Using filesort
1 PRIMARY pt ref FKplayer_pt77082,FKplayer_pt265938,new_index FKplayer_pt77082 4 t.id 30 Using where
1 PRIMARY p eq_ref PRIMARY PRIMARY 4 pt.player_id 1
2 DEPENDENT SUBQUERY J index NULL new_index 8 NULL 150000 Using index
Try this:
SELECT t.*, cnt
FROM (
SELECT team_id, COUNT(*) AS cnt
FROM (
SELECT player_id, MAX(id) AS mid
FROM player_team
GROUP BY
player_id
) q
JOIN player_team pt
ON pt.id = q.mid
GROUP BY
team_id
) q2
JOIN team t
ON t.id = q2.team_id
ORDER BY
cnt DESC
LIMIT 50
Create an index on player_team (player_id, id) (in this order) for this to work fast.
Its the subquery that is killing it - if you add a current field on the player_team table, where you give it value = 1 if it is current, and 0 if it is old you could simplify this alot by just doing:
SELECT
COUNT(*) AS total,
pt.team_id,
p.facebook_uid AS owner_uid,
t.color
FROM
player_team pt
JOIN player p ON (p.id = pt.player_id)
JOIN team t ON (t.id = pt.team_id)
WHERE
player_team.current = 1
GROUP BY
pt.team_id
ORDER BY
total DESC
LIMIT 50
Having multiple entries in the player_team table for the same relationship where the only way to distinguish which one is the 'current' record is by comparing two (or more) rows I think is bad practice. I have been in this situation before and the workarounds you have to do to make it work really kill performance. It is far better to be able to see which row is current by doing a simple lookup (in this case, where current=1) - or by moving historical data into a completely different table (depending on your situation this might be overkill).
I sometimes find that more complex queries in MySQL need to be broken into two pieces.
The first piece would pull the data required into a temporary table and the second piece would be the query that attempts to manipulate the dataset created. Doing this definitely results in a significant performance gain.
This will get the current teams with colours ordered by size:
SELECT team_id, COUNT(player_id) c AS total, t.color
FROM player_team pt JOIN teams t ON t.team_id=pt.team_id
GROUP BY pt.team_id WHERE current=1
ORDER BY pt.c DESC
LIMIT 50;
But you've not given a condition for which player should be considered owner of the team. Your current query is arbitrarily showing one player as owner_id because of the grouping, not because that player is the actual owner. If your player_team table contained an 'owner' column, you could join the above query to a query of owners. Something like:
SELECT o.facebook_uid, a.team_id, a.color, a.c
FROM player_teams pt1
JOIN players o ON (pt1.player_id=o.player_id AND o.owner=1)
JOIN (...above query...) a
ON a.team_id=pt1.team_id;
You could add a column "last_playteam_id" to player table, and update it each time a player changes his team with the pk from player_team table.
Then you can do this:
SELECT
COUNT(*) AS total,
pt.team_id,
p.facebook_uid AS owner_uid,
t.color
FROM
player_team pt
JOIN player p ON (p.id = pt.player_id) and p.last_playteam_id = pt.id
JOIN team t ON (t.id = pt.team_id)
GROUP BY
pt.team_id
ORDER BY
total DESC
LIMIT 50
This could be fastest because you don't have to update the old player_team rows to current=0.
You could also add instead a column "last_team_id" and keep it's current team there, you get the fastest result for the above query, but it could be less helpful with other queries.
Related
The query below is grabbing some information about a category of toys and showing the most recent sale price for three levels of condition (e.g., Brand New, Used, Refurbished). The price for each sale is almost always different. One other thing - the sales table row id's are not necessarily in chronological order, e.g., a toy with a sale id of 5 could have happened later than a toy with a sale id of 10).
This query works but is not performant. It runs in a manageable amount of time, usually about 1s. However, I need to add yet another left join to include some more data, which causes the query time to balloon up to about 9s, no bueno.
Here is the working but nonperformant query:
SELECT b.brand_name, t.toy_id, t.toy_name, t.toy_number, tt.toy_type_name, cp.catalog_product_id, s.date_sold, s.condition_id, s.sold_price FROM brands AS b
LEFT JOIN toys AS t ON t.brand_id = b.brand_id
JOIN toy_types AS tt ON t.toy_type_id = tt.toy_type_id
LEFT JOIN catalog_products AS cp ON cp.toy_id = t.toy_id
LEFT JOIN toy_category AS tc ON tc.toy_category_id = t.toy_category_id
LEFT JOIN (
SELECT date_sold, sold_price, catalog_product_id, condition_id
FROM sales
WHERE invalid = 0 AND condition_id <= 3
ORDER BY date_sold DESC
) AS s ON s.catalog_product_id = cp.catalog_product_id
WHERE tc.toy_category_id = 1
GROUP BY t.toy_id, s.condition_id
ORDER BY t.toy_id ASC, s.condition_id ASC
But like I said it's slow. The sales table has about 200k rows.
What I tried to do was create the subquery as a view, e.g.,
CREATE VIEW sales_view AS
SELECT date_sold, sold_price, catalog_product_id, condition_id
FROM sales
WHERE invalid = 0 AND condition_id <= 3
ORDER BY date_sold DESC
Then replace the subquery with the view, like
SELECT b.brand_name, t.toy_id, t.toy_name, t.toy_number, tt.toy_type_name, cp.catalog_product_id, s.date_sold, s.condition_id, s.sold_price FROM brands AS b
LEFT JOIN toys AS t ON t.brand_id = b.brand_id
JOIN toy_types AS tt ON t.toy_type_id = tt.toy_type_id
LEFT JOIN catalog_products AS cp ON cp.toy_id = t.toy_id
LEFT JOIN toy_category AS tc ON tc.toy_category_id = t.toy_category_id
LEFT JOIN sales_view AS s ON s.catalog_product_id = cp.catalog_product_id
WHERE tc.toy_category_id = 1
GROUP BY t.toy_id, s.condition_id
ORDER BY t.toy_id ASC, s.condition_id ASC
Unfortunately, this change causes the query to no longer grab the most recent sale, and the sales price it returns is no longer the most recent.
Why is it that the table view doesn't return the same result as the same select as a subquery?
After reading just about every top-n-per-group stackoverflow question and blog article I could find, getting a query that actually worked was fantastic. But now that I need to extend the query one more step I'm running into performance issues. If anybody wants to sidestep the above question and offer some ways to optimize the original query, I'm all ears!
Thanks for any and all help.
The solution to the subquery performance issue was to use the answer provided here: Groupwise maximum
I thought that this approach could only be used when querying a single table, but indeed it works even when you've joined many other tables. You just have to left join the same table twice using the s.date_sold < s2.date_sold join condition and make sure the where clause looks for the null value in the second table's id column.
So I have the following table structure for a Sports Event system
TEAMS TABLE
team_id
game_id
team_name
team_logo
PLAYERS TABLE
player_id
team_id
player_name
player_mobile
player_email
So whenever a player submits a team registration details get saved on both tables. Events could be something like Cricket, Basketball, Netball, etc. Sometimes they dont fill in players details and sometimes they resubmit their team again which means same team name is submitted.
So whenever I need to check the accurate details of the team list I have been using this:
SELECT team_id FROM `teams` WHERE `game_id`= 35 GROUP BY `team_name
To get a list of the people in these teams that are the same name I was using this:
SELECT team_id, player_name FROM `player` WHERE team_id IN (SELECT team_id FROM `teams` WHERE `game_id`= 35 GROUP BY `team_name`) AND player_name IS NOT NULL AND player_name <> ''
The problem is the query on top gives me different results to what I am getting on the bottom. What I need to do is to get a list of current teams whenever i need. Duplicates of teams should be not there. Then I need a list of the players of these teams.
Currently stumped :( Help me pls.
TL;DR
You can get the desired results with a JOIN and DISTINCT
SELECT DISTINCT t.team_name, P.player_name
FROM teams AS t
INNER JOIN Players AS p
ON p.team_id = t.team_id;
FULL EXPLANATION
The following query is not deterministic, that is to say, you could run the same query on the same data multiple times and get different results:
SELECT team_id
FROM `teams`
WHERE `game_id`= 35
GROUP BY `team_name`;
Many DBMS would not even allow this query to run. You have stated that some teams are duplicated, so consider the following dummy data:
team_id team_name game_id
------------------------------------
1 The A-Team 35
2 The A-Team 35
3 The A-Team 35
When you group by team_name you are end up with one group, so if we start with a valid query:
SELECT team_name
FROM `teams`
WHERE `game_id`= 35
GROUP BY `team_name`;
We would expect one result:
team_name
--------------
The A-Team
When you add team_id in to the select, with no aggregate function, you need to pick one value for team_id, but the query engine has 3 different values to chose from, and none of them are more correct than any other. This is why anything in the select statement, must be contained within the group by (or functionally dependent on something that is), or part of an aggregate function.
The MySQL Docs state:
In standard SQL, a query that includes a GROUP BY clause cannot refer to nonaggregated columns in the select list that are not named in the GROUP BY clause. For example, this query is illegal in standard SQL because the name column in the select list does not appear in the GROUP BY:
SELECT o.custid, c.name, MAX(o.payment)
FROM orders AS o, customers AS c
WHERE o.custid = c.custid
GROUP BY o.custid;
For the query to be legal, the name column must be omitted from the select list or named in the GROUP BY clause.
MySQL extends the use of GROUP BY so that the select list can refer to nonaggregated columns not named in the GROUP BY clause. This means that the preceding query is legal in MySQL. You can use this feature to get better performance by avoiding unnecessary column sorting and grouping. However, this is useful primarily when all values in each nonaggregated column not named in the GROUP BY are the same for each group.
The reason this clause exists is valid, and can save some time, consider the following query:
SELECT t.team_id, t.team_name, COUNT(*) AS Players
FROM teams AS t
LEFT JOIN Players AS p
ON p.team_id = t.team_id
GROUP BY t.team_id;
Here, we can include team_name in the select list even though it is not in the group by, but we can do this safely since team_id is the primary key, therefore it would be impossible to have two different values of team_name for a single team_id.
Anyway, I digress, the problem you are most likely having is that the value returned for team_id in each of your queries will likely be different depending on the context of the query and the execution plan chosen.
You can get a distinct list of players and teams using DISTINCT:
SELECT DISTINCT t.team_name, P.player_name
FROM teams AS t
INNER JOIN Players AS p
ON p.team_id = t.team_id;
This is essentially a hack, and while it does remove duplicate records it does not resolve the underlying issue, of duplicate records, and potentially a sub-optimal data structure.
If it is not too late, I would reconsider your design and make a few changes. If team names are supposed to be unique, then make them unique with a unique constraint, so instead of working around duplicate entries, you prevent them completely.
You should probably be using junction tables for players and games, i.e. have your main tables
Team (team_id, team_name, team_logo etc)
Game (game_id, game_name, etc)
Player (player_id, player_name, player_email, player_mobile etc)
Then tables to link them
Team_Game (team_id, game_id)
Team_Player (team_id, player_id)
This then allows one player to play for multiple teams, or one team to enter multiple events.
Select t.team_id , p.player_name from player p
JOIN teams t
ON t.team_id = p.team_id
Where t.game_id = 35 AND p.player_name IS NOT NULL AND p.player_name <> ''
GROUP BY(t.team_name)
```
You should do a unique constraint on the team_name column, this way you are not allowing duplicate teams
Ps. I did not test the query but it should work
I have this query:
SELECT timestmp
FROM devicelog
WHERE device_id = 5
ORDER BY id DESC LIMIT 1
which takes less than 0.001 seconds to fetch, but once I put it in a subquery, it slows down to about 3.05 seconds. Any reason to why it does this, or how I can remedy it?
Here is the second query (which is the one I want to optimize):
SELECT device.id,
(SELECT timestmp
FROM devicelog
WHERE device_id = device.id
ORDER BY id DESC LIMIT 1) as timestmp
FROM device
Table "device" only has like 10-15 records in it (devicelog has several million), so I would assume it goes 1 by 1 through each record and then executes the subquery, but obviously it's doing something else. The PK of devicelog is the id, and the PK in device is its id as well. There is an index on devicelog for timestmp (which is a datetime) and device_id which is also a FK back to devicelog. There are other indices as well, but they are irrelevant (things like names, descriptions, etc).
I just need it to loop through devices, then display the last timestamp record.
If I list each device in PHP, then perform the first query separately, it will be perform extraordinary well, but I want to do this in one entire query. Like, I could do something like (pseudocode):
foreach($row in <devicelog>)
query('<first query> where id = $row[id]')
Doing an entire join would be too expensive on devicelog just because the high count.
Your question and query do not match what you are looking for per the comment to the first answer offered.
What you can do is a pre-aggregate on a per-device basis to get the max ID log, then join that to your master list of devices...
SELECT
d.name,
d.id,
DeviceMax.lastTime
from
device d
LEFT JOIN ( select dl.device_id,
max( dl.timestamp ) lastTime
from
devicelog dl
group by
dl.device_id ) as DeviceMax
ON d.id = DeviceMax.device_id
Now, if you needed other stuff from the device log for that specific entry, we could just add on to that...
LEFT JOIN devicelog dl2
on DeviceMax.Device_id = dl2.Device_id
AND DeviceMax.lastTime = dl2.timestamp
then you can get any other columns from the "dl2" alias added to your query.
Also, for your device log table, I would have a covering index on (device_id, timestamp)
COMMENT FROM FEEDBACK
Then I would offer this as a suggestion for you, which is something also common in the world of web development when someone needs "highest count", or "most" of something, or "most recent" etc.
Denormalize your Device table with only one respect... add a column for the LastDeviceLogID. Then, whenever your DeviceLog has an entry added to it, you just use an after insert trigger that does...
update Device
set LastDeviceLogID = newRecord.DeviceLogID
where ID = newRecord.Device_ID
Columns may not be exact, but the principle is there. This way, you never need to do a LIMIT 1, MAX(), etc and go through your millions of records, you can get as simply as doing
SELECT
d.name,
d.id,
dl.timestamp,
dl.othercolumns
from
device d
LEFT JOIN devicelog dl
on d.LastDeviceLogID = dl.DeviceLogID
try this with inner join
SELECT d.id, dl.timestamp
FROM device d
INNER JOIN devicelog dl
ON dl.device_id = d.id
ORDER BY d.id DESC LIMIT 1
to list all devices you may consider the GROUP BY
SELECT d.id, dl.timestamp
FROM device d
INNER JOIN devicelog dl
ON dl.device_id = d.id
GROUP BY d.id
ORDER BY d.id DESC
Edit:
if you have many indices , thn better to index your column id
try this
ALTER TABLE `device` ADD INDEX `id` (`id`)
edit: here is a simplified version of the original query (runs in 3.6 secs on a products table of 475K rows)
SELECT p.*, shop FROM products p JOIN
users u ON p.date >= u.prior_login and u.user_id = 22 JOIN
shops s ON p.shop_id = s.shop_id
ORDER BY shop, date, product_id;
this is the explain plan
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE u const PRIMARY,prior_login,user_id PRIMARY 4 const 1 Using temporary; Using filesort
1 SIMPLE s ALL PRIMARY NULL NULL NULL 90
1 SIMPLE p ref shop_id,date,shop_id_2,shop_id_3 shop_id 4 bitt3n_minxa.s.shop_id 5338 Using where
the bottleneck seems to be ORDER BY date,product_id. Removing these two orderings, the query runs in 0.06 seconds. (Removing either one of the two (but not both) has virtually no effect, query still takes over 3 seconds.) I have indexes on both product_id and date in the products table. I have also added an index on (product,date) with no improvement.
newtover suggests the problem is the fact that the INNER JOIN users u1 ON products.date >= u1.prior_login requirement is preventing use of the index on products.date
Two variations of the query that execute in ~0.006 secs (as opposed to 3.6 secs for the original) have been suggested to me (not from this thread).
this one uses a subquery, which appears to force the order of the joins
SELECT p.*, shop
FROM
(
SELECT p.*
FROM products p
WHERE p.date >= (select prior_login FROM users where user_id = 22)
) as p
JOIN shops s
ON p.shop_id = s.shop_id
ORDER BY shop, date, product_id;
this one uses the WHERE clause to do the same thing (although the presence of SQL_SMALL_RESULT doesn't change the execution time, 0.006 secs without it as well)
SELECT SQL_SMALL_RESULT p . * , shop
FROM products p
INNER JOIN shops s ON p.shop_id = s.shop_id
WHERE p.date >= (
SELECT prior_login
FROM users
WHERE user_id =22 )
ORDER BY shop, DATE, product_id;
My understanding is that these queries work much faster on account of reducing the relevant number of rows of the product table before joining it to the shops table. I am wondering if this is correct.
Use the EXPLAIN statement to see the execution plan. Also you can try adding an index to products.date and u1.prior_login.
Also please just make sure you have defined your foreign keys and they are indexed.
Good luck.
We do need an explain plan... but
Be very careful of select * from table where id in (select id from another_table) This is a notorious. Generally these can be replaced by a join. The following query might run, although I haven't tested it.
SELECT shop,
shops.shop_id AS shop_id,
products.product_id AS product_id,
brand,
title,
price,
image AS image,
image_width,
image_height,
0 AS sex,
products.date AS date,
fav1.favorited AS circle_favorited,
fav2.favorited AS session_user_favorited,
u2.username AS circle_username
FROM products
LEFT JOIN favorites fav2
ON fav2.product_id = products.product_id
AND fav2.user_id = 22
AND fav2.current = 1
INNER JOIN shops
ON shops.shop_id = products.shop_id
INNER JOIN users u1
ON products.date >= u1.prior_login AND u1.user_id = 22
LEFT JOIN favorites fav1
ON products.product_id = fav1.product_id
LEFT JOIN friends f1
ON f1.star_id = fav1.user_id
LEFT JOIN users u2
ON fav1.user_id = u2.user_id
WHERE f1.fan_id = 22 OR fav1.user_id = 22
ORDER BY shop,
DATE,
product_id,
circle_favorited
the fact that the query is slow because of the ordering is rather obvious since it is hard to find an index that would to apply ORDER BY in this case. The main problem is products.date >= comparison which breaks using any index for ORDER BY. And since you have a lot of data to output, MySQL starts using temporary tables for sorting.
what i would to is to try to force MySQL output data in the order of an index which already has the required order and remove the ORDER BY clause.
I am not at a computer to test, but how would I do it:
I would do all inner joins
then I would LEFT JOIN to a subquery which makes all computations on favorites ordered by product_id, circle_favourited (which would provide the last ordering criterion).
So, the question is how to make the data be sorted on shop, date, product_id
I am going to write about it a bit later =)
UPD1:
You should probably read something on how btree indexes work in MySQL. There is a good article on mysqlperformanceblog.com about it (I currently write from a mobile and don't have the link at hand). In short, you seem to talk about one-column indexes which arrange pointers to rows based on values sorted in a single column. Compound indexes store an order based on several columns. Indexes mostly used to operate on clearly defined ranges of them to obtain most of the information before retrieving data from the rows they point at. Indexes usually do not know about other indexes on the same table, as result they are rarely merged. when there is no more info to take from the index, MySQL starts to operate directly on data.
That is an index on date can not make use of the index on product_id, but an index on (date, product_id) can get some more info on product_id after a condition on date (sort on product id for a specific date match).
Nevertheless, a range condition on date (>=) breaks this. That is what I was talking about.
UPD2:
As I uderstand the problem can be reduced to (most of the time it spends on that):
SELECT p.*, shop
FROM products p
JOIN users u ON p.`date` >= u.prior_login and u.user_id = 22
JOIN shops s ON p.shop_id = s.shop_id
ORDER BY shop, `date`, product_id;
Now add an index (user_id, prior_login) on users and (date) on products, and try the following query:
SELECT STRAIGHT_JOIN p.*, shop
FROM (
SELECT product_id, shop
FROM users u
JOIN products p
user_id = 22 AND p.`date` >= prior_login
JOIN shops s
ON p.shop_id = s.shop_id
ORDER BY shop, p.`date`, product_id
) as s
JOIN products p USING (product_id);
If I am correct the query should return the same result but quicker. If would be nice if you would post the result of EXPLAIN for the query.
I have two tables players and scores.
I want to generate a report that looks something like this:
player first score points
foo 2010-05-20 19
bar 2010-04-15 29
baz 2010-02-04 13
Right now, my query looks something like this:
select p.name player,
min(s.date) first_score,
s.points points
from players p
join scores s on s.player_id = p.id
group by p.name, s.points
I need the s.points that is associated with the row that min(s.date) returns. Is that happening with this query? That is, how can I be certain I'm getting the correct s.points value for the joined row?
Side note: I imagine this is somehow related to MySQL's lack of dense ranking. What's the best workaround here?
This is the greatest-n-per-group problem that comes up frequently on Stack Overflow.
Here's my usual answer:
select
p.name player,
s.date first_score,
s.points points
from players p
join scores s
on s.player_id = p.id
left outer join scores s2
on s2.player_id = p.id
and s2.date < s.date
where
s2.player_id is null
;
In other words, given score s, try to find a score s2 for the same player, but with an earlier date. If no earlier score is found, then s is the earliest one.
Re your comment about ties: You have to have a policy for which one to use in case of a tie. One possibility is if you use auto-incrementing primary keys, the one with the least value is the earlier one. See the additional term in the outer join below:
select
p.name player,
s.date first_score,
s.points points
from players p
join scores s
on s.player_id = p.id
left outer join scores s2
on s2.player_id = p.id
and (s2.date < s.date or s2.date = s.date and s2.id < s.id)
where
s2.player_id is null
;
Basically you need to add tiebreaker terms until you get down to a column that's guaranteed to be unique, at least for the given player. The primary key of the table is often the best solution, but I've seen cases where another column was suitable.
Regarding the comments I shared with #OMG Ponies, remember that this type of query benefits hugely from the right index.
Most RDMBs won't even let you include non aggregate columns in your SELECT clause when using GROUP BY. In MySQL, you'll end up with values from random rows for your non-aggregate columns. This is useful if you actually have the same value in a particular column for all the rows. Therefore, it's nice that MySQL doesn't restrict us, though it's an important thing to understand.
A whole chapter is devoted to this in SQL Antipatterns.