I have two tables players and scores.
I want to generate a report that looks something like this:
player first score points
foo 2010-05-20 19
bar 2010-04-15 29
baz 2010-02-04 13
Right now, my query looks something like this:
select p.name player,
min(s.date) first_score,
s.points points
from players p
join scores s on s.player_id = p.id
group by p.name, s.points
I need the s.points that is associated with the row that min(s.date) returns. Is that happening with this query? That is, how can I be certain I'm getting the correct s.points value for the joined row?
Side note: I imagine this is somehow related to MySQL's lack of dense ranking. What's the best workaround here?
This is the greatest-n-per-group problem that comes up frequently on Stack Overflow.
Here's my usual answer:
select
p.name player,
s.date first_score,
s.points points
from players p
join scores s
on s.player_id = p.id
left outer join scores s2
on s2.player_id = p.id
and s2.date < s.date
where
s2.player_id is null
;
In other words, given score s, try to find a score s2 for the same player, but with an earlier date. If no earlier score is found, then s is the earliest one.
Re your comment about ties: You have to have a policy for which one to use in case of a tie. One possibility is if you use auto-incrementing primary keys, the one with the least value is the earlier one. See the additional term in the outer join below:
select
p.name player,
s.date first_score,
s.points points
from players p
join scores s
on s.player_id = p.id
left outer join scores s2
on s2.player_id = p.id
and (s2.date < s.date or s2.date = s.date and s2.id < s.id)
where
s2.player_id is null
;
Basically you need to add tiebreaker terms until you get down to a column that's guaranteed to be unique, at least for the given player. The primary key of the table is often the best solution, but I've seen cases where another column was suitable.
Regarding the comments I shared with #OMG Ponies, remember that this type of query benefits hugely from the right index.
Most RDMBs won't even let you include non aggregate columns in your SELECT clause when using GROUP BY. In MySQL, you'll end up with values from random rows for your non-aggregate columns. This is useful if you actually have the same value in a particular column for all the rows. Therefore, it's nice that MySQL doesn't restrict us, though it's an important thing to understand.
A whole chapter is devoted to this in SQL Antipatterns.
Related
I have this query I need to optimize further since it requires too much cpu time and I can't seem to find any other way to write it more efficiently. Is there another way to write this without altering the tables?
SELECT category, b.fruit_name, u.name
, r.count_vote, r.text_c
FROM Fruits b, Customers u
, Categories c
, (SELECT * FROM
(SELECT *
FROM Reviews
ORDER BY fruit_id, count_vote DESC, r_id
) a
GROUP BY fruit_id
) r
WHERE b.fruit_id = r.fruit_id
AND u.customer_id = r.customer_id
AND category = "Fruits";
This is your query re-written with explicit joins:
SELECT
category, b.fruit_name, u.name, r.count_vote, r.text_c
FROM Fruits b
JOIN
(
SELECT * FROM
(
SELECT *
FROM Reviews
ORDER BY fruit_id, count_vote DESC, r_id
) a
GROUP BY fruit_id
) r on r.fruit_id = b.fruit_id
JOIN Customers u ON u.customer_id = r.customer_id
CROSS JOIN Categories c
WHERE c.category = 'Fruits';
(I am guessing here that the category column belongs to the categories table.)
There are some parts that look suspicious:
Why do you cross join the Categories table, when you don't even display a column of the table?
What is ORDER BY fruit_id, count_vote DESC, r_id supposed to do? Sub query results are considered unordered sets, so an ORDER BY is superfluous and can be ignored by the DBMS. What do you want to achieve here?
SELECT * FROM [ revues ] GROUP BY fruit_id is invalid. If you group by fruit_id, what count_vote and what r.text_c do you expect to get for the ID? You don't tell the DBMS (which would be something like MAX(count_vote) and MIN(r.text_c)for instance. MySQL should through an error, but silently replacescount_vote, r.text_cbyANY_VALUE(count_vote), ANY_VALUE(r.text_c)` instead. This means you get arbitrarily picked values for a fruit.
The answer hence to your question is: Don't try to speed it up, but fix it instead. (Maybe you want to place a new request showing the query and explaining what it is supposed to do, so people can help you with that.)
Your Categories table seems not joined/related to the others this produce a catesia product between all the rows
If you want distinct resut don't use group by but distint so you can avoid an unnecessary subquery
and you dont' need an order by on a subquery
SELECT category
, b.fruit_name
, u.name
, r.count_vote
, r.text_c
FROM Fruits b
INNER JOIN Customers u ON u.customer_id = r.customer_id
INNER JOIN Categories c ON ?????? /Your Categories table seems not joined/related to the others /
INNER JOIN (
SELECT distinct fruit_id, count_vote, text_c, customer_id
FROM Reviews
) r ON b.fruit_id = r.fruit_id
WHERE category = "Fruits";
for better reading you should use explicit join syntax and avoid old join syntax based on comma separated tables name and where condition
The next time you want help optimizing a query, please include the table/index structure, an indication of the cardinality of the indexes and the EXPLAIN plan for the query.
There appears to be absolutely no reason for a single sub-query here, let alone 2. Using sub-queries mostly prevents the DBMS optimizer from doing its job. So your biggest win will come from eliminating these sub-queries.
The CROSS JOIN creates a deliberate cartesian join - its also unclear if any attributes from this table are actually required for the result, if it is there to produce multiples of the same row in the output, or just an error.
The attribute category in the last line of your query is not attributed to any of the tables (but I suspect it comes from the categories table).
Further, your code uses a GROUP BY clause with no aggregation function. This will produce non-deterministic results and is a bug. Assuming that you are not exploiting a side-effect of that, the query can be re-written as:
SELECT
category, b.fruit_name, u.name, r.count_vote, r.text_c
FROM Fruits b
JOIN Reviews r
ON r.fruit_id = b.fruit_id
JOIN Customers u ON u.customer_id = r.customer_id
ORDER BY r.fruit_id, count_vote DESC, r_id;
Since there are no predicates other than joins in your query, there is no scope for further optimization beyond ensuring there are indexes on the join predicates.
As all too frequently, the biggest benefit may come from simply asking the question of why you need to retrieve every single row in the tables in a single query.
I'm trying to create a database for a guessing game where a player should guess the value of a given product. I have three tables;
One table for the products with the following columns:
ProductID
ProductName (VarChar)
One table for the players with the following columns
PlayerID
PlayerName (VarChar)
One table where the players, the products and their guesses are stored, i.e
GuessID
PlayerID
ProductID
Guess (Int)
I'm an trying to find a way to combine which player (PlayerName) that made the highest guess (Guess) for EVERY product (ProductName), i.e a way to summarize every single product, the highest guess for each product and the name och player who made the guess.
So far, I have only been able to get the correct ProductName and the correct Guess-value for each product. Somehow this doesn't work for PlayerName and i keep getting the wrong Name each time.
SELECT pl.PlayerName, MAX(g.Guess), p.ProductName
FROM
guess g
INNER JOIN player pl on g.PlayerID = pl.PlayerID
INNER JOIN product p on g.ProductID = p.ProductID
GROUP BY g.ProductID;
I guess my problem is that each row related to a specific product holds two dynamic values; Guess and PlayerID which is giving me problems whenever i try to sort by Max(Guess). I can't make any sense of the PlayerID that is chosen for the query stated above. It doesn't make any sense to me.
I would appreciate if any of you guys could point me in the right direction.
Cheers.
Use a correlated subquery to select the highest guess for each product, then select the related players and products records in the outer query, like :
SELECT
p.productname, u.playername, g.guess as "max guess"
FROM
guess_table g
INNER JOIN product_table p ON p.productid = g.productid
INNER JOIN players_table u on u.playerid = g.playerid
WHERE g.guess = (SELECT MAX(guess) FROM guess_table WHERE productid = p.productid)
You can make use of last value in this query. You can also use first value by ordering desc (you don't have to give unbounded preceding and following. Also, this is based on assumption that guess is unique within product name, so that ordering will be consistent.
Sample data and output would have helped to see if it is expected or not but I think it will work here based on what you have mentioned.
SELECT distinct p.ProductName,
last_value(Guess) over (partition by p.ProductName order by Guess rows between
unbounded preceding and unbounded following ) GuessNumber,
last_value(pl.PlayerName) over (partition by p.ProductName order by Guess rows
between unbounded preceding and unbounded following ) PlayerName
FROM guess g
inner join player pl on g.PlayerID = pl.PlayerID
inner join product p on g.ProductID = p.ProductID
One method is to use something of a hack in MySQL:
SELECT SUBSTRING_INDEX(GROUP_CONCAT(pl.PlayerName ORDER BY g.guess DESC), ',', 1) as PlayerName,
MAX(g.Guess), p.ProductName
FROM guess g JOIN
player pl
ON g.PlayerID = pl.PlayerID JOIN
product p
ON g.ProductID = p.ProductID
GROUP BY p.ProductName;
This method definitely has limits. The internal string representation for GROUP_CONCAT() is limited to about 1000 characters (although that can be expanded). The above assumes that the names do not contain commas (although other separators can be used).
The better solution is to use window functions, but these are available only in MySQL 8+.
How I can get top 10 scorers by seasons.
So it shows last season top 10 scorers...
I've tryed left join into table, but it goes broken showing 2 player and counts all goals to first player.
My sqlfiddle:
http://sqlfiddle.com/#!9/b5d0a78/1
You got it almost right.
You want to group match_goals by player ID (match_player_id), but then you should not select goal_minute or any other per goal data.
After grouping by player, then you can create a column for COUNT(match_player_id) this will give you the number of goals, you can also use this column to order the results.
Your joins and conditions are correct I think.
EDIT
I think your schema needs a few tweaks: check this http://sqlfiddle.com/#!9/f5a75b/2
Basically create direct relations in the match_players and match_goals to the other tables.
I think the query you want looks like this:
SELECT p.*, count(*) as num_goals
FROM match_goals g INNER JOIN
match_players p
ON g.match_player_id = p.id INNER JOIN
matches m
ON m.id = p.match_id
WHERE p.is_deleted = 0 AND
g.is_own_goal = 0 AND
m.seasion_id = <last season id>
GROUP BY p.id
ORDER BY num_goals DESC
LIMIT 10;
Note that the teams table is not needed. The SELECT p.* is allowed because p.id (the GROUP BY key) is unique.
I have these tables and queries as defined in sqlfiddle.
First my problem was to group people showing LEFT JOINed visits rows with the newest year. That I solved using subquery.
Now my problem is that that subquery is not using INDEX defined on visits table. That is causing my query to run nearly indefinitely on tables with approx 15000 rows each.
Here's the query. The goal is to list every person once with his newest (by year) record in visits table.
Unfortunately on large tables it gets real sloooow because it's not using INDEX in subquery.
SELECT *
FROM people
LEFT JOIN (
SELECT *
FROM visits
ORDER BY visits.year DESC
) AS visits
ON people.id = visits.id_people
GROUP BY people.id
Does anyone know how to force MySQL to use INDEX already defined on visits table?
Your query:
SELECT *
FROM people
LEFT JOIN (
SELECT *
FROM visits
ORDER BY visits.year DESC
) AS visits
ON people.id = visits.id_people
GROUP BY people.id;
First, is using non-standard SQL syntax (items appear in the SELECT list that are not part of the GROUP BY clause, are not aggregate functions and do not sepend on the grouping items). This can give indeterminate (semi-random) results.
Second, ( to avoid the indeterminate results) you have added an ORDER BY inside a subquery which (non-standard or not) is not documented anywhere in MySQL documentation that it should work as expected. So, it may be working now but it may not work in the not so distant future, when you upgrade to MySQL version X (where the optimizer will be clever enough to understand that ORDER BY inside a derived table is redundant and can be eliminated).
Try using this query:
SELECT
p.*, v.*
FROM
people AS p
LEFT JOIN
( SELECT
id_people
, MAX(year) AS year
FROM
visits
GROUP BY
id_people
) AS vm
JOIN
visits AS v
ON v.id_people = vm.id_people
AND v.year = vm.year
ON v.id_people = p.id;
The: SQL-fiddle
A compound index on (id_people, year) would help efficiency.
A different approach. It works fine if you limit the persons to a sensible limit (say 30) first and then join to the visits table:
SELECT
p.*, v.*
FROM
( SELECT *
FROM people
ORDER BY name
LIMIT 30
) AS p
LEFT JOIN
visits AS v
ON v.id_people = p.id
AND v.year =
( SELECT
year
FROM
visits
WHERE
id_people = p.id
ORDER BY
year DESC
LIMIT 1
)
ORDER BY name ;
Why do you have a subquery when all you need is a table name for joining?
It is also not obvious to me why your query has a GROUP BY clause in it. GROUP BY is ordinarily used with aggregate functions like MAX or COUNT, but you don't have those.
How about this? It may solve your problem.
SELECT people.id, people.name, MAX(visits.year) year
FROM people
JOIN visits ON people.id = visits.id_people
GROUP BY people.id, people.name
If you need to show the person, the most recent visit, and the note from the most recent visit, you're going to have to explicitly join the visits table again to the summary query (virtual table) like so.
SELECT a.id, a.name, a.year, v.note
FROM (
SELECT people.id, people.name, MAX(visits.year) year
FROM people
JOIN visits ON people.id = visits.id_people
GROUP BY people.id, people.name
)a
JOIN visits v ON (a.id = v.id_people and a.year = v.year)
Go fiddle: http://www.sqlfiddle.com/#!2/d67fc/20/0
If you need to show something for people that have never had a visit, you should try switching the JOIN items in my statement with LEFT JOIN.
As someone else wrote, an ORDER BY clause in a subquery is not standard, and generates unpredictable results. In your case it baffled the optimizer.
Edit: GROUP BY is a big hammer. Don't use it unless you need it. And, don't use it unless you use an aggregate function in the query.
Notice that if you have more than one row in visits for a person and the most recent year, this query will generate multiple rows for that person, one for each visit in that year. If you want just one row per person, and you DON'T need the note for the visit, then the first query will do the trick. If you have more than one visit for a person in a year, and you only need the latest one, you have to identify which row IS the latest one. Usually it will be the one with the highest ID number, but only you know that for sure. I added another person to your fiddle with that situation. http://www.sqlfiddle.com/#!2/4f644/2/0
This is complicated. But: if your visits.id numbers are automatically assigned and they are always in time order, you can simply report the highest visit id, and be guaranteed that you'll have the latest year. This will be a very efficient query.
SELECT p.id, p.name, v.year, v.note
FROM (
SELECT id_people, max(id) id
FROM visits
GROUP BY id_people
)m
JOIN people p ON (p.id = m.id_people)
JOIN visits v ON (m.id = v.id)
http://www.sqlfiddle.com/#!2/4f644/1/0 But this is not the way your example is set up. So you need another way to disambiguate your latest visit, so you just get one row per person. The only trick we have at our disposal is to use the largest id number.
So, we need to get a list of the visit.id numbers that are the latest ones, by this definition, from your tables. This query does that, with a MAX(year)...GROUP BY(id_people) nested inside a MAX(id)...GROUP BY(id_people) query.
SELECT v.id_people,
MAX(v.id) id
FROM (
SELECT id_people,
MAX(year) year
FROM visits
GROUP BY id_people
)p
JOIN visits v ON (p.id_people = v.id_people AND p.year = v.year)
GROUP BY v.id_people
The overall query (http://www.sqlfiddle.com/#!2/c2da2/1/0) is this.
SELECT p.id, p.name, v.year, v.note
FROM (
SELECT v.id_people,
MAX(v.id) id
FROM (
SELECT id_people,
MAX(year) year
FROM visits
GROUP BY id_people
)p
JOIN visits v ON ( p.id_people = v.id_people
AND p.year = v.year)
GROUP BY v.id_people
)m
JOIN people p ON (m.id_people = p.id)
JOIN visits v ON (m.id = v.id)
Disambiguation in SQL is a tricky business to learn, because it takes some time to wrap your head around the idea that there's no inherent order to rows in a DBMS.
I have a MySQL query that:
gets data from three tables linked by unique id's.
counts the number of games played in each category, from each user
and counts the number of games each user has played that fall under the "fps" category
It seems to me that this code could be a lot smaller. How would I go about making this query smaller. http://sqlfiddle.com/#!2/6d211/1
Any help is appreciated even if you just give me links to check out.
Generally it's a good idea to have your join logic as part of the [Inner|Left] Join clause, rather than as part of the Where clause. In your case of simplifying the query, this cleans up your Where clause so that the query processor doesn't apply filter conditions too early, which restricts what you want to do in more complex parts of the query (and impacts the overall performance of the query).
By refactoring the join conditions, we can reduce the query to its core join across the three tables, and then add the join to the specialised subquery where the aggregation occurs. This results in only one nested query, which joins across the fewest tables needed.
Here's what I came up with:
SELECT
u.user_id
,pg.game_id
,u.user
,g.game
,g.game_cat
,ga.cat_count
,ga.fps_count
FROM users u
inner join played_games pg
on u.user_id = pg.user_id
inner join games g
on pg.game_id = g.id
inner join
(
select
ipg.user_id
,ig.game_cat
,count(ig.game) cat_count
,sum(case when ig.game_cat = 'fps' then 1 else 0 end) fps_count
from played_games ipg
inner join games ig
on ipg.game_id = ig.id
group by
ipg.user_id
,ig.game_cat
) ga
on g.game_cat = ga.game_cat
and pg.user_id = ga.user_id
order by
ga.fps_count desc
,u.user
,ga.cat_count desc;
One difference between the original query (apart from the slight rename) is that the fps_count field has a value of 0 instead of NULL for players who haven't played a single FPS game. Hopefully this isn't so critical, but rather helps to add meaning to the query.
Lastly, I'm not sure about the context of how this is going to be used. In my opinion it's probably trying to do too much in both listing every game played by every user (one objective) and summarising the categories of games played by each user (a separate objective). This means that the summary details are being repeated multiple times, e.g. for users playing multiple games of a particular category, which may not be ideal. My recommendation would be to separate these out into two separate queries, though I don't know whether that would meet your specific needs.
Hope this helps.
I was thinking whether to provide d_mcg solution or this one. I decided to go for this one. I was wondering which one would be faster. That's something you can try and tell us :)
select u.user_id, pg.game_id, u.user, g.game, g.game_cat,
(select count(*) from played_games pg2
join games g2 on pg2.game_id = g2.id
where pg2.user_id = pg.user_id and g2.game_cat = g.game_cat) cat_count,
(select count(*) from played_games pg3
join games g3 on pg3.game_id = g3.id
where pg3.user_id = pg.user_id and g3.game_cat = g.game_cat and
g.game_cat = 'fps') order_count
from users u
left join played_games pg on u.user_id = pg.user_id
join games g on pg.game_id = g.id
order by order_count desc, u.user, cat_count desc