I have 2 tables:
F_Test
and d_partners
I need to provide to every “site_name” the top 5 “partner_name” (by join with dim_partners) with the highest number of clicks (every record with value under the “partner_id” it’s one click).
This is my query:
select t.partner_name, t.partner_id
from F_Test as t, d_partners as t2
where t. partner_id =t2.partner_id
GROUP BY t.site_name
Order by desc limit 5
Do you think it's fine? What should I change?
Here are the parts of the tables:
F_Test table
d_partners table
I would expect a query like this:
select sp.*
from (select t.site_name, p.partner_name, count(*) as num_clicks,
row_number() over (partition by t.site_name order by count(*) desc) as seqnum
from F_Test t join
d_partners p
on p.partner_id = t.partner_id
group by t.site_name, p.partner_name
) sp
where seqnum <= 5;
Notes:
The query uses proper, explicit, standard, readable JOIN syntax.
As a corollary: Never use commas in the FROM clause.
Use meaningful table aliases -- abbreviations for the table names -- rather than meaningless aliases.
Window functions have been available in MySQL for several years, starting with version 8.0.
Related
I have two tables. The first table (users) is a simple "id, username" with 100,00 rows and the second (stats) is "id, date, stat" with 20M rows.
I'm trying to figure out which username went up by the most in stat and here's the query I have. On a powerful machine, this query takes minutes to complete. Is there a better way to write it to speed it up?
SELECT a.id, a.username, b.stat, c.stat, (b.stat - c.stat) AS stat_diff
FROM users AS a
INNER JOIN stats AS b ON (b.id=a.id)
INNER JOIN stats AS c ON (c.id=a.id)
WHERE b.date = '2016-01-10'
AND c.date = '2016-01-13'
GROUP BY a.id
ORDER BY stat_diff DESC
LIMIT 100
the other way i tried but it doesn't seem optimal is
SELECT a.id, a.username,
(SELECT b.stat FROM stats AS b ON (b.id=a.id) AND b.date = '2016-01-10') AS start,
(SELECT c.stat FROM stats AS c ON (c.id=a.id) AND c.date = '2016-01-14') AS end,
((SELECT b.stat FROM stats AS b ON (b.id=a.id) AND b.date = '2016-01-10') -
(SELECT c.stat FROM stats AS c ON (c.id=a.id) AND c.date = '2016-01-14')) AS stat_diff
FROM users AS a
GROUP BY a.id
ORDER BY stat_diff DESC
LIMIT 100
Introduction
Let's suppose we rewrite sentence like this:
SELECT a.id, a.username, b.stat, c.stat, (b.stat - c.stat) AS stat_diff
FROM users AS a
INNER JOIN stats AS b ON
b.date = STR_TO_DATE('2016-01-10', '%Y-%m-%d' ) and b.id=a.id
INNER JOIN stats AS c ON
c.date = STR_TO_DATE('2016-01-13', '%Y-%m-%d' ) and c.id=a.id
GROUP BY a.id
ORDER BY stat_diff DESC
LIMIT 100
And we ensure than:
users table has index on field id:
stats has index on composite field date, id: create index stats_idx_d_i on stats ( date, id );
Then
Database optimizer may use indexes to selected a Restricted Set of Date ('RSD'), that means, rows that match filtered dates. This is fast.
But
You are sorting by a calculated field:
(b.stat - c.stat) AS stat_diff #<-- calculated
ORDER BY stat_diff DESC #<-- this forces to calculate it
They are no possible optimization on this sort because you should to calculate one by one all results on your 'RSD' (restricted set of data).
Conclusion
The question is, how may rows they are on your 'RSD'? If only they are few hundreds rows you query may run fast, else, your query will be slow.
Any case, you should to be sure the first step of query ( without sorting ) is made by index and no fullscanning. Use Explain command to be sure.
All you need to do is to help optimizer.At a bare minimum.have a check list which looks like below
1.Are my join columns indexed ?
2.Are the where clauses Sargable
3.are there any implicit,explicit conversions
4.Am i seeing any statistics issues
one more interesting aspect to look at is how is your data distributed,once you understand the data,you will be able to intrepret the execution plan and alter it as per your need
EX:
Think like i have any customers table with 100rows,Each one has a minimum of 10 orders(total upto 10000 orders).Now if you need to find out only top 3 orders by date,you dont want a scan happening of orders table
Now in your case ,i may not go with second option,even though the optimizer may choose a good plan for this one as well,I will go first approach and try to see if the execution time is acceptable.if not then i will go through my check list and try to tune it further
The Query Seems OK, Verify your Indexes ..
Or
Try this Query
SELECT a.id, a.username, b.stat, c.stat, (b.stat - c.stat) AS stat_diff
FROM users AS a
INNER JOIN (select id,stat from stats where date = '2016-01-10') AS b ON (b.id=a.id)
INNER JOIN (select id,stat from stats where date = '2016-01-13') AS c ON (c.id=a.id)
GROUP BY a.id
ORDER BY stat_diff DESC
LIMIT 100
If I run the following query:
SELECT *
FROM `smp_data_log`
WHERE Post_id = 1234 AND Account_id = 1306
ORDER BY Created_time DESC
I get 7 rows back including entries with the following Created_times:
1) 1424134801
2) 1424134801
3) 1421802001
4) 3601
If I run the following query:
SELECT mytable.*
FROM (SELECT * FROM `smp_data_log` ORDER BY Created_time DESC) AS mytable
WHERE Post_id = 1234 AND Account_id = 1306
GROUP BY Post_id
I am would expect to see 1424134801 come back as a single row - but instead I am seeing 3601?? I would have thought this would have returned the latest time (as its descending). What am I doing wrong?
Your expectation is wrong. And this is well documented in MySQL. You are using an extension, where you have columns in the select that are not in the group by -- a very bad habit and one that doesn't work in other databases (except in some very special circumstances allowed by the ANSI standard).
Just use join to get what you really want:
SELECT l.*
FROM smp_data_log l JOIN
(select post_id, max(created_time) as maxct
from smp_data_log
group by post_id
) lmax
on lmax.post_id = l.post_id and lmax.maxct = l.created_time;
Here is the quote from the documentation:
MySQL extends the use of GROUP BY so that the select list can refer to
nonaggregated columns not named in the GROUP BY clause. This means
that the preceding query is legal in MySQL. You can use this feature
to get better performance by avoiding unnecessary column sorting and
grouping. However, this is useful primarily when all values in each
nonaggregated column not named in the GROUP BY are the same for each
group. The server is free to choose any value from each group, so
unless they are the same, the values chosen are indeterminate.
I have a query.
SELECT * FROM users LEFT JOIN ranks ON ranks.minPosts <= users.postCount
This returns a row every time it is matched. By using a GROUP BY users.id I get each row as a individual id.
However, when they group I only get the first row. I would instead like the row with the highest value of ranks.minPosts
Is there a way to do this, also, would it be faster (less resources) to just use two different queries?
Assuming there is only one column in ranks that you want, you can do this using a correlated subquery:
SELECT u.*,
(select r.minPosts
from ranks r
where r.minPosts <= u.PostCount
order by minPosts desc
limit 1
) as minPosts
FROM users u;
If you need the entire row from ranks, then join it back in:
SELECT ur.*, r.*
FROM (SELECT u.*,
(select r.minPosts
from ranks r
where r.minPosts <= u.PostCount
order by minPosts desc
limit 1
) as minPosts
FROM users u
) ur join
ranks r
on ur.minPosts = r.minPosts;
(The * is for convenience; you should list out the columns you want.)
Because you're using mysql, this will work:
SELECT * FROM (
SELECT *, users.id user_id
FROM users
LEFT JOIN ranks ON ranks.minPosts <= users.postCount
ORDER BY ranks.minPosts DESC
) x
GROUP BY user_id
Mysql always returns the first row encountered for each unique group, so if you first order the data, then use the non-standard grouping behaviour, you'll get the row you want.
Disclaimer:
Although this works reliably in practice, the mysql documentation says not to rely on it. If you use this convenient approach (which will reliably pass any test you can write), you should consider that it is not recommended by mysql and that later releases of mysql may not continue behave in this way.
What we'd really like to do would be to order the rows by ranks.minPosts before the group by. Unfortunately MySQL doesn't support that without using a subquery of some form.
If the ranks are already ordered by their ids then you can extract the id by selecting MAX(ranks.id), and if they're not, you can still get the highest ranks.minPosts by selecting MAX(ranks.minPosts). However, it would be nice to be able to get the entire record. I guess you're left with the subquery solution, which is as follows:
SELECT <fields> FROM users LEFT JOIN
(SELECT * FROM ranks ORDER BY minPosts DESC) as r
ON r.minPosts <= users.postCount GROUP BY users.id
I have this query:
SELECT timestmp
FROM devicelog
WHERE device_id = 5
ORDER BY id DESC LIMIT 1
which takes less than 0.001 seconds to fetch, but once I put it in a subquery, it slows down to about 3.05 seconds. Any reason to why it does this, or how I can remedy it?
Here is the second query (which is the one I want to optimize):
SELECT device.id,
(SELECT timestmp
FROM devicelog
WHERE device_id = device.id
ORDER BY id DESC LIMIT 1) as timestmp
FROM device
Table "device" only has like 10-15 records in it (devicelog has several million), so I would assume it goes 1 by 1 through each record and then executes the subquery, but obviously it's doing something else. The PK of devicelog is the id, and the PK in device is its id as well. There is an index on devicelog for timestmp (which is a datetime) and device_id which is also a FK back to devicelog. There are other indices as well, but they are irrelevant (things like names, descriptions, etc).
I just need it to loop through devices, then display the last timestamp record.
If I list each device in PHP, then perform the first query separately, it will be perform extraordinary well, but I want to do this in one entire query. Like, I could do something like (pseudocode):
foreach($row in <devicelog>)
query('<first query> where id = $row[id]')
Doing an entire join would be too expensive on devicelog just because the high count.
Your question and query do not match what you are looking for per the comment to the first answer offered.
What you can do is a pre-aggregate on a per-device basis to get the max ID log, then join that to your master list of devices...
SELECT
d.name,
d.id,
DeviceMax.lastTime
from
device d
LEFT JOIN ( select dl.device_id,
max( dl.timestamp ) lastTime
from
devicelog dl
group by
dl.device_id ) as DeviceMax
ON d.id = DeviceMax.device_id
Now, if you needed other stuff from the device log for that specific entry, we could just add on to that...
LEFT JOIN devicelog dl2
on DeviceMax.Device_id = dl2.Device_id
AND DeviceMax.lastTime = dl2.timestamp
then you can get any other columns from the "dl2" alias added to your query.
Also, for your device log table, I would have a covering index on (device_id, timestamp)
COMMENT FROM FEEDBACK
Then I would offer this as a suggestion for you, which is something also common in the world of web development when someone needs "highest count", or "most" of something, or "most recent" etc.
Denormalize your Device table with only one respect... add a column for the LastDeviceLogID. Then, whenever your DeviceLog has an entry added to it, you just use an after insert trigger that does...
update Device
set LastDeviceLogID = newRecord.DeviceLogID
where ID = newRecord.Device_ID
Columns may not be exact, but the principle is there. This way, you never need to do a LIMIT 1, MAX(), etc and go through your millions of records, you can get as simply as doing
SELECT
d.name,
d.id,
dl.timestamp,
dl.othercolumns
from
device d
LEFT JOIN devicelog dl
on d.LastDeviceLogID = dl.DeviceLogID
try this with inner join
SELECT d.id, dl.timestamp
FROM device d
INNER JOIN devicelog dl
ON dl.device_id = d.id
ORDER BY d.id DESC LIMIT 1
to list all devices you may consider the GROUP BY
SELECT d.id, dl.timestamp
FROM device d
INNER JOIN devicelog dl
ON dl.device_id = d.id
GROUP BY d.id
ORDER BY d.id DESC
Edit:
if you have many indices , thn better to index your column id
try this
ALTER TABLE `device` ADD INDEX `id` (`id`)
I have these tables and queries as defined in sqlfiddle.
First my problem was to group people showing LEFT JOINed visits rows with the newest year. That I solved using subquery.
Now my problem is that that subquery is not using INDEX defined on visits table. That is causing my query to run nearly indefinitely on tables with approx 15000 rows each.
Here's the query. The goal is to list every person once with his newest (by year) record in visits table.
Unfortunately on large tables it gets real sloooow because it's not using INDEX in subquery.
SELECT *
FROM people
LEFT JOIN (
SELECT *
FROM visits
ORDER BY visits.year DESC
) AS visits
ON people.id = visits.id_people
GROUP BY people.id
Does anyone know how to force MySQL to use INDEX already defined on visits table?
Your query:
SELECT *
FROM people
LEFT JOIN (
SELECT *
FROM visits
ORDER BY visits.year DESC
) AS visits
ON people.id = visits.id_people
GROUP BY people.id;
First, is using non-standard SQL syntax (items appear in the SELECT list that are not part of the GROUP BY clause, are not aggregate functions and do not sepend on the grouping items). This can give indeterminate (semi-random) results.
Second, ( to avoid the indeterminate results) you have added an ORDER BY inside a subquery which (non-standard or not) is not documented anywhere in MySQL documentation that it should work as expected. So, it may be working now but it may not work in the not so distant future, when you upgrade to MySQL version X (where the optimizer will be clever enough to understand that ORDER BY inside a derived table is redundant and can be eliminated).
Try using this query:
SELECT
p.*, v.*
FROM
people AS p
LEFT JOIN
( SELECT
id_people
, MAX(year) AS year
FROM
visits
GROUP BY
id_people
) AS vm
JOIN
visits AS v
ON v.id_people = vm.id_people
AND v.year = vm.year
ON v.id_people = p.id;
The: SQL-fiddle
A compound index on (id_people, year) would help efficiency.
A different approach. It works fine if you limit the persons to a sensible limit (say 30) first and then join to the visits table:
SELECT
p.*, v.*
FROM
( SELECT *
FROM people
ORDER BY name
LIMIT 30
) AS p
LEFT JOIN
visits AS v
ON v.id_people = p.id
AND v.year =
( SELECT
year
FROM
visits
WHERE
id_people = p.id
ORDER BY
year DESC
LIMIT 1
)
ORDER BY name ;
Why do you have a subquery when all you need is a table name for joining?
It is also not obvious to me why your query has a GROUP BY clause in it. GROUP BY is ordinarily used with aggregate functions like MAX or COUNT, but you don't have those.
How about this? It may solve your problem.
SELECT people.id, people.name, MAX(visits.year) year
FROM people
JOIN visits ON people.id = visits.id_people
GROUP BY people.id, people.name
If you need to show the person, the most recent visit, and the note from the most recent visit, you're going to have to explicitly join the visits table again to the summary query (virtual table) like so.
SELECT a.id, a.name, a.year, v.note
FROM (
SELECT people.id, people.name, MAX(visits.year) year
FROM people
JOIN visits ON people.id = visits.id_people
GROUP BY people.id, people.name
)a
JOIN visits v ON (a.id = v.id_people and a.year = v.year)
Go fiddle: http://www.sqlfiddle.com/#!2/d67fc/20/0
If you need to show something for people that have never had a visit, you should try switching the JOIN items in my statement with LEFT JOIN.
As someone else wrote, an ORDER BY clause in a subquery is not standard, and generates unpredictable results. In your case it baffled the optimizer.
Edit: GROUP BY is a big hammer. Don't use it unless you need it. And, don't use it unless you use an aggregate function in the query.
Notice that if you have more than one row in visits for a person and the most recent year, this query will generate multiple rows for that person, one for each visit in that year. If you want just one row per person, and you DON'T need the note for the visit, then the first query will do the trick. If you have more than one visit for a person in a year, and you only need the latest one, you have to identify which row IS the latest one. Usually it will be the one with the highest ID number, but only you know that for sure. I added another person to your fiddle with that situation. http://www.sqlfiddle.com/#!2/4f644/2/0
This is complicated. But: if your visits.id numbers are automatically assigned and they are always in time order, you can simply report the highest visit id, and be guaranteed that you'll have the latest year. This will be a very efficient query.
SELECT p.id, p.name, v.year, v.note
FROM (
SELECT id_people, max(id) id
FROM visits
GROUP BY id_people
)m
JOIN people p ON (p.id = m.id_people)
JOIN visits v ON (m.id = v.id)
http://www.sqlfiddle.com/#!2/4f644/1/0 But this is not the way your example is set up. So you need another way to disambiguate your latest visit, so you just get one row per person. The only trick we have at our disposal is to use the largest id number.
So, we need to get a list of the visit.id numbers that are the latest ones, by this definition, from your tables. This query does that, with a MAX(year)...GROUP BY(id_people) nested inside a MAX(id)...GROUP BY(id_people) query.
SELECT v.id_people,
MAX(v.id) id
FROM (
SELECT id_people,
MAX(year) year
FROM visits
GROUP BY id_people
)p
JOIN visits v ON (p.id_people = v.id_people AND p.year = v.year)
GROUP BY v.id_people
The overall query (http://www.sqlfiddle.com/#!2/c2da2/1/0) is this.
SELECT p.id, p.name, v.year, v.note
FROM (
SELECT v.id_people,
MAX(v.id) id
FROM (
SELECT id_people,
MAX(year) year
FROM visits
GROUP BY id_people
)p
JOIN visits v ON ( p.id_people = v.id_people
AND p.year = v.year)
GROUP BY v.id_people
)m
JOIN people p ON (m.id_people = p.id)
JOIN visits v ON (m.id = v.id)
Disambiguation in SQL is a tricky business to learn, because it takes some time to wrap your head around the idea that there's no inherent order to rows in a DBMS.