How can I speed up a multiple inner join query? - mysql

I have two tables. The first table (users) is a simple "id, username" with 100,00 rows and the second (stats) is "id, date, stat" with 20M rows.
I'm trying to figure out which username went up by the most in stat and here's the query I have. On a powerful machine, this query takes minutes to complete. Is there a better way to write it to speed it up?
SELECT a.id, a.username, b.stat, c.stat, (b.stat - c.stat) AS stat_diff
FROM users AS a
INNER JOIN stats AS b ON (b.id=a.id)
INNER JOIN stats AS c ON (c.id=a.id)
WHERE b.date = '2016-01-10'
AND c.date = '2016-01-13'
GROUP BY a.id
ORDER BY stat_diff DESC
LIMIT 100
the other way i tried but it doesn't seem optimal is
SELECT a.id, a.username,
(SELECT b.stat FROM stats AS b ON (b.id=a.id) AND b.date = '2016-01-10') AS start,
(SELECT c.stat FROM stats AS c ON (c.id=a.id) AND c.date = '2016-01-14') AS end,
((SELECT b.stat FROM stats AS b ON (b.id=a.id) AND b.date = '2016-01-10') -
(SELECT c.stat FROM stats AS c ON (c.id=a.id) AND c.date = '2016-01-14')) AS stat_diff
FROM users AS a
GROUP BY a.id
ORDER BY stat_diff DESC
LIMIT 100

Introduction
Let's suppose we rewrite sentence like this:
SELECT a.id, a.username, b.stat, c.stat, (b.stat - c.stat) AS stat_diff
FROM users AS a
INNER JOIN stats AS b ON
b.date = STR_TO_DATE('2016-01-10', '%Y-%m-%d' ) and b.id=a.id
INNER JOIN stats AS c ON
c.date = STR_TO_DATE('2016-01-13', '%Y-%m-%d' ) and c.id=a.id
GROUP BY a.id
ORDER BY stat_diff DESC
LIMIT 100
And we ensure than:
users table has index on field id:
stats has index on composite field date, id: create index stats_idx_d_i on stats ( date, id );
Then
Database optimizer may use indexes to selected a Restricted Set of Date ('RSD'), that means, rows that match filtered dates. This is fast.
But
You are sorting by a calculated field:
(b.stat - c.stat) AS stat_diff #<-- calculated
ORDER BY stat_diff DESC #<-- this forces to calculate it
They are no possible optimization on this sort because you should to calculate one by one all results on your 'RSD' (restricted set of data).
Conclusion
The question is, how may rows they are on your 'RSD'? If only they are few hundreds rows you query may run fast, else, your query will be slow.
Any case, you should to be sure the first step of query ( without sorting ) is made by index and no fullscanning. Use Explain command to be sure.

All you need to do is to help optimizer.At a bare minimum.have a check list which looks like below
1.Are my join columns indexed ?
2.Are the where clauses Sargable
3.are there any implicit,explicit conversions
4.Am i seeing any statistics issues
one more interesting aspect to look at is how is your data distributed,once you understand the data,you will be able to intrepret the execution plan and alter it as per your need
EX:
Think like i have any customers table with 100rows,Each one has a minimum of 10 orders(total upto 10000 orders).Now if you need to find out only top 3 orders by date,you dont want a scan happening of orders table
Now in your case ,i may not go with second option,even though the optimizer may choose a good plan for this one as well,I will go first approach and try to see if the execution time is acceptable.if not then i will go through my check list and try to tune it further

The Query Seems OK, Verify your Indexes ..
Or
Try this Query
SELECT a.id, a.username, b.stat, c.stat, (b.stat - c.stat) AS stat_diff
FROM users AS a
INNER JOIN (select id,stat from stats where date = '2016-01-10') AS b ON (b.id=a.id)
INNER JOIN (select id,stat from stats where date = '2016-01-13') AS c ON (c.id=a.id)
GROUP BY a.id
ORDER BY stat_diff DESC
LIMIT 100

Related

How to Optimize mysql query the brings huge quantity of rows?

I really need your help. I´m doing one work from my universitiy and before I come here I read a lot of things from the documentations of mysql, searched and searched but none of this helped me in my sql query. Look I have this query:
SELECT a.nome, COUNT(*)
FROM publ p JOIN auth a on p.pubid = a.pubid
WHERE p.pubid IN (SELECT pubid
FROM auth
GROUP BY pubid
HAVING COUNT(*) < 3) // THIS VALUE 3 here I have to do with value 2, 4 and 5
GROUP BY a.nome // in different querys.
ORDER BY COUNT(*) DESC, a.nome ASC
I tried to put index in the where clause but I never get the results and takes to long time. What can I do to increase my query to bring me more faster the results? Thank you for the help
I would create these indexes and reorder the query
CREATE INDEX publ_pubid ON publ(pubid);
CREATE INDEX auth_pubid ON auth(pubid, nome);
SELECT a.nome, COUNT(*)
FROM (
SELECT pubid
FROM auth
GROUP BY pubid
HAVING COUNT(*) < 3
) L
LEFT JOIN publ p on L.pubid=publ.pubid
JOIN auth a on p.pubid = a.pubid
GROUP BY a.nome
ORDER BY COUNT(*) DESC, a.nome ASC;

Merging 3 Tables, Limiting 1 Table With Multiple Fields Needed

Been looking into this for awhile. Hoping someone might be able to provide some insight. I have 3 tables. All of which I'm grabbing multiple columns, but the 3rd I need to limit the output to just the most recent timestamp entry, BUT still display multiple columns.
If I have the following data [ Please see SQL Fiddle ]:
http://sqlfiddle.com/#!2/84b91/6
The fiddle is a list of (names) in Table1(users), (job_name,years) in Table2(job), and then (score, timestamp) in Table3(job_details). All linked together by the users id.
I am definitely not great at MYSQL. I know I'm missing something.. possibly a series of JOINs. I have been able to get Table 1, Table 2 and one column of Table 3 by doing this:
select a.id, a.name, b.job_name, b.years,
(select c.timestamp
from job_details as c
where c.user_id = a.id
order by c.timestamp desc limit 1) score
from users a, job as b where a.id = b.user_id;
At this point, I can get multiple column data on the first two columns, limit the 3rd to one value and sort that value on the last timestamp...
My question is: How does one go about adding a second column to the limit? In the example in the fiddle, I'd like to add the score as well as the timestamp to the output.
I'd like the output to be:
NAME, JOB, YEARS, SCORE, TIMESTAMP. The last two columns would only be the last entry in job_details sorted by the most recent TIMESTAMP.
Please let me know if more information is required! Thank you for your time!
T
Try this:
select a.id, a.name, b.job_name, b.years, c.timestamp, c.score
from users a
INNER JOIN job as b ON a.id = b.user_id
INNER JOIN (SELECT jd.user_id, jd.timestamp, jd.score
FROM job_details as jd
INNER JOIN (select user_id, MAX(timestamp) as tstamp
from job_details
GROUP BY user_id) as max_ts ON jd.user_id = max_ts.user_id
AND jd.timestamp = max_ts.tstamp
) as c ON a.id = c.user_id
;

Fast query slows down when within a subquery

I have this query:
SELECT timestmp
FROM devicelog
WHERE device_id = 5
ORDER BY id DESC LIMIT 1
which takes less than 0.001 seconds to fetch, but once I put it in a subquery, it slows down to about 3.05 seconds. Any reason to why it does this, or how I can remedy it?
Here is the second query (which is the one I want to optimize):
SELECT device.id,
(SELECT timestmp
FROM devicelog
WHERE device_id = device.id
ORDER BY id DESC LIMIT 1) as timestmp
FROM device
Table "device" only has like 10-15 records in it (devicelog has several million), so I would assume it goes 1 by 1 through each record and then executes the subquery, but obviously it's doing something else. The PK of devicelog is the id, and the PK in device is its id as well. There is an index on devicelog for timestmp (which is a datetime) and device_id which is also a FK back to devicelog. There are other indices as well, but they are irrelevant (things like names, descriptions, etc).
I just need it to loop through devices, then display the last timestamp record.
If I list each device in PHP, then perform the first query separately, it will be perform extraordinary well, but I want to do this in one entire query. Like, I could do something like (pseudocode):
foreach($row in <devicelog>)
query('<first query> where id = $row[id]')
Doing an entire join would be too expensive on devicelog just because the high count.
Your question and query do not match what you are looking for per the comment to the first answer offered.
What you can do is a pre-aggregate on a per-device basis to get the max ID log, then join that to your master list of devices...
SELECT
d.name,
d.id,
DeviceMax.lastTime
from
device d
LEFT JOIN ( select dl.device_id,
max( dl.timestamp ) lastTime
from
devicelog dl
group by
dl.device_id ) as DeviceMax
ON d.id = DeviceMax.device_id
Now, if you needed other stuff from the device log for that specific entry, we could just add on to that...
LEFT JOIN devicelog dl2
on DeviceMax.Device_id = dl2.Device_id
AND DeviceMax.lastTime = dl2.timestamp
then you can get any other columns from the "dl2" alias added to your query.
Also, for your device log table, I would have a covering index on (device_id, timestamp)
COMMENT FROM FEEDBACK
Then I would offer this as a suggestion for you, which is something also common in the world of web development when someone needs "highest count", or "most" of something, or "most recent" etc.
Denormalize your Device table with only one respect... add a column for the LastDeviceLogID. Then, whenever your DeviceLog has an entry added to it, you just use an after insert trigger that does...
update Device
set LastDeviceLogID = newRecord.DeviceLogID
where ID = newRecord.Device_ID
Columns may not be exact, but the principle is there. This way, you never need to do a LIMIT 1, MAX(), etc and go through your millions of records, you can get as simply as doing
SELECT
d.name,
d.id,
dl.timestamp,
dl.othercolumns
from
device d
LEFT JOIN devicelog dl
on d.LastDeviceLogID = dl.DeviceLogID
try this with inner join
SELECT d.id, dl.timestamp
FROM device d
INNER JOIN devicelog dl
ON dl.device_id = d.id
ORDER BY d.id DESC LIMIT 1
to list all devices you may consider the GROUP BY
SELECT d.id, dl.timestamp
FROM device d
INNER JOIN devicelog dl
ON dl.device_id = d.id
GROUP BY d.id
ORDER BY d.id DESC
Edit:
if you have many indices , thn better to index your column id
try this
ALTER TABLE `device` ADD INDEX `id` (`id`)

MySQL is not using INDEX in subquery

I have these tables and queries as defined in sqlfiddle.
First my problem was to group people showing LEFT JOINed visits rows with the newest year. That I solved using subquery.
Now my problem is that that subquery is not using INDEX defined on visits table. That is causing my query to run nearly indefinitely on tables with approx 15000 rows each.
Here's the query. The goal is to list every person once with his newest (by year) record in visits table.
Unfortunately on large tables it gets real sloooow because it's not using INDEX in subquery.
SELECT *
FROM people
LEFT JOIN (
SELECT *
FROM visits
ORDER BY visits.year DESC
) AS visits
ON people.id = visits.id_people
GROUP BY people.id
Does anyone know how to force MySQL to use INDEX already defined on visits table?
Your query:
SELECT *
FROM people
LEFT JOIN (
SELECT *
FROM visits
ORDER BY visits.year DESC
) AS visits
ON people.id = visits.id_people
GROUP BY people.id;
First, is using non-standard SQL syntax (items appear in the SELECT list that are not part of the GROUP BY clause, are not aggregate functions and do not sepend on the grouping items). This can give indeterminate (semi-random) results.
Second, ( to avoid the indeterminate results) you have added an ORDER BY inside a subquery which (non-standard or not) is not documented anywhere in MySQL documentation that it should work as expected. So, it may be working now but it may not work in the not so distant future, when you upgrade to MySQL version X (where the optimizer will be clever enough to understand that ORDER BY inside a derived table is redundant and can be eliminated).
Try using this query:
SELECT
p.*, v.*
FROM
people AS p
LEFT JOIN
( SELECT
id_people
, MAX(year) AS year
FROM
visits
GROUP BY
id_people
) AS vm
JOIN
visits AS v
ON v.id_people = vm.id_people
AND v.year = vm.year
ON v.id_people = p.id;
The: SQL-fiddle
A compound index on (id_people, year) would help efficiency.
A different approach. It works fine if you limit the persons to a sensible limit (say 30) first and then join to the visits table:
SELECT
p.*, v.*
FROM
( SELECT *
FROM people
ORDER BY name
LIMIT 30
) AS p
LEFT JOIN
visits AS v
ON v.id_people = p.id
AND v.year =
( SELECT
year
FROM
visits
WHERE
id_people = p.id
ORDER BY
year DESC
LIMIT 1
)
ORDER BY name ;
Why do you have a subquery when all you need is a table name for joining?
It is also not obvious to me why your query has a GROUP BY clause in it. GROUP BY is ordinarily used with aggregate functions like MAX or COUNT, but you don't have those.
How about this? It may solve your problem.
SELECT people.id, people.name, MAX(visits.year) year
FROM people
JOIN visits ON people.id = visits.id_people
GROUP BY people.id, people.name
If you need to show the person, the most recent visit, and the note from the most recent visit, you're going to have to explicitly join the visits table again to the summary query (virtual table) like so.
SELECT a.id, a.name, a.year, v.note
FROM (
SELECT people.id, people.name, MAX(visits.year) year
FROM people
JOIN visits ON people.id = visits.id_people
GROUP BY people.id, people.name
)a
JOIN visits v ON (a.id = v.id_people and a.year = v.year)
Go fiddle: http://www.sqlfiddle.com/#!2/d67fc/20/0
If you need to show something for people that have never had a visit, you should try switching the JOIN items in my statement with LEFT JOIN.
As someone else wrote, an ORDER BY clause in a subquery is not standard, and generates unpredictable results. In your case it baffled the optimizer.
Edit: GROUP BY is a big hammer. Don't use it unless you need it. And, don't use it unless you use an aggregate function in the query.
Notice that if you have more than one row in visits for a person and the most recent year, this query will generate multiple rows for that person, one for each visit in that year. If you want just one row per person, and you DON'T need the note for the visit, then the first query will do the trick. If you have more than one visit for a person in a year, and you only need the latest one, you have to identify which row IS the latest one. Usually it will be the one with the highest ID number, but only you know that for sure. I added another person to your fiddle with that situation. http://www.sqlfiddle.com/#!2/4f644/2/0
This is complicated. But: if your visits.id numbers are automatically assigned and they are always in time order, you can simply report the highest visit id, and be guaranteed that you'll have the latest year. This will be a very efficient query.
SELECT p.id, p.name, v.year, v.note
FROM (
SELECT id_people, max(id) id
FROM visits
GROUP BY id_people
)m
JOIN people p ON (p.id = m.id_people)
JOIN visits v ON (m.id = v.id)
http://www.sqlfiddle.com/#!2/4f644/1/0 But this is not the way your example is set up. So you need another way to disambiguate your latest visit, so you just get one row per person. The only trick we have at our disposal is to use the largest id number.
So, we need to get a list of the visit.id numbers that are the latest ones, by this definition, from your tables. This query does that, with a MAX(year)...GROUP BY(id_people) nested inside a MAX(id)...GROUP BY(id_people) query.
SELECT v.id_people,
MAX(v.id) id
FROM (
SELECT id_people,
MAX(year) year
FROM visits
GROUP BY id_people
)p
JOIN visits v ON (p.id_people = v.id_people AND p.year = v.year)
GROUP BY v.id_people
The overall query (http://www.sqlfiddle.com/#!2/c2da2/1/0) is this.
SELECT p.id, p.name, v.year, v.note
FROM (
SELECT v.id_people,
MAX(v.id) id
FROM (
SELECT id_people,
MAX(year) year
FROM visits
GROUP BY id_people
)p
JOIN visits v ON ( p.id_people = v.id_people
AND p.year = v.year)
GROUP BY v.id_people
)m
JOIN people p ON (m.id_people = p.id)
JOIN visits v ON (m.id = v.id)
Disambiguation in SQL is a tricky business to learn, because it takes some time to wrap your head around the idea that there's no inherent order to rows in a DBMS.

MySQL query performance problem - INNER JOIN, ORDER BY, DESC

I have got this query:
SELECT
t.type_id, t.product_id, u.account_id, t.name, u.username
FROM
types AS t
INNER JOIN
( SELECT user_id, username, account_id
FROM users WHERE account_id=$account_id ) AS u
ON
t.user_id = u.user_id
ORDER BY
t.type_id DESC
1st question:
It takes around 30seconds to do this at the moment with only 18k records in types table.
The only indexes at the moment are only a primary indexes with just id.
Would the long time be caused by a lack of more indexes? Or would it be more to do with the structure of this query?
2nd question:
How can I add the LIMIT so I only get 100 records with the highest type_id?
Without changing the results, I think it is a 100 times faster if you don't make a sub-select of your users table. It is not needed at all in this case.
You can just add LIMIT 100 to get only the first 100 results (or less if there aren't a 100).
SELECT SQL_CALC_FOUND_ROWS /* Calculate the total number of rows, without the LIMIT */
t.type_id, t.product_id, u.account_id, t.name, u.username
FROM
types t
INNER JOIN users u ON u.user_id = t.user_id
WHERE
u.account_id = $account_id
ORDER BY
t.type_id DESC
LIMIT 1
Then, execute a second query to get the total number of rows that is calculated.
SELECT FOUND_ROWS()
That sub select on MySQL is going to slow down your query. I'm assuming that this
SELECT user_id, username, account_id
FROM users WHERE account_id=$account_id
doesn't return many rows at all. If that's the case then the sub select alone won't explain the delay you're seeing.
Try throwing an index on user_id in your types table. Without it, you're doing a full table scan of 18k records for each record returned by that sub select.
Inner join the users table and add that index and I bet you see a huge increase in speed.