more efficient inner join query

more efficient inner join query - mysql

Is that possible to make this query more efficient ?
SELECT DISTINCT(static.template.name)
FROM probedata.probe
INNER JOIN static.template ON probedata.probe.template_fk = static.template.pk
WHERE creation_time >= DATE_SUB(NOW(), INTERVAL 6 MONTH)
Thanks.

First, I'm going to rewrite it using table aliases, so I can read it:
SELECT DISTINCT(t.name)
FROM probedata.probe p INNER JOIN
static.template t
ON p.template_fk = t.pk
WHERE creation_time >= DATE_SUB(NOW(), INTERVAL 6 MONTH);
Let me make two assumptions:
name is unique in static.template
creation_time comes from probe
The first assumption is particularly useful. You can rewrite the query as:
SELECT t.name
FROM static.template t
WHERE EXISTS (SELECT 1
FROM probedata.probe p
WHERE p.template_fk = t.pk AND
p.creation_time >= DATE_SUB(NOW(), INTERVAL 6 MONTH)
);
The second assumption only affects the indexing. For this query, you want indexes on probe(template_fk, creation_time).
If template has wide records, then an index on template(pk, name) might also prove useful.
This will change the execution plan to be a scan of the template table with a fast look up using the index into the probe table. There will be no additional processing to remove duplicates.

Could help:
If you use this statement in a script, assign the result of the DATE_SUB(NOW(), INTERVAL 6 MONTH) in a variable before the select statement and use that variable in the where condition (because the functions to calculate last X months would execute just once)
Instead of distinct, try and see if there is an improvement using just the column in the select clause (so no distinct) and add the GROUP BY static.template.name

Related

Select rows from tableA based on age calculation from tableB

table1 we have ID, DOB(date of birth, eg. 01/01/1980)
Table2 we have id and other columns
How to get all rows from table 2 if id is under the age of 20?
I currently have:
SELECT *
FROM table2
WHERE id IN (
SELECT id
FROM table1
WHERE TIMESTAMPDIFF(Year,DOB,curdate()) <= 20
)
Is my solution efficient?

You would be better off calculating a date 20 years ago and asking if the table data is after that date. This means one calculation is needed, not a calculation for every row in the table. Any time that you perform a calculation on row data it means an index cannot be used. This is catastrophe for performance if DOB is indexed
TIMESTAMPDIFF doesn't count the number of years between two dates, it give you the number of times the year rolls over 31 dec for two dates. This means asking for the difference between 31 dec and 1 jan will report as 1 year when in fact it is only one (or upto two) days (depending on the times)
SELECT id
FROM table1
where DOB > DATE_SUB(CURDATE(), INTERVAL 20 YEAR)
Personally I use join rather than IN because once you learn the pattern it is easy to extend it using LEFT joins to look for rows that don't exist or match the patterns, but in practical terms the query optimizer rewrites IN and JOIN to execute them the same anyway. Some dB perform poorly for IN, because they execute them differently to joins
SELECT *
FROM
table1 t1
INNER JOIN table2 t2
ON t1.id = t2.id
where t1.DOB > DATE_SUB(CURDATE(), INTERVAL 20 YEAR)
Mech is making the point about select * that it should be avoided in production code. That's a relevant point for the most part - always select only the columns you need (sometimes if a dB has indexed a table and you only need columns that are in the index, then using select * will be a performance hit because the dB has to use the index to look up which rows then lookup the rows. If you specify the columns you need it can decide whether it can answer the query purely from the index for a speed boost. The only time I might consider using select * is in a sub query where the optimizer will rewrite it anyway
Always alias your tables and use the aliases. This prevents your query breaking if later you add a column to either table that is the same name as a column in the other table. While adding things isn't usually a problem or cause bugs and crashes, if a query just "select name from a join b.." and only table a has a name column, it will start crashing if a name column is added to b. Specifying a.name would prevent this

For MySQL
SELECT table2.*
FROM table1
JOIN table2 ON table1.id = table2.id
WHERE table1.dob >= CURRENT_DATE - INTERVAL 20 YEAR

Historically, MySQL has implemented EXISTS more efficiently than IN. So, I would recommend:
SELECT t2.*
FROM table2 t2
WHERE EXISTS (SELECT 1
FROM table1 t1
WHERE t1.id = t2.id AND
TIMESTAMPDIFF(Year, t1.DOB, curdate()) <= 20
);
For performance, you want an index on table1(id, DOB).
You can also change the year comparison to:
t1.DOB <= curdate() - interval 20 year
That is presumably the logic you want and the index could take advantage of it.
I recommend this over an join because there is no risk of having duplicate rows in the result set. Your question does not specify that id is unique in table1, so duplicates are a risk. Even if there are no duplicates, this would also have the best performance under many circumstances.

How to transform a NOT IN into a LEFT JOIN (or a NOT EXISTS)

I have the following query that is quite complex and even though I tried to understand how to do using various sources online, all the examples uses simple queries where mine is more complex, and for that, I don't find the solution.
Here's my current query :
SELECT id, category_id, name
FROM orders AS u1
WHERE added < (UTC_TIMESTAMP() - INTERVAL 60 SECOND)
AND (executed IS NULL OR executed < (UTC_DATE() - INTERVAL 1 MONTH))
AND category_id NOT IN (SELECT category_id
FROM orders AS u2
WHERE executed > (UTC_TIMESTAMP() - INTERVAL 5 SECOND)
GROUP BY category_id)
GROUP BY category_id
ORDER BY added ASC
LIMIT 10;
The table orders is like this:
id
category_id
name
added
executed
The purpose of the query is to list n orders (here, 10) that belong in different categories (I have hundreds of categories), so 10 category_id different. The orders showed here must be older than a minute ago (INTERVAL 60 SECOND) and never executed (IS NULL) or executed more than a month ago.
The NOT IN query is to avoid treating a category_id that has already been treated less than 5 seconds ago. So in the result, I remove all the categories that have been treated less than 5 seconds ago.
I've tried to change the NOT IN in a LEFT JOIN clause or a NOT EXISTS but the switch results in a different set of entries so I believe it's not correct.
Here's what I have so far :
SELECT u1.id, u1.category_id, u1.name, u1.added
FROM orders AS u1
LEFT JOIN orders AS u2
ON u1.category_id = u2.category_id
AND u2.executed > (UTC_TIMESTAMP() - INTERVAL 5 SECOND)
WHERE u1.added < (UTC_TIMESTAMP() - INTERVAL 60 SECOND)
AND (u1.executed IS NULL OR u1.executed < (UTC_DATE() - INTERVAL 1 MONTH))
AND u2.category_id IS NULL
GROUP BY u1.category_id
LIMIT 10
Thank you for your help.
Here's a sample data to try. In that case, there is no "older than 5 seconds" since it's near impossible to get a correct value, but it gives you some data to help out :)

Your query is using a column which doesn't exist in the table as a join condition.
ON u1.domain = u2.category_id
There is no column in your example data called "domain"
Your query is also using the incorrect operator for your 2nd join condition.
AND u2.executed > (UTC_TIMESTAMP() - INTERVAL 5 SECOND)
should be
AND u2.executed < (UTC_TIMESTAMP() - INTERVAL 5 SECOND)
as is used in your first query

MYSQL Getting all values for a GROUP BY

DB has 3 columns (thing1, thing2, datetime). What I want to do is pull all the records for thing1 that has more than 1 unique thing2 entry for it.
SELECT thing1,thing2 FROM db WHERE datetime >= DATE_SUB(NOW(), INTERVAL 1 HOUR) GROUP BY thing1 HAVING COUNT(DISTINCT(thing2)) > 1;
Gets me almost what I need but of course the "GROUP BY" makes it so it only returns 1 entry for the thing1 column, but I need all the thing1,thing2 entries.
Any suggestions would be greatly appreciated.

I think you should use group by this way
SELECT thing1,thing2
FROM db WHERE datetime >= DATE_SUB(NOW(), INTERVAL 1 HOUR)
GROUP BY thing1, thing2 HAVING COUNT(*) > 1;

Shamelessly copying Matt S' original answer as a starting point to provide an alternative...
SELECT db.thing1, db.thing2
FROM db
INNER JOIN (
SELECT thing1, MIN(`datetime`) As `datetime`
FROM db
WHERE `datetime` >= DATE_SUB(NOW(), INTERVAL 1 HOUR)
GROUP BY thing1
HAVING COUNT(DISTINCT thing2) > 1
) AS subQ ON db.thing1 = subQ.thing1 AND db.`datetime` >= subQ.`datetime`
;
MySQL is very finicky, performance-wise, when it comes to subqueries in WHERE clauses; this JOIN alternative may perform faster than such a query.
It may also perform faster, than in it's current form, with the MIN removed from the subquery (and the join condition), and a redundant datetime condition on the outer WHERE supplied instead.
Which is best will depend on data, hardware, configuration, etc...
Sidenote: I would caution against using keywords such as datetime as field (or table) names; they tend to bite their user when least expected, and at very least should always be escaped with ` as in the example.

If I'm understanding what you're looking for, you'll want to use your current query as a sub-query:
SELECT thing1, thing2 FROM db WHERE thing1 IN (
SELECT thing1 FROM db
WHERE datetime >= DATE_SUB(NOW(), INTERVAL 1 HOUR)
GROUP BY thing1
HAVING COUNT(DISTINCT thing2) > 1
);
The subquery is already getting the thing1s you want, so this lets you get the original rows back from the table, limited to just those thing1s.

Use joined field as variable for date_sub

I have the following query;
SELECT count(*) as aggregate
FROM cars
INNER JOIN snapshots ON cars.snapshot_id=snapshots.id
WHERE cars.snapshot_id=194340
That works correctly - counting all cars with a specific snapshot_id.
But now I want to only count cars who have been there for greater than 1 hour from when the snapshot was created.
I've tried this - but it doesnt work:
SELECT count(*) as aggregate
FROM cars
INNER JOIN snapshots ON cars.snapshot_id=snapshots.id
WHERE cars.snapshot_id=194340
AND time_arrive_destination <= DATE_SUB('snapshots.created_at', INTERVAL 1 HOUR)
I know I'm close - because if I hard code the timestamp from snapshots.created_at - it works:
SELECT count(*) as aggregate
FROM cars
INNER JOIN snapshots ON cars.snapshot_id=snapshots.id
WHERE cars.snapshot_id=194340
AND time_arrive_destination <= DATE_SUB("2015-08-31 20:29:49", INTERVAL 1 HOUR)
So how do I use a joined column field snapshots.created_at as a variable for date_sub()?

If you need (or want) to escape an identifier (e.g. a column name, because it's a reserved word), enclose the identifier in backtick characters, not single quotes, e.g.
AND time_arrive_destination <= DATE_SUB(`snapshots`.`created_at`, INTERVAL 1 HOUR)

MySQL join date columns with 1-month lag and performance issues

Note: I found this similar question but it does not address my issue, so I do not believe this is a duplicate.
I have two simple MySQL tables (created with the MyISAM engine), Table1 and Table2.
Both of the tables have 3 columns, a date-type column, an integer ID column, and a float value column. Both tables have about 3 million records and are very straightforward.
The contents of the tables looks like this (with Date and Id as primary keys):
Date Id Var1
2012-1-27 1 0.1
2012-1-27 2 0.5
2012-2-28 1 0.6
2012-2-28 2 0.7
(assume Var1 becomes Var2 for the second table).
Note that for each (year, month, ID) triplet, there will only be a single entry. But the actual day of the month that appears is not necessarily the final day, nor is it the final weekday, nor is it the final business day, etc... It's just some day of the month. This day is important as an observation day in other tables, but the day-of-month itself doesn't matter between Table1 and Table2.
Because of this, I cannot rely on Date + INTERVAL 1 MONTH to produce the matching day-of-month for the date it should match to that is one month ahead.
I'm looking to join the two tables on Date and Id but where the values from the second table (Var2) come from 1-month ahead than Var1.
This sort of code will accomplish it, but I am noticing a significant performance degradation with this, explained below.
-- This is exceptionally slow for me
SELECT b.Date,
b.Id,
a.Var1,
b.Var2
FROM Table1 a
JOIN Table2 b
ON a.Id = b.Id
AND YEAR(a.Date + INTERVAL 1 MONTH) = YEAR(b.Date)
AND MONTH(a.Date + INTERVAL 1 MONTH) = MONTH(b.Date)
-- This returns quickly, but if I use it as a sub-query
-- then the parent query is very slow.
SELECT Date + INTERVAL 1 MONTH as FutureDate,
Id,
Var1
FROM Table1
-- That is, the above is fast, but this is super slow:
select b.Date,
b.Id,
a.Var1,
b.Var2
FROM (SELECT Date + INTERVAL 1 MONTH as FutureDate
Id,
Var1
FROM Table1) a
JOIN Table2 b
ON YEAR(a.FutureDate) = YEAR(b.Date)
AND MONTH(a.FutureDate) = MONTH(b.Date)
AND a.Id = b.Id
I've tried re-ordering the JOIN criteria, thinking maybe that matching on Id first in the code would change the query execution plan, but it seems to make no difference.
When I say "super slow", I mean that option #1 from the code above doesn't return the results for all 3 million records even if I wait for over an hour. Option #2 returns in less than 10 minutes, but then option number three takes longer than 1 hour again.
I don't understand why the introduction of the date lag makes it take so long.
How can I
profile the queries to understand why it takes a long time?
write a better query for joining tables based on a 1-month date lag (where day-of-month that results from the 1-month lag may cause mismatches).

Here is an alternative approach:
SELECT b.Date, b.Id, b.Var2
(select a.var1
from Table1 a
where a.id = b.id and a.date < b.date
order by a.date
limit 1
) as var1
b.Var2
FROM Table2 b;
Be sure the primary index is set up with id first and then date on Table1. Otherwise, create another index Table1(id, date).
Note that this assumes that the preceding date is for the preceding month.

Here's another alternative way to go about this:
SELECT thismonth.Date,
thismonth.Id,
thismonth.Var1 AS Var1_thismonth,
lastmonth.Var1 AS Var1_lastmonth
FROM Table2 AS thismonth
JOIN
(SELECT id, Var1,
DATE(DATE_FORMAT(Date,'%Y-%m-01')) as MonthStart
FROM Table2
) AS lastmonth
ON ( thismonth.id = lastmonth.id
AND thismonth.Date >= lastmonth.MonthStart + INTERVAL 1 MONTH
AND thismonth.Date < lastmonth.MonthStart + INTERVAL 2 MONTH
)
To get this to perform ideally, I think you're going to need a compound covering index on (id, Date, Var1).
It works by generating a derived table containing Id,MonthStart,Var1 and then joining the original table to it by a sequence of range scans. Hence the compound covering index.

The other answers gave very useful tips, but ultimately, without making significant modifications to the index structure of my data (which is not feasible at the moment), those methods would not work faster (in any meaningful sense) than what I had already tried in the question.
Ollie Jones gave me the idea to use date formatting, and coupling that with the TIMESTAMPDIFF function seems to make it passably fast, though I still welcome any comments explaining why the use of YEAR, MONTH, DATE_FORMAT, and TIMESTAMPDIFF have such wildly different performance properties.
SELECT b.Date,
b.Id,
b.Var2,
a.Date,
a.Id,
a.Var1
FROM Table1 a
JOIN Table2 b
ON a.Id = b.Id
AND (TIMESTAMPDIFF(MONTH,
DATE_FORMAT(a.Date, '%Y-%m-01'),
DATE_FORMAT(b.Date, '%Y-%m-01')) = 1)

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008