table1 we have ID, DOB(date of birth, eg. 01/01/1980)
Table2 we have id and other columns
How to get all rows from table 2 if id is under the age of 20?
I currently have:
SELECT *
FROM table2
WHERE id IN (
SELECT id
FROM table1
WHERE TIMESTAMPDIFF(Year,DOB,curdate()) <= 20
)
Is my solution efficient?
You would be better off calculating a date 20 years ago and asking if the table data is after that date. This means one calculation is needed, not a calculation for every row in the table. Any time that you perform a calculation on row data it means an index cannot be used. This is catastrophe for performance if DOB is indexed
TIMESTAMPDIFF doesn't count the number of years between two dates, it give you the number of times the year rolls over 31 dec for two dates. This means asking for the difference between 31 dec and 1 jan will report as 1 year when in fact it is only one (or upto two) days (depending on the times)
SELECT id
FROM table1
where DOB > DATE_SUB(CURDATE(), INTERVAL 20 YEAR)
Personally I use join rather than IN because once you learn the pattern it is easy to extend it using LEFT joins to look for rows that don't exist or match the patterns, but in practical terms the query optimizer rewrites IN and JOIN to execute them the same anyway. Some dB perform poorly for IN, because they execute them differently to joins
SELECT *
FROM
table1 t1
INNER JOIN table2 t2
ON t1.id = t2.id
where t1.DOB > DATE_SUB(CURDATE(), INTERVAL 20 YEAR)
Mech is making the point about select * that it should be avoided in production code. That's a relevant point for the most part - always select only the columns you need (sometimes if a dB has indexed a table and you only need columns that are in the index, then using select * will be a performance hit because the dB has to use the index to look up which rows then lookup the rows. If you specify the columns you need it can decide whether it can answer the query purely from the index for a speed boost. The only time I might consider using select * is in a sub query where the optimizer will rewrite it anyway
Always alias your tables and use the aliases. This prevents your query breaking if later you add a column to either table that is the same name as a column in the other table. While adding things isn't usually a problem or cause bugs and crashes, if a query just "select name from a join b.." and only table a has a name column, it will start crashing if a name column is added to b. Specifying a.name would prevent this
For MySQL
SELECT table2.*
FROM table1
JOIN table2 ON table1.id = table2.id
WHERE table1.dob >= CURRENT_DATE - INTERVAL 20 YEAR
Historically, MySQL has implemented EXISTS more efficiently than IN. So, I would recommend:
SELECT t2.*
FROM table2 t2
WHERE EXISTS (SELECT 1
FROM table1 t1
WHERE t1.id = t2.id AND
TIMESTAMPDIFF(Year, t1.DOB, curdate()) <= 20
);
For performance, you want an index on table1(id, DOB).
You can also change the year comparison to:
t1.DOB <= curdate() - interval 20 year
That is presumably the logic you want and the index could take advantage of it.
I recommend this over an join because there is no risk of having duplicate rows in the result set. Your question does not specify that id is unique in table1, so duplicates are a risk. Even if there are no duplicates, this would also have the best performance under many circumstances.
Related
I have read through quite a few posts with greatest-n-per-group but still don't seem to find a good solution in terms of performance. I'm running 10.1.43-MariaDB.
I'm trying to get the change in data values in given time frame and so I need to get the earliest and latest row from this period. The largest number of rows in a time frame that needs to be calculated right now is around 700k and it's only going to be growing. For now I have just resulted into doing two queries, one for the latest and one for the earliest date, but even this has slow performance on currently. The table looks like this:
user_id data date
4567 109 28/06/2019 11:04:45
4252 309 18/06/2019 11:04:45
4567 77 18/02/2019 11:04:45
7893 1123 22/06/2019 11:04:45
4252 303 11/06/2019 11:04:45
4252 317 19/06/2019 11:04:45
The date and user_id columns are indexed. Without ordering the rows aren't in any particular order in the database if that makes a difference.
The furthest I have gotten with this issue is query like this for year period currently (700k datapoints):
SELECT user_id,
MIN(date) as date, data
FROM datapoint_table
WHERE date >= '2019-01-14'
GROUP BY user_id
This gives me the right date and user_id in around very fast in around ~0.05s. But like the common issue with the greatest-n-per-group is, the rest of the row (data in this case) is not from the same row with date. I have read about other similar questions and tried with subquery like this:
SELECT a.user_id, a.date, a.data
FROM datapoint_table a
INNER JOIN (
SELECT datapoint_table.user_id,
MIN(date) as date, data
FROM datapoint_table
WHERE date >= '2019-01-01'
GROUP BY user_id
) b ON a.user_id = b.user_id AND a.date = b.date
This query takes around 15s to complete and gets the correct data value. The 15s tho is just way too long and I must be doing something wrong when the first query is so fast. I also tried doing (MAX)-(MIN) for the data with group by for user_id but it also had slow performance.
What would be more efficient way of getting the same data value as the date or even the difference in latest and earliest data for each user?
Assuming you are using a fairly recent version of either MariaDB or MySQL, then ROW_NUMBER would probably be the most efficient way to find the earliest record for each user:
WITH cte AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY date) rn
FROM datapoint_table
WHERE date > '2019-01-14'
)
SELECT user_id, data, date
FROM cte
WHERE rn = 1;
To the above you could also consider adding the following index:
CREATE INDEX ON datapoint_table (user_id, date);
You could also try the following variant index with the columns reversed:
CREATE INDEX ON datapoint_table (date, user_id);
It is not clear which version of the index would perform the best, which would depend on your data and the execution plan. Ideally one of the above two indices would help the database execute ROW_NUMBER, along with the WHERE clause.
If your database version does not support ROW_NUMBER, then you may continue with your current approach:
SELECT d1.user_id, d1.data, d1.date
FROM datapoint_table d1
INNER JOIN
(
SELECT user_id, MIN(date) AS min_date
FROM datapoint_table
WHERE date > '2019-01-14'
GROUP BY user_id
) d2
ON d1.user_id = d2.user AND d1.date = d2.min_date
WHERE
d1.date > '2019-01-14';
Again, the indices suggested should at least speed up the execution of the GROUP BY subquery.
I have two tables in MYSQL where table2 contains ranges of serial numbers (unique) with 17 digits (varchar 17) and table1 contains serial values (same format as ranges)
ex:
table 1:
serial_id seial
1 12345678123456799
table 2:
range id date start end
1 2012-01-01 12345678123456789 12345678123456999
2 2012-01-01 12345678123457000 12345678123457099
3 2012-01-01 12345678123457100 12345678123457199
I want to find range ids that each serial belong to it.the simplest query that can be used is:
select *
from table1,table2
where table1.serial between table2.start and table2.end
but I want to optimize it to run faster with below facts :
the serials and ranges are unique and so each serial may belong to one and only one range. so it is not necessary to search other ranges when one range contains the serial.
first 11 digits of each range are the same. for example one ranges can be from 12345678120000000 to 12345678129999999.
serials and ranges are ordered by date and it is more likely to find ranges in early dates. serials are about 6000000 records and ranges are about 100000 records.
any idea for better query?
This is a bit challenging to speed up. Here is one method that I've used with IP address ranges:
select t1.*,
(select t2.range_id
from table2 t2
where t2.start <= t.serial
order by t2.start desc
limit 1
) as range_id
from table1 t1;
This can take advantage of an index on table2(start, range_id).
Note: this does not check the end of the range. For that, I would add another join . . . although this (unhappily) requires materializing a subquery:
select *
from (select t1.*,
(select t2.range_id
from table2 t2
where t2.start <= t.serial
order by t2.start desc
limit 1
) as range_id
from table1 t1
) t1 left join
table2 t2
on t1.range_id = t2.range_id and t2.end >= t.serial;
The additional join want an index on table2(range_id, end).
I think by a little change in data model, a big performance improvement will happen.
By adding rangeid column to table1 as foreign key.
table 1:
serial_id seial rangeid
1 12345678123456799 1
Then write following query:
select *
from table1 join table2 using(rangeid);
And if that change is impossible you can use like operator as below:
select *
from table1 join table2
on(table2.start like concat(left(table1.serial,12),'%'))
where table1.serial between table2.start and table2.end;
table2.start column must be indexed.
Edit:
And increase the number "12" to max possible number according the relation between serial field and start field.
Is that possible to make this query more efficient ?
SELECT DISTINCT(static.template.name)
FROM probedata.probe
INNER JOIN static.template ON probedata.probe.template_fk = static.template.pk
WHERE creation_time >= DATE_SUB(NOW(), INTERVAL 6 MONTH)
Thanks.
First, I'm going to rewrite it using table aliases, so I can read it:
SELECT DISTINCT(t.name)
FROM probedata.probe p INNER JOIN
static.template t
ON p.template_fk = t.pk
WHERE creation_time >= DATE_SUB(NOW(), INTERVAL 6 MONTH);
Let me make two assumptions:
name is unique in static.template
creation_time comes from probe
The first assumption is particularly useful. You can rewrite the query as:
SELECT t.name
FROM static.template t
WHERE EXISTS (SELECT 1
FROM probedata.probe p
WHERE p.template_fk = t.pk AND
p.creation_time >= DATE_SUB(NOW(), INTERVAL 6 MONTH)
);
The second assumption only affects the indexing. For this query, you want indexes on probe(template_fk, creation_time).
If template has wide records, then an index on template(pk, name) might also prove useful.
This will change the execution plan to be a scan of the template table with a fast look up using the index into the probe table. There will be no additional processing to remove duplicates.
Could help:
If you use this statement in a script, assign the result of the DATE_SUB(NOW(), INTERVAL 6 MONTH) in a variable before the select statement and use that variable in the where condition (because the functions to calculate last X months would execute just once)
Instead of distinct, try and see if there is an improvement using just the column in the select clause (so no distinct) and add the GROUP BY static.template.name
Note: I found this similar question but it does not address my issue, so I do not believe this is a duplicate.
I have two simple MySQL tables (created with the MyISAM engine), Table1 and Table2.
Both of the tables have 3 columns, a date-type column, an integer ID column, and a float value column. Both tables have about 3 million records and are very straightforward.
The contents of the tables looks like this (with Date and Id as primary keys):
Date Id Var1
2012-1-27 1 0.1
2012-1-27 2 0.5
2012-2-28 1 0.6
2012-2-28 2 0.7
(assume Var1 becomes Var2 for the second table).
Note that for each (year, month, ID) triplet, there will only be a single entry. But the actual day of the month that appears is not necessarily the final day, nor is it the final weekday, nor is it the final business day, etc... It's just some day of the month. This day is important as an observation day in other tables, but the day-of-month itself doesn't matter between Table1 and Table2.
Because of this, I cannot rely on Date + INTERVAL 1 MONTH to produce the matching day-of-month for the date it should match to that is one month ahead.
I'm looking to join the two tables on Date and Id but where the values from the second table (Var2) come from 1-month ahead than Var1.
This sort of code will accomplish it, but I am noticing a significant performance degradation with this, explained below.
-- This is exceptionally slow for me
SELECT b.Date,
b.Id,
a.Var1,
b.Var2
FROM Table1 a
JOIN Table2 b
ON a.Id = b.Id
AND YEAR(a.Date + INTERVAL 1 MONTH) = YEAR(b.Date)
AND MONTH(a.Date + INTERVAL 1 MONTH) = MONTH(b.Date)
-- This returns quickly, but if I use it as a sub-query
-- then the parent query is very slow.
SELECT Date + INTERVAL 1 MONTH as FutureDate,
Id,
Var1
FROM Table1
-- That is, the above is fast, but this is super slow:
select b.Date,
b.Id,
a.Var1,
b.Var2
FROM (SELECT Date + INTERVAL 1 MONTH as FutureDate
Id,
Var1
FROM Table1) a
JOIN Table2 b
ON YEAR(a.FutureDate) = YEAR(b.Date)
AND MONTH(a.FutureDate) = MONTH(b.Date)
AND a.Id = b.Id
I've tried re-ordering the JOIN criteria, thinking maybe that matching on Id first in the code would change the query execution plan, but it seems to make no difference.
When I say "super slow", I mean that option #1 from the code above doesn't return the results for all 3 million records even if I wait for over an hour. Option #2 returns in less than 10 minutes, but then option number three takes longer than 1 hour again.
I don't understand why the introduction of the date lag makes it take so long.
How can I
profile the queries to understand why it takes a long time?
write a better query for joining tables based on a 1-month date lag (where day-of-month that results from the 1-month lag may cause mismatches).
Here is an alternative approach:
SELECT b.Date, b.Id, b.Var2
(select a.var1
from Table1 a
where a.id = b.id and a.date < b.date
order by a.date
limit 1
) as var1
b.Var2
FROM Table2 b;
Be sure the primary index is set up with id first and then date on Table1. Otherwise, create another index Table1(id, date).
Note that this assumes that the preceding date is for the preceding month.
Here's another alternative way to go about this:
SELECT thismonth.Date,
thismonth.Id,
thismonth.Var1 AS Var1_thismonth,
lastmonth.Var1 AS Var1_lastmonth
FROM Table2 AS thismonth
JOIN
(SELECT id, Var1,
DATE(DATE_FORMAT(Date,'%Y-%m-01')) as MonthStart
FROM Table2
) AS lastmonth
ON ( thismonth.id = lastmonth.id
AND thismonth.Date >= lastmonth.MonthStart + INTERVAL 1 MONTH
AND thismonth.Date < lastmonth.MonthStart + INTERVAL 2 MONTH
)
To get this to perform ideally, I think you're going to need a compound covering index on (id, Date, Var1).
It works by generating a derived table containing Id,MonthStart,Var1 and then joining the original table to it by a sequence of range scans. Hence the compound covering index.
The other answers gave very useful tips, but ultimately, without making significant modifications to the index structure of my data (which is not feasible at the moment), those methods would not work faster (in any meaningful sense) than what I had already tried in the question.
Ollie Jones gave me the idea to use date formatting, and coupling that with the TIMESTAMPDIFF function seems to make it passably fast, though I still welcome any comments explaining why the use of YEAR, MONTH, DATE_FORMAT, and TIMESTAMPDIFF have such wildly different performance properties.
SELECT b.Date,
b.Id,
b.Var2,
a.Date,
a.Id,
a.Var1
FROM Table1 a
JOIN Table2 b
ON a.Id = b.Id
AND (TIMESTAMPDIFF(MONTH,
DATE_FORMAT(a.Date, '%Y-%m-01'),
DATE_FORMAT(b.Date, '%Y-%m-01')) = 1)
I have a PHP program w/MySQL database which contains many records. Two columns of particular relevance are incidentnumber and date. These both move forward only. However, sometimes a user enters data which is out of sequence; eg:
Incident Date
1 Jan 1 2000
2 Jan 1 2010
3 Jan 1 2002
It appears that incident 2 was entered with the wrong date, it should be Jan 1 2001.
Is there any way to query for records where the date is out of sequence? Or do I have to iterate through all records tracking last date to find the error?
ADDED NOTE: The incidents are not sequential (they might go 1,3,6,123, etc). Nor are the dates sequential. And these are columns in the same table.
This command selects any records for which there exists in the same table a record with a lower Incident number but a higher Date.
SELECT * FROM TableName T1 WHERE EXISTS
(SELECT * FROM TableName T2
WHERE T2.Incident < T1.Incident AND T2.Date > T1.Date)
This slightly more complex command will find only records for which are out of order in "both directions", meaning they have an later dated record earlier in the file and an earlier dated record later in the file. This avoids the situation in which making a mistake in a very early record in the file will make all the subsequent records appear out of order. However, it will not catch a problem in the two records with the lowest or highest incident numbers.
SELECT * FROM TableName T1 WHERE EXISTS
(SELECT * FROM TableName T2
WHERE T2.Incident < T1.Incident AND T2.Date > T1.Date)
AND EXISTS
(SELECT * FROM TableName T2
WHERE T2.Incident > T1.Incident AND T2.Date < T1.Date)
Finally, as ruakh points out in the comments, the above query gives you ALL the out-of-order records. Although that is, technically, what you wanted it makes it difficult to find the "point of breakage" in the chain of dates. The following query will give you only the records where the chain gets messed up, does not require IncidentID to increase monotonically, and allows deletions of incidents.
SELECT * FROM TableName T1 WHERE
Date < (SELECT Date FROM TableName T2 WHERE T2.IncidentID =
(SELECT MAX(IncidentID) FROM TableName T3 WHERE T3.IncidentID < T1.IncidentID))
OR Date > (SELECT Date FROM TableName T2 WHERE T2.IncidentID =
(SELECT MAX(IncidentID) FROM TableName T3 WHERE T3.IncidentID > T1.IncidentID))
(Not tested, since I don't have a copy of MySQL handy).
select * from yourtable t1
inner join yourtable t2
on t1.incident=t2.incident-1
and t1.date>t2.date
This selects all of the ids where the date is greater than the next records date. That should tell you which ones are out of order.
SELECT Incident FROM table a
WHERE a.Date > (SELECT b.Date FROM table b WHERE b.Incident = (a.Incident + 1))
In case that the IncidentID column is always in a regular incremental sequence:
SELECT c.IncidentID AS cincID, p.IncidentID AS pincID,
c.Date AS cDate, p.Date AS pDate,
DATEDIFF(c.Date, p.Date)
FROM Incident c, Incident p
WHERE c.IncidentID = (p.IncidentID + 1)
AND datediff(c.Date, p.Date) < 1