MySQL join date columns with 1-month lag and performance issues - mysql

Note: I found this similar question but it does not address my issue, so I do not believe this is a duplicate.
I have two simple MySQL tables (created with the MyISAM engine), Table1 and Table2.
Both of the tables have 3 columns, a date-type column, an integer ID column, and a float value column. Both tables have about 3 million records and are very straightforward.
The contents of the tables looks like this (with Date and Id as primary keys):
Date Id Var1
2012-1-27 1 0.1
2012-1-27 2 0.5
2012-2-28 1 0.6
2012-2-28 2 0.7
(assume Var1 becomes Var2 for the second table).
Note that for each (year, month, ID) triplet, there will only be a single entry. But the actual day of the month that appears is not necessarily the final day, nor is it the final weekday, nor is it the final business day, etc... It's just some day of the month. This day is important as an observation day in other tables, but the day-of-month itself doesn't matter between Table1 and Table2.
Because of this, I cannot rely on Date + INTERVAL 1 MONTH to produce the matching day-of-month for the date it should match to that is one month ahead.
I'm looking to join the two tables on Date and Id but where the values from the second table (Var2) come from 1-month ahead than Var1.
This sort of code will accomplish it, but I am noticing a significant performance degradation with this, explained below.
-- This is exceptionally slow for me
SELECT b.Date,
b.Id,
a.Var1,
b.Var2
FROM Table1 a
JOIN Table2 b
ON a.Id = b.Id
AND YEAR(a.Date + INTERVAL 1 MONTH) = YEAR(b.Date)
AND MONTH(a.Date + INTERVAL 1 MONTH) = MONTH(b.Date)
-- This returns quickly, but if I use it as a sub-query
-- then the parent query is very slow.
SELECT Date + INTERVAL 1 MONTH as FutureDate,
Id,
Var1
FROM Table1
-- That is, the above is fast, but this is super slow:
select b.Date,
b.Id,
a.Var1,
b.Var2
FROM (SELECT Date + INTERVAL 1 MONTH as FutureDate
Id,
Var1
FROM Table1) a
JOIN Table2 b
ON YEAR(a.FutureDate) = YEAR(b.Date)
AND MONTH(a.FutureDate) = MONTH(b.Date)
AND a.Id = b.Id
I've tried re-ordering the JOIN criteria, thinking maybe that matching on Id first in the code would change the query execution plan, but it seems to make no difference.
When I say "super slow", I mean that option #1 from the code above doesn't return the results for all 3 million records even if I wait for over an hour. Option #2 returns in less than 10 minutes, but then option number three takes longer than 1 hour again.
I don't understand why the introduction of the date lag makes it take so long.
How can I
profile the queries to understand why it takes a long time?
write a better query for joining tables based on a 1-month date lag (where day-of-month that results from the 1-month lag may cause mismatches).

Here is an alternative approach:
SELECT b.Date, b.Id, b.Var2
(select a.var1
from Table1 a
where a.id = b.id and a.date < b.date
order by a.date
limit 1
) as var1
b.Var2
FROM Table2 b;
Be sure the primary index is set up with id first and then date on Table1. Otherwise, create another index Table1(id, date).
Note that this assumes that the preceding date is for the preceding month.

Here's another alternative way to go about this:
SELECT thismonth.Date,
thismonth.Id,
thismonth.Var1 AS Var1_thismonth,
lastmonth.Var1 AS Var1_lastmonth
FROM Table2 AS thismonth
JOIN
(SELECT id, Var1,
DATE(DATE_FORMAT(Date,'%Y-%m-01')) as MonthStart
FROM Table2
) AS lastmonth
ON ( thismonth.id = lastmonth.id
AND thismonth.Date >= lastmonth.MonthStart + INTERVAL 1 MONTH
AND thismonth.Date < lastmonth.MonthStart + INTERVAL 2 MONTH
)
To get this to perform ideally, I think you're going to need a compound covering index on (id, Date, Var1).
It works by generating a derived table containing Id,MonthStart,Var1 and then joining the original table to it by a sequence of range scans. Hence the compound covering index.

The other answers gave very useful tips, but ultimately, without making significant modifications to the index structure of my data (which is not feasible at the moment), those methods would not work faster (in any meaningful sense) than what I had already tried in the question.
Ollie Jones gave me the idea to use date formatting, and coupling that with the TIMESTAMPDIFF function seems to make it passably fast, though I still welcome any comments explaining why the use of YEAR, MONTH, DATE_FORMAT, and TIMESTAMPDIFF have such wildly different performance properties.
SELECT b.Date,
b.Id,
b.Var2,
a.Date,
a.Id,
a.Var1
FROM Table1 a
JOIN Table2 b
ON a.Id = b.Id
AND (TIMESTAMPDIFF(MONTH,
DATE_FORMAT(a.Date, '%Y-%m-01'),
DATE_FORMAT(b.Date, '%Y-%m-01')) = 1)

Related

Select rows from tableA based on age calculation from tableB

table1 we have ID, DOB(date of birth, eg. 01/01/1980)
Table2 we have id and other columns
How to get all rows from table 2 if id is under the age of 20?
I currently have:
SELECT *
FROM table2
WHERE id IN (
SELECT id
FROM table1
WHERE TIMESTAMPDIFF(Year,DOB,curdate()) <= 20
)
Is my solution efficient?
You would be better off calculating a date 20 years ago and asking if the table data is after that date. This means one calculation is needed, not a calculation for every row in the table. Any time that you perform a calculation on row data it means an index cannot be used. This is catastrophe for performance if DOB is indexed
TIMESTAMPDIFF doesn't count the number of years between two dates, it give you the number of times the year rolls over 31 dec for two dates. This means asking for the difference between 31 dec and 1 jan will report as 1 year when in fact it is only one (or upto two) days (depending on the times)
SELECT id
FROM table1
where DOB > DATE_SUB(CURDATE(), INTERVAL 20 YEAR)
Personally I use join rather than IN because once you learn the pattern it is easy to extend it using LEFT joins to look for rows that don't exist or match the patterns, but in practical terms the query optimizer rewrites IN and JOIN to execute them the same anyway. Some dB perform poorly for IN, because they execute them differently to joins
SELECT *
FROM
table1 t1
INNER JOIN table2 t2
ON t1.id = t2.id
where t1.DOB > DATE_SUB(CURDATE(), INTERVAL 20 YEAR)
Mech is making the point about select * that it should be avoided in production code. That's a relevant point for the most part - always select only the columns you need (sometimes if a dB has indexed a table and you only need columns that are in the index, then using select * will be a performance hit because the dB has to use the index to look up which rows then lookup the rows. If you specify the columns you need it can decide whether it can answer the query purely from the index for a speed boost. The only time I might consider using select * is in a sub query where the optimizer will rewrite it anyway
Always alias your tables and use the aliases. This prevents your query breaking if later you add a column to either table that is the same name as a column in the other table. While adding things isn't usually a problem or cause bugs and crashes, if a query just "select name from a join b.." and only table a has a name column, it will start crashing if a name column is added to b. Specifying a.name would prevent this
For MySQL
SELECT table2.*
FROM table1
JOIN table2 ON table1.id = table2.id
WHERE table1.dob >= CURRENT_DATE - INTERVAL 20 YEAR
Historically, MySQL has implemented EXISTS more efficiently than IN. So, I would recommend:
SELECT t2.*
FROM table2 t2
WHERE EXISTS (SELECT 1
FROM table1 t1
WHERE t1.id = t2.id AND
TIMESTAMPDIFF(Year, t1.DOB, curdate()) <= 20
);
For performance, you want an index on table1(id, DOB).
You can also change the year comparison to:
t1.DOB <= curdate() - interval 20 year
That is presumably the logic you want and the index could take advantage of it.
I recommend this over an join because there is no risk of having duplicate rows in the result set. Your question does not specify that id is unique in table1, so duplicates are a risk. Even if there are no duplicates, this would also have the best performance under many circumstances.

SQL Performance on selecting first/last row for each user on bigger data table

I have read through quite a few posts with greatest-n-per-group but still don't seem to find a good solution in terms of performance. I'm running 10.1.43-MariaDB.
I'm trying to get the change in data values in given time frame and so I need to get the earliest and latest row from this period. The largest number of rows in a time frame that needs to be calculated right now is around 700k and it's only going to be growing. For now I have just resulted into doing two queries, one for the latest and one for the earliest date, but even this has slow performance on currently. The table looks like this:
user_id data date
4567 109 28/06/2019 11:04:45
4252 309 18/06/2019 11:04:45
4567 77 18/02/2019 11:04:45
7893 1123 22/06/2019 11:04:45
4252 303 11/06/2019 11:04:45
4252 317 19/06/2019 11:04:45
The date and user_id columns are indexed. Without ordering the rows aren't in any particular order in the database if that makes a difference.
The furthest I have gotten with this issue is query like this for year period currently (700k datapoints):
SELECT user_id,
MIN(date) as date, data
FROM datapoint_table
WHERE date >= '2019-01-14'
GROUP BY user_id
This gives me the right date and user_id in around very fast in around ~0.05s. But like the common issue with the greatest-n-per-group is, the rest of the row (data in this case) is not from the same row with date. I have read about other similar questions and tried with subquery like this:
SELECT a.user_id, a.date, a.data
FROM datapoint_table a
INNER JOIN (
SELECT datapoint_table.user_id,
MIN(date) as date, data
FROM datapoint_table
WHERE date >= '2019-01-01'
GROUP BY user_id
) b ON a.user_id = b.user_id AND a.date = b.date
This query takes around 15s to complete and gets the correct data value. The 15s tho is just way too long and I must be doing something wrong when the first query is so fast. I also tried doing (MAX)-(MIN) for the data with group by for user_id but it also had slow performance.
What would be more efficient way of getting the same data value as the date or even the difference in latest and earliest data for each user?
Assuming you are using a fairly recent version of either MariaDB or MySQL, then ROW_NUMBER would probably be the most efficient way to find the earliest record for each user:
WITH cte AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY date) rn
FROM datapoint_table
WHERE date > '2019-01-14'
)
SELECT user_id, data, date
FROM cte
WHERE rn = 1;
To the above you could also consider adding the following index:
CREATE INDEX ON datapoint_table (user_id, date);
You could also try the following variant index with the columns reversed:
CREATE INDEX ON datapoint_table (date, user_id);
It is not clear which version of the index would perform the best, which would depend on your data and the execution plan. Ideally one of the above two indices would help the database execute ROW_NUMBER, along with the WHERE clause.
If your database version does not support ROW_NUMBER, then you may continue with your current approach:
SELECT d1.user_id, d1.data, d1.date
FROM datapoint_table d1
INNER JOIN
(
SELECT user_id, MIN(date) AS min_date
FROM datapoint_table
WHERE date > '2019-01-14'
GROUP BY user_id
) d2
ON d1.user_id = d2.user AND d1.date = d2.min_date
WHERE
d1.date > '2019-01-14';
Again, the indices suggested should at least speed up the execution of the GROUP BY subquery.

How to summarize column price from 2 tables with no relationship?

In a mysql databse, I have tblA.price and tblB.price. There is no relationship between them.
I want summarize all sales from tableA and tableB. It will be something like that sum(tblA.price)+sum(tblB.price) AS total.
how could I perform that query?
The union that #cjsfj shows would work, and here are a couple of other options:
Do two scalar subqueries and add them together.
select (select sum(price) from tblA) + (select sum(price) from tblB) as total;
Do two queries from your application, get the results of each, and add them together.
Quick and Dirty. Unions aren't great. But if you have a fixed number of tables, this will work. Performance might get tricky, and it's definitely not pretty, but answers your question.
select sum(price) as totalprice from
(select sum(a.price) as price
from a
union all
select sum(b.price) as price
from b) as ab
In order to complement the other answers, I had some issues when there is no result from one of the tables. It was returning null. For that reason I had to filter that result and turn it into 0. Just just did IFNULL(SUM(field),0).
Here is my final query:
SELECT
IFNULL(SUM(tblA.price),0) + (SELECT
IFNULL(SUM(fieldB),0)
FROM
tblB
WHERE
creation_date BETWEEN '$startDT' AND DATE_SUB(NOW(), INTERVAL 1 HOUR) AS amount
FROM
tableA tblA
WHERE
tblA.transaction_date BETWEEN 'startDT' AND DATE_SUB(NOW(),
INTERVAL 1 HOUR)
AND tblA.service_type <> 'service1'
AND tblA.service_type <> 'service2'
AND tblA.service_type <> 'service3';

more efficient inner join query

Is that possible to make this query more efficient ?
SELECT DISTINCT(static.template.name)
FROM probedata.probe
INNER JOIN static.template ON probedata.probe.template_fk = static.template.pk
WHERE creation_time >= DATE_SUB(NOW(), INTERVAL 6 MONTH)
Thanks.
First, I'm going to rewrite it using table aliases, so I can read it:
SELECT DISTINCT(t.name)
FROM probedata.probe p INNER JOIN
static.template t
ON p.template_fk = t.pk
WHERE creation_time >= DATE_SUB(NOW(), INTERVAL 6 MONTH);
Let me make two assumptions:
name is unique in static.template
creation_time comes from probe
The first assumption is particularly useful. You can rewrite the query as:
SELECT t.name
FROM static.template t
WHERE EXISTS (SELECT 1
FROM probedata.probe p
WHERE p.template_fk = t.pk AND
p.creation_time >= DATE_SUB(NOW(), INTERVAL 6 MONTH)
);
The second assumption only affects the indexing. For this query, you want indexes on probe(template_fk, creation_time).
If template has wide records, then an index on template(pk, name) might also prove useful.
This will change the execution plan to be a scan of the template table with a fast look up using the index into the probe table. There will be no additional processing to remove duplicates.
Could help:
If you use this statement in a script, assign the result of the DATE_SUB(NOW(), INTERVAL 6 MONTH) in a variable before the select statement and use that variable in the where condition (because the functions to calculate last X months would execute just once)
Instead of distinct, try and see if there is an improvement using just the column in the select clause (so no distinct) and add the GROUP BY static.template.name

SQL Work out the average time difference between total rows

I've searched around SO and can't seem to find a question with an answer that works fine for me. I have a table with almost 2 million rows in, and each row has a MySQL Date formatted field.
I'd like to work out (in seconds) how often a row was inserted, so work out the average difference between the dates of all the rows with a SQL query.
Any ideas?
-- EDIT --
Here's what my table looks like
id, name, date (datetime), age, gender
If you want to know how often (on average) a row was inserted, I don't think you need to calculate all the differences. You only need to sum up the differences between adjacent rows (adjacent based on the timestamp) and divide the result by the number of the summands.
The formula
((T1-T0) + (T2-T1) + … + (TN-TN-1)) / N
can obviously be simplified to merely
(TN-T0) / N
So, the query would be something like this:
SELECT TIMESTAMPDIFF(SECOND, MIN(date), MAX(date)) / (COUNT(*) - 1)
FROM atable
Make sure the number of rows is more than 1, or you'll get the Division By Zero error. Still, if you like, you can prevent the error with a simple trick:
SELECT
IFNULL(TIMESTAMPDIFF(SECOND, MIN(date), MAX(date)) / NULLIF(COUNT(*) - 1, 0), 0)
FROM atable
Now you can safely run the query against a table with a single row.
Give this a shot:
select AVG(theDelay) from (
select TIMESTAMPDIFF(SECOND,a.date, b.date) as theDelay
from myTable a
join myTable b on b.date = (select MIN(x.date)
from myTable x
where x.date > a.date)
) p
The inner query joins each row with the next row (by date) and returns the number of seconds between them. That query is then encapsulated and is queried for the average number of seconds.
EDIT: If your ID column is auto-incrementing and they are in date order, you can speed it up a bit by joining to the next ID row rather than the MIN next date.
select AVG(theDelay) from (
select TIMESTAMPDIFF(SECOND,a.date, b.date) as theDelay
from myTable a
join myTable b on b.date = (select MIN(x.id)
from myTable x
where x.id > a.id)
) p
EDIT2: As brilliantly commented by Mikael Eriksson, you may be able to just do:
select (TIMESTAMPDIFF(SECOND,(MAX(date),MIN(date)) / COUNT(*)) from myTable
There's a lot you can do with this to eliminate off-peak hours or big spans without a new record, using the join syntax in my first example.
Try this:
select avg(diff) as AverageSecondsBetweenDates
from (
select TIMESTAMPDIFF(SECOND, t1.MyDate, min(t2.MyDate)) as diff
from MyTable t1
inner join MyTable t2 on t2.MyDate > t1.MyDate
group by t1.MyDate
) a