Lead() and LAG() functionality in SQL Server 2008 - sql-server-2008

Hope all the SQL GURUS out there are doing great :)
I am trying to simulate LEAD() and LAG() functionality in SQL Server 2008.
This is my scenario: I have a temp table which is populated using the base query with the business logic for mileage. I want to calculate accumulated mileage for each user per day.
The temp table is setup using ROW_NUMBER(), so I have all the data needed in the temp table except the accumulated mileage.
I have tried using a CTE with the base query and self joining with itself and couldn't get it working. I am attaching the screen shot for the same.
Any help/suggestion would be appreciated.

You are on the right track by joining the table to itself. I included 2 methods of doing this below that should work fine here. The first trick is in your ROW_NUMBER, be sure to partition by the user id and sort by the date. Then you can use either an INNER JOIN with aggregation or CROSS APPLY to build your running totals.
Setting up the data with the partitioned ROW_NUMBER():
DECLARE #Data TABLE (
RowNum INT,
UserId INT,
Date DATE,
Miles INT
)
INSERT #Data
SELECT
ROW_NUMBER() OVER (PARTITION BY UserId
ORDER BY Date) AS RowNum,
*
FROM (
SELECT 1, '2015-01-01', 5
UNION ALL SELECT 1, '2015-01-02', 6
UNION ALL SELECT 2, '2015-01-01', 7
UNION ALL SELECT 2, '2015-01-02', 3
UNION ALL SELECT 2, '2015-01-03', 2
) T (UserId, Date, Miles)
Use INNER JOIN with Aggregation
SELECT
D1.UserId,
D1.Date,
D1.Miles,
SUM(D2.Miles) AS [Total]
FROM #Data D1
INNER JOIN #Data D2
ON D1.UserId = D2.UserId
AND D2.RowNum <= D1.RowNum
GROUP BY
D1.UserId,
D1.Date,
D1.Miles
Use CROSS APPLY for the running total
SELECT
UserId,
Date,
Miles,
Total
FROM #Data D1
CROSS APPLY (
SELECT SUM(Miles) AS Total
FROM #Data
WHERE UserId = D1.UserId
AND RowNum <= D1.RowNum
) RunningTotal
Output is the same for each method:
UserId Date Miles Total
----------- ---------- ----------- -----------
1 2015-01-01 5 5
1 2015-01-02 6 11
2 2015-01-01 7 7
2 2015-01-02 3 10
2 2015-01-03 2 12

Related

Calculate count of particular day fall in between two dates

I need an Amazon Redshift SQL query to calculate the number of a particular day fall in between two dates.
Date Format - YYYY-MM-DD
For example - Start date = 2019-06-14, End Date = 2019-10-09, Day - 2nd of every month
Now, I want to calculate the count of 2nd-day fall in between 2019-06-14 and 2019-10-09
So, the actual result for the above example should be 4. Since 4 times the 2nd-day will fall in between 2019-06-14 and 2019-10-09.
I tried the DATE_DIFF function and months_between function of redshift. But failed to build the logic. Since not able to understand what math or equation should be.
for me it seems as if you wanted to select from a calendar table. That's how you can solve your problem. You'll notice that the query looks a little hacky because Redshift does not support any functions to generate sequences, which leaves you with creating sequence tables yourself (see seq_10 and seq_1000). Once you have a sequence, you can easily create a calendar with all the information you need (eg. day_of_month).
That's the query answering your question:
WITH seq_10 as (
SELECT 1 UNION ALL
SELECT 1 UNION ALL
SELECT 1 UNION ALL
SELECT 1 UNION ALL
SELECT 1 UNION ALL
SELECT 1 UNION ALL
SELECT 1 UNION ALL
SELECT 1 UNION ALL
SELECT 1 UNION ALL
SELECT 1
), seq_1000 as (
select
row_number() over () - 1 as n
from
seq_10 a cross join
seq_10 b cross join
seq_10 c
), calendar as (
select '2018-01-01'::date + n as date,
extract(day from date) as day_of_month,
extract(dow from date) as day_of_week
from seq_1000
)
select count(*) from calendar
where day_of_month = 2
and date between '2019-06-14' and '2019-10-09'

Selecting the latest row for each customer that matches these params

I have an SQL table that stores reports. Each row has a customer_id and a building_id and when I have the customer_id, I need to select the latest row (most recent create_date) for each building with that customer_id.
report_id customer_id building_id create_date
1 1 4 1553561789
2 2 5 1553561958
3 1 4 1553561999
4 2 5 1553562108
5 3 7 1553562755
6 3 8 1553570000
I would expect to get report id's 3, 4, 5 and 6 back.
How do I query this? I have tried a few sub-selects and group by and not gotten it to work.
If you are using MySQL 8+, then ROW_NUMBER is a good approach here:
WITH cte AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY customer_id, building_id
ORDER BY create_date DESC) rn
FROM yourTable
)
SELECT
report_id,
customer_id,
building_id,
create_date
FROM cte
WHERE rn = 1;
If there could be more than one customer/building pair tied for the latest creation date, and you want to capture all ties, then replace ROW_NUMBER with RANK, and use the same query.
Another variation:
SELECT a.*
FROM myTable a
WHERE a.create_date = (SELECT MAX(create_date)
FROM myTable b
WHERE b.customer_id = a.customer_id
AND b.building_id = a.building_id)
Can try doing a search for "effective dated records" to see various approaches.

SUM DISTINCT based off a specific column?

I am attempting to sum the balance of each customer only once. Normally I would use a SUM DISTINCT expression, however, one column is throwing it off and the row is no longer "Distinct".
For example:
Customer Number - Customer Name - Exception Type - Balance
CIF13443 - Paul - 1 - 125
CIF13452 - Ryan - 2 - 85
CIF13443 - Paul - 3 - 125
CIF13765 - Linda - 1 - 90
In this case, if I use SUM DISTINCT, Paul's balance will be summed up twice simply because he has a different exception type, where in fact I only want SSRS to sum each customer once. One way would be to SUM the "Balance" of only DISTINCT customer numbers. Possibly by grouping the Customer Number? Is that possible on SSRS? I would rather not touch the SQL Dataset.
Thanks!
Any input is appreciated.
you can use union to dustinct the duplicate line:
Query 1(Using Union):
;WITH testdata(CustomerNumber,CustomerName,ExceptionType,Balance)AS(
SELECT 'CIF13443','Paul','1',125 UNION all
SELECT 'CIF13452','Ryan ','2',85 UNION all
SELECT 'CIF13443','Paul','3',125 UNION all
SELECT 'CIF13765','Linda','1',90
)
SELECT CustomerNumber,CustomerName,Balance FROM testdata UNION SELECT NULL,NULL,NULL WHERE 1!=1
/*
SELECT CustomerNumber,CustomerName,SUM(t.Balance) AS total_balance
FROM (
SELECT CustomerNumber,CustomerName,Balance FROM testdata UNION SELECT NULL,NULL,NULL
) AS t WHERE CustomerNumber IS NOT null
GROUP BY CustomerNumber,CustomerName
*/
CustomerNumber CustomerName Balance
-------------- ------------ -----------
CIF13443 Paul 125
CIF13452 Ryan 85
CIF13765 Linda 90
Using windows function pick one line for the duplicate lines:
;WITH testdata(CustomerNumber,CustomerName,ExceptionType,Balance)AS(
SELECT 'CIF13443','Paul','1',125 UNION all
SELECT 'CIF13452','Ryan ','2',85 UNION all
SELECT 'CIF13443','Paul','3',125 UNION all
SELECT 'CIF13765','Linda','1',90
)
SELECT CustomerNumber,CustomerName,SUM(t.Balance) AS total_balance
FROM (
SELECT CustomerNumber,CustomerName,Balance,ROW_NUMBER()OVER(PARTITION BY CustomerNumber,CustomerName,Balance ORDER BY testdata.ExceptionType) seq FROM testdata
) AS t WHERE t.seq=1
GROUP BY CustomerNumber,CustomerName

How to find daily average over a time period in mysql?

I've a table where there's two column:
MARKS
CREAT_TS
I want to daily average marks for between two date range (e.g. startDate & endDate)
I've made the following query:
select SUM(MARKS)/ COUNT(date(CREAT_TS)) AS DAILY_AVG_MARKS,
date(CREAT_TS) AS DATE
from TABLENAME
group by date(CREAT_TS)
With this query I can get the daily average only if there's a row in the database for the date. But my requirement is that even if there's no row, I want to show 0 for that date.
I mean I want the query to return X rows if there are X days between (startDate, endDate)
Can anyone help me. :(
You need to create a set of integers that you can add to the dates. The following will give you an idea:
select thedate, avg(Marks) as DAILY_AVG_MARKS
from (select startdate+ interval num day as thedate
from (select d1.d + 10 * d2.d + 100*d3.d as num
from (select 0 as d union select 1 union select 2 union select 3 union select 4 union
select 5 union select 6 union select 7 union select 8 union select 9
) d1 cross join
(select 0 as d union select 1 union select 2 union select 3 union select 4 union
select 5 union select 6 union select 7 union select 8 union select 9
) d2 cross join
(select 0 as d union select 1 union select 2 union select 3 union select 4 union
select 5 union select 6 union select 7 union select 8 union select 9
) d3
) n cross join
(select XXX as startdate, YYY as enddate) const
where startdate + num <= enddate
) left outer join
tablename t
on date(CREAT_TS) = thedate
group by thedate
All the complication is in creating a set of sequential dates for the report. If you have a numbers table or a calendar table, then the SQL looks much simpler.
How does this work? The first big subquery has two parts. The first just generates the numbers from 0 to 999 by cross joining the digits 0-9 and doing some arithmetic. The second joins this to the two dates, startdate and enddate -- you need to put the correct values in for XXX and YYY. With this table, you have all the dates between the two values. If you need more than 999 days, just add in another cross join.
This is the left joined to your data table. The result is that all dates appear for the group by.
In terms of reporting, there are advantages and disadvantages to doing this in the presentation layer. Basically, the advantage to doing it in SQL is that the report layer is simpler. The advantage to doing it in the reporting layer is that the SQL is simpler. It is hard for an outsider to make that judgement.
My suggestion would be to create a numbers table that you can just use in reports like this. Then the query will look simpler and you won't have to change the reporting layer.

How to find missing data rows using SQL?

My problem:
I got a MySQL database that stores a great amount of meteorological data in chronological order (New data are inserted every 10 min). Unfortunately there have been several blackouts and hence certain rows are missing. I recently managed to obtain certain backup-files from the weather station and now I want to use these to fill in the missing data.
The DB ist structures like this (example):
date* the data
2/10/2009 10:00 ...
2/10/2009 10:10 ...
( Missing data!)
2/10/2009 10:40 ...
2/10/2009 10:50 ...
2/10/2009 11:00 ...
...
*=datatime-type, primary key
My idea:
Since backup and database are located on different computers and traffic is quite slow, I thought of creating a MySQL-query that, when run, will return a list of all missing dates in a specified range of time. I could then extract these dates from the backup and insert them to the database.
The question:
How to write such a query? I don't have the permission to create any auxilary table. Is it possible to formulate a "virtual table" of all required dates in the specified interval and then use it in a JOIN? Or are there entirely different propositions for solving my problem?
Edit:
Yes, the timestamps are consistently in the form shown above (always 10 minutes), except that some are just missing.
Okay, what about the temporary tables? Is there an elegant way of populating them with the time-range automatically? What if two scripts try to run simultaneously, does this cause problems with the table?
select t1.ts as hival, t2.ts as loval
from metdata t1, metdata t2
where t2.ts = (select max(ts) from metdata t3
where t3.ts < t1.ts)
and not timediff(t1.ts, t2.ts) = '00:10:00'
This query will return couplets you can use to select the missing data. The missing data will have a timestamp between hival and loval for each couplet returned by the query.
EDIT - thx for checking, Craig
EDIT2 :
getting the missing timestamps - this SQL gets a bit harder to read, so I'll break it up a bit. First, we need a way to calculate a series of timestamp values between a given low value and a high value in 10 minute intervals. A way of doing this when you can't create tables is based on the following sql, which creates as a resultset all of the digits from 0 to 9.
select d1.* from
(select 1 as digit
union select 2
union select 3
union select 4
union select 5
union select 6
union select 7
union select 8
union select 9
union select 0
) as d1
...now by combining this table with a copy of itself a couple of times means we can dynamically generate a list of a specified length
select curdate() +
INTERVAL (d1.digit * 100 + d2.digit * 10 + d3.digit) * 10 MINUTE
as date
from (select 1 as digit
union select 2
union select 3
union select 4
union select 5
union select 6
union select 7
union select 8
union select 9
union select 0
) as d1
join
(select 1 as digit
union select 2
union select 3
union select 4
union select 5
union select 6
union select 7
union select 8
union select 9
union select 0
) as d2
join
(select 1 as digit
union select 2
union select 3
union select 4
union select 5
union select 6
union select 7
union select 8
union select 9
union select 0
) as d3
where (d1.digit * 100 + d2.digit * 10 + d3.digit) between 1 and 42
order by 1
... now this piece of sql is getting close to what we need. It has 2 input variables:
a starting timestamp (I used
curdate() in the example); and a
number of iterations - the where
clause specifies 42 iterations in
the example, maximum with 3 x digit tables is 1000 intervals
... which means we can use the original sql to drive the example from above to generate a series of timestamps for each hival lowval pair. Bear with me, this sql is a bit long now...
select daterange.loval + INTERVAL (d1.digit * 100 + d2.digit * 10 + d3.digit) * 10 MINUTE as date
from
(select t1.ts as hival, t2.ts as loval
from metdata t1, metdata t2
where t2.ts = (select max(ts) from metdata t3
where t3.ts < t1.ts)
and not timediff(t1.ts, t2.ts) = '00:10:00'
) as daterange
join
(select 1 as digit
union select 2
union select 3
union select 4
union select 5
union select 6
union select 7
union select 8
union select 9
union select 0
) as d1
join
(select 1 as digit
union select 2
union select 3
union select 4
union select 5
union select 6
union select 7
union select 8
union select 9
union select 0
) as d2
join
(select 1 as digit
union select 2
union select 3
union select 4
union select 5
union select 6
union select 7
union select 8
union select 9
union select 0
) as d3
where (d1.digit * 100 + d2.digit * 10 + d3.digit) between 1 and
round((time_to_sec(timediff(hival, loval))-600) /600)
order by 1
...now there's a bit of epic sql
NOTE : using the digits table 3 times gives a maximum gap it will cover of a bit over 6 days
If you can create a temporary table, you can solve the problem with a JOIN
CREATE TEMPORARY TABLE DateRange
(theDate DATE);
Populate the table with all 10 minute intervals between your dates, then use the following
SELECT theDate
FROM DateRange dr
LEFT JOIN Meteorological mm on mm.date = dr.theDate
WHERE mm.date IS NULL
The result will be all of the date/times that do not have entries in your weather table.
If you need to quickly find days with missing data, you can use
select Date(mm.Date),144-count(*) as TotMissing
from Meteorological mm
group by Date(mm.Date)
having count(*) < 144
This is assume 24 hour a day, 6 entries per hour (hence 144 rows). – Sparky 0 secs ago
Create a temporary table (JOIN). Or take all the dates and query them locally, where you should have free reign (loop/hash).
For the JOIN, your generated reference of all dates is your base table and your data is your joined table. Seek out pairs where the joined data does not exist and select the generated date.
As a quick solotion using Sql Server, check for dates that do not have a follower of date+interval. I think MySql does have some sort of dateadd function, but you can try something like this. This will show you the ranges where you have missing data.
DECLARE #TABLE TABLE(
DateValue DATETIME
)
INSERT INTO #TABLE SELECT '10 Feb 2009 10:00:00'
INSERT INTO #TABLE SELECT '10 Feb 2009 10:10:00'
INSERT INTO #TABLE SELECT '10 Feb 2009 10:40:00'
INSERT INTO #TABLE SELECT '10 Feb 2009 10:50:00'
INSERT INTO #TABLE SELECT '10 Feb 2009 11:00:00'
SELECT *
FROM #TABLE currentVal
WHERE ((SELECT * FROM #TABLE nextVal WHERE DATEADD(mi,10,currentVal.DateValue) = nextVal.DateValue) IS NULL AND currentVal.DateValue != (SELECT MAX(DateValue) FROM #TABLE))
OR ((SELECT * FROM #TABLE prevVal WHERE DATEADD(mi,-10,currentVal.DateValue) = prevVal.DateValue) IS NULL AND currentVal.DateValue != (SELECT MIN(DateValue) FROM #TABLE))
Note: uses MSSQL syntax. I think MySQL uses DATE_ADD(T1.date, INTERVAL 10 MINUTE) instead of DATEADD, but I haven't tested this.
You can get the missing timestamps with two self-joins:
SELECT T1.[date] AS DateFrom, MIN(T3.[date]) AS DateTo
FROM [test].[dbo].[WeatherData] T1
LEFT JOIN [test].[dbo].[WeatherData] T2 ON DATEADD(MINUTE, 10, T1.date) = T2.date
LEFT JOIN [test].[dbo].[WeatherData] T3 ON T3.date > T1.Date
WHERE T2.[value] IS NULL
GROUP BY T1.[date]
If you have a lot of data, You might want to try restricting the range to one month at a time to avoid heavy load on your server, as this operation could be quite intensive.
The results will be something like this:
DateFrom DateTo
2009-10-02 10:10:00.000 2009-10-02 10:40:00.000
2009-10-02 11:00:00.000 NULL
The last row represents all data from the last timestamp into the future.
You can then use another join to get the rows from the other database that have a timestamp in between any of these intervals.
This solution uses sub-queries, and there is no need for any explicit temporary tables. I've assumed your backup data is in another database on the other machine; if not you'd only need to do up to step 2 for the result-set you need, and write your program to update the main database accordingly.
The idea is to start out by producing a 'compact' result-set summarising the gap-list. I.e. the following data:
MeasureDate
2009-12-06 13:00:00
2009-12-06 13:10:00
--missing data
2009-12-06 13:30:00
--missing data
2009-12-06 14:10:00
2009-12-06 14:20:00
2009-12-06 14:30:00
--missing data
2009-12-06 15:00:00
Would be transformed into the following where actual gaps are strictly between (i.e. exclusive of) the endpoints:
GapStart GapEnd
2009-12-06 13:10:00 2009-12-06 13:30:00
2009-12-06 13:30:00 2009-12-06 14:10:00
2009-12-06 14:30:00 2009-12-06 15:00:00
2009-12-06 15:00:00 NULL
The solution query is built up as follows:
Obtain all MeasureDates that don't have an entry 10 minutes later as this will be the start of a gap. NOTE: The last entry will be included even though not strictly a gap; but this won't have any adverse effects.
Augment the above by adding the end of the gap using the first MeasureDate after the start of the gap.
NOTE: The gap-list is compact, and unless you have an exceptionally high prevalence of fragmented gaps, it should not consume much bandwidth in passing that result-set to the backup machine.
Use an INNER JOIN with inequalities to identify any missing data that may be available in the backup. (Run tests and checks to verify the integrity of your backup data.)
Assuming your backup data is sound, and won't produce anomalous unfounded spikes in your measurements, INSERT the data in your main database.
The following query should be tested (preferably adjusted to run on the backup server for performance reasons).
/* TiC Copyright
This query is writtend (sic) by me, and cannot be used without
expressed (sic) written permission. (lol) */
/*Step 3*/
SELECT gap.GapStart, gap.GapEnd,
rem.MeasureDate, rem.Col1, ...
FROM (
/*Step 2*/
SELECT gs.GapStart, (
SELECT MIN(wd.MeasureDate)
FROM WeatherData wd
WHERE wd.MeasureDate > gs.GapStart
) AS GapEnd
FROM (
/*Step 1*/
SELECT wd.MeasureDate AS GapStart
FROM WeatherData wd
WHERE NOT EXISTS (
SELECT *
FROM WeatherData nxt
WHERE nxt.MeasureDate = DATEADD(mi, 10, wd.MeasureDate)
)
) gs
) gap
INNER JOIN RemoteWeatherData rem ON
rem.MeasureDate > gap.GapStart
AND rem.MeasureDate < gap.GapEnd
The insert...
INSERT INTO WeatherData (MeasureDate, Col1, ...)
SELECT /*gap.GapStart, gap.GapEnd,*/
rem.MeasureDate, rem.Col1, ...
...
Do a self join and then calculate the max values that are smaller and have a difference larger than your interval.
In Oracle I'd do it like this (with ts being the timestamp column):
Select t1.ts, max(t2.ts)
FROM atable t1 join atable t2 on t1.ts > t2.ts
GROUP BY t1.ts
HAVING (t1.ts - max(t2.ts))*24*60 > 10
There will be better ways to handle the difference calculation in mySql, but I hope the idea comes across.
This query will give you the timestamps directly after and before outage, and you can build from there.