Mysql group by while using 3 data points while skipping first entry - mysql

Im trying to write query to select data from table and group it by date and calculate average. It gets a bit complicated because i dont want to calculate average for all the data for certain date but only the 3 entries in descending order and also skip the first one.
I manage to write query for single date
select datum,datum_aladin,veter from napovedt where datum='2023-01-30' and spot='19' order by datum_aladin desc limit 1,3
which gives me returns:
datum |datum_aladin|veter|
----------+------------+-----+
2023-01-30| 2023013000| 4.0|
2023-01-30| 2023012918| 4.0|
2023-01-30| 2023012912| 4.5|
Now how do i get average for "veter" and write query to get this average value for all the dates in database ?

With MySQL 8 you might try:
Schema (MySQL v8.0)
CREATE TABLE napovedt
(`datum` datetime, `datum_aladin` int, `veter` decimal(8,3))
;
INSERT INTO napovedt
(`datum`, `datum_aladin`, `veter`)
VALUES
('2023-01-30 00:00:00', 2023013100, 2.0),
('2023-01-30 00:00:00', 2023013000, 4.0),
('2023-01-30 00:00:00', 2023012918, 4.0),
('2023-01-30 00:00:00', 2023012912, 4.5),
('2023-01-30 00:00:00', 1023013000, 3.0),
('2023-01-31 00:00:00', 2023013100, 5.0),
('2023-01-31 00:00:00', 2023013000, 4.0),
('2023-01-31 00:00:00', 2023012918, 2.0),
('2023-01-31 00:00:00', 2023012912, 1.5),
('2023-01-31 00:00:00', 1023013000, 3.0)
;
Query #1
with sq as (
select n.datum, n.veter, row_number () over (
partition by datum
order by datum_aladin desc
) rn
from napovedt n
)
select sq.datum, avg(sq.veter) from sq
where sq.rn between 2 and 4
group by sq.datum
order by sq.datum;
datum
avg(sq.veter)
2023-01-30 00:00:00
4.1666667
2023-01-31 00:00:00
2.5000000
View on DB Fiddle
If you do not have MySQL 8, you can then emulate PARTITION OVER with the following:
select sq.datum, avg(sq.veter) from (
select *, if(#prev <> datum, #rn:=0, #rn), #prev:=datum, #rn:=#rn+1 as rn
from napovedt, (select #rn:=0) as rn, (select #prev:='') as prev
order BY datum, datum_aladin desc
) sq
where sq.rn between 2 and 4
group by sq.datum
order by sq.datum;
datum
avg(sq.veter)
2023-01-30 00:00:00
4.1666667
2023-01-31 00:00:00
2.5000000
View on DB Fiddle

Related

GROUP BY function ret

start
end
category
2022:10:14 17:13:00
2022:10:14 17:19:00
A
2022:10:01 16:29:00
2022:10:01 16:49:00
B
2022:10:19 18:55:00
2022:10:19 19:03:00
A
2022:10:31 07:52:00
2022:10:31 07:58:00
A
2022:10:13 18:41:00
2022:10:13 19:26:00
B
The table is sample data about trips
the target is to calculate the time consumed for each category . EX: category A = 02:18:02
1st I changed the time stamp criteria in the csv file as YYYY/MM/DD HH:MM:SS to match with MYSQL, and removed the headers
I created a table in MYSQL Workbench as the following code
CREATE TABLE trip (
start TIMESTAMP,
end TIMESTAMP,
category VARCHAR(6)
);
Then to calculate the consumed time I coded as
SELECT category, SUM(TIMEDIFF(end, start)) as length
FROM trip
GROUP BY CATEGORY;
The result was solid numbers as A=34900 & B = 38000
SO I added a convert, Time function as following:
SELECT category, Convert(SUM(TIMEDIFF(end, start)), Time) as length
FROM trip
GROUP BY category;
THE result was great with category A =03:49:00 , but unfortunately category B= NULL instead of 03:08:00
WHAT I'VE DONE WRONG , what is the different approach I should've done
You can do it as follows :
This is useful to Surpass MySQL's TIME value limit of 838:59:59
SELECT category,
CONCAT(FLOOR(SUM(TIMESTAMPDIFF(SECOND, `start`, `end`))/3600),":",FLOOR((SUM(TIMESTAMPDIFF(SECOND, `start`, `end`))%3600)/60),":",(SUM(TIMESTAMPDIFF(SECOND, `start`, `end`))%3600)%60) as `length`
FROM trip
GROUP BY category;
This is to get time like 00:20:00 instead of 0:20:0
SELECT category,
CONCAT(
if(FLOOR(SUM(TIMESTAMPDIFF(SECOND, `start`, `end`))/3600) > 10, FLOOR(SUM(TIMESTAMPDIFF(SECOND, `start`, `end`))/3600), CONCAT('0',FLOOR(SUM(TIMESTAMPDIFF(SECOND, `start`, `end`))/3600)) ) ,
":",
if(FLOOR((SUM(TIMESTAMPDIFF(SECOND, `start`, `end`))%3600)/60) > 10, FLOOR((SUM(TIMESTAMPDIFF(SECOND, `start`, `end`))%3600)/60), CONCAT('0', FLOOR((SUM(TIMESTAMPDIFF(SECOND, `start`, `end`))%3600)/60) ) ),
":",
if( (SUM(TIMESTAMPDIFF(SECOND, `start`, `end`) )%3600)%60 > 10, (SUM(TIMESTAMPDIFF(SECOND, `start`, `end`) )%3600)%60, concat('0', (SUM(TIMESTAMPDIFF(SECOND, `start`, `end`) )%3600)%60))
) as `length`
FROM trip
GROUP BY category;
You'd calculate the length for each separate trip in seconds, get sum of the lengths per category then convert seconds to time:
SELECT category, SEC_TO_TIME(SUM(TIMESTAMPDIFF(SECOND, `end`, `start`))) as `length`
FROM trip
GROUP BY category;
If SUM() exceeds the limit for TIME datatype (838:59:59) then this MAXVALUE will be returned.
For the values which exceeds the limit for TIME value use
SELECT category,
CONCAT_WS(':',
secs DIV (60 * 60),
LPAD(secs DIV 60 MOD 60, 2, 0),
LPAD(secs MOD 60, 2, 0)) AS `length`
FROM (
SELECT category, SUM(TIMESTAMPDIFF(SECOND, `end`, `start`)) AS secs
FROM trip
GROUP BY category
) subquery
;

How to find cumulative sum between two dates in MySQL?

How to find cumulative sum between two dates taking into account the previous state?
Putting WHERE condition
WHERE date BETWEEN '2021-02-19 12:00:00'AND '2021-02-21 12:00:00';
doesn't do the job because the sum starts from the first condition's date, and not from the first record. I would like to select only part of the whole query (between two dates), but to calculate cumulative sum from the first (initial) state.
I prepared Fiddle
CREATE TABLE `table1` (
`id` int(11) NOT NULL,
`date` datetime NOT NULL DEFAULT current_timestamp(),
`payment` double NOT NULL
);
INSERT INTO `table1` (`id`, `date`, `payment`) VALUES
(1, '2021-02-16 12:00:00', 100),
(2, '2021-02-17 12:00:00', 200),
(3, '2021-02-18 12:00:00', 300),
(4, '2021-02-19 12:00:00', 400),
(5, '2021-02-20 12:00:00', 500),
(6, '2021-02-21 12:00:00', 600),
(7, '2021-02-22 12:00:00', 700);
version();
SELECT DATE_FORMAT(date, "%Y-%m-%d") AS date,
payment, SUM(payment) OVER(ORDER BY id) AS balance
FROM table1
WHERE date BETWEEN '2021-02-19 12:00:00'AND '2021-02-21 12:00:00';
You must filter the table after you get the cumulative sums:
SELECT *
FROM (
SELECT DATE(date) AS date,
payment,
SUM(payment) OVER(ORDER BY id) AS balance
FROM table1
) t
WHERE date BETWEEN '2021-02-19'AND '2021-02-21';
or:
SELECT *
FROM (
SELECT DATE(date) AS date,
payment,
SUM(payment) OVER(ORDER BY id) AS balance
FROM table1
WHERE DATE(date) <= '2021-02-21'
) t
WHERE date >= '2021-02-19';
See the demo.
Results:
date
payment
balance
2021-02-19
400
1000
2021-02-20
500
1500
2021-02-21
600
2100

Getting values from the first row, last row, and an aggregation in MySQL Window function

For a marketing related analysis I need to provide data on the first- and last-touchpoint and of the number of total interactions with our website.
A simplified version of our interaction table looks like this:
create table interaction (
id varchar(36) primary key,
session_id varchar(36) not null,
timestamp timestamp(3) not null,
utm_source varchar(255) null,
utm_medium varchar(255) null
)
Our current approach looks like this:
with interaction_ordered as (
select *,
row_number() over (partition by session_id order by timestamp asc) as row_num_asc,
row_number() over (partition by session_id order by timestamp desc) as row_num_desc
from interaction
)
select first_interaction.session_id as session_id,
first_interaction.timestamp as session_start,
timestampdiff(SECOND, first_interaction.timestamp, last_interaction.timestamp) as session_duration,
count(*) as interaction_count,
first_interaction.utm_source as first_touchpoint,
last_interaction.utm_source as last_touchpoint,
last_interaction.utm_medium as last_medium
from interaction_ordered as interaction
join interaction_ordered as first_interaction using (session_id)
join interaction_ordered as last_interaction using (session_id)
where first_interaction.row_num_asc = 1 and last_interaction.row_num_desc = 1
group by session_id
having session_start between ? - interval 1 day and ? + interval 1 day
Currently, we observe that the runtime scales approximately linear with our data, which will become infeasible to compute soon.
An alternative idea is
select session_id,
min(timestamp) as session_start,
timestampdiff(
SECOND,
min(timestamp),
max(timestamp)
) as session_duration,
count(*) as interaction_count,
first_value(utm_source) over (partition by session_id order by timestamp) as first_touchpoint,
first_value(utm_source) over (partition by session_id order by timestamp desc) as last_touchpoint,
first_value(utm_medium) over (partition by session_id order by timestamp desc) as last_medium
from interaction
group by session_id
having session_start between ? - interval 1 day and ? + interval 1 day
but in our experiments we never saw the second query complete. Hence, we are not a 100% sure that it yields the same results.
We tried indices on timestamp and (session_id, timestamp), but according to EXPLAIN this didn't change the query plan.
Is there any fast way to retrieve individual properties from the first and last entry per session_id plus the count per session_id?
Note that in our real example there are more parameter like utm_source and utm_medium that we are interested in.
EDIT
Sample data:
insert into interaction values
('a', 'session_1', '2020-06-15T12:00:00.000', 'search.com', 'search'),
('b', 'session_1', '2020-06-15T12:01:00.000', null, null),
('c', 'session_1', '2020-06-15T12:01:30.000', 'social.com', 'social'),
('d', 'session_1', '2020-06-15T12:02:00.250', 'ads.com', 'ads'),
('e', 'session_2', '2020-06-15T14:00:00.000', null, null),
('f', 'session_2', '2020-06-15T14:12:00.000', null, null),
('g', 'session_2', '2020-06-15T14:25:00.000', 'social.com', 'social'),
('h', 'session_3', '2020-06-16T12:05:00.000', 'ads.com', 'ads'),
('i', 'session_3', '2020-06-16T12:05:01.000', null, null),
('j', 'session_4', '2020-06-15T12:00:00.000', null, null),
('k', 'session_5', '2020-06-15T12:00:00.000', 'search.com', 'search');
Expected result:
session_id, session_start, session_duration, interaction_count, first_touchpoint, last_touchpoint, last_medium
session_1, 2020-06-15T12:00:00.000, 120, 4, search.com, ads.com, ads
session_2, 2020-06-15T14:00:00.000, 1500, 3, null, social.com, social
session_3, 2020-06-16T12:05:00.000, 1, 2, ads.com, null, null
session_4, 2020-06-15T12:00:00.000, 0, 1, null, null, null
session_5, 2020-06-15T12:00:00.000, 0, 1, search.com, search.com, search
I noticed that my second query doesn't yield the expected result. The last_touchpoint and last_medium are filled with the first value instead.
I tried
first_value(utm_source) over (partition by session_id order by timestamp desc) as last_touchpoint, and
last_value(utm_source) over (partition by session_id order by timestamp range between unbounded preceding and unbounded following) as last_touchpoint,
The only way you are going to make the query scalable is by reducing the amount of data being processed using a where clause. If I assume that sessions never last more than a day, then I can expand the timeframe for the calculation by a day and use window functions. That results in something like this:
select s.*
from (select i.*,
min(timestamp) over (partition by session_id) as session_start,
count(*) over (partition by session_id) as interaction_count,
first_value(utm_source) over (partition by session_id order by timestamp) as first_touchpoint,
first_value(utm_source) over (partition by session_id order by timestamp desc) as last_touchpoint,
first_value(utm_medium) over (partition by session_id order by timestamp desc) as last_medium
from interaction i
where timestamp between ? - interval 2 day and ? + interval 2 day
) s
where timestamp = session_start and
session_start between ? - interval 1 day and ? + interval 1 day;
Your use of first_value() should be returning an error -- it violates the rules of "full group by" which MySQL 8+ has set by default. No surprise that syntactically incorrect code is not working.
WITH cte AS ( SELECT *,
FIRST_VALUE(utm_source) OVER (PARTITION BY session_id ORDER BY `timestamp` ASC) first_touchpoint,
FIRST_VALUE(utm_source) OVER (PARTITION BY session_id ORDER BY `timestamp` DESC) last_touchpoint,
FIRST_VALUE(utm_medium) OVER (PARTITION BY session_id ORDER BY `timestamp` DESC) last_medium
FROM interaction
)
SELECT session_id,
MIN(`timestamp`) session_start,
TIMESTAMPDIFF(SECOND, MIN(`timestamp`), MAX(`timestamp`)) session_duration,
COUNT(*) interaction_count,
ANY_VALUE( first_touchpoint ) first_touchpoint,
ANY_VALUE( last_touchpoint ) last_touchpoint,
ANY_VALUE( last_medium ) last_medium
FROM cte
GROUP BY session_id;
fiddle

Possible to improve the performance of this SQL query?

I have a table that has over 100,000,000 rows and I have a query that looks like this:
SELECT
COUNT(IF(created_at >= '2015-07-01 00:00:00', 1, null)) AS 'monthly',
COUNT(IF(created_at >= '2015-07-26 00:00:00', 1, null)) AS 'weekly',
COUNT(IF(created_at >= '2015-06-30 07:57:56', 1, null)) AS '30day',
COUNT(IF(created_at >= '2015-07-29 17:03:44', 1, null)) AS 'recent'
FROM
items
WHERE
user_id = 123456;
The table looks like so:
CREATE TABLE `items` (
`user_id` int(11) NOT NULL,
`item_id` int(11) NOT NULL,
`created_at` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
PRIMARY KEY (`user_id`,`item_id`),
KEY `user_id` (`user_id`,`created_at`),
KEY `created_at` (`created_at`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
The explain looks fairly harmless, minus the massive row counts:
1 SIMPLE items ref PRIMARY,user_id user_id 4 const 559864 Using index
I use the query to gather counts for a specific user for 4 segments of time.
Is there a smarter/faster way to obtain the same data or is my only option to tally these as new rows are put into this table?
If you have an index on created_at, I would also put in the where clause created_at >= '2015-06-30 07:57:56' which is the lowest date possible in your segment.
Also with the same index it might work splitting in 4 queries:
select count(*) AS '30day'
FROM
items
WHERE
user_id = 123456
and created_at >= '2015-06-30 07:57:56'
union ....
And so on
I would add an index on created_at field:
ALTER TABLE items ADD INDEX idx_created_at (created_at)
or (as Thomas suggested) since you are also filtering for user_id a composite index on created_at and user_id:
ALTER TABLE items ADD INDEX idx_user_created_at (user_id, created_at)
and then I would write your query as:
SELECT 'monthly' as description, COUNT(*) AS cnt FROM items
WHERE created_at >= '2015-07-01 00:00:00' AND user_id = 123456
UNION ALL
SELECT 'weekly' as description, COUNT(*) AS cnt FROM items
WHERE created_at >= '2015-07-26 00:00:00' AND user_id = 123456
UNION ALL
SELECT '30day' as description, COUNT(*) AS cnt FROM items
WHERE created_at >= '2015-06-30 07:57:56' AND user_id = 123456
UNION ALL
SELECT 'recent' as description, COUNT(*) AS cnt FROM items
WHERE created_at >= '2015-07-29 17:03:44' AND user_id = 123456
yes, the output is a little different. Or you can use inline queries:
SELECT
(SELECT COUNT(*) FROM items WHERE created_at>=... AND user_id=...) AS 'monthly',
(SELECT COUNT(*) FROM items WHERE created_at>=... AND user_id=...) AS 'weekly',
...
and if you want an average, you could use a subquery:
SELECT
monthly,
weekly,
monthly / total,
weekly / total
FROM (
SELECT
(SELECT COUNT(*) FROM items WHERE created_at>=... AND user_id=...) AS 'monthly',
(SELECT COUNT(*) FROM items WHERE created_at>=... AND user_id=...) AS 'weekly',
...,
(SELECT COUNT(*) FROM items WHERE user_id=...) AS total
) s
INDEX(user_id, created_at) -- optimal
AND created_at >= '2015-06-30 07:57:56' -- helps because it cuts down on the number of index entries to touch
Doing a UNION does not help since it leads to 4 times as much work.
Doing for subquery SELECTs does not help for the same reason.
Also
COUNT(IF(created_at >= '2015-07-29 17:03:44', 1, null))
can be shortened to
SUM(created_at >= '2015-07-29 17:03:44')
(But probably does not speed it up much)
If the data does not change over time, only new rows are added, then Summary tables of past data would lead to a significant speedup, but only if you could avoid things like '07:57:56' for '30day'. (Why have '00:00:00' for only some of them?) Perhaps the speedup would be another factor of 10 on top of the other changes. Want to discuss further?
(I do not see any advantage in using PARTITION.)

mysql query using aggregate function to select field values

MySQL 5.5.29
Here is a mysql query I am working on without success:
SELECT ID, Bike,
(SELECT IF( MIN( ABS( DATEDIFF( '2011-1-1', Reading_Date ) ) ) = ABS( DATEDIFF( '2011-1-1', Reading_Date ) ) , Reading_Date, NULL ) FROM odometer WHERE Bike=10 ) AS StartDate,
(SELECT IF( MIN( ABS( DATEDIFF( '2011-1-1', Reading_Date ) ) ) = ABS( DATEDIFF( '2011-1-1', Reading_Date ) ) , Miles, NULL ) FROM odometer WHERE Bike=10 ) AS BeginMiles,
(SELECT IF( MIN( ABS( DATEDIFF( '2012-1-1', Reading_Date ) ) ) = ABS( DATEDIFF( '2012-1-1', Reading_Date ) ) , Reading_Date, NULL ) FROM odometer WHERE Bike=10 ) AS EndDate,
(SELECT IF( MIN( ABS( DATEDIFF( '2012-1-1', Reading_Date ) ) ) = ABS( DATEDIFF( '2012-1-1', Reading_Date ) ) , Miles, NULL ) FROM odometer WHERE Bike=10 ) AS EndMiles
FROM `odometer`
WHERE Bike =10;
And the result is:
ID Bike StartDate BeginMiles EndDate EndMiles
14 10 [->] 2011-04-15 27.0 NULL NULL
15 10 [->] 2011-04-15 27.0 NULL NULL
16 10 [->] 2011-04-15 27.0 NULL NULL
Motocycle owners enter odometer readings once a year at or near January 1. I want to calculate the total mileage by motorcycle.
Here is what the data in the table odometer looks like:
(source: bmwmcindy.org)
So to calculate the mileage for this bike for 2011, I need determine which of these records is closer to Jan. 1, 2011 and that is record 14. The starting mileage would be 27. I need to find the record closest to Jan. 1, 2012 and that is record 15. The ending mileage for 2011 is 10657 (which will also be the starting odometer reading when 2012 is calculated.
Here is the table:
DROP TABLE IF EXISTS `odometer`;
CREATE TABLE IF NOT EXISTS `odometer` (
`ID` int(3) NOT NULL AUTO_INCREMENT,
`Bike` int(3) NOT NULL,
`is_MOA` tinyint(1) NOT NULL,
`Reading_Date` date NOT NULL,
`Miles` decimal(8,1) NOT NULL,
PRIMARY KEY (`ID`),
KEY `Bike` (`Bike`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 AUTO_INCREMENT=22 ;
data for table odometer
INSERT INTO `odometer` (`ID`, `Bike`, `is_MOA`, `Reading_Date`, `Miles`) VALUES
(1, 1, 0, '2012-01-01', 5999.0),
(2, 6, 0, '2013-02-01', 14000.0),
(3, 7, 0, '2013-03-01', 53000.2),
(6, 1, 1, '2012-04-30', 10001.0),
(7, 1, 0, '2013-01-04', 31000.0),
(14, 10, 0, '2011-04-15', 27.0),
(15, 10, 0, '2011-12-31', 10657.0),
(16, 10, 0, '2012-12-31', 20731.0),
(19, 1, 1, '2012-09-30', 20000.0),
(20, 6, 0, '2011-12-31', 7000.0),
(21, 7, 0, '2012-01-03', 23000.0);
I am trying to get dates and miles from different records so that I can subtact the beginning miles from the ending miles to get total miles for a particular bike (in the example Bike=10) for a particular year (in this case 2011).
I have read quite a bit about aggregate functions and problems of getting values from the correct record. I thought the answer is somehow in a subqueries. But when try the query above I get data from only the first record. In this case the ending miles should come from the second record.
I hope someone can point me in the right direction.
Miles should be steadily increasing. It would be nice if something like this worked:
select year(Reading_Date) as yr,
max(miles) - min(miles) as MilesInYear
from odometer o
where bike = 10
group by year(reading_date)
Alas, your logic is really much harder than you think. This would be easier in a database such as SQL Server 2012 or Oracle that has the lead and lag functions.
My approach is to find the first and last reading dates for each year. You can calculate this using a correlated subquery:
select o.*,
(select max(date) from odometer o2 where o.bike = o2.bike and o2.date <= o.date - dayofyear(o.date) + 1
) ReadDateForYear
from odometer o
Next, summarize this at the bike and year levels. If there is no read date for the year one or before the beginning of the year, use the first date:
select bike, year(date) as yr,
coalesce(min(ReadDateForYear), min(date)) as FirstReadDate,
coalesce(min(ReadDateForNextYear), max(date)) as LastReadDate
from (select o.*,
(select max(date) from odometer o2 where o.bike = o2.bike and o2.date <= o.date - dayofyear(o.date) + 1
) ReadDateForYear,
(select max(date) from odometer o2 where o.bike = o2.bike and o2.date <= date_add(o.date - dayofyear(0.date) + 1 + interval 1 year)
) ReadDateForNextYear
from odometer o
) o
group by bike, year(date)
Let me call this . To get the final results, you need something like:
select the fields you need
from <q> q join
odometer s
on s.bike = q.bike and year(s.date) = q.year join
odometer e
on s.bike = q.bike and year(e.date) = q.year
Note: this SQL is untested. I'm sure there are syntax errors.