Compute an average number of transactions per user in a readable manner - mysql

I have always been struggling with these types of queries. So, I'd like that someone checks my approach to handle those.
I am asked to find how many transactions, on average, each user executes during a 12 hours timespan starting from the first transaction.
This is the data:
CREATE TABLE IF NOT EXISTS `transactions` (
`transactions_ts` timestamp ,
`user_id` int(6) unsigned NOT NULL,
`transaction_id` bigint NOT NULL,
`item` varchar(200), PRIMARY KEY(`transaction_id`)
) DEFAULT CHARSET=utf8;
INSERT INTO `transactions` (`transactions_ts`, `user_id`, `transaction_id`,`item` ) VALUES
('2016-06-18 13:46:51.0', 13811335,1322361417, 'glove'),
('2016-06-18 17:29:25.0', 13811335,3729362318, 'hat'),
('2016-06-18 23::07:12.0', 13811335,1322363995,'vase' ),
('2016-06-19 07:14:56.0',13811335,7482365143, 'cup'),
('2016-06-19 21:59:40.0',13811335,1322369619,'mirror' ),
('2016-06-17 12:39:46.0',3378024101,9322351612, 'dress'),
('2016-06-17 20:22:17.0',3378024101,9322353031,'vase' ),
('2016-06-20 11:29:02.0',3378024101,6928364072,'tie'),
('2016-06-20 18:59:48.0',13811335,1322375547, 'mirror');
My approach is the following (with the steps and the query itself below):
1) For each distinct user_id, find their first and 12 hours' transaction timestamp. This is accomplished by the inner query aliased as t1
2) Then, by inner join to the second inner query (t2), basically, I augment each row of the transactions table with two variables "first_trans" and "right_trans" of the 1st step.
3) Now, by where-condition, I select only those transaction timestamps that fall in the interval specified by first_trans and right_trans timestamps
4) Filtered table from the step 3 is now aggregated as count distinct transaction ids per user
5) The result of the 4 steps above is a table where each user has a count of transactions falling into the interval of 12 hrs from the first timestamp. I wrap it in another select that sums users' transaction counts and divides by the number of users, giving an average count per user.
I am quite certain that the end result is correct overall, but I keep thinking I might go without the 4th select. Or, perhaps, the whole code is somewhat clumsy, while my aim was to make this query as readable as possible, and not necessarily computationally optimal.
select
sum(dist_ts)/count(*) as avg_ts_per_user
from (
select
count(distinct transaction_id) as dist_ts,
us_id
from
(select
user_id as us_id,
min(transactions_ts) as first_trans,
min(transactions_ts) + interval 12 hour as right_trans
from transactions
group by us_id )
as t1
inner join
(select * from transactions )
as t2
on t1.us_id=t2.user_id
where transactions_ts >= first_trans
and transactions_ts < right_trans
group by us_id
) as t3
Fiddle demo

I don't think there is a mistake per se. The code can be slightly simplified (and neatened up a bit as follows):
select sum(dist_ts)/count(*) as avg_ts_per_user
from (
select count(distinct transaction_id) as dist_ts, us_id
from (
select user_id as us_id, min(transactions_ts) as first_trans, min(transactions_ts) + interval 12 hour as right_trans
from transactions
group by us_id
) as t1
inner join transactions as t2
on t1.us_id=t2.user_id and transactions_ts >= first_trans and transactions_ts < right_trans
group by us_id
) as t3
The (select * from transactions ) as t2 was simplified above and I somewhat arbitrarilly moved a where clause condition to the on clause of the inner join.
My Fiddle Demo
Here is a second way that does not use inner joins:
select sum(cnt)/count(*) as avg_ts_per_user from (
select count(*) as cnt, t.user_id
from transactions t
where t.transactions_ts >= (select min(transactions_ts) from transactions where user_id = t.user_id)
and t.transactions_ts < (select min(transactions_ts) + interval 12 hour from transactions where user_id = t.user_id)
group by t.user_id
) sq
Another Fiddle
You should probably run EXPLAIN against the two queries to see which one runs better on your server. Also note that min(transaction_ts) is specified twice for each user. Is MySql able to avoid the redundant calculation? I don't know. One possibility would be to create a temporary table consisting of user_id and min_transaction_ts so that the value is computed once. This would only make sense if your table had lots of rows and maybe not even then.

Related

How to use MYSQL to include missing rows in final table with default 0 values for all columns?

This question is part of a bigger mySQL query I have. So I have a table of 'playerIds', 'dates', 'scores' and 'problems'. It's Table T0 in the image attached. I am running a SQL query on it to get the most-recent row for all players where the 'date' is <= (2020-08-14 - 7days). Not all players will have a row with a date that satisfies that condition, so naturally those playerId rows will not appear in the resulting table (Table T1 in the pic).
Now what I want to do is to include those missing rows with 0 values for 'score' and 'problems' in the resulting table (See Table T2 in the pic). I am totally at a loss as to how to go about it since I am very new to SQL queries.
Here's the part of the SQL query which is producing Table T1 from T0, but I want to modify it such that it produces Table T2 from T0:
select *
from (
select *, row_number() over (partition by playerId order by date desc) as ranking
from player
where date<=date_add(date('2020-08-14'),interval -7 day)
) t
where t.ranking = 1
One option uses a subquery to list all the players, and then brings your current resultset with a left join:
select p.playerId, t.date, coalesce(t.score, 0) score, coalesce(t.problem, 0) problem
from (select distinct playerId from player) p
left join (
select p.*, row_number() over (partition by playerId order by date desc) as rn
from player p
where date <= '2020-08-14' - interval 7 day
) t on t.playerId = p.playerId and t.rn = 1
If you have a referential table for all players, you can just replace the select distinct subquery with that table.

Writing SQL with timestamps

The data
CREATE TABLE IF NOT EXISTS `transactions` (
`transactions_ts` timestamp ,
`user_id` int(6) unsigned NOT NULL,
`transaction_id` bigint,
`item` varchar(200), PRIMARY KEY(`transaction_id`)
) DEFAULT CHARSET=utf8;
INSERT INTO `transactions` (`transactions_ts`, `user_id`, `transaction_id`,`item` ) VALUES
('2016-06-18 13:46:51.0', 13811335,1322361417, 'glove'),
('2016-06-18 17:29:25.0', 13811335,3729362318, 'hat'),
('2016-06-18 23::07:12.0', 13811335,1322363995,'vase' ),
('2016-06-19 07:14:56.0',13811335,7482365143, 'cup'),
('2016-06-19 21:59:40.0',13811335,1322369619,'mirror' ),
('2016-06-17 12:39:46.0',3378024101,9322351612, 'dress'),
('2016-06-17 20:22:17.0',3378024101,9322353031,'vase' ),
('2016-06-20 11:29:02.0',3378024101,6928364072,'tie'),
('2016-06-20 18:59:48.0',13811335,1322375547, 'mirror');
The question: for each user, show the first item that they ordered (first by time). I assume time as a whole timestamp (not time and date separately).
My attempt
select
min(transactions_ts) as first_trans,
user_id, item
from transactions
group by user_id
order by first_trans;
I am sorry that may be it is a simple question, but one person tells me that my query is entirely wrong. And I have got no other means to test this claim of his
demo fiddle
This is a little bit more complicated than you thought.
To start with: "for each user" would translate to GROUP BY user_id, not to GROUP BY user_id, item.
But with GROUP BY user_id, you'd need an aggregation function saying "the item for the minimum transactions_ts". MySQL doesn't feature such an aggregation function.
The obvious solution is to make this two steps:
Find the first transaction per user
Show the items for these transactions
The query:
select *
from transactions
where (user_id, transactions_ts) in
(
select user_id, min(transactions_ts)
from transactions
group by user_id
);
Another way to word the task is: "Give me the transactions for which no older transaction for the same user exists".
The query:
select *
from transactions t
where not exists
(
select *
from transactions t2
where t2.user_id = t.user_id
and t2.transactions_ts < t.transactions_ts
);
If you are using MySQL 8.0, window function ROW_NUMBER() can be used to adress your use case, as follows:
SELECT transactions_ts, user_id, item
FROM (
SELECT
transactions_ts,
user_id,
item,
ROW_NUMBER() OVER(PARTITION BY user_id ORDER BY transactions_ts) rn
FROM transactions
) x WHERE rn = 1
The inner query ranks each record by ascending timestamp, within groups of records having the same user_id. The outer query filters in the first transaction of each customer.
Demo on DB Fiddle:
transactions_ts | user_id | item
:------------------ | ---------: | :----
2016-06-18 13:46:51 | 13811335 | glove
2016-06-17 12:39:46 | 3378024101 | dress
You can do it using a subquery to get the first transaction_ts for each user:
select user_id, item, transactions_ts
from transactions a
where transactions_ts=(select min(transactions_ts)
from transactions b
where b.user_id=a.user_id)
So your get:
In the inner query get the first transaction time for each user
In the outer query you get the row that has the time you got at point 1

SQL - Keep only the first and last record of each day

I have a table that stores simple log data:
CREATE TABLE chronicle (
id INT auto_increment PRIMARY KEY,
data1 VARCHAR(256),
data2 VARCHAR(256),
time DATETIME
);
The table is approaching 1m records, so I'd like to start consolidating data.
I want to be able to take the first and last record of each DISTINCT(data1, data2) each day and delete all the rest.
I know how to just pull in the data and process it in whatever language I want then delete the records with a huge IN (...) query, but it seems like a better alternative would to use SQL directly (am I wrong?)
I have tried several queries, but I'm not very good with SQL beyond JOINs.
Here is what I have so far:
SELECT id, Max(time), Min(time)
FROM (SELECT id, data1 ,data2, time, Cast(time AS DATE) AS day
FROM chronicle) AS initial
GROUP BY day;
This gets me the first and last time for each day, but it's not separated out by the data (i.e. I get the last record of each day, not the last record for each distinct set of data for each day.) Additionally, the id is just for the Min(time).
The information I've found on this particular problem is only for finding the the last record of the day, not each last record for sets of data.
IMPORTANT: I want the first/last record for each DISTINCT(data1, data2) for each day, not just the first/last record for each day in the table. There will be more than 2 records for each day.
Solution:
My solution thanks to Jonathan Dahan and Gordon Linoff:
SELECT o.data1, o.data2, o.time FROM chronicle AS o JOIN (
SELECT Min(id) as id FROM chronicle GROUP BY DATE(time), data1, data2
UNION SELECT Max(id) as id FROM test_chronicle GROUP BY DATE(time), data1. data2
) AS n ON o.id = n.id;
From here it's a simple matter of referencing the same table to delete rows.
this will improve performance when searching on dates.
ALTER TABLE chronicle
ADD INDEX `ix_chronicle_time` (`time` ASC);
This will delete the records:
CREATE TEMPORARY TABLE #tmp_ids (
`id` INT NOT NULL,
PRIMARY KEY (`id`)
);
INSERT INTO #tmp_ids (id)
SELECT
min(id)
FROM
chronicle
GROUP BY
CAST(day as DATE),
data1,
data2
UNION
SELECT
Max(id)
FROM
chronicle
GROUP BY
CAST(day as DATE),
data1,
data2;
DELETE FROM
chronicle
WHERE
ID not in (select id FROM #tmp_ids)
AND date <= '2015-01-01'; -- if you want to consider all dates, then remove this condition
You have the right idea. You just need to join back to get the original information.
SELECT c.*
FROM chronicle c JOIN
(SELECT date(time) as day, min(time) as mint, max(time) as maxt
FROM chronicle
GROUP BY date(time)
) cc
ON c.time IN (cc.mint, cc.maxt);
Note that the join condition doesn't need to include day explicitly because it is part of the time. Of course, you could add date(c.time) = cc.day if you wanted to.
Instead of deleting rows in your original table, I would suggest that you make a new table. Something lie this:
create table ChronicleByDay like chronicle;
insert into ChronicleByDay
SELECT c.*
FROM chronicle c JOIN
(SELECT date(time) as day, min(time) as mint, max(time) as maxt
FROM chronicle
GROUP BY date(time)
) cc
ON c.time IN (cc.mint, cc.maxt);
That way, you can have the more detailed information if you ever need it.

What query will select rows for months that are more than a certain percent different from the previous month?

I have a table representing a car traveling a route, broken up into segments:
CREATE TABLE travels (
id int auto_increment primary key,
segmentID int,
month int,
year int,
avgSpeed int
);
-- sample data
INSERT INTO travels (segmentID, month, year, avgSpeed)
VALUES
(15,1,2014,80),
(15,1,2014,84),
(15,1,2014,82),
(15,2,2014,70),
(15,2,2014,68),
(15,2,2014,66);
The above schema and sample data is also available as a fiddle.
What query will identify segment IDs where average driving speed decreased by more than 10% compared to the previous month?
here is my solution
Sqlfidle demo
The key is to keep track between previous month and next so i`m doing year*100+month and after group by year and moth check for difference 1 and 89 in year*100+month field.
Also it is pitty that MySQL does not support CTE and makes query ugly using derivered tables.
Code:
select s.month,s.speed,m.month as prevmonth,m.speed as sp, 100-s.speed/m.speed*100 as speeddiff from
(SELECT segmentid,month,year*100+month as mark,avg(avgSpeed) as speed from travels
group by segmentid,month,year*100+month
) as s
,
(SELECT segmentid,month,year*100+month as mark,avg(avgSpeed) as speed from travels
group by segmentid,month,year*100+month
) as m
where s.segmentid=m.segmentid and (s.mark=m.mark+1 or s.mark=m.mark+89) and (m.speed-(m.speed/10))>s.speed;
CTE code working on every DB except MySQL
with t as(SELECT segmentid,month,year*100+month as mark,avg(avgSpeed) as speed from travels
group by segmentid,month,year*100+month
)
select s.month,s.speed,m.month as prevmonth,m.speed as sp, 100-s.speed/m.speed*100 as speeddiff from t s
inner join t m on s.segmentid=m.segmentid and (s.mark=m.mark+1 or s.mark=m.mark+89)
where (m.speed-(m.speed/10))>s.speed;
You need to select each month, join the next month (which is a little convoluted due to your table structure) and find the decrease(/increase). Try the following complex query
SELECT
t1.segmentID, t1.month, t1.year, AVG(t1.avgSpeed) as avgSpeed1,
AVG(t2.avgSpeed) as avgSpeed2,
1-(AVG(t1.avgSpeed)/AVG(t2.avgSpeed)) as decrease
FROM
travels t1
LEFT JOIN
travels t2
ON
CONCAT(t2.year,'-',LPAD(t2.month,2,'00'),'-',LPAD(1,2,'00')) = DATE_ADD(CONCAT(t1.year,'-',LPAD(t1.month,2,'00'),'-',LPAD(1,2,'00')), INTERVAL -1 MONTH)
GROUP BY
segmentID, month, year
HAVING
avgSpeed1/avgSpeed2 < .9
Here is the updated SQLFiddle - http://sqlfiddle.com/#!2/183c1/25
This requires a self join. This answer will get you started. You can work on the details.
select somefields
from yourtable t1 join yourtable t2 on t1.something = t2.something
where t1.month = whatever
and t2.month = t1.month + 1
and t2.speed <= t1.speed * .9

SELECT * FROM table while condition=true?

i want to select something from table while one condition is true,
SELECT * FROM (SELECT * FROM`table1` `t1` ORDER BY t1.date) `t2` WHILE t2.id!=5
when while condition comes to false it stop selecting next rows.
Please help me, I have already search a lot and many similars in stackoverflow but I can't get it.
please don't tell me about where , i want solution in sql not in php or anything other
OK the real problem is here
SELECT *,(SELECT SUM(t2.amount) FROM (select * from transaction as t1 order by t1.date) `t2`) as total_per_transition FROM transaction
here i want to calculate total balance on each transaction
First find the first date where the condition fails, so where id=5:
SELECT date
FROM table1
WHERE id = 5
ORDER BY date
LIMIT 1
Then make the above a derived table (we call it lim) and join it to the original table to get all rows with previous dates: t.date < lim.date
SELECT t.*
FROM table1 AS t
JOIN
( SELECT date
FROM table1
WHERE id = 5
ORDER BY date
LIMIT 1
) AS lim
ON t.date < COALESCE(lim.date, '9999-12-31') ;
The COALESCE() is for the case when there are no rows at all with id=5 - and in that case we want all rows from the table.