Accelerate large mysql join - mysql

I am writing a sql to list every day active user with its first appearance date in the log table. The MySQL version is 5.7.
Like:
date active_users reg_date
2020-03-1 user1 2019-02-01
2020-03-1 user2 2019-03-04
2020-03-2 user3 2019-01-18
2020-03-2 user1 2019-02-01
I have finished a query to achieve this, but as shown, I made 2 aggregation for the same table and then join them together... The login log table game_user_log comprises 2 million rows of data and I have added index on column data_date and data_date, but my query takes about 1 minute .
Is there any way to optimize and accelerate the query? Any help is appreciated.
This is my query:
SELECT a.data_date, a.user_id, b.reg_date
-- List every day and de-duplicated users
from ( SELECT distinct data_date, user_id
from `game_user_log`) a
-- Get the first login date as reg_date
left outer join ( SELECT user_id, min(data_date) reg_date
FROM `game_user_log`
GROUP BY user_id) b
on a.user_id=b.user_id

SELECT data_date,
user_id,
MIN(data_date) OVER (PARTITION BY user_id) reg_date
FROM game_user_log
GROUP BY data_date, user_id
?
PS. Index by (user_id, data_date) needed for to accelerate.

I would write your query as:
select du.data_date, du.user_id, u.reg_date
from (select distinct data_date, user_id
from game_user_log
) du join
(select user_id, min(data_date) as reg_date
from game_user_log
group by user_id
) u
on du.user_id = u.user_id;
For this query, you can try an index on game_user_log(user_id, data_date).

Related

A mySQL query for returning customers that make multiple purchases and the specific time of the 5th purchase

"loyal" customers are considered loyal if they have purchased at least 5 times.
I am trying to build an SQL query which returns only "loyal" customers along with the day on
which they become "loyal" customers (the day of their 5th transaction).
user_id
purchase_ts
f594fsae
2021-07-21
........
............
Ideally the desired output would be as follows
loyal_user_id
Loyal_Moment
f594fsae
2021-07-29
..............
............
I tried creating a new table as follows:
SELECT user_id, purchase_ts
FROM Customers
WHERE user_id IN (
SELECT user_id
FROM Customers
GROUP BY user_id
HAVING COUNT (user_id) >=5
)
But I am having trouble, any suggestions?
On MySQL 8 you could use:
with cte as
( select *,
row_number() over (partition by user_id order by purchase_ts asc) row_num
from accounts
)
select user_id,purchase_ts
from cte
where row_num >=5;
Result:
user_id purchase_ts
f594fsae 2021-07-25
f632fsae 2021-07-25
Demo
Since MySQL introduced the support of subquery a long time ago, we have been using its techniques in some MySQL-version-nonspecific scenarios. In this case, we can use a correlated subquery to get exactly the fifth purchase_ts by using the LIMIT [OFFSET] clause. The WHERE clause is used to exclude those purchase_id which doesn't have a fifth purchase_ts.
select distinct user_Id,
(select purchase_ts from purchase where user_id=p.user_id order by purchase_ts limit 4,1) as loyal_time
from purchase p
where (select purchase_ts from purchase where user_id=p.user_id order by purchase_ts limit 4,1) is not null;

SQL query for listing values based on a column

I have a table with the following columns member_id, status and created_at (timestamp) and i want to extract the latest status for each member_id based on the timestamp value.
member_id
status
created_at
1
ON
1641862225
1
OFF
1641862272
2
OFF
1641862397
3
OFF
1641862401
3
ON
1641862402
Source: Raw data image
So, my ideal query result would be like this:
member_id
status
created_at
1
OFF
1641862272
2
OFF
1641862397
3
ON
1641862402
Expected query results image
My go to process for doing things like that is to assign a row number to each data and get row number 1 depending on the partition and sorting.
For mysql, this is only available starting mysql 8
SELECT ROW_NUMBER() OVER(PARTITION BY member_id ORDER BY created_at DESC) as row_num,
member_id, status, created_at FROM table
This will generate something like this.
row_num
member_id
status
created_at
1
1
OFF
1641862272
2
1
ON
1641862225
1
2
OFF
1641862397
1
3
ON
1641862402
2
3
OFF
1641862401
Then you use that as a sub query and get the rows where row_num = 1
SELECT member_id, status, created_at FROM (
SELECT ROW_NUMBER() OVER(PARTITION BY member_id ORDER BY created_at DESC) as row_num,
member_id, status, created_at FROM table
) a WHERE row_num = 1
MySQL has support for Window Function since v8.0. the solution from crimson589 is preferred for v8+, this solution applies for earlier versions of MySQL or if you need an alternate solution to window queries.
After grouping by member_id we can either join back into the original set to gain the corresponding status value to the MAX(created_at)
SELECT ByMember.member_id
, status.status
, ByMember.created_at
FROM (
SELECT member_id, max(created_at) as created_at
FROM MemberStatus
GROUP BY member_id
) ByMember
JOIN MemberStatus status ON ByMember.member_id = status.member_id AND ByMember.created_at = status.created_at;
Or you could use a sub query instead of the join:
SELECT ByMember.member_id
, (SELECT status.status FROM MemberStatus status WHERE ByMember.member_id = status.member_id AND ByMember.created_at = status.created_at) as status
, ByMember.created_at
FROM (
SELECT member_id, max(created_at) as created_at
FROM MemberStatus
GROUP BY member_id
) ByMember
The JOIN based solution allows you to query additional columns from the original set instead of having multiple sub-queries. I would almost always advocate for the JOIN solution, but sometimes the sub-query is simpler to maintain.
I've setup a fiddle to compare these options: http://sqlfiddle.com/#!9/0edb931/11
You can group by member_id and max of created_at, then a self join with member_id and created_at will give you the latest status.

Writing SQL with timestamps

The data
CREATE TABLE IF NOT EXISTS `transactions` (
`transactions_ts` timestamp ,
`user_id` int(6) unsigned NOT NULL,
`transaction_id` bigint,
`item` varchar(200), PRIMARY KEY(`transaction_id`)
) DEFAULT CHARSET=utf8;
INSERT INTO `transactions` (`transactions_ts`, `user_id`, `transaction_id`,`item` ) VALUES
('2016-06-18 13:46:51.0', 13811335,1322361417, 'glove'),
('2016-06-18 17:29:25.0', 13811335,3729362318, 'hat'),
('2016-06-18 23::07:12.0', 13811335,1322363995,'vase' ),
('2016-06-19 07:14:56.0',13811335,7482365143, 'cup'),
('2016-06-19 21:59:40.0',13811335,1322369619,'mirror' ),
('2016-06-17 12:39:46.0',3378024101,9322351612, 'dress'),
('2016-06-17 20:22:17.0',3378024101,9322353031,'vase' ),
('2016-06-20 11:29:02.0',3378024101,6928364072,'tie'),
('2016-06-20 18:59:48.0',13811335,1322375547, 'mirror');
The question: for each user, show the first item that they ordered (first by time). I assume time as a whole timestamp (not time and date separately).
My attempt
select
min(transactions_ts) as first_trans,
user_id, item
from transactions
group by user_id
order by first_trans;
I am sorry that may be it is a simple question, but one person tells me that my query is entirely wrong. And I have got no other means to test this claim of his
demo fiddle
This is a little bit more complicated than you thought.
To start with: "for each user" would translate to GROUP BY user_id, not to GROUP BY user_id, item.
But with GROUP BY user_id, you'd need an aggregation function saying "the item for the minimum transactions_ts". MySQL doesn't feature such an aggregation function.
The obvious solution is to make this two steps:
Find the first transaction per user
Show the items for these transactions
The query:
select *
from transactions
where (user_id, transactions_ts) in
(
select user_id, min(transactions_ts)
from transactions
group by user_id
);
Another way to word the task is: "Give me the transactions for which no older transaction for the same user exists".
The query:
select *
from transactions t
where not exists
(
select *
from transactions t2
where t2.user_id = t.user_id
and t2.transactions_ts < t.transactions_ts
);
If you are using MySQL 8.0, window function ROW_NUMBER() can be used to adress your use case, as follows:
SELECT transactions_ts, user_id, item
FROM (
SELECT
transactions_ts,
user_id,
item,
ROW_NUMBER() OVER(PARTITION BY user_id ORDER BY transactions_ts) rn
FROM transactions
) x WHERE rn = 1
The inner query ranks each record by ascending timestamp, within groups of records having the same user_id. The outer query filters in the first transaction of each customer.
Demo on DB Fiddle:
transactions_ts | user_id | item
:------------------ | ---------: | :----
2016-06-18 13:46:51 | 13811335 | glove
2016-06-17 12:39:46 | 3378024101 | dress
You can do it using a subquery to get the first transaction_ts for each user:
select user_id, item, transactions_ts
from transactions a
where transactions_ts=(select min(transactions_ts)
from transactions b
where b.user_id=a.user_id)
So your get:
In the inner query get the first transaction time for each user
In the outer query you get the row that has the time you got at point 1

SQL Group by day from timestamp with two tables

I have two tables with timestamp columns.
Table #1 contains clicks, timestamp and Table #2 contains userid, timestamp. I want the counts of clicks and users by date. for example
Date clicks_count users_count
2015-07-24 10 15
2015-07-24 04 06
I think these SQL useful to you.
select a.date1,clicks_count,users_count from
(select date(Table1.timestamp)as date1, count(clicks) as clicks_count
from Table1
group by date(Table1.timestamp)) as a
join
(
select date(Table2.timestamp) date2, count(userid) as users_count
from Table2
group by date(Table2.timestamp)) b on a.date1 = b.date2
Thank you.
select date(timestamp),
sum(is_click) as clicks,
sum(is_click = 0) as user_count
from
(
select timestamp, 1 as is_click from table1
union all
select timestamp, 0 from table2
) tmp
group by date(timestamp)
You can select the timestamps from both tables together and add a calculated column that indicates from which table the timestamp came from.
Then you take that subquery result and group by by the date and count the users and clicks.
sum(is_click = 0) counts how many time the timestamp came from the users table.

Mysql get latest row for status

I have a log table with several statuses. It logs the position of physical objects in an external system. I want to get the latest rows for a status for each distinct physical object.
I need a list of typeids and their quantity for each status, minus the quantity of typeids that have an entry for another status that is later than the row with the status we are looking for.
e.g each status move is recorded but nothing else.
Here's the problem, I don't have a distinct ID for each physical object. I can only calculate how many there are from the state of the log table.
I've tried
SELECT dl.id, dl.status
FROM `log` AS dl
INNER JOIN (
SELECT MAX( `date` ) , id
FROM `log`
GROUP BY id ORDER BY `date` DESC
) AS dl2
WHERE dl.id = dl2.id
but this would require a distinct type id to work.
My table has a primary key id, datetime, status, product type_id. There are four different statuses.
a product must pass through all statuses.
Example Data.
date typeid status id
2014-01-13 PF0180 shopfloor 71941
2014-01-13 ND0355 shopfloor 71940
2014-01-10 ND0355 machine 71938
2014-01-10 ND0355 machine 71937
2014-01-10 ND0282 machine 7193
when selected results for the status shopfloor I would want
quantity typeid
1 ND0355
1 PF0180
when selecting for status machine I would want
quantity typeid
1 ND0282
1 ND0355
The order of the statuses shouldn't matter it only matters if there is a later entry for the product.
If I understood you correctly, this will give you the desired output:
select
l1.typeid,
l1.status,
count(1) - (
select count(1)
from log l2
where l2.typeid = l1.typeid and
l2.date > l1.date
)
from log l1
group by l1.typeid, l1.status;
Check this SQL Fiddle
TYPEID STATUS TOTAL
-----------------------------
ND0282 machine 1
ND0355 machine 1
ND0355 shopfloor 1
PF0180 shopfloor 1
You need to get the greatest date per status, not per id. Then join to the log table where the status and date are the same.
SELECT dl.id, dl.status
FROM `log` AS dl
INNER JOIN (
SELECT status, MAX( `date` ) AS date
FROM `log`
GROUP BY status ORDER BY NULL
) AS dl2 USING (status, date);
It would be helpful to have an index on (status, date) on this table, which would allow the subquery to run as an index-only query.
Everton Agner originally posted this solution, but the reply seems to have disappeared so I'm adding it (with slight modifications)
select
l1.typeid,
l1.status,
count(1) - (
select count(1)
from log l2
where l2.typeid = l1.typeid and
l2.`date` > l1.`date`
AND l2.status != 'dieshop'
) as quant
from log l1
WHERE l1.status = 'dieshop'
group by l1.typeid;