geting the purchase number based on user ID in mysql - mysql

I am trying to get a fouth column where I get the purchase number of that user, I have this data:
user date purchase_id
a 01-01-2018 1
b 02-01-2018 2
a 02-01-2018 3
a 03-01-2018 4
b 04-01-2018 5
a 04-01-2018 6
and would like to get something like this:
user date purchase_id purchase_order
a 01-01-2018 1 1
b 02-01-2018 2 1
a 02-01-2018 3 2
a 03-01-2018 4 3
b 04-01-2018 5 2
a 04-01-2018 6 4
The final use of this is to build a cohort analysis to check user retention.
Thanks

You seem to be looking for ROW_NUMBER() (available in MySQL 8.0). This window function can be used to rank records within groups sharing the same user.
SELECT
user,
date,
purchase_id,
ROW_NUMBER() OVER(PARTITION BY user ORDER BY purchase_id ) purchase_order
FROM mytable
NB: it is unclear what column you want to use for ordering. It could be purchase_id (as show in the above query), or maybe date: you can change the query as per your requirement.
Demo on DB Fiddle:
| user | date | purchase_id | purchase_order |
| ---- | ---------- | ----------- | -------------- |
| a | 2018-01-01 | 1 | 1 |
| a | 2018-01-02 | 3 | 2 |
| a | 2018-01-03 | 4 | 3 |
| a | 2018-01-04 | 6 | 4 |
| b | 2018-01-02 | 2 | 1 |
| b | 2018-01-04 | 5 | 2 |

Exclusively for versions prior to 8.0...
DROP TABLE IF EXISTS my_table;
CREATE TABLE my_table
(purchase_id SERIAL PRIMARY KEY
,user CHAR(1) NOT NULL
,date DATE NOT NULL
);
INSERT INTO my_table VALUES
(1,'a','2018-01-01'),
(2,'b','2018-01-02'),
(3,'a','2018-01-02'),
(4,'a','2018-01-03'),
(5,'b','2018-01-04'),
(6,'a','2018-01-04');
SELECT a.purchase_id
, a.user
, a.date
, a.i rank
FROM
( SELECT x.*
, CASE WHEN #prev = user THEN #i:=#i+1 ELSE #i:=1 END i
, #prev := user
FROM my_table x
, (SELECT #prev:=null,#i:=0) vars
ORDER
BY user
, date
) a
ORDER
BY purchase_id;
+-------------+------+------------+------+
| purchase_id | user | date | rank |
+-------------+------+------------+------+
| 1 | a | 2018-01-01 | 1 |
| 2 | b | 2018-01-02 | 1 |
| 3 | a | 2018-01-02 | 2 |
| 4 | a | 2018-01-03 | 3 |
| 5 | b | 2018-01-04 | 2 |
| 6 | a | 2018-01-04 | 4 |
+-------------+------+------------+------+

Related

How to select all records for only the first 50 distinct values in a column

I am trying to create a classifier model for a dataset, but I have too many distinct values for my target variable. If I run something like this:
Create or replace model `model_name`
options (model_type="AUTOML_CLASSIFIER", input_label_cols=["ORIGIN_AIRPORT"]) as
select DAY_OF_WEEK, ARRIVAL_TIME, ARRIVAL_DELAY, ORIGIN_AIRPORT
from `table_name`
limit 1000
I end up getting
Error running query
Classification model currently only supports classification with up to 50 unique labels and the label column had 111 unique labels.
So how can I select, for example, all rows that have one of the first 50 values of ORIGIN_AIRPORT?
Select * from “TABLE_NAME” as T1 left outer join (SELECT distinct
COLUMN_NAME from TABLE_NAME Order by COLUMN_NAME limit 50)as T2 on
T1.COLUMN_NAME=T2.COLUMN_NAME
This query will fetch you 50 distinct values in the inner query, then the outer query searches for those particular 50 distinct values using the T1.COLUMN_NAME=T2.COLUMN_NAME commands and returns all the records( it shows null for those not included in the 50 unique list)
Given a table of values (origin_airport), with unique identifiers (id) and date, find the minimum date for each unique value (origin_airport) to decide which N origin_airport values are to be returned.
Return all rows which match the first 3 unique origin_airport values (densely ranked, by min(date) per origin_airport).
Updated: to use columns that more closely match the model, with origin_airport and a date column for ordering.
Full working test case
The test data:
CREATE TABLE airportlogs (
origin_airport int
, id int primary key auto_increment
, date date DEFAULT NULL
);
INSERT INTO airportlogs (origin_airport) VALUES
( 1 )
, ( 1 )
, ( 8 )
, ( 8 )
, ( 8 )
, ( 7 )
, ( 7 )
, ( 6 )
, ( 5 )
, ( 4 )
, ( 3 )
, ( 3 )
, ( 7 )
, ( 7 )
, ( 1 )
, ( 8 )
, ( 3 )
, ( 1 )
;
-- Create some dates to use for ordering.
-- Ordering can be as complicated as we need.
UPDATE airportlogs SET date = current_date + INTERVAL +id DAY;
-- Intermediate calculation to show the MIN(date) per origin_airport
WITH nvals (origin_airport, mdate) AS (
SELECT origin_airport, MIN(date) AS mdate FROM airportlogs GROUP BY origin_airport
)
SELECT *
FROM nvals
ORDER BY mdate
;
+----------------+------------+
| origin_airport | mdate |
+----------------+------------+
| 1 | 2021-08-05 |
| 8 | 2021-08-07 |
| 7 | 2021-08-10 |
| 6 | 2021-08-12 |
| 5 | 2021-08-13 |
| 4 | 2021-08-14 |
| 3 | 2021-08-15 |
+----------------+------------+
-- Calculation of ordered rank for the unique origin_airport values
-- by MIN(date) per origin_airport.
WITH nvals0 (origin_airport, date, mdate) AS (
SELECT origin_airport
, date
, MIN(date) OVER (PARTITION BY origin_airport) AS mdate
FROM airportlogs
)
, nvals (origin_airport, date, mdate, r) AS (
SELECT origin_airport
, date
, mdate
, DENSE_RANK() OVER (ORDER BY mdate) AS r
FROM nvals0
)
SELECT *
FROM nvals
ORDER BY r, date
;
Result:
+----------------+------------+------------+---+
| origin_airport | date | mdate | r |
+----------------+------------+------------+---+
| 1 | 2021-08-05 | 2021-08-05 | 1 |
| 1 | 2021-08-06 | 2021-08-05 | 1 |
| 1 | 2021-08-19 | 2021-08-05 | 1 |
| 1 | 2021-08-22 | 2021-08-05 | 1 |
| 8 | 2021-08-07 | 2021-08-07 | 2 |
| 8 | 2021-08-08 | 2021-08-07 | 2 |
| 8 | 2021-08-09 | 2021-08-07 | 2 |
| 8 | 2021-08-20 | 2021-08-07 | 2 |
| 7 | 2021-08-10 | 2021-08-10 | 3 |
| 7 | 2021-08-11 | 2021-08-10 | 3 |
| 7 | 2021-08-17 | 2021-08-10 | 3 |
| 7 | 2021-08-18 | 2021-08-10 | 3 |
| 6 | 2021-08-12 | 2021-08-12 | 4 |
| 5 | 2021-08-13 | 2021-08-13 | 5 |
| 4 | 2021-08-14 | 2021-08-14 | 6 |
| 3 | 2021-08-15 | 2021-08-15 | 7 |
| 3 | 2021-08-16 | 2021-08-15 | 7 |
| 3 | 2021-08-21 | 2021-08-15 | 7 |
+----------------+------------+------------+---+
The final solution:
WITH min_date (origin_airport, date, mdate) AS (
SELECT origin_airport
, date
, MIN(date) OVER (PARTITION BY origin_airport) AS mdate
FROM airportlogs
)
, ranks (origin_airport, date, mdate, r) AS (
SELECT origin_airport
, date
, mdate
, DENSE_RANK() OVER (ORDER BY mdate) AS r
FROM min_date
)
SELECT *
FROM ranks
WHERE r <= 3
ORDER BY r, date
;
The final result:
+----------------+------------+------------+---+
| origin_airport | date | mdate | r |
+----------------+------------+------------+---+
| 1 | 2021-08-05 | 2021-08-05 | 1 |
| 1 | 2021-08-06 | 2021-08-05 | 1 |
| 1 | 2021-08-19 | 2021-08-05 | 1 |
| 1 | 2021-08-22 | 2021-08-05 | 1 |
| 8 | 2021-08-07 | 2021-08-07 | 2 |
| 8 | 2021-08-08 | 2021-08-07 | 2 |
| 8 | 2021-08-09 | 2021-08-07 | 2 |
| 8 | 2021-08-20 | 2021-08-07 | 2 |
| 7 | 2021-08-10 | 2021-08-10 | 3 |
| 7 | 2021-08-11 | 2021-08-10 | 3 |
| 7 | 2021-08-17 | 2021-08-10 | 3 |
| 7 | 2021-08-18 | 2021-08-10 | 3 |
+----------------+------------+------------+---+
There are a number of other solutions.
The poster didn't mention the logic for this ordering. But with the above window function behavior, that's trivial to specify.

SQL - Select records that their columns do not follow the same order

Given we have following table where the series number and the the date should increment
+----+--------+------------+
| id | series | date |
+----+--------+------------+
| 1 | 10 | 2020-08-13 |
| 2 | 9 | 2020-08-02 |
| 3 | 8 | 2020-06-23 |
| 4 | 7 | 2020-06-08 |
| 5 | 6 | 2020-05-20 |
| 6 | 5 | 2020-05-05 |
| 7 | 4 | 2020-05-01 |
+----+--------+------------+
Is there a way to check if there are records that do not follow this pattern ?
For example row 2 has bigger series number but it's date is before row 3
+----+--------+------------+
| id | series | date |
+----+--------+------------+
| 1 | 10 | 2020-08-13 |
| 2 | 9 | 2020-06-02 |
| 3 | 8 | 2020-07-23 |
| 4 | 7 | 2020-06-08 |
| 5 | 6 | 2020-05-20 |
| 6 | 5 | 2020-05-05 |
| 7 | 4 | 2020-05-01 |
+----+--------+------------+
You can use window functions:
select *
from (
select t.*, lead(date) over(order by series) lead_date
from mytable t
) t
where date > lead_date
Alternatively:
select *
from (
select t.*, lead(series) over(order by date) lead_series
from mytable t
) t
where series > lead_series
You can use lag():
select t.*
from (select t.*,
lag(id) over (order by series) as prev_id_series,
lag(id) over (order by date) as prev_id_date
from t
) t
where prev_id_series <> prev_id_date;
You can fetch problematic rows and their corresponding conflicting rows using SELF JOIN like this (assuming your table is called "series"):
SELECT s1.id AS row_id, s1.series AS row_series, s1.date AS row_date,
s2.id AS conflict_id, s2.series AS conflict_series, s2.date AS conflict_date
FROM series AS s1
JOIN series AS s2
ON s1.series > s2.series AND s1.date < s2.date;

Query with dynamic date intervals

Given a statuses table that holds information about products availability, how do I select the date that corresponds to the 1st day in the latest 20 days that the product has been active?
Yes I know the question is hard to follow. I think another way to put it would be: I want to know how many times each product has been sold in the last 20 days that it was active, meaning the product could have been active for years, but I'd only want the sales count from the latest 20 days that it had a status of "active".
It's something easily doable in the server-side (i.e. getting any collection of products from the DB, iterating them, performing n+1 queries on the statuses table, etc), but I have hundreds of thousands of items so it's imperative to do it in SQL for performance reasons.
table : products
+-------+-----------+
| id | name |
+-------+-----------+
| 1 | Apple |
| 2 | Banana |
| 3 | Grape |
+-------+-----------+
table : statuses
+-------+-------------+---------------+---------------+
| id | name | product_id | created_at |
+-------+-------------+---------------+---------------+
| 1 | active | 1 | 2018-01-01 |
| 2 | inactive | 1 | 2018-02-01 |
| 3 | active | 1 | 2018-03-01 |
| 4 | inactive | 1 | 2018-03-15 |
| 6 | active | 1 | 2018-04-25 |
| 7 | active | 2 | 2018-03-01 |
| 8 | active | 3 | 2018-03-10 |
| 9 | inactive | 3 | 2018-03-15 |
+-------+-------------+---------------+---------------+
table : items (ordered products)
+-------+---------------+-------------+
| id | product_id | order_id |
+-------+---------------+-------------+
| 1 | 1 | 1 |
| 2 | 1 | 2 |
| 3 | 1 | 3 |
| 4 | 1 | 4 |
| 5 | 1 | 5 |
| 6 | 2 | 3 |
| 7 | 2 | 4 |
| 8 | 2 | 5 |
| 9 | 3 | 5 |
+-------+---------------+-------------+
table : orders
+-------+---------------+
| id | created_at |
+-------+---------------+
| 1 | 2018-01-02 |
| 2 | 2018-01-15 |
| 3 | 2018-03-02 |
| 4 | 2018-03-10 |
| 5 | 2018-03-13 |
+-------+---------------+
I want my final results to look like this:
+-------+-----------+----------------------+--------------------------------+
| id | name | recent_sales_count | date_to_start_counting_sales |
+-------+-----------+----------------------+--------------------------------+
| 1 | Apple | 3 | 2018-01-30 |
| 2 | Banana | 0 | 2018-04-09 |
| 3 | Grape | 1 | 2018-03-10 |
+-------+-----------+----------------------+--------------------------------+
So this is what I mean by latest 20 active days for e.g. Apple:
It was last activated at '2018-04-25'. That's 4 days ago.
Before that, it was inactive since '2018-03-15', so all these days until '2018-04-25' don't count.
Before that, it was active since '2018-03-01'. That's more 14 days until '2018-03-15'.
Before that, inactive since '2018-02-01'.
Finally, it was active since '2018-01-01', so it should only count the missing 2 days (4 + 14 + 2 = 20) backwards from '2018-02-01', resulting in date_to_start_counting_sales = '2018-01-30'.
With the '2018-01-30' date in hand, I'm then able to count Apple orders in the last 20 active days: 3.
Hope that makes sense.
Here is a fiddle with the data provided above.
I've got a standard SQL solution, that does not use any window function as you are on MySQL 5
My solution requires 3 stacked views.
It would have been better with a CTE but your version doesn't support it. Same goes for the stacked Views... I don't like to stack views and always try to avoid it, but sometimes you have no other choice, because MySQL doesn't accept subqueries in FROM clause for Views.
CREATE VIEW VIEW_product_dates AS
(
SELECT product_id, created_at AS active_date,
(
SELECT created_at
FROM statuses ti
WHERE name = 'inactive' AND ta.created_at < ti.created_at AND ti.product_id=ta.product_id
GROUP BY product_id
) AS inactive_date
FROM statuses ta
WHERE name = 'active'
);
CREATE VIEW VIEW_product_dates_days AS
(
SELECT product_id, active_date, inactive_date, datediff(IFNULL(inactive_date, SYSDATE()),active_date) AS nb_days
FROM VIEW_product_dates
);
CREATE VIEW VIEW_product_dates_days_cumul AS
(
SELECT product_id, active_date, ifnull(inactive_date,sysdate()) AS inactive_date, nb_days,
IFNULL((SELECT SUM(V2.nb_days) + V1.nb_days
FROM VIEW_product_dates_days V2
WHERE V2.active_date >= IFNULL(V1.inactive_date, SYSDATE()) AND V1.product_id=V2.product_id
),V1.nb_days) AS cumul_days
FROM VIEW_product_dates_days V1
);
The final view produce this :
| product_id | active_date | inactive_date | nb_days | cumul_days |
|------------|----------------------|----------------------|---------|------------|
| 1 | 2018-01-01T00:00:00Z | 2018-02-01T00:00:00Z | 31 | 49 |
| 1 | 2018-03-01T00:00:00Z | 2018-03-15T00:00:00Z | 14 | 18 |
| 1 | 2018-04-25T00:00:00Z | 2018-04-29T11:28:39Z | 4 | 4 |
| 2 | 2018-03-01T00:00:00Z | 2018-04-29T11:28:39Z | 59 | 59 |
| 3 | 2018-03-10T00:00:00Z | 2018-03-15T00:00:00Z | 5 | 5 |
So it aggregates all active periods of all products, it counts the number of days for each period, and the cumulative days of all past active periods since current date.
Then we can query this final view to get the desired date for each product. I set a variable for your 20 days, so you can change that number easily if you want.
SET #cap_days = 20 ;
SELECT PD.id, Pd.name,
SUM(CASE WHEN o.created_at > PD.date_to_start_counting_sales THEN 1 ELSE 0 END) AS recent_sales_count ,
PD.date_to_start_counting_sales
FROM
(
SELECT p.*,
(CASE WHEN LowerCap.max_cumul_days IS NULL
THEN ADDDATE(ifnull(HigherCap.min_inactive_date,sysdate()),(-#cap_days))
ELSE
CASE WHEN LowerCap.max_cumul_days < #cap_days AND HigherCap.min_inactive_date IS NULL
THEN ADDDATE(ifnull(LowerCap.max_inactive_date,sysdate()),(-LowerCap.max_cumul_days))
ELSE ADDDATE(ifnull(HigherCap.min_inactive_date,sysdate()),(LowerCap.max_cumul_days-#cap_days))
END
END) as date_to_start_counting_sales
FROM products P
LEFT JOIN
(
SELECT product_id, MAX(cumul_days) AS max_cumul_days, MAX(inactive_date) AS max_inactive_date
FROM VIEW_product_dates_days_cumul
WHERE cumul_days <= #cap_days
GROUP BY product_id
) LowerCap ON P.id=LowerCap.product_id
LEFT JOIN
(
SELECT product_id, MIN(cumul_days) AS min_cumul_days, MIN(inactive_date) AS min_inactive_date
FROM VIEW_product_dates_days_cumul
WHERE cumul_days > #cap_days
GROUP BY product_id
) HigherCap ON P.id=HigherCap.product_id
) PD
LEFT JOIN items i ON PD.id = i.product_id
LEFT JOIN orders o ON o.id = i.order_id
GROUP BY PD.id, Pd.name, PD.date_to_start_counting_sales
Returns
| id | name | recent_sales_count | date_to_start_counting_sales |
|----|--------|--------------------|------------------------------|
| 1 | Apple | 3 | 2018-01-30T00:00:00Z |
| 2 | Banana | 0 | 2018-04-09T20:43:23Z |
| 3 | Grape | 1 | 2018-03-10T00:00:00Z |
FIDDLE : http://sqlfiddle.com/#!9/804f52/24
Not sure which version of MySql you're working with, but if you can use 8.0, that version came out with a lot of functionality that makes things slightly more doable (CTE's, row_number(), partition, etc.).
My recommendation would be to create a view like in this DB-Fiddle Example, call the view on server side and iterate programatically. There are ways of doing it in SQL, but it'd be a bear to write, test and likely would be less efficient.
Assumptions:
Products cannot be sold during inactive date ranges
Statuses table will always alternate status active/inactive/active for each product. I.e. no date ranges where a certain product is both active and inactive.
View Results:
+------------+-------------+------------+-------------+
| product_id | active_date | end_date | days_active |
+------------+-------------+------------+-------------+
| 1 | 2018-01-01 | 2018-02-01 | 31 |
+------------+-------------+------------+-------------+
| 1 | 2018-03-01 | 2018-03-15 | 14 |
+------------+-------------+------------+-------------+
| 1 | 2018-04-25 | 2018-04-29 | 4 |
+------------+-------------+------------+-------------+
| 2 | 2018-03-01 | 2018-04-29 | 59 |
+------------+-------------+------------+-------------+
| 3 | 2018-03-10 | 2018-03-15 | 5 |
+------------+-------------+------------+-------------+
View:
CREATE OR REPLACE VIEW days_active AS (
WITH active_rn
AS (SELECT *, Row_number()
OVER ( partition BY NAME, product_id
ORDER BY created_at) AS rownum
FROM statuses
WHERE name = 'active'),
inactive_rn
AS (SELECT *, Row_number()
OVER ( partition BY NAME, product_id
ORDER BY created_at) AS rownum
FROM statuses
WHERE name = 'inactive')
SELECT x1.product_id,
x1.created_at AS active_date,
CASE WHEN x2.created_at IS NULL
THEN Curdate()
ELSE x2.created_at
END AS end_date,
CASE WHEN x2.created_at IS NULL
THEN Datediff(Curdate(), x1.created_at)
ELSE Datediff(x2.created_at,x1.created_at)
END AS days_active
FROM active_rn x1
LEFT OUTER JOIN inactive_rn x2
ON x1.rownum = x2.rownum
AND x1.product_id = x2.product_id ORDER BY
x1.product_id);

select a column of data then a count of that column's value when a certain condition is true

Let's say I have a table like this:
project_id | created_by | created
1 | 3 | 2015-04-01
2 | 3 | 2015-04-07
3 | 4 | 2015-05-01
4 | 4 | 2015-05-02
and I want to select these columns, then a count of how many projects were created by the created_by before each project, to look like this:
project_id | created_by | created | previous by created_by user
1 | 3 | 2015-04-01 | 0
2 | 3 | 2015-04-07 | 1
3 | 4 | 2015-05-01 | 0
4 | 4 | 2015-05-02 | 1
How do I select the count for that last column? I've tried count(case where [condition] then 1 else null end) but I keep only getting one row of results when I use that.
You can use a subquery which i already mentioned in the comments.
For Example the query could look like this:
SELECT t1.*,
(SELECT count(*)
FROM Table t2
WHERE UNIX_TIMESTAMP(t2.date) < UNIX_TIMESTAMP( t1.date)
AND t2.created_by = t1.created_by) before
FROM Table t1
It will return the columns of the the Table 'Table' and the result of the subquery as column 'before' which contains the count of before created rows.
Is this what you are after ?
select
project_id,
created_by,
created,
rn as `previous by created_by user`
from(
select
project_id,
created_by,
created,
#rn:=if(#prev_created_by = created_by,#rn+1,0) as rn,
#prev_created_by := created_by
from project,(select #rn:=0,#prev_created_by:=null)x
order by created_by,created
)x;
Here is a test case
mysql> select * from project ;
+------------+------------+------------+
| project_id | created_by | created |
+------------+------------+------------+
| 1 | 3 | 2015-04-01 |
| 2 | 3 | 2015-04-07 |
| 3 | 4 | 2015-05-01 |
| 4 | 4 | 2015-05-02 |
+------------+------------+------------+
4 rows in set (0.00 sec)
The above query will have
+------------+------------+------------+-----------------------------+
| project_id | created_by | created | previous by created_by user |
+------------+------------+------------+-----------------------------+
| 1 | 3 | 2015-04-01 | 0 |
| 2 | 3 | 2015-04-07 | 1 |
| 3 | 4 | 2015-05-01 | 0 |
| 4 | 4 | 2015-05-02 | 1 |
+------------+------------+------------+-----------------------------+
Select t1.project_id , t1.created_by, t1.created,count(t2.created)
from t1 , (select created_by,created from t1) as t2
Where t1.created_by=t2.created_by and t1.created>t2.created
group by t1.project_id ,t1.created_by, t1.created

Sum up values in SQL once all values are available

I have events flowing into a MySQL database and I need to group and sum the events to transactions and store away into another table. The data looks like:
+----+---------+------+-------+
| id | transid | code | value |
+----+---------+------+-------+
| 1 | 1 | b | 12 |
| 2 | 1 | i | 23 |
| 3 | 2 | b | 34 |
| 4 | 1 | e | 45 |
| 5 | 3 | b | 56 |
| 6 | 2 | i | 67 |
| 7 | 2 | e | 78 |
| 8 | 3 | i | 89 |
| 9 | 3 | i | 90 |
+----+---------+------+-------+
The events arrive in batches and I would like to create the transaction by summing up the values for each transid, like:
select transid, sum(value) from eventtable group by transid;
but only after all the events for that transid have arrived. That is determined by the event with the code e (b for the beginning, e for the end and i for varying amount of intermediates). Being a novice in SQL, how could I implement the requirement for the existance of the end code before the summing?
Perhaps with having:
select transid, sum(value)
from eventtable
group by transid
having max(case code when 'e' then 1 end)=1;
select transid, sum(value) from eventtable
group by transid
HAVING COUNT(*) = 3
you should count the records in the group. So when there is (b)egin, (i)?? don't know what it is and (e)nd this group is not filtered out.