SQL: How to find top customers that pay 80% of revenue? - mysql

Let's say I have a table TRANSACTIONS:
desc customer_transactions;
+------------------------------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+------------------------------+--------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| transactionID | varchar(128) | YES | | NULL | |
| customerID | varchar(128) | YES | | NULL | |
| amountAuthorized | DECIMAL(5,2) | YES | | NULL | |
| createdDatetime | datetime | YES | | NULL | |
+------------------------------+--------------+------+-----+---------+----------------+
This table has records of credit card transactions for a SAAS business for the last 5 years. The business has a typical monthly subscription model, where customers automatically charged based on their plan.
I need to find the top customers that are responsible for 80% of all revenue (per time period). The SAAS business is very uneven, because some customers pay 10/month, others may pay in thousands per month.
I will add a "time period" filter later, just need help with aggregation.
I want to generate a report where I only select the customers that generated 80% of revenue in this format:
+------------+-------+
| customerID | Total |
+------------+-------+
Not sure why this question was "on hold". I just need help writing a query and do not have enough experience with SQL. Basically, the title of the question states what is needed here:
I need to list customers and their corresponding totals, however, only need to select those customers that make up 80% of total revenue. The report needs to aggregate a total per customer.
Using MariaDB version 10.3.9

This is the kind of thing you need to use window functions for.
WITH
-- define some sample data,
-- where the sum total of amountAuthorized is 10,000
customer_transactions( `id`, transactionID, customerID,
amountAuthorized, createdDatetime) AS
(
SELECT 1, 1, 1, 5000, '2018-08-01'
UNION ALL SELECT 2, 2, 2, 2000, '2018-08-01'
UNION ALL SELECT 3, 3, 3, 1000, '2018-08-01'
UNION ALL SELECT 4, 4, 4, 1000, '2018-08-01'
UNION ALL SELECT 5, 5, 5, 1000, '2018-08-01'
)
-- a query that gives us the running total, sorted to give us the biggest customers first.
-- note that the additional sorts affect what customers might be returned.
,running_totals AS
(
SELECT *, SUM(amountAuthorized) OVER (ORDER BY amountAuthorized DESC, createdDatetime DESC, `id`) AS runningTotal
FROM customer_transactions
)
SELECT *
FROM running_totals
WHERE runningTotal <= ( SELECT 0.8 * SUM(amountAuthorized)
FROM customer_transactions)
Note that this takes into account (no pun intended) all data in the table. When you want to only look at a specific time period, you might want to create an intermediate CTE that filters out the dates you want.

You will find that surprisingly close 20% of the customers account for the 80%. See the 80/20 rule .
But, if you don't want to go that direction, you have 2 options:
Switch to MySQL 8.0 or MariaDB 10.1 in order to use 'windowing' functions; or
Use #variables to produce a running total, then (in an outer query) grab the desired rows.
Since you are using MariaDB 10.3.9, the windowing seems to be the way to go. But first, you need a separate query (or derived table) that computes the total revenue so you can get 80% of it.
Suggest
SELECT #revenue80 := 0.8 * SUM(amountAuthorized)
FROM customer_transactions
Then use #revenue80 inside the WHERE that Zack suggests.
I see that each amount can be no more than 999.99. Really? Is this a coffee shop?

Use the following:
SELECT
ct1.customerID,
SUM(ct1.amountAuthorized) as Total,
100 * (SUM(ct1.amountAuthorized) / ct3.total_revenue) as percent_revenue
FROM
customer_transactions ct1
CROSS JOIN (SELECT SUM(amountAuthorized) AS total_revenue
FROM customer_transactions ct2) AS ct3
GROUP BY
ct1.customerID
HAVING percent_revenue >= 80

Related

Return preferred record when there is more than one record for the same user

I have a table where it stores the types of discounts that a user can have.
Some users will get the standard discount, but some will get a bigger and better discount. For users who have the biggest and best discount, there will be two records in the database, one for the default discount and the other for the biggest and best discount. The biggest and best discount will be preferred in the search.
I would like to do a SELECT that would return the record with the highest discount and if you don't find it, return it with the standard discount for me to avoid making two queries in the database or having to filter in the source code.
Ex:
| id | user_id | country | discount | cashback | free_trial |
|-----------------------------------------------------------------------|
| 1 | 1 | EUA | DEFAULT | 10 | false |
| 2 | 1 | EUA | CHRISTMAS | 20 | true |
| 3 | 3 | EUA | DEFAULT | 10 | false |
SELECT *
FROM users
WHERE country = 'EUA'
AND (discount = 'CHRISTMAS' OR discount = 'DEFAULT');
In this example above for user 1 it would return the record with the discount equal to "CHRISTMAS" and for user 3 it would return "DEFAULT" because it is the only one that has. Can you help me please?
You can use the row_number() window function to do this. This function includes a PARTITION BY that lets you start the numbering over with each user, as well as it's own ORDER BY that lets you determine which rows will sort first within each user/partition.
Then you nest this inside another SELECT to limit to rows where the row_number() result is 1 (the discount that sorted best):
SELECT *
FROM (
SELECT *, row_number() OVER (PARTITION BY id, ORDER BY cashback desc) rn
FROM users
WHERE country = 'EUA'
) u
WHERE rn = 1
You could also use a LATERAL JOIN, which is usually better than the correlated join in the other answer, but not as good as the window function.
You can using GROUP BY to do it
SELECT u1.*
FROM users u1
JOIN
(
SELECT COUNT(id) AS cnt,user_id
FROM users WHERE country = 'EUA'
GROUP BY user_id
) u2 ON u1.user_id=u2.user_id
WHERE IF(u2.cnt=1,u1.discount='DEFAULT',u1.discount='CHRISTMAS')
DB Fiddle Demo

Need some help optimising an SQL query

my client was given the following code and he uses it daily to count the messages sent to businesses on his website. I have looked at the MYSQL.SLOW.LOG and it has the following stats for this query, which indicates to me it needs optimising.
Count: 183 Time=44.12s (8073s) Lock=0.00s (0s)
Rows_sent=17337923391683297280.0 (-1), Rows_examined=382885.7
(70068089), Rows_affected=0.0 (0), thewedd1[thewedd1]#localhost
The query is:
SELECT
businesses.name AS BusinessName,
messages.created AS DateSent,
messages.guest_sender AS EnquirersEmail,
strip_tags(messages.message) AS Message,
users.name AS BusinessName
FROM
messages
JOIN users ON messages.from_to = users.id
JOIN businesses ON users.business_id = businesses.id
My SQL is not very good but would a LEFT JOIN rather than a JOIN help to reduce the number or rows returned? Ive have run an EXPLAIN query and it seems to make no difference between the LEFT JOIN and the JOIN..
Basically I think it would be good to reduce the number of rows returned, as it is absurdly big..
Short answer: There is nothing "wrong" with your query, other than the duplicate BusinessName alias.
Long answer: You can add indexes to the foreign / primary keys to speed up searching which will do more than changing the query.
If you're using SSMS (SQL management studio) you can right click on indexes for a table and use the wizard.
Just don't be tempted to index all the columns as that may slow down any inserts you do in future, stick to the ids and _ids unless you know what you're doing.
he uses it daily to count the messages sent to businesses
If this is done per day, why not limit this to messages sent in specific recent days?
As an example: To count messages sent per business per day, for just a few recent days (example: 3 or 4 days), try this:
SELECT businesses.name AS BusinessName
, messages.created AS DateSent
, COUNT(*) AS n
FROM messages
JOIN users ON messages.from_to = users.id
JOIN businesses ON users.business_id = businesses.id
WHERE messages.created BETWEEN current_date - INTERVAL '3' DAY AND current_date
GROUP BY businesses.id
, DateSent
ORDER BY DateSent DESC
, n DESC
, businesses.id
;
Note: businesses.name is functionally dependent on businesses.id (in the GROUP BY terms), which is the primary key of businesses.
Example result:
+--------------+------------+---+
| BusinessName | DateSent | n |
+--------------+------------+---+
| business1 | 2021-09-05 | 3 |
| business2 | 2021-09-05 | 1 |
| business2 | 2021-09-04 | 1 |
| business2 | 2021-09-03 | 1 |
| business3 | 2021-09-02 | 5 |
| business1 | 2021-09-02 | 1 |
| business2 | 2021-09-02 | 1 |
+--------------+------------+---+
7 rows in set
This assumes your basic join logic is correct, which might not be true.
Other data could be returned as aggregated results, if necessary, and the fact that this is now limited to just recent data, the amount of rows examined should be much more reasonable.

Time difference between adjacent rows in one column of one mysql table

I have a table with some 100.000 rows having this structure:
+------+---------------------+-----------+
| id | timestamp | eventType |
+------+---------------------+-----------+
| 12 | 2015-07-01 16:45:47 | 3001 |
| 103 | 2015-07-10 19:30:14 | 3001 |
| 1174 | 2015-09-03 12:57:08 | 3001 |
+------+---------------------+-----------+
For each row, I would like to calculate the days between the timestamp of this and the previous row.
As you can see, the id is not continuous, this the table contains different events and I would like to compare only the timestamp of one specific event over time.
I know, that for the comparison of tow datas, DATEDIFF can be used, and I would define the two rows with a query, that selects the row by the specific id.
But as I have many 1000 rows, I am searching for a way to somehow loop through the whole table.
Unfortunately my sql knowledge is limited and searching did not reveal an example, close enough to my question, that I would continue form there.
I would be very thankful for any hint.
If you are running MySQL 8.0, you can just use lag(). Say you want the difference in seconds:
select t.*,
timestampdiff(
second,
lag(timestamp) over(partition by eventtype order by id),
timestamp
) diff
from mytable t
In earlier versions, one alternative is a correlated subquery:
select t.*,
timestampdiff(
second,
(select timestamp from mytable t1 where t1.eventtype = t.eventtype and t1.id < t.id order by t1.id desc limit 1),
timestamp
) diff
from mytable t

Find sum of stacked/overlapping date intersections in SQL table

I have the following table which represents bookings of articles:
+---+------------+----------+-------------+-------------+
|id | article_id | quantity | starts_at | ends_at |
+---+------------+----------+-------------+-------------+
| 1 | 1 | 1 | 2015-03-01 | 2015-03-20 |
| 2 | 1 | 2 | 2015-03-02 | 2015-03-03 |
| 3 | 1 | 3 | 2015-03-04 | 2015-03-15 |
| 4 | 1 | 2 | 2015-03-16 | 2015-03-22 |
| 5 | 1 | 2 | 2015-03-11 | 2015-03-19 |
| 6 | 2 | 2 | 2015-03-06 | 2015-03-22 |
| 7 | 2 | 3 | 2015-03-02 | 2015-03-04 |
+---+------------+----------+-------------+-------------+
From this table I want to extract the following information:
+------------+----------+
| article_id | sum |
+------------+----------+
| 1 | 6 |
| 2 | 3 |
+------------+----------+
Sum represents the max sum of quantity of stacked/overlapping booked articles for the given time ranges. In the first table article with id=1 has its maximum from booking 1, 3 and 5.
Is there any MySQL solution to obtain this information from a table like this?
Thank you very much!
EDIT: The date intersections are crucial. Let's say booking 5 starts at 2015-03-17 the sum for article_id=1 results 5, because booking 3 and 5 are not overlapping anymore. The sql should automatically consider all possible overlapping possibilities.
My answer is going to seem crazy complicated, perhaps; but it isn't, if one accepts that the use of a calendar table is an excellent MySQL idiom for dealing with date range related issues. I've closely adapted calendar table code from Artful Software's calendar table article. Artful Software's query techniques are a wonderful resource for doing complicated things in MySQL. The calendar table gives you a row per individual date that you are working with, which makes many things much easier.
For the whole thing below, you can go to this sqlfiddle for a place to play around with the code. It'll take a while to load.
First, here is your data:
CREATE TABLE articles
(`id` int, `article_id` int, `quantity` int, `starts_at` datetime, `ends_at` datetime);
INSERT INTO articles
(`id`, `article_id`, `quantity`, `starts_at`, `ends_at`)
VALUES
(1, 1, 1, '2015-03-01 00:00:00', '2015-03-20 00:00:00'),
(2, 1, 2, '2015-03-02 00:00:00', '2015-03-03 00:00:00'),
(3, 1, 3, '2015-03-04 00:00:00', '2015-03-15 00:00:00'),
(4, 1, 2, '2015-03-16 00:00:00', '2015-03-22 00:00:00'),
(5, 1, 2, '2015-03-11 00:00:00', '2015-03-19 00:00:00'),
(6, 2, 2, '2015-03-06 00:00:00', '2015-03-22 00:00:00'),
(7, 2, 3, '2015-03-02 00:00:00', '2015-03-04 00:00:00');
Next, here is the creation of the calendar table--I've created somewhat more date rows than needed (going back to start of year, and forward to start of next year). Ideally you just permanently keep a more massive calendar table on hand, covering a span of dates that will handle anything you could ever need. All the stuff below is going to seem quite lengthy and complex. But if you already have a calendar table lying around, the whole next section is not necessary.
CREATE TABLE calendar ( dt datetime primary key );
/* the views below will be joined and rejoined to themselves to
get the effect creating many rows. V ends up with 10 rows. */
CREATE OR REPLACE VIEW v3 as SELECT 1 n UNION ALL SELECT 1 UNION ALL SELECT 1;
CREATE OR REPLACE VIEW v as SELECT 1 n FROM v3 a, v3 b UNION ALL SELECT 1;
/* Going to limit the calendar table to first of year of min date
and first of year after max date */
SELECT #min := makedate(year(min(starts_at)),1) FROM articles;
SELECT #max := makedate(year(min(ends_at))+1,1) FROM articles;
SET #inc = -1;
/* below we work with #min date + #inc days successively, with #inc:=#inc+1
acting like ++variable, so we start with minus 1.
We insert as many individual date rows as we want by self-joining v,
and using some kind of limit via WHERE to keep the calendar table small
for our example. For n occurrences of v below, you get a max
of 10^n rows in the calendar table. We are using v as row-creation
engine. */
INSERT INTO calendar
SELECT #min + interval #inc:=#inc+1 day as dt
FROM v a, v b, v c, v d # , v e , v f
WHERE #inc < datediff(#max,#min);
Now we are ready to find the stackings. Assuming the above (big assumption, I know), this becomes pretty easy. I'm going to do it through a few views for readability.
/* now create a view that will let us easily view the articles
related to indvidual dates when we query.
Not necessary, just makes things easier to read. */
CREATE OR REPLACE VIEW articles_to_dates as
SELECT c.dt, article_id
FROM articles a
INNER JOIN calendar c on c.dt between (SELECT min(starts_at) FROM articles) and (SELECT max(ends_at) FROM articles)
GROUP BY article_id, c.dt;
--SELECT * FROM articles_to_dates --This query would show the view's result
/* next view is the total amount of articles booked per individual date */
CREATE OR REPLACE VIEW booked_quantities_per_day AS
SELECT a2d.dt,a2d.article_id, SUM(a.quantity) as booked_quantity
FROM articles_to_dates a2d
INNER JOIN articles a on a2d.dt between a.starts_at and a.ends_at and a.article_id = a2d.article_id
GROUP BY a2d.dt, a2d.article_id
ORDER by a2d.article_id, a2d.dt
--SELECT * from booked_quantities_per_day --this query would show the view's result
Finally, here are the desired results:
SELECT article_id, max(booked_quantity) max_stacked
FROM booked_quantities_per_day
GROUP BY article_id;
Results:
article_id max_stacked
1 6
2 3
This should work.
Two groups. First to get distinct list of possible 'quantity'; second - summarise them
SELECT article_id, SUM(sub.quantity) FROM
(SELECT article_id, quantity FROM table GROUP BY article_id, quantity) as sub
GROUP BY article_id
select sum(quantity) from ...
group by article_id
where
... select your date range ...

complex sql query (GROUP BY)

I need some help building a query.
Here is what I need :
I have a table called data:
ID| PRODUCT | VALUE |COUNTRY| DEVICE | SYSTEM
-----+---------+-------+-------+---------+--------
48 | p1 | 0.4 | US | dev1 | system1
47 | p2 | 0.67 | IT | dev2 | system2
46 | p3 | 1.2 | GB | dev3 | system3
45 | p1 | 0.9 | ES | dev4 | system4
44 | p1 | 0.6 | ES | dev4 | system1
I need to show which products have produced the most revenue and which country, device and system contributed the most.
**for example : the result i would get from the table would be:
PRODUCT | TOTAL COST |COUNTRY| DEVICE | SYSTEM
-------+------------+-------+---------+--------
p1 | 1.9 | ES | dev4 | system1
p2 | 0.67 | IT | dev2 | system2
p3 | 1.2 | GB | dev3 | system3
Top country is ES because ES contributed with 0.9 + 0.6 = 1.5 > 0.4 (contribution of US).
same logic for top device and top system.**
I guess for total revenue and product something like this will do :
SELECT SUM(value) as total_revenue,product FROM data GROUP BY product
But how can I add country,device and system?
Is this even feasible in a single query, if not what is the best way (performance wise) to do it?
Many thanks for your help.
EDIT
I edited the sample table to explain better.
Do it in separate queries:
SELECT product,
SUM(value) AS amount
FROM data
GROUP BY country -- change to device, system, etc. as required
ORDER BY amount DESC
LIMIT 1
You are correct... it is not just a simple query... but 3 queries wrapped into one result.
I've posted my sample out on SQL Fiddle here...
First query -- the inner most. You need to get all revenue based on a per product/country and sort that by the product and DESCENDING on the total revenue to have highest revenue in first position per product.
Next query (where I've implemented use of MySQL #variable use). Since the first result order already has it in order of product and revenue rank, I set the rank to 1 every time a product changes from whatever the "#LastProd" is... This would create ES = Rank #1 for product 1, then US = Rank #2 for product 1, then continue on the other "products".
The final outermost query re-joins back to the raw Data table but gets a list of all the devices and systems that comprised the product sale in question, but ONLY where the product rank was #1.
select
pqRank.product,
pqRank.country,
pqRank.revenue,
group_concat( distinct d2.device ) as PartDevices,
group_concat( distinct d2.system ) as PartSystems
from
( select
pq.product,
pq.country,
pq.revenue,
#RevenueRank := if( #LastProd = pq.product, #RevenueRank +1, 1 ) as ProdRank,
#LastProd := pq.product
from
( select
d.product,
d.country,
sum( d.value ) as Revenue
from
data d
group by
d.product,
d.country
order by
d.product,
Revenue desc ) pq,
( select #RevenueRank := 0,
#LastProd := ' ') as sqlvars
) pqRank
JOIN data d2
on pqRank.product = d2.product
and pqRank.country = d2.country
where
pqRank.ProdRank = 1
group by
pqRank.product,
pqRank.country
You could do sth like that
CREATE TABLE data
(
id int auto_increment primary key,
product varchar(20),
country varchar(4),
device varchar(20),
system varchar(20),
value decimal(5,2)
);
INSERT INTO data (product, country, device, system, value)
VALUES
('p1', 'US', 'dev1', 'system1', 0.4),
('p2', 'IT', 'dev2', 'system2', 0.67),
('p1', 'IT', 'dev1', 'system2', 0.23);
select 'p' as grouping_type, product, sum(value) as sumval
from data
group by product
union all
select 'c' as grouping_type, country, sum(value) as sumval
from data
group by country
union all
select 'd' as grouping_type, device, sum(value) as sumval
from data
group by device
union all
select 's' as grouping_type, system, sum(value) as sumval
from data
group by system
order by grouping_type, sumval
It's ugly, I wouldn't use it, but it should work.