I have been trying to optimise some SQL queries based on the assumption that Joining tables is more efficient than nesting queries. I am joining the same table multiple times to perform a different analysis on the data.
I have 2 tables:
transactions:
id | date_add | merchant_ id | transaction_type | amount
1 1488733332 108 add 20.00
2 1488733550 108 remove 5.00
and a calendar table which just lists dates so that I can create empty records where there are no transactions on particular days:
calendar:
id | datefield
1 2017-03-01
2 2017-03-02
3 2017-03-03
4 2017-03-04
I have many thousands of rows in the transactions table, and I'm trying to get an annual summary of total and different types of transactions per month (i.e 12 rows in total), where
transactions = sum of all "amount"s,
additions = sum of all "amounts" where transaction_type = "add"
redemptions = sum of all "amounts" where transaction_type = "remove"
result:
month | transactions | additions | redemptions
Jan 15 12 3
Feb 20 15 5
...
My initial query looks like this:
SELECT COALESCE(tr.transactions, 0) AS transactions,
COALESCE(ad.additions, 0) AS additions,
COALESCE(re.redemptions, 0) AS redemptions,
calendar.date
FROM (SELECT DATE_FORMAT(datefield, '%b %Y') AS date FROM calendar WHERE datefield LIKE '2017-%' GROUP BY YEAR(datefield), MONTH(datefield)) AS calendar
LEFT JOIN (SELECT COUNT(transaction_type) as transactions, from_unixtime(date_add, '%b %Y') as date_t FROM transactions WHERE merchant_id = 108 GROUP BY from_unixtime(date_add, '%b %Y')) AS tr
ON calendar.date = tr.date_t
LEFT JOIN (SELECT COUNT(transaction_type = 'add') as additions, from_unixtime(date_add, '%b %Y') as date_a FROM transactions WHERE merchant_id = 108 AND transaction_type = 'add' GROUP BY from_unixtime(date_add, '%b %Y')) AS ad
ON calendar.date = ad.date_a
LEFT JOIN (SELECT COUNT(transaction_type = 'remove') as redemptions, from_unixtime(date_add, '%b %Y') as date_r FROM transactions WHERE merchant_id = 108 AND transaction_type = 'remove' GROUP BY from_unixtime(date_add, '%b %Y')) AS re
ON calendar.date = re.date_r
I tried optimising and cleaning it up a little, removing the nested statements and came up with this:
SELECT
DATE_FORMAT(cal.datefield, '%b %d') as date,
IFNULL(count(ct.amount),0) as transactions,
IFNULL(count(a.amount),0) as additions,
IFNULL(count(r.amount),0) as redeptions
FROM calendar as cal
LEFT JOIN transactions as ct ON cal.datefield = date(from_unixtime(ct.date_add)) && ct.merchant_id = 108
LEFT JOIN transactions as r ON r.id = ct.id && r.transaction_type = 'remove'
LEFT JOIN transactions as a ON a.id = ct.id && a.transaction_type = 'add'
WHERE cal.datefield like '2017-%'
GROUP BY month(cal.datefield)
I was surprised to see that the revised statement was about 20x slower than the original with my dataset. Have I missed some sort of logic? Is there a better way to achieve the same result with a more streamlined query, given I am joining the same table multiple times?
EDIT:
So to further explain the results I'm looking for - I'd like a single row for each month of the year (12 rows) each with a column for the total transactions, total additions, and total redemptions in each month.
The first query I was getting a result in about 0.5 sec but with the second I was getting results in 9.5sec.
Looking to your query You could use a single left join with case when
SELECT COALESCE(t.transactions, 0) AS transactions,
COALESCE(t.additions, 0) AS additions,
COALESCE(t.redemptions, 0) AS redemptions,
calendar.date
FROM (SELECT DATE_FORMAT(datefield, '%b %Y') AS date
FROM calendar
WHERE datefield LIKE '2017-%'
GROUP BY YEAR(datefield), MONTH(datefield)) AS calendar
LEFT JOIN
( select
COUNT(transaction_type) as transactions
, sum( case when transaction_type = 'add' then 1 else 0 end ) as additions
, sum( case when transaction_type = 'remove' then 1 else 0 end ) as redemptions
, from_unixtime(date_add, '%b %Y') as date_t
FROM transactions
WHERE merchant_id = 108
GROUP BY from_unixtime(date_add, '%b %Y' ) t ON calendar.date = t.date_t
First I would create a derived table with timestamp ranges for every month from your calendar table. This way a join with the transactions table will be efficient if date_add is indexed.
select month(c.datefield) as month,
unix_timestamp(timestamp(min(c.datefield), '00:00:00')) as ts_from,
unix_timestamp(timestamp(max(c.datefield), '23:59:59')) as ts_to
from calendar c
where c.datefield between '2017-01-01' and '2017-12-31'
group by month(c.datefield)
Join it with the transaactions table and use conditional aggregations to get your data:
select c.month,
sum(t.amount) as transactions,
sum(case when t.transaction_type = 'add' then t.amount else 0 end) as additions,
sum(case when t.transaction_type = 'remove' then t.amount else 0 end) as redemptions
from (
select month(c.datefield) as m, date_format(c.datefield, '%b') as `month`
unix_timestamp(timestamp(min(c.datefield), '00:00:00')) as ts_from,
unix_timestamp(timestamp(max(c.datefield), '23:59:59')) as ts_to
from calendar c
where c.datefield between '2017-01-01' and '2017-12-31'
group by month(c.datefield), date_format(c.datefield, '%b')
) c
left join transactions t on t.date_add between c.ts_from and c.ts_to
where t.merchant_id = 108
group by c.m, c.month
order by c.m
Related
Could you help me to calculate percent of users, which made payments?
I've got two tables:
activity
user_id login_time
201 01.01.2017
202 01.01.2017
255 04.01.2017
255 05.01.2017
256 05.01.2017
260 15.03.2017
2
payments
user_id payment_date
200 01.01.2017
202 01.01.2017
255 05.01.2017
I try to use this query, but it calculates wrong percent:
SELECT activity.login_time, (select COUNT(distinct payments.user_id)
from payments where payments.payment_time between '2017-01-01' and
'2017-01-05') / COUNT(distinct activity.user_id) * 100
AS percent
FROM payments INNER JOIN activity ON
activity.user_id = payments.user_id and activity.login_time between
'2017-01-01' and '2017-01-05'
GROUP BY activity.login_time;
I need a result
01.01.2017 100 %
02.01.2017 0%
03.01.2017 0%
04.01.2017 0%
05.01.2017 - 50%
If you want the ratio of users who have made payments to those with activity, just summarize each table individually:
select p.cnt / a.cnt
from (select count(distinct user_id) as cnt from activity a) a cross join
(select count(distinct user_id) as cnt from payment) p;
EDIT:
You need a table with all dates in the range. That is the biggest problem.
Then I would recommend:
SELECT d.dte,
( ( SELECT COUNT(DISTINCT p.user_id)
FROM payments p
WHERE p.payment_date >= d.dte and p.payment_date < d.dte + INTERVAL 1 DAY
) /
NULLIF( (SELECT COUNT(DISTINCT a.user_id)
FROM activity a
WHERE a.login_time >= d.dte and p.login_time < d.dte + INTERVAL 1 DAY
), 0
) as ratio
FROM (SELECT date('2017-01-01') dte UNION ALL
SELECT date('2017-01-02') dte UNION ALL
SELECT date('2017-01-03') dte UNION ALL
SELECT date('2017-01-04') dte UNION ALL
SELECT date('2017-01-05') dte
) d;
Notes:
This returns NULL on days where there is no activity. That makes more sense to me than 0.
This uses logic on the dates that works for both dates and date/time values.
The logic for dates can make use of an index, which can be important for this type of query.
I don't recommend using LEFT JOINs. That will multiply the data which can make the query expensive.
First you need a table with all days in the range. Since the range is small you can build an ad hoc derived table using UNION ALL. Then left join the payments and activities. Group by the day and calculate the percentage using the count()s.
SELECT x.day,
concat(CASE count(DISTINCT a.user_id)
WHEN 0 THEN
1
ELSE
count(DISTINCT p.user_id)
/
count(DISTINCT a.user_id)
END
*
100,
'%')
FROM (SELECT cast('2017-01-01' AS date) day
UNION ALL
SELECT cast('2017-01-02' AS date) day
UNION ALL
SELECT cast('2017-01-03' AS date) day
UNION ALL
SELECT cast('2017-01-04' AS date) day
UNION ALL
SELECT cast('2017-01-05' AS date) day) x
LEFT JOIN payments p
ON p.payment_date = x.day
LEFT JOIN activity a
ON a.login_time = x.day
GROUP BY x.day;
I am trying to find the number of sellers that made a sale last month but didn't make a sale this month.
I have a query that works but I don't think its efficient and I haven't figured out how to do this for all months.
SELECT count(distinct user_id) as users
FROM transactions
WHERE MONTH(date) = 12
AND YEAR(date) = 2015
AND transactions.status = 'COMPLETED'
AND transactions.amount > 0
AND transactions.user_id NOT IN
(
SELECT distinct user_id
FROM transactions
WHERE MONTH(date) = 1
AND YEAR(date) = 2016
AND transactions.status = 'COMPLETED'
AND transactions.amount > 0
)
The structure of the table is:
+---------+------------+-------------+--------+
| user_id | date | status | amount |
+---------+------------+-------------+--------+
| 1 | 2016-01-01 | 'COMPLETED' | 1.00 |
| 2 | 2015-12-01 | 'COMPLETED' | 1.00 |
| 3 | 2015-12-01 | 'COMPLETED' | 2.00 |
| 1 | 2015-12-01 | 'COMPLETED' | 3.00 |
+---------+------------+-------------+--------+
So in this case, users with ID 2 and 3, didn't make a sale this month.
Use conditional aggregation:
SELECT count(*) as users
FROM
(
SELECT user_id
FROM transactions
-- 1st of previous month
WHERE date BETWEEN SUBDATE(SUBDATE(CURRENT_DATE, DAYOFMONTH(CURRENT_DATE)-1), interval 1 month)
-- end of current month
AND LAST_DAY(CURRENT_DATE)
AND transactions.status = 'COMPLETED'
AND transactions.amount > 0
GROUP BY user_id
-- any row from previous month
HAVING MAX(CASE WHEN date < SUBDATE(CURRENT_DATE, DAYOFMONTH(CURRENT_DATE)-1)
THEN date
END) IS NOT NULL
-- no row in current month
AND MAX(CASE WHEN date >= SUBDATE(CURRENT_DATE, DAYOFMONTH(CURRENT_DATE)-1)
THEN date
END) IS NULL
) AS dt
SUBDATE(CURRENT_DATE, DAYOFMONTH(CURRENT_DATE)-1) = first day of current month
SUBDATE(first day of current month, interval 1 month) = first day of previous month
LAST_DAY(CURRENT_DATE) = end of current month
if you want to generify it, you can use curdate() to get current month, and DATE_SUB(curdate(), INTERVAL 1 MONTH) to get last month (you will need to do some if clause for January/December though):
SELECT count(distinct user_id) as users
FROM transactions
WHERE MONTH(date) = MONTH(DATE_SUB(curdate(), INTERVAL 1 MONTH))
AND transactions.status = 'COMPLETED'
AND transactions.amount > 0
AND transactions.user_id NOT IN
(
SELECT distinct user_id
FROM transactions
WHERE MONTH(date) = MONTH(curdate())
AND transactions.status = 'COMPLETED'
AND transactions.amount > 0
)
as far as efficiency goes, I don't see a problem with this one
The following should be pretty efficient. In order to make it even more so, you would need to provide the table definition and and the EXPLAIN.
SELECT COUNT(DISTINCT user_id) users
FROM transactions t
LEFT
JOIN transactions x
ON x.user_id = t.user_id
AND x.date BETWEEN '2016-01-01' AND '2016-01-31'
AND x.status = 'COMPLETED'
AND x.amount > 0
WHERE t.date BETWEEN '2015-12-01' AND '2015-12-31'
AND t.status = 'COMPLETED'
AND t.amount > 0
AND x.user_id IS NULL;
Just some input for thought:
You could create aggregated lists of user-IDs per month, representing all the unique buyers in that month. In your application, you would then simply have to subtract the two months in question in order to get all user-IDs that have only made a sale in one of the two months.
See below for query- and post-processing-examples.
In order to make your query efficient, I would recommend at least a 2-column index for table transactions on [status, amount]. However, in order to prevent the query from having to look up data in the actual table, you could even create a 4-column index [status, amount, date, user_id], which should further improve the performance of your query.
Postgres (v9.0+, tested)
SELECT (DATE_PART('year', t.date) || '-' || DATE_PART('month', t.date)) AS d,
STRING_AGG( DISTINCT t.user_id::TEXT, ',' ) AS buyers
FROM transactions t
WHERE t.status = 'COMPLETED'
AND t.amount > 0
GROUP BY DATE_PART('year', t.date),
DATE_PART('month', t.date)
ORDER BY DATE_PART('year', t.date),
DATE_PART('month', t.date)
;
MySQL (not tested)
SELECT (YEAR(t.date) || '-' || MONTH(t.date)) AS d,
GROUP_CONCAT( DISTINCT t.user_id ) AS buyers
FROM transactions t
WHERE t.status = 'COMPLETED'
AND t.amount > 0
GROUP BY YEAR(t.date), MONTH(t.date)
ORDER BY YEAR(t.date), MONTH(t.date)
;
Ruby (example for post-processing)
db_result = ActiveRecord::Base.connection_pool.with_connection { |con| con.execute( db_query ) }
unique_buyers = db_result.map{|e|[e['d'],e['buyers'].split(',')]}.to_h
buyers_dec15_but_not_jan16 = unique_buyers['2015-12'] - unique_buyers['2016-1']
buyers_nov15_but_not_dec16 = unique_buyers['2015-11']||[] - unique_buyers['2015-12']
...(and so on)...
I have a simple table with transactions in it and I want to get, for each month, how many consumers made transactions that total more than 0 and their first transaction was not in that month. By first transaction we mean that the customer bought for the first time during that month.
The result I am trying to get is in the following form:
+--------+---------+-----------------------------------+
| Year | Month | NumOfCustomersWithPositiveTotals |
+--------+----------------------------+----------------+
| 2014 | 1 | 22 |
+--------+----------------------------+----------------+
| 2014 | 2 | 10 |
+--------+----------------------------+----------------+
I've got an SQL fiddle where I find the same thing, but for consumers that have their first transactions within that month. Practically, the query I am looking for is the same, but for the rest of the consumers.
This is the fiddle: http://sqlfiddle.com/#!9/31538/24
I think this is what you want:
SELECT count(consumerId) as NumOfCust, mm, yy FROM
(
SELECT consumerId, month(date) as mm, year(date) as yy, sum(amount) as total, mdate FROM beta.transaction as t
LEFT JOIN (
SELECT min(month(date)) as mdate, consumerId as con FROM beta.transaction
GROUP BY consumerId
) as MinDate ON con = t.consumerId
GROUP BY month(date), consumerId
HAVING mdate < mm AND total > 0
) as res
GROUP BY res.mm;
And i will try to explain it from inside out
Let's see what the JOIN table named minDate has:
SELECT min(month(date)) as mdate, consumerId as con FROM beta.transaction
GROUP BY consumerId
-- Here we find the first date of transaction per consumerId
Next
SELECT consumerId, month(date) as mm, year(date) as yy, sum(amount), mdate FROM beta.transaction as t
LEFT JOIN (
SELECT min(month(date)) as mdate, consumerId as con FROM beta.transaction
GROUP BY consumerId
) as MinDate ON con = t.consumerId
GROUP BY month(date), consumerId
HAVING mdate < mm AND total > 0
-- Here we find total amount per consumerId per month and count only the consumers whose first transact (aka minDate) is lower than current month AND total is greater than 0
At last with the outer SELECT we count the above results Grouped by month.
I hope its what you wanted.
The query below returns the correct results. The answer was based on Akis's answer by adding the consumerId in the group by clause and adding years in the checks as well as months.
select
tyyyy, tmm, count(tcon) as oldkund_real
from
(
select
t.yyyy as tyyyy, t.mm as tmm, t.consumerId as tcon, sum(t.amount) as total, fp.yyyy as fpyyyy, fp.mm as fpmm, fp.consumerId as fpcon
from
(
select
year(date) as yyyy, month(date) as mm, consumerId, amount
from
transaction
) as t
left join
(
select
year(min(date)) as yyyy, month(min(date)) as mm, consumerId
from
transaction
group by
consumerid
) as fp
on
t.consumerId = fp.consumerId
group by
t.yyyy, t.mm, t.consumerId
having
STR_TO_DATE(CONCAT('01,', tmm, ',', tyyyy),'%d,%m,%Y') > STR_TO_DATE(CONCAT('01,', fpmm, ',', fpyyyy),'%d,%m,%Y') and total > 0
) as res
group by
tyyyy, tmm
order by
tyyyy, tmm;
Thank you very much Akis!
I have a table emails
id date sent_to
1 2013-01-01 345
2 2013-01-05 990
3 2013-02-05 1000
table2 is responses
email_id email response
1 xyz#email.com xxxx
1 xyzw#email.com yyyy
.
.
.
I want a result with the following format:
Month total_number_of_subscribers_sent total_responded
2013-01 1335 2
.
.
this is my query:
SELECT
DATE_FORMAT(e.date, '%Y-%m')AS `Month`,
count(*) AS total_responded,
SUM(e.sent_to) AS total_sent
FROM
responses r
LEFT JOIN emails e ON e.id = r.email_id
WHERE
e.date > '2012-12-31' AND e.date < '2013-10-01'
GROUP BY
DATE_FORMAT(e.date, '%Y %m')
it works ok with total_responded, but the total_sent goes crazy in millions, obviously because the resultant join table has the redundant values.
So basically can I do a SUM and COUNT in the same query on separate tables ?
If you want to count duplicates in each table, then the query is a little complicated.
You need to aggregate the sends and responses separately, before joining them together. The join is on the date, which necessarily comes from the "sent" information:
select r.`Month`, coalesce(total_sent, 0) as total_sent, coalesce(total_emails, 0) as total_emails,
coalesce(total_responses, 0) as total_responses,
coalesce(total_email_responses, 0) as total_email_responses
from (select DATE_FORMAT(e.date, '%Y-%m') as `Month`,
count(*) as total_sent, count(distinct email) as total_emails
from emails e
where e.date > '2012-12-31' AND e.date < '2013-10-01'
group by DATE_FORMAT(r.date, '%Y-%m')
) e left outer join
(select DATE_FORMAT(e.date, '%Y-%m') as `Month`,
count(*) as total_responses, count(distinct r.email) as total_email_responses
from emails e join
responses r
on e.email = r.email
where e.date > '2012-12-31' AND e.date < '2013-10-01'
) r
on e.`Month` = r.`Month`;
The apparent fact that your responses have no link to the "sent" information -- not even the date -- suggests a real problem with your operations and data.
I have this quite long query that should give me some information about shipments, and it works, but it's performing terribly bad. It takes about 4500ms to load.
SELECT
DATE(paid_at) AS day,
COUNT(*) as order_count,
(
SELECT COUNT(*) FROM line_items
WHERE order_id IN (SELECT id from orders WHERE DATE(paid_at) = day)
) as product_count,
(
SELECT COUNT(*) FROM orders
WHERE shipping_method = 'colissimo'
AND DATE(paid_at) = day
AND state IN ('paid','shipped','completed')
) as orders_co,
(
SELECT COUNT(*) FROM orders
WHERE shipping_method = 'colissimo'
AND DATE(paid_at) = day
AND state IN ('paid','shipped','completed')
AND paid_amount < 70
) as co_less_70,
(
SELECT COUNT(*) FROM orders
WHERE shipping_method = 'colissimo'
AND DATE(paid_at) = day
AND state IN ('paid','shipped','completed')
AND paid_amount >= 70
) as co_plus_70,
(
SELECT COUNT(*) FROM orders
WHERE shipping_method = 'mondial_relais'
AND DATE(paid_at) = day
AND state IN ('paid','shipped','completed')
) as orders_mr,
(
SELECT COUNT(*) FROM orders
WHERE shipping_method = 'mondial_relais'
AND DATE(paid_at) = day
AND state IN ('paid','shipped','completed')
AND paid_amount < 70
) as mr_less_70,
(
SELECT COUNT(*) FROM orders
WHERE shipping_method = 'mondial_relais'
AND DATE(paid_at) = day
AND state IN ('paid','shipped','completed')
AND paid_amount >= 70
) as mr_plus_70
FROM orders
WHERE MONTH(paid_at) = 11
AND YEAR(paid_at) = 2011
AND state IN ('paid','shipped','completed')
GROUP BY day;
Any idea what I could be doing wrong or what I could be doing better? I have other queries of similar length that don't take as much time to load as this. I thought this would be faster than for example having an individual query for each day (in my programming instead of the SQL query).
It is because you are using sub-queries where you don't need them.
As a general rule, where you have a sub-query within a main SELECT clause, that sub-query will query the tables within it once for each row in the main SELECT clause - so if you have 7 subqueries and are selecting a date range of 30 days, you will effectively be running 210 separate subqueries (plus your main query).
(Some query optimisers can resolve sub-queries into the main query under some circumstances, but as a general rule you can't rely on this.)
In this case, you don't need any of the orders sub-queries, because all the orders data you require is included in the main query - so you can rewrite this as:
SELECT
DATE(paid_at) AS day,
COUNT(*) as order_count,
(
SELECT COUNT(*) FROM line_items
WHERE order_id IN (SELECT id from orders WHERE DATE(paid_at) = day)
) as product_count,
sum(case when shipping_method = 'colissimo' then 1 end) as orders_co,
sum(case when shipping_method = 'colissimo' AND
paid_amount < 70 then 1 end) as co_less_70,
sum(case when shipping_method = 'colissimo' AND
paid_amount >= 70 then 1 end) as co_plus_70,
sum(case when shipping_method = 'mondial_relais' then 1 end) as orders_mr,
sum(case when shipping_method = 'mondial_relais' AND
paid_amount < 70 then 1 end) as mr_less_70,
sum(case when shipping_method = 'mondial_relais' AND
paid_amount >= 70 then 1 end) as mr_plus_70
FROM orders
WHERE MONTH(paid_at) = 11
AND YEAR(paid_at) = 2011
AND state IN ('paid','shipped','completed')
GROUP BY day;
The problem in your query is that scans the same table over and over. All scans (selects in your case) of ORDER table can be transformed to multiple SUM+CASE or COUNT+CASE as in SQL query with count and case statement.