SQL aggregation select using SUM and COUNT on different tables - mysql

I have a table emails
id date sent_to
1 2013-01-01 345
2 2013-01-05 990
3 2013-02-05 1000
table2 is responses
email_id email response
1 xyz#email.com xxxx
1 xyzw#email.com yyyy
.
.
.
I want a result with the following format:
Month total_number_of_subscribers_sent total_responded
2013-01 1335 2
.
.
this is my query:
SELECT
DATE_FORMAT(e.date, '%Y-%m')AS `Month`,
count(*) AS total_responded,
SUM(e.sent_to) AS total_sent
FROM
responses r
LEFT JOIN emails e ON e.id = r.email_id
WHERE
e.date > '2012-12-31' AND e.date < '2013-10-01'
GROUP BY
DATE_FORMAT(e.date, '%Y %m')
it works ok with total_responded, but the total_sent goes crazy in millions, obviously because the resultant join table has the redundant values.
So basically can I do a SUM and COUNT in the same query on separate tables ?

If you want to count duplicates in each table, then the query is a little complicated.
You need to aggregate the sends and responses separately, before joining them together. The join is on the date, which necessarily comes from the "sent" information:
select r.`Month`, coalesce(total_sent, 0) as total_sent, coalesce(total_emails, 0) as total_emails,
coalesce(total_responses, 0) as total_responses,
coalesce(total_email_responses, 0) as total_email_responses
from (select DATE_FORMAT(e.date, '%Y-%m') as `Month`,
count(*) as total_sent, count(distinct email) as total_emails
from emails e
where e.date > '2012-12-31' AND e.date < '2013-10-01'
group by DATE_FORMAT(r.date, '%Y-%m')
) e left outer join
(select DATE_FORMAT(e.date, '%Y-%m') as `Month`,
count(*) as total_responses, count(distinct r.email) as total_email_responses
from emails e join
responses r
on e.email = r.email
where e.date > '2012-12-31' AND e.date < '2013-10-01'
) r
on e.`Month` = r.`Month`;
The apparent fact that your responses have no link to the "sent" information -- not even the date -- suggests a real problem with your operations and data.

Related

How to set default value from mysql join interval yearmonth

I have problem with my query. I have two tables and I want join them to get the results based on primary key on first table, but I missing 1 data from first table.
this my fiddle
as you can see, I missing "xx3" from month 1
I have tried to change left and right join but, the results stil same.
So as you can see I have to set coalesce(sum(b.sd_qty),0) as total, if no qty, set 0 as default.
You should cross join the table to the distinct dates also:
SELECT a.item_code,
COALESCE(SUM(b.sd_qty), 0) total,
DATE_FORMAT(d.sd_date, '%m-%Y') month_year
FROM item a
CROSS JOIN (
SELECT DISTINCT sd_date
FROM sales_details
WHERE sd_date >= '2020-04-01' - INTERVAL 3 MONTH AND sd_date < '2020-05-01'
) d
LEFT JOIN sales_details b
ON a.item_code = b.item_code AND b.sd_date = d.sd_date
GROUP BY month_year, a.item_code
ORDER BY month_year, a.item_code;
Or, for MySql 8.0+, with a recursive CTE that returns the starting dates of all the months that you want the results, which can be cross joined to the table:
WITH RECURSIVE dates AS (
SELECT '2020-04-01' - INTERVAL 3 MONTH AS sd_date
UNION ALL
SELECT sd_date + INTERVAL 1 MONTH
FROM dates
WHERE sd_date + INTERVAL 1 MONTH < '2020-05-01'
)
SELECT a.item_code,
COALESCE(SUM(b.sd_qty), 0) total,
DATE_FORMAT(d.sd_date, '%m-%Y') month_year
FROM item a CROSS JOIN dates d
LEFT JOIN sales_details b
ON a.item_code = b.item_code AND DATE_FORMAT(b.sd_date, '%m-%Y') = DATE_FORMAT(d.sd_date, '%m-%Y')
GROUP BY month_year, a.item_code
ORDER BY month_year, a.item_code;
See the demo.

Identifying Duplicate Transactions in SQL

Recently because of an issue, multiple duplicate transactions were inserted into the database at different timings. Need to find those duplicate transactions and remove them.
I tried grouping the members and transactions
select count(*),
member_id,
TRUNC(created, 'DDD')
from TXN
where created > TO_DATE('06/01/2019 00:00:00', 'MM/DD/YYYY HH24:MI:SS')
group by member_id,
TRUNC(created, 'DDD')
having count(*) > 2;
I need all the transactions that were created in 10 minutes of time difference for the same member.
Examples:
MEMBER_ID ROW_ID ORG DEST Created
1-FRGD 1-FGTH YFG DFG 10-01-2019 00:00:00:00
1-FRGD 1-TYHG THU SEF 10-01-2019 00:00:09:12
1-FGHR 1-FTGH TGH DRF 10-01-2019 00:01:03:25
In this example, I need the first two txns as output because of not more than 10minutes if time difference and has the same member number
You may want self join:
select a.Member_Id as Member_Id,
a.Row_Id as Row_Id,
a.Org as Org,
a.Dest as Dest ,
a.Created as Created,
b.Row_Id as Duplicate_Row_Id,
b.Org as Duplicate_Org,
b.Dest as Duplicate_Dest,
b.Created as Duplicate_Created
from TXN a inner join
TXN b on a.Member_Id = b.Member_Id and
a.Created < b.Created and
TIMESTAMPDIFF(a.Created, b.Created) / 60 <= 10
order by a.Member_Id
For each record in TNX you provide its duplicate(s).
If you want to delete these transactions:
delete tnext
from txn tnext join
txn t
on tnext.member_id = t.member_id and
tnext.created > t.created and
tnext.created < t.created + interval 10 minute
where t.created > '2019-06-01';
Be sure you backup the table and test the logic using select before running this on your actual table.
If you simply want to select transactions without the duplicates, I would recommend not exists:
select t.*
from txn t
where not exists (select 1
from t tprev
where tprev.member_id = t.member_id and
tprev.created < t.created and
tprev.created > t.created - interval 10 minute
) and
t.created >= '2019-06-01';

How to calculate percent?

Could you help me to calculate percent of users, which made payments?
I've got two tables:
activity
user_id login_time
201 01.01.2017
202 01.01.2017
255 04.01.2017
255 05.01.2017
256 05.01.2017
260 15.03.2017
2
payments
user_id payment_date
200 01.01.2017
202 01.01.2017
255 05.01.2017
I try to use this query, but it calculates wrong percent:
SELECT activity.login_time, (select COUNT(distinct payments.user_id)
from payments where payments.payment_time between '2017-01-01' and
'2017-01-05') / COUNT(distinct activity.user_id) * 100
AS percent
FROM payments INNER JOIN activity ON
activity.user_id = payments.user_id and activity.login_time between
'2017-01-01' and '2017-01-05'
GROUP BY activity.login_time;
I need a result
01.01.2017 100 %
02.01.2017 0%
03.01.2017 0%
04.01.2017 0%
05.01.2017 - 50%
If you want the ratio of users who have made payments to those with activity, just summarize each table individually:
select p.cnt / a.cnt
from (select count(distinct user_id) as cnt from activity a) a cross join
(select count(distinct user_id) as cnt from payment) p;
EDIT:
You need a table with all dates in the range. That is the biggest problem.
Then I would recommend:
SELECT d.dte,
( ( SELECT COUNT(DISTINCT p.user_id)
FROM payments p
WHERE p.payment_date >= d.dte and p.payment_date < d.dte + INTERVAL 1 DAY
) /
NULLIF( (SELECT COUNT(DISTINCT a.user_id)
FROM activity a
WHERE a.login_time >= d.dte and p.login_time < d.dte + INTERVAL 1 DAY
), 0
) as ratio
FROM (SELECT date('2017-01-01') dte UNION ALL
SELECT date('2017-01-02') dte UNION ALL
SELECT date('2017-01-03') dte UNION ALL
SELECT date('2017-01-04') dte UNION ALL
SELECT date('2017-01-05') dte
) d;
Notes:
This returns NULL on days where there is no activity. That makes more sense to me than 0.
This uses logic on the dates that works for both dates and date/time values.
The logic for dates can make use of an index, which can be important for this type of query.
I don't recommend using LEFT JOINs. That will multiply the data which can make the query expensive.
First you need a table with all days in the range. Since the range is small you can build an ad hoc derived table using UNION ALL. Then left join the payments and activities. Group by the day and calculate the percentage using the count()s.
SELECT x.day,
concat(CASE count(DISTINCT a.user_id)
WHEN 0 THEN
1
ELSE
count(DISTINCT p.user_id)
/
count(DISTINCT a.user_id)
END
*
100,
'%')
FROM (SELECT cast('2017-01-01' AS date) day
UNION ALL
SELECT cast('2017-01-02' AS date) day
UNION ALL
SELECT cast('2017-01-03' AS date) day
UNION ALL
SELECT cast('2017-01-04' AS date) day
UNION ALL
SELECT cast('2017-01-05' AS date) day) x
LEFT JOIN payments p
ON p.payment_date = x.day
LEFT JOIN activity a
ON a.login_time = x.day
GROUP BY x.day;

Optimise MySQL - JOIN vs Nested query

I have been trying to optimise some SQL queries based on the assumption that Joining tables is more efficient than nesting queries. I am joining the same table multiple times to perform a different analysis on the data.
I have 2 tables:
transactions:
id | date_add | merchant_ id | transaction_type | amount
1 1488733332 108 add 20.00
2 1488733550 108 remove 5.00
and a calendar table which just lists dates so that I can create empty records where there are no transactions on particular days:
calendar:
id | datefield
1 2017-03-01
2 2017-03-02
3 2017-03-03
4 2017-03-04
I have many thousands of rows in the transactions table, and I'm trying to get an annual summary of total and different types of transactions per month (i.e 12 rows in total), where
transactions = sum of all "amount"s,
additions = sum of all "amounts" where transaction_type = "add"
redemptions = sum of all "amounts" where transaction_type = "remove"
result:
month | transactions | additions | redemptions
Jan 15 12 3
Feb 20 15 5
...
My initial query looks like this:
SELECT COALESCE(tr.transactions, 0) AS transactions,
COALESCE(ad.additions, 0) AS additions,
COALESCE(re.redemptions, 0) AS redemptions,
calendar.date
FROM (SELECT DATE_FORMAT(datefield, '%b %Y') AS date FROM calendar WHERE datefield LIKE '2017-%' GROUP BY YEAR(datefield), MONTH(datefield)) AS calendar
LEFT JOIN (SELECT COUNT(transaction_type) as transactions, from_unixtime(date_add, '%b %Y') as date_t FROM transactions WHERE merchant_id = 108 GROUP BY from_unixtime(date_add, '%b %Y')) AS tr
ON calendar.date = tr.date_t
LEFT JOIN (SELECT COUNT(transaction_type = 'add') as additions, from_unixtime(date_add, '%b %Y') as date_a FROM transactions WHERE merchant_id = 108 AND transaction_type = 'add' GROUP BY from_unixtime(date_add, '%b %Y')) AS ad
ON calendar.date = ad.date_a
LEFT JOIN (SELECT COUNT(transaction_type = 'remove') as redemptions, from_unixtime(date_add, '%b %Y') as date_r FROM transactions WHERE merchant_id = 108 AND transaction_type = 'remove' GROUP BY from_unixtime(date_add, '%b %Y')) AS re
ON calendar.date = re.date_r
I tried optimising and cleaning it up a little, removing the nested statements and came up with this:
SELECT
DATE_FORMAT(cal.datefield, '%b %d') as date,
IFNULL(count(ct.amount),0) as transactions,
IFNULL(count(a.amount),0) as additions,
IFNULL(count(r.amount),0) as redeptions
FROM calendar as cal
LEFT JOIN transactions as ct ON cal.datefield = date(from_unixtime(ct.date_add)) && ct.merchant_id = 108
LEFT JOIN transactions as r ON r.id = ct.id && r.transaction_type = 'remove'
LEFT JOIN transactions as a ON a.id = ct.id && a.transaction_type = 'add'
WHERE cal.datefield like '2017-%'
GROUP BY month(cal.datefield)
I was surprised to see that the revised statement was about 20x slower than the original with my dataset. Have I missed some sort of logic? Is there a better way to achieve the same result with a more streamlined query, given I am joining the same table multiple times?
EDIT:
So to further explain the results I'm looking for - I'd like a single row for each month of the year (12 rows) each with a column for the total transactions, total additions, and total redemptions in each month.
The first query I was getting a result in about 0.5 sec but with the second I was getting results in 9.5sec.
Looking to your query You could use a single left join with case when
SELECT COALESCE(t.transactions, 0) AS transactions,
COALESCE(t.additions, 0) AS additions,
COALESCE(t.redemptions, 0) AS redemptions,
calendar.date
FROM (SELECT DATE_FORMAT(datefield, '%b %Y') AS date
FROM calendar
WHERE datefield LIKE '2017-%'
GROUP BY YEAR(datefield), MONTH(datefield)) AS calendar
LEFT JOIN
( select
COUNT(transaction_type) as transactions
, sum( case when transaction_type = 'add' then 1 else 0 end ) as additions
, sum( case when transaction_type = 'remove' then 1 else 0 end ) as redemptions
, from_unixtime(date_add, '%b %Y') as date_t
FROM transactions
WHERE merchant_id = 108
GROUP BY from_unixtime(date_add, '%b %Y' ) t ON calendar.date = t.date_t
First I would create a derived table with timestamp ranges for every month from your calendar table. This way a join with the transactions table will be efficient if date_add is indexed.
select month(c.datefield) as month,
unix_timestamp(timestamp(min(c.datefield), '00:00:00')) as ts_from,
unix_timestamp(timestamp(max(c.datefield), '23:59:59')) as ts_to
from calendar c
where c.datefield between '2017-01-01' and '2017-12-31'
group by month(c.datefield)
Join it with the transaactions table and use conditional aggregations to get your data:
select c.month,
sum(t.amount) as transactions,
sum(case when t.transaction_type = 'add' then t.amount else 0 end) as additions,
sum(case when t.transaction_type = 'remove' then t.amount else 0 end) as redemptions
from (
select month(c.datefield) as m, date_format(c.datefield, '%b') as `month`
unix_timestamp(timestamp(min(c.datefield), '00:00:00')) as ts_from,
unix_timestamp(timestamp(max(c.datefield), '23:59:59')) as ts_to
from calendar c
where c.datefield between '2017-01-01' and '2017-12-31'
group by month(c.datefield), date_format(c.datefield, '%b')
) c
left join transactions t on t.date_add between c.ts_from and c.ts_to
where t.merchant_id = 108
group by c.m, c.month
order by c.m

Combining two similar queries together

I have MySQL queries both of which work fine independantly which I would like to combine together so I get three values returned.
Query 1 checks how many accounts have been deleted:
SELECT
COUNT(1) AS deleted_count,
SUBDATE(e.timestamp, INTERVAL WEEKDAY(e.timestamp) DAY) AS display_date
FROM
exit_reasons e
WHERE
e.timestamp>='$sixmonths'
GROUP BY
WEEKOFYEAR(e.timestamp)
ORDER BY
display_date ASC
LIMIT 26
This returns a date and the number who deleted in that week
Query 2 checks how many of these have subsequently signed up again:
SELECT
COUNT(1) AS date_count,
SUBDATE(e.timestamp, INTERVAL WEEKDAY(e.timestamp) DAY) AS display_date
FROM
exit_reasons e
LEFT JOIN
companies c on e.email=c.email
WHERE
e.timestamp>='$sixmonths' AND c.email IS NOT NULL
GROUP BY
WEEKOFYEAR(e.timestamp)
ORDER BY
display_date ASC
LIMIT 26
This returns a date and the number of that weeks deleted who now have a new account
I would like it to return a date and then the number deleted and number rejoined in one query so I tried:
SELECT
COUNT(1) AS date_count,
SUBDATE(e.timestamp, INTERVAL WEEKDAY(e.timestamp) DAY) AS display_date,
date_count as rejoined_count from
(SELECT
COUNT(1) AS date_count,
SUBDATE(e.timestamp, INTERVAL WEEKDAY(e.timestamp) DAY) AS display_date
FROM
exit_reasons e2
LEFT JOIN
companies c on e.email=c.email
LEFT JOIN
companies_users cu on e.email=cu.email
WHERE
e2.timestamp>='$sixmonths' AND c.email IS NOT NULL
GROUP BY
WEEKOFYEAR(e.timestamp)
ORDER BY
display_date ASC
LIMIT 26)
FROM
exit_reasons e
WHERE
e.timestamp>='$sixmonths'
GROUP BY
WEEKOFYEAR(e.timestamp)
ORDER BY
display_date ASC
LIMIT 26
but I am getting a syntax error - how can I combine these queries together into one query?
You should be able to combine the two queries into a single query by using an aggregate function along with some conditional logic like a CASE expression:
SELECT
COUNT(1) AS deleted_count,
SUM(CASE WHEN c.email IS NOT NULL THEN 1 ELSE 0 END) as date_count,
SUBDATE(e.timestamp, INTERVAL WEEKDAY(e.timestamp) DAY) AS display_date
FROM exit_reasons e
LEFT JOIN companies c
on e.email=c.email
WHERE e.timestamp>='$sixmonths'
GROUP BY WEEKOFYEAR(e.timestamp)
ORDER BY display_date ASC
LIMIT 26;
See Demo. Your check on the second query if the c.email IS NOT NULL is moved into the SUM(CASE.. which allows you to get a total of the rows that are not null.
I think the following will do what you want:
SELECT COUNT(*) AS deleted_count,
COUNT(c.email) as date_count,
SUBDATE(e.timestamp, INTERVAL WEEKDAY(e.timestamp) DAY) AS display_date
FROM exit_reasons e LEFT JOIN
companies c
on e.email = c.email
WHERE e.timestamp >= '$sixmonths'
GROUP BY WEEKOFYEAR(e.timestamp)
ORDER BY display_date ASC
LIMIT 26;
In the event that someone can sign up more than once with the same email, you should change the count() to use distinct:
COUNT(DISTINCT e.email) as deleted_count,
COUNT(DISTINCT c.email) as date_count