MySQL InnoDB - GROUP BY on many items - mysql

I've got a table that has id, date, ad_id, ad_network, ad_event columns.
In my database there are millions of distinct ad_id each has a few events associated with them.
When I try to use GROUP BY on the ad_id to count each event it takes so long there is 503 error.
I need to count distinct AdClickThru and AdImpression so that I can calculate the CTR.
The problem is that one user can click many times, so I must count only one AdClickThru.
The query is below:
SELECT
`ad_network`,
`ad_id`,
SUM(DISTINCT CASE WHEN `ad_event` = "AdImpression" THEN 1 ELSE 0 END) as AdImpression,
SUM(DISTINCT CASE WHEN `ad_event` = "AdClickThru" THEN 1 ELSE 0 END) as AdClickThru
FROM `ads`
WHERE 1
AND `ad_event` IN ("AdImpression", "AdClickThru")
AND SUBSTR(`date`, 1, 7) = "2020-08"
GROUP BY `ad_id`
I have indexes on ad_id and ad_event + date but it does not help much.
How can I optimize this query?
The database will grow to billions of entries and more.
#edit
Forgot to mention that the code above is inner part of outer query:
SELECT
`ad_network`,
SUM(`AdImpression`) as cnt_AdImpression,
SUM(`AdClickThru`) as cnt_AdClickThru,
100 * SUM(`AdClickThru`) / SUM(`AdImpression`) as ctr
FROM (
SELECT
`ad_network`,
`ad_id`,
SUM(DISTINCT CASE WHEN `ad_event` = "AdImpression" THEN 1 ELSE 0 END) as AdImpression,
SUM(DISTINCT CASE WHEN `ad_event` = "AdClickThru" THEN 1 ELSE 0 END) as AdClickThru
FROM `ads`
WHERE 1
AND `ad_event` IN ("AdImpression", "AdClickThru")
AND SUBSTR(`date`, 1, 7) = "2020-08" -- better performance
GROUP BY `ad_id`
) a
GROUP BY `ad_network`
ORDER BY ctr DESC

The problem is that one user can click many times, so I must count only one AdClickThru.
Then use MAX(), not COUNT(DISTINCT). This gives the same result as your expression, and is much more efficient. I would also recommend rewriting the date filter so it is index-friendly:
SELECT
`ad_network`,
`ad_id`,
MAX(`ad_event` = 'AdImpression') as AdImpression,
MAX(`ad_event` = 'AdClickThru') as AdClickThru
FROM `ads`
WHERE 1
AND `ad_event` IN ('AdImpression', 'AdClickThru')
AND `date` >= '2020-08-01'
AND `date` < '2020-09-01'
GROUP BY `ad_id`
Notes:
the presence of ad_network in the select clause is hitching me: if there are several values per ad_id, it is undefined which will be picked. Either put this column in the group by clause as well, or use an aggregate function in the sélect clause (such as MAX(ad_network) - or if you are ok with an arbitrary value, then be explicit about it with any_value()
use single quotes for literal strings rather than double quotes (this is the SQL standard)

There is no need for 2 separate aggregations in the main query and the subquery.
You want to count the distinct ad_ids for each of the 2 cases:
SELECT ad_network,
COUNT(DISTINCT CASE WHEN ad_event = 'AdImpression' THEN ad_id END) AS cnt_AdImpression,
COUNT(DISTINCT CASE WHEN ad_event = 'AdClickThru' THEN ad_id END) AS cnt_AdClickThru,
100 *
COUNT(DISTINCT CASE WHEN ad_event = 'AdClickThru' THEN ad_id END) /
COUNT(DISTINCT CASE WHEN ad_event = 'AdImpression' THEN ad_id END) AS ctr
FROM ads
WHERE ad_event IN ('AdImpression', 'AdClickThru') AND SUBSTR(date, 1, 7) = '2020-08'
GROUP BY ad_network
ORDER BY ctr DESC
The problem here is that you have to repeat the expressions for cnt_AdImpression and cnt_AdClickThru.
You can calculate these expressions in a subquery:
SELECT ad_network, cnt_AdImpression, cnt_AdClickThru,
100 * cnt_AdClickThru / cnt_AdImpression AS ctr
FROM (
SELECT ad_network,
COUNT(DISTINCT CASE WHEN ad_event = 'AdImpression' THEN ad_id END) AS cnt_AdImpression,
COUNT(DISTINCT CASE WHEN ad_event = 'AdClickThru' THEN ad_id END) AS cnt_AdClickThru
FROM ads
WHERE ad_event IN ('AdImpression', 'AdClickThru') AND SUBSTR(date, 1, 7) = '2020-08'
GROUP BY ad_network
) t
ORDER BY ctr DESC

Related

Select column(s) corresponding to max/min of another column without joins

I have a table (id, employee_id, device_id, logged_time) [simplified] that logs attendances of employees from biometric devices.
I generate reports showing the first in and last out time of each employee by date.
Currently, I am able to fetch the first in and last out time of each employee by date, but I also need to fetch the first in and last out device_ids of each employee. The entries are not in sequential order of the logged time.
I do not want to (and probably cannot) use joins as in one of the reports the columns are dynamically generated and can lead to thousands of joins. Furthermore, these are subqueries and are joined to other queries to get further details.
A sample setup of the table and queries are at http://sqlfiddle.com/#!9/3bc755/4
The first one just shows lists the entry and exit time by date of every employee
select
attendance_logs.employee_id,
DATE(attendance_logs.logged_time) as date,
TIME(MIN(attendance_logs.logged_time)) as entry_time,
TIME(MAX(attendance_logs.logged_time)) as exit_time
from attendance_logs
group by date, attendance_logs.employee_id
The second one builds up an attendance chart given a date range
select
`attendance_logs`.`employee_id`,
DATE(MIN(case when DATE(`attendance_logs`.`logged_time`) = '2017-09-18' THEN `attendance_logs`.`logged_time` END)) as date_2017_09_18,
MIN(case when DATE(`attendance_logs`.`logged_time`) = '2017-09-18' THEN `attendance_logs`.`logged_time` END) as entry_2017_09_18,
MAX(case when DATE(`attendance_logs`.`logged_time`) = '2017-09-18' THEN `attendance_logs`.`logged_time` END) as exit_2017_09_18,
DATE(MIN(case when DATE(`attendance_logs`.`logged_time`) = '2017-09-19' THEN `attendance_logs`.`logged_time` END)) as date_2017_09_19,
MIN(case when DATE(`attendance_logs`.`logged_time`) = '2017-09-19' THEN `attendance_logs`.`logged_time` END) as entry_2017_09_19,
MAX(case when DATE(`attendance_logs`.`logged_time`) = '2017-09-19' THEN `attendance_logs`.`logged_time` END) as exit_2017_09_19
/*
* dynamically generated columns for dates in date range
*/
from `attendance_logs`
where `attendance_logs`.`logged_time` >= '2017-09-18 00:00:00' and `attendance_logs`.`logged_time` <= '2017-09-19 23:59:59'
group by `attendance_logs`.`employee_id`;
Tried:
Similar to max and min logged_time of each date using case, tried to select the device_id where logged_time is max/min.
```MIN(case
when
`attendance_logs.logged_time` = MIN(
case when DATE(`attendance_logs`.`logged_time`)
= '2017-09-18' THEN `attendance_logs`.`logged_time` END
)
then `attendance_logs`.`device_id` end) as entry_device_2017_09_18 ```
This results in invalid use of group by
A quick hack for your query to pick the device id for in and out by using GROUP_CONCAT with in SUBSTRING_INDEX
SUBSTRING_INDEX(GROUP_CONCAT(case when DATE(`l`.`logged_time`) = '2017-09-18' THEN `l`.`device_id` END ORDER BY `l`.`device_id` desc),',',1) exit_device_2017_09_18,
Or if device id will be same for each in and its out then simply it can be written with GROUP_CONCAT only
GROUP_CONCAT(DISTINCT case when DATE(`l`.`logged_time`) = '2017-09-18' THEN `l`.`device_id` END)
DEMO
To avoid joins I suggest you try "correlated subqueries" instead:
select
employee_id
, logdate
, TIME(entry_time) entry_time
, (select MIN(l.device_id)
from attendance_logs l
where l.employee_id = d.employee_id
and l.logged_time = d.entry_time) entry_device
, TIME(exit_time) exit_time
, (select MAX(l.device_id)
from attendance_logs l
where l.employee_id = d.employee_id
and l.logged_time = d.exit_time) exit_device
from (
select
attendance_logs.employee_id
, DATE(attendance_logs.logged_time) as logdate
, MIN(attendance_logs.logged_time) as entry_time
, MAX(attendance_logs.logged_time) as exit_time
from attendance_logs
group by
attendance_logs.employee_id
, DATE(attendance_logs.logged_time)
) d
;
see: http://sqlfiddle.com/#!9/06e0e2/3
Note: I have used MIN() and MAX() on those subqueries only to avoid any possibility that these return more than one value. You could use limit 1 instead if you prefer.
Note also: I do not normally recommend correlated subqueries as they can cause performance issues, but they do supply the data you need.
oh, and please try to avoid using date as a column name, it isn't good practice.

Union two SQL Queries with one same column ( using Eloquent )

i have two queries which should be union (with laravel eloquent) but there is a duplicate column called group_date in both query and I should show one of them
SELECT
to_char(CREATE_UTC_DATETIME, 'yyyy-mm-dd') AS group_date,
COUNT(*) AS successful_transaction
FROM "REPORT_EVENTS"
WHERE "RESULT_CODE" = '0' AND "EVENT_TYPE" = 'BILL'
GROUP BY to_char(CREATE_UTC_DATETIME, 'yyyy-mm-dd')
ORDER BY "GROUP_DATE" DESC
SELECT
to_char(CREATE_UTC_DATETIME, 'yyyy-mm-dd') AS group_date,
COUNT(*) AS unsuccessful_transaction
FROM "REPORT_EVENTS"
WHERE "RESULT_CODE" = '1' AND "EVENT_TYPE" = 'BILL'
GROUP BY to_char(CREATE_UTC_DATETIME, 'yyyy-mm-dd')
ORDER BY "GROUP_DATE" DESC
You don't want a UNION here, but rather a single query which uses conditional aggregation:
SELECT
TO_CHAR(CREATE_UTC_DATETIME, 'yyyy-mm-dd') AS group_date,
SUM(CASE WHEN RESULT_CODE = '0' THEN 1 ELSE 0 END) AS successful_transaction,
SUM(CASE WHEN RESULT_CODE = '1' THEN 1 ELSE 0 END) AS unsuccessful_transaction
FROM "REPORT_EVENTS"
WHERE "EVENT_TYPE" = 'BILL'
GROUP BY TO_CHAR(CREATE_UTC_DATETIME, 'yyyy-mm-dd')
ORDER BY "GROUP_DATE" DESC
I am not giving any Eloquent/Laravel code here, but I am fairly certain that you would need a custom raw query to handle this. So, your actual PHP code would more or less just have the above query in its raw form.

Display the results in 1 row and different columns

Assume a simple case e.g. a table bug that has a column status that can be open,fixed etc.
If I want to know how many bugs are open I simply do:
select count(*) as open_bugs from bugs where status = 'open';
If I want to know how many bugs are open I simply do:
select count(*) as closed_bugs from bugs where status = 'closed';
If what want to know how many open and how many closed there are in a query that returns the results in 2 columns i.e.
Open | Closed|
60 180
What is the best way to do it? UNION concatenates the results so it is not what I want
This can be done by using a CASE expression with your aggregate function. This will convert the rows into columns:
select
sum(case when status = 'open' then 1 else 0 end) open_bugs,
sum(case when status = 'closed' then 1 else 0 end) closed_bugs
from bugs
This could also be written using your original queries:
select
max(case when status = 'open' then total end) open_bugs,
max(case when status = 'closed' then total end) closed_bugs
from
(
select status, count(*) as total from bugs where status = 'open' group by status
union all
select status, count(*) as total from bugs where status = 'closed' group by status
) d
Besides the CASE variants that aggregate over the whole table, there is another way. To use the queries you have and put them inside another SELECT:
SELECT
( SELECT COUNT(*) FROM bugs WHERE status = 'open') AS open_bugs,
( SELECT COUNT(*) FROM bugs WHERE status = 'closed') AS closed_bugs
FROM dual -- this line is optional
;
It has the advantage that you can wrap counts from different tables or joins in a single query.
There may also be differences in efficiency (worse or better). Test with your tables and indexes.
You can also use GROUP BY to get all the counts in separate rows (like the UNION you mention) and then use another aggregation to pivot the results in one row:
SELECT
MIN(CASE WHEN status = 'open' THEN cnt END) AS open_bugs,
MIN(CASE WHEN status = 'closed' THEN cnt END) AS closed_bugs
FROM
( SELECT status, COUNT(*) AS cnt
FROM bugs
WHERE status IN ('open', 'closed')
GROUP BY status
) AS g
Try this
select count(case when status = 'open' then 1 end) open_bugs,
count(case when status = 'closed' then 1 end) closed_bugs
from bugs

Getting a Rank out of a Total

I have been doing this for quite some time:
SELECT COUNT(*) AS 'Rank' FROM Table
WHERE Condition = 'Condition' AND Score >= 'Score';
SELECT COUNT(*) AS 'Total' FROM Table
WHERE Condition = 'Condition';
Is there a more efficient way of getting both Rank and Total?
You can calculate both at the same time with one pass through the data.
SELECT COUNT(*) AS 'Total',
SUM(CASE WHEN Score >= 'Score' THEN 1 ELSE 0 END) AS `Rank`
FROM Table
WHERE Condition = 'Condition';

SQL query not returning expect result

I wrote the following query to return some statistics about purchases made in the X amount of time. But for some reason every "COUNT" column return the total number of rows. Did I organize the query incorrectly?
SELECT COUNT(*) as countTotal, SUM(`cost`) as cost, COUNT(`paymentType` = 'credit') as count_credit, COUNT(`paymentType` = 'cash') as count_cash
FROM `purchase` WHERE `date` >= '2011-5-4'
update
I just decided to use sub-queries. This is what I ended up with.
SELECT
COUNT(*) as countTotal,
SUM(`cost`) as cost,
(SELECT COUNT(*) FROM `purchase` WHERE `paymentType` = 'credit') as count_credit,
(SELECT COUNT(*) FROM `purchase` WHERE `paymentType` = 'cash') as count_cash
FROM `purchase` WHERE `date` >= '2011-5-4'
update2
Used ypercubes answer below.
count does return the number of rows for the domain or group queried. Looks like you need to group by PaymentType to achieve what you are looking for.
SELECT PaymentType, COUNT(*) as countTotal, SUM(`cost`) as cost,
FROM `purchase`
WHERE `date` >= '2011-5-4'
Group by PaymentType
here is a reference
http://dev.mysql.com/doc/refman/5.0/en/group-by-functions.html
It doesn't look correct but changing COUNT() to SUM() works fine:
SELECT COUNT(*) AS countTotal
, SUM(cost) AS cost
, SUM(paymentType = 'credit') AS count_credit --- SUM does counting here
, SUM(paymentType = 'cash') AS count_cash --- and here
FROM purchase
WHERE `date` >= '2011-05-04'
Explanation: True == 1 and False == 0 for MySQL.
You need a GROUP BY clause after your WHERE clause