Best way to index and query analytic table in MySQL - mysql

I have an analytics table (5M rows and growing) with the following structure
Hits
id int() NOT NULL AUTO_INCREMENT,
hit_date datetime NOT NULL,
hit_day int(11) DEFAULT NULL,
gender varchar(255) DEFAULT NULL,
age_range_id int(11) DEFAULT NULL,
klout_range_id int(11) DEFAULT NULL,
frequency int(11) DEFAULT NULL,
count int(11) DEFAULT NULL,
location_id int(11) DEFAULT NULL,
source_id int(11) DEFAULT NULL,
target_id int(11) DEFAULT NULL,
Most queries to the table is to query between two datetimes for a particular sub-set of columns and them sum up all the count column across all rows. For example:
SELECT target.id,
SUM(CASE gender WHEN 'm' THEN count END) AS 'gender_male',
SUM(CASE gender WHEN 'f' THEN count END) AS 'gender_female',
SUM(CASE age_range_id WHEN 1 THEN count END) AS 'age_18 - 20',
SUM(CASE target_id WHEN 1 then count END) AS 'target_test'
SUM(CASE location_id WHEN 1 then count END) AS 'location_NY'
FROM Hits
WHERE (location_id =1 or location_id = 2)
AND (target_id = 40 OR target_id = 22)
AND cast(hit_date AS date) BETWEEN '2012-5-4'AND '2012-5-10'
GROUP BY target.id
The interesting thing about queries to this table is that the where clause include any permutation of Hit columns names and values since those are what we're filtering against. So the particular query above is getting the # of males and females between the ages of 18 and 20 (age_range_id 1) in NY that belongs to a target called "test". However, there are over 8 age groups, 10 klout ranges, 45 locations, 10 sources etc (all
foreign key references).
I currently have an index on hot_date and another one on target_id. What the best way to properly index this table?. Having a composite index on all column fields seems inherently wrong.
Is there any other way to run this query without using a sub-query to sum up all counts? I did some research and this seems to be the best way to get the data-set I need but is there a more efficient way of handling this query?

Here's your optimized query. The idea is to get rid of the ORs and the CAST() function on hit_date so that MySQL can utilize a compound index that covers each of the subsets of data. You'll want a compound index on (location_id, target_id, hit_date) in that order.
SELECT id, gender_male, gender_female, `age_18 - 20`, target_test, location_NY
FROM
(
SELECT target.id,
SUM(CASE gender WHEN 'm' THEN 1 END) AS gender_male,
SUM(CASE gender WHEN 'f' THEN 1 END) AS gender_female,
SUM(CASE age_range_id WHEN 1 THEN 1 END) AS `age_18 - 20`,
SUM(CASE target_id WHEN 1 then 1 END) AS target_test,
SUM(CASE location_id WHEN 1 then 1 END) AS location_NY
FROM Hits
WHERE (location_id =1)
AND (target_id = 40)
AND hit_date BETWEEN '2012-05-04 00:00:00' AND '2012-05-10 23:59:59'
GROUP BY target.id
UNION ALL
SELECT target.id,
SUM(CASE gender WHEN 'm' THEN 1 END) AS gender_male,
SUM(CASE gender WHEN 'f' THEN 1 END) AS gender_female,
SUM(CASE age_range_id WHEN 1 THEN 1 END) AS `age_18 - 20`,
SUM(CASE target_id WHEN 1 then 1 END) AS target_test,
SUM(CASE location_id WHEN 1 then 1 END) AS location_NY
FROM Hits
WHERE (location_id = 2)
AND (target_id = 22)
AND hit_date BETWEEN '2012-05-04 00:00:00' AND '2012-05-10 23:59:59'
GROUP BY target.id
UNION ALL
SELECT target.id,
SUM(CASE gender WHEN 'm' THEN 1 END) AS gender_male,
SUM(CASE gender WHEN 'f' THEN 1 END) AS gender_female,
SUM(CASE age_range_id WHEN 1 THEN 1 END) AS `age_18 - 20`,
SUM(CASE target_id WHEN 1 then 1 END) AS target_test,
SUM(CASE location_id WHEN 1 then 1 END) AS location_NY
FROM Hits
WHERE (location_id =1)
AND (target_id = 22)
AND hit_date BETWEEN '2012-05-04 00:00:00' AND '2012-05-10 23:59:59'
GROUP BY target.id
UNION ALL
SELECT target.id,
SUM(CASE gender WHEN 'm' THEN 1 END) AS gender_male,
SUM(CASE gender WHEN 'f' THEN 1 END) AS gender_female,
SUM(CASE age_range_id WHEN 1 THEN 1 END) AS `age_18 - 20`,
SUM(CASE target_id WHEN 1 then 1 END) AS target_test,
SUM(CASE location_id WHEN 1 then 1 END) AS location_NY
FROM Hits
WHERE (location_id = 2)
AND (target_id = 22)
AND hit_date BETWEEN '2012-05-04 00:00:00' AND '2012-05-10 23:59:59'
GROUP BY target.id
) a
GROUP BY id
If your selection size is so large that this is no improvement, then you may as well keep scanning all rows like you're already doing.
Note, surround aliases with back ticks, not single quotes, which are deprecated. I also fixed your CASE clauses which had count instead of 1.

Related

CASE query optimization

SELECT
COUNT(CASE WHEN VALUE = 1 THEN 1 END) AS score_1,
COUNT(CASE WHEN VALUE = 2 THEN 1 END) AS score_2,
COUNT(CASE WHEN VALUE = 3 THEN 1 END) AS score_3,
COUNT(CASE WHEN VALUE = 4 THEN 1 END) AS score_4,
COUNT(CASE WHEN VALUE = 5 THEN 1 END) AS score_5,
COUNT(CASE WHEN VALUE = 6 THEN 1 END) AS score_6,
COUNT(CASE WHEN VALUE = 7 THEN 1 END) AS score_7,
COUNT(CASE WHEN VALUE = 8 THEN 1 END) AS score_8,
COUNT(CASE WHEN VALUE = 9 THEN 1 END) AS score_9,
COUNT(CASE WHEN VALUE = 10 THEN 1 END) AS score_10
FROM
`answers`
WHERE
`created_at` BETWEEN '2017-01-01 00:00:00' AND '2019-11-30 23:59:59'
Is there a way to optimize this query, because I have 4 million answer records in my DB, and it runs very slowly?
Try running this one time to create an index:
CREATE INDEX ix_ca on answers(created_at)
That should speed your query up. If you are curious about why, see here:
What is an index in SQL?
You could try add a redundant composite index
create idx1 on table answers(created_at, value)
using redudance in index the query should be result without accessing to table data just using the index content
Want it to be 10 times as fast? Use the Data Warehousing technique of buiding and maintaining a "Summary table". In this example the summary table might be
CREATE TABLE subtotals (
dy DATE NOT NULL,
`value` ... NOT NULL, -- TINYINT UNSIGNED ?
ct SMALLINT UNSIGNED NOT NULL, -- this is 2 bytes, max 65K; change if might be bigger
PRIMARY KEY(value, dy) -- or perhaps the opposite order
) ENGINE=InnoDB
Each night you summarize the day's data and build 10 new rows in subtotals.
Then the "report" query becomes
SELECT
SUM(CASE WHEN VALUE = 1 THEN ct END) AS score_1,
SUM(CASE WHEN VALUE = 2 THEN ct END) AS score_2,
SUM(CASE WHEN VALUE = 3 THEN ct END) AS score_3,
SUM(CASE WHEN VALUE = 4 THEN ct END) AS score_4,
SUM(CASE WHEN VALUE = 5 THEN ct END) AS score_5,
SUM(CASE WHEN VALUE = 6 THEN ct END) AS score_6,
SUM(CASE WHEN VALUE = 7 THEN ct END) AS score_7,
SUM(CASE WHEN VALUE = 8 THEN ct END) AS score_8,
SUM(CASE WHEN VALUE = 9 THEN ct END) AS score_9,
SUM(CASE WHEN VALUE = 10 THEN ct END) AS score_10
FROM
`subtotals`
WHERE `created_at` >= '2017-01-01'
AND `created_at` < '2019-12-01'
Based on what you have provided, there will be about 10K rows in subtotals; that's a lot less to wade through than 4M rows. It might run more than 10 times as fast.
More discussion: http://mysql.rjweb.org/doc.php/summarytables

get specific data along with group by

I've a table named log.
Table: log
ID user_id time_of_action
I want to get result for each user for each date i.e. group by date,user_id.
So, here's the expected output structure:
user_id date occurred_in_afternoon occurred_at_night total_action_count
Explanation:
occurred_in_afternoon: whether any action of a user occurred in between 12:00 PM to 4:00 PM
occurred_at_night: whether any action of a user occurred between 8:00 PM to 12:00 AM (next day)
Schema and sample data:
DROP TABLE IF EXISTS `logs`;
CREATE TABLE `logs` (
`Id` int(11) NOT NULL AUTO_INCREMENT,
`user_id` int(11) DEFAULT NULL,
`time_of_action` timestamp NULL DEFAULT NULL ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`Id`)
);
INSERT INTO `logs` VALUES ('1', '71', '2016-03-10 10:07:34');
INSERT INTO `logs` VALUES ('2', '66', '2016-03-10 14:07:57');
INSERT INTO `logs` VALUES ('3', '71', '2016-03-10 22:08:27');
INSERT INTO `logs` VALUES ('4', '71', '2016-03-10 15:08:40');
And here's my current query:
SELECT
user_id,
DATE(time_of_action) `date`,
CASE WHEN time_of_action BETWEEN TIMESTAMPADD(HOUR,12,DATE(time_of_action)) AND TIMESTAMPADD(HOUR,16,DATE(time_of_action)) THEN 1 ELSE 0 END occurred_in_afternoon,
CASE WHEN time_of_action BETWEEN TIMESTAMPADD(HOUR,20,DATE(time_of_action)) AND TIMESTAMPADD(HOUR,24,DATE(time_of_action)) THEN 1 ELSE 0 END occurred_at_night,
COUNT(*) total_action_count
FROM `logs`
GROUP BY `date`,user_id
my current output:
user_id date occurred_in_afternoon occurred_at_night total_action_count
66 2016-03-10 1 0 1
71 2016-03-10 0 0 3
Expected output:
user_id date occurred_in_afternoon occurred_at_night total_action_count
66 2016-03-10 1 0 1
71 2016-03-10 1 1 3
The problem is that I am not getting the expected result. I guess occurred in afternoon value is reset by another time_of_action which doesn't lie in that afternoon region.
And is it possible to implement it in a single query?
You missed to use an aggregate function. You can use MAX() or BIT_OR() for your purpose:
SELECT
user_id,
DATE(time_of_action) `date`,
MAX(CASE WHEN time_of_action BETWEEN TIMESTAMPADD(HOUR,12,DATE(time_of_action)) AND TIMESTAMPADD(HOUR,16,DATE(time_of_action)) THEN 1 ELSE 0 END) occurred_in_afternoon,
MAX(CASE WHEN time_of_action BETWEEN TIMESTAMPADD(HOUR,20,DATE(time_of_action)) AND TIMESTAMPADD(HOUR,24,DATE(time_of_action)) THEN 1 ELSE 0 END) occurred_at_night,
COUNT(*) total_action_count
FROM `logs`
GROUP BY `date`,user_id
Update: I would also prefer a more readable version like
SELECT
user_id,
DATE(time_of_action) `date`,
BIT_OR(TIME(time_of_action) BETWEEN '12:00:00' AND '16:00:00') occurred_in_afternoon,
BIT_OR(TIME(time_of_action) BETWEEN '20:00:00' AND '23:59:59') occurred_at_night,
COUNT(*) total_action_count
FROM `logs`
GROUP BY `date`,user_id
I was thinking to have an alias of the result table that I've got through SUM in order to get Binary value for those two fields.
SELECT
t.user_id,
t.date,
CASE WHEN t.occurred_in_afternoon > 0 THEN 1 ELSE 0 END AS occurred_in_afternoon,
CASE WHEN t.occurred_at_night > 0 THEN 1 ELSE 0 END AS occurred_at_night,
t.total_action_count
FROM
(SELECT
user_id,
DATE(time_of_action) `date`,
SUM(CASE WHEN time_of_action BETWEEN TIMESTAMPADD(HOUR,12,DATE(time_of_action)) AND TIMESTAMPADD(HOUR,16,DATE(time_of_action)) THEN 1 ELSE 0 END) occurred_in_afternoon,
SUM(CASE WHEN time_of_action BETWEEN TIMESTAMPADD(HOUR,20,DATE(time_of_action)) AND TIMESTAMPADD(HOUR,24,DATE(time_of_action)) THEN 1 ELSE 0 END) occurred_at_night,
COUNT(*) total_action_count
FROM `logs`
GROUP BY `date`,user_id) t

need absent and present count with month name

I need a month name with absent and present count. This is my database query:
SELECT sid,COUNT(CASE WHEN STATUS ='A' THEN 1 END) AS absent_count,COUNT(CASE WHEN STATUS ='P' THEN 1 END) AS present_count,
MONTHNAME(attendance_date) AS `Month_Name`
FROM attendance
WHERE SID = '2'
AND campus_id = 2
GROUP BY sid;
There's no point in group by sid - it will always be '2', as per your where clause. Instead, since you want to count per month name, that should appear in the group by clause:
SELECT MONTHNAME(attendance_date) AS `Month_Name`,
COUNT(CASE WHEN STATUS ='A' THEN 1 END) AS absent_count,
COUNT(CASE WHEN STATUS ='P' THEN 1 END) AS present_count,
FROM attendance
WHERE sid = '2' AND campus_id = 2
GROUP BY MONTHNAME(attendance_date);

convert Rows to column

Looking for the way to change row to column. (The comflag is of type bit and not null). Help appreciated
Table1
Id Commflag value
122 0 Ce
125 1 Cf
122 0 Cg
125 1 cs
Here is what I want in result
id ce cf cg cs cp
122 0 null 0 null null
125 null 1 null 1 null
The below query shows error-
SELECT ID , [CE],[CF],[CG],[CS],[CP]
FROM TABLE1
PIVOT ((convert((Commflag)as varchar()) FOR value IN [CE],[CF],[CG],[CS],[CP] as pvt
ORDER BY date
This query does what you want:
select Id, pvt.Ce, pvt.Cf, pvt.CG, pvt.Cs, pvt.Cp
from
(
select Id, cast(Commflag as tinyint) Commflag, value
from Table1
) t
pivot (max(Commflag) for value in ([Ce],[Cf],[CG],[Cs],[Cp])) pvt
SQL Fiddle
Here's another way to do it, without using PIVOT:
select Id,
max(case value when 'Ce' then CAST(Commflag as tinyint) else null end) Ce,
max(case value when 'Cf' then CAST(Commflag as tinyint) else null end) Cf,
max(case value when 'Cg' then CAST(Commflag as tinyint) else null end) Cg,
max(case value when 'Cs' then CAST(Commflag as tinyint) else null end) Cs,
max(case value when 'Cp' then CAST(Commflag as tinyint) else null end) Cp
from Table1
group by Id
order by Id
SQL Fiddle

how to marge the values of column in mysql

Lets consider this query
select class_id,case when event_id=2 then sum(time_spent) end as timespent ,case when event_id=3 then sum(timespent) end as visitedtimespent from class group by class_id,event_id;
output is looking like
class_id timespent visitedtimespent
1 2000 NULL
1 NULL 10
2 4000 NULL
2 NULL 5
when I use this query
select class_id,case when event_id=2 then sum(time_spent) end as timespent ,case when event_id=3 then sum(time_spent) end as timespent from class group by class_id;
output is looking like
class_id timespent visitedtimespent
1 2000 NULL
2 4000 NULL
but I expected this output
class_id timespent visitedtimespent
1 2000 10
2 4000 5
how can I achieve this?
select class_id,
sum(case when event_id=2 then time_spent else 0 end) as timespent,
sum(case when event_id=3 then time_spent else 0 end) as visitedtimespent
from class
group by class_id
sum the case.
select class_id,
sum(case when event_id=2 then time_spent end) as timespent ,
sum(case when event_id=3 then time_spent end) as visitedtimespent
from class group by class_id;
to explain the difference:
case when id... then sum(value) is equivalent to
select case when id then value from
(
select id, sum(value) as value from table
)subquery
which is an illegal grouping(ID is not aggregated or included in grouping, so the ID value will be chosen at random between all existing entries), and your ID information will be lost. IF you then apply a case to the ID info, you will not get relevant results.