Optimization of a mysql query - mysql

I'm using MySQL and have a table user_data like this:
user_id int(10) unsigned
reg_date int(10) unsigned
carrier char(1)
The reg_data is the unix timestamp of the registration time (it could be any second of a day), and the carrier is the type of carriers, the possible values of which could ONLY be 'D', 'A' or 'V'.
I need to write a sql statement to select the registered user number of different carriers on each day from 2013/01/01 to 2013/01/31. So the desirable result could be:
2013/01/01 D 10
2013/01/01 A 31
2013/01/01 V 24
2013/01/02 D 9
2013/01/02 A 23
2013/01/02 V 14
....
2013/01/31 D 11
2013/01/31 A 34
2013/01/31 V 22
Can anyone help me with this question? I'm required to give the BEST answer, which means I can add index if necessary, but I need to keep the query efficient.
Currently, I created an index on (reg_date, carrier) and use the following query:
select FROM_UNIXTIME(reg_date, "%M %D %Y") as reg_day, carrier, count(carrier) as user_count
from user_data
where reg_date >= UNIX_TIMESTAMP('2013-01-01 00:00:00') and reg_date < UNIX_TIMESTAMP('2013-02-01 00:00:00')
group by reg_day, carrier
order by reg_date;
Thanks!

If you can not change the table (storing individual dates would help a little), only indexes, then:
Create a compound index: carrier, reg_date, then group carrier, reg_date and order by reg_date, carrier.
You can create an other index just for the timestamp (it may work better for the WHERE caluse, depending your number of records outside the scope).
Further more you can use completely unix timestamps, then embed this as a subquery an an outer one can covert the timestamps to human-readable dates (this way the conversion is done after the group, not for each individual record).
Creating indexes:
CREATE INDEX bytime ON user_data (reg_date);
CREATE INDEX daily_group ON user_data (carrier, reg_date);
Query:
SELECT FROM_UNIXTIME(reg_day, "%M %D %Y") AS reg_day
, carrier
, user_count
FROM (
SELECT FLOOR(reg_date / (60 * 60 * 24)) AS reg_day
, carrier
, count(carrier) AS user_count
FROM user_data
WHERE reg_date >= UNIX_TIMESTAMP('2013-01-01 00:00:00')
AND reg_date < UNIX_TIMESTAMP('2013-02-01 00:00:00')
GROUP BY carrier, reg_day
ORDER BY reg_day, carrier
) AS a;

Related

collect_set() distinct users by day from last 90 days only when user is older than last 90 days

for now I was able to collect_set() everyone that is active with no problem:
with aux as(
select date
,collect_set(user_id) over(
partition by feature
order by cast(timestamp(date) as float)
range between (-90*60*60*24) following and 0 preceding
) as user_id
,feature
--
from (
select data
,feature
,collect_set(user_id)
--
from table
--
group by date, feature
)
)
--
select date
,distinct_array(flatten(user_id))
,feature
--
from aux
The problem is, now I have to keep only users that are older than last 90 days
I tried this and didn't work:
select date
,collect_set(case when user_created_at < date - interval 90 day
then user_id end) over(
partition by feature
order by cast(timestamp(date) as float)
range between (-90*60*60*24) following and 0 preceding
) as teste
,feature
from table
The reason it didn't work is because the filter inside collect_select() filters only users from one day instead filtering all the users from the last 90 days,
Making the result with more results than expected.
How can I get it correctly?
As reference, I'm using this query to verify if is correct:
select
count(distinct user_id) as total
,count(distinct case when user_created_at < date('2020-04-30') - interval 90 day then user_id end)
,count(distinct case when user_created_at >= date('2020-04-30') - interval 90 day then user_id end)
--
from table
--
where 1=1
and date >= date('2020-04-30') - interval 90 day
and date <= '2020-04-30'
and feature = 'a_feature'
pretty ugly workaround but:
select data
,feature
,collect_set(cus.client_id) as client
from (
select data
,explode(array_distinct(flatten(client))) as client
,feature
from(
select data
,collect_set(client_id) over(
partition by feature
order by cast(timestamp(data) as float)
range between (-90*60*60*24) following and 0 preceding
) as cliente
,feature
from (
select data
,feature
,collect_set(client_id) as cliente
from da_pandora.ds_transaction dtr
--
group by data, feature
)
)
)as dtr
left join costumer as cus
on cus.client_id = dtr.client and date(client_created_at) < data - interval 90 day
group by data, feature

Rewrite sql query to pad empty month rows

I have this query i use to get statistics of blogs in our own tracking system.
I use union select over 2 tables as we daily aggregate data in 1 table and keeps todays data in another table.
I want to have the last 10 months of traffic show.. This query does that, but of there is no traffic in a specific month that row is not in the result.
I have previously used a calendar table in mysql to join against to at avoid that, but im simply not skilled enoght to rewrite this query to join against that calendar table.
The calendart table has 1 field called "datefield" which i date format YYY-MM-DD
This is the current query i use
SELECT FORMAT(SUM(`count`),0) as `count`, DATE(`date`) as `date`
FROM
(
SELECT count(distinct(uniq_id)) as `count`, `timestamp` as `date`
FROM tracking
WHERE `timestamp` > now() - INTERVAL 1 DAY AND target_bid = 92
group by `datestamp`
UNION ALL
select sum(`count`),`datestamp` as `date`
from aggregate_visits
where `datestamp` > now() - interval 10 month
and target_bid = 92
group by `datestamp`
) a
GROUP BY MONTH(date)
Something like this?
select sum(COALESCE(t.`count`,0)),s.date as `date`
from DateTable s
LEFT JOIN (SELECT * FROM aggregate_visits
where `datestamp` > now() - interval 10 month
and target_bid = 92) t
ON(s.date = t.datestamp)
group by s.date

MySql -- Determine periods of missing data with query

I have a database that's set up like this:
(Schema Name)
Historical
-CID int UQ AI NN
-ID Int PK
-Location Varchar(255)
-Status Varchar(255)
-Time datetime
So an entry might look like this
433275 | 97 | MyLocation | OK | 2013-08-20 13:05:54
My question is, if I'm expecting 5 minute interval data from each of my sites, how can I determine how long a site has been down?
Example, if MyLocation didn't send in the 5 minute interval data from 13:05:54 until 14:05:54 it would've missed 60 minutes worth of intervals, how could I find this downtime and report on it easily?
Thanks,
*Disclaimer: I'm assuming that your time column determines the order of the entries in your table and that you can't easily (and without heavy performance loss) self-join the table on auto_increment column since it can contain gaps.*
Either you create a table containing simply datetime values and do a
FROM datetime_table d
LEFT JOIN your_table y ON DATE_FORMAT(d.datetimevalue, '%Y-%m-%d %H:%i:00') = DATE_FORMAT(y.`time`, '%Y-%m-%d %H:%i:00')
WHERE y.some_column IS NULL
(date_format() function is used here to get rid of the seconds part in the datetime values).
Or you use user defined variables.
SELECT * FROM (
SELECT
y.*,
TIMESTAMPDIFF(MINUTE, #prevDT, `Time`) AS timedifference
#prevDT := `Time`
FROM your_table y ,
(SELECT #prevDT:=(SELECT MIN(`Time`) FROM your_table)) vars
ORDER BY `Time`
) sq
WHERE timedifference > 5
EDIT: I thought you wanted to scan the whole table (or parts of it) for rows where the timedifference to the previous row is greater than 5 minutes. To check for a specific ID (and still having same assumptions as in the disclaimer) you'd have to do a different approach:
SELECT
TIMESTAMPDIFF(MINUTE, (SELECT `Time` FROM your_table sy WHERE sy.ID < y.ID ORDER BY ID DESC LIMIT 1), `Time`) AS timedifference
FROM your_table y
WHERE ID = whatever
EDIT 2:
When you say "if the ID is currently down" is there already an entry in your table or not? If not, you can simply check this via
SELECT TIMESTAMPDIFF(MINUTE, NOW(), (SELECT MAX(`Time`) FROM your_table WHERE ID = whatever));
So I assume you are going to have some sort of cron job running to check this table. If that is the case you can simply check for the highest time value for each id/location and compare it against current time to flag any id's that have a most recent time that is older than the specified threshold. You can do that like this:
SELECT id, location, MAX(time) as most_recent_time
FROM Historical
GROUP BY id
HAVING most_recent_time < DATE_SUB(NOW(), INTERVAL 5 minutes)
Something like this:
SELECT h1.ID, h1.location, h1.time, min(h2.time)
FROM Historical h1 LEFT JOIN Historical h2
ON (h1.ID = h2.ID AND h2.CID > h1.CID)
WHERE now() > h1.time + INTERVAL 301 SECOND
GROUP BY h1.ID, h1.location, h1.time
HAVING min(h2.time) IS NULL
OR min(h2.time) > h1.time + INTERVAL 301 SECOND

GROUP BY MONTH() hide result

I'm trying to count how many result in each month.
This is my query :
SELECT
COUNT(*) as nb,
CONCAT(MONTH(t.date),0x3a,YEAR(t.date)) as period
FROM table1 t
WHERE t.criteria = 'value'
GROUP BY MONTH(t.date)
ORDER BY YEAR(t.date)
My Result:
nb period
---------------
7 6:2009
46 8:2009
2 10:2009
1 11:2009
14 1:2009
9 9:2010
161 7:2010
5 2:2010
88 3:2010
28 4:2010
4 5:2011
2 12:2011
The problem is, I'm sure that I've result between 5:2011 & 12:2011 , and each other period
since 2009 ... :/
This is a problem of my request or mysql configuration ?
Thx a lot
You have to group by both the year and the month. Otherwise your April 2012 rows are grouped with April 2011 (and April 2010 ...) rows as well.
SELECT
COUNT(*) AS nb,
CONCAT(MONTH(t.date), ':', YEAR(t.date)) AS period
FROM table1 AS t
WHERE t.criteria = 'value'
GROUP BY YEAR(t.date)
, MONTH(t.date) ;
(and is there a reason you used 0x3a and not ':'?)
You could also use some other DATE and TIME functions of MySQL so there are fewer functions calls per row and probably a more efficient query:
SELECT
COUNT(*) AS nb,
DATE_FORMAT(t.date, '%m:%Y') AS period
FROM table1 AS t
WHERE t.criteria = 'value'
GROUP BY EXTRACT( YEAR_MONTH FROM t.date) ;
For several queries, it's useful to have a permanent Calendar table in your database (with all dates or all year-months) or even several Calendar tables. Example:
CREATE TABLE CalendarYear
( Year SMALLINT UNSIGNED NOT NULL
, PRIMARY KEY (Year)
) ENGINE = InnoDB ;
INSERT INTO CalendarYear
(Year)
VALUES
(1900), (1901), ..., (2099) ;
CREATE TABLE CalendarMonth
( Month TINYINT UNSIGNED NOT NULL
, PRIMARY KEY (Month)
) ENGINE = InnoDB ;
INSERT INTO CalendarMonth
(Month)
VALUES
(1), (2), ..., (12) ;
Those can also help us make the one we'll need here:
CREATE TABLE CalendarYearMonth
( Year SMALLINT UNSIGNED NOT NULL
, Month TINYINT UNSIGNED NOT NULL
, FirstDay DATE NOT NULL
, NextMonth_FirstDay DATE NOT NULL
, PRIMARY KEY (Year, Month)
) ENGINE = InnoDB ;
INSERT INTO CalendarYearMonth
(Year, Month, FirstDay, NextMonth_FirstDay)
SELECT
y.Year
, m.Month
, MAKEDATE(y.Year, 1) + INTERVAL (m.Month-1) MONTH
, MAKEDATE(y.Year, 1) + INTERVAL (m.Month) MONTH
FROM
CalendarYear AS y
CROSS JOIN
CalendarMonth AS m ;
Then you can use the Calendar tables to write more complex queries, like the variation you want (with missing months) and probably more efficiently. Tested in SQL-Fiddle:
SELECT
COUNT(t.date) AS nb,
CONCAT(cal.Month, ':', cal.Year) AS period
FROM
CalendarYearMonth AS cal
JOIN
( SELECT
YEAR(MIN(date)) AS min_year
, MONTH(MIN(date)) AS min_month
, YEAR(MAX(date)) AS max_year
, MONTH(MAX(date)) AS max_month
FROM table1
WHERE criteria = 'value'
) AS mm
ON (cal.Year, cal.Month) >= (mm.min_year, mm.min_month)
AND (cal.Year, cal.Month) <= (mm.max_year, mm.max_month)
LEFT JOIN
table1 AS t
ON t.criteria = 'value'
AND t.date >= cal.FirstDay
AND t.date < cal.NextMonth_FirstDay
GROUP BY
cal.Year, cal.Month ;
You must also GROUP BY the year:
GROUP BY MONTH(t.date), YEAR(t.date)
Your original query uses YEAR(t.date) in the SELECT clause outside of any aggregate function without grouping by it -- as a result, you get exactly 12 groups (one for each possible month) and for each group (that possibly contains dates across many years) a "random" year is chosen by MySql for selection. Strictly speaking, this is meaningless and the query should never have been allowed to execute. But MySql... sigh.

Mysql nested query optimization

I have a table that logs various transactions for a CMS. It logs the username, action, and time. I have made the following query to tell me how many transactions each user made in the past two days, but it is so slow its faster for me to send a bunch of separate querys at this point. Am I missing a fundamental rule for writing nested queries?
SELECT DISTINCT
`username`
, ( SELECT COUNT(*)
FROM `ActivityLog`
WHERE `username`=`top`.`username`
AND `time` > CURRENT_TIMESTAMP - INTERVAL 2 DAY
) as `count`
FROM `ActivityLog` as `top`
WHERE 1;
You could use:
SELECT username
, COUNT(*) AS count
FROM ActivityLog
WHERE time > CURRENT_TIMESTAMP - INTERVAL 2 DAY
GROUP BY username
An index on (username, time) would be helpful regarding speed.
If you want users with 0 transcations (the last 2 days), use this:
SELECT DISTINCT
act.username
, COALESCE(grp.cnt, 0) AS cnt
FROM ActivityLog act
LEFT JOIN
( SELECT username
, COUNT(*) AS count
FROM ActivityLog
WHERE time > CURRENT_TIMESTAMP - INTERVAL 2 DAY
GROUP BY username
) AS grp
ON grp.username = act.username
or, if you have a users table:
SELECT
u.username
, COALESCE(grp.cnt, 0) AS cnt
FROM users u
LEFT JOIN
( SELECT username
, COUNT(*) AS count
FROM ActivityLog
WHERE time > CURRENT_TIMESTAMP - INTERVAL 2 DAY
GROUP BY username
) AS grp
ON grp.username = u.username
Another way, similar to yours, would be:
SELECT username
, SUM(IF(time > CURRENT_TIMESTAMP - INTERVAL 2 DAY, 1, 0))
AS count
FROM ActivityLog
GROUP BY username
or even this (because true=1 and false=0 for MySQL):
SELECT username
, SUM(time > CURRENT_TIMESTAMP - INTERVAL 2 DAY)
AS count
FROM ActivityLog
GROUP BY username
No need for nesting...
SELECT `username`, COUNT(`username`) as `count` FROM `ActivityLog` WHERE `time` > CURRENT_TIMESTAMP - INTERVAL 2 DAY GROUP BY `username`
Also don't forget to add an INDEX on time if you want to make it even faster