Lets say we have a database table with two columns, entry_time and value. entry_time is timestamp while value can be any other datatype. The records are relatively consistent, entered in roughly x minute intervals. For many x's of time, however, an entry may not be made, thus producing a 'gap' in the data.
In terms of efficiency, what is the best way to go about finding these gaps of at least time Y (both new and old) with a query?
To start with, let us summarize the number of entries by hour in your table.
SELECT CAST(DATE_FORMAT(entry_time,'%Y-%m-%d %k:00:00') AS DATETIME) hour,
COUNT(*) samplecount
FROM table
GROUP BY CAST(DATE_FORMAT(entry_time,'%Y-%m-%d %k:00:00') AS DATETIME)
Now, if you log something every six minutes (ten times an hour) all your samplecount values should be ten. This expression: CAST(DATE_FORMAT(entry_time,'%Y-%m-%d %k:00:00') AS DATETIME) looks hairy but it simply truncates your timestamps to the hour in which they occur by zeroing out the minute and second.
This is reasonably efficient, and will get you started. It's very efficient if you can put an index on your entry_time column and restrict your query to, let's say, yesterday's samples as shown here.
SELECT CAST(DATE_FORMAT(entry_time,'%Y-%m-%d %k:00:00') AS DATETIME) hour,
COUNT(*) samplecount
FROM table
WHERE entry_time >= CURRENT_DATE - INTERVAL 1 DAY
AND entry_time < CURRENT_DATE
GROUP BY CAST(DATE_FORMAT(entry_time,'%Y-%m-%d %k:00:00') AS DATETIME)
But it isn't much good at detecting whole hours that go by with missing samples. It's also a little sensitive to jitter in your sampling. That is, if your top-of-the-hour sample is sometimes a half-second early (10:59:30) and sometimes a half-second late (11:00:30) your hourly summary counts will be off. So, this hour summary thing (or day summary, or minute summary, etc) is not bulletproof.
You need a self-join query to get stuff perfectly right; it's a bit more of a hairball and not nearly as efficient.
Let's start by creating ourselves a virtual table (subquery) like this with numbered samples. (This is a pain in MySQL; some other expensive DBMSs make it easier. No matter.)
SELECT #sample:=#sample+1 AS entry_num, c.entry_time, c.value
FROM (
SELECT entry_time, value
FROM table
ORDER BY entry_time
) C,
(SELECT #sample:=0) s
This little virtual table gives entry_num, entry_time, value.
Next step, we join it to itself.
SELECT one.entry_num, one.entry_time, one.value,
TIMEDIFF(two.value, one.value) interval
FROM (
/* virtual table */
) ONE
JOIN (
/* same virtual table */
) TWO ON (TWO.entry_num - 1 = ONE.entry_num)
This lines up the tables next two each other offset by a single entry, governed by the ON clause of the JOIN.
Finally we choose the values from this table with an interval larger than your threshold, and there are the times of the samples right before the missing ones.
The over all self join query is this. I told you it was a hairball.
SELECT one.entry_num, one.entry_time, one.value,
TIMEDIFF(two.value, one.value) interval
FROM (
SELECT #sample:=#sample+1 AS entry_num, c.entry_time, c.value
FROM (
SELECT entry_time, value
FROM table
ORDER BY entry_time
) C,
(SELECT #sample:=0) s
) ONE
JOIN (
SELECT #sample2:=#sample2+1 AS entry_num, c.entry_time, c.value
FROM (
SELECT entry_time, value
FROM table
ORDER BY entry_time
) C,
(SELECT #sample2:=0) s
) TWO ON (TWO.entry_num - 1 = ONE.entry_num)
If you have to do this in production on a large table you may want to do it for a subset of your data. For example, you could do it each day for the previous two days' samples. This would be decently efficient, and would also make sure you didn't overlook any missing samples right at midnight. To do this your little rownumbered virtual tables would look like this.
SELECT #sample:=#sample+1 AS entry_num, c.entry_time, c.value
FROM (
SELECT entry_time, value
FROM table
ORDER BY entry_time
WHERE entry_time >= CURRENT_DATE - INTERVAL 2 DAY
AND entry_time < CURRENT_DATE /*yesterday but not today*/
) C,
(SELECT #sample:=0) s
A very efficient way to do this is with a stored procedure using cursors. I think this is simpler and more efficient than the other answers.
This procedure creates a cursor and iterates it through the datetime records that you are checking. If there is ever a gap of more than what you specify, it will write the gap's begin and end to a table.
CREATE PROCEDURE findgaps()
BEGIN
DECLARE done INT DEFAULT FALSE;
DECLARE a,b DATETIME;
DECLARE cur CURSOR FOR SELECT dateTimeCol FROM targetTable
ORDER BY dateTimeCol ASC;
DECLARE CONTINUE HANDLER FOR NOT FOUND SET done = TRUE;
OPEN cur;
FETCH cur INTO a;
read_loop: LOOP
SET b = a;
FETCH cur INTO a;
IF done THEN
LEAVE read_loop;
END IF;
IF DATEDIFF(a,b) > [range you specify] THEN
INSERT INTO tmp_table (gap_begin, gap_end)
VALUES (a,b);
END IF;
END LOOP;
CLOSE cur;
END;
In this case it is assumed that 'tmp_table' exists. You could easily define this as a TEMPORARY table in the procedure, but I left it out of this example.
I'm trying this on MariaDB 10.3.27 so this procedure may not work, but I'm getting an error creating the procedure and I can't figure out why! I have a table called electric_use with a field Intervaldatetime DATETIME that I want to find gaps in. I created a target table electric_use_gaps with fields of gap_begin datetime and gap_end datetime
The data are taken every hour and I want to know if I'm missing even an hour's worth of data across 5 years.
DELIMITER $$
CREATE PROCEDURE findgaps()
BEGIN
DECLARE done INT DEFAULT FALSE;
DECLARE a,b DATETIME;
DECLARE cur CURSOR FOR SELECT Intervaldatetime FROM electric_use
ORDER BY Intervaldatetime ASC;
DECLARE CONTINUE HANDLER FOR NOT FOUND SET done = TRUE;
OPEN cur;
FETCH cur INTO a;
read_loop: LOOP
SET b = a;
FETCH cur INTO a;
IF done THEN
LEAVE read_loop;
END IF;
IF TIMESTAMPDIFF(MINUTE,a,b) > [60] THEN
INSERT INTO electric_use_gaps(gap_begin, gap_end)
VALUES (a,b);
END IF;
END LOOP;
CLOSE cur;
END&&
DELIMITER ;
This is the error:
Query: CREATE PROCEDURE findgaps() BEGIN DECLARE done INT DEFAULT FALSE; DECLARE a,b DATETIME; DECLARE cur CURSOR FOR SELECT Intervalda...
Error Code: 1064
You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near '[60] THEN
INSERT INTO electric_use_gaps(gap_begin, gap_end)
...' at line 16
Related
Lets say we have a database table with two columns, entry_time and value. entry_time is timestamp while value can be any other datatype. The records are relatively consistent, entered in roughly x minute intervals. For many x's of time, however, an entry may not be made, thus producing a 'gap' in the data.
In terms of efficiency, what is the best way to go about finding these gaps of at least time Y (both new and old) with a query?
To start with, let us summarize the number of entries by hour in your table.
SELECT CAST(DATE_FORMAT(entry_time,'%Y-%m-%d %k:00:00') AS DATETIME) hour,
COUNT(*) samplecount
FROM table
GROUP BY CAST(DATE_FORMAT(entry_time,'%Y-%m-%d %k:00:00') AS DATETIME)
Now, if you log something every six minutes (ten times an hour) all your samplecount values should be ten. This expression: CAST(DATE_FORMAT(entry_time,'%Y-%m-%d %k:00:00') AS DATETIME) looks hairy but it simply truncates your timestamps to the hour in which they occur by zeroing out the minute and second.
This is reasonably efficient, and will get you started. It's very efficient if you can put an index on your entry_time column and restrict your query to, let's say, yesterday's samples as shown here.
SELECT CAST(DATE_FORMAT(entry_time,'%Y-%m-%d %k:00:00') AS DATETIME) hour,
COUNT(*) samplecount
FROM table
WHERE entry_time >= CURRENT_DATE - INTERVAL 1 DAY
AND entry_time < CURRENT_DATE
GROUP BY CAST(DATE_FORMAT(entry_time,'%Y-%m-%d %k:00:00') AS DATETIME)
But it isn't much good at detecting whole hours that go by with missing samples. It's also a little sensitive to jitter in your sampling. That is, if your top-of-the-hour sample is sometimes a half-second early (10:59:30) and sometimes a half-second late (11:00:30) your hourly summary counts will be off. So, this hour summary thing (or day summary, or minute summary, etc) is not bulletproof.
You need a self-join query to get stuff perfectly right; it's a bit more of a hairball and not nearly as efficient.
Let's start by creating ourselves a virtual table (subquery) like this with numbered samples. (This is a pain in MySQL; some other expensive DBMSs make it easier. No matter.)
SELECT #sample:=#sample+1 AS entry_num, c.entry_time, c.value
FROM (
SELECT entry_time, value
FROM table
ORDER BY entry_time
) C,
(SELECT #sample:=0) s
This little virtual table gives entry_num, entry_time, value.
Next step, we join it to itself.
SELECT one.entry_num, one.entry_time, one.value,
TIMEDIFF(two.value, one.value) interval
FROM (
/* virtual table */
) ONE
JOIN (
/* same virtual table */
) TWO ON (TWO.entry_num - 1 = ONE.entry_num)
This lines up the tables next two each other offset by a single entry, governed by the ON clause of the JOIN.
Finally we choose the values from this table with an interval larger than your threshold, and there are the times of the samples right before the missing ones.
The over all self join query is this. I told you it was a hairball.
SELECT one.entry_num, one.entry_time, one.value,
TIMEDIFF(two.value, one.value) interval
FROM (
SELECT #sample:=#sample+1 AS entry_num, c.entry_time, c.value
FROM (
SELECT entry_time, value
FROM table
ORDER BY entry_time
) C,
(SELECT #sample:=0) s
) ONE
JOIN (
SELECT #sample2:=#sample2+1 AS entry_num, c.entry_time, c.value
FROM (
SELECT entry_time, value
FROM table
ORDER BY entry_time
) C,
(SELECT #sample2:=0) s
) TWO ON (TWO.entry_num - 1 = ONE.entry_num)
If you have to do this in production on a large table you may want to do it for a subset of your data. For example, you could do it each day for the previous two days' samples. This would be decently efficient, and would also make sure you didn't overlook any missing samples right at midnight. To do this your little rownumbered virtual tables would look like this.
SELECT #sample:=#sample+1 AS entry_num, c.entry_time, c.value
FROM (
SELECT entry_time, value
FROM table
ORDER BY entry_time
WHERE entry_time >= CURRENT_DATE - INTERVAL 2 DAY
AND entry_time < CURRENT_DATE /*yesterday but not today*/
) C,
(SELECT #sample:=0) s
A very efficient way to do this is with a stored procedure using cursors. I think this is simpler and more efficient than the other answers.
This procedure creates a cursor and iterates it through the datetime records that you are checking. If there is ever a gap of more than what you specify, it will write the gap's begin and end to a table.
CREATE PROCEDURE findgaps()
BEGIN
DECLARE done INT DEFAULT FALSE;
DECLARE a,b DATETIME;
DECLARE cur CURSOR FOR SELECT dateTimeCol FROM targetTable
ORDER BY dateTimeCol ASC;
DECLARE CONTINUE HANDLER FOR NOT FOUND SET done = TRUE;
OPEN cur;
FETCH cur INTO a;
read_loop: LOOP
SET b = a;
FETCH cur INTO a;
IF done THEN
LEAVE read_loop;
END IF;
IF DATEDIFF(a,b) > [range you specify] THEN
INSERT INTO tmp_table (gap_begin, gap_end)
VALUES (a,b);
END IF;
END LOOP;
CLOSE cur;
END;
In this case it is assumed that 'tmp_table' exists. You could easily define this as a TEMPORARY table in the procedure, but I left it out of this example.
I'm trying this on MariaDB 10.3.27 so this procedure may not work, but I'm getting an error creating the procedure and I can't figure out why! I have a table called electric_use with a field Intervaldatetime DATETIME that I want to find gaps in. I created a target table electric_use_gaps with fields of gap_begin datetime and gap_end datetime
The data are taken every hour and I want to know if I'm missing even an hour's worth of data across 5 years.
DELIMITER $$
CREATE PROCEDURE findgaps()
BEGIN
DECLARE done INT DEFAULT FALSE;
DECLARE a,b DATETIME;
DECLARE cur CURSOR FOR SELECT Intervaldatetime FROM electric_use
ORDER BY Intervaldatetime ASC;
DECLARE CONTINUE HANDLER FOR NOT FOUND SET done = TRUE;
OPEN cur;
FETCH cur INTO a;
read_loop: LOOP
SET b = a;
FETCH cur INTO a;
IF done THEN
LEAVE read_loop;
END IF;
IF TIMESTAMPDIFF(MINUTE,a,b) > [60] THEN
INSERT INTO electric_use_gaps(gap_begin, gap_end)
VALUES (a,b);
END IF;
END LOOP;
CLOSE cur;
END&&
DELIMITER ;
This is the error:
Query: CREATE PROCEDURE findgaps() BEGIN DECLARE done INT DEFAULT FALSE; DECLARE a,b DATETIME; DECLARE cur CURSOR FOR SELECT Intervalda...
Error Code: 1064
You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near '[60] THEN
INSERT INTO electric_use_gaps(gap_begin, gap_end)
...' at line 16
I have a table name tbl_tmp_trans
it contains every user transactions ever done ( and it's up to 6Mil right now !)
we have decided to keep only last 100 transaction per user in our database so we could keep the db clean
here is a query that i have came up with
delete from tbl_tmp_trans
where trans_id in
(
select trans_id
from
(
select trans_id
from tbl_faucets_transactions
order by date
group by user_id
limit 100
) foo
)
what am i doing wrong?
because after doing this my cpu reach 100% and mysql crashed.
Thanks in advance
P.S: our db is Mysql and table engine is Innodb
P.S2: We have about 120k and transction table have near 6 million record
I have a proposal... Hopefully, it might help you.
Alter your table:
alter table tbl_tmp_trans add column todel tinyint(1);
Implement a stored procedure to iterate through the table with a cursor and mark (set todel to 1) records that should be deleted. Example procedure to do that:
delimiter //
drop procedure if exists mark_old_transactions //
create procedure mark_old_transactions()
begin
declare done int default false;
declare tid int;
declare uid int;
declare last_uid int default 0;
declare count int default 0;
declare cur cursor for select trans_id, user_id from tbl_tmp_trans order by user_id, date desc;
declare continue handler for not found set done = true;
open cur;
repeat
fetch cur into tid, uid;
if (!done) then
if (uid!=last_uid) then
set count = 0;
end if;
set last_uid = uid;
set count = count + 1;
if (count > 100) then
update tbl_tmp_trans set todel=1 where trans_id=tid;
end if;
end if;
until done
end repeat;
close cur;
end //
Invoke the procedure, maybe do some simple checks (how many transactions you delete from the table, etc.), and delete the marked records.
call mark_old_transactions;
-- select count(*) from tbl_tmp_trans where todel=1;
-- select count(*) from tbl_tmp_trans;
delete from tbl_tmp_trans where todel=1;
Finally, remove the column that we just added.
alter table tbl_tmp_trans drop column todel;
Some notes:
Probably you have to iterate through all the records of the table
anyway, so you don't loose performance with the cursor.
If you have ~120K users and ~6M transactions, you have ~50 transactions per user on average. Which means, that probably you don't really
have too many users with transactions over 100, so the number of
updates (hopefully) won't be too many. => the procedure runs relatively fast.
Delete should be fast again with the new column.
I have a Query that could do with optimization if possible as it's taking 15 seconds to run.
It is querying a large db with approx 1000000 records and is slowed down by grouping by hour (which is derived from DATE_FORMAT()).
I indexed all relevant fields in all the tables which improved the performance significantly but I don't know how to or if it's even possible to create an index for the hour group since it's not a field...
I do realise the dataset is very large but I'd like to know if I have any options.
Any help would be appreciated!
Thanks!
SELECT `id`,
tbl1.num,
name,
DATE_FORMAT(`timestamp`,'%x-%v') AS wknum,
DATE_FORMAT(`timestamp`,'%Y-%m-%d') AS date,
DATE_FORMAT(`timestamp`,'%H') as hour,
IF(code<>0,codedescription,'') AS status,
SUM(TIME_TO_SEC(`timeblock`))/60 AS time,
SUM(`distance`) AS distance,
SUM(`distance`)/(SUM(TIME_TO_SEC(`timeblock`))/60) AS speed
FROM `tbl1`
LEFT JOIN `tbl2` ON tbl1.code = tbl2.code
LEFT JOIN `tbl3` ON tbl1.status = tbl3.status
LEFT JOIN `tbl4` ON tbl1.conditionnum = tbl4.conditionnum
LEFT JOIN `tbl5` ON tbl1.num = edm_mc_list.num
WHERE `timestamp`>'2013-07-28 00:00:00'
GROUP BY `num`,DATE_FORMAT(`timestamp`,'%H'),`mcstatus`
MySQL generally can’t use indexes on columns unless the columns are
isolated in the query. Isolating the column means it should not be part of an expression or be inside a function in the query.
Solutions:
1-You can store hour separate from timestamp column. for example you can store it by both before insert and before update triggers.
DELIMITER $$
CREATE TRIGGER `before_update_hour`
BEFORE UPDATE ON `tbl1`
FOR EACH ROW
BEGIN
IF NEW.`timestamp` != OLD.`timestamp` THEN
SET NEW.`hour` = DATE_FORMAT( NEW.`timestamp`,'%H')
END IF;
END;
$$
DELIMITER ;
DELIMITER $$
CREATE TRIGGER `before_insert_hour`
BEFORE INSERT ON `tbl1`
FOR EACH ROW
BEGIN
SET NEW.`hour` = DATE_FORMAT( NEW.`timestamp`,'%H')
END;
$$
DELIMITER ;
2-If you can use MariaDB, you can use MariaDB virtual columns.
I have written a stored procedure in my sql which is very slow. there are million records in database .
DELIMITER $$
CREATE DEFINER=`root`#`localhost` PROCEDURE `FetchEnergyLine`(IN From_Time INT, IN
To_Time INT, IN Meter_Id INT, IN Device_Id VARCHAR(10), IN ct INT)
BEGIN
DECLARE c INT(5) default 0;
DECLARE Count INT default 1;
SET autocommit=0;
SET #c=0;
SET Count = (SELECT COUNT(TimeStamp) FROM Meter_Data
WHERE
TimeStamp > From_Time
AND TimeStamp < To_Time
AND MeterID = Meter_Id
AND DeviceID = Device_Id );
IF Count > (2*ct) THEN SET Count=Count/ct;
ELSE SET COUNT = 20;
END IF;
SELECT * FROM ( SELECT TimeStamp, FwdHr, W , #c:=#c+1 as counter
FROM
Meter_Data
WHERE
TimeStamp > From_Time
AND TimeStamp < To_Time
AND MeterID = Meter_Id
AND DeviceID = Device_Id ORDER BY TimeStamp
) as tmp
WHERE
counter % Count =1;
END
i think when i had less data it was very fast but my other queries with same database are running fine but stored procedure is slow.
1) it can be count statement which is counting number of readings in beginning but i am not too sure.
Can anybody help?? thanks in advance..
First thing I do after building my query is making sure I didn't forget anything. (Indexes, temporary tables, keys etc...)
At first look, I would guess you have some table scans in your plan which could take 80-90% of the process time.
To make sure it doesn't happen, create the required indexes in the tables you query.
To make sure nothing is taking too long, you should study how to use the execution plan and see what takes the longest.
i have table data like this:
id,time,otherdata
a,1,fsdfas
a,2,fasdfag
a,3,fasdfas
a,7,asfdsaf
b,8,fasdf
a,8,asdfasd
a,9,afsadfa
b,10,fasdf
...
so essentially, i can select all the data in the order i want by saying something like:
select * from mytable ordered by id,time;
so i get all the records in the order i want, sorted by id first, and then by time. but instead of getting all the records, i need the latest 3 times for each id.
Answer:
Well, I figured out how to do it. I'm surprised at how quick it was, as I'm operating on a couple million rows of data and it took about 11 seconds. I wrote a procedure in a sql script to do it, and here's what it looks like. --Note that instead of getting the last 3, it gets the last "n" number of rows of data.
use my_database;
drop procedure if exists getLastN;
drop table if exists lastN;
-- Create a procedure that gets the last three records for each id
delimiter //
create procedure getLastN(n int)
begin
# Declare cursor for data iterations, and variables for storage
declare idData varchar(32);
declare done int default 0;
declare curs cursor for select distinct id from my_table;
declare continue handler for not found set done = 1;
open curs;
# Create a temporary table to contain our results
create temporary table lastN like my_table;
# Iterate through each id
DATA_LOOP: loop
if done then leave DATA_LOOP; end if;
fetch curs into idData;
insert into lastThree select * from my_table where id = idData order by time desc limit n;
end loop;
end//
delimiter ;
call getLastN(3);
select * from lastN;
sorry if this doesn't exactly work, I've had to change variable names and stuff to obfuscate my work's work, but i ran this exact piece of code and got what i needed!
I think it's as simple as:
SELECT * FROM `mytable`
GROUP BY `id`
ORDER BY `time` DESC
LIMIT 3
Two approaches that I'm aware of are (1) to use a set of unions, each one containing a "limit 3", or (2) to use a temporary variable. These approaches, along with other useful links and discussion can be found here.
Try this:
select *
from mytable as m1
where (
select count(*) from mytable as m2
where m1.id = m2.id
) <= 3 ORDER BY id, time