MySQL Find distinct pair of values per group in time interval

MySQL Find distinct pair of values per group in time interval - mysql

I have the following table in MySQL:
CREATE TABLE `events` (
`pv_name` varchar(60) COLLATE utf8mb4_bin NOT NULL,
`time_stamp` bigint(20) unsigned NOT NULL,
`event_type` varchar(40) COLLATE utf8mb4_bin NOT NULL,
`has_data` tinyint(1) NOT NULL,
`data` json DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin ROW_FORMAT=COMPRESSED;
ALTER TABLE `events`
ADD PRIMARY KEY (`pv_name`,`time_stamp`), ADD KEY `has_data` (`has_data`,`pv_name`,`time_stamp`);
I have been struggling to construct an efficient query to find each pv_name that has at least one change in value in a given time interval.
I believe that the query I currently have is inefficient because it finds all of the distinct values in the given time interval for each pv_name, instead of stopping as soon as it finds more than one:
SELECT events.pv_name
FROM events
WHERE events.time_stamp > 0 AND events.time_stamp < 9999999999999999999
GROUP BY events.pv_name
HAVING COUNT(DISTINCT JSON_EXTRACT(events.data, '$.value')) > 1;
To avoid this I am considering breaking the count and distinct parts into separate steps, since the documentation says that:
When combining LIMIT row_count with DISTINCT, MySQL stops as soon as
it finds row_count unique rows.
Is there an efficient query to find a pair of distinct values for each pv_name in a given time interval, that does not have to find all of the distinct values for each pv_name in a given time interval?
EDIT #Rick James
I am essentially trying to find a faster non cursor based solution for this:
SET #old_sql_mode=##sql_mode, sql_mode='STRICT_ALL_TABLES';
DELIMITER //
DROP PROCEDURE IF EXISTS check_for_change;
CREATE PROCEDURE check_for_change(IN t0_in bigint(20) unsigned, IN t1_in bigint(20) unsigned)
BEGIN
DECLARE done INT DEFAULT FALSE;
DECLARE current_pv_name VARCHAR(60);
DECLARE cur CURSOR FOR SELECT DISTINCT pv_name FROM events;
DECLARE CONTINUE HANDLER FOR SQLSTATE '02000' SET done = TRUE;
SET #t0_in := t0_in;
SET #t1_in := t1_in;
IF #t0_in > #t1_in THEN
SET #temp := #t0_in;
SET #t0_in := #t1_in;
SET #t1_in := #temp;
END IF;
DROP TEMPORARY TABLE IF EXISTS has_change;
CREATE TEMPORARY TABLE has_change (
pv_name varchar(60) NOT NULL,
PRIMARY KEY (pv_name)
) ENGINE=Memory DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin;
OPEN cur;
label1: LOOP
FETCH cur INTO current_pv_name;
IF done THEN
LEAVE label1;
END IF;
INSERT INTO has_change
SELECT current_pv_name
FROM (
SELECT DISTINCT JSON_EXTRACT(events.data, '$.value') AS distinct_value
FROM events
WHERE events.pv_name = current_pv_name
AND events.has_data = 1
AND events.time_stamp > #t0_in AND events.time_stamp < #t1_in
LIMIT 2 ) AS t
HAVING COUNT(t.distinct_value) = 2;
END LOOP;
CLOSE cur;
END //
DELIMITER ;
SET sql_mode=#old_sql_mode;
The optimization here is in the application of the limit on the number of distinct values to find for each pv_name.

There is no LIMIT, so the quote does not apply. (Or at least, I think not.)
COUNT(DISTINCT ...) will, in some cases do a "loose scan", which is better than reading every row. For example,
SELECT name
FROM tbl
GROUP BY name
HAVING COUNT(DISTINCT foo) > 3;
together with INDEX(name, foo) would probably leapfrog through the index to do the COUNT DISTINCT of foos for each name. Granted, this is not "stopping at 3" as you requested.
You can demonstrate the above by doing
FLUSH STATUS;
SELECT ...;
SHOW SESSIONS STATUS LIKE 'Handler%';
To see that it did not (or did) have a Handler_read count that is the size of the table.
The loose scan is not applicable to your particular query for multiple reasons.
Bottom line: "No, you can't achieve your goal".
Also, the stored routine you wrote will probably be much slower than simply accepting the overhead of a full scan.

Related

How to get last inserted rows in MySQL from different tables?

I have a lot of different tables in my database, and I need somehow to get last inserted rows from those tables. Like social networks feeds. Also, those tables are have not random, but unknown names, because they all generated by users.
In example:
I have tables: A,B,C and D with 5k rows in each table.
I need somehow to get last rows from those tables and make it ordered by id, like we do in a simple query: "SELECT * FROM table A ORDER BY id DESC", but I'm looking for something like: "SELECT * FROM A,B,C,D ORDER BY id DESC".
Tables have same structure.

You can use union and order by if your tables have the same structure. Something like:
select *
from (
select * from A
union all
select * from B
union all
select * from C
) order by id desc
If the tables don't have the same structure then you cannot select * from all and order them and you might want to do two queries. First would be:
select id, tableName
from (
select id, 'tableA' as tableName from A
union all
select id, 'tableB' as tableName from B
union all
select id, 'tableC' as tableName from C
) order by id desc
Which will give you the last IDs and the tables where they are inserted. And then you need to get the rows from each respective table.
With pure Mysql it will be a bit hard. You can select the table names like:
SELECT table_name FROM information_schema.tables; but still how to use them in the statement? You will need to generate it dynamically
A procedure to generate the query dynamically can be something like ( I haven't tested it but I believe with some debugging it should work) :
DELIMITER $$
CREATE PROCEDURE buildQuery (OUT v_query VARCHAR)
BEGIN
DECLARE v_finished INTEGER DEFAULT 0;
DECLARE v_table_count INTEGER DEFAULT 0;
DECLARE v_table varchar(100) DEFAULT "";
-- declare cursor for tables (this is for all tables but can be changed)
DEClARE table_cursor CURSOR FOR
SELECT table_name FROM information_schema.tables;
-- declare NOT FOUND handler
DECLARE CONTINUE HANDLER
FOR NOT FOUND SET v_finished = 1;
OPEN table_cursor;
SET v_query="select * from ( ";
get_table: LOOP
FETCH table_cursor INTO v_table;
SET v_table_count = v_table_count + 1;
IF v_finished = 1 THEN
LEAVE get_table;
END IF;
if v_table_count>1 THEN
CONCAT(vquery, " UNION ALL ")
END IF;
SET v_query = CONCAT(vquery," select * from ", v_table );
END LOOP get_table;
SET v_query = CONCAT(vquery," ) order by id desc " );
-- here v_query should be the created query with UNION_ALL
CLOSE table_cursor;
SELECT #v_query;
END$$
DELIMITER ;

If each table's id is counted seperatly you can't order by ID, so you'll need to calculate a global id and use it on all of your tables.
You can do it as follows:
Assuming you have 2 tables A,B:
Create Table A(id int NOT NULL auto_increment, name varchar(max), value varchar(max), PRIMARY_KEY(id));
Create Table B(id int NOT NULL auto_increment, name varchar(max), value varchar(max), PRIMARY_KEY(id));
Add another table IDS with id as auto increment primary key.
Create table IDS (id int NOT NULL auto_increment, ts Timestamp default CURRENT_TIMESTAMP, PRIMARY_KEY(id));
For all your tables id column should use now the id from the IDS table as foreign key instead of auto increment.
Create Table A(id int NOT NULL auto_increment, name varchar(max), value varchar(max), PRIMARY_KEY(id),CONSTRAINT fk_A_id FOREIGN KEY(id) REFERENCE IDS(id) ON DELETE CASCADE ON UPDATE CASCADE);
Create Table B(id int NOT NULL auto_increment, name varchar(max), value varchar(max), PRIMARY_KEY(id),CONSTRAINT fk_A_id FOREIGN KEY(id) REFERENCE IDS(id) ON DELETE CASCADE ON UPDATE CASCADE);
for each table add before insert trigger, the trigger should first insert row to the IDS table and insert the LAST_INSERT_ID the table.
Create TRIGGER befor_insert_A BEFORE INSERT On A
FOR EACH ROW
BEGIN
insert into IDS() values ();
set new.id = LAST_INSERT_ID();
END
Create TRIGGER befor_insert_B BEFORE INSERT On B
FOR EACH ROW
BEGIN
insert into IDS() values ();
set new.id = LAST_INSERT_ID();
END
Now you can create view from all tables with union all, the rows of v can be sorted now by the id and give the cronlogic order of insertion.
Create view V AS select * from A UNION ALL select * from B
For example you can query on V the latest 10 ids:
select * from V Order by id desc LIMIT 10
Other option is to add timestamp for each table and sort the view by the timestamp.

Hi are you looking for this? However the id is not a good column to see the last updated among different tables.
select *
from A
join B
on 1=1
join C
on 1=1
join D
on 1=1
order by A.id desc

How to use a variable as the result of a group_concat in a IN()

These are the tables i have (simplified):
mytable
--------------------
`pid` int(11) NOT NULL AUTO_INCREMENT
`mydate` date NOT NULL,
pictures
--------------------
`pid` int(11) NOT NULL AUTO_INCREMENT
`image_name` varchar(255)
pictures_2014
--------------------
`pid` int(11) NOT NULL AUTO_INCREMENT
`image_name` varchar(255)
This is the code i have:
SET AUTOCOMMIT=0;
START TRANSACTION;
SELECT #pids := GROUP_CONCAT(CONVERT(pid , CHAR(8)))
FROM `mytable`
WHERE `mydate`
BETWEEN '2014-01-01'
AND '2014-10-31';
INSERT INTO pictures_2014 SELECT * FROM `pictures` where pid in(#pids);
COMMIT;
Only one item is being inserted into pictures_2014, the first item in the comma separated list.
The result I expect is all 4111 rows that match my first query (at least those in pictures) will be copied.
I do know that at least the first 6 pids in the group_concat shown in phpmyadmin exist in the table pictures
I also tried adding LIMIT 0,100 to my first query to see if it was a variable length issue but i still only get one result.

You're doing more work than you have to - you're going to all of the trouble of combining multiple values into a single string and now you're having problems because IN wants to work with multiple values.
Why not something like:
SET AUTOCOMMIT=0;
START TRANSACTION;
INSERT INTO pictures_2014 SELECT * FROM `pictures`
where pid in(
SELECT pid
FROM `mytable`
WHERE `mydate`
BETWEEN '2014-01-01'
AND '2014-10-31');
COMMIT;

ok so this is what i learned:
the variable being set is being interpreted as '321,654,987' rather than '321','654','987'
(Thanks #John Woo)
By changing the IN() to FIND_IN_SET() i get the right result!
SET AUTOCOMMIT=0;
START TRANSACTION;
SELECT #pids := GROUP_CONCAT(CONVERT(pid , CHAR(8)))
FROM `mytable`
WHERE `mydate`
BETWEEN '2014-01-01'
AND '2014-10-31';
INSERT INTO pictures_2014 SELECT * FROM `pictures` where FIND_IN_SET(pid,#pids);
COMMIT;

GROUP_CONCAT but with limits to get more than one row

I am developing a small jumbled-words game for users on a PtokaX DC hub I manage. For this, I'm storing the list of words inside a MySQL table. Table schema is as follows:
CREATE TABLE `jumblewords` (
`id` INT(10) UNSIGNED NOT NULL AUTO_INCREMENT,
`word` CHAR(15) NOT NULL,
PRIMARY KEY (`id`),
UNIQUE INDEX `word` (`word`)
)
COMMENT='List of words to be used for jumble game.'
COLLATE='utf8_general_ci'
ENGINE=MyISAM;
Now, in the game-engine; I want to fetch 20 words as a string randomly. This I can achieve with a query similar to this:
SELECT GROUP_CONCAT(f.word SEPARATOR ', ' )
FROM ( SELECT j.word AS word
FROM jumblewords j
ORDER BY RAND()
LIMIT 20) f
but I have to execute this statement everytime the list expires(all 20 words have been put before user).
Can I modify this query so that I can fetch more than one row with the results as generated from the query I have above?

Probably an easier way to solve this problem is by storing the random words in a temporary table and later extract the values. A stored procedure would be perfect for that.
DELIMITER //
DROP PROCEDURE IF EXISTS sp_jumblewords //
CREATE PROCEDURE sp_jumblewords(no_lines INT)
BEGIN
DROP TABLE IF EXISTS tmp_jumblewords;
CREATE TEMPORARY TABLE tmp_jumblewords (
`word` VARCHAR(340) NOT NULL);
REPEAT
INSERT INTO tmp_jumblewords
SELECT GROUP_CONCAT(f.word SEPARATOR ', ' )
FROM ( SELECT j.word AS word
FROM jumblewords j
ORDER BY RAND()
LIMIT 20) f;
SET no_lines = no_lines - 1;
UNTIL no_lines = 0
END REPEAT;
SELECT * FROM tmp_jumblewords;
END //
DELIMITER ;
CALL sp_jumblewords(20);

mysql stored procedure to pop from a database based queue

We have a system that has a database based queue for processing items in threads instead of real time. It's currently implemented in Mybatis calling a this stored procedure in mysql:
DROP PROCEDURE IF EXISTS pop_invoice_queue;
DELIMITER ;;
CREATE PROCEDURE pop_invoice_queue(IN compId int(11), IN limitRet int(11)) BEGIN
SELECT LAST_INSERT_ID(id) as value, InvoiceQueue.* FROM InvoiceQueue
WHERE companyid = compId
AND (lastPopDate is null OR lastPopDate < DATE_SUB(NOW(), INTERVAL 3 MINUTE)) LIMIT limitRet FOR UPDATE;
UPDATE InvoiceQueue SET lastPopDate=NOW() WHERE id=LAST_INSERT_ID();
END;;
DELIMITER ;
The problem is that this pops N items from the queue but only updates the lastPopDate value for the last item popped off the queue. So if we call this stored procedure with limitRet = 5, it will pop five items off the queue and start working on them but only the fifth item will have a lastPopDate set so when the next thread comes and pops off the queue it will get items 1-4 and item 6.
How can we get this to update all N records 'popped' off the database?

If you are willing to add a BIGINT field to the table via:
ALTER TABLE InvoiceQueue
ADD uuid BIGINT NULL DEFAULT NULL,
INDEX ix_uuid (uuid);
then you can do the update first, and select the records updated, via:
CREATE PROCEDURE pop_invoice_queue(IN compId int(11), IN limitRet int(11))
BEGIN
SET #uuid = UUID_SHORT();
UPDATE InvoiceQueue
SET uuid = #uuid,
lastPopDate = NOW()
WHERE companyid = compId
AND uuid IS NULL
AND (lastPopDate IS NULL OR lastPopDate < NOW() - INTERVAL 3 MINUTE)
ORDER BY
id
LIMIT limitRet;
SELECT *
FROM InvoiceQueue
WHERE uuid = #uuid
FOR UPDATE;
END;;
For the UUID_SHORT() function to return unique values, it should be called no more than 16 million times a second per machine. Visit here for more details.
For performance, you may want to alter the lastPopDate field to be NOT NULL as the OR clause will cause your query to not use an index, even if one is available:
ALTER TABLE InvoiceQueue
MODIFY lastPopDate DATETIME NOT NULL DEFAULT '0000-00-00';
Then, if you do not already have one, you could add an index on the companyid/lastPopDate/uuid fields, as follows:
ALTER TABLE InvoiceQueue
ADD INDEX ix_company_lastpop (companyid, lastPopDate, uuid);
Then you can remove the OR clause from your UPDATE query:
UPDATE InvoiceQueue
SET uuid = #uuid,
lastPopDate = NOW()
WHERE companyid = compId
AND lastPopDate < NOW() - INTERVAL 3 MINUTE
ORDER BY
id
LIMIT limitRet;
which will use the index you just created.

Since mysql has neither collection nor output/returning clause, my suggestion is to use temporary tables. Something like :
CREATE TEMPORARY TABLE temp_data
SELECT LAST_INSERT_ID(id) as value, InvoiceQueue.* FROM InvoiceQueue
WHERE companyid = compId
AND (lastPopDate is null OR lastPopDate < DATE_SUB(NOW(), INTERVAL 3 MINUTE)) LIMIT limitRet FOR UPDATE;
UPDATE InvoiceQueue
INNER JOIN temp_data ON (InvoiceQueue.PKColumn = temp_data.PKColumn)
SET lastPopDate=NOW();
SELECT * FROM temp_data ;
DROP TEMPORARY TABLE temp_data;
Also, I surmise such select ... for update can cause deadlocks (surely, if the procedure is called from different sessions) - as far as I know order in which rows get locked is not guaranteed (even if you had order by, rows might be locked in different order). I'd recommend to double check documentation.

MySQL: How to optimize this simple GROUP BY+ORDER BY query?

I have one mysql table:
CREATE TABLE IF NOT EXISTS `test` (
`Id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`SenderId` int(10) unsigned NOT NULL,
`ReceiverId` int(10) unsigned NOT NULL,
`DateSent` datetime NOT NULL,
`Notified` tinyint(1) unsigned NOT NULL DEFAULT '0',
PRIMARY KEY (`Id`),
KEY `ReceiverId_SenderId` (`ReceiverId`,`SenderId`),
KEY `SenderId` (`SenderId`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;
The table is populated with 10.000 random rows for testing by using the following procedure:
DELIMITER //
CREATE DEFINER=`root`#`localhost` PROCEDURE `FillTest`(IN `cnt` INT)
BEGIN
DECLARE i INT DEFAULT 1;
DECLARE intSenderId INT;
DECLARE intReceiverId INT;
DECLARE dtDateSent DATE;
DECLARE blnNotified INT;
WHILE (i<=cnt) DO
SET intSenderId = FLOOR(1 + (RAND() * 50));
SET intReceiverId = FLOOR(51 + (RAND() * 50));
SET dtDateSent = str_to_date(concat(floor(1 + rand() * (12-1)),'-',floor(1 + rand() * (28 -1)),'-','2008'),'%m-%d-%Y');
SET blnNotified = FLOOR(1 + (RAND() * 2))-1;
INSERT INTO test (SenderId, ReceiverId, DateSent, Notified)
VALUES(intSenderId,intReceiverId,dtDateSent, blnNotified);
SET i=i+1;
END WHILE;
END//
DELIMITER ;
CALL `FillTest`(10000);
The problem:
I need to write a query which will group by ‘SenderId, ReceiverId’ and return the first 100 highest Ids of each group, ordered by Id in ascending order.
I played with GROUP BY, ORDER BY and MAX(Id), but the query was too slow, so I came up with this query:
SELECT SQL_NO_CACHE t1.*
FROM test t1
LEFT JOIN test t2 ON (t1.ReceiverId = t2.ReceiverId AND t1.SenderId = t2.SenderId AND t1.Id < t2.Id)
WHERE t2.Id IS NULL
ORDER BY t1.Id ASC
LIMIT 100;
The above query returns the correct data, but it becomes too slow when the test table has more than 150.000 rows . On 150.000 rows the above query needs 7 seconds to complete. I expect the test table to have between 500.000 – 1M rows, and the query needs to return the correct data in less than 3 sec. If it’s not possible to fetch the correct data in less than 3 sec, than I need it to fetch the data using the fastest query possible.
So, how can the above query be optimized so that it runs faster?

Reasons why this query may be slow:
It's a lot of data. Lots of it may be returned. It returns the last record for each SenderId/ReceiverId combination.
The division of the data (many Sender/Receiver combinations, or relative few of them, but with multiple 'versions'.
The whole result set must be sorted by MySQL, because you need the first 100 records, sorted by Id.
These make it hard to optimize this query without restructuring the data. A few suggestions to try:
- You could try using NOT EXISTS, although I doubt if it would help.
SELECT SQL_NO_CACHE t1.*
FROM test t1
WHERE NOT EXISTS
(SELECT 'x'
FROM test t2
WHERE t1.ReceiverId = t2.ReceiverId AND t1.SenderId = t2.SenderId AND t1.Id < t2.Id)
ORDER BY t1.Id ASC
LIMIT 100;
- You could try using proper indexes on ReceiverId, SenderId and Id. Experiment with creating a combined index on the three columns. Try two versions, one with Id being the first column, and one with Id being the last.
With slight database modifications:
- You could save a combination of SenderId/ReceiverId in a separate table with a LastId pointing to the record you want.
- You could save a 'PreviousId' with each record, keeping it NULL for the last record per Sender/Receiver. You only need to query the records where previousId is null.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

MySQL Find distinct pair of values per group in time interval - mysql

Related

How to get last inserted rows in MySQL from different tables?

How to use a variable as the result of a group_concat in a IN()

GROUP_CONCAT but with limits to get more than one row

mysql stored procedure to pop from a database based queue

MySQL: How to optimize this simple GROUP BY+ORDER BY query?

Categories

Resources