MySQL: How to optimize this simple GROUP BY+ORDER BY query?

MySQL: How to optimize this simple GROUP BY+ORDER BY query? - mysql

I have one mysql table:
CREATE TABLE IF NOT EXISTS `test` (
`Id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`SenderId` int(10) unsigned NOT NULL,
`ReceiverId` int(10) unsigned NOT NULL,
`DateSent` datetime NOT NULL,
`Notified` tinyint(1) unsigned NOT NULL DEFAULT '0',
PRIMARY KEY (`Id`),
KEY `ReceiverId_SenderId` (`ReceiverId`,`SenderId`),
KEY `SenderId` (`SenderId`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;
The table is populated with 10.000 random rows for testing by using the following procedure:
DELIMITER //
CREATE DEFINER=`root`#`localhost` PROCEDURE `FillTest`(IN `cnt` INT)
BEGIN
DECLARE i INT DEFAULT 1;
DECLARE intSenderId INT;
DECLARE intReceiverId INT;
DECLARE dtDateSent DATE;
DECLARE blnNotified INT;
WHILE (i<=cnt) DO
SET intSenderId = FLOOR(1 + (RAND() * 50));
SET intReceiverId = FLOOR(51 + (RAND() * 50));
SET dtDateSent = str_to_date(concat(floor(1 + rand() * (12-1)),'-',floor(1 + rand() * (28 -1)),'-','2008'),'%m-%d-%Y');
SET blnNotified = FLOOR(1 + (RAND() * 2))-1;
INSERT INTO test (SenderId, ReceiverId, DateSent, Notified)
VALUES(intSenderId,intReceiverId,dtDateSent, blnNotified);
SET i=i+1;
END WHILE;
END//
DELIMITER ;
CALL `FillTest`(10000);
The problem:
I need to write a query which will group by ‘SenderId, ReceiverId’ and return the first 100 highest Ids of each group, ordered by Id in ascending order.
I played with GROUP BY, ORDER BY and MAX(Id), but the query was too slow, so I came up with this query:
SELECT SQL_NO_CACHE t1.*
FROM test t1
LEFT JOIN test t2 ON (t1.ReceiverId = t2.ReceiverId AND t1.SenderId = t2.SenderId AND t1.Id < t2.Id)
WHERE t2.Id IS NULL
ORDER BY t1.Id ASC
LIMIT 100;
The above query returns the correct data, but it becomes too slow when the test table has more than 150.000 rows . On 150.000 rows the above query needs 7 seconds to complete. I expect the test table to have between 500.000 – 1M rows, and the query needs to return the correct data in less than 3 sec. If it’s not possible to fetch the correct data in less than 3 sec, than I need it to fetch the data using the fastest query possible.
So, how can the above query be optimized so that it runs faster?

Reasons why this query may be slow:
It's a lot of data. Lots of it may be returned. It returns the last record for each SenderId/ReceiverId combination.
The division of the data (many Sender/Receiver combinations, or relative few of them, but with multiple 'versions'.
The whole result set must be sorted by MySQL, because you need the first 100 records, sorted by Id.
These make it hard to optimize this query without restructuring the data. A few suggestions to try:
- You could try using NOT EXISTS, although I doubt if it would help.
SELECT SQL_NO_CACHE t1.*
FROM test t1
WHERE NOT EXISTS
(SELECT 'x'
FROM test t2
WHERE t1.ReceiverId = t2.ReceiverId AND t1.SenderId = t2.SenderId AND t1.Id < t2.Id)
ORDER BY t1.Id ASC
LIMIT 100;
- You could try using proper indexes on ReceiverId, SenderId and Id. Experiment with creating a combined index on the three columns. Try two versions, one with Id being the first column, and one with Id being the last.
With slight database modifications:
- You could save a combination of SenderId/ReceiverId in a separate table with a LastId pointing to the record you want.
- You could save a 'PreviousId' with each record, keeping it NULL for the last record per Sender/Receiver. You only need to query the records where previousId is null.

Related

Efficient way of gettting a single example of a row when using hashes

My database has a staging table with the following structure:
CREATE TABLE featureMappings (
id bigint(20) NOT NULL AUTO_INCREMENT,
visitId bigint(20) NOT NULL,
featureId bigint(20) NOT NULL,
textValue text DEFAULT NULL,
hashTextValue char(32) GENERATED ALWAYS AS (MD5(textValue)) VIRTUAL,
PRIMARY KEY (id));
ALTER TABLE featureMappings
ADD INDEX fsHashTextValue (featureId, hashTextValue)
In a typical run this table has approximately 40 - 100 million rows. There are a lot of duplicate text values so I am using the hashTextValue key to be able to index on this column.
The following query takes about 25 seconds to run:
CREATE TEMPORARY TABLE temp AS
SELECT
featureId,
hashTextValue
FROM
featureMappings
GROUP BY featureId, hashTextValue
Question
I'd like to extract the value in the textValue column alongside the featureId and hashTextValue columns.
I have tried two approaches. Both of these dramatically increased the query time, so I'm looking for a better solution.
Slow Option 1 - Adding textValue to the query
When running the belo change to the query, the time to process went from 25 seconds to about 10 minutes. I've tried to google how textValue is retrieved when not using an aggregate function, but could not find a clear answer.
CREATE TEMPORARY TABLE temp AS
SELECT
featureId,
hashTextValue,
textValue # I also tried MIN(textValue)
FROM
featureMappings
GROUP BY featureId, hashTextValue
Complicated Option 2: Iterative Update
My preferred approach is to iterate over the unique combinations of the first query and then run a loop over the following queries:
SELECT featureId, hashTextValue INTO #fid, #htv
FROM temp
WHERE textValue is NULL and hashTextValue IS NOT NULL
LIMIT 1;
SELECT textValue
INTO #textValue
FROM featureMappings
WHERE featureId = #fid and hashTextValue = #htv
LIMIT 1;
UPDATE temp
SET textValue = #textValue
WHERE featureId = #fid AND hashTextValue = #htv;
Server Configuration
This is being run on AWS RDS Aurora based on Mysql 5.7. The server has limited (2GB) memory and usually has less freeable memory than the index size on the table.

Plan A: Dedup as you load. This is trivially done by making the PK of featureMappings be PRIMARY KEY(featureId, hashTextValue) and using INSERT IGNORE when loading the staging table.
Plan B: (Assuming there is something preventing Plan A) Change the table have these indexes.
PRIMARY KEY (featureId, hashTextValue, id),
INDEX(id)
This still has the dups, but I am unclear on what needs to happen next.
Further...
SELECT featureId, hashTextValue INTO #fid, #htv
FROM temp
WHERE textValue is NULL and hashTextValue IS NOT NULL
LIMIT 1;
This has the problem of getting slower and slower as you eat through the items that match. It would be better to add an explicit PRIMARY KEY and walk through temp. In fact, it will be an order of magnitude faster (if temp is large). Let's say id is the PK; then:
SELECT #id := id, #fid := featureId, #htv := hashTextValue INTO
FROM temp
WHERE textValue is NULL and hashTextValue IS NOT NULL
AND id > #id -- this picks up 'where you left off'
LIMIT 1;
(Initialize with SET #id := 0;)
Now that you have the id, the UPDATE becomes simpler and faster.

MySQL Find distinct pair of values per group in time interval

I have the following table in MySQL:
CREATE TABLE `events` (
`pv_name` varchar(60) COLLATE utf8mb4_bin NOT NULL,
`time_stamp` bigint(20) unsigned NOT NULL,
`event_type` varchar(40) COLLATE utf8mb4_bin NOT NULL,
`has_data` tinyint(1) NOT NULL,
`data` json DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin ROW_FORMAT=COMPRESSED;
ALTER TABLE `events`
ADD PRIMARY KEY (`pv_name`,`time_stamp`), ADD KEY `has_data` (`has_data`,`pv_name`,`time_stamp`);
I have been struggling to construct an efficient query to find each pv_name that has at least one change in value in a given time interval.
I believe that the query I currently have is inefficient because it finds all of the distinct values in the given time interval for each pv_name, instead of stopping as soon as it finds more than one:
SELECT events.pv_name
FROM events
WHERE events.time_stamp > 0 AND events.time_stamp < 9999999999999999999
GROUP BY events.pv_name
HAVING COUNT(DISTINCT JSON_EXTRACT(events.data, '$.value')) > 1;
To avoid this I am considering breaking the count and distinct parts into separate steps, since the documentation says that:
When combining LIMIT row_count with DISTINCT, MySQL stops as soon as
it finds row_count unique rows.
Is there an efficient query to find a pair of distinct values for each pv_name in a given time interval, that does not have to find all of the distinct values for each pv_name in a given time interval?
EDIT #Rick James
I am essentially trying to find a faster non cursor based solution for this:
SET #old_sql_mode=##sql_mode, sql_mode='STRICT_ALL_TABLES';
DELIMITER //
DROP PROCEDURE IF EXISTS check_for_change;
CREATE PROCEDURE check_for_change(IN t0_in bigint(20) unsigned, IN t1_in bigint(20) unsigned)
BEGIN
DECLARE done INT DEFAULT FALSE;
DECLARE current_pv_name VARCHAR(60);
DECLARE cur CURSOR FOR SELECT DISTINCT pv_name FROM events;
DECLARE CONTINUE HANDLER FOR SQLSTATE '02000' SET done = TRUE;
SET #t0_in := t0_in;
SET #t1_in := t1_in;
IF #t0_in > #t1_in THEN
SET #temp := #t0_in;
SET #t0_in := #t1_in;
SET #t1_in := #temp;
END IF;
DROP TEMPORARY TABLE IF EXISTS has_change;
CREATE TEMPORARY TABLE has_change (
pv_name varchar(60) NOT NULL,
PRIMARY KEY (pv_name)
) ENGINE=Memory DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin;
OPEN cur;
label1: LOOP
FETCH cur INTO current_pv_name;
IF done THEN
LEAVE label1;
END IF;
INSERT INTO has_change
SELECT current_pv_name
FROM (
SELECT DISTINCT JSON_EXTRACT(events.data, '$.value') AS distinct_value
FROM events
WHERE events.pv_name = current_pv_name
AND events.has_data = 1
AND events.time_stamp > #t0_in AND events.time_stamp < #t1_in
LIMIT 2 ) AS t
HAVING COUNT(t.distinct_value) = 2;
END LOOP;
CLOSE cur;
END //
DELIMITER ;
SET sql_mode=#old_sql_mode;
The optimization here is in the application of the limit on the number of distinct values to find for each pv_name.

There is no LIMIT, so the quote does not apply. (Or at least, I think not.)
COUNT(DISTINCT ...) will, in some cases do a "loose scan", which is better than reading every row. For example,
SELECT name
FROM tbl
GROUP BY name
HAVING COUNT(DISTINCT foo) > 3;
together with INDEX(name, foo) would probably leapfrog through the index to do the COUNT DISTINCT of foos for each name. Granted, this is not "stopping at 3" as you requested.
You can demonstrate the above by doing
FLUSH STATUS;
SELECT ...;
SHOW SESSIONS STATUS LIKE 'Handler%';
To see that it did not (or did) have a Handler_read count that is the size of the table.
The loose scan is not applicable to your particular query for multiple reasons.
Bottom line: "No, you can't achieve your goal".
Also, the stored routine you wrote will probably be much slower than simply accepting the overhead of a full scan.

why is the sum of five 1 = 4

the base query works as intenden, but when i try to sum the first columns, its supose to be 5, but insted i get 4, why?
base query:
SET #last_task = 0;
SELECT
IF(#last_task = RobotShortestPath, 0, 1) AS new_task,
#last_task := RobotShortestPath
FROM rob_log
ORDER BY rog_log_id;
1 1456
0 1456
0 1456
1 1234
0 1234
1 1456
1 2556
1 1456
sum query
SET #last_task = 0;
SELECT SUM(new_task) AS tasks_performed
FROM (
SELECT
IF(#last_task = RobotShortestPath, 0, 1) AS new_task,
#last_task := RobotShortestPath
FROM rob_log
ORDER BY rog_log_id
) AS tmp
4
table structure
CREATE TABLE rob_log (
rog_log_id BIGINT UNSIGNED NOT NULL AUTO_INCREMENT,
# RobotPosX FLOAT NOT NULL,
# RobotPosY FLOAT NOT NULL,
# RobotPosDir TINYINT UNSIGNED NOT NULL,
RobotShortestPath MEDIUMINT UNSIGNED NOT NULL,
PRIMARY KEY(rog_log_id),
KEY (rog_log_id, RobotShortestPath)
);
INSERT INTO rob_log(RobotShortestPath) SELECT 1456;
INSERT INTO rob_log(RobotShortestPath) SELECT 1456;
INSERT INTO rob_log(RobotShortestPath) SELECT 1456;
INSERT INTO rob_log(RobotShortestPath) SELECT 1234;
INSERT INTO rob_log(RobotShortestPath) SELECT 1234;
INSERT INTO rob_log(RobotShortestPath) SELECT 1456;
INSERT INTO rob_log(RobotShortestPath) SELECT 2556;
INSERT INTO rob_log(RobotShortestPath) SELECT 1456;
testing it at sqlfiddle: http://sqlfiddle.com/#!2/e80f5/3
as an answer for Counting changes in timeline with MySQL
but got relly confused

Here's the reason (as discussed on Twitter):
The variable #last_task was defined in a separate query "batch". I break up the queries on SQL Fiddle into individual batches, executed separately. I do this so you can see the output from each batch as a distinct result set below. In your Fiddle, you can see that there are two sets of output: http://sqlfiddle.com/#!2/e80f5/3/0 and http://sqlfiddle.com/#!2/e80f5/3/1. These map to the two statements you are running (the set and the select). The problem is, your set statement defines a variable that only exists in the first batch; when the select statement runs, it is a separate batch and your variable isn't defined within that context.
To correct this problem, all you have to do is define a different query terminator. Note the dropdown box/button under both the schema and the query panels ( [ ; ] ) - click on that, and you can choose something other than semicolon (the default). Then your two statements will be included together as part of the same batch, and you'll get the result you want. For example:
http://sqlfiddle.com/#!2/e80f5/9

It's probably a some bug in older version of MySQL.
I have tried it on MySQL 5.5 and its working perfectly.

Is cross-table indexing possible?

Consider a structure where you have a many-to-one (or one-to-many) relationship with a condition (where, order by, etc.) on both tables. For example:
CREATE TABLE tableTwo (
id INT UNSIGNED PRIMARY KEY AUTO_INCREMENT,
eventTime DATETIME NOT NULL,
INDEX (eventTime)
) ENGINE=InnoDB;
CREATE TABLE tableOne (
id INT UNSIGNED PRIMARY KEY AUTO_INCREMENT,
tableTwoId INT UNSIGNED NOT NULL,
objectId INT UNSIGNED NOT NULL,
INDEX (objectID),
FOREIGN KEY (tableTwoId) REFERENCES tableTwo (id)
) ENGINE=InnoDB;
and for an example query:
select * from tableOne t1
inner join tableTwo t2 on t1.tableTwoId = t2.id
where objectId = '..'
order by eventTime;
Let's say you index tableOne.objectId and tableTwo.eventTime. If you then explain on the above query, it will show "Using filesort". Essentially, it first applies the tableOne.objectId index, but it can't apply the tableTwo.eventTime index because that index is for the entirety of tableTwo (not the limited result set), and thus it must do a manual sort.
Thus, is there a way to do a cross-table index so it wouldn't have to filesort each time results are retrieved? Something like:
create index ind_t1oi_t2et on tableOne t1
inner join tableTwo t2 on t1.tableTwoId = t2.id
(t1.objectId, t2.eventTime);
Also, I've looked into creating a view and indexing that, but indexing is not supported for views.
The solution I've been leaning towards if cross-table indexing isn't possible is replicating the conditional data in one table. In this case that means eventTime would be replicated in tableOne and a multi-column index would be set up on tableOne.objectId and tableOne.eventTime (essentially manually creating the index). However, I thought I'd seek out other people's experience first to see if that was the best way.
Thanks much!
Update:
Here are some procedures for loading test data and comparing results:
drop procedure if exists populate_table_two;
delimiter #
create procedure populate_table_two(IN numRows int)
begin
declare v_counter int unsigned default 0;
while v_counter < numRows do
insert into tableTwo (eventTime)
values (CURRENT_TIMESTAMP - interval 0 + floor(0 + rand()*1000) minute);
set v_counter=v_counter+1;
end while;
end #
delimiter ;
drop procedure if exists populate_table_one;
delimiter #
create procedure populate_table_one
(IN numRows int, IN maxTableTwoId int, IN maxObjectId int)
begin
declare v_counter int unsigned default 0;
while v_counter < numRows do
insert into tableOne (tableTwoId, objectId)
values (floor(1 +(rand() * maxTableTwoId)),
floor(1 +(rand() * maxObjectId)));
set v_counter=v_counter+1;
end while;
end #
delimiter ;
You can use these as follows to populate 10,000 rows in tableTwo and 20,000 rows in tableOne (with random references to tableOne and random objectIds between 1 and 5), which took 26.2 and 70.77 seconds respectively to run for me:
call populate_table_two(10000);
call populate_table_one(20000, 10000, 5);
Update 2 (Tested Triggering SQL):
Below is the tried and tested SQL based on daniHp's triggering method. This keeps the dateTime in sync on tableOne when tableOne is added or tableTwo is updated. Also, this method should also work for many-to-many relationships if the condition columns are copied to the joining table. In my testing of 300,000 rows in tableOne and 200,000 rows in tableTwo, the speed of the old query with similar limits was 0.12 sec and the speed of the new query still shows as 0.00 seconds. Thus, there is a clear improvement, and this method should perform well into the millions of rows and farther.
alter table tableOne add column tableTwo_eventTime datetime;
create index ind_t1_oid_t2et on tableOne (objectId, tableTwo_eventTime);
drop TRIGGER if exists t1_copy_t2_eventTime;
delimiter #
CREATE TRIGGER t1_copy_t2_eventTime
BEFORE INSERT ON tableOne
for each row
begin
set NEW.tableTwo_eventTime = (select eventTime
from tableTwo t2
where t2.id = NEW.tableTwoId);
end #
delimiter ;
drop TRIGGER if exists upd_t1_copy_t2_eventTime;
delimiter #
CREATE TRIGGER upd_t1_copy_t2_eventTime
BEFORE UPDATE ON tableTwo
for each row
begin
update tableOne
set tableTwo_eventTime = NEW.eventTime
where tableTwoId = NEW.id;
end #
delimiter ;
And the updated query:
select * from tableOne t1
inner join tableTwo t2 on t1.tableTwoId = t2.id
where t1.objectId = 1
order by t1.tableTwo_eventTime desc limit 0,10;

As you know, SQLServer achieves this with indexed views:
indexed views provide additional performance benefits that cannot be
achieved using standard indexes. Indexed views can increase query
performance in the following ways:
Aggregations can be precomputed and stored in the index to minimize
expensive computations during query execution.
Tables can be prejoined and the resulting data set stored.
Combinations of joins or aggregations can be stored.
In SQLServer, to take advantage of this technique, you must query over the view and not over the tables. That means that you should know about the view and indexes.
MySQL does not have indexed views, but you can simulate the behavior with table + triggers + indexes.
Instead of creating a view, you must create an indexed table, a trigger to keep the data table up to date, and then you must query your new table instead of your normalized tables.
You must evaluate if the overhead of write operations offsets the improvement in read operations.
Edited:
Note that it is not always necessary to create a new table. For example, in a 1:N relationship (master-detail) trigger, you can keep a copy of a field from the 'master' table into the 'detail' table. In your case:
CREATE TABLE tableOne (
id INT UNSIGNED PRIMARY KEY AUTO_INCREMENT,
tableTwoId INT UNSIGNED NOT NULL,
objectId INT UNSIGNED NOT NULL,
desnormalized_eventTime DATETIME NOT NULL,
INDEX (objectID),
FOREIGN KEY (tableTwoId) REFERENCES tableTwo (id)
) ENGINE=InnoDB;
CREATE TRIGGER tableOne_desnormalized_eventTime
BEFORE INSERT ON tableOne
for each row
begin
DECLARE eventTime DATETIME;
SET eventTime =
(select eventTime
from tableOne
where tableOne.id = NEW.tableTwoId);
NEW.desnormalized_eventTime = eventTime;
end;
Notice that this is a before insert trigger.
Now, the query is rewritten as follows:
select * from tableOne t1
inner join tableTwo t2 on t1.tableTwoId = t2.id
where t1.objectId = '..'
order by t1.desnormalized_eventTime;
Disclaimer: not tested.

Cross-table indexing is not possible in MySQL except via the now-defunct Akiban(?) Engine.
I have a rule: "Do not normalize 'continuous' values such as INTs, FLOATs, DATETIMEs, etc." The cost of the JOIN when you need to sort or range-test on the continuous value will kill performance.
DATETIME takes 5 bytes; INT takes 4. So any 'space' argument toward normalizing a datetime is rather poor. It is rare that you would need to 'normalize' a datetime in the off chance that all uses of a particular value were to change.

May be I'm wrong , but if this is my application I will not duplicate the data unless I need to order by 2 columns in 2 different tables and this is a hot query (it's required many times). But since there is no clear cut solution to avoid the filesort, what about this little trick (force the optimizer to use the index on the column in the order by clause eventTime)
select * from tableOne t1
inner join tableTwo t2 use index (eventTime) on t1.tableTwoId = t2.id and t2.eventTime > 0
where t1.objectId = 1
order by t2.eventTime desc limit 0,10;
notice use index (eventTime) and t2.eventTime > 0
It's explain shows that the optimizer has used the index on eventTime instead of filesort
1 SIMPLE t2 range eventTime eventTime 5 5000 Using where; Using index
1 SIMPLE t1 ref objectId,tableTwoId tableTwoId 4 tests.t2.id 1 Using where

MySQL Row counter in Update statement

The following MySQL statement is working fine, and it returns me the rownumber as row, of each result. But now, what I want to do, is setting the column pos with the value of "row", by using an update statement, since I don't want to loop thousands of records with single queries.
Any ideas?
SELECT #row := #row + 1 AS row, u.ID,u.pos
FROM user u, (SELECT #row := 0) r
WHERE u.year<=2010
ORDER BY u.pos ASC LIMIT 0,10000

There is a risk using user defined variables
In a SELECT statement, each select expression is evaluated only when sent to the client. This means that in a HAVING, GROUP BY, or ORDER BY clause, referring to a variable that is assigned a value in the select expression list does not work as expected:
A more safe guard method will be
create table tmp_table
(
pos int(10) unsigned not null auto_increment,
user_id int(10) not null default 0,
primary key (pos)
);
insert into tmp_table
select null, u.ID
from user
where u.year<=2010
order by YOUR_ORDERING_DECISION
limit 0, 10000;
alter table tmp_table add index (user_id);
update user, tmp_table
set user.pos=tmp_table.pos
where user.id=tmp_table.user_id;
drop table tmp_table;

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

MySQL: How to optimize this simple GROUP BY+ORDER BY query? - mysql

Related

Efficient way of gettting a single example of a row when using hashes

MySQL Find distinct pair of values per group in time interval

why is the sum of five 1 = 4

Is cross-table indexing possible?

MySQL Row counter in Update statement

Categories

Resources