Consider a structure where you have a many-to-one (or one-to-many) relationship with a condition (where, order by, etc.) on both tables. For example:
CREATE TABLE tableTwo (
id INT UNSIGNED PRIMARY KEY AUTO_INCREMENT,
eventTime DATETIME NOT NULL,
INDEX (eventTime)
) ENGINE=InnoDB;
CREATE TABLE tableOne (
id INT UNSIGNED PRIMARY KEY AUTO_INCREMENT,
tableTwoId INT UNSIGNED NOT NULL,
objectId INT UNSIGNED NOT NULL,
INDEX (objectID),
FOREIGN KEY (tableTwoId) REFERENCES tableTwo (id)
) ENGINE=InnoDB;
and for an example query:
select * from tableOne t1
inner join tableTwo t2 on t1.tableTwoId = t2.id
where objectId = '..'
order by eventTime;
Let's say you index tableOne.objectId and tableTwo.eventTime. If you then explain on the above query, it will show "Using filesort". Essentially, it first applies the tableOne.objectId index, but it can't apply the tableTwo.eventTime index because that index is for the entirety of tableTwo (not the limited result set), and thus it must do a manual sort.
Thus, is there a way to do a cross-table index so it wouldn't have to filesort each time results are retrieved? Something like:
create index ind_t1oi_t2et on tableOne t1
inner join tableTwo t2 on t1.tableTwoId = t2.id
(t1.objectId, t2.eventTime);
Also, I've looked into creating a view and indexing that, but indexing is not supported for views.
The solution I've been leaning towards if cross-table indexing isn't possible is replicating the conditional data in one table. In this case that means eventTime would be replicated in tableOne and a multi-column index would be set up on tableOne.objectId and tableOne.eventTime (essentially manually creating the index). However, I thought I'd seek out other people's experience first to see if that was the best way.
Thanks much!
Update:
Here are some procedures for loading test data and comparing results:
drop procedure if exists populate_table_two;
delimiter #
create procedure populate_table_two(IN numRows int)
begin
declare v_counter int unsigned default 0;
while v_counter < numRows do
insert into tableTwo (eventTime)
values (CURRENT_TIMESTAMP - interval 0 + floor(0 + rand()*1000) minute);
set v_counter=v_counter+1;
end while;
end #
delimiter ;
drop procedure if exists populate_table_one;
delimiter #
create procedure populate_table_one
(IN numRows int, IN maxTableTwoId int, IN maxObjectId int)
begin
declare v_counter int unsigned default 0;
while v_counter < numRows do
insert into tableOne (tableTwoId, objectId)
values (floor(1 +(rand() * maxTableTwoId)),
floor(1 +(rand() * maxObjectId)));
set v_counter=v_counter+1;
end while;
end #
delimiter ;
You can use these as follows to populate 10,000 rows in tableTwo and 20,000 rows in tableOne (with random references to tableOne and random objectIds between 1 and 5), which took 26.2 and 70.77 seconds respectively to run for me:
call populate_table_two(10000);
call populate_table_one(20000, 10000, 5);
Update 2 (Tested Triggering SQL):
Below is the tried and tested SQL based on daniHp's triggering method. This keeps the dateTime in sync on tableOne when tableOne is added or tableTwo is updated. Also, this method should also work for many-to-many relationships if the condition columns are copied to the joining table. In my testing of 300,000 rows in tableOne and 200,000 rows in tableTwo, the speed of the old query with similar limits was 0.12 sec and the speed of the new query still shows as 0.00 seconds. Thus, there is a clear improvement, and this method should perform well into the millions of rows and farther.
alter table tableOne add column tableTwo_eventTime datetime;
create index ind_t1_oid_t2et on tableOne (objectId, tableTwo_eventTime);
drop TRIGGER if exists t1_copy_t2_eventTime;
delimiter #
CREATE TRIGGER t1_copy_t2_eventTime
BEFORE INSERT ON tableOne
for each row
begin
set NEW.tableTwo_eventTime = (select eventTime
from tableTwo t2
where t2.id = NEW.tableTwoId);
end #
delimiter ;
drop TRIGGER if exists upd_t1_copy_t2_eventTime;
delimiter #
CREATE TRIGGER upd_t1_copy_t2_eventTime
BEFORE UPDATE ON tableTwo
for each row
begin
update tableOne
set tableTwo_eventTime = NEW.eventTime
where tableTwoId = NEW.id;
end #
delimiter ;
And the updated query:
select * from tableOne t1
inner join tableTwo t2 on t1.tableTwoId = t2.id
where t1.objectId = 1
order by t1.tableTwo_eventTime desc limit 0,10;
As you know, SQLServer achieves this with indexed views:
indexed views provide additional performance benefits that cannot be
achieved using standard indexes. Indexed views can increase query
performance in the following ways:
Aggregations can be precomputed and stored in the index to minimize
expensive computations during query execution.
Tables can be prejoined and the resulting data set stored.
Combinations of joins or aggregations can be stored.
In SQLServer, to take advantage of this technique, you must query over the view and not over the tables. That means that you should know about the view and indexes.
MySQL does not have indexed views, but you can simulate the behavior with table + triggers + indexes.
Instead of creating a view, you must create an indexed table, a trigger to keep the data table up to date, and then you must query your new table instead of your normalized tables.
You must evaluate if the overhead of write operations offsets the improvement in read operations.
Edited:
Note that it is not always necessary to create a new table. For example, in a 1:N relationship (master-detail) trigger, you can keep a copy of a field from the 'master' table into the 'detail' table. In your case:
CREATE TABLE tableOne (
id INT UNSIGNED PRIMARY KEY AUTO_INCREMENT,
tableTwoId INT UNSIGNED NOT NULL,
objectId INT UNSIGNED NOT NULL,
desnormalized_eventTime DATETIME NOT NULL,
INDEX (objectID),
FOREIGN KEY (tableTwoId) REFERENCES tableTwo (id)
) ENGINE=InnoDB;
CREATE TRIGGER tableOne_desnormalized_eventTime
BEFORE INSERT ON tableOne
for each row
begin
DECLARE eventTime DATETIME;
SET eventTime =
(select eventTime
from tableOne
where tableOne.id = NEW.tableTwoId);
NEW.desnormalized_eventTime = eventTime;
end;
Notice that this is a before insert trigger.
Now, the query is rewritten as follows:
select * from tableOne t1
inner join tableTwo t2 on t1.tableTwoId = t2.id
where t1.objectId = '..'
order by t1.desnormalized_eventTime;
Disclaimer: not tested.
Cross-table indexing is not possible in MySQL except via the now-defunct Akiban(?) Engine.
I have a rule: "Do not normalize 'continuous' values such as INTs, FLOATs, DATETIMEs, etc." The cost of the JOIN when you need to sort or range-test on the continuous value will kill performance.
DATETIME takes 5 bytes; INT takes 4. So any 'space' argument toward normalizing a datetime is rather poor. It is rare that you would need to 'normalize' a datetime in the off chance that all uses of a particular value were to change.
May be I'm wrong , but if this is my application I will not duplicate the data unless I need to order by 2 columns in 2 different tables and this is a hot query (it's required many times). But since there is no clear cut solution to avoid the filesort, what about this little trick (force the optimizer to use the index on the column in the order by clause eventTime)
select * from tableOne t1
inner join tableTwo t2 use index (eventTime) on t1.tableTwoId = t2.id and t2.eventTime > 0
where t1.objectId = 1
order by t2.eventTime desc limit 0,10;
notice use index (eventTime) and t2.eventTime > 0
It's explain shows that the optimizer has used the index on eventTime instead of filesort
1 SIMPLE t2 range eventTime eventTime 5 5000 Using where; Using index
1 SIMPLE t1 ref objectId,tableTwoId tableTwoId 4 tests.t2.id 1 Using where
Related
My database has a staging table with the following structure:
CREATE TABLE featureMappings (
id bigint(20) NOT NULL AUTO_INCREMENT,
visitId bigint(20) NOT NULL,
featureId bigint(20) NOT NULL,
textValue text DEFAULT NULL,
hashTextValue char(32) GENERATED ALWAYS AS (MD5(textValue)) VIRTUAL,
PRIMARY KEY (id));
ALTER TABLE featureMappings
ADD INDEX fsHashTextValue (featureId, hashTextValue)
In a typical run this table has approximately 40 - 100 million rows. There are a lot of duplicate text values so I am using the hashTextValue key to be able to index on this column.
The following query takes about 25 seconds to run:
CREATE TEMPORARY TABLE temp AS
SELECT
featureId,
hashTextValue
FROM
featureMappings
GROUP BY featureId, hashTextValue
Question
I'd like to extract the value in the textValue column alongside the featureId and hashTextValue columns.
I have tried two approaches. Both of these dramatically increased the query time, so I'm looking for a better solution.
Slow Option 1 - Adding textValue to the query
When running the belo change to the query, the time to process went from 25 seconds to about 10 minutes. I've tried to google how textValue is retrieved when not using an aggregate function, but could not find a clear answer.
CREATE TEMPORARY TABLE temp AS
SELECT
featureId,
hashTextValue,
textValue # I also tried MIN(textValue)
FROM
featureMappings
GROUP BY featureId, hashTextValue
Complicated Option 2: Iterative Update
My preferred approach is to iterate over the unique combinations of the first query and then run a loop over the following queries:
SELECT featureId, hashTextValue INTO #fid, #htv
FROM temp
WHERE textValue is NULL and hashTextValue IS NOT NULL
LIMIT 1;
SELECT textValue
INTO #textValue
FROM featureMappings
WHERE featureId = #fid and hashTextValue = #htv
LIMIT 1;
UPDATE temp
SET textValue = #textValue
WHERE featureId = #fid AND hashTextValue = #htv;
Server Configuration
This is being run on AWS RDS Aurora based on Mysql 5.7. The server has limited (2GB) memory and usually has less freeable memory than the index size on the table.
Plan A: Dedup as you load. This is trivially done by making the PK of featureMappings be PRIMARY KEY(featureId, hashTextValue) and using INSERT IGNORE when loading the staging table.
Plan B: (Assuming there is something preventing Plan A) Change the table have these indexes.
PRIMARY KEY (featureId, hashTextValue, id),
INDEX(id)
This still has the dups, but I am unclear on what needs to happen next.
Further...
SELECT featureId, hashTextValue INTO #fid, #htv
FROM temp
WHERE textValue is NULL and hashTextValue IS NOT NULL
LIMIT 1;
This has the problem of getting slower and slower as you eat through the items that match. It would be better to add an explicit PRIMARY KEY and walk through temp. In fact, it will be an order of magnitude faster (if temp is large). Let's say id is the PK; then:
SELECT #id := id, #fid := featureId, #htv := hashTextValue INTO
FROM temp
WHERE textValue is NULL and hashTextValue IS NOT NULL
AND id > #id -- this picks up 'where you left off'
LIMIT 1;
(Initialize with SET #id := 0;)
Now that you have the id, the UPDATE becomes simpler and faster.
Once upon a time, I had a table like this:
CREATE TABLE `Events` (
`EvtId` INT UNSIGNED NOT NULL AUTO_INCREMENT,
`AlarmId` INT UNSIGNED,
-- Other fields omitted for brevity
PRIMARY KEY (`EvtId`)
);
AlarmId was permitted to be NULL.
Now, because I want to expand from zero-or-one alarm per event to zero-or-more alarms per event, in a software update I'm changing instances of my database to have this instead:
CREATE TABLE `Events` (
`EvtId` INT UNSIGNED NOT NULL AUTO_INCREMENT,
-- Other fields omitted for brevity
PRIMARY KEY (`EvtId`)
);
CREATE TABLE `EventAlarms` (
`EvtId` INT UNSIGNED NOT NULL,
`AlarmId` INT UNSIGNED NOT NULL,
PRIMARY KEY (`EvtId`, `AlarmId`),
CONSTRAINT `fk_evt` FOREIGN KEY (`EvtId`) REFERENCES `Events` (`EvtId`)
ON DELETE CASCADE ON UPDATE CASCADE
);
So far so good.
The data is easy to migrate, too:
INSERT INTO `EventAlarms`
SELECT `EvtId`, `AlarmId` FROM `Events` WHERE `AlarmId` IS NOT NULL;
ALTER TABLE `Events` DROP COLUMN `AlarmId`;
Thing is, my system requires that a downgrade also be possible. I accept that downgrades will sometimes be lossy in terms of data, and that's okay. However, they do need to work where possible, and result in the older database structure while making a best effort to keep as much original data as is reasonably possible.
In this case, that means going from zero-or-more alarms per event, to zero-or-one alarm per event. I could do it like this:
ALTER TABLE `Events` ADD COLUMN `AlarmId` INT UNSIGNED;
UPDATE `Events`
LEFT JOIN `EventAlarms` USING(`EvtId`)
SET `Events`.`AlarmId` = `EventAlarms`.`AlarmId`;
DROP TABLE `EventAlarms`;
… which is kind of fine, since I don't really care which one gets kept (it's best-effort, remember). However, as warned, this is not good for replication as the result may be unpredictable:
> SHOW WARNINGS;
Unsafe statement written to the binary log using statement format since
BINLOG_FORMAT = STATEMENT. Statements writing to a table with an auto-
increment column after selecting from another table are unsafe because the
order in which rows are retrieved determines what (if any) rows will be
written. This order cannot be predicted and may differ on master and the
slave.
Is there a way to somehow "order" or "limit" the join in the update, or shall I just skip this whole enterprise and stop trying to be clever? If the latter, how can I leave the downgraded AlarmId as NULL iff there were multiple rows in the new table between which we cannot safely distinguish? I do want to migrate the AlarmId if there is only one.
As a downgrade is a "one-time" maintenance operation, it doesn't have to be exactly real-time, but speed would be nice. Both tables could potentially have thousands of rows.
(MariaDB 5.5.56 on CentOS 7, but must also work on whatever ships with CentOS 6.)
First, we can perform a bit of analysis, with a self-join:
SELECT `A`.`EvtId`, COUNT(`B`.`EvtId`) AS `N`
FROM `EventAlarms` AS `A`
LEFT JOIN `EventAlarms` AS `B` ON (`A`.`EvtId` = `B`.`EvtId`)
GROUP BY `B`.`EvtId`
The result will look something like this:
EvtId N
--------------
370 1
371 1
372 4
379 1
380 1
382 16
383 1
384 1
Now you can, if you like, drop all the rows representing events that map to more than one alarm (which you suggest as a fallback solution; I think this makes sense, though you could modify the below to leave one of them in place if you really wanted).
Instead of actually DELETEing anything, though, it's easier to introduce a new table, populated using the self-joining query shown above:
CREATE TEMPORARY TABLE `_migrate` (
`EvtId` INT UNSIGNED,
`n` INT UNSIGNED,
PRIMARY KEY (`EvtId`),
KEY `idx_n` (`n`)
);
INSERT INTO `_migrate`
SELECT `A`.`EvtId`, COUNT(`B`.`EvtId`) AS `n`
FROM `EventAlarms` AS `A`
LEFT JOIN `EventAlarms` AS `B` ON(`A`.`EvtId` = `B`.`EvtId`)
GROUP BY `B`.`EvtId`;
Then your update becomes:
UPDATE `Events`
LEFT JOIN `_migrate` ON (`Events`.`EvtId` = `_migrate`.`EvtId` AND `_migrate`.`n` = 1)
LEFT JOIN `EventAlarms` ON (`_migrate`.`EvtId` = `EventAlarms`.`EvtId`)
SET `Events`.`AlarmId` = `EventAlarms`.`AlarmId`
WHERE `EventAlarms`.`AlarmId` IS NOT NULL
And, finally, clean up after yourself:
DROP TABLE `_migrate`;
DROP TABLE `EventAlarms`;
MySQL still kicks out the same warning as before, but since know that at most one value will be pulled from the source tables, we can basically just ignore it.
It should even be reasonably efficient, as we can tell from the equivalent EXPLAIN SELECT:
EXPLAIN SELECT `Events`.`EvtId` FROM `Events`
LEFT JOIN `_migrate` ON (`Events`.`EvtId` = `_migrate`.`EvtId` AND `_migrate`.`n` = 1)
LEFT JOIN `EventAlarms` ON (`_migrate`.`EvtId` = `EventAlarms`.`EvtId`)
WHERE `EventAlarms`.`AlarmId` IS NOT NULL
id select_type table type possible_keys key key_len ref rows Extra
---------------------------------------------------------------------------------------------------------------------
1 SIMPLE _migrate ref PRIMARY,idx_n idx_n 5 const 6 Using index
1 SIMPLE EventAlarms ref PRIMARY,fk_AlarmId PRIMARY 8 db._migrate.EvtId 1 Using where; Using index
1 SIMPLE Events eq_ref PRIMARY PRIMARY 8 db._migrate.EvtId 1 Using where; Using index
Use a subquery and user variables to select just one EventAlarms
In your update instead of EventAlarms use
( SELECT `EvtId`, `AlarmId`
FROM ( SELECT `EvtId`, `AlarmId`,
#rn := if ( #EvtId = `EvtId`
#rn + 1,
if ( #EvtId := `EvtId` , 1, 1)
) as rn
FROM `EventAlarms`
CROSS JOIN ( SELECT #EvtId := 0, #rn := 0) as vars
ORDER BY EvtId, AlarmId
) as t
WHERE rn = 1
) as SingleEventAlarms
I know, deleting duplicates from mysql is often discussed here. But none of the solution work fine within my case.
So, I have a DB with Address Data nearly like this:
ID; Anrede; Vorname; Nachname; Strasse; Hausnummer; PLZ; Ort; Nummer_Art; Vorwahl; Rufnummer
ID is primary Key and unique.
And i have entrys for example like this:
1;Herr;Michael;Müller;Testweg;1;55555;Testhausen;Mobile;012345;67890
2;Herr;Michael;Müller;Testweg;1;55555;Testhausen;Fixed;045678;877656
The different PhoneNumber are not the problem, because they are not relevant for me. So i just want to delete the duplicates in Lastname, Street and Zipcode. In that case ID 1 or ID 2. Which one of both doesn't matter.
I tried it actually like this with delete:
DELETE db
FROM Import_Daten db,
Import_Daten dbl
WHERE db.id > dbl.id AND
db.Lastname = dbl.Lastname AND
db.Strasse = dbl.Strasse AND
db.PLZ = dbl.PLZ;
And insert into a copy table:
INSERT INTO Import_Daten_1
SELECT MIN(db.id),
db.Anrede,
db.Firstname,
db.Lastname,
db.Branche,
db.Strasse,
db.Hausnummer,
db.Ortsteil,
db.Land,
db.PLZ,
db.Ort,
db.Kontaktart,
db.Vorwahl,
db.Durchwahl
FROM Import_Daten db,
Import_Daten dbl
WHERE db.lastname = dbl.lastname AND
db.Strasse = dbl.Strasse And
db.PLZ = dbl.PLZ;
The complete table contains over 10Mio rows. The size is actually my problem. The mysql runs on a MAMP Server on a Macbook with 1,5GHZ and 4GB RAM. So not really fast. SQL Statements run in a phpmyadmin. Actually i have no other system possibilities.
You can write a stored procedure that will each time select a different chunk of data (for example by rownumber between two values) and delete only from that range. This way you will slowly bit by bit delete your duplicates
A more effective two table solution can look like following.
We can store only the data we really need to delete and only the fields that contain duplicate information.
Let's assume we are looking for duplicate data in Lastname , Branche, Haushummer fields.
Create table to hold the duplicate data
DROP TABLE data_to_delete;
Populate the table with data we need to delete ( I assume all fields have VARCHAR(255) type )
CREATE TABLE data_to_delete (
id BIGINT COMMENT 'this field will contain ID of row that we will not delete',
cnt INT,
Lastname VARCHAR(255),
Branche VARCHAR(255),
Hausnummer VARCHAR(255)
) AS SELECT
min(t1.id) AS id,
count(*) AS cnt,
t1.Lastname,
t1.Branche,
t1.Hausnummer
FROM Import_Daten AS t1
GROUP BY t1.Lastname, t1.Branche, t1.Hausnummer
HAVING count(*)>1 ;
Now let's delete duplicate data and leave only one record of all duplicate sets
DELETE Import_Daten
FROM Import_Daten LEFT JOIN data_to_delete
ON Import_Daten.Lastname=data_to_delete.Lastname
AND Import_Daten.Branche=data_to_delete.Branche
AND Import_Daten.Hausnummer = data_to_delete.Hausnummer
WHERE Import_Daten.id != data_to_delete.id;
DROP TABLE data_to_delete;
You can add a new column e.g. uq and make it UNIQUE.
ALTER TABLE Import_Daten
ADD COLUMN `uq` BINARY(16) NULL,
ADD UNIQUE INDEX `uq_UNIQUE` (`uq` ASC);
When this is done you can execute an UPDATE query like this
UPDATE IGNORE Import_Daten
SET
uq = UNHEX(
MD5(
CONCAT(
Import_Daten.Lastname,
Import_Daten.Street,
Import_Daten.Zipcode
)
)
)
WHERE
uq IS NULL;
Once all entries are updated and the query is executed again, all duplicates will have the uq field with a value=NULL and can be removed.
The result then is:
0 row(s) affected, 1 warning(s): 1062 Duplicate entry...
For newly added rows always create the uq hash and and consider using this as the primary key once all entries are unique.
I have got an older database for which (at some really questionable and obscure reason I do not like to put too much on topic here) I want to randomize or shuffle the primary keys.
I right now have auto-increment fields in the Mysql database tables.
I have not many relations, those that exist are not defined as foreign keys. The relationships do not need to be preserved.
All I'm looking for is to take the current values of the primary keys and replace it with a random value out of those like:
ID := new(ID)
Where the new function returns a value from the set of all OLD ids with a 1:1 match. E.g.
2 := 3
3 := 2
But not
2 := 3
3 := 3
Is there a way to change the data in the database with (ideally) a single SQL query per table?
Edit: I do not have really strict requirements. Consider to have exclusive access to the database if it helps, including changing constraints on the primary key back and forth, e.g. alter the table, do the operation, alter the table to previous schema. It is also possible to add another column for the new (or old) PK value.
Just a scetch of the procedure. Create two temporary tables
CREATE TABLE temp_old
( ai INT NOT NULL AUTO_INCREMENT
, id INT NOT NULL
, PRIMARY KEY (ai)
, INDEX old_idx (id, ai)
) ENGINE = InnoDB ;
CREATE TABLE temp_new
( ai INT NOT NULL AUTO_INCREMENT
, id INT NOT NULL
, PRIMARY KEY (ai)
, INDEX new_idx (id, ai)
) ENGINE = InnoDB ;
Copy the id values in different orders to the two tables (randomly in the 2nd table):
INSERT INTO temp_old
(id)
SELECT id
FROM tableX
ORDER BY id ;
INSERT INTO temp_new
(id)
SELECT id
FROM tableX
ORDER BY RAND() ;
Then we drop the primary key:
ALTER TABLE tableX
DROP PRIMARY KEY ;
to run the actual UPDATE statement:
UPDATE tableX AS t
JOIN temp_old AS o
ON o.id = t.id
JOIN temp_new AS n
ON n.ai = o.ai
SET t.id = n.id ;
Then recreate the primary key and drop the temp tables:
ALTER TABLE tableX
ADD PRIMARY KEY (id) ;
DROP TABLE temp_old ;
DROP TABLE temp_new ;
Tested in SQL-Fiddle
Here's a technique that creates a list of your ids in table order, along with a sequential number from 1, it also creates a list of your ids in a random order, along with a sequential number from 1. It then updates the ids based on matching the sequential number.
There are issues with the performance of order by rand(), (and it's randomness).
If your keys are already sequential starting from 1, you can simplify this.
Update
Test as t
Inner Join (
Select
#rownum2 := #rownum2 + 1 as rank,
t2.id
From
Test t2,
(Select #rownum2:= 0) a1
) as o on t.id = o.id
Inner Join (
Select
#rownum := #rownum + 1 as rank,
t3.id
From
(Select id from Test order by Rand()) t3,
(Select #rownum:= 0) a2
) as n on o.rank = n.rank
Set
t.id = n.id
http://sqlfiddle.com/#!2/3f354/1
You could create a stored procedure that would create a temporary table containing all of the ids, then you can loop over each record, replacing the id with an id from the temp table then removing that id from the temp table. I don't believe there is a way to do what you are talking about in a single query though.
I have one mysql table:
CREATE TABLE IF NOT EXISTS `test` (
`Id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`SenderId` int(10) unsigned NOT NULL,
`ReceiverId` int(10) unsigned NOT NULL,
`DateSent` datetime NOT NULL,
`Notified` tinyint(1) unsigned NOT NULL DEFAULT '0',
PRIMARY KEY (`Id`),
KEY `ReceiverId_SenderId` (`ReceiverId`,`SenderId`),
KEY `SenderId` (`SenderId`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;
The table is populated with 10.000 random rows for testing by using the following procedure:
DELIMITER //
CREATE DEFINER=`root`#`localhost` PROCEDURE `FillTest`(IN `cnt` INT)
BEGIN
DECLARE i INT DEFAULT 1;
DECLARE intSenderId INT;
DECLARE intReceiverId INT;
DECLARE dtDateSent DATE;
DECLARE blnNotified INT;
WHILE (i<=cnt) DO
SET intSenderId = FLOOR(1 + (RAND() * 50));
SET intReceiverId = FLOOR(51 + (RAND() * 50));
SET dtDateSent = str_to_date(concat(floor(1 + rand() * (12-1)),'-',floor(1 + rand() * (28 -1)),'-','2008'),'%m-%d-%Y');
SET blnNotified = FLOOR(1 + (RAND() * 2))-1;
INSERT INTO test (SenderId, ReceiverId, DateSent, Notified)
VALUES(intSenderId,intReceiverId,dtDateSent, blnNotified);
SET i=i+1;
END WHILE;
END//
DELIMITER ;
CALL `FillTest`(10000);
The problem:
I need to write a query which will group by ‘SenderId, ReceiverId’ and return the first 100 highest Ids of each group, ordered by Id in ascending order.
I played with GROUP BY, ORDER BY and MAX(Id), but the query was too slow, so I came up with this query:
SELECT SQL_NO_CACHE t1.*
FROM test t1
LEFT JOIN test t2 ON (t1.ReceiverId = t2.ReceiverId AND t1.SenderId = t2.SenderId AND t1.Id < t2.Id)
WHERE t2.Id IS NULL
ORDER BY t1.Id ASC
LIMIT 100;
The above query returns the correct data, but it becomes too slow when the test table has more than 150.000 rows . On 150.000 rows the above query needs 7 seconds to complete. I expect the test table to have between 500.000 – 1M rows, and the query needs to return the correct data in less than 3 sec. If it’s not possible to fetch the correct data in less than 3 sec, than I need it to fetch the data using the fastest query possible.
So, how can the above query be optimized so that it runs faster?
Reasons why this query may be slow:
It's a lot of data. Lots of it may be returned. It returns the last record for each SenderId/ReceiverId combination.
The division of the data (many Sender/Receiver combinations, or relative few of them, but with multiple 'versions'.
The whole result set must be sorted by MySQL, because you need the first 100 records, sorted by Id.
These make it hard to optimize this query without restructuring the data. A few suggestions to try:
- You could try using NOT EXISTS, although I doubt if it would help.
SELECT SQL_NO_CACHE t1.*
FROM test t1
WHERE NOT EXISTS
(SELECT 'x'
FROM test t2
WHERE t1.ReceiverId = t2.ReceiverId AND t1.SenderId = t2.SenderId AND t1.Id < t2.Id)
ORDER BY t1.Id ASC
LIMIT 100;
- You could try using proper indexes on ReceiverId, SenderId and Id. Experiment with creating a combined index on the three columns. Try two versions, one with Id being the first column, and one with Id being the last.
With slight database modifications:
- You could save a combination of SenderId/ReceiverId in a separate table with a LastId pointing to the record you want.
- You could save a 'PreviousId' with each record, keeping it NULL for the last record per Sender/Receiver. You only need to query the records where previousId is null.