NOT EXISTS improves performance?

NOT EXISTS improves performance? - sql-server-2008

I ran into a case where I realized the NOT EXISTS clause was making to run a query faster.
If we have a table Test (with about 500k records)
CREATE TABLE Test
(
ID bigint
, Code varchar(20)
, AddedDate datetime
)
if I want to insert the resulting data into a temporary table A
DECLARE #A TABLE
(
ID bigint
)
then we have 3 scenarios as follows:
1). Inserting all data without any restriction
INSERT #A
SELECT dh.ID
FROM TEST dh
The query runs in 7 - 8 seconds.
2). Adding NOT EXISTS with a code that does not exist
INSERT #A
SELECT dh.ID
FROM TEST dh
WHERE NOT EXISTS (SELECT 1 FROM TEST d WHERE d.Code = 'bbbb')
The query runs in about 4 seconds
3). Adding NOT EXITS with a code that exits
INSERT #A
SELECT dh.ID
FROM TEST dh
WHERE NOT EXISTS (SELECT 1 FROM TEST d WHERE d.Code = 'AX8')
The query runs in 0 seconds.
The results for 3). makes sense, since the inner query has data, the NOT EXISTS will avoid to have result inserted into #A table so the time will be the lowest, 0 seconds in this case.
But why 2) run faster than 1).

Related

Mysql - Create sample data from existing rows

I have a table with about 50K rows. I need to multiply this data 10 fold to have at least 5M rows for testing the performance. Now, its taken me several long minutes to import 50K from a CSV file so I don't want to create a 5M record file and then import it into SQL.
Is there a way to duplicate the existing rows over and over again to create 5M records? I don't mind if the rows are identical, they should just have a diferrent id which is the Primary (Auto Increment) column.
I'm currently doing this on XAMPP with phpMyAdmin.

Insert into my_table (y,z) select y, z from my_table;
where x is your autoincrementing id.
REPEAT a (remarkably small) number of times

Option 1 : Use union
insert into your_table (col1,col2)
select col1,col2 from your_table
union all
select col1,col2 from your_table
union all
select col1,col2 from your_table
union all
select col1,col2 from your_table
continued...
Option 2 : Use dummy table with 10 records and do cross join
Create a dummy table with 10 rows
insert into your_table (col1,col2)
select col1,col2 from your_table, dummy_table

If you have ~50K rows, then copying them 99 times will give you ~5M rows.
To do so, you can create a procedure and use a loop to copy them 99 times.
DELIMITER $$
CREATE PROCEDURE populate()
BEGIN
DECLARE counter INT DEFAULT 1;
WHILE counter < 100 DO
insert into mytable(colA, colB) select colA, colB from mytable;
SET counter = counter + 1;
END WHILE;
END $$
DELIMITER ;
Then you can call the procedure using
call populate();

Performance of MySQL very bad when stored procedure iterates over large table of 15M rows

I have a stored procedure that opens a CURSOR on a select statement that iterates over a table of 15M rows (This table is a simpl import of a large CSV).
I need to normalize that data by inserting various pieces of each row into 3 different tables (capture auto-update ID's, use them in forein key constraints, and such).
So I wrote a simple stored procedure, open CURSOR, FETCH the fields into varialbes and do the 3 insert statements.
I'm on a small DB server, default mysql installation (1 cpu, 1.7GB ram), I had hoped for a few hours for this task. I'm at 24 hours+ and top shows 85% wasted CPU.
I think I have some kind of terrible inefficiency. Any ideas on improving the efficiency of the task? Or just determining where the bottleneck is?
root#devapp1:/mnt/david_tmp# vmstat 10
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
0 1 256 13992 36888 1466584 0 0 9 61 1 1 0 0 98 1
1 2 256 15216 35800 1466312 0 0 57 7282 416 847 2 1 12 85
0 1 256 14720 35984 1466768 0 0 42 6154 387 811 2 1 10 87
0 1 256 13736 36160 1467344 0 0 51 6979 439 934 2 1 9 89
DROP PROCEDURE IF EXISTS InsertItemData;
DELIMITER $$
CREATE PROCEDURE InsertItemData() BEGIN
DECLARE spd TEXT;
DECLARE lpd TEXT;
DECLARE pid INT;
DECLARE iurl TEXT;
DECLARE last_id INT UNSIGNED;
DECLARE done INT DEFAULT FALSE;
DECLARE raw CURSOR FOR select t.shortProductDescription, t.longProductDescription, t.productID, t.productImageURL
from frugg.temp_input t;
DECLARE CONTINUE HANDLER FOR NOT FOUND SET done = TRUE;
OPEN raw;
read_loop: LOOP
FETCH raw INTO spd, lpd, pid, iurl;
IF done THEN
LEAVE read_loop;
END IF;
INSERT INTO item (short_description, long_description) VALUES (spd, lpd);
SET last_id = LAST_INSERT_ID();
INSERT INTO item_catalog_map (catalog_id, catalog_unique_item_id, item_id) VALUES (1, CAST(pid AS CHAR), last_id);
INSERT INTO item_images (item_id, original_url) VALUES (last_id, iurl);
END LOOP;
CLOSE raw;
END$$
DELIMITER ;

MySQL will almost always perform better executing straight SQL statements, than looping inside a stored procudure.
That said, if you are using InnoDB tables, your procedure will run faster inside a START TRANSACTION / COMMIT block.
Even better would be to add an AUTO_INCREMENT to the records in frugg.temp_input, and querying against that table:
DROP TABLE IF EXISTS temp_input2;
CREATE TABLE temp_input2 (
id INT UNSIGNED NOT NULL AUTO_INCREMENT,
shortProductDescription TEXT,
longProductDescription TEXT,
productID INT,
productImageURL TEXT,
PRIMARY KEY (id)
);
START TRANSACTION;
INSERT INTO
temp_input2
SELECT
NULL AS id,
shortProductDescription,
longProductDescription,
productID,
productImageURL
FROM
frugg.temp_input;
INSERT
INTO item
(
id,
short_description,
long_description
)
SELECT
id,
shortProductDescription AS short_description,
longProductDescription AS long_description
FROM
temp_input2
ORDER BY
id;
INSERT INTO
item_catalog_map
(
catalog_id,
catalog_unique_item_id,
item_id
)
SELECT
1 AS catalog_id,
CAST(productID AS CHAR) AS catalog_unique_item_id,
id AS item_id
FROM
temp_input2
ORDER BY
id;
INSERT INTO
item_images
(
item_id,
original_url
)
SELECT
id AS item_id,
productImageURL AS original_url
FROM
temp_input2
ORDER BY
id;
COMMIT;
Even better than the above, is before loading the .CSV file into frugg.temp_input, you add an AUTO_INCREMENT field to it, saving you the extra step of creating/loading temp_input2 shown above.

I'm at a similar thought as Ross offered, but without knowing more of your tables, indexes, what the "auto-increment" column names are, I would just do direct inserts... However, you'll have an issue if you encounter any duplicates which I didn't see any checking for.. I would just insert as follows and have appropriate indexes to help the re-join (based on short and long product descriptions).
I would just try by inserting and inserting from the select, then inserting from that select... such as...
INSERT INTO item
( short_description,
long_description )
SELECT
t.ShortProductDescription,
t.LongProductDescription
from
frugg.temp_input t;
done, 15 million inserted... into items table... Now, add to the catalog map table...
INSERT INTO item_catalog_map
( catalog_id,
catalog_unique_item_id,
item_id )
SELECT
1 as Catalog_id,
CAST( t.productID as CHAR) as catalog_unique_item_id,
item.AutoIncrementIDColumn as item_id
from
frugg.temp_input t
JOIN item on t.ShortProductDescription = item.short_desciption
AND t.LongProductDescription = item.long_description
done, all catalog map entries with corresponding "Item ID" inserted...
INSERT INTO item_images
( item_id,
original_url )
SELECT
item.AutoIncrementIDColumn as item_id,
t.productImageURL as original_url
from
frugg.temp_input t
JOIN item on t.ShortProductDescription = item.short_desciption
AND t.LongProductDescription = item.long_description
Done with the image URLs.

why is the sum of five 1 = 4

the base query works as intenden, but when i try to sum the first columns, its supose to be 5, but insted i get 4, why?
base query:
SET #last_task = 0;
SELECT
IF(#last_task = RobotShortestPath, 0, 1) AS new_task,
#last_task := RobotShortestPath
FROM rob_log
ORDER BY rog_log_id;
1 1456
0 1456
0 1456
1 1234
0 1234
1 1456
1 2556
1 1456
sum query
SET #last_task = 0;
SELECT SUM(new_task) AS tasks_performed
FROM (
SELECT
IF(#last_task = RobotShortestPath, 0, 1) AS new_task,
#last_task := RobotShortestPath
FROM rob_log
ORDER BY rog_log_id
) AS tmp
4
table structure
CREATE TABLE rob_log (
rog_log_id BIGINT UNSIGNED NOT NULL AUTO_INCREMENT,
# RobotPosX FLOAT NOT NULL,
# RobotPosY FLOAT NOT NULL,
# RobotPosDir TINYINT UNSIGNED NOT NULL,
RobotShortestPath MEDIUMINT UNSIGNED NOT NULL,
PRIMARY KEY(rog_log_id),
KEY (rog_log_id, RobotShortestPath)
);
INSERT INTO rob_log(RobotShortestPath) SELECT 1456;
INSERT INTO rob_log(RobotShortestPath) SELECT 1456;
INSERT INTO rob_log(RobotShortestPath) SELECT 1456;
INSERT INTO rob_log(RobotShortestPath) SELECT 1234;
INSERT INTO rob_log(RobotShortestPath) SELECT 1234;
INSERT INTO rob_log(RobotShortestPath) SELECT 1456;
INSERT INTO rob_log(RobotShortestPath) SELECT 2556;
INSERT INTO rob_log(RobotShortestPath) SELECT 1456;
testing it at sqlfiddle: http://sqlfiddle.com/#!2/e80f5/3
as an answer for Counting changes in timeline with MySQL
but got relly confused

Here's the reason (as discussed on Twitter):
The variable #last_task was defined in a separate query "batch". I break up the queries on SQL Fiddle into individual batches, executed separately. I do this so you can see the output from each batch as a distinct result set below. In your Fiddle, you can see that there are two sets of output: http://sqlfiddle.com/#!2/e80f5/3/0 and http://sqlfiddle.com/#!2/e80f5/3/1. These map to the two statements you are running (the set and the select). The problem is, your set statement defines a variable that only exists in the first batch; when the select statement runs, it is a separate batch and your variable isn't defined within that context.
To correct this problem, all you have to do is define a different query terminator. Note the dropdown box/button under both the schema and the query panels ( [ ; ] ) - click on that, and you can choose something other than semicolon (the default). Then your two statements will be included together as part of the same batch, and you'll get the result you want. For example:
http://sqlfiddle.com/#!2/e80f5/9

It's probably a some bug in older version of MySQL.
I have tried it on MySQL 5.5 and its working perfectly.

Should I use cursors in my SQL procedure?

I have a table that contains computer login and logoff events. Each row is a separate event with a timestamp, machine name, login or logoff event code and other details. I need to create a SQL procedure that goes through this table and locates corresponding login and logoff event and insert new rows into another table that contain the machine name, login time, logout time and duration time.
So, should I use a cursor to do this or is there a better way to go about this? The database is pretty huge so efficiency is certainly a concern. Any suggested pseudo code would be great as well.
[edit : pulled from comment]
Source table:
History (
mc_id
, hs_opcode
, hs_time
)
Existing data interpretation:
Login_Event = unique mc_id, hs_opcode = 1, and hs_time is the timestamp
Logout_Event = unique mc_id, hs_opcode = 2, and hs_time is the timestamp

First, your query will be simpler (and faster) if you can order the data in such a way that you don't need a complex subquery to pair up the rows. Since MySQL doesn't support CTE to do this on-the-fly, you'll need to create a temporary table:
CREATE TABLE history_ordered (
seq INT NOT NULL PRIMARY KEY AUTO_INCREMENT,
hs_id INT,
mc_id VARCHAR(255),
mc_loggedinuser VARCHAR(255),
hs_time DATETIME,
hs_opcode INT
);
Then, pull and sort from your original table into the new table:
INSERT INTO history_ordered (
hs_id, mc_id, mc_loggedinuser,
hs_time, hs_opcode)
SELECT
hs_id, mc_id, mc_loggedinuser,
hs_time, hs_opcode
FROM history ORDER BY mc_id, hs_time;
You can now use this query to correlate the data:
SELECT li.mc_id,
li.mc_loggedinuser,
li.hs_time as login_time,
lo.hs_time as logout_time
FROM history_ordered AS li
JOIN history_ordered AS lo
ON lo.seq = li.seq + 1
AND li.hs_opcode = 1;
For future inserts, you can use a trigger like below to keep your duration table updated automatically:
DELIMITER $$
CREATE TRIGGER `match_login` AFTER INSERT ON `history`
FOR EACH ROW
BEGIN
IF NEW.hs_opcode = 2 THEN
DECLARE _user VARCHAR(255);
DECLARE _login DATETIME;
SELECT mc_loggedinuser, hs_time FROM history
WHERE hs_time = (
SELECT MAX(hs_time) FROM history
WHERE hs_opcode = 1
AND mc_id = NEW.mc_id
) INTO _user, _login;
INSERT INTO login_duration
SET machine = NEW.mc_id,
logout = NEW.hs_time,
user = _user,
login = _login;
END IF;
END$$
DELIMITER ;

CREATE TABLE dummy (fields you'll select data into, + additional fields as needed)
INSERT INTO dummy (columns from your source)
SELECT * FROM <all the tables where you need data for your target data set>
UPDATE dummy SET col1 = CASE WHEN this = this THEN that, etc
INSERT INTO targetTable
SELECT all columns FROM dummy
Without any code that you're working on.. it'll be hard to see if this approach will be any useful.. There may be some instances when you really need to loop through things.. and some instances when this approach can be used instead..
[EDIT: based on poster's comment]
Can you try executing this and see if you get the desired results?
INSERT INTO <your_target_table_here_with_the_three_columns_required>
SELECT li.mc_id, li.hs_time AS login_time, lo.hs_time AS logout_time
FROM
history AS li
INNER JOIN history AS lo
ON li.mc_id = lo.mc_id
AND li.hs_opcode = 1
AND lo.hs_opcode = 2
AND lo.hs_time = (
SELECT min(hs_time) AS hs_time
FROM history
WHERE hs_time > li.hs_time
AND mc_id = li.mc_id
)

MySQL: How to optimize this simple GROUP BY+ORDER BY query?

I have one mysql table:
CREATE TABLE IF NOT EXISTS `test` (
`Id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`SenderId` int(10) unsigned NOT NULL,
`ReceiverId` int(10) unsigned NOT NULL,
`DateSent` datetime NOT NULL,
`Notified` tinyint(1) unsigned NOT NULL DEFAULT '0',
PRIMARY KEY (`Id`),
KEY `ReceiverId_SenderId` (`ReceiverId`,`SenderId`),
KEY `SenderId` (`SenderId`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;
The table is populated with 10.000 random rows for testing by using the following procedure:
DELIMITER //
CREATE DEFINER=`root`#`localhost` PROCEDURE `FillTest`(IN `cnt` INT)
BEGIN
DECLARE i INT DEFAULT 1;
DECLARE intSenderId INT;
DECLARE intReceiverId INT;
DECLARE dtDateSent DATE;
DECLARE blnNotified INT;
WHILE (i<=cnt) DO
SET intSenderId = FLOOR(1 + (RAND() * 50));
SET intReceiverId = FLOOR(51 + (RAND() * 50));
SET dtDateSent = str_to_date(concat(floor(1 + rand() * (12-1)),'-',floor(1 + rand() * (28 -1)),'-','2008'),'%m-%d-%Y');
SET blnNotified = FLOOR(1 + (RAND() * 2))-1;
INSERT INTO test (SenderId, ReceiverId, DateSent, Notified)
VALUES(intSenderId,intReceiverId,dtDateSent, blnNotified);
SET i=i+1;
END WHILE;
END//
DELIMITER ;
CALL `FillTest`(10000);
The problem:
I need to write a query which will group by ‘SenderId, ReceiverId’ and return the first 100 highest Ids of each group, ordered by Id in ascending order.
I played with GROUP BY, ORDER BY and MAX(Id), but the query was too slow, so I came up with this query:
SELECT SQL_NO_CACHE t1.*
FROM test t1
LEFT JOIN test t2 ON (t1.ReceiverId = t2.ReceiverId AND t1.SenderId = t2.SenderId AND t1.Id < t2.Id)
WHERE t2.Id IS NULL
ORDER BY t1.Id ASC
LIMIT 100;
The above query returns the correct data, but it becomes too slow when the test table has more than 150.000 rows . On 150.000 rows the above query needs 7 seconds to complete. I expect the test table to have between 500.000 – 1M rows, and the query needs to return the correct data in less than 3 sec. If it’s not possible to fetch the correct data in less than 3 sec, than I need it to fetch the data using the fastest query possible.
So, how can the above query be optimized so that it runs faster?

Reasons why this query may be slow:
It's a lot of data. Lots of it may be returned. It returns the last record for each SenderId/ReceiverId combination.
The division of the data (many Sender/Receiver combinations, or relative few of them, but with multiple 'versions'.
The whole result set must be sorted by MySQL, because you need the first 100 records, sorted by Id.
These make it hard to optimize this query without restructuring the data. A few suggestions to try:
- You could try using NOT EXISTS, although I doubt if it would help.
SELECT SQL_NO_CACHE t1.*
FROM test t1
WHERE NOT EXISTS
(SELECT 'x'
FROM test t2
WHERE t1.ReceiverId = t2.ReceiverId AND t1.SenderId = t2.SenderId AND t1.Id < t2.Id)
ORDER BY t1.Id ASC
LIMIT 100;
- You could try using proper indexes on ReceiverId, SenderId and Id. Experiment with creating a combined index on the three columns. Try two versions, one with Id being the first column, and one with Id being the last.
With slight database modifications:
- You could save a combination of SenderId/ReceiverId in a separate table with a LastId pointing to the record you want.
- You could save a 'PreviousId' with each record, keeping it NULL for the last record per Sender/Receiver. You only need to query the records where previousId is null.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

NOT EXISTS improves performance? - sql-server-2008

Related

Mysql - Create sample data from existing rows

Performance of MySQL very bad when stored procedure iterates over large table of 15M rows

why is the sum of five 1 = 4

Should I use cursors in my SQL procedure?

MySQL: How to optimize this simple GROUP BY+ORDER BY query?

Categories

Resources