Can this query or table schema be optimized? - mysql

I am running this procedure a few million times, and although each time it takes a few ms, eventually it takes a couple of weeks to run all of them. I was wondering if anyone could help me optimizing or improving its performance. Any improvement might save days!
CREATE PROCEDURE process_parameters(IN parameter1 VARCHAR(128), IN parameter2 VARCHAR(128), IN combination_type CHAR(1))
BEGIN
SET #parameter1_id := NULL, #parameter2_id := NULL;
SET #parameter1_hash := "", #parameter2_hash := "";
IF parameter1 IS NOT NULL THEN
SET #parameter_hash := parameter1;
INSERT IGNORE INTO `collection1` (`parameter`) VALUES (parameter1);
SET #parameter1_id := (SELECT `id` FROM `collection1` WHERE `parameter` = parameter1);
END IF;
IF parameter2 IS NOT NULL THEN
SET #parameter2_hash := parameter2;
INSERT IGNORE INTO `collection2` (`parameter`) VALUES (parameter2);
SET #parameter2_id := (SELECT `id` FROM `collection2` WHERE `parameter` = parameter2);
END IF;
SET #hash := MD5(CONCAT(#parameter1_hash, #parameter2_hash));
INSERT IGNORE INTO `combinations` (`hash`,`type`,`parameter1`,`parameter2`) VALUES (#hash, combination_type, #parameter1_id, #parameter2_id);
END
The logic behind of it is: I store unique combinations of (parameter1, parameter2) in combinations, where parameter1 or paramter2 can be NULL (but never both at the same time). I store a type in combinations to know later which parameter has value. To ensure that a combination is unique I added an MD5 field (a primary key (parameter1,parameter2) will not work because of comparison with NULL always returns NULL). Each parameter has a separate table (collection1 and collection2 respectively) to store their unique id. There are hundreds/thousands of unique parameter1 and parameter2, but their combinations are highly repeated and are much below the cardinal multiplication.
As an example, ("A", "1"), ("A", "2"), ("B", "1"), ("A", "1"), ("A", NULL), (NULL, "2") would yield:
`collection1` (`id`, `parameter`)
1, "A"
2, "B"
`collection2` (`id`, `parameter`)
1, "1"
2, "2"
`combinations` (`type`, `parameter1`, `parameter2`)
"P1andP2", 1, 1,
"P1andP2", 1, 2,
"P1andP2", 2, 1,
"P1Only", 1, NULL
"P2Only", NULL, 2
These are the definitions of the tables:
DESCRIBE `combinations`;
+-------------+-----------------------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------------+-----------------------------------+------+-----+---------+----------------+
| combination | int(11) | NO | PRI | NULL | auto_increment |
| hash | char(32) | NO | UNI | NULL | |
| type | enum('P1andP2','P1Only','P2Only') | NO | | NULL | |
| parameter1 | int(11) | YES | | NULL | |
| parameter2 | int(11) | YES | | NULL | |
+-------------+-----------------------------------+------+-----+---------+----------------+
DESCRIBE `collection1`; (`collection2` is identical)
+-----------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-----------+--------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| parameter | varchar(255) | NO | UNI | NULL | |
+-----------+--------------+------+-----+---------+----------------+
Any help will be appreciated!

Please use SHOW CREATE TABLE; it is more descriptive than DESCRIBE.
Use LAST_INSERT_ID()
SET #parameter1_id := (SELECT `id` FROM `collection1`
WHERE `parameter` = parameter1);
can be replaced by
SELECT #parameter1_id := LAST_INSERT_ID();
It will avoid a round trip to the server.
Oops... The OP points out that the id won't be returned if the row is a dup. This is a workaround that might run faster:
INSERT INTO `collection1` (`parameter`)
VALUES (parameter1)
ON DUPLICATE KEY UPDATE
id = LAST_INSERT_ID(id);
SELECT #parameter1 := LAST_INSERT_ID(id);
It's a kludgy trick that is documented somewhere in the documentation. But; more below...
Shrink table
Do you really need combination? You have another UNIQUE key that could be used as the PRIMARY KEY. This might cut in half the time taken for the final INSERT.
This may (or may not) speed things up, but only because the row size shrinks: Instead of storing the md5 into CHAR(32), store UNHEX(md5) into BINARY(16).
Batch INSERT
Can you gather a bunch of these to INSERT at once? If you gather 1000 rows and string them into a single INSERT (actually 3 INSERTs, since 3 tables are involved), it will run literally 10 times as fast.
Because of needing the ids, it gets more complicated. You would need to batch things into collection1 and collection2; then work on combinations.
Since the "combination*" tables are essentially "normalization", see my discussion of how to batch them very efficiently: http://mysql.rjweb.org/doc.php/staging_table#normalization It involves 2 statements, one to insert new rows, the other to grab all the ids for the batch.
COALESCE
Get rid of #parameter*_hash and #hash completely. Change the use of #hash call to:
INSERT IGNORE INTO combinations (...) VALUES
( CONCAT(COALESCE(parameter1,''), COALESCE(parameter2, '')),
...)
Think of it this way... Each statement takes a non-trivial amount of time. (This shows up significantly in batching of inserts.) I'm getting rid of 4 statements at some expense due to adding complexity to one statement.
Settings
The most important might be innodb_flush_log_at_trx_commit = 2.
3 Streams
Write 3 procedures, each one with the code simplified to the particular type. Combining this with batching should further speed things up.
Potential issues
I think these two will get the same hash. Hence, only one row for these two:
("xyz", NULL)
(NULL, "xyz")
Be aware that INSERT IGNORE will burn ids if there is already a row with the given unique key. Because of this, keep an eye on running out of values with INT (only 2 billion). Changing to INT UNSIGNED would up it to 4B, still in 4 bytes.

Related

Creating Primary key from 2 autonumber and constant letter when creating table

I am new to MYSQL and would like to create a table where a constant Letter depicting the department is added to an auto increment number. This way I would be able to identify the category of the worker upon viewing the ID.
Ex. Dept A and employee 135. The ID I am imaging should read A135 or something similar. I have created the table, the auto increment works fine, the constant letter has been declared and is featuring. However I would like to concatenate them in order to use the A135 as a primary key.
Any Help Please?
This quite tricky, and you would be probably better off doing manual concatenation in a select query.
But since you asked for it...
In normal usage you would have used a computed column for this, but they do not support using autoincremented columns in their declaration. So you would need to use triggers:
on insert, query information_schema.tables to retrieve the autoincremented id that is about to be assigned and use it to generate the custom id
on update, reset the custom id
Consider the following table structure:
create table workers (
id int auto_increment primary key,
name varchar(50) not null,
dept varchar(1) not null,
custom_id varchar(12)
);
Here is the trigger for insert:
delimiter //
create trigger trg_workers_insert before insert ON workers
for each row
begin
if new.custom_id is null then
select auto_increment into #nextid
from information_schema.tables
where table_name = 'workers' and table_schema = database();
set new.custom_id = CONCAT(new.dept, lpad(#nextid, 11, 0));
end if;
end
//
delimiter ;
And the trigger for update:
delimiter //
create trigger trg_workers_update before update ON workers
for each row
begin
if new.dept is not null then
set new.custom_id = CONCAT(new.dept, lpad(old.id, 11, 0));
end if;
end
//
delimiter ;
Let's run a couple of inserts for testing:
insert into workers (dept, name) values ('A', 'John');
insert into workers (dept, name) values ('B', 'Jim');
select * from workers;
| id | name | dept | custom_id |
| --- | ---- | ---- | ------------ |
| 1 | John | A | A00000000001 |
| 2 | Jim | B | B00000000002 |
And let's test the update trigger
update workers set dept = 'C' where name = 'Jim';
select * from workers;
| id | name | dept | custom_id |
| --- | ---- | ---- | ------------ |
| 1 | John | A | A00000000001 |
| 2 | Jim | C | C00000000002 |
Demo on DB Fiddle
Sorry, my answer does not fit in a comment.
I agree with #GMB.
This is a tricky situation and in some cases (selects mainly) will lead in a performance risk due you'll have to split PK in where statements, which is not recommended.
Having a column for department and another for auto_increment is more logical. And the only gap you have is to know the number of employees per department you'll have to make a count grouping by dept. Instead of a max() splitting your concatenated PK, which is is at high performance cost.
Let atomic and logic data remain in separate columns. I would suggest to create a third column with the concatenated value.
If, for some company reason, you need B1 and A1 values for employees of different departments, I'd suggest to have 3 columns
Col1 - letter(not null)
Col2 - ID(Not auto-increment, but calculated as #GMB's solution) (Not NULL)
Col3 - Concatenation of Col1 and Col2 (not null)
PK( Col1, col2)

Moving hex data from a varchar type field to bigint type (mysql)

I am trying to insert data from one table into another, and each table has an 'id' field that should be the same, but is stored different datatype. This 'id' field should represent the same unique value, allowing me to update from one to another.
In one table (the new.table one), the 'id' is stored as datatype varchar(35) and in the old.table it is datatype bigint(20) -- I believe this older table represents the integer version of the hex value stored in the new one. I am trying to update data from the new.table back into the old.table
After searching about this for a while
When I try this simple mysql update query it fails:
INSERT INTO old.table (id, field2)
SELECT CAST(CONV(id,16,10) AS UNSIGNED INTEGER), field2
FROM new.table;
It fails with this error:
Out of range value for column 'id' at row 1
I have also tried a simple
SELECT CAST(CONV(id, 16,10) AS UNSIGNED INTEGER) from new.table;
And the result is all the same integer mostly, but each hex value in new.table is unique. I've google this for two days, and could really use to help to figure out what is wrong. Thanks.
EDIT: Some of the example data from console of output of SELECT ID from new.table:
| 1d2353560110956e1b3e8610a35d903a |
| ec526762556c4f92a3ea4584a7cebfe1.11 |
| 34b8c838c18a4c5690514782b7137468.16 |
| 1233fa2813af44ca9f25bb8cac05b5b5.16 |
| 37f396d9c6e04313b153a34ab1e80304.16 |
The problem id is too high values.
MySQL will return limit-value when overflow happened.
Query 1:
select CONV('FFFFFFFFFFFFFFFF1',16,10)
Results:
| CONV('FFFFFFFFFFFFFFFF1',16,10) |
|---------------------------------|
| 18446744073709551615 |
Query 2:
select CONV('FFFFFFFFFFFFFFFF',16,10)
Results:
| CONV('FFFFFFFFFFFFFFFF',16,10) |
|--------------------------------|
| 18446744073709551615 |
I would suggest you, Implement the logic algorithm for id in your case in a function instead of use CONV function.
EDIT
I would use a variable to make new row number and insert to old table.
CREATE TABLE new(
Id varchar(35)
);
insert into new values ('1d2353560110956e1b3e8610a35d903a');
insert into new values ('ec526762556c4f92a3ea4584a7cebfe1.11');
insert into new values ('34b8c838c18a4c5690514782b7137468.16');
insert into new values ('1233fa2813af44ca9f25bb8cac05b5b5.16');
insert into new values ('37f396d9c6e04313b153a34ab1e80304.16');
CREATE TABLE old(
Id bigint(20),
val varchar(35)
);
INSERT INTO old (id, val)
SELECT rn, id
FROM (
SELECT *,(#Rn:=#Rn +1) rn
FROM new CROSS JOIN (SELECT #Rn:=0) v
) t1
Query 1:
SELECT * FROM old
Results:
| Id | val |
|----|-------------------------------------|
| 1 | 1d2353560110956e1b3e8610a35d903a |
| 2 | ec526762556c4f92a3ea4584a7cebfe1.11 |
| 3 | 34b8c838c18a4c5690514782b7137468.16 |
| 4 | 1233fa2813af44ca9f25bb8cac05b5b5.16 |
| 5 | 37f396d9c6e04313b153a34ab1e80304.16 |

Is it possible to do a select and update it with new value in the same query?

I have tgot the following table structure
mysql> desc test
-> ;
+-------+-------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------+-------------+------+-----+---------+-------+
| id | varchar(19) | NO | PRI | | |
| name | varchar(19) | YES | | NULL | |
| age | varchar(19) | YES | | NULL | |
+-------+-------------+------+-----+---------+-------+
3 rows in set (0.05 sec)
Initialy i have done an insert as shown
insert into test (id, name, age) values("1", "A", 19);
my requirement is that , i need to extract the age of id "1" and add some integer to the existing age
I have seen this below example , can tis be useful in my case ??
insert into test (id, name, age) values("1", "A", 30) on duplicate key update age=values(age)
I am using JAVA , i have symbols of 300 , for which i need to update contonouslly
Is it possible to do a select and update the existing column with new value in the same query ??
For example
how can i get the existing age 19 and add it with 30 in the same query for the id 1 ??
(This question has already been answered by Marc B in the comments section.)
Yes, the statement OP has posted will work just fine, because there is a primary key constraint on the id column, and INSERT ... ON DUPLICATE KEY will cause an UPDATE of the existing row.
To "add" the value being inserted to the value that already exists in the column, we'd assign an expression that does that operation to the column:
e.g.
insert into test (id, name, age) values("1", "A", 30)
on duplicate key update age = age + values(age)
^^^^^
Note that the only change required in OP statement is the reference to the existing column value and an addition operation.
N.B. If either the existing value in the column, or the new value being supplied in the INSERT statement is NULL, the result of the expression will be NULL. A different expression would be needed if this is undesired behavior.
This should work:
UPDATE test SET age=age+1 WHERE id=1;

How to get a unique primary key from a range select in concurrent table reads

Morning,
I've got multiple clients trying to get a unique primary key on a table.
A row identified by that PK is considered "valid" only if they match a successful range scan. The range scan is SELECT id FROM lookup WHERE allowed='Y' and updated<=NOW() LIMIT 1
------------+---------------+------+-----+-------------------+----------------+
| Field | Type | Null | Key | Default | Extra |
+------------+---------------+------+-----+-------------------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| fullname | varchar(250) | NO | UNI | 0 | |
| allowed | enum('Y','N') | NO | MUL | N | |
| updated | timestamp | NO | | CURRENT_TIMESTAMP | |
| hits | smallint(6) | NO | MUL | 0 | |
| stop_allow | enum('Y','N') | NO | MUL | N | |
+------------+---------------+------+-----+-------------------+----------------+
Once that is first select is done, another SELECT is executed in order to retrieve the content.
The problem is that many clients are doing the same thing at the same time (or they do randomly find a way to match each other grrrr...).
So far, I've tried:
1)
start transaction;
*range scan* LIMIT 1 FOR UPDATE;
SELECT * from lookup WHERE id=(result of the range scan);
*perform stuff*
commit;
This is a performance killer. Stuff is locked forever and "Mysql server goes to heaven" after some time.
2)
start transaction;
*range scan*
SELECT * from lookup WHERE id=(result of the previous query) FOR UPDATE;
*perform stuff*
commit;
This fails miserably with autocommit=0, but it is quite fast
3) At this point, I'm starting to think that transactions are the problem
no transaction;
//get a row that is not being processed
*range scan* LEFT OUTER JOIN temp_mem_table WHERE **temp_mem_table.id IS NULL**
$rid = (result of the range scan)
//check if another client is doing the same thing, if so then stop here
select 1 from temp_mem_table WHERE id=$rid
//if there is a result => return null; this is not enough to block stuff going through
//signal to other client that this ID is being processed
insert into temp_mem_table(id) values($rid)
//get the content
SELECT * from lookup WHERE id=($rid);
*perform time intensive operations*
Edit: the temp_mem_table is in fact a memory table, that is flushed on in a while. It does look like this:
CREATE TABLE temp_mem_table(id int(11), primary key(id)) engine=memory
Thought process is: if what's being processed is stored on a memory table accessible to all clients, then they should be able to know what their friends are doing. The check should stop any further processing. But somehow they find a way to go through :(
After a short period of time, it appears that almost 50% of those primary keys were processed at least twice.
I'm going to find a way of doing this, but maybe some of you encountered a similar situation and can help.
thanx
Ok, for those who encountered the famous "How do I select a unlocked row in Mysql?" such as seen here http://bugs.mysql.com/bug.php?id=49763 and a lot of other place. Here is a dirty hack to solve it.
This is done in READ REPEATABLE MODE which should be ACID over 9000 or at least won't break anything (maybe).
The start point is to have some kind of 'range' of rows that needs to be locked for read so other clients won't get it no matter what.
SELECT pk FROM tbl LIMIT 0,10
SELECT pk FROM tbl where *large range scan*
I do create a memory table (because it should be faster) such as:
CREATE TABLE `jobs` (
`pid` smallint(6) DEFAULT NULL,
`tid` int(11) DEFAULT NULL,
UNIQUE KEY `pid` (`pid`),
UNIQUE KEY `tid` (`tid`)
) ENGINE=MEMORY DEFAULT CHARSET=utf8 |
Pid is a unique identifier of the client. In my case its the actual process id.
Tid is the task id which matches the primary key of that huge table in which we perform some king of range scan.
Then pseudo-code is like this:
SELECT pk from tbl WHERE (range scan) or limit 0,100
delete from jobs where pid=$my_pid
foreach of those pk do
if(insert IGNORE into jobs(pid,tid) values(1234,pk)) break;
done;
select pk from jobs where pid=$my_pid
select * from big_tbl where id=pk
Have tested this with 2,10,25,50 and 100 concurrent clients and got 100% unique distribution of tasks across each client.
Now this might not be super complicated, or might not look elegant but I don't give a damn as long as CPU stays cool.
Can you add another column to the table to indicate that the row is being processed, and by whom? Then you could do:
START TRANSACTION;
UPDATE lookup SET owner=<client id>
WHERE id=( SELECT id FROM *range scan* ...
AND owner IS NULL
AND completed = false
FOR UPDATE);
COMMIT;
*do stuff*
UPDATE lookup SET owner=NULL, completed=true,... WHERE owner=<client id>;
The final UPDATE will never cause a conflict as long as every client has its own unique ID, and the initial SELECT can be LIMITed, and with proper indexing ought to be quite fast.
It is important that the last UPDATE keeps the row unselectable by the other clients. That is, the initial SELECT gets those rows where owner is NULL and completed is false; the first UPDATE makes them unselectable in that they now have an owner; the final UPDATE keeps them unselectable in that they are now completed.
Note: I hadn't realized this solution had already been proposed in a comment by user Kickstart.

Fastest way to diff datasets and update/insert lots of rows into large MySQL table?

The schema
I have a MySQL database with one large table (5 million rows say). This table has several fields for actual data, an optional comment field, and fields to record when the row was first added and when the data is deleted. To simplify to one "data" column, it looks a bit like this:
+----+------+---------+---------+----------+
| id | data | comment | created | deleted |
+----+------+---------+---------+----------+
| 1 | val1 | NULL | 1 | 2 |
| 2 | val2 | nice | 1 | NULL |
| 3 | val3 | NULL | 2 | NULL |
| 4 | val4 | NULL | 2 | 3 |
| 5 | val5 | NULL | 3 | NULL |
This schema allows us to look at any past version of the data thanks to the created and deleted fields e.g.
SET #version=1;
SELECT data, comment FROM MyTable
WHERE created <= #version AND
(deleted IS NULL OR deleted > #version);
+------+---------+
| data | comment |
+------+---------+
| val1 | NULL |
| val2 | nice |
The current version of the data can be fetched more simply:
SELECT data, comment FROM MyTable WHERE deleted IS NULL;
+------+---------+
| data | comment |
+------+---------+
| val2 | nice |
| val3 | NULL |
| val5 | NULL |
DDL:
CREATE TABLE `MyTable` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`data` varchar(32) NOT NULL,
`comment` varchar(32) DEFAULT NULL,
`created` int(11) NOT NULL,
`deleted` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `data` (`data`,`comment`)
) ENGINE=InnoDB;
Updating
Periodically a new set of data and comments arrives. This can be fairly large, half a million rows say. I need to update MyTable so that this new data set is stored in it. This means:
"Deleting" old rows. Note the "scare quotes" - we don't actually delete rows from MyTable. We have to set the deleted field to the new version N. This has to be done for all rows in MyTable that are in the previous version N-1, but are not in the new set.
Inserting new rows. All rows that are in the new set and are not in version N-1 in MyTable must be added as new rows with the created field set to the new version N, and deleted as NULL.
Some rows in the new set may match existing rows in MyTable at version N-1 in which case there is nothing to do.
My current solution
Given that we have to "diff" two sets of data to work out the deletions, we can't just read over the new data and do insertions as appropriate. I can't think of a way to do the diff operation without dumping all the new data into a temporary table first. So my strategy goes like this:
-- temp table uses MyISAM for speed.
CREATE TEMPORARY TABLE tempUpdate (
`data` char(32) NOT NULL,
`comment` char(32) DEFAULT NULL,
PRIMARY KEY (`data`),
KEY (`data`, `comment`)
) ENGINE=MyISAM;
-- Bulk insert thousands of rows
INSERT INTO tempUpdate VALUES
('some new', NULL),
('other', 'comment'),
...
-- Start transaction for the update
BEGIN;
SET #newVersion = 5; -- Worked out out-of-band
-- Do the "deletions". The join selects all non-deleted rows in MyTable for
-- which the matching row in tempUpdate does not exist (tempUpdate.data is NULL)
UPDATE MyTable
LEFT JOIN tempUpdate
ON MyTable.data = tempUpdate.data AND
MyTable.comment <=> tempUpdate.comment
SET MyTable.deleted = #newVersion
WHERE tempUpdate.data IS NULL AND
MyTable.deleted IS NULL;
-- Delete all rows from the tempUpdate table that match rows in the current
-- version (deleted is null) to leave just new rows.
DELETE tempUpdate.*
FROM MyTable RIGHT JOIN tempUpdate
ON MyTable.data = tempUpdate.data AND
MyTable.comment <=> tempUpdate.comment
WHERE MyTable.id IS NOT NULL AND
MyTable.deleted IS NULL;
-- All rows left in tempUpdate are new so add them.
INSERT INTO MyTable (data, comment, created)
SELECT DISTINCT tempUpdate.data, tempUpdate.comment, #newVersion
FROM tempUpdate;
COMMIT;
DROP TEMPORARY TABLE IF EXISTS tempUpdate;
The question (at last)
I need to find the fastest way to do this update operation. I can't change the schema for MyTable, so any solution must work with that constraint. Can you think of a faster way to do the update operation, or suggest speed-ups to my existing method?
I have a Python script for testing the timings of different update strategies and checking their correctness over several versions. It's fairly long but I can edit into the question if people think it would be useful.
One of speed-ups is for loading -- LOAD DATA INFILE.
In so far as I've experienced audit-logging, you'll be better off with two tables, e.g.:
yourtable (id, col1, col2, version) -- pkey on id
yourtable_logs (id, col1, col2, version) -- pkey on (id, version)
Then add an update trigger on yourtable, which inserts the previous version in yourtable_logs.