Using MySQL's INSERT IGNORE to prevent dumplicate entries (performance issues?) - mysql

I have a table where measurements of a sensor are saved. A row contains the value of the measurement, the id (pk and auto increment) and a random number = num (about 10 digits long or even longer).
CREATE TABLE `table` (
`id` int(10) UNSIGNED NOT NULL,
`value` float NOT NULL,
`num` int(10) UNSIGNED NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
Now after some weeks/months the table could contain thousand and thousand of rows.
My Question:
My system requires that the random number is unique (two or more measurements/rows with the same random number are not acceptable).
Now I have done some research and there is the neat INSERT IGNORE statement.
But I'm not sure if it's smart to use it in my case as there might be many many rows after some time and checking all rows in the table for the random number and if it matches the one that has to be newly inserted might be overkill after some time and drastically impact performance?
Any thoughts?

Use the INSERT IGNORE command rather than the INSERT command. If a record doesn't duplicate an existing record, then MySQL inserts it as usual. If the record is a duplicate, then the IGNORE keyword tells MySQL to discard it silently without generating an error.
And also use unique UNIQUE constraints on the Random number column. For increasing the performance when you try to retrieve data from table create INDEX for that.
https://www.w3schools.com/sql/sql_create_index.asp

Related

MySQL Large Table Sharding to Smaller Table based on Unique ID

We have a large MySQL table (device_data) with the following columns:
ID (int)
dt (timestamp)
serial_number (char(20))
data1 (double)
data2 (double)
... // other columns
The table receives around 10M rows every day.
We have done a sharding by separating the table based on the date of the timestamp (device_data_YYYYMMDD). However, we feel this is not effective because most of our queries (shown below) always check on the "serial_number" and will perform across many dates.
SELECT * FROM device_data WHERE serial_number = 'XXX' AND dt >= '2018-01-01' AND dt <= '2018-01-07';
Therefore, we think that creating the sharding based on the serial number will be more effective. Basically, we will have:
device_data_<serial_number>
device_data_0012393746
device_data_7891238456
Hence, when we want to find data for a particular device, we can easily reference as:
SELECT * FROM device_data_<serial_number> WHERE dt >= '2018-01-01' AND dt <= '2018-01-07';
This approach seems to be effective because:
The application at all time will access the data based on the device first.
We have checked that there is no query that access the data without specifying the device serial number first.
The table for each device will be relatively small (9000 rows per day)
A few challenges that we think we will face is:
We have alot of devices. This means that the table device_data_ will be alot too. I have checked that MySQL does not provide limitation in the number of tables in the database. Will this impact on performance vs keeping them in one table?
How will this impact on later on when we would like to scale MySQL (e.g. using master / slave, etc)?
Are there other alternative / solution in resolving this?
Update. Below is the show create table result from our existing table:
CREATE TABLE `test_udp_new` (
`id` int(20) unsigned NOT NULL AUTO_INCREMENT,
`dt` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`device_sn` varchar(20) NOT NULL,
`gps_date` datetime NOT NULL,
`lat` decimal(10,5) DEFAULT NULL,
`lng` decimal(10,5) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `device_sn_2` (`dt`,`device_sn`),
KEY `dt` (`dt`),
KEY `data` (`data`) USING BTREE,
KEY `test_udp_new_device_sn_dt_index` (`device_sn`,`dt`),
KEY `test_udp_new_device_sn_data_dt_index` (`device_sn`,`data`,`dt`)
) ENGINE=InnoDB AUTO_INCREMENT=44449751 DEFAULT CHARSET=latin1 ROW_FORMAT=DYNAMIC
The most frequent queries being run:
SELECT *
FROM test_udp_new
WHERE device_sn = 'xxx'
AND dt >= 'xxx'
AND dt <= 'xxx'
ORDER BY dt DESC;
The optimal way to handle that query is in a non-partitioned table with
INDEX(serial_number, dt)
Even better is to change the PRIMARY KEY. Assuming you currently have id AUTO_INCREMENT because there is not a unique combination of columns suitable for being a "natural PK",
PRIMARY KEY(serial_number, dt, id), -- to optimize that query
INDEX(id) -- to keep AUTO_INCREMENT happy
If there are other queries that are run often, please provide them; this may hurt them. In large tables, it is a juggling task to find the optimal index(es).
Other Comments:
There are very few use cases for which partitioning actually speed up processing.
Making lots of 'identical' tables is a maintenance nightmare, and, again, not a performance benefit. There are probably a hundred Q&A on stackoverflow shouting not to do such.
By having serial_number first in the PRIMARY KEY, all queries referring to a single serial_number are likely to benefit.
A million serial_numbers? No problem.
One common use case for partitioning involves purging "old" data. This is because big DELETEs are much more costly than DROP PARTITION. That involves PARTITION BY RANGE(TO_DAYS(dt)). If you are interested in that, my PK suggestion still stands. (And the query in question will run about the same speed with or without this partitioning.)
How many months before the table outgrows your disk? (If this will be an issue, let's discuss it.)
Do you need 8-byte DOUBLE? FLOAT has about 7 significant digits of precision and takes only 4 bytes.
You are using InnoDB?
Is serial_number fixed at 20 characters? If not, use VARCHAR. Also, CHARACTER SET ascii may be better than the default of utf8?
Each table (or each partition of a table) involves at least one file that the OS must deal with. When you have "too many", the OS groans, often before MySQL groans. (It is hard to make either "die" of overdose.)
Addressing the query
PRIMARY KEY (`id`),
KEY `device_sn_2` (`dt`,`device_sn`),
KEY `dt` (`dt`),
KEY `data` (`data`) USING BTREE,
KEY `test_udp_new_device_sn_dt_index` (`device_sn`,`dt`),
KEY `test_udp_new_device_sn_data_dt_index` (`device_sn`,`data`,`dt`)
-->
PRIMARY KEY(`device_sn`,`dt`, id),
INDEX(id)
KEY `dt_sn` (`dt`,`device_sn`),
KEY `data` (`data`) USING BTREE,
Notes:
By starting the PK with device_sn, dt, you get the clustering benefits to make the query with WHERE device_sn = .. AND dt BETWEEN ...
INDEX(id) is to keep AUTO_INCREMENT happy.
When you have INDEX(a,b), INDEX(a) is redundant.
The (20) is meaningless; id will max out at about 4 billion.
I tossed the last index because it is probably helped enough by the new PK.
lng decimal(10,5) -- Don't need 5 decimal places to left of point; only need 3 or 2. So: lat decimal(7,5),lng decimal(8,5)`. This will save a total of 3 bytes per row.

MySQL Partitioning a VARCHAR(60)

I have a very large 500 million rows table with the following columns:
id - Bigint - Autoincrementing primary index.
date - Datetime - Approximately 1.5 million rows per date, data older 1 year is deleted.
uid - VARCHAR(60) - A user ID
sessionNumber - INT
start - INT - epoch of start time.
end - INT - epoch of end time.
More columns not relevant for this query.
The combination of uid and sessionNumber forms a uinque index. I also have an index on date.
Due to the sheer size, I'd like to partition the table.
Most of my accesses would be by date, so partitioning by date ranges seems intuitive, but as the date is not part of the unique index, this is not an option.
Option 1: RANGE PARTITION on Date and BEFORE INSERT TRIGGER
I don't really have a regular issue with the uid and sessionNumber uniqueness being violated. The source data is consistent, but sessions that span two days may be inserted on two consecutive days with midnight being the end time of the first and start time of the second.
I'm trying to understand if I could remove the unique key and instead use a trigger that would
Check if there is a session with the same identifiers the previous day and if so,
Updates the end date.
cancels the actual insert.
However, I am not sure if I can 1) trigger an update on the same table. or 2) prevent the actual insert.
Option 2: LINEAR HASH PARTITION on UID
My second option is to use a linear hash partition on the UID. However I cannot see any example that utilizes a VARCHAR and converts it to an INTEGER which is used for the HASH partitioning.
However I cannot finde a permitted way to convert from VARCHAR to INTEGER. For example
ALTER TABLE mytable
PARTITION BY HASH (CAST(md5(uid) AS UNSIGNED integer))
PARTITIONS 20
returns that the partition function is not allowed.
HASH partitioning must work with a 32-bit integer. But you can't convert an MD5 string to an integer simply with CAST().
Instead of MD5, CRC32() can take an arbitrary string and converts to a 32-bit integer. But this is also not a valid function for partitioning.
mysql> alter table v partition by hash(crc32(uid));
ERROR 1564 (HY000): This partition function is not allowed
You could partition by the string using KEY Partitioning instead of HASH partitioning. KEY Partitioning accepts strings. It passes whatever input string through MySQL's built-in PASSWORD() function, which is basically related to SHA1.
However, this leads to another problem with your partitioning strategy:
mysql> alter table v partition by key(uid);
ERROR 1503 (HY000): A PRIMARY KEY must include all columns in the table's partitioning function
Your table's primary key id does not include the column uid that you want to partition by. This is a restriction of MySQL's partitioning:
every unique key on the table must use every column in the table's partitioning expression.
Here's the table I'm testing with (it would have been a good idea for you to include this in your question):
CREATE TABLE `v` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`date` datetime NOT NULL,
`uid` varchar(60) NOT NULL,
`sessionNumber` int(11) NOT NULL,
`start` int(11) NOT NULL,
`end` int(11) NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `uid` (`uid`,`sessionNumber`),
KEY `date` (`date`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
Before going any further, I have to wonder why you want to use partitioning anyway? "Sheer size" is not a reason to partition a table.
Partitioning, like any optimization, is done for the sake of specific queries you want to optimize for. Any optimization improves one query at the expense of other queries. Optimization has nothing to do with the table. The table is happy to sit there with 5 billion rows, and it doesn't care. Optimization is for the queries.
So you need to know which queries you want to optimize for. Then decide on a strategy. Partitioning might not be the best strategy for the set of queries you need to optimize!
I'll assume your 'uid' is a 128-bit UUID kind of value, which can be stored as a BINARY(16), because that is generally worth the trouble.
Next, stay away from the 'datetime' type, as it is stored like a packed string, and doesn't hold any timezone information. Store date-time-values either as pure numerical values (the number of seconds since the UNIX-epoch), or let MySQL do that for you and use the timestamp(N) type.
Also don't call a column 'date', not just because that is a reserved word, but also because the value contains time details too.
Next, stay away from using anything else than latin1 as the CHARSET of (all) your tables. Only ever do UTF-8-ness at the column level. This to prevent unnecessarily byte-wide columns and indexes creeping in over time. Adopt this habit and you'll happily look back on it after some years, promised.
This makes the table look like:
CREATE TABLE `v` (
`uuid` binary(16) NOT NULL,
`mysql_created_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`visitor_uuid` BINARY(16) NOT NULL,
`sessionNumber` int NOT NULL,
`start` int NOT NULL,
`end` int NOT NULL,
PRIMARY KEY (`uuid`),
UNIQUE KEY (`visitor_uuid`,`sessionNumber`),
KEY (`mysql_created_at`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
PARTITIONED BY RANGE COLUMNS (`uuid`)
( PARTITION `p_0` VALUES LESS THAN (X'10')
, PARTITION `p_1` VALUES LESS THAN (X'20')
...
, PARTITION `p_9` VALUES LESS THAN (X'A0')
, PARTITION `p_A` VALUES LESS THAN (X'B0')
...
, PARTITION `p_F` VALUES LESS THAN (MAXVALUE)
);
To make the KEY (mysql_created_at) be only on the date-part, needs a calculated column, which can be added in-place, and then an index on it is also light to add, so I'll leave that as homework.

Concatenating a str to an auto incrementeed column which functions as primary key

Having some trouble putting together a table with a unique value. The current setup I have for two tables which for all intents and purposes can be the same as the one below. My problem is that I'm trying to use the auto incremented value as the primary key due to redundancies in the data pulls, but since it's for two tables, I want to concatenate a string to the auto incremented value so my ID column would be:
Boop1, Boop2, Boop3 and Beep1, Beep2, Beep3, instead of 1, 2, 3 for both tables so they are differentiated and thus do not have duplicate values when I put in constraints
CREATE TABLE IF NOT EXISTS `beep`.`boop` (
`ID` INT NOT NULL AUTO_INCREMENT,
`a` VARCHAR(15) NOT NULL,
`b` VARCHAR(255) NOT NULL,
`c` VARCHAR(50) NOT NULL,
`d` VARCHAR(50) NOT NULL,
PRIMARY KEY(`ID`))
ENGINE = InnoDB;
LOAD DATA LOCAL INFILE 'blah.csv'
FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' IGNORE 1 LINES SET DCMID = CONCAT('DCM'+`DCMID`);
The code in boldface is optional and was only there to try concatenating which I already know does not work
I realize that this would not be able to work since my datatype is an INT, so what would I have to do to to keep my autoincrement while differentiating
For reference, I am using a LOAD DATA LOCAL INFILE, and not INSERT (and I don't think bulk insert is available with mySQL workbench). otherwise, i would bulk insert and just utilize last_insert_id
The goal is to plug and play for a datapull I perform so I can archive my data quickly and run queries to grab the data I need in the future. using one insert line per row of data i have would be extremely inefficient
I was utilizing a delimiter function earlier with a trigger, which in theory would have worked by altering the table after the load data infile, but that requires SUPER privileges which I do not have
is what i'm asking for even possible, or should i give up and find a workaround, or try grabbing super priveleges and trying the delimiter trigger
I'm not sure why you would do that, but you could use two different indexes. The first one is the auto-increment, and is populated by MySQL. The second one is your "prefixed" key, and is created by a trigger called after insert, where you update the column based on the first key and the prefix that you want.

How to improve INSERT performance on a very large MySQL table

I am working on a large MySQL database and I need to improve INSERT performance on a specific table. This one contains about 200 Millions rows and its structure is as follows:
(a little premise: I am not a database expert, so the code I've written could be based on wrong foundations. Please help me to understand my mistakes :) )
CREATE TABLE IF NOT EXISTS items (
id INT NOT NULL AUTO_INCREMENT,
name VARCHAR(200) NOT NULL,
key VARCHAR(10) NOT NULL,
busy TINYINT(1) NOT NULL DEFAULT 1,
created_at DATETIME NOT NULL,
updated_at DATETIME NOT NULL,
PRIMARY KEY (id, name),
UNIQUE KEY name_key_unique_key (name, key),
INDEX name_index (name)
) ENGINE=MyISAM
PARTITION BY LINEAR KEY(name)
PARTITIONS 25;
Every day I receive many csv files in which each line is composed by the pair "name;key", so I have to parse these files (adding values created_at and updated_at for each row) and insert the values into my table. In this one, the combination of "name" and "key" MUST be UNIQUE, so I implemented the insert procedure as follows:
CREATE TEMPORARY TABLE temp_items (
id INT NOT NULL AUTO_INCREMENT,
name VARCHAR(200) NOT NULL,
key VARCHAR(10) NOT NULL,
busy TINYINT(1) NOT NULL DEFAULT 1,
created_at DATETIME NOT NULL,
updated_at DATETIME NOT NULL,
PRIMARY KEY (id)
)
ENGINE=MyISAM;
LOAD DATA LOCAL INFILE 'file_to_process.csv'
INTO TABLE temp_items
FIELDS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '\"'
(name, key, created_at, updated_at);
INSERT INTO items (name, key, busy, created_at, updated_at)
(
SELECT temp_items.name, temp_items.key, temp_items.busy, temp_items.created_at, temp_items.updated_at
FROM temp_items
)
ON DUPLICATE KEY UPDATE busy=1, updated_at=NOW();
DROP TEMPORARY TABLE temp_items;
The code just shown allows me to reach my goal but, to complete the execution, it employs about 48 hours, and this is a problem.
I think that this poor performance are caused by the fact that the script must check on a very large table (200 Millions rows) and for each insertion that the pair "name;key" is unique.
How can I improve the performance of my script?
Thanks to all in advance.
You can use the following methods to speed up inserts:
If you are inserting many rows from the same client at the same time, use INSERT statements with multiple VALUES lists to insert several rows at a time. This is considerably faster (many times faster in some cases) than using separate single-row INSERT statements. If you are adding data to a nonempty table, you can tune the bulk_insert_buffer_size variable to make data insertion even faster.
When loading a table from a text file, use LOAD DATA INFILE. This is usually 20 times faster than using INSERT statements.
Take advantage of the fact that columns have default values. Insert values explicitly only when the value to be inserted differs from the default. This reduces the parsing that MySQL must do and improves the insert speed.
Reference: MySQL.com: 8.2.4.1 Optimizing INSERT Statements
Your linear key on name and the large indexes slows things down.
LINEAR KEY needs to be calculated every insert.
http://dev.mysql.com/doc/refman/5.1/en/partitioning-linear-hash.html
can you show us some example data of file_to_process.csv maybe a better schema should be build.
Edit looked more closely
INSERT INTO items (name, key, busy, created_at, updated_at)
(
SELECT temp_items.name, temp_items.key, temp_items.busy, temp_items.created_at, temp_items.updated_at
FROM temp_items
)
this will proberly will create a disk temp table, this is very very slow so you should not use it to get more performance or maybe you should check some mysql config settings like tmp-table-size and max-heap-table-size maybe these are misconfigured.
There is a piece of documentation I would like to point out, Speed of INSERT Statements.
By thinking in java ;
Divide the object list into the partitions and generate batch insert statement for each partition.
Utilize CPU cores and available db connections efficiently, nice new java features can help to achieve parallelism easily(e.g.paralel, forkjoin) or you can create your custom thread pool optimized with number of CPU cores you have and feed your threads from centralized blocking queue in order to invoke batch insert prepared statements.
Decrease the number of indexes on the target table if possible. If foreign key is not really needed, just drop it. Less indexes faster inserts.
Avoid using Hibernate except CRUD operations, always write SQL for complex selects.
Decrease number of joins in your query, instead forcing the DB, use java streams for filtering, aggregating and transformation.
If you feel that you do not have to do, do not combine select and inserts as one sql statement
Add rewriteBatchedStatements=true to your JDBC string, it will help to decrease TCP level communication between app and DB.
Use #Transactional for the methods that carry out insert batch and write rollback methods yourself.
You could use
load data local infile ''
REPLACE
into table
etc...
The REPLACE ensure that any duplicate value is overwritten with the new values.
Add a SET updated_at=now() at the end and you're done.
There is no need for the temporary table.

MySQL Auto-Inc Bug?

In my MySQL table I've created an ID column which I'm hoping to auto-increment in order for it to be the primary key.
I've created my table:
CREATE TABLE `test` (
`id` INT( 11 ) NOT NULL AUTO_INCREMENT PRIMARY KEY ,
`name` VARCHAR( 50 ) NOT NULL ,
`date_modified` DATETIME NOT NULL ,
UNIQUE (
`name`
)
) TYPE = INNODB;
then Inserted my records:
INSERT INTO `test` ( `id` , `name` , `date_modified` )
VALUES (
NULL , 'TIM', '2011-11-16 12:36:30'
), (
NULL , 'FRED', '2011-11-16 12:36:30'
);
I'm expecting that my ID's for the above are 1 and 2 (respectively). And so far this is true.
However when I do something like this:
insert into test (name) values ('FRED')
on duplicate key update date_modified=now();
then insert a new record, I'm expecting it to be 3, however now I'm shown an ID of 4; skipping the place spot for 3.
Normally this wouldn't be an issue but I'm using millions of records which have thousands of updates every day.. and I don't really want to even have to think about running out of ID's simply because I'm skipping a ton of numbers..
Anyclue to why this is happening?
MySQL version: 5.1.44
Thank you
My guess is that the INSERT itself kicks off the code that generates the next ID number. When the duplicate key is detected, and ON DUPLICATE KEY UPDATE is executed, the ID number is abandoned. (No SQL dbms guarantees that automatic sequences will be without gaps, AFAIK.)
MySQL docs say
In general, you should try to avoid using an ON DUPLICATE KEY UPDATE
clause on tables with multiple unique indexes.
That page also says
If a table contains an AUTO_INCREMENT column and INSERT ... ON
DUPLICATE KEY UPDATE inserts or updates a row, the LAST_INSERT_ID()
function returns the AUTO_INCREMENT value.
which stops far short of describing the internal behavior I guessed at above.
Can't test here; will try later.
Is it possible to change your key to unsigned bigint - 18,446,744,073,709,551,615 is a lot of records - thus delaying the running out of ID's
Found this in mysql manual http://dev.mysql.com/doc/refman/5.1/en/example-auto-increment.html
Use a large enough integer data type for the AUTO_INCREMENT column to hold the
maximum sequence value you will need. When the column reaches the upper limit of
the data type, the next attempt to generate a sequence number fails. For example,
if you use TINYINT, the maximum permissible sequence number is 127.
For TINYINT UNSIGNED, the maximum is 255.
More reading here http://dev.mysql.com/doc/refman/5.6/en/information-functions.html#function_last-insert-id it could be inferred that the insert to a transactional table is a rollback so the manual says "LAST_INSERT_ID() is not restored to that before the transaction"
What about for a possible solution to use a table to generate the ID's and then insert into your main table as the PK using LAST_INSERT_ID();
From the manual:
Create a table to hold the sequence counter and initialize it:
mysql> CREATE TABLE sequence (id INT NOT NULL);
mysql> INSERT INTO sequence VALUES (0);
Use the table to generate sequence numbers like this:
mysql> UPDATE sequence SET id=LAST_INSERT_ID(id+1);
mysql> SELECT LAST_INSERT_ID();
The UPDATE statement increments the sequence counter and causes the next call to
LAST_INSERT_ID() to return the updated value. The SELECT statement retrieves that
value. The mysql_insert_id() C API function can also be used to get the value.
See Section 20.9.3.37, “mysql_insert_id()”.
It's really a bug how you can see here: http://bugs.mysql.com/bug.php?id=26316
But, apparently, they fixed it on 5.1.47 and it was declared as INNODB plugin problem.
A duplicate, but same problem, you can see here too: http://bugs.mysql.com/bug.php?id=53791 referenced to the first page mentioned here in this answer.