How to improve INSERT performance on a very large MySQL table - mysql

I am working on a large MySQL database and I need to improve INSERT performance on a specific table. This one contains about 200 Millions rows and its structure is as follows:
(a little premise: I am not a database expert, so the code I've written could be based on wrong foundations. Please help me to understand my mistakes :) )
CREATE TABLE IF NOT EXISTS items (
id INT NOT NULL AUTO_INCREMENT,
name VARCHAR(200) NOT NULL,
key VARCHAR(10) NOT NULL,
busy TINYINT(1) NOT NULL DEFAULT 1,
created_at DATETIME NOT NULL,
updated_at DATETIME NOT NULL,
PRIMARY KEY (id, name),
UNIQUE KEY name_key_unique_key (name, key),
INDEX name_index (name)
) ENGINE=MyISAM
PARTITION BY LINEAR KEY(name)
PARTITIONS 25;
Every day I receive many csv files in which each line is composed by the pair "name;key", so I have to parse these files (adding values created_at and updated_at for each row) and insert the values into my table. In this one, the combination of "name" and "key" MUST be UNIQUE, so I implemented the insert procedure as follows:
CREATE TEMPORARY TABLE temp_items (
id INT NOT NULL AUTO_INCREMENT,
name VARCHAR(200) NOT NULL,
key VARCHAR(10) NOT NULL,
busy TINYINT(1) NOT NULL DEFAULT 1,
created_at DATETIME NOT NULL,
updated_at DATETIME NOT NULL,
PRIMARY KEY (id)
)
ENGINE=MyISAM;
LOAD DATA LOCAL INFILE 'file_to_process.csv'
INTO TABLE temp_items
FIELDS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '\"'
(name, key, created_at, updated_at);
INSERT INTO items (name, key, busy, created_at, updated_at)
(
SELECT temp_items.name, temp_items.key, temp_items.busy, temp_items.created_at, temp_items.updated_at
FROM temp_items
)
ON DUPLICATE KEY UPDATE busy=1, updated_at=NOW();
DROP TEMPORARY TABLE temp_items;
The code just shown allows me to reach my goal but, to complete the execution, it employs about 48 hours, and this is a problem.
I think that this poor performance are caused by the fact that the script must check on a very large table (200 Millions rows) and for each insertion that the pair "name;key" is unique.
How can I improve the performance of my script?
Thanks to all in advance.

You can use the following methods to speed up inserts:
If you are inserting many rows from the same client at the same time, use INSERT statements with multiple VALUES lists to insert several rows at a time. This is considerably faster (many times faster in some cases) than using separate single-row INSERT statements. If you are adding data to a nonempty table, you can tune the bulk_insert_buffer_size variable to make data insertion even faster.
When loading a table from a text file, use LOAD DATA INFILE. This is usually 20 times faster than using INSERT statements.
Take advantage of the fact that columns have default values. Insert values explicitly only when the value to be inserted differs from the default. This reduces the parsing that MySQL must do and improves the insert speed.
Reference: MySQL.com: 8.2.4.1 Optimizing INSERT Statements

Your linear key on name and the large indexes slows things down.
LINEAR KEY needs to be calculated every insert.
http://dev.mysql.com/doc/refman/5.1/en/partitioning-linear-hash.html
can you show us some example data of file_to_process.csv maybe a better schema should be build.
Edit looked more closely
INSERT INTO items (name, key, busy, created_at, updated_at)
(
SELECT temp_items.name, temp_items.key, temp_items.busy, temp_items.created_at, temp_items.updated_at
FROM temp_items
)
this will proberly will create a disk temp table, this is very very slow so you should not use it to get more performance or maybe you should check some mysql config settings like tmp-table-size and max-heap-table-size maybe these are misconfigured.

There is a piece of documentation I would like to point out, Speed of INSERT Statements.

By thinking in java ;
Divide the object list into the partitions and generate batch insert statement for each partition.
Utilize CPU cores and available db connections efficiently, nice new java features can help to achieve parallelism easily(e.g.paralel, forkjoin) or you can create your custom thread pool optimized with number of CPU cores you have and feed your threads from centralized blocking queue in order to invoke batch insert prepared statements.
Decrease the number of indexes on the target table if possible. If foreign key is not really needed, just drop it. Less indexes faster inserts.
Avoid using Hibernate except CRUD operations, always write SQL for complex selects.
Decrease number of joins in your query, instead forcing the DB, use java streams for filtering, aggregating and transformation.
If you feel that you do not have to do, do not combine select and inserts as one sql statement
Add rewriteBatchedStatements=true to your JDBC string, it will help to decrease TCP level communication between app and DB.
Use #Transactional for the methods that carry out insert batch and write rollback methods yourself.

You could use
load data local infile ''
REPLACE
into table
etc...
The REPLACE ensure that any duplicate value is overwritten with the new values.
Add a SET updated_at=now() at the end and you're done.
There is no need for the temporary table.

Related

Using MySQL's INSERT IGNORE to prevent dumplicate entries (performance issues?)

I have a table where measurements of a sensor are saved. A row contains the value of the measurement, the id (pk and auto increment) and a random number = num (about 10 digits long or even longer).
CREATE TABLE `table` (
`id` int(10) UNSIGNED NOT NULL,
`value` float NOT NULL,
`num` int(10) UNSIGNED NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
Now after some weeks/months the table could contain thousand and thousand of rows.
My Question:
My system requires that the random number is unique (two or more measurements/rows with the same random number are not acceptable).
Now I have done some research and there is the neat INSERT IGNORE statement.
But I'm not sure if it's smart to use it in my case as there might be many many rows after some time and checking all rows in the table for the random number and if it matches the one that has to be newly inserted might be overkill after some time and drastically impact performance?
Any thoughts?
Use the INSERT IGNORE command rather than the INSERT command. If a record doesn't duplicate an existing record, then MySQL inserts it as usual. If the record is a duplicate, then the IGNORE keyword tells MySQL to discard it silently without generating an error.
And also use unique UNIQUE constraints on the Random number column. For increasing the performance when you try to retrieve data from table create INDEX for that.
https://www.w3schools.com/sql/sql_create_index.asp

Increase in deadlocks when adding primary key. Why?

First, a bit of necessary background (please, bear with me). I work as a developers of a web application using MySQL for persistance. We have implemented audit logging by creating an audit trail table for each data table. We might for example have the following table definitions for a Customer entity:
-- Data table definition.
CREATE TABLE my_database.customers (
CustomerId INT(11) NOT NULL AUTO_INCREMENT PRIMARY KEY,
FirstName VARCHAR(255) NOT NULL,
LastName VARCHAR(255) NOT NULL,
-- More data columns, removed for simplicity.
...
);
-- Audit table definition in separate schema.
CREATE TABLE my_database_audittrail.customers (
CustomerId INT(11) DEFAULT NULL,
FirstName VARCHAR(255) DEFAULT NULL,
LastName VARCHAR(255) DEFAULT NULL,
-- More data columns.
...
-- Audit meta data columns.
ChangeTime DATETIME NOT NULL,
ChangeByUser VARCHAR(255) NOT NULL
);
As you can see, the audit table is simply a copy of the data table plus some metadata. Note that the audit table doesn't have any keys. When, for example, we update a customer, our ORM generates SQL similar to the following:
-- Insert a copy of the customer entity, before the update, into the audit table.
INSERT INTO my_database_audittrail.customers (
CustomerId,
FirstName,
LastName,
...
ChangeTime,
ChangeByUser)
)
SELECT
CustomerId,
FirstName,
LastName,
...
NOW(),
#ChangeByUser
FROM my_database.customers
WHERE CustomerId = #CustomerId;
-- Then update the data table.
UPDATE
my_database.customers
SET
FirstName = #FirstName,
LastName = #LastName,
...
WHERE CustomerId = #CustomerId;
This has worked well enough. Recently, however, we needed to add a primary key column to the audit tables for various reasons, changing the audit table definition to something similar to the following:
CREATE TABLE my_database_audittrail.customers (
__auditId INT(11) NOT NULL AUTO_INCREMENT PRIMARY KEY,
CustomerId INT(11) DEFAULT NULL,
FirstName VARCHAR(255) DEFAULT NULL,
LastName VARCHAR(255) DEFAULT NULL,
...
ChangeTime DATETIME NOT NULL,
ChangeByUser VARCHAR(255) NOT NULL
);
The SQL generated by our ORM when updating data tables has not been modified. This change seem to have increased the risk of deadlock very much. The system in question is a web application with a number of nightly batch jobs. The increase in deadlocks doesn't show in the day to day use of the system by our web users. The nightly batch jobs, however, do suffer from the deadlocks very much as they do intense work on a few database tables. Our "solution" has been to add a retry-upon-deadlock strategy (hardly controversial) and while this seems to work fine I would very much like to understand why the above change has increased the risk of deadlocks that much (and if we can somehow remedy the problem).
Further information:
Our nightly batch jobs do INSERTS, UPDATES and DELETES on our data tables. Only INSERTS are performed on the audit tables.
We use repeatable read isolation level on out database transactions.
Before this change, we haven't seen a single deadlock when running our nightly batch jobs.
UPDATE: Checked SHOW ENGINE INNODB STATUS to determine the cause of the deadlocks and found this:
*** WAITING FOR THIS LOCK TO BE GRANTED:
TABLE LOCK table `my_database_audittrail`.`customers` trx id 24972756464 lock mode AUTO-INC waiting
I was under the impression that auto increments was handled outside of any transactions in order to avoid using the same auto increment value in different transactions? But I guess the AUTO_INCREMENT property on the primary key we introduced seems to be the problem?
This is speculation.
Inserting or updating into a table with indexes not only locks the data pages but also the index pages, including the higher levels of the index. When multiple threads are affecting records at the same time, they may lock different portions of the index.
This would not generally show up with single record inserts. However, two statements that are updating multiple records might start acquiring locks on the index and find that they are deadlocking each other. Retry may be sufficient for fixing this problem. Alternatively, it seems that "too much" may be running at one time and you may want to consider how the nightly update work is laid out.
When inserting into tables with auto increment columns, MySQL uses different strategies to acquire values for the auto increments column(s) depending on which type of insert is made, on your insert statements and how MySQL is configured to handle auto increment columns, an insert may result in a complete table lock.
With "simple inserts", i.e inserts where MySQL can determine before hand the number of rows which will be inserted into a table (e.g INSERT INTO table (col1, col2) VALUES (val1, val2);) auto increment column values are acquired using a light weight lock on the auto increment counter. This light weight lock is released as soon as the auto increment values are acquired so one won't have to wait until the actual insert to complete. I.e no table lock.
However, with "bulk inserts", where MySQL cannot determine the number of inserted rows before hand (e.g INSERT INTO table (col1, col2) SELECT col1, col2 FROM table2 WHERE ...;) a table lock is created to acquire auto increment column values and not relinquished until the insert is completed.
The above is per MySQL's default configuration. MySQL can be configured to not use table locks on bulk inserts but this may cause auto increment columns to have different values on masters and slaves (if replication is set up) and thus may or may not be an viable option.

Concatenating a str to an auto incrementeed column which functions as primary key

Having some trouble putting together a table with a unique value. The current setup I have for two tables which for all intents and purposes can be the same as the one below. My problem is that I'm trying to use the auto incremented value as the primary key due to redundancies in the data pulls, but since it's for two tables, I want to concatenate a string to the auto incremented value so my ID column would be:
Boop1, Boop2, Boop3 and Beep1, Beep2, Beep3, instead of 1, 2, 3 for both tables so they are differentiated and thus do not have duplicate values when I put in constraints
CREATE TABLE IF NOT EXISTS `beep`.`boop` (
`ID` INT NOT NULL AUTO_INCREMENT,
`a` VARCHAR(15) NOT NULL,
`b` VARCHAR(255) NOT NULL,
`c` VARCHAR(50) NOT NULL,
`d` VARCHAR(50) NOT NULL,
PRIMARY KEY(`ID`))
ENGINE = InnoDB;
LOAD DATA LOCAL INFILE 'blah.csv'
FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' IGNORE 1 LINES SET DCMID = CONCAT('DCM'+`DCMID`);
The code in boldface is optional and was only there to try concatenating which I already know does not work
I realize that this would not be able to work since my datatype is an INT, so what would I have to do to to keep my autoincrement while differentiating
For reference, I am using a LOAD DATA LOCAL INFILE, and not INSERT (and I don't think bulk insert is available with mySQL workbench). otherwise, i would bulk insert and just utilize last_insert_id
The goal is to plug and play for a datapull I perform so I can archive my data quickly and run queries to grab the data I need in the future. using one insert line per row of data i have would be extremely inefficient
I was utilizing a delimiter function earlier with a trigger, which in theory would have worked by altering the table after the load data infile, but that requires SUPER privileges which I do not have
is what i'm asking for even possible, or should i give up and find a workaround, or try grabbing super priveleges and trying the delimiter trigger
I'm not sure why you would do that, but you could use two different indexes. The first one is the auto-increment, and is populated by MySQL. The second one is your "prefixed" key, and is created by a trigger called after insert, where you update the column based on the first key and the prefix that you want.

Mysql Innodb deadlock problems on REPLACE INTO

I want to update the statistic count in mysql.
The SQL is as follow:
REPLACE INTO `record_amount`(`source`,`owner`,`day_time`,`count`) VALUES (?,?,?,?)
Schema :
CREATE TABLE `record_amount` (
`id` int(11) NOT NULL AUTO_INCREMENT COMMENT 'id',
`owner` varchar(50) NOT NULL ,
`source` varchar(50) NOT NULL ,
`day_time` varchar(10) NOT NULL,
`count` int(11) NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `src_time` (`owner`,`source`,`day_time`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
However, it caused a DEADLOCK exception in multi-processes running (i.e. Map-Reduce).
I've read some materials online and confused about those locks. I know innodb uses row-level lock. I can just use the table-lock to solve the business problem but it is a little extreme. I found some possible solutions:
change REPLACE INTO to transaction with SELECT id FOR UPDATE and UPDATE
change REPLACE INTO to INSERT ... ON DUPLICATE KEY UPDATE
I have no idea that which is practical and better. Can someone explain it or offer some links for me to read and study? Thank you!
Are you building a summary table, one source row at a time? And effectively doing UPDATE ... count = count+1? Throw away the code and start over. MAP-REDUCE on that is like using a sledge hammer on a thumbtack.
INSERT INTO summary (source, owner, day_time, count)
SELECT source, owner, day_time, COUNT(*)
FROM raw
GROUP BY source, owner, day_time
ON DUPLICATE KEY UPDATE count = count + VALUES(count);
A single statement approximately like that will do all the work at virtually disk I/O speed. No SELECT ... FOR UPDATE. No deadlocks. No multiple threads. Etc.
Further improvements:
Get rid of the AUTO_INCREMENT; turn the UNIQUE into PRIMARY KEY.
day_time -- is that a DATETIME truncated to an hour? (Or something like that.) Use DATETIME, you will have much more flexibility in querying.
To discuss further, please elaborate on the source data (`CREATE TABLE, number of rows, frequency of processing, etc) and other details. If this is really a Data Warehouse application with a Summary table, I may have more suggestions.
If the data is coming from a file, do LOAD DATA to shovel it into a temp table raw so that the above INSERT..SELECT can work. If it is of manageable size, make raw Engine=MEMORY to avoid any I/O for it.
If you have multiple feeds, my high-speed-ingestion blog discusses how to have multiple threads without any deadlocks.

How can I select a set of IDs from large table fast?

I have a large table with ID as primary. About 3 million rows and I need to extract a small set of rows base on given ID list.
Currently I am doing it on where... in but it's very slow, like 5 to 10s.
My code:
select id,fa,fb,fc
from db1.t1
where id in(15,213,156,321566,13,165,416,132163,6514361,... );
I tried to query one ID at a time but it is still slow. like
select id,fa,fb,fc from db1.t1 where id =25;
I also tried to use a temp table and insert the ID list and call Join. But no improvement.
select id,fa,fb,fc from db1.t1 inner join db1.temp on t1.id=temp.id
Is there any way to make it faster?
here is table.
CREATE TABLE `db1`.`t1` (
`id` int(9) NOT NULL,
`url` varchar(256) COLLATE utf8_unicode_ci NOT NULL,
`title` varchar(1024) COLLATE utf8_unicode_ci DEFAULT NULL,
`lastUpdate` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`lastModified` datetime DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
Ok here is explain select.
id=1,
select_type='SIMPLE',
table='t1',
type='range',
possible_keys='PRIMARY',
key='PRIMARY',
key_len= '4',
ref= '',
rows=9,
extra='Using where'
Here are some tips how you can speed up the performance of your table:
Try to avoid complex SELECT queries on MyISAM tables that are updated
frequently, to avoid problems with table locking that occur due to
contention between readers and writers.
To sort an index and data according to an index, use myisamchk
--sort-index --sort-records=1 (assuming that you want to sort on index 1). This is a good way to make queries faster if you have a
unique index from which you want to read all rows in order according
to the index. The first time you sort a large table this way, it may
take a long time.
For MyISAM tables that change frequently, try to avoid all
variable-length columns (VARCHAR, BLOB, and TEXT). The table uses
dynamic row format if it includes even a single variable-length
column.
Strings are automatically prefix- and end-space compressed in MyISAM
indexes. See “CREATE INDEX Syntax”.
You can increase performance by caching queries or answers in your
application and then executing many inserts or updates together.
Locking the table during this operation ensures that the index cache
is only flushed once after all updates. You can also take advantage
of MySQL's query cache to achieve similar results; see “The MySQL Query Cache”..
You can read further on this articles on Optimizing your queries.
MySQL Query Cache
Query Cache SELECT Options
Optimizing MySQL queries with IN operator
Optimizing MyISAM Queries
First of all clustered indexes are faster then non-clustered indexes if I am not wrong.
Then sometime even you have index on a table, try to create re-index, or create statistics to rebuild it.
I saw on SQL explain plan that when we use where ID in (...), it converts it to
Where (ID =1) or (ID=2) or (Id=3)..... so bigger the list many ors, so for very big tables avoid IN ()
Try "Explain" this SQL and it can tell you where is the actual bottle neck.
Check this link http://dev.mysql.com/doc/refman/5.5/en/explain.html
hope will work
Looks like original sql statement using 'in' should be fine since the Id columns is indexed
I think you basically need a faster computer - are you doing this query on shared hosting?