I am trying to implement simple program in Java that will be used to populate a MySQL database from a CSV source file. For each row in the CSV file, I need to execute following sequence of SQL statements (example in pseudo code):
execute("INSERT INTO table_1 VALUES(?, ?)");
String id = execute("SELECT LAST_INSERT_ID()");
execute("INSERT INTO table_2 VALUES(?, ?)");
String id2 = execute("SELECT LAST_INSERT_ID()");
execute("INSERT INTO table_3 values("some value", id1, id2)");
execute("INSERT INTO table_3 values("some value2", id1, id2)");
...
There are three basic problems:
1. Database is not on localhost so each single INSERT/SELECT has latency and this is the basic problem
2. CSV file contains millions of rows (like 15 000 000) so it takes too long.
3. I cannot modify the database structure (add extra tables, disable keys etc).
I was wondering how can I speed up the INSERT/SELECT process? Currently 80% of the execution time is consumed by communication.
I already tried to group the above statements and execute them as batch but because of LAST_INSERT_ID it does not work. In any other cases it takes too long (see point 1).
Fastest way is to let MySQL parse the CSV and load records into the table. For that, you can use "LOAD DATA INFILE":
http://dev.mysql.com/doc/refman/5.1/en/load-data.html
It works even better if you can transfer the file to server or keep it on a shared directory that is accessible to server.
Once that is done, you can have a column that indicates whether the records has been processed or not. Its value should be false by default.
Once data is loaded, you can pick up all records where processed=false.
For all such records you can populate table 2 and 3.
Since all these operation would happen on server, server <> client latency would not come into the picture.
Feed the data into a blackhole
CREATE TABLE `test`.`blackhole` (
`t1_f1` int(10) unsigned NOT NULL,
`t1_f2` int(10) unsigned NOT NULL,
`t2_f1` ... and so on for all the tables and all the fields.
) ENGINE=BLACKHOLE DEFAULT CHARSET=latin1;
Note that this is a blackhole table, so the data is going nowhere.
However you can create a trigger on the blackhole table, something like this.
And pass it on using a trigger
delimiter $$
create trigger ai_blackhole_each after insert on blackhole for each row
begin
declare lastid_t1 integer;
declare lastid_t2 integer;
insert into table1 values(new.t1_f1, new.t1_f2);
select last_insert_id() into lastid_t1;
insert into table2 values(new.t2_f1, new.t2_f1, lastid_t1);
etc....
end$$
delimiter ;
Now you can feed the blackhole table with a single insert statement at full speed and even insert multiple rows in one go.
insert into blackhole values(a,b,c,d,e,f,g,h),(....),(...)...
Disable index updates to speed things up
ALTER TABLE $tbl_name DISABLE KEYS;
....Lot of inserts
ALTER TABLE $tbl_name ENABLE KEYS;
Will disable all non-unique key updates and speed up the insert. (an autoincrement key is unique, so that's not affected)
If you have any unique keys and you don't want MySQL to check for them during the mass-insert, make sure you do an alter table to eliminate the unique key and enable it afterwards.
Note that the alter table to put the unique key back in will take a long time.
Related
This may seem like a dumb question. I am wanting to set up an SQL db with records containing numbers. I would like to run an enquiry to select a group of records, then take the values in that group, do some basic arithmetic on the numbers and then save the results to a different table but still have them linked with a foreign key to the original record. Is that possible to do in SQL without taking the data to another application and then importing it back? If so, what is the basic function/procedure to complete this action?
I'm coming from an excel/macro/basic python background and want to investigate if it's worth the switch to SQL.
PS. I'm wanting to stay open source.
A tiny example using postgresql (9.6)
-- Create tables
CREATE TABLE initialValues(
id serial PRIMARY KEY,
value int
);
CREATE TABLE addOne(
id serial,
id_init_val int REFERENCES initialValues(id),
value int
);
-- Init values
INSERT INTO initialValues(value)
SELECT a.n
FROM generate_series(1, 100) as a(n);
-- Insert values in the second table by selecting the ones from the
-- First one .
WITH init_val as (SELECT i.id,i.value FROM initialValues i)
INSERT INTO addOne(id_init_val,value)
(SELECT id,value+1 FROM init_val);
In MySQL you can use CREATE TABLE ... SELECT (https://dev.mysql.com/doc/refman/8.0/en/create-table-select.html)
I have a need to "update" some table data I receive from external source (every time I receive "all" data, with some fields for some records updated).
There's no unique field or combination of fields, and thus I figured the best way would be to every time to wipe out all data from DB and write all (now updated) data in again. There are up to a 1000 records (there will never be more than that), about 15 short fields each: text, numbers, datetime. And I'm writing it to remote DB (so, it's slow).
Currently I'm doing:
delete from `table` where `date_dt` > ?
and then for each row
INSERT INTO `table` ( `field_0`,`field_1`,... ) VALUES (?,?,...)
It's not only slow, but it's possible that the end user may not see the complete data while I'm still inserting.
I figured I could do:
CREATE TEMPORARY TABLE `temp_table` ( ... ); -- same structure as in main table
INSERT INTO `temp_table` ( `field_0`,`field_1`,... ) VALUES (?,?,...) -- repeat 1000x
START TRANSACTION;
DELETE FROM `table`;
INSERT INTO `table` SELECT * FROM `temp_table`;
DROP `temp_table`;
COMMIT;
Does this makes any sense? What's is a better way of solving this?
The speed of filling up the temp table with data is not crucial, but filling the main table with data is (so users don't see incomplete data, or the period of time they do is minimal).
mysqlimport --delete will truncate the table first, and then load your external data from a CSV file. It runs many times faster than doing INSERT one row at a time.
See https://dev.mysql.com/doc/refman/5.7/en/mysqlimport.html
I did a presentation in April 2017 about performance of bulk data loads for MySQL:
https://www.slideshare.net/billkarwin/load-data-fast
P.S.: Don't use the temp table solution if you have a MySQL replication environment. This is a well-known way of breaking replication. If the slave restarts in between your creation of the temp table and the INSERT...SELECT that reads from the temp table, then the slave will find the temp table is gone, and this will result in an error and stop replication. This might seem unlikely, but it does happen eventually.
I was just trying to add a column called "location" to a table (main_table) in a database. The command I run was
ALTER TABLE main_table ADD COLUMN location varchar (256);
The main_table contains > 2,000,000 rows. It keeps running for more than 2 hours and still not completed.
I tried to use mytop
to monitor the activity of this database to make sure that the query is not locked by other querying process, but it seems not. Is it supposed to take that long time? Actually, I just rebooted the machine before running this command. Now this command is still running. I am not sure what to do.
Your ALTER TABLE statement implies mysql will have to re-write every single row of the table including the new column. Since you have more than 2 million rows, I would definitely expect it takes a significant amount of time, during which your server will likely be mostly IO-bound. You'd usually find it's more performant to do the following:
CREATE TABLE main_table_new LIKE main_table;
ALTER TABLE main_table_new ADD COLUMN location VARCHAR(256);
INSERT INTO main_table_new SELECT *, NULL FROM main_table;
RENAME TABLE main_table TO main_table_old, main_table_new TO main_table;
DROP TABLE main_table_old;
This way you add the column on the empty table, and basically write the data in that new table that you are sure no-one else will be looking at without locking as much resources.
I think the appropriate answer for this is using a feature like pt-online-schema-change or gh-ost.
We have done migration of over 4 billion rows with this, though it can take upto 10 days, with less than a minute of downtime.
Percona works in a very similar fashion as above
Create a temp table
Creates triggers on the first table (for inserts, updates, deletes) so that they are replicated to the temp table
In small batches, migrate data
When done, rename table to new table, and drop the other table
You can speed up the process by temporarily turning off unique checks and foreign key checks. You can also change the algorithm that gets used.
If you want the new column to be at the end of the table, use algorithm=instant:
SET unique_checks = 0;
SET foreign_key_checks = 0;
ALTER TABLE main_table ADD location varchar(256), algorithm=instant;
SET unique_checks = 1;
SET foreign_key_checks = 1;
Otherwise, if you need the column to be in a specific location, use algorithm=inplace:
SET unique_checks = 0;
SET foreign_key_checks = 0;
ALTER TABLE main_table ADD location varchar(256) AFTER othercolumn, algorithm=inplace;
SET unique_checks = 1;
SET foreign_key_checks = 1;
For reference, it took my PC about 2 minutes to alter a table with 20 million rows using the inplace algorithm. If you're using a program like Workbench, then you may want to increase the default timeout period in your settings before starting the operation.
If you find that the operation is hanging indefinitely, then you may need to look through the list of processes and kill whatever process has a lock on the table. You can do that using these commands:
SHOW FULL PROCESSLIST;
KILL PROCESS_NUMBER_GOES_HERE;
Alter table takes a long time with a big data like in your case, so avoid to use it in such situations, and use some code like this one:
select main_table.*,
cast(null as varchar(256)) as null_location, -- any column you want accepts null
cast('' as varchar(256)) as not_null_location, --any column doesn't accept null
cast(0 as int) as not_null_int, -- int column doesn't accept null
into new_table
from main_table;
drop table main_table;
rename table new_table TO main_table;
DB2 z/OS does a virtual add of the column instantly. And puts the table into Advisory-Reorg status. Anything that runs before the reorg gets the default value or null if no default. When updates are done, they expand the rows updated. Inserts are done expanded. The next reorg expands every unexpanded row and assigns the default value to anything it expands.
Only a real database handles this well. DB2 z/OS.
I am collecting readings from several thousand sensors and storing them in a MySQL database. There are several hundred inserts per second. To improve the insert performance I am storing the values initially into a MEMORY buffer table. Once a minute I run a stored procedure which moves the inserted rows from the memory buffer to a permanent table.
Basically I would like to do the following in my stored procedure to move the rows from the temporary buffer:
INSERT INTO data SELECT * FROM data_buffer;
DELETE FROM data_buffer;
Unfortunately the previous is not usable because the data collection processes insert additional rows in "data_buffer" between INSERT and DELETE above. Thus those rows will get deleted without getting inserted to the "data" table.
How can I make the operation atomic or make the DELETE statement to delete only the rows which were SELECTed and INSERTed in the preceding statement?
I would prefer doing this in a standard way which works on different database engines if possible.
I would prefer not adding any additional "id" columns because of performance overhead and storage requirements.
I wish there was SELECT_AND_DELETE or MOVE statement in standard SQL or something similar...
I beleive this will work but will block until insert is done
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
BEGIN TRANSACTION;
INSERT INTO data (SELECT * FROM data_buffer FOR UPDATE);
DELETE FROM data_buffer;
COMMIT TRANSACTION;
A possible way to avoid all those problems, and to also stay fast, would be to use two data_buffer tables (let's call them data_buffer1 and data_buffer2); while the collection processes insert into data_buffer2, you can do the insert and delete on data_buffer2; than you switch, so collected data goes into data_buffer2, while data is inserted+deleted from data_buffer1 into data.
How about having a row id, get the max value before insert, make the insert and then delete records <= max(id)
This is a similar solution to #ammoQ's answer. The difference is that instead of having the INSERTing process figure out which table to write to, you can transparently swap the tables in the scheduled procedure.
Use RENAME in the scheduled procedure to swap tables:
CREATE TABLE IF NOT EXISTS data_buffer_new LIKE data_buffer;
RENAME TABLE data_buffer TO data_buffer_old, data_buffer_new TO data_buffer;
INSERT INTO data SELECT * FROM data_buffer_old;
DROP TABLE data_buffer_old;
This works because RENAME statement swaps the tables atomically, thus the INSERTing processes will not fail with "table not found". This is MySQL specific though.
I assume the tables are identical, with the same columns and primary key(s)? If that is the case, you could nestled select inside a where clause...something like this:
DELETE FROM data_buffer
WHERE primarykey IN (SELECT primarykey FROM data)
This is a MySQL specific solution. You can use locking to prevent the INSERTing processes from adding new rows while you are moving rows.
The procedure which moves the rows should be as follows:
LOCK TABLE data_buffer READ;
INSERT INTO data SELECT * FROM data_buffer;
DELETE FROM data_buffer;
UNLOCK TABLE;
The code which INSERTs new rows in the buffer should be changed as follows:
LOCK TABLE data_buffer WRITE;
INSERT INTO data_buffer VALUES (1, 2, 3);
UNLOCK TABLE;
The INSERT process will obviously block while the lock is in place.
We're using MySQL with InnoDB storage engine and transactions a lot, and we've run into a problem: we need a nice way to emulate Oracle's SEQUENCEs in MySQL. The requirements are:
- concurrency support
- transaction safety
- max performance (meaning minimizing locks and deadlocks)
We don't care if some of the values won't be used, i.e. gaps in sequence are ok. There is an easy way to archieve that by creating a separate InnoDB table with a counter, however this means it will take part in transaction and will introduce locks and waiting. I am thinking to try a MyISAM table with manual locks, any other ideas or best practices?
If auto-increment isn't good enough for your needs, you can create a atomic sequence mechanism with n named sequences like this:
Create a table to store your sequences:
CREATE TABLE sequence (
seq_name varchar(20) unique not null,
seq_current int unsigned not null
);
Assuming you have a row for 'foo' in the table you can atomically get the next sequence id like this:
UPDATE sequence SET seq_current = (#next := seq_current + 1) WHERE seq_name = 'foo';
SELECT #next;
No locks required. Both statements need to be executed in the same session, so that the local variable #next is actually defined when the select happens.
The right way to do this is given in the MySQL manual:
UPDATE child_codes SET counter_field = LAST_INSERT_ID(counter_field + 1);
SELECT LAST_INSERT_ID();
We are a high transaction gaming company and need these sort of solutions for our needs. One of the features of Oracle sequences was also the increment value that could also be set.
The solution uses DUPLICATE KEY.
CREATE TABLE sequences (
id BIGINT DEFAULT 1,
name CHAR(20),
increment TINYINT,
UNIQUE KEY(name)
);
To get the next index:
Abstract the following with a stored procedure or a function sp_seq_next_val(VARCHAR):
INSERT INTO sequences (name) VALUES ("user_id") ON DUPLICATE KEY UPDATE id = id + increment;<br/>
SELECT id FROM sequences WHERE name = "user_id";
Won't the MySQL Identity column on the table handle this?
CREATE TABLE table_name
(
id INTEGER AUTO_INCREMENT PRIMARY KEY
)
Or are you looking to use it for something other than just inserting into another table?
If you're writing using a procedural language as well (instead of just SQL) then the other option would be to create a table containing a single integer (or long integer) value and a stored procedure which locked it, selected from it, incremented it and unlocked it before returning the value.
(Note - always increment before you return the value - it maximise the chance of not getting duplicates if there are errors - or wrap the whole thing in a transaction.)
You would then call this independently of your main insert / update (so it doesn't get caught in any transactions automatically created by the calling mechanism) and then pass it as a parameter to wherever you want to use it.
Because it's independent of the rest of the stuff you're doing it should be quick and avoid locking issues. Even if you did see an error caused by locking (unlikely unless you're overloading the database) you could just call it a second / third time.