JDBC batch insert performance - mysql

I need to insert a couple hundreds of millions of records into the mysql db. I'm batch inserting it 1 million at a time. Please see my code below. It seems to be slow. Is there any way to optimize it?
try {
// Disable auto-commit
connection.setAutoCommit(false);
// Create a prepared statement
String sql = "INSERT INTO mytable (xxx), VALUES(?)";
PreparedStatement pstmt = connection.prepareStatement(sql);
Object[] vals=set.toArray();
for (int i=0; i<vals.length; i++) {
pstmt.setString(1, vals[i].toString());
pstmt.addBatch();
}
// Execute the batch
int [] updateCounts = pstmt.executeBatch();
System.out.append("inserted "+updateCounts.length);

I had a similar performance issue with mysql and solved it by setting the useServerPrepStmts and the rewriteBatchedStatements properties in the connection url.
Connection c = DriverManager.getConnection("jdbc:mysql://host:3306/db?useServerPrepStmts=false&rewriteBatchedStatements=true", "username", "password");

I'd like to expand on Bertil's answer, as I've been experimenting with the connection URL parameters.
rewriteBatchedStatements=true is the important parameter. useServerPrepStmts is already false by default, and even changing it to true doesn't make much difference in terms of batch insert performance.
Now I think is the time to write how rewriteBatchedStatements=true improves the performance so dramatically. It does so by rewriting of prepared statements for INSERT into multi-value inserts when executeBatch() (Source). That means that instead of sending the following n INSERT statements to the mysql server each time executeBatch() is called :
INSERT INTO X VALUES (A1,B1,C1)
INSERT INTO X VALUES (A2,B2,C2)
...
INSERT INTO X VALUES (An,Bn,Cn)
It would send a single INSERT statement :
INSERT INTO X VALUES (A1,B1,C1),(A2,B2,C2),...,(An,Bn,Cn)
You can observe it by toggling on the mysql logging (by SET global general_log = 1) which would log into a file each statement sent to the mysql server.

You can insert multiple rows with one insert statement, doing a few thousands at a time can greatly speed things up, that is, instead of doing e.g. 3 inserts of the form INSERT INTO tbl_name (a,b,c) VALUES(1,2,3); , you do INSERT INTO tbl_name (a,b,c) VALUES(1,2,3),(1,2,3),(1,2,3); (It might be JDBC .addBatch() does similar optimization now - though the mysql addBatch used to be entierly un-optimized and just issuing individual queries anyhow - I don't know if that's still the case with recent drivers)
If you really need speed, load your data from a comma separated file with LOAD DATA INFILE , we get around 7-8 times speedup doing that vs doing tens of millions of inserts.

If:
It's a new table, or the amount to be inserted is greater then the already inserted data
There are indexes on the table
You do not need other access to the table during the insert
Then ALTER TABLE tbl_name DISABLE KEYS can greatly improve the speed of your inserts. When you're done, run ALTER TABLE tbl_name ENABLE KEYS to start building the indexes, which can take a while, but not nearly as long as doing it for every insert.

You may try using DDBulkLoad object.
// Get a DDBulkLoad object
DDBulkLoad bulkLoad = DDBulkLoadFactory.getInstance(connection);
bulkLoad.setTableName(“mytable”);
bulkLoad.load(“data.csv”);

try {
// Disable auto-commit
connection.setAutoCommit(false);
int maxInsertBatch = 10000;
// Create a prepared statement
String sql = "INSERT INTO mytable (xxx), VALUES(?)";
PreparedStatement pstmt = connection.prepareStatement(sql);
Object[] vals=set.toArray();
int count = 1;
for (int i=0; i<vals.length; i++) {
pstmt.setString(1, vals[i].toString());
pstmt.addBatch();
if(count%maxInsertBatch == 0){
pstmt.executeBatch();
}
count++;
}
// Execute the batch
pstmt.executeBatch();
System.out.append("inserted "+count);

Related

EF6 & MySql - Avoid selecting Id after inserting record

I have a program that inserts thousands of records in a MySql DB. The operation cannot be done in bulk for a variety of reasons. Overall, the operations are very slow.
After looking at the SQL that is being generated, I can see that EF is calling a select to get the Id of the recently inserted record.
SET SESSION sql_mode = 'ANSI'; INSERT INTO `table`(
'blah') VALUES(
86613784);
SELECT
`id`
FROM `table`
WHERE row_count() > 0 AND `id`= last_insert_id()
Since I don't need that Id, how can I tell EF to avoid the call and save me the time?
FYI - I am already using the following statements to speed things up as well.
Configuration.ValidateOnSaveEnabled = false;
Configuration.AutoDetectChangesEnabled = false;
As requested, here is the code used to create the record. Not much to it, but if it helps...
using (var ctx = new Tc_TrademarkEntities(_entityFrameworkConnectionString))
{
ctx.case_file.Add(request.Trademark);
ctx.SaveChanges();
}

Performance of batched statements using INSERT ... SET vs INSERT ... VALUES

I recently wrote a simple Java program that processed some data and inserted it in a MyISAM table. About 35000 rows had to be inserted. I wrote the INSERT statement using INSERT ... SET syntax and executed it for all rows with PreparedStatement.executeBatch(). So:
String sql = "INSERT INTO my_table"
+ " SET "
+ " my_column_1 = ? "
+ " my_column_2 = ? "
...
+ " my_column_n = ? ";
try(PreparedStatement pst = con.prepareStatement(sql)){
for(Object o : someCollection){
pst.setInt(1, ...);
pst.setInt(2, ...);
...
pst.setInt(n, ...);
pst.addBatch();
}
pst.executeBatch();
}
I tried inserting all rows in a single batch and in bacthes of 1000, but in all cases the execution was VERY slow (about 1 minute per 1000 rows). After some tinkering I found that changing the syntax to INSERT ... VALUES improved the speed dramatically, 100x at the very least (I didn't measure it accurately).
String sql = "INSERT INTO my_table (my_column_1, my_column_2, ... , my_column_n)"
+ " VALUES (?, ?, ... , ?)";
What's going on here? Can it be that the JDBC driver cannot rewrite the batches when using INSERT ... SET? I didn't find any documentation about this. I am creating my connections with options rewriteBatchedStatements=true&useServerPrepStmts=false.
I first noticed this problem when accessing a database in another host. That is, I have used the INSERT ... SET approach before without any noticeable performance issue in applications that were executing in the same host as the database. So I guess the problem may be that many more statements are sent over the network with INSERT ... SET than with INSERT ... VALUES.
If you examine the INSERT ... SET syntax, you'll see it's only meant for inserting a single row. INSERT ... VALUES is meant for inserting multiple rows at one time.
In other words - even though you set rewriteBatchedStatements=true, the JDBC driver can't optimize the SET variation like it can with the VALUES variation because SET is not built for the batch case you have. Use VALUES to compress N inserts into one.
Bonus tip - If you use ON DUPLICATE KEY UPDATE, the JDBC currently can't rewrite those statements either. (edit: This statement is false - my mistake.)
There's an option you can set to verify all of this for yourself (I think it's 'profileSQL').

How to run 'SELECT FOR UPDATE' in Laravel 3 / MySQL

I am trying to execute SELECT ... FOR UPDATE query using Laravel 3:
SELECT * from projects where id = 1 FOR UPDATE;
UPDATE projects SET money = money + 10 where id = 1;
I have tried several things for several hours now:
DB::connection()->pdo->exec($query);
and
DB::query($query)
I have also tried adding START TRANSACTION; ... COMMIT; to the query
and I tried to separate the SELECT from the UPDATE in two different parts like this:
DB::query($select);
DB::query($update);
Sometimes I get 0 rows affected, sometimes I get an error like this one:
SQLSTATE[HY000]: General error: 2014 Cannot execute queries while other unbuffered queries are active. Consider using PDOStatement::fetchAll(). Alternatively, if your code is only ever going to run against mysql, you may enable query buffering by setting the PDO::MYSQL_ATTR_USE_BUFFERED_QUERY attribute.
SQL: UPDATE `sessions` SET `last_activity` = ?, `data` = ? WHERE `id` = ?
I want to lock the row in order to update sensitive data, using Laravel's database connection.
Thanks.
In case all you need to do is increase money by 10, you don't need to lock the row before update. Simply executing the update query will do the job. The SELECT query will only slow down your script and doesn't help in this case.
UPDATE projects SET money = money + 10 where id = 1;
I would use diferent queries for sure, so you can have control on what you are doing.
I would use a transaction.
If we read this simple explanations, pdo transactions are quite straightforward. They give us this simple but complete example, that ilustrates how everithing is as we should expect (consider $db to be your DB::connection()->pdo).
try {
$db->beginTransaction();
$db->exec("SOME QUERY");
$stmt = $db->prepare("SOME OTHER QUERY?");
$stmt->execute(array($value));
$stmt = $db->prepare("YET ANOTHER QUERY??");
$stmt->execute(array($value2, $value3));
$db->commit();
}
catch(PDOException $ex) {
//Something went wrong rollback!
$db->rollBack();
echo $ex->getMessage();
}
Lets go to your real statements. For the first of them, the SELECT ..., i wouldn't use exec, but query, since as stated here
PDO::exec() does not return results from a SELECT statement. For a
SELECT statement that you only need to issue once during your program,
consider issuing PDO::query(). For a statement that you need to issue
multiple times, prepare a PDOStatement object with PDO::prepare() and
issue the statement with PDOStatement::execute().
And assign its result to some temp variable like
$result= $db->query ($select);
After this execution, i would call $result->fetchAll(), or $result->closeCursor(), since as we can read here
If you do not fetch all of the data in a result set before issuing
your next call to PDO::query(), your call may fail. Call
PDOStatement::closeCursor() to release the database resources
associated with the PDOStatement object before issuing your next call
to PDO::query().
Then you can exec the update
$result= $db->exec($update);
And after all, just in case, i would call again $result->fetchAll(), or $result->closeCursor().
If the aim is
to lock the row in order to update sensitive data, using Laravel's database connection.
Maybe you can use PDO transactions :
DB::connection()->pdo->beginTransaction();
DB::connection()->pdo->commit();
DB::connection()->pdo->rollBack();

Is this database dump design ok?

I have written a Java program to do the following and would like opinions on my design:
Read data from a CSV file. The file is a database dump with 6 columns.
Write data into a MySQL database table.
The database table is as follows:
CREATE TABLE MYTABLE
(
ID int PRIMARY KEY not null auto_increment,
ARTICLEID int,
ATTRIBUTE varchar(20),
VALUE text,
LANGUAGE smallint,
TYPE smallint
);
I created an object to store each row.
I used OpenCSV to read each row into a list of objects created in 1.
Iterate this list of objects and using PreparedStatements, I write each row to the database.
The solution should be highly amenable to the changes in requirements and demonstrate good approach, robustness and code quality.
Does that design look ok?
Another method I tried was to use the 'LOAD DATA LOCAL INFILE' sql statement. Would that be a better choice?
EDIT: I'm now using OpenCSV and it's handling the issue of having commas inside actual fields. The issue now is nothing is writing to the DB. Can anyone tell me why?
public static void exportDataToDb(List<Object> data) {
Connection conn = connect("jdbc:mysql://localhost:3306/datadb","myuser","password");
try{
PreparedStatement preparedStatement = null;
String query = "INSERT into mytable (ID, X, Y, Z) VALUES(?,?,?,?);";
preparedStatement = conn.prepareStatement(query);
for(Object o : data){
preparedStatement.setString(1, o.getId());
preparedStatement.setString(2, o.getX());
preparedStatement.setString(3, o.getY());
preparedStatement.setString(4, o.getZ());
}
preparedStatement.executeBatch();
}catch (SQLException s){
System.out.println("SQL statement is not executed!");
}
}
From a purely algorithmic perspective, and unless your source CSV file is small, it would be better to
prepare your insert statement
start a transaction
load one (or a few) line(s) from it
insert the small batch into your database
return to 3. while there are some lines remainig
commit
This way, you avoid loading the entire dump in memory.
But basically, you probably had better use LOAD DATA.
If the no. of rows is huge, then the code will fail at Step 2 with out of memory error. You need to figure out a way to get rows in chunks and perform a batch with prepared statement for that chunk, continue till all the rows are processed. This will work for any no. of rows and also the batching will improve performance. Other than this I don't see any issue with the design.

MySQL - Fastest way to check if data in InnoDB table has changed

My application is very database intensive. Currently, I'm running MySQL 5.5.19 and using MyISAM, but I'm in the process of migrating to InnoDB. The only problem left is checksum performance.
My application does about 500-1000 "CHECKSUM TABLE" statements per second in peak times, because the clients GUI is polling the database constantly for changes (it is a monitoring system, so must be very responsive and fast).
With MyISAM, there are Live checksums that are precalculated on table modification and are VERY fast. However, there is no such thing in InnoDB. So, CHECKSUM TABLE is very slow...
I hoped to be able to check the last update time of the table, Unfortunately, this is not available in InnoDB either. I'm stuck now, because tests have shownn that the performance of the application drops drastically...
There are simply too much lines of code that update the tables, so implementing logic in the application to log table changes is out of the question...
The Database ecosystem consists of one master na 3 slaves, so local file checks is not an option.
I thought of a method to mimic a checksum cache - a lookup table with two columns - table_name, checksum, and update that table with triggers when changes in a table occurs, but i have around 100 tables to monitor and this means 3 triggers per table = 300 triggers. Hard to maintain, and i'm not sure that this wont be a performance hog again.
So is there any FAST method to detect changes in InnoDB tables?
Thanks!
The simplest way is to add a nullable column with type TIMESTAMP, with the trigger: ON UPDATE CURRENT_TIMESTAMP.
Therefore, the inserts will not change because the column accepts nulls, and you can select only new and changed columns by saying:
SELECT * FROM `table` WHERE `mdate` > '2011-12-21 12:31:22'
Every time you update a row this column will change automatically.
Here are some more informations: http://dev.mysql.com/doc/refman/5.0/en/timestamp.html
To see deleted rows simply create a trigger which is going to log every deletion to another table:
DELIMITER $$
CREATE TRIGGER MyTable_Trigger
AFTER DELETE ON MyTable
FOR EACH ROW
BEGIN
INSERT INTO MyTable_Deleted VALUES(OLD.id, NOW());
END$$
I think I've found the solution. For some time I was looking at Percona Server to replace my MySQL servers, and now i think there is a good reason for this.
Percona server introduces many new INFORMATION_SCHEMA tables like INNODB_TABLE_STATS, which isn't available in standard MySQL server.
When you do:
SELECT rows, modified FROM information_schema.innodb_table_stats WHERE table_schema='db' AND table_name='table'
You get actual row count and a counter. The Official documentation says the following about this field:
If the value of modified column exceeds “rows / 16” or 2000000000, the
statistics recalculation is done when innodb_stats_auto_update == 1.
We can estimate the oldness of the statistics by this value.
So this counter wraps every once in a while, but you can make a checksum of the number of rows and the counter, and then with every modification of the table you get a unique checksum. E.g.:
SELECT MD5(CONCAT(rows,'_',modified)) AS checksum FROM information_schema.innodb_table_stats WHERE table_schema='db' AND table_name='table';
I was going do upgrade my servers to Percona server anyway so this bounding is not an issue for me. Managing hundreds of triggers and adding fields to tables is a major pain for this application, because it's very late in development.
This is the PHP function I've come up with to make sure that tables can be checksummed whatever engine and server is used:
function checksum_table($input_tables){
if(!$input_tables) return false; // Sanity check
$tables = (is_array($input_tables)) ? $input_tables : array($input_tables); // Make $tables always an array
$where = "";
$checksum = "";
$found_tables = array();
$tables_indexed = array();
foreach($tables as $table_name){
$tables_indexed[$table_name] = true; // Indexed array for faster searching
if(strstr($table_name,".")){ // If we are passing db.table_name
$table_name_split = explode(".",$table_name);
$where .= "(table_schema='".$table_name_split[0]."' AND table_name='".$table_name_split[1]."') OR ";
}else{
$where .= "(table_schema=DATABASE() AND table_name='".$table_name."') OR ";
}
}
if($where != ""){ // Sanity check
$where = substr($where,0,-4); // Remove the last "OR"
$get_chksum = mysql_query("SELECT table_schema, table_name, rows, modified FROM information_schema.innodb_table_stats WHERE ".$where);
while($row = mysql_fetch_assoc($get_chksum)){
if($tables_indexed[$row[table_name]]){ // Not entirely foolproof, but saves some queries like "SELECT DATABASE()" to find out the current database
$found_tables[$row[table_name]] = true;
}elseif($tables_indexed[$row[table_schema].".".$row[table_name]]){
$found_tables[$row[table_schema].".".$row[table_name]] = true;
}
$checksum .= "_".$row[rows]."_".$row[modified]."_";
}
}
foreach($tables as $table_name){
if(!$found_tables[$table_name]){ // Table is not found in information_schema.innodb_table_stats (Probably not InnoDB table or not using Percona Server)
$get_chksum = mysql_query("CHECKSUM TABLE ".$table_name); // Checksuming the old-fashioned way
$chksum = mysql_fetch_assoc($get_chksum);
$checksum .= "_".$chksum[Checksum]."_";
}
}
$checksum = sprintf("%s",crc32($checksum)); // Using crc32 because it's faster than md5(). Must be returned as string to prevent PHPs signed integer problems.
return $checksum;
}
You can use it like this:
// checksum a signle table in the current db
$checksum = checksum_table("test_table");
// checksum a signle table in db other than the current
$checksum = checksum_table("other_db.test_table");
// checksum multiple tables at once. It's faster when using Percona server, because all tables are checksummed via one select.
$checksum = checksum_table(array("test_table, "other_db.test_table"));
I hope this saves some trouble to other people having the same problem.