Distributed database insertion speed is very slow - mysql

#Test
public void transaction() throws Exception {
Connection conn = null;
PreparedStatement ps = null;
try {
String sql = "insert into `1` values(?, ?, ?, ?)";
conn = JDBCUtils.getConnection();
ps = conn.prepareStatement(sql);
conn.setAutoCommit(false);
for(int i = 1; i <= 10000; i++){
ps.setObject(1, i);
ps.setObject(2, 10.12345678);
ps.setObject(3, "num_" + i);
ps.setObject(4, "2021-12-24 19:00:00");
ps.addBatch();
}
ps.executeBatch();
ps.clearBatch();
conn.commit();
} catch (Exception e) {
conn.rollback();
e.printStackTrace();
}finally {
JDBCUtils.closeResources(conn, ps);
}
}
When setAutoCommit = true, local MySQL and distributed MySQL insert speeds are very slow.
When I set the transaction to commit manually, just like the code above, the local MySQL speed has increased a lot, but the insertion speed of distributed MySQL is still very slow.
Is there any additional parameters I need to set?

Setting parameters probably won't help (much).
There are a couple of reasons for the slowness:
With autocommit=true you are committing on every insert statement. That means the each new row must be written to disk before the database server returns the response to the client.
With autocommit=false there is still a client -> server -> client round trip for each insert statement. Those round trips add up to a significant amount of time.
One way to make this faster is to insert multiple rows with each insert statement, but that is messy because you would need to generate complex (multi-row) insert statements.
A better way is to use JDBC's batch feature to reduce the number of round-trips. For example:
PreparedStatement ps = c.prepareStatement("INSERT INTO employees VALUES (?, ?)");
ps.setString(1, "John");
ps.setString(2,"Doe");
ps.addBatch();
ps.clearParameters();
ps.setString(1, "Dave");
ps.setString(2,"Smith");
ps.addBatch();
ps.clearParameters();
int[] results = ps.executeBatch();
(Attribution: above code copied from this answer by #Tusc)
If that still isn't fast enough, you should get even better performance using MySQL's native bulk insert mechanism; e.g. load data infile; see High-speed inserts with MySQL
For completeness, I am adding this suggestion from #Wilson Hauck
"In your configuration [mysqld] section, innodb_change_buffer_max_size=50 # from 25 (percent) for improved INSERT rate per second. SHOW FULL PROCESSLIST; to monitor when the instance has completed adjustment, then do your inserts and put it back to 25 percent for typical processing speed."
This may increase the insert rate depending on your table and its indexes, and on the order in which you are inserting the rows.
But the flip-side is that you may be able to achieve the same speedup (or more!) by other means; e.g.
by sorting your input so that rows are inserted in index order, or
by dropping the indexes, inserting the records and then recreating the indexes.
You can read about the change buffer here and make your own judgements.

Related

How to Bulk update in HIbernate

I need to update multiple rows in my MySQL database using Hibernate. I have done this using JDBC where we have the support of batched Query. I want something like this in hibernate.
Does hibernate support batched Query?
Batched Query Example in jdbc:
// Create statement object
Statement stmt = conn.createStatement();
String SQL = "INSERT INTO Employees (id, first, last, age) " +
"VALUES(200,'Zia', 'Ali', 30)";
// Add above SQL statement in the batch.
stmt.addBatch(SQL);
// Create one more SQL statement
String SQL = "INSERT INTO Employees (id, first, last, age) " +
"VALUES(201,'Raj', 'Kumar', 35)";
// Add above SQL statement in the batch.
stmt.addBatch(SQL);
int[] count = stmt.executeBatch();
Now when we issue stmt.executeBatch call Both Sql Query will be executed in a single jdbc round trip.
You may check the Hibernate documentation. Hibernate has some configuration properties that control (or disable) the use of JDBC batching.
If you issue the same INSERT multiple times and your entity does not use an identity generator, Hibernate will use JDBC batching transparently.
The configuration must enable the use of JDBC batching. Batching is disabled by default.
Configuring the Hibernate
The hibernate.jdbc.batch_size property defines the number of statements that Hibernate will batch before asking the driver to execute the batch. Zero or a negative number will disable the batching.
You can define a global configuration, e.g. in the persistence.xml, or define a session-specific configuration. To configure the session, you can use code like the following
entityManager
.unwrap( Session.class )
.setJdbcBatchSize( 10 );
Using the JDBC batching
As mentioned before, Hibernate call the JDBC batching transparently. If you wanna control the batching, you can use the flush() and clear() methods in the session.
The following is an example from the Documentation. It calls flush() and clear() when the number of insertions reach a batchSize value. It works efficiently if batchSize is lesser or equal than the configured hibernate.jdbc.batch_size.
EntityManager entityManager = null;
EntityTransaction txn = null;
try {
entityManager = entityManagerFactory().createEntityManager();
txn = entityManager.getTransaction();
txn.begin();
// define a batch size lesser or equal than the JDBC batching size
int batchSize = 25;
for ( int i = 0; i < entityCount; ++i ) {
Person Person = new Person( String.format( "Person %d", i ) );
entityManager.persist( Person );
if ( i > 0 && i % batchSize == 0 ) {
//flush a batch of inserts and release memory
entityManager.flush();
entityManager.clear();
}
}
txn.commit();
} catch (RuntimeException e) {
if ( txn != null && txn.isActive()) txn.rollback();
throw e;
} finally {
if (entityManager != null) {
entityManager.close();
}
}

SQL Deadlock with Python Data Insert

I'm currently trying to build a database interface with python to store stock data. This data is in the form of a tuple list with each element consisting of "date, open, high, low, close, volume. date represents a UNIX timestamp and has to be unique in combination with the ticker symbol in the database. Below is an example of a typically processed output (company_stock):
[(1489780560, 'NYSE:F', 12.5, 12.505, 12.49, 12.495, 567726),
(1489780620, 'NYSE:F', 12.495, 12.5, 12.48, 12.48, 832487),
(1489780680, 'NYSE:F', 12.485, 12.49, 12.47, 12.475, 649818),
(1489780740, 'NYSE:F', 12.475, 12.48, 12.47, 12.47, 700579),
(1489780800, 'NYSE:F', 12.47, 12.48, 12.47, 12.48, 567798)]
I'm using the pymysql package to insert this list into a local MySQL database (Version 5.5). While the code runs through and the values get inserted, the database will crash - or rather stop - after reaching about ~250k rows. Since the relevant This is the export part of the stock data processing function which gets called about once every 20 seconds and inserts about 400 values.
# SQL Export
def tosql(company_stock, ticker, interval, amount_period, period):
try:
conn = pymysql.connect(host = "localhost", user = "root",
passwd = "pw", db = "db", charset = "utf8",
autocommit = True,
cursorclass = pymysql.cursors.DictCursor)
cur = conn.cursor()
# To temp table
query = "INSERT INTO stockdata_import "
query += "(date, tickersymbol, open, high, low, close, volume)"
query += "VALUES (%s, %s, %s, %s, %s, %s, %s)"
cur.executemany(query, company_stock)
# Duplicate Check with temp table and existing database storage
query = "INSERT INTO stockdata (date, tickersymbol, open, high, low, close, volume) "
query += "SELECT i.date, i.tickersymbol, i.open, i.high, i.low, "
query += "i.close, i.volume FROM stockdata_import i "
query += "WHERE NOT EXISTS(SELECT dv.date, dv.tickersymbol FROM "
query += "stockdata dv WHERE dv.date = i.date "
query += "AND dv.tickersymbol = i.tickersymbol)"
cur.execute(query)
print(": ".join([datetime.now().strftime("%d.%m.%Y %H:%M:%S"),
"Data stored in Vault. Ticker", str(ticker),
"Interval", str(interval),
"Last", str(amount_period), str(period)]))
finally:
# Clear temp import table and close connection
query = "DELETE from stockdata_import"
cur.execute(query)
cur.close()
conn.close()
I suspect that the check for already existent values takes too long as the database grows and eventually breaks down due to the lock of the tables (?) while checking for uniqueness of the date/ticker combination. Since I expect this database to grow rather fast (about 1 million rows per week) it seems that a different solution is required to ensure that there is only one date/ticker pair. This is the SQL CREATE statement for the import table (the real table with which it gets compared looks the same):
CREATE TABLE stockdata_import (id_stock_imp BIGINT(12) NOT NULL AUTO_INCREMENT,
date INT(10),
tickersymbol VARCHAR(16),
open FLOAT(12,4),
high FLOAT(12,4),
low FLOAT(12,4),
close FLOAT(12,4),
volume INT(12),
crawled_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY(id_stock_imp));
I have already looked into setting a constraint for the date/tickersymbol pair and to handle upcoming exceptions in python, but my research so far suggested that this would be even slower plus I am not even sure if this will work with the bulk insert of the pymysql cursor function executemany(query, data).
Context information:
The SQL export shown above is the final part of a python script handling the stock data response. This script, in turn, gets called by another script which is timed by a crontab to run at a specific time each day.
Once the crontab starts the control script, this will call the subscript about 500 times with a sleep time of about 20-25 seconds between each run.
The error which I see in the logs is: ERROR 1205 (HY000): Lock wait timeout exceeded; try restarting transaction
Questions:
How can I optimize the query or alter the storage table to ensure uniqueness for a given date/ticker combination?
Is this even the problem or do I fail to see some other problem here?
Any further advice is also welcome.
If you would like to ensure uniqueness of your data, then just add a unique index on the relevant date and ticker fields. Unique index prevents duplicate values from being inserted, therefore there is no need to check for the existence of data before the insertion.
Since you do not want to insert duplicate data, just use insert ignore instead of plain insert to supress duplicate insert errors. Based on the mumber of affected rows, you can still detect and log duplicate insertions.

Safety of not catching SQL Exception

Let's say I have a program that puts email addresses into a database where the email attribute is a primary key.
If I have a duplicate email address, I could deal with it in two ways.
1) run a "select email from table" query. If the email is currently in there, don't add it.
2) don't check if email is in the table. catch(SQLException e), but don't print the stack trace, simply skip over it. This way, if I'm inserting a duplicate it effectively ignores it.
Granted with method 1, I'm only executing a simple select query (no joins or anything fancy) so performance isn't really a huge issue. But if I wanted to optimize performance, would method 2 be a viable, safe way of doing this?
So instead of running a "select ..." every time, I just add it.
Are there any safety issues with skipping over the exception?
Java Example (with JDBC):
try {
String sql = "insert into emails values(?)";
PreparedStatement pstmt = conn.prepareStatement(sql);
pstmt.setString(1, email);
pstmt.execute();
return true;
}
catch(SQLException e) {
// e.printStackTrace(); // skip; don't print out error
return false;
}

Update table on mysql after a bigdecimal is declared

I have the following work on my application, in which I am trying to update the value total on my mysql database table called "porcobrar2012". However, the only value that gets updated is the last one generated in the while loop. Why? all values are been printout on the screen with no problem, but those values are not getting updated in the database.
Here is the code:
BigDecimal total = new BigDecimal("0");
try
{
//Class.forName("sun.jdbc.odbc.JdbcOdbcDriver");
//Connection connection=DriverManager.getConnection("jdbc:odbc:db1","","");
Class.forName("com.mysql.jdbc.Driver").newInstance();
Connection connection=DriverManager.getConnection("jdbc:mysql://localhost/etoolsco_VecinetSM?user=etoolsco&password=g7Xm2heD41");
Statement statement=connection.createStatement();
String query;
query="SELECT * FROM porcobrar2012";
ResultSet resultSet=statement.executeQuery(query);
while(resultSet.next())
{
out.println(resultSet.getString(2)+"");out.println(resultSet.getBigDecimal(3)+"");out.println(resultSet.getBigDecimal(4)+"");out.println(resultSet.getBigDecimal(5)+"");out.println(resultSet.getBigDecimal(6)+"");out.println(resultSet.getBigDecimal(7)+"");out.println(resultSet.getBigDecimal(8)+"");out.println(resultSet.getBigDecimal(9)+"");out.println(resultSet.getBigDecimal(10)+"");out.println(resultSet.getBigDecimal(11)+"");out.println(resultSet.getBigDecimal(12)+"");out.println(resultSet.getBigDecimal(13)+"")out.println(resultSet.getBigDecimal(14)+"");out.println(resultSet.getBigDecimal(15)+"");
total = resultSet.getBigDecimal(3).add(resultSet.getBigDecimal(4)).add(resultSet.getBigDecimal(5)).add(resultSet.getBigDecimal(6)).add(resultSet.getBigDecimal(7)).add(resultSet.getBigDecimal(8)).add(resultSet.getBigDecimal(9)).add(resultSet.getBigDecimal(10)).add(resultSet.getBigDecimal(11)).add(resultSet.getBigDecimal(12)).add(resultSet.getBigDecimal(13)).add(resultSet.getBigDecimal(14)).add(resultSet.getBigDecimal(15));
String query1;
query1="UPDATE porcobrar2012 SET total=total";
PreparedStatement ps = connection.prepareStatement(query1);
ps.executeUpdate();
out.println(total);
}
connection.close();
statement.close();
}
catch (Exception e)
{
//e.printStackTrace();
out.println(e.toString());
}
It's because the update closes the existing result set. But I would ask why you aren't doing the addition in a single UPDATE statement without any prior query, at the database, no loops, no BigDecimals. Rule one of database programming is 'don't move the data further than you need to'. It would be many times as efficient to just write "UPDATE porcobrar2012 SET a=b+c+d+...". And you can remove the Class.forName() call too: it hasn't been required for years.

speed up sql INSERTs

I have the following method to insert millions of rows of data into a table (I use SQL 2008) and it seems slow, is there any way to speed up INSERTs?
Here is the code snippet - I use MS enterprise library
public void InsertHistoricData(List<DataRow> dataRowList)
{
string sql = string.Format( #"INSERT INTO [MyTable] ([Date],[Open],[High],[Low],[Close],[Volumn])
VALUES( #DateVal, #OpenVal, #High, #Low, #CloseVal, #Volumn )");
DbCommand dbCommand = VictoriaDB.GetSqlStringCommand( sql );
DB.AddInParameter(dbCommand, "DateVal", DbType.Date);
DB.AddInParameter(dbCommand, "OpenVal", DbType.Currency);
DB.AddInParameter(dbCommand, "High", DbType.Currency );
DB.AddInParameter(dbCommand, "Low", DbType.Currency);
DB.AddInParameter(dbCommand, "CloseVal", DbType.Currency);
DB.AddInParameter(dbCommand, "Volumn", DbType.Int32);
foreach (NasdaqHistoricDataRow dataRow in dataRowList)
{
DB.SetParameterValue( dbCommand, "DateVal", dataRow.Date );
DB.SetParameterValue( dbCommand, "OpenVal", dataRow.Open );
DB.SetParameterValue( dbCommand, "High", dataRow.High );
DB.SetParameterValue( dbCommand, "Low", dataRow.Low );
DB.SetParameterValue( dbCommand, "CloseVal", dataRow.Close );
DB.SetParameterValue( dbCommand, "Volumn", dataRow.Volumn );
DB.ExecuteNonQuery( dbCommand );
}
}
Consider using bulk insert instead.
SqlBulkCopy lets you efficiently bulk
load a SQL Server table with data from
another source. The SqlBulkCopy class
can be used to write data only to SQL
Server tables. However, the data
source is not limited to SQL Server;
any data source can be used, as long
as the data can be loaded to a
DataTable instance or read with a
IDataReader instance. For this example
the file will contain roughly 1000
records, but this code can handle
large amounts of data.
This example first creates a DataTable and fills it with the data. This is kept in memory.
DataTable dt = new DataTable();
string line = null;
bool firstRow = true;
using (StreamReader sr = File.OpenText(#"c:\temp\table1.csv"))
{
while ((line = sr.ReadLine()) != null)
{
string[] data = line.Split(',');
if (data.Length > 0)
{
if (firstRow)
{
foreach (var item in data)
{
dt.Columns.Add(new DataColumn());
}
firstRow = false;
}
DataRow row = dt.NewRow();
row.ItemArray = data;
dt.Rows.Add(row);
}
}
}
Then we push the DataTable to the server in one go.
using (SqlConnection cn = new SqlConnection(ConfigurationManager.ConnectionStrings["ConsoleApplication3.Properties.Settings.daasConnectionString"].ConnectionString))
{
cn.Open();
using (SqlBulkCopy copy = new SqlBulkCopy(cn))
{
copy.ColumnMappings.Add(0, 0);
copy.ColumnMappings.Add(1, 1);
copy.ColumnMappings.Add(2, 2);
copy.ColumnMappings.Add(3, 3);
copy.ColumnMappings.Add(4, 4);
copy.DestinationTableName = "Censis";
copy.WriteToServer(dt);
}
}
One general tip on any relational database when doing a large number of inserts, or indeed any data change, is to drop the all your secondary indexes first then recreate them afterwards.
Why does this work? Well with secondary indexes the index data will be elsewhere on the disk than the data, so forcing at best an additional read/write update for each record written to the table per index. In fact it may be much worse than this as from time to time the database will decide it needs to carry out a more serious reorganisation operation on the index.
When you recreate the index at the end of the insert run the database will perform just one full table scan to read and process the data. Not only do you end up with a better organised index on disk, but the total amount of work required will be less.
When is this worthwhile doing? That depends upon your database, index structure and other factors (such as if you have your indexes on a separate disk to your data) but my rule of thumb is to consider it if I am processing more than 10% of the records in a table of a million records or more - and then check with test inserts to see if it is worthwhile.
Of course on any particular database there will be specialist bulk insert routines, and you should also look at those.
FYI - looping through a record set and doing a million+ inserts on a relational DB, is the worst case scenario when loading a table. Some languages now offer record-set objects. For fastest performance SMINK is right, use BULK INSERT. Millions of rows loaded in minutes, rather than hours. Orders of magnitude faster than any other method.
As an example, I worked on a eCommerce project, that required a product list refresh each night. 100,000 rows inserted into a high-end Oracle DB, took 10 hours. If I remember correctly the top speed to when doing row-by-row inserts is aprox 10 recs/sec. Painful slow and completely unnecessary. With bulk insert - 100K rows should take less than a minute.
Hope this helps.
Where the data come from? Could you run a bulk insert? If so, that is the best option you could take.