speed up sql INSERTs - sql-server-2008

I have the following method to insert millions of rows of data into a table (I use SQL 2008) and it seems slow, is there any way to speed up INSERTs?
Here is the code snippet - I use MS enterprise library
public void InsertHistoricData(List<DataRow> dataRowList)
{
string sql = string.Format( #"INSERT INTO [MyTable] ([Date],[Open],[High],[Low],[Close],[Volumn])
VALUES( #DateVal, #OpenVal, #High, #Low, #CloseVal, #Volumn )");
DbCommand dbCommand = VictoriaDB.GetSqlStringCommand( sql );
DB.AddInParameter(dbCommand, "DateVal", DbType.Date);
DB.AddInParameter(dbCommand, "OpenVal", DbType.Currency);
DB.AddInParameter(dbCommand, "High", DbType.Currency );
DB.AddInParameter(dbCommand, "Low", DbType.Currency);
DB.AddInParameter(dbCommand, "CloseVal", DbType.Currency);
DB.AddInParameter(dbCommand, "Volumn", DbType.Int32);
foreach (NasdaqHistoricDataRow dataRow in dataRowList)
{
DB.SetParameterValue( dbCommand, "DateVal", dataRow.Date );
DB.SetParameterValue( dbCommand, "OpenVal", dataRow.Open );
DB.SetParameterValue( dbCommand, "High", dataRow.High );
DB.SetParameterValue( dbCommand, "Low", dataRow.Low );
DB.SetParameterValue( dbCommand, "CloseVal", dataRow.Close );
DB.SetParameterValue( dbCommand, "Volumn", dataRow.Volumn );
DB.ExecuteNonQuery( dbCommand );
}
}

Consider using bulk insert instead.
SqlBulkCopy lets you efficiently bulk
load a SQL Server table with data from
another source. The SqlBulkCopy class
can be used to write data only to SQL
Server tables. However, the data
source is not limited to SQL Server;
any data source can be used, as long
as the data can be loaded to a
DataTable instance or read with a
IDataReader instance. For this example
the file will contain roughly 1000
records, but this code can handle
large amounts of data.
This example first creates a DataTable and fills it with the data. This is kept in memory.
DataTable dt = new DataTable();
string line = null;
bool firstRow = true;
using (StreamReader sr = File.OpenText(#"c:\temp\table1.csv"))
{
while ((line = sr.ReadLine()) != null)
{
string[] data = line.Split(',');
if (data.Length > 0)
{
if (firstRow)
{
foreach (var item in data)
{
dt.Columns.Add(new DataColumn());
}
firstRow = false;
}
DataRow row = dt.NewRow();
row.ItemArray = data;
dt.Rows.Add(row);
}
}
}
Then we push the DataTable to the server in one go.
using (SqlConnection cn = new SqlConnection(ConfigurationManager.ConnectionStrings["ConsoleApplication3.Properties.Settings.daasConnectionString"].ConnectionString))
{
cn.Open();
using (SqlBulkCopy copy = new SqlBulkCopy(cn))
{
copy.ColumnMappings.Add(0, 0);
copy.ColumnMappings.Add(1, 1);
copy.ColumnMappings.Add(2, 2);
copy.ColumnMappings.Add(3, 3);
copy.ColumnMappings.Add(4, 4);
copy.DestinationTableName = "Censis";
copy.WriteToServer(dt);
}
}

One general tip on any relational database when doing a large number of inserts, or indeed any data change, is to drop the all your secondary indexes first then recreate them afterwards.
Why does this work? Well with secondary indexes the index data will be elsewhere on the disk than the data, so forcing at best an additional read/write update for each record written to the table per index. In fact it may be much worse than this as from time to time the database will decide it needs to carry out a more serious reorganisation operation on the index.
When you recreate the index at the end of the insert run the database will perform just one full table scan to read and process the data. Not only do you end up with a better organised index on disk, but the total amount of work required will be less.
When is this worthwhile doing? That depends upon your database, index structure and other factors (such as if you have your indexes on a separate disk to your data) but my rule of thumb is to consider it if I am processing more than 10% of the records in a table of a million records or more - and then check with test inserts to see if it is worthwhile.
Of course on any particular database there will be specialist bulk insert routines, and you should also look at those.

FYI - looping through a record set and doing a million+ inserts on a relational DB, is the worst case scenario when loading a table. Some languages now offer record-set objects. For fastest performance SMINK is right, use BULK INSERT. Millions of rows loaded in minutes, rather than hours. Orders of magnitude faster than any other method.
As an example, I worked on a eCommerce project, that required a product list refresh each night. 100,000 rows inserted into a high-end Oracle DB, took 10 hours. If I remember correctly the top speed to when doing row-by-row inserts is aprox 10 recs/sec. Painful slow and completely unnecessary. With bulk insert - 100K rows should take less than a minute.
Hope this helps.

Where the data come from? Could you run a bulk insert? If so, that is the best option you could take.

Related

Distributed database insertion speed is very slow

#Test
public void transaction() throws Exception {
Connection conn = null;
PreparedStatement ps = null;
try {
String sql = "insert into `1` values(?, ?, ?, ?)";
conn = JDBCUtils.getConnection();
ps = conn.prepareStatement(sql);
conn.setAutoCommit(false);
for(int i = 1; i <= 10000; i++){
ps.setObject(1, i);
ps.setObject(2, 10.12345678);
ps.setObject(3, "num_" + i);
ps.setObject(4, "2021-12-24 19:00:00");
ps.addBatch();
}
ps.executeBatch();
ps.clearBatch();
conn.commit();
} catch (Exception e) {
conn.rollback();
e.printStackTrace();
}finally {
JDBCUtils.closeResources(conn, ps);
}
}
When setAutoCommit = true, local MySQL and distributed MySQL insert speeds are very slow.
When I set the transaction to commit manually, just like the code above, the local MySQL speed has increased a lot, but the insertion speed of distributed MySQL is still very slow.
Is there any additional parameters I need to set?
Setting parameters probably won't help (much).
There are a couple of reasons for the slowness:
With autocommit=true you are committing on every insert statement. That means the each new row must be written to disk before the database server returns the response to the client.
With autocommit=false there is still a client -> server -> client round trip for each insert statement. Those round trips add up to a significant amount of time.
One way to make this faster is to insert multiple rows with each insert statement, but that is messy because you would need to generate complex (multi-row) insert statements.
A better way is to use JDBC's batch feature to reduce the number of round-trips. For example:
PreparedStatement ps = c.prepareStatement("INSERT INTO employees VALUES (?, ?)");
ps.setString(1, "John");
ps.setString(2,"Doe");
ps.addBatch();
ps.clearParameters();
ps.setString(1, "Dave");
ps.setString(2,"Smith");
ps.addBatch();
ps.clearParameters();
int[] results = ps.executeBatch();
(Attribution: above code copied from this answer by #Tusc)
If that still isn't fast enough, you should get even better performance using MySQL's native bulk insert mechanism; e.g. load data infile; see High-speed inserts with MySQL
For completeness, I am adding this suggestion from #Wilson Hauck
"In your configuration [mysqld] section, innodb_change_buffer_max_size=50 # from 25 (percent) for improved INSERT rate per second. SHOW FULL PROCESSLIST; to monitor when the instance has completed adjustment, then do your inserts and put it back to 25 percent for typical processing speed."
This may increase the insert rate depending on your table and its indexes, and on the order in which you are inserting the rows.
But the flip-side is that you may be able to achieve the same speedup (or more!) by other means; e.g.
by sorting your input so that rows are inserted in index order, or
by dropping the indexes, inserting the records and then recreating the indexes.
You can read about the change buffer here and make your own judgements.

Mahout 0.7 Failed to get recommendation with a large data using MysqlJdbcDataModel

I am using Mahout to build an Item-based Cf recommendation engine.
I create an MahoutHelper class which has a constructor:
public MahoutHelper(String serverName, String user, String password,
String DatabaseName, String tableName) {
source = new MysqlConnectionPoolDataSource();
source.setServerName(serverName);
source.setUser(user);
source.setPassword(password);
source.setDatabaseName(DatabaseName);
source.setCachePreparedStatements(true);
source.setCachePrepStmts(true);
source.setCacheResultSetMetadata(true);
source.setAlwaysSendSetIsolation(true);
source.setElideSetAutoCommits(true);
DBmodel = new MySQLJDBCDataModel(source, tableName, "userId", "itemId",
"value", null);
similarity = new TanimotoCoefficientSimilarity(DBmodel);
}
and the recommend method is:
public List<RecommendedItem> recommendation() throws TasteException {
Recommender recommender = null;
recommender = new GenericItemBasedRecommender(DBmodel, similarity);
List<RecommendedItem> recommendations = null;
recommendations = recommender.recommend(userId, maxNum);
System.out.println("query completed");
return recommendations;
}
It's using datasource to build datamodel but the problem is that when mysql has only a few data (less than 100) the program works fine for me, while when the scale turns to be over 1,000,000, the program stacks at doing recommendation and never goes forward. I have no idea how it happens. By the way I used the same data to build a FileDataModel with a .dat file, and it takes only 2~3 second to complete analysis. I am confused.
Using the database directly will only work for tiny data sets, like maybe a hundred thousand data points. Beyond that the overhead of such data-intensive applications will never run quickly; a query takes thousands of SQL queries or more.
Instead you must load and re-load into memory. You can still pull from the database; look at ReloadFromJDBCDataModel as a wrapper.

LINQ Insertonsubmit very slow compared to legacy SQL Insert statement

I have a large inserting job to perform, say 300000 Inserts.
If I do it the legacy way, I just write a SQL string with blocks of 100 Insert statements, and perform an executeCommand against the DB (each 100 records).
That lends to some 100 inserts per 3 seconds or so.
Now of course there are issue with single quotes and CrLf's within the inserted values. So rather than writing code to double the single quotes and so on, since I'm lazy I have a go with Linq InsertOnSubmit and one context.SublitChanges each other 100 rows.
And that take some 20x more times than the legacy way!!!
Why?
You're not using the right tool for the job. LINQ-to-SQL and most other ORMs (at least Entity Framework and NHibernate) are meant for OLTP scenarios, they are not meant for bulk data operations and will perform slowly when used for bulk data operations.
You should be using SqlBulkCopy.
I had the same issues, with InsertOnSubmit() taking a long time.
However, using the DataTableHelper class (downloadable from the link below), and changing just 1 or 2 lines of your code, you can easily use a Bulk Insert instead.
Bulk-inserts
For example:
const int RECORDS_TO_INSERT = 5000;
List<Product> recordsToBeInserted = new List<Product>();
using (NorthwindDataContext dc = new NorthwindDataContext())
{
for (int n = 0; n < RECORDS_TO_INSERT; n++)
{
Product newProduct = new Product()
{
ProductName = "Product " + n.ToString(),
UnitPrice = 3999,
UnitsInStock = 2,
UnitsOnOrder = 0,
Discontinued = false
};
recordsToBeInserted.Add(newProduct);
}
// Insert this List<> of records into the [Products] table in our database, using a Bulk Insert
DataTableHelper.BulkCopyToDatabase(recordsToBeInserted, "Products", dc);
}
Hope this helps.

How can I speed up updating lots of rows

I have a table that has 1.400.000 entries. Its is a simple list of documents
Table - Document
ID int
DocumentPath nvarchar
DocumentValid
bit
I scan a directory and set any document found in the directory as valid.
public void SetReportsToValidated(List<int> validatedReports)
{
SqlConnection myCon = null;
try
{
myCon = new SqlConnection(_conn);
myCon.Open();
foreach (int id in validatedReports)
{
SqlDataAdapter myAdap = new SqlDataAdapter("update_DocumentValidated", myCon);
myAdap.SelectCommand.CommandType = CommandType.StoredProcedure;
SqlParameter pId = new SqlParameter("#Id", SqlDbType.Int);
pId.Value = id;
myAdap.SelectCommand.Parameters.Add(pId);
myAdap.SelectCommand.ExecuteNonQuery();
}
}
catch (SystemException ex)
{
_log.Error(ex);
throw;
}
finally
{
if (myCon != null)
{
myCon.Close();
}
}
}
The performance of Updates is ok, but I want more. It takes more than 1 hour to update 1000000 of the documents to valid. Is there any good way to speed up the updates? I am thinking of using some kind of batch (like table valued parameters).
Each update takes some 5-10ms when profiled on SQLServer.
Read the reports in and append them together in a DataTable (since they have the same dimensions) then use the SqlBulkCopy object for to upload the entire thing. Will probably work better for you. I don't think you will have memory issues given the small number of columns and rows.
At the moment you are calling the db for each record individually. You can use the SqlDataAdapter to do bulk updates by (in a very brief nutshell):
1) define one SqlDataAdapter
2) set the .UpdateCommand on the adapter to your update sproc
3) call the .Update method on the adapter, passing it a DataTable containing the ids of documents to be updated. This will batch up the updated rows from the DataTable in to the DB, calling the sproc for each record in a batched manner. You can control the Batch Size via the .BatchSize property.
4) So what you're doing is removing the manual, row by row looping which is inefficient for batched updates.
See examples:
http://support.microsoft.com/kb/308055
http://www.c-sharpcorner.com/UploadFile/61b832/4430/
Alternatively, you could:
1) Use SqlBulkCopy to bulk insert all the IDs into a new table in the database (highly efficient)
2) Once loaded in to that staging table, run a single SQL statement to update your main table from that staging table to validate the documents.
See examples:
http://www.adathedev.co.uk/2010/02/sqlbulkcopy-bulk-load-to-sql-server.html
http://www.adathedev.co.uk/2011/01/sqlbulkcopy-to-sql-server-in-parallel.html
Instead of creating the adapter and parameter every time in the loop just create them once and assign different value to the parameter:
SqlDataAdapter myAdap = new SqlDataAdapter("update_DocumentValidated", myCon);
myAdap.SelectCommand.CommandType = CommandType.StoredProcedure;
SqlParameter pId = new SqlParameter("#Id", SqlDbType.Int);
myAdap.SelectCommand.Parameters.Add(pId);
foreach (int id in validatedReports)
{
myAdap.SelectCommand.Parameters[0].Value = id;
myAdap.SelectCommand.ExecuteNonQuery();
}
This might not result in a very dramatic improvement but is better compared to the original code. Also, as you are manually executing the SqlCommand object you do not need the adapter at all. Just use the SqlCommand directly.

Do MERGE using Linq to SQL

SQL Server 2008 Ent
ASP.NET MVC 2.0
Linq-to-SQL
I am building a gaming site, that tracks when a particular player (toon) had downed a particular monster (boss). Table looks something like:
int ToonId
int BossId
datetime LastKillTime
I use a 3d party service that gives me back latest information (toon,boss,time).
Now I want to update my database with that new information.
Brute force approach is to do line-by-line upsert. But It looks ugly (code-wise), and probably slow too.
I think better solution would be to insert new data (using temp table?) and then run MERGE statement.
Is it good idea? I know temp tables are "better-to-avoid". Should I create a permanent "temp" table just for this operation?
Or should I just read entire current set (100 rows at most), do merge and put it back from within application?
Any pointers/suggestions are always appreciated.
An ORM is the wrong tool for performing batch operations, and Linq-to-SQL is no exception. In this case I think you have picked the right solution: Store all entries in a temporary table quickly, then do the UPSERT using merge.
The fastest way to store the data to the temporary table is to use SqlBulkCopy to store all data to a table of your choice.
If you're using Linq-to-SQL, upserts aren't that ugly..
foreach (var line in linesFromService) {
var kill = db.Kills.FirstOrDefault(t=>t.ToonId==line.ToonId && t.BossId==line.BossId);
if (kill == null) {
kill = new Kills() { ToonId = line.ToonId, BossId = line.BossId };
db.Kills.InsertOnSubmit(kill);
}
kill.LastKillTime = line.LastKillTime;
}
db.SubmitChanges();
Not a work of art, but nicer than in SQL. Also, with only 100 rows, I wouldn't be too concerned about performance.
Looks like a straight-forward insert.
private ToonModel _db = new ToonModel();
Toon t = new Toon();
t.ToonId = 1;
t.BossId = 2;
t.LastKillTime = DateTime.Now();
_db.Toons.InsertOnSubmit(t);
_db.SubmitChanges();
To update without querying the records first, you can do the following. It will still hit the db once to check if record exists but will not pull the record:
var blob = new Blob { Id = "some id", Value = "some value" }; // Id is primary key (PK)
if (dbContext.Blobs.Contains(blob)) // if blob exists by PK then update
{
// This will update all columns that are not set in 'original' object. For
// this to work, Blob has to have UpdateCheck=Never for all properties except
// for primary keys. This will update the record without querying it first.
dbContext.Blobs.Attach(blob, original: new Blob { Id = blob.Id });
}
else // insert
{
dbContext.Blobs.InsertOnSubmit(blob);
}
dbContext.Blobs.SubmitChanges();
See here for an extension method for this.