I have a large inserting job to perform, say 300000 Inserts.
If I do it the legacy way, I just write a SQL string with blocks of 100 Insert statements, and perform an executeCommand against the DB (each 100 records).
That lends to some 100 inserts per 3 seconds or so.
Now of course there are issue with single quotes and CrLf's within the inserted values. So rather than writing code to double the single quotes and so on, since I'm lazy I have a go with Linq InsertOnSubmit and one context.SublitChanges each other 100 rows.
And that take some 20x more times than the legacy way!!!
Why?
You're not using the right tool for the job. LINQ-to-SQL and most other ORMs (at least Entity Framework and NHibernate) are meant for OLTP scenarios, they are not meant for bulk data operations and will perform slowly when used for bulk data operations.
You should be using SqlBulkCopy.
I had the same issues, with InsertOnSubmit() taking a long time.
However, using the DataTableHelper class (downloadable from the link below), and changing just 1 or 2 lines of your code, you can easily use a Bulk Insert instead.
Bulk-inserts
For example:
const int RECORDS_TO_INSERT = 5000;
List<Product> recordsToBeInserted = new List<Product>();
using (NorthwindDataContext dc = new NorthwindDataContext())
{
for (int n = 0; n < RECORDS_TO_INSERT; n++)
{
Product newProduct = new Product()
{
ProductName = "Product " + n.ToString(),
UnitPrice = 3999,
UnitsInStock = 2,
UnitsOnOrder = 0,
Discontinued = false
};
recordsToBeInserted.Add(newProduct);
}
// Insert this List<> of records into the [Products] table in our database, using a Bulk Insert
DataTableHelper.BulkCopyToDatabase(recordsToBeInserted, "Products", dc);
}
Hope this helps.
Related
I am writing a small application with
Mojolicious
DBIx::Class
Hypnotoad which is a pre-forking web
server.
MySQL
In my application I need to do the following;
Do some complex processing ( takes a minute of so to complete )
insert resulting data from above processing into tables
obtain the last auto increment of some tables, do some more processing.
use the values from (3) as part of an insert into another table ( a junction table )
Here is some sample code starting at step 2
#step 2
my $device = $device_rs->create(
{
devicename => $deviceName,
objects => \#objects
object_groups => \#objectgroups,
}
);
#step 3
my $lastogid = $db->resultset('ObjectGroup')->get_column('objectgroupid')->max;
my $lastobid = $db->resultset('Object')->get_column('objectid')->max;
my $obgcount = scalar(#objectgroups);
my $objcount = scalar(#objects);
my $ogoffset = $lastogid - $obgcount;
my $oboffset = $lastobid - $objcount;
#now increment the object/group ids by the offset which will be inserted into the many- many table
foreach my $hash (#childobjects) {
$hash->{'objectgroup_objectgroupid'} += $ogoffset;
$hash->{'object_objectid'} += $oboffset;
}
#step 4 - populate the junction table
$db->resultset('ObjectGroupHasObjects’)->populate(\#childobjects);
Now due to having multiple threads going a once the values obtained from step 3 may not be correct ( for the current ‘device’ ).
I’m trying to find a way around this issue. The only thing I can think of at the moment is putting a lock on the database tables before step 2) and unlocking after step 4).
How can I do this in DBIx::Class and is this likely to resolve my issue?
Thank you.
Something like
$schema->dbh_do("LOCK TABLES names");
...
...
$schema->dbh_do("UNLOCK TABLES");
Source: http://www.perlmonks.org/?node_id=854538
Also see: How to avoid race conditions when using the find_or_create method of DBIx::Class::ResultSet?
and SQLHackers::SELECT#SELECT_..._FOR_UPDATE
I have a very large MySQL table (billions of rows, with dozens of columns) I would like to convert into a ColumnFamily in Cassandra. I'm using Hector.
I first create my schema as such :
String clusterName = "Test Cluster";
String host = "cassandra.lanhost.com:9160";
String newKeyspaceName = "KeyspaceName";
String newColumnFamilyName = "CFName";
ThriftCluster cassandraCluster;
CassandraHostConfigurator cassandraHostConfigurator;
cassandraHostConfigurator = new CassandraHostConfigurator(host);
cassandraCluster = new ThriftCluster(clusterName, cassandraHostConfigurator);
BasicColumnFamilyDefinition columnFamilyDefinition = new BasicColumnFamilyDefinition();
columnFamilyDefinition.setKeyspaceName(newKeyspaceName);
columnFamilyDefinition.setName(newColumnFamilyName);
columnFamilyDefinition.setDefaultValidationClass("UTF8Type");
columnFamilyDefinition.setKeyValidationClass(ComparatorType.UTF8TYPE.getClassName());
columnFamilyDefinition.setComparatorType(ComparatorType.UTF8TYPE);
BasicColumnDefinition columnDefinition = new BasicColumnDefinition();
columnDefinition.setName(StringSerializer.get().toByteBuffer("id"));
columnDefinition.setIndexType(ColumnIndexType.KEYS);
columnDefinition.setValidationClass(ComparatorType.INTEGERTYPE.getClassName());
columnDefinition.setIndexName("id_index");
columnFamilyDefinition.addColumnDefinition(columnDefinition);
columnDefinition = new BasicColumnDefinition();
columnDefinition.setName(StringSerializer.get().toByteBuffer("status"));
columnDefinition.setIndexType(ColumnIndexType.KEYS);
columnDefinition.setValidationClass(ComparatorType.ASCIITYPE.getClassName());
columnDefinition.setIndexName("status_index");
columnFamilyDefinition.addColumnDefinition(columnDefinition);
.......
ColumnFamilyDefinition cfDef = new ThriftCfDef(columnFamilyDefinition);
KeyspaceDefinition keyspaceDefinition =
HFactory.createKeyspaceDefinition(newKeyspaceName, "org.apache.cassandra.locator.SimpleStrategy", 1, Arrays.asList(cfDef));
cassandraCluster.addKeyspace(keyspaceDefinition);
Once that done, I load my data, stored in a List, since I'm fetching the MySQL data with a namedParametersJdbcTemplate, as such :
String clusterName = "Test Cluster";
String host = "cassandra.lanhost.com:9160";
String KeyspaceName = "KeyspaceName";
String ColumnFamilyName = "CFName";
final StringSerializer serializer = StringSerializer.get();
public void insert(List<SqlParameterSource> dataToInsert) throws ExceptionParserInterrupted {
Keyspace workingKeyspace = null;
Cluster cassandraCluster = HFactory.getOrCreateCluster(clusterName, host);
workingKeyspace = HFactory.createKeyspace(KeyspaceName, cassandraCluster);
Mutator<String> mutator = HFactory.createMutator(workingKeyspace, serializer);
ColumnFamilyTemplate<String, String> template = new ThriftColumnFamilyTemplate<String, String>(workingKeyspace, ColumnFamilyName, serializer, serializer);
long t1 = System.currentTimeMillis();
for (SqlParameterSource data : dataToInsert) {
String keyId = "id" + (Integer) data.getValue("id");
mutator.addInsertion(keyId, ColumnFamilyName, HFactory.createColumn("id", (Integer) data.getValue("id"), StringSerializer.get(), IntegerSerializer.get()));
mutator.addInsertion(keyId,ColumnFamilyName, HFactory.createStringColumn("status", data.getValue("status").toString()));
...............
}
mutator.execute();
System.out.println(t1 - System.currentTimeMillis());
I'm inserting 100 000 lines in approximatively 1 hour, which is really slow. I heard about multi-threading my inserts, but in this particular case I don't know what to do. Should I use BatchMutate?
Yes, you should run your insertion code from multiple threads. Take a look at the following stress testing code for an example of how to do this efficiently with hector:
https://github.com/zznate/cassandra-stress
An additional source of your insert performance issue may be the number of secondary indexes you are applying on the column family (each secondary index creates an additional column family 'under the hood').
Correctly designed data models should not really need a large number of secondary indexes. The following article provides a good overview of data modeling in Cassandra:
http://www.datastax.com/docs/1.0/ddl/index
There is one alternate way of achieving this. You can try exploring https://github.com/impetus-opensource/Kundera. You would love it.
Kundera is a JPA 2.0 compliant Object-Datastore Mapping Library for NoSQL Datastores and currently supports Cassandra, HBase, MongoDB and all relational datastores (Kundera internally uses Hibernate for all relational datastores).
In your case you can use your existing objects along with JPA annotations to store them in Cassandra. Since Kundera supports polyglot persistence you also use a MySQL + Cassandra combination where you can use MySQL for most of your data and Cassandra for transactional data.And since all you need to care about is objects and JPA annotations, your job would be much easier.
For performance you can have a look at https://github.com/impetus-opensource/Kundera/wiki/Kundera-Performance
Is there a way to do a bulk update on a collection with LINQ? Currently if I have a List<myObject> and I want to update column1 to equal TEST for every row in the List I would setup a foreach loop and then for each individual object I would set the value and then save it. This works fine but I was just wondering if there was some LINQ method out there where I could do something like myOject.BulkUpdate(columnName, value)?
Your requirement here is entirely possible using Linq expressions and Terry Aney's excellent library on this topic.
Batch Updates and Deletes with LINQ to SQL
An update in the terms of the example you gave would be as follows:
using BTR.Core.Linq;
...
Context.myObjects.UpdateBatch
(
Context.myObjects.Where(x => x.columnName != value),
x => new myObject { columnName = value}
);
Edit (2017-01-20): It's worth nothing this is now available in the form of a NuGet package # https://www.nuget.org/packages/LinqPost/.
Install-Package LinqPost
Sounds like you're using LINQ To SQL, and you've got the basics laid out already.
LINQ To SQL is about abstracting tables into classes, and doesn't really provide the 'silver bullet' or one-liner you are looking for.
The only way to do that is to achieve your one-liner would be to make a stored proc to take that column name and new value, and implement that logic yourself.
db.MassUpdateTableColumn("Customer", "Name", "TEST");
....
CREATE PROC MassUpdateTableColumn
#TableName varchar(100), #ColumnName varchar(100), #NewVal varchar(100)
AS
/*your dynamic SQL to update a table column with a new val. */
Otherwise, it's as you describe:
List<Customer> myCusts = db.Customers.ToList();
foreach(Customer c in myCusts)
{
c.Name = "TEST";
}
db.SubmitChanges();
LINQ to SQL (or EF for that matter), is all about bringing objects into memory, manipulating them, and then updating them with separate database requests for each row.
In cases where you don't need to hydrate the entire object on the client, it is much better to use server side operations (stored procs, TSQL) instead of LINQ. You can use the LINQ providers to issue TSQL against the database. For example, with LINQ to SQL you can use context.ExecuteCommand("Update table set field=value where condition"), just watch out for SQL Injection.
EF Core 7.0 introduces Bulk Update and Bulk Delete.
For example, consider the following LINQ query terminated with a call to ExecuteUpdateAsync:
var priorToDateTime = new DateTime(priorToYear, 1, 1);
await context.Tags
.Where(t => t.Posts.All(e => e.PublishedOn < priorToDateTime))
.ExecuteUpdateAsync(s => s.SetProperty(t => t.Text, t => t.Text + " (old)"));
This generates SQL to immediately update the “Text” column of all tags for posts published before the given year:
UPDATE [t]
SET [t].[Text] = [t].[Text] + N' (old)'
FROM [Tags] AS [t]
WHERE NOT EXISTS (
SELECT 1
FROM [PostTag] AS [p]
INNER JOIN [Posts] AS [p0] ON [p].[PostsId] = [p0].[Id]
WHERE [t].[Id] = [p].[TagsId] AND [p0].[PublishedOn] < #__priorToDateTime_1)
SQL Server 2008 Ent
ASP.NET MVC 2.0
Linq-to-SQL
I am building a gaming site, that tracks when a particular player (toon) had downed a particular monster (boss). Table looks something like:
int ToonId
int BossId
datetime LastKillTime
I use a 3d party service that gives me back latest information (toon,boss,time).
Now I want to update my database with that new information.
Brute force approach is to do line-by-line upsert. But It looks ugly (code-wise), and probably slow too.
I think better solution would be to insert new data (using temp table?) and then run MERGE statement.
Is it good idea? I know temp tables are "better-to-avoid". Should I create a permanent "temp" table just for this operation?
Or should I just read entire current set (100 rows at most), do merge and put it back from within application?
Any pointers/suggestions are always appreciated.
An ORM is the wrong tool for performing batch operations, and Linq-to-SQL is no exception. In this case I think you have picked the right solution: Store all entries in a temporary table quickly, then do the UPSERT using merge.
The fastest way to store the data to the temporary table is to use SqlBulkCopy to store all data to a table of your choice.
If you're using Linq-to-SQL, upserts aren't that ugly..
foreach (var line in linesFromService) {
var kill = db.Kills.FirstOrDefault(t=>t.ToonId==line.ToonId && t.BossId==line.BossId);
if (kill == null) {
kill = new Kills() { ToonId = line.ToonId, BossId = line.BossId };
db.Kills.InsertOnSubmit(kill);
}
kill.LastKillTime = line.LastKillTime;
}
db.SubmitChanges();
Not a work of art, but nicer than in SQL. Also, with only 100 rows, I wouldn't be too concerned about performance.
Looks like a straight-forward insert.
private ToonModel _db = new ToonModel();
Toon t = new Toon();
t.ToonId = 1;
t.BossId = 2;
t.LastKillTime = DateTime.Now();
_db.Toons.InsertOnSubmit(t);
_db.SubmitChanges();
To update without querying the records first, you can do the following. It will still hit the db once to check if record exists but will not pull the record:
var blob = new Blob { Id = "some id", Value = "some value" }; // Id is primary key (PK)
if (dbContext.Blobs.Contains(blob)) // if blob exists by PK then update
{
// This will update all columns that are not set in 'original' object. For
// this to work, Blob has to have UpdateCheck=Never for all properties except
// for primary keys. This will update the record without querying it first.
dbContext.Blobs.Attach(blob, original: new Blob { Id = blob.Id });
}
else // insert
{
dbContext.Blobs.InsertOnSubmit(blob);
}
dbContext.Blobs.SubmitChanges();
See here for an extension method for this.
I have the following method to insert millions of rows of data into a table (I use SQL 2008) and it seems slow, is there any way to speed up INSERTs?
Here is the code snippet - I use MS enterprise library
public void InsertHistoricData(List<DataRow> dataRowList)
{
string sql = string.Format( #"INSERT INTO [MyTable] ([Date],[Open],[High],[Low],[Close],[Volumn])
VALUES( #DateVal, #OpenVal, #High, #Low, #CloseVal, #Volumn )");
DbCommand dbCommand = VictoriaDB.GetSqlStringCommand( sql );
DB.AddInParameter(dbCommand, "DateVal", DbType.Date);
DB.AddInParameter(dbCommand, "OpenVal", DbType.Currency);
DB.AddInParameter(dbCommand, "High", DbType.Currency );
DB.AddInParameter(dbCommand, "Low", DbType.Currency);
DB.AddInParameter(dbCommand, "CloseVal", DbType.Currency);
DB.AddInParameter(dbCommand, "Volumn", DbType.Int32);
foreach (NasdaqHistoricDataRow dataRow in dataRowList)
{
DB.SetParameterValue( dbCommand, "DateVal", dataRow.Date );
DB.SetParameterValue( dbCommand, "OpenVal", dataRow.Open );
DB.SetParameterValue( dbCommand, "High", dataRow.High );
DB.SetParameterValue( dbCommand, "Low", dataRow.Low );
DB.SetParameterValue( dbCommand, "CloseVal", dataRow.Close );
DB.SetParameterValue( dbCommand, "Volumn", dataRow.Volumn );
DB.ExecuteNonQuery( dbCommand );
}
}
Consider using bulk insert instead.
SqlBulkCopy lets you efficiently bulk
load a SQL Server table with data from
another source. The SqlBulkCopy class
can be used to write data only to SQL
Server tables. However, the data
source is not limited to SQL Server;
any data source can be used, as long
as the data can be loaded to a
DataTable instance or read with a
IDataReader instance. For this example
the file will contain roughly 1000
records, but this code can handle
large amounts of data.
This example first creates a DataTable and fills it with the data. This is kept in memory.
DataTable dt = new DataTable();
string line = null;
bool firstRow = true;
using (StreamReader sr = File.OpenText(#"c:\temp\table1.csv"))
{
while ((line = sr.ReadLine()) != null)
{
string[] data = line.Split(',');
if (data.Length > 0)
{
if (firstRow)
{
foreach (var item in data)
{
dt.Columns.Add(new DataColumn());
}
firstRow = false;
}
DataRow row = dt.NewRow();
row.ItemArray = data;
dt.Rows.Add(row);
}
}
}
Then we push the DataTable to the server in one go.
using (SqlConnection cn = new SqlConnection(ConfigurationManager.ConnectionStrings["ConsoleApplication3.Properties.Settings.daasConnectionString"].ConnectionString))
{
cn.Open();
using (SqlBulkCopy copy = new SqlBulkCopy(cn))
{
copy.ColumnMappings.Add(0, 0);
copy.ColumnMappings.Add(1, 1);
copy.ColumnMappings.Add(2, 2);
copy.ColumnMappings.Add(3, 3);
copy.ColumnMappings.Add(4, 4);
copy.DestinationTableName = "Censis";
copy.WriteToServer(dt);
}
}
One general tip on any relational database when doing a large number of inserts, or indeed any data change, is to drop the all your secondary indexes first then recreate them afterwards.
Why does this work? Well with secondary indexes the index data will be elsewhere on the disk than the data, so forcing at best an additional read/write update for each record written to the table per index. In fact it may be much worse than this as from time to time the database will decide it needs to carry out a more serious reorganisation operation on the index.
When you recreate the index at the end of the insert run the database will perform just one full table scan to read and process the data. Not only do you end up with a better organised index on disk, but the total amount of work required will be less.
When is this worthwhile doing? That depends upon your database, index structure and other factors (such as if you have your indexes on a separate disk to your data) but my rule of thumb is to consider it if I am processing more than 10% of the records in a table of a million records or more - and then check with test inserts to see if it is worthwhile.
Of course on any particular database there will be specialist bulk insert routines, and you should also look at those.
FYI - looping through a record set and doing a million+ inserts on a relational DB, is the worst case scenario when loading a table. Some languages now offer record-set objects. For fastest performance SMINK is right, use BULK INSERT. Millions of rows loaded in minutes, rather than hours. Orders of magnitude faster than any other method.
As an example, I worked on a eCommerce project, that required a product list refresh each night. 100,000 rows inserted into a high-end Oracle DB, took 10 hours. If I remember correctly the top speed to when doing row-by-row inserts is aprox 10 recs/sec. Painful slow and completely unnecessary. With bulk insert - 100K rows should take less than a minute.
Hope this helps.
Where the data come from? Could you run a bulk insert? If so, that is the best option you could take.