Grails 3: Bulk insert performance issue - mysql

Grails Version: 3.3.2
I have 100k records I am loading from a CSV file and trying to do a bulk save. The issue I am having is that the bulk save is performing worse than a non-bulk save.
All the online searches I did basically use the same methods as this site which I referenced
http://krixisolutions.com/bulk-insert-grails-gorm/
I tried all 3 solutions on the page, here is an example of one of them:
def saveFsRawData(List<FactFsRawData> rawData) {
int startTime = DateTime.newInstance().secondOfDay;
println("Start Save");
Session session = sessionFactory.openSession();
Transaction tx = session.beginTransaction();
rawData.eachWithIndex{ FactFsRawData entry, int i ->
session.save(entry);
if(i % 1000 == 0) {
session.flush();
session.clear();
}
}
tx.commit();
session.close();
println("End Save - "+ (DateTime.newInstance().secondOfDay - startTime));
}
I have tried various bulk sizes from 100 to 5k (using 1k in the example). All of them average around 80 seconds.
If I remove the batch processing completely then I get an average of 65 seconds.
I am unsure of what the issue is or where I am going wrong. Any ideas?

Related

JDBC results set iteration sometimes goes fast and often times goes slow (75x speed difference)

I'm developing a GAS script to retrieve data (~15,000 rows) from an Azure SQL database table into a Sheets spreadsheet. The code works fine but there are huge speed differences from run to run in the results.next() loop
Below is my code (some variable declarations and private stuff removed) and below the code is logs from three executions
function readData() {
Logger.log('Establishing DB connection')
let conn = Jdbc.getConnection(connectionString , user, userPwd);
Logger.log('Executing query')
let stmt = conn.createStatement();
let results = stmt.executeQuery("SELECT * FROM VIEW");
let contents = []
let i = 0
Logger.log("Iterating result set and adding into array")
while (results.next()) {
contents.push([
results.getInt(1),
results.getString(2),
results.getInt(3),
results.getString(4),
results.getInt(5),
results.getString(6),
results.getString(7),
results.getString(8),
results.getFloat(9),
results.getFloat(10),
results.getInt(11),
results.getString(12),
results.getInt(13),
results.getInt(14),
results.getInt(15),
])
//Make log entry every 100th iteration and display the average passed ms per iteration
i++
if(i % 100 == 0){
Logger.log(i)
finish = new Date().getMilliseconds();
Logger.log((finish - start) / i)
}
}
sheet.getRange(2,1,sheet.getLastRow(),15).clearContent()
sheet.getRange(2,1,contents.length,15).setValues(contents)
results.close();
stmt.close();
}
Fast run:
8:41:47 AM Info 11100 Records added
8:41:47 AM Info 8.43ms on average per record
8:41:47 AM Info
8:41:47 AM Info 11200 Records added
8:41:47 AM Info 8.42ms on average per record
8:41:47 AM Info
8:41:48 AM Info 11300 Records added
8:41:48 AM Info 8.42ms on average per record
Slow run:
8:48:01 AM Info 100 Records added
8:48:01 AM Info 162.30ms on average per record
8:48:01 AM Info
8:48:17 AM Info 200 Records added
8:48:17 AM Info 162.84ms on average per record
8:48:17 AM Info
8:48:34 AM Info 300 Records added
8:48:34 AM Info 163.11ms on average per record
Extremely slow run:
8:56:46 AM Info 300 Records added
8:56:46 AM Info 629.08ms on average per record
8:56:46 AM Info
8:57:49 AM Info 400 Records added
8:57:49 AM Info 628.95ms on average per record
8:57:49 AM Info
8:58:52 AM Info 500 Records added
8:58:52 AM Info 629.70ms on average per record
So as seen from above logs, one run of the script can go roughly 75x faster than another. The time per iteration stays the same within a specific run. I'm pretty baffled as to how that's possible. Is there something about the result set object I don't know?
You can submit a bug on Google's Issue Tracker using the following template for Apps Script:
https://issuetracker.google.com/issues/new?component=191640&template=823905
If you have a workspace account, you can also contact Google Workspace support so they can take a look at your issue.

How to ensure that updating to database on certain interval

I am working in a application Spring java8
I have one function that generate Labels(pdf generation) asynchronously.
it contains a loop, usually it will run more than 1000, it generate more than 1000 pdf labels.
after every loop ends we need to update the database, so that we just saving the status, ie initially it save numberOfgeneratedCount=0 , after each label we just increment the variable and update the table.
It is not Necessary to save this incremented count to db at every loop ends, what we need is in a fixed intervals only we need to update the database to reduce load on dataBase inserts.
currently my code is like
// Label is a database model class labeldb is variable of that
//commonDao.saveLabelToDb function to save Label object
int numberOfgeneratedCount =0;
labeldb.setProcessedOrderCount(numberOfgeneratedCount);
commonDao.saveLabelToDb(labeldb);
for(Order order: orders){
generated = true;
try{
// pdf generation code
}catch Exception e{
// catch block here
generated = false;
}
if(generated){
numberOfgeneratedCount++;
deliveryLabeldb.setProcessedOrderCount(numberOfgeneratedCount);
commonDao.saveLabelToDb(labeldb );
}
}
to improve the performance we need to update database only an interval of 10 seconds. Any help would appreciated
I have done this using the following code, I am not sure about whether this is a good solution, Some one please improve this using some built in functions
int numberOfgeneratedCount =0;
labeldb.setProcessedOrderCount(numberOfgeneratedCount);
commonDao.saveLabelToDb(labeldb);
int nowSecs =LocalTime.now().toSecondOfDay();
int lastSecs = nowSecs;
for(Order order: orders){
nowSecs = LocalTime.now().toSecondOfDay();
generated = true;
try{
// pdf generation code
}catch Exception e{
// catch block here
generated = false;
}
if(generated){
numberOfgeneratedCount++;
deliveryLabeldb.setProcessedOrderCount(numberOfgeneratedCount);
if(nowSecs-lastSecs > 10){
lastSecs=nowSecs;
commonDao.saveLabelToDb(labeldb );
}
}
}

Big database - doctrine query slow even with index

I'm building an app with Symfony 4 + Doctrine, where people can upload big CSV files and those records then get stored in a database. Before inserting, I'm checking that the entry doesn't already exist...
On a sample CSV file with only 1000 records, it takes 16 seconds without an index and 8 seconds with an index (MacBook 3Ghz - 16 GB Memory). My intuition tells me, this is quite slow and should be done in under < 1 sec especially with the index.
The index is set on the email column.
My code:
$ssList = $this->em->getRepository(EmailList::class)->findOneBy(["id" => 1]);
foreach ($csv as $record) {
$subscriber_exists = $this->em->getRepository(Subscriber::class)
->findOneByEmail($record['email']);
if ($subscriber_exists === NULL) {
$subscriber = (new Subscriber())
->setEmail($record['email'])
->setFirstname($record['first_name'])
->addEmailList($ssList)
;
$this->em->persist($subscriber);
$this->em->flush();
}
}
My Question:
How can I speed up this process?
Use LOAD DATA INFILE.
LOAD DATA INFILE has IGNORE and REPLACE options for handling duplicates if you put a UNIQUE KEY or PRIMARY KEY on your email column.
Look at settings for making the import faster.
Like Cid said, move the flush() outside of the loop or put a batch counter inside the loop and only flush inside of it at certain intervals
$batchSize = 1000;
$i = 1;
foreach ($csv as $record) {
$subscriber_exists = $this->em->getRepository(Subscriber::class)
->findOneByEmail($record['email']);
if ($subscriber_exists === NULL) {
$subscriber = (new Subscriber())
->setEmail($record['email'])
->setFirstname($record['first_name'])
->addEmailList($ssList)
;
$this->em->persist($subscriber);
if ( ($i % $batchSize) === 0) {
$this->em->flush();
}
$i++;
}
}
$this->em->flush();
Or if that's still slow, you could grab the Connection $this->em->getConnection() and use DBAL as stated here: https://www.doctrine-project.org/projects/doctrine-dbal/en/2.8/reference/data-retrieval-and-manipulation.html#insert

Bulk insert performance issue in EF ObjectContext

I am trying to insert a large number of rows (>10,000,000) into a MySQL
database using EF ObjectContext (db-first). After reading the answer of this question
I wrote this code (batch save) to insert about 10,000 contacts (30k rows actually; including related other rows):
// var myContactGroupId = ...;
const int maxContactToAddInOneBatch = 100;
var numberOfContactsAdded = 0;
// IEnumerable<ContactDTO> contacts = ...
foreach (var contact in contacts)
{
var newContact = AddSingleContact(contact); // method excerpt below
if (newContact == null)
{
return;
}
if (++numberOfContactsAdded % maxContactToAddInOneBatch == 0)
{
LogAction(Action.ContactCreated, "Batch #" + numberOfContactsAdded / maxContactToAddInOneBatch);
_context.SaveChanges();
_context.Dispose();
// _context = new ...
}
}
// ...
private Contact AddSingleContact(ContactDTO contact)
{
Validate(contact); // Simple input validations
// ...
// ...
var newContact = Contact.New(contact); // Creates a Contact entity
// Add cell numbers
foreach (var cellNumber in contact.CellNumbers)
{
var existingContactCell = _context.ContactCells.FirstOrDefault(c => c.CellNo == cellNumber);
if (existingContactCell != null)
{
// Set some error message and return
return;
}
newContact.ContactCells.Add(new ContactCell
{
CellNo = cellNumber,
});
}
_context.Contacts.Add(newContact);
_context.ContactsInGroups.Add(new ContactsInGroup
{
Contact = newContact,
// GroupId = some group id
});
return newContact;
}
But it seems that the more contacts are added (batchwise), it takes more time (non linear).
Here is the log for batch size 100 (10k contacts). Notice the increasing time needed as the batch# increases:
12:16:48 Batch #1
12:16:49 Batch #2
12:16:49 Batch #3
12:16:50 Batch #4
12:16:50 Batch #5
12:16:50 Batch #6
12:16:51 Batch #7
12:16:52 Batch #8
12:16:53 Batch #9
12:16:54 Batch #10
...
...
12:21:26 Batch #89
12:21:32 Batch #90
12:21:38 Batch #91
12:21:44 Batch #92
12:21:50 Batch #93
12:21:57 Batch #94
12:22:03 Batch #95
12:22:10 Batch #96
12:22:16 Batch #97
12:22:23 Batch #98
12:22:29 Batch #99
12:22:36 Batch #100
It took 6 mins 48 sec. If I increase the batch size to 10,000 (requires a single batch), it takes about 26 sec (for 10k contacts). But when I try to insert 100k contacts (10k per batch), it takes a long time (for the increasing time per batch I guess).
Can you explain why it is taking increasing amount of time despite of the context being renew-ed?
Is there any other idea except raw SQL?
Most answers on the question you linked to use context.Configuration.AutoDetectChangesEnabled = false; I don't see that in your example. So you should try that. You might want to consider EF6 too. It has an AddRange method on the context for this purpose see INSERTing many rows with Entity Framework 6 beta 1
Finally got it. Looks like the method Validate() was the culprit. It had an existence check query to check if the contact already existed. So, as the contacts are being added, the db grows and it takes more time to check as the batch# increases; mainly because the cell number field (which it was comparing) was not indexed.

SQL Bulk Insert Vs Update - DeadLock issues

I got two process which sometimes run concurrently.
First one is Bulk insert
using (var connection = new SqlConnection(connectionString))
{
var bulkCopy = new SqlBulkCopy(connectionString, SqlBulkCopyOptions.TableLock)
{DestinationTableName = "CTRL.CTRL_DATA_ERROR_DETAIL",};
connection.Open();
try
{
bulkCopy.WriteToServer(errorDetailsDt);
}
catch (Exception e)
{
throw new Exception("Error Bulk writing Error Details for Data Stream ID: " + dataStreamId +
" Details of Error : " + e.Message);
}
connection.Close();
}
Second one is Bulk Update from Stored Procedure
--Part of code from Stored Procedure--
UPDATE [CTL].[CTRL].[CTRL_DATA_ERROR_DETAIL]
SET [MODIFIED_CONTAINER_SEQUENCE_NUMBER] = #containerSequenceNumber
,[MODIFIED_DATE] = GETDATE()
,[CURRENT_FLAG] = 'N'
WHERE [DATA_ERROR_KEY] = #DataErrorKey
AND [CURRENT_FLAG] ='Y'
First process run for bit of time (depending on incoming record load) and second process always gets deadlock victim.
Should I set SqlBulkCopyOptions.TableLock so that second process waits until the resources are released.
By default SqlBulkCopy doesn't take exclusive locks so while it's doing it's thing and inserting data your update process kicks off and thus causes a deadlock. To get around this you could instruct SqlBulkCopy to take an exclusive table lock as you already suggested or you can set the batch size of the bulk insert to a reasonable number.
If you can get away with it I think the table lock idea is the best option.