Big database - doctrine query slow even with index - mysql

I'm building an app with Symfony 4 + Doctrine, where people can upload big CSV files and those records then get stored in a database. Before inserting, I'm checking that the entry doesn't already exist...
On a sample CSV file with only 1000 records, it takes 16 seconds without an index and 8 seconds with an index (MacBook 3Ghz - 16 GB Memory). My intuition tells me, this is quite slow and should be done in under < 1 sec especially with the index.
The index is set on the email column.
My code:
$ssList = $this->em->getRepository(EmailList::class)->findOneBy(["id" => 1]);
foreach ($csv as $record) {
$subscriber_exists = $this->em->getRepository(Subscriber::class)
->findOneByEmail($record['email']);
if ($subscriber_exists === NULL) {
$subscriber = (new Subscriber())
->setEmail($record['email'])
->setFirstname($record['first_name'])
->addEmailList($ssList)
;
$this->em->persist($subscriber);
$this->em->flush();
}
}
My Question:
How can I speed up this process?

Use LOAD DATA INFILE.
LOAD DATA INFILE has IGNORE and REPLACE options for handling duplicates if you put a UNIQUE KEY or PRIMARY KEY on your email column.
Look at settings for making the import faster.

Like Cid said, move the flush() outside of the loop or put a batch counter inside the loop and only flush inside of it at certain intervals
$batchSize = 1000;
$i = 1;
foreach ($csv as $record) {
$subscriber_exists = $this->em->getRepository(Subscriber::class)
->findOneByEmail($record['email']);
if ($subscriber_exists === NULL) {
$subscriber = (new Subscriber())
->setEmail($record['email'])
->setFirstname($record['first_name'])
->addEmailList($ssList)
;
$this->em->persist($subscriber);
if ( ($i % $batchSize) === 0) {
$this->em->flush();
}
$i++;
}
}
$this->em->flush();
Or if that's still slow, you could grab the Connection $this->em->getConnection() and use DBAL as stated here: https://www.doctrine-project.org/projects/doctrine-dbal/en/2.8/reference/data-retrieval-and-manipulation.html#insert

Related

Yii2 mysql deal with bigger data

is there any way in which we can deal with a large number of data.
Like here is my code which failed with "Allowed memory size of 524288000 bytes exhausted (tried to allocate 20480 bytes)"
$amount_needed = 0;
$openWo = Production::find()->where(['fulfilled'=>0])->all(); // it returns 10k records
foreach ($openWo as $value) {
$amount_needed += $value->amount_needed;
$item = Item::findOne($value->item_id); // so this will fire a query 10k times
$materials = Materials::find()->where(['item_id'=>$value->item_id])->all(); // so this will fire a query 10k times
foreach ($materials as $val) {
//much more here....
}
//much more here....
}
so here the first foreach will 10k times. and again there are other queries and again foreach loop...
So is there any way I can work with large data in mysql+yii2 or do I need to switch from MySQL to other database?

Grails 3: Bulk insert performance issue

Grails Version: 3.3.2
I have 100k records I am loading from a CSV file and trying to do a bulk save. The issue I am having is that the bulk save is performing worse than a non-bulk save.
All the online searches I did basically use the same methods as this site which I referenced
http://krixisolutions.com/bulk-insert-grails-gorm/
I tried all 3 solutions on the page, here is an example of one of them:
def saveFsRawData(List<FactFsRawData> rawData) {
int startTime = DateTime.newInstance().secondOfDay;
println("Start Save");
Session session = sessionFactory.openSession();
Transaction tx = session.beginTransaction();
rawData.eachWithIndex{ FactFsRawData entry, int i ->
session.save(entry);
if(i % 1000 == 0) {
session.flush();
session.clear();
}
}
tx.commit();
session.close();
println("End Save - "+ (DateTime.newInstance().secondOfDay - startTime));
}
I have tried various bulk sizes from 100 to 5k (using 1k in the example). All of them average around 80 seconds.
If I remove the batch processing completely then I get an average of 65 seconds.
I am unsure of what the issue is or where I am going wrong. Any ideas?

fast execution analitycs database with thousand rows to displaying in php

i've table with a thousand rows and i want to creating analitycs with chart display my front end php
my table structure is
and how i display this data :
by user_agent column i display operating system, browsers, and devices.
for now i still using the old algorithm with looping using for () method and parsing each rows. And it takes a long time respond and displaying the data.
anyone knows how i can display this data without take long respond in my website? any idea? with the database structure or my php script?
Thankyou before.
Assuming you're loading all your data in a PHP script and postprocessing it in a for-loop in PHP, you should alter your database query. A GROUP BY statement might help. Of course, you need to alter your script to work with the new data. Revisiting your database structure is a good idea, too. A better approach might be not to save the whole user-agent string in one column but to use several columns.
Example before:
$data = $db->query('SELECT * FROM table');
for ($i = 0; $i <= $data->max(); i++) {
$row = $data->getRow($i);
postprocessRow($row); /* $sum += 1; */
}
Example after:
$data = $db->query('SELECT count(*) as weight, * FROM table GROUP BY user_agent');
for ($i = 0; $i <= $data->max(); i++) {
$row = $data->getRow($i);
postprocessRowWeighted($row); /* $sum += $row['weight']; */
}

How optimize the research of next free "slot" in mysql?

i've a problem and i can't find an easy solution.
I have self expanding stucture made in this way.
database1 | table1
| table2
....
| table n
.
.
.
databaseN | table 1
table 2
table n
each table has a structire like this:
id|value
each time a number is generated is put into the right database/table/structure (is divided in this way for scalability... would be impossible to manage table of billions of records in a fas way).
the problem that N is not fixed.... but is like a base for calculating numbers (to be precise N is known....62 but I can onlyuse a subset of "digits" that could be different in time).
for exemple I can work only with 0 1 and 2 and after a while (when I've done all the possibilities) I want to add 4 and so on (up to base 62).
I would like to find a simple way to find the 1st free slot to put the next randomly generated id but that could be reverted.
Exemple:
I have 0 1 2 3 as numbers I want use....
the element 2313 is put on dabase 2 table 3 and there will be 13|value into table.
the element 1301 is put on dabase 1 table 3 and there will be 01|value into table.
I would like to generate another number based on the next free slot.
I could test every slot starting from 0 to the biggest number but when there will be milions of records for every database and table this will be impossible.
the next element of the 1st exemple would be 2323(and not 2314 since I'm using only the 0 1 2 3 digits).
I would like som sort of invers code in mysql to give me the 23 slot on table 3 database 2 to transform it into the number. I could randomly generate a number and try to find the nearest free up and down but since the set is variable could not be a good choice.
I hope it will be clear enought to tell me any suggestion ;-)
Use
show databases like 'database%' and a loop to find non-existent databases
show tables like 'table%' and a loop for tables
select count(*) from tableN to see if a table is "full" or not.
To find a free slot, walk the database with count in chunks.
This untested PHP/MySQL implementation will first fill up all existing databases and tables to base N+1 before creating new tables or databases.
The if(!$base) part should be altered if another behaviour is wanted.
The findFreeChunk can also be solved with iteration; but I leave that effort to You.
define (DB_PREFIX, 'database');
define (TABLE_PREFIX, 'table');
define (ID_LENGTH, 2)
function findFreeChunk($base, $db, $table, $prefix='')
{
$maxRecordCount=base**(ID_LENGTH-strlen($prefix));
for($i=-1; ++$i<$base;)
{
list($n) = mysql_fetch_row(mysql_query(
"select count(*) from `$db`.`$table` where `id` like '"
. ($tmp = $prefix. base_convert($i, 10, 62))
. "%'"));
if($n<$maxRecordCount)
{
// incomplete chunk found: recursion
for($k=-1;++$k<$base;)
if($ret = findFreeChunk($base, $db, $table, $tmp)
{ return $ret; }
}
}
}
function findFreeSlot($base=NULL)
{
// find current base if not given
if (!$base)
{
for($base=1; !$ret = findFreeSlot(++$base););
return $ret;
}
$maxRecordCount=$base**ID_LENGTH;
// walk existing DBs
$res = mysql_query("show databases like '". DB_PREFIX. "%'");
$dbs = array ();
while (list($db)=mysql_fetch_row($res))
{
// walk existing tables
$res2 = mysql_query("show tables in `$db` like '". TABLE_PREFIX. "%'");
$tables = array ();
while (list($table)=mysql_fetch_row($res2))
{
list($n) = mysql_fetch_row(mysql_query("select count(*) from `$db`.`$table`"));
if($n<$maxRecordCount) { return findFreeChunk($base, $db, $table); }
$tables[] = $table;
}
// no table with empty slot found: all available table names used?
if(count($tables)<$base)
{
for($i=-1;in_array($tmp=TABLE_PREFIX. base_convert(++$i,10,62),$tables););
if($i<$base) return [$db, $tmp, 0];
}
$dbs[] = $db;
}
// no database with empty slot found: all available database names used?
if(count($dbs)<$base)
{
for($i=-1;in_array($tmp=DB_PREFIX.base_convert(++$i,10,62),$dbs););
if($i<$base) return [$tmp, TABLE_PREFIX. 0, 0];
}
// none: return false
return false;
}
If you are not reusing your slots or not deleting anything, you can of course dump all this and simply remember the last ID to calculate the next one.

Bulk insert performance issue in EF ObjectContext

I am trying to insert a large number of rows (>10,000,000) into a MySQL
database using EF ObjectContext (db-first). After reading the answer of this question
I wrote this code (batch save) to insert about 10,000 contacts (30k rows actually; including related other rows):
// var myContactGroupId = ...;
const int maxContactToAddInOneBatch = 100;
var numberOfContactsAdded = 0;
// IEnumerable<ContactDTO> contacts = ...
foreach (var contact in contacts)
{
var newContact = AddSingleContact(contact); // method excerpt below
if (newContact == null)
{
return;
}
if (++numberOfContactsAdded % maxContactToAddInOneBatch == 0)
{
LogAction(Action.ContactCreated, "Batch #" + numberOfContactsAdded / maxContactToAddInOneBatch);
_context.SaveChanges();
_context.Dispose();
// _context = new ...
}
}
// ...
private Contact AddSingleContact(ContactDTO contact)
{
Validate(contact); // Simple input validations
// ...
// ...
var newContact = Contact.New(contact); // Creates a Contact entity
// Add cell numbers
foreach (var cellNumber in contact.CellNumbers)
{
var existingContactCell = _context.ContactCells.FirstOrDefault(c => c.CellNo == cellNumber);
if (existingContactCell != null)
{
// Set some error message and return
return;
}
newContact.ContactCells.Add(new ContactCell
{
CellNo = cellNumber,
});
}
_context.Contacts.Add(newContact);
_context.ContactsInGroups.Add(new ContactsInGroup
{
Contact = newContact,
// GroupId = some group id
});
return newContact;
}
But it seems that the more contacts are added (batchwise), it takes more time (non linear).
Here is the log for batch size 100 (10k contacts). Notice the increasing time needed as the batch# increases:
12:16:48 Batch #1
12:16:49 Batch #2
12:16:49 Batch #3
12:16:50 Batch #4
12:16:50 Batch #5
12:16:50 Batch #6
12:16:51 Batch #7
12:16:52 Batch #8
12:16:53 Batch #9
12:16:54 Batch #10
...
...
12:21:26 Batch #89
12:21:32 Batch #90
12:21:38 Batch #91
12:21:44 Batch #92
12:21:50 Batch #93
12:21:57 Batch #94
12:22:03 Batch #95
12:22:10 Batch #96
12:22:16 Batch #97
12:22:23 Batch #98
12:22:29 Batch #99
12:22:36 Batch #100
It took 6 mins 48 sec. If I increase the batch size to 10,000 (requires a single batch), it takes about 26 sec (for 10k contacts). But when I try to insert 100k contacts (10k per batch), it takes a long time (for the increasing time per batch I guess).
Can you explain why it is taking increasing amount of time despite of the context being renew-ed?
Is there any other idea except raw SQL?
Most answers on the question you linked to use context.Configuration.AutoDetectChangesEnabled = false; I don't see that in your example. So you should try that. You might want to consider EF6 too. It has an AddRange method on the context for this purpose see INSERTing many rows with Entity Framework 6 beta 1
Finally got it. Looks like the method Validate() was the culprit. It had an existence check query to check if the contact already existed. So, as the contacts are being added, the db grows and it takes more time to check as the batch# increases; mainly because the cell number field (which it was comparing) was not indexed.