Mysql search on high number of rows - mysql

I try to import a relatively high number of data in a mysql database (around 6 millions entries coming from text files).
I have to check for each entry if there is not already a similar record in the database by comparing it with two text fields :
`ref` varchar(30) COLLATE utf8_unicode_ci NOT NULL
`labelCanonical` varchar(15) COLLATE utf8_unicode_ci DEFAULT NULL
Files are processed by batches of N entries (for this example 10), and I do a single query to check for all duplicates in the batch, like so :
SELECT p.`ref`, p.`labelCanonical`
FROM `rtd_piece` p
WHERE (p.`ref` = "6569GX" AND p.`labelCanonical` = "fsc-principal")
OR (p.`ref` = "6569GY" AND p.`labelCanonical` = "fsc-principal")
OR (p.`ref` = "6569GZ" AND p.`labelCanonical` = "fsc-principal")
OR (p.`ref` = "6569H0" AND p.`labelCanonical` = "fsc-habitacle")
OR (p.`ref` = "6569H1" AND p.`labelCanonical` = "support-fsc")
OR (p.`ref` = "6569H2" AND p.`labelCanonical` = "fsc-injection")
OR (p.`ref` = "6569H4" AND p.`labelCanonical` = "fsc-injection")
OR (p.`ref` = "6569H8" AND p.`labelCanonical` = "faisceau-mot")
OR (p.`ref` = "6569H9" AND p.`labelCanonical` = "faisceau-mot")
OR (p.`ref` = "6569HA" AND p.`labelCanonical` = "fsc-principal")
I use Doctrine 2 (without Symfony), and I do this query using "NativeQuery".
This problem is, even with a 600k entries in the database, this query takes 730ms (or 6.7 seconds for a batch of 100 records) to execute and it increases dramatically as records are added to the database.
I have no index on "ref" or "labelCanonical" fields for now, and I'm not sure if adding one will do any good with the kind of request I do.
Where I am wrong with this method so its so slow ?
Edit to add more information about the process.
I do an ajax query for each batch, also to give a feedback to the user.
When in the server side (PHP), I do the following procedure :
1) I seek on the current file on processing and extract next N records
2) I parse each line and add references and slugified labels to two different arrays
3) I try to get these records from the database to avoid duplicates :
$existing = array();
$results = getRepository('Piece')->findExistingPieces($refs, $labels);
for ($i = 0, $c = count($results); $i < $c; ++$i) {
$existing[] = $results[$i]['ref'].'|'.$results[$i]['labelCanonical'];
}
public function findExistingPieces(array $refs, array $labels)
{
$sql = '';
$where = array();
$params = array();
for ($i = 0, $c = count($refs); $i < $c; ++$i) {
$params[] = $refs[$i];
$params[] = $labels[$i];
$where[] = '(p.`ref` = ? AND p.`labelCanonical` = ?)';
}
$sql = 'SELECT p.`ref`, p.`labelCanonical` '.
'FROM `rtd_piece` p '.
'WHERE '.implode(' OR ', $where);
$rsm = new ResultSetMapping;
$rsm->addScalarResult('ref', 'ref');
$rsm->addScalarResult('labelCanonical', 'labelCanonical');
$query = $this->getEntityManager()
->createNativeQuery($sql, $rsm)
->setParameters($params);
return $query->getScalarResult();
}
4) I iterate through previously parsed data and check for duplicates :
for ($i = 0; $i < $nbParsed; ++$i) {
$data = $parsed[$i];
if (in_array($data['ref'].'|'.$data['labelCanonical'], $existing)) {
// ...
continue ;
}
// Add record
$piece = new PieceEntity;
$piece->setRef($data['ref']);
//...
$em->persist($piece);
}
5) I flush at the end of the batch
I've added some "profiling" code to track the time being spent for each step, here the result :
0.00024509429931641 (0.245 ms) : Initialized
0.00028896331787109 (0.289 ms) : Start doProcess
0.00033092498779297 (0.331 ms) : Read and parse lines
0.0054769515991211 (5.477 ms) : Check existence in database
6.9432899951935 (6,943.290 ms) : Process parsed data
6.9459540843964 (6,945.954 ms) : Finilize
6.9461529254913 (6,946.153 ms) : End of process
6.9464020729065 (6,946.402 ms) : End doProcess
6.9464418888092 (6,946.442 ms) : Return result
The first number show microseconds elapsed since the beginning of the request, then the same time in milliseconds and then what is being done.

So after some refactoring, here it's what I came with :
I check for duplicates using a new field named "hash" like so :
$existing = array();
$results = getRepository('Piece')->findExistingPiecesByHashes($hashes);
for ($i = 0, $c = count($results); $i < $c; ++$i) {
$existing[] = $results[$i]['hash'];
}
public function findExistingPiecesByHashes(array $hashes)
{
$sql = 'SELECT p.`ref`, p.`labelCanonical`, p.`hash` '.
'FROM `rtd_piece` p '.
'WHERE (p.`hash`) IN (?)';
$rsm = new ResultSetMapping;
$rsm->addScalarResult('ref', 'ref');
$rsm->addScalarResult('hash', 'hash');
$rsm->addScalarResult('labelCanonical', 'labelCanonical');
$query = $this->getEntityManager()
->createNativeQuery($sql, $rsm)
->setParameters(array($hashes));
return $query->getScalarResult();
}
The hash is automatically updated in the model like so :
// Entities/Piece.class.php
private function _updateHash()
{
$this->hash = md5($this->ref.'|'.$this->labelCanonical);
}
My hash field has no FULLTEXT index because I use the InnoDB engine and MySQL version 5.5, and from what I've read InnoDB only supports FULLTEXT indexes since MySQL 5.6.
I don't have the feel to update MySQL right now, too many databases and websites runs on it, it would be disastrous if the update goes wrong.
BUT, even without indexing the field, the performance gain is incredible :
0.00024199485778809 (0.242) : Initialized
0.00028181076049805 (0.282) : Start doProcess
0.0003199577331543 (0.320) : Read and parse lines
0.088779926300049 (88.780) : Check existence in database
0.8656108379364 (865.611) : Process parsed data
0.94273900985718 (942.739) : Finilize
1.3771109580994 (1,377.111) : End of process
1.3795168399811 (1,379.517) : End doProcess
1.3795938491821 (1,379.594) : Return result
And this is for a batch of 1000 with 650k records on the table.
Before this optimization, it took 6.7s for a check of 100 records, so it's around 9 times faster !
At this speed I should be able to import all the data in 1h30-2h.
Thanks you very much for your help.

First, let me suggest that you write this using row constructors:
SELECT p.`ref`, p.`labelCanonical`
FROM `rtd_piece` p
WHERE (p.`ref`, p.`labelCanonical`) IN ( ('6569GX', 'fsc-principal'),
('6569GY', 'fsc-principal'),
. . .
);
This will not affect performance, but it is easier to read. Then, you need an index, either rtd_piece(ref, labelCanonical) or rtd_piece(labelCanonical, ref).

Related

Unique Profile Slug with PHP and PDO

I am using a class to generate a string name profile to slug and next use an SQL command to tell me whats the unique value to use in insert command, the problem is the command isn't working properly, sometimes it is possible to return a value which already exist...
Thats the class I am using to generate the slug: (composer require channaveer/slug)
And this the example code:
use Channaveer\Slug\Slug;
$string = "john doe";
$slug = Slug::create($string);
$profile_count_stmt = $pdo->prepare("
SELECT
COUNT(`id`) slug_count
FROM
`advogados_e_escritorios`
WHERE
`slug_perfil` LIKE :slug
");
$profile_count_stmt->execute([
":slug" => "%".$slug."%"
]);
$profile_count = $profile_count_stmt->fetchObject();
if ($profile_count && $profile_count->slug_count > 0) {
$profile_increment = $profile_count->slug_count + 1;
$slug = $slug . '-' . $profile_increment;
}
echo 'Your unique slug: '. $slug;
// Your unique slug: john-doe-5
This is the content of the table when the script run:
Do you know how can I improve the select command to prevent it to return existing slugs from DB?
Ok finally found a solution... Heres the code for who wants to generate unique profile slugs using PHP - PDO and MySQL
$string = "John Doe";
$string = mb_strtolower(preg_replace('/\s+/', '-', $string));
$slug = iconv('UTF-8', 'ASCII//TRANSLIT', $string);
$pdo = Conectar();
$sql = "
SELECT slug_perfil
FROM advogados_e_escritorios
WHERE slug_perfil
LIKE '$slug%'
";
$statement = $pdo->prepare($sql);
if($statement->execute())
{
$total_row = $statement->rowCount();
if($total_row > 0)
{
$result = $statement->fetchAll();
foreach($result as $row)
{
$data[] = $row['slug_perfil'];
}
if(in_array($slug, $data))
{
$count = 0;
while( in_array( ($slug . '-' . ++$count ), $data) );
$slug = $slug . '-' . $count;
}
}
}
echo $slug;
//john-doe-1
You should check if the slug exists or not from your database. If it already exists then you can append some random string like the following
$slug = Slug::create($string);
$slugExists = "DB query to check if the slug exists in your database then you may return the count of rows";
//If the count of rows is more than 0, then add some random string
if($slugExists) {
/** NOTE: you can use primary key - id to append after the slug, but that has to be done after you create the user record. This will help you to achieve the concurrency problem as #YourCommenSense was stating. */
$slug = $slug.time(); //time() function will return time in number of seconds
}
//DB query to insert into database
I have followed the same for my blog articles (StackCoder) too. Even LinkedIn follows the same fashion.
Following is screenshot from LinkedIn URL

Is there any way to set a special character in MySQL database record?

I used CodeIgniter for my back end technology. I just create one bulk insertion module for reading CSV data and insert it in the database.
But in insertion do not insert proper value like:-
Wrong value:- Brhl Castle
Right value:- Brühl Castle
Is there any way to insert the same value in a database like the right value?
In Controller my code like this:-
$dataIn['title'] = utf8_encode($data[1]);
$dataIn['latitude'] = $data[2];
$dataIn['longitude'] = $data[3];
$dataIn['country'] = utf8_encode($data[4]);
$dataIn['city'] = utf8_encode($data[5]);
$this->model->bulkInsert($dataIn);
In Model my code like this:-
function bulkInsert($dataIn, $id = '') {
$this->db->where(array('id' => $dataIn['id']));
$query = $this->db->get($this->TABLE);
$result = $query->row_array();
if ($query->num_rows() > 0) {
$this->db->where($this->PK, $dataIn['id']);
$this->db->update($this->TABLE, $dataIn);
} else {
$this->db->insert($this->TABLE, $dataIn);
}
setMessage('Record inserted successfully.', 'green');
}

Statements works in SQL, but not in PDO

I'm converting from SQL to PDO, and everything has gone well until this statement.
My SQL does what it should and does NOT output the message "This user has no private images". But for some reason, when changing to PDO, the same message is shown when it should not be.
Any ideas?
Original SQL:
$result = mysql_query("SELECT * FROM tbl_private_photos WHERE profile = $usernum AND photo_deleted != 'Yes' LIMIT 1");
if (mysql_num_rows($result)!==1) { die("This user has no private images");}
My PDO:
$sql = "SELECT * FROM tbl_private_photos WHERE profile = :usernum AND photo_deleted != 'Yes' LIMIT 1";
$q = $conn->prepare($sql); // the default way of PDO to manage errors is quite the same as `or die()` so no need for that
$q->bindValue(':usernum',$usernum,PDO::PARAM_INT);
$q->execute();
if($r = $q->fetch(PDO::FETCH_ASSOC)!==1)
{
die("This user has no private images");
}
PDO::fetch() returns in this case an array or false. You don't want to compare the fetch result explicitly to the integer 1 then assign it to a variable--it will always be true because a 1 !== array() is always true, and 1 !== false is always true.
Instead you should see if your result set is empty or false.
Try this instead:
$r = $q->fetch(PDO::FETCH_ASSOC);
if(empty($r))
{
die("This user has no private images");
}

Doctrine Native SQL many-to-many query

I have a many-to-many relationship between Students and Programs with tables student, program, and student_program in my database.
I'm trying to join the two entities and perform some custom queries that require subqueries. This means that the Doctrine QueryBuilder cannot work because it does not support subqueries.
Instead, I'm trying the NativeSQL function and am making decent progress. However, when I try to SELECT something from the Program entity, I get the error Notice: Undefined index: Bundle\Entity\Program in vendor/doctrine/orm/lib/Doctrine/ORM/Internal/Hydration/ObjectHydrator.php line 180.
$mapping = new \Doctrine\ORM\Query\ResultSetMappingBuilder($em);
$mapping->addRootEntityFromClassMetadata('Student', 's');
$mapping->addJoinedEntityFromClassMetadata('Program', 'p', 's', 'programs', array('id' => 'program_id'));
// Query based on form
$sql = 'SELECT s.id, s.last_name, p.name <---- problem when this is added
FROM student s
JOIN program p
';
$query = $em->createNativeQuery($sql, $mapping);
$students = $query->getResult();
Not a direct answer but doctrine 2 does indeed support sub queries. Just create a query then feed the dql into a where class. This example is somewhat verbose but it works just fine:
public function queryGames($search)
{
// Pull params
$ages = $this->getValues($search,'ages');
$genders = $this->getValues($search,'genders');
$regions = $this->getValues($search,'regions');
$sortBy = $this->getValues($search,'sortBy',1);
$date1 = $this->getValues($search,'date1');
$date2 = $this->getValues($search,'date2');
$time1 = $this->getValues($search,'time1');
$time2 = $this->getValues($search,'time2');
$projectId = $this->getValues($search,'projectId');
// Build query
$em = $this->getEntityManager();
$qbGameId = $em->createQueryBuilder(); // ### SUB QUERY ###
$qbGameId->addSelect('distinct gameGameId.id');
$qbGameId->from('ZaysoCoreBundle:Event','gameGameId');
$qbGameId->leftJoin('gameGameId.teams', 'gameTeamGameId');
$qbGameId->leftJoin('gameTeamGameId.team','teamGameId');
if ($projectId) $qbGameId->andWhere($qbGameId->expr()->in('gameGameId.projectId',$projectId));
if ($date1) $qbGameId->andWhere($qbGameId->expr()->gte('gameGameId.date',$date1));
if ($date2) $qbGameId->andWhere($qbGameId->expr()->lte('gameGameId.date',$date2));
if ($time1) $qbGameId->andWhere($qbGameId->expr()->gte('gameGameId.time',$time1));
if ($time2) $qbGameId->andWhere($qbGameId->expr()->lte('gameGameId.time',$time2));
if ($ages) $qbGameId->andWhere($qbGameId->expr()->in('teamGameId.age', $ages));
if ($genders) $qbGameId->andWhere($qbGameId->expr()->in('teamGameId.gender',$genders));
if ($regions)
{
// $regions[] = NULL;
// $qbGameId->andWhere($qbGameId->expr()->in('teamGameId.org', $regions));
$qbGameId->andWhere($qbGameId->expr()->orX(
$qbGameId->expr()->in('teamGameId.org',$regions),
$qbGameId->expr()->isNull('teamGameId.org')
));
}
//$gameIds = $qbGameId->getQuery()->getArrayResult();
//Debug::dump($gameIds);die();
//return $gameIds;
// Games
$qbGames = $em->createQueryBuilder();
$qbGames->addSelect('game');
$qbGames->addSelect('gameTeam');
$qbGames->addSelect('team');
$qbGames->addSelect('field');
$qbGames->addSelect('gamePerson');
$qbGames->addSelect('person');
$qbGames->from('ZaysoCoreBundle:Event','game');
$qbGames->leftJoin('game.teams', 'gameTeam');
$qbGames->leftJoin('game.persons', 'gamePerson');
$qbGames->leftJoin('game.field', 'field');
$qbGames->leftJoin('gameTeam.team', 'team');
$qbGames->leftJoin('gamePerson.person', 'person');
$qbGames->andWhere($qbGames->expr()->in('game.id',$qbGameId->getDQL())); // ### THE TRICK ###
switch($sortBy)
{
case 1:
$qbGames->addOrderBy('game.date');
$qbGames->addOrderBy('game.time');
$qbGames->addOrderBy('field.key1');
break;
case 2:
$qbGames->addOrderBy('game.date');
$qbGames->addOrderBy('field.key1');
$qbGames->addOrderBy('game.time');
break;
case 3:
$qbGames->addOrderBy('game.date');
$qbGames->addOrderBy('team.age');
$qbGames->addOrderBy('game.time');
$qbGames->addOrderBy('field.key1');
break;
}
// Always get an array even if no records found
$query = $qbGames->getQuery();
$items = $query->getResult();
return $items;
}

PHP PDO succinct mySQL SELECT object

Using PDO I have built a succinct object for retrieving rows from a database as a PHP object with the first column value being the name and the second column value being the desired value.
$sql = "SELECT * FROM `site`"; $site = array();
foreach($sodb->query($sql) as $sitefield){
$site[$sitefield['name']] = $sitefield['value'];
}
I now want to apply it to a function with 2 parameters, the first containing the table and the second containing any where clauses to then produce the same result.
function select($table,$condition){
$sql = "SELECT * FROM `$table`";
if($condition){
$sql .= " WHERE $condition";
}
foreach($sodb->query($sql) as $field){
return $table[$field['name']] = $field['value'];
}
}
The idea that this could be called something like this:
<?php select("options","class = 'apples'");?>
and then be used on page in the same format as the first method.
<?php echo $option['green'];?>
Giving me the value of the column named value that is in the same row as the value called 'green' in the column named field.
The problem of course is that the function will not return the foreach data like that. That is that this bit:
foreach($sodb->query($sql) as $field){
return $table[$field['name']] = $field['value'];
}
cannot return data like that.
Is there a way to make it?
Well, this:
$sql = "SELECT * FROM `site`"; $site = array();
foreach($sodb->query($sql) as $sitefield){
$site[$sitefield['name']] = $sitefield['value'];
}
Can easily become this:
$sql = "SELECT * FROM `site`";
$site = array();
foreach( $sodb->query($sql) as $row )
{
$site[] = $row;
}
print_r($site);
// or, where 0 is the index you want, etc.
echo $site[0]['name'];
So, you should be able to get a map of all of your columns into the multidimensional array $site.
Also, don't forget to sanitize your inputs before you dump them right into that query. One of the benefits of PDO is using placeholders to protect yourself from malicious users.