Optimal way to find similar value from a large table - mysql

I have a database where i am storing more than 1000000 names in mysql. Now the task of my application is a bit typical. I not only searches for names in the database,but also finds similar names. Suppose the name is entered as christian, then the application will show suggested names like christine, chris etc. What is the optimal way to do this, without using the like clause. The suggestions will be only on the changes in the last part of the name.

If you want also similar names (by sound) something like SOUNDEX() could help: http://dev.mysql.com/doc/refman/5.0/en/string-functions.html#function_soundex
Otherwise … LIKE 'chri%' seems for me not a bad idea?
If you really want just the first characters without LIKE you can use SUBSTRING().

You could use php's metaphone() function to generate the metaphone-code for each name and store them along with the names.
<?php
print "chris" . "\t" . metaphone("chris") . "\n";
print "christian" . "\t" . metaphone("christian") . "\n";
print "christine" . "\t" . metaphone("christine") . "\n";
# prints:
# chris XRS
# christine XRSTN
# christian XRSXN
You can then use a levenshtein distance algorithm (either in php [http://php.net/manual/en/function.levenshtein.php] or mysql [http://www.artfulsoftware.com/infotree/queries.php#552]) to calculate the distance between the metacodes. In my test below a distance of 2 or less seemed to indicate the level of similarity that you are seeking.
<?php
$names = array(
array('mike',metaphone('mike')),
array('chris',metaphone('chris')),
array('chrstian',metaphone('christian')),
array('christine',metaphone('christine')),
array('michelle',metaphone('chris')),
array('mick',metaphone('mick')),
array('john',metaphone('john')),
array('joseph',metaphone('joseph'))
);
foreach ($names as $name) {
_compare($name);
}
function _compare($n) {
global $names;
$name = $n[0];
$meta = $n[1];
foreach ($names as $cname) {
printf("The distance between $name and {$cname[0]} is %d\n",
levenshtein($meta, $cname[1]));
}
}

Like is generally a good solution, but another way to improve performance for this might be to create a partial column index and then submit queries at the same length as your prefix. See the MySQL documentation regarding col_name(length).

You could use a regular experssion i think. I'm not goot at thme but there is a function called REGEXP that you can put in a WHERE clause. Look here

You can use SOUNDS LIKE, I think it should be quite fast as well.
http://dev.mysql.com/doc/refman/5.0/en/string-functions.html#operator_sounds-like

Using LIKE where the left hand side is fixed won't require a table scan. I'm assuming this is why you don't want to use LIKE : SELECT * FROM table WHERE name LIKE CONCAT(?, "%") is fast and won't require a table scan to find rows. The CONCAT allows you to use prepared queries with % syntax.
You could also do something like:
SELECT * from table WHERE name < 'christian' LIMIT 20
and
SELECT * FROM table WHERE name > 'christian' LIMIT 20
to find neighbors in the sorted list.

Related

is there a better way to use mysql_insert_id()

I have the following sql statement written in PHP.
$sql='INSERT INTO pictures (`picture`) VALUES ("'.$imgs[$i]['name'].'",)';
$db->query($sql);
$imgs[$i]['sqlID'] = $this->id=mysql_insert_id();
$imgs[$i]['newImgName'] = $imgs[$i]['sqlID'].'_'.$imgs[$i]['name'];
$sql='UPDATE pictures SET picture="'.$imgs[$i]['newImgName'].'" WHERE id='.$imgs[$i]['sqlID'];
$db->query($sql);
Now this writes the image name to database table pictures. After that is done I get the mysql_insert_id() and than I'll update the picture name with the last id in front of the name with underscore.
I'll do this to make sure no picture name can be the same. Cause all those pictures get saved in the same folder. Is there another way to save that ID already the first time I set the sql query? Or are there other better ways to achieve this result?
Thanks for all advices
Using the native auto_increment - there is no other way. You need to do the 3 steps you described.
As Dan Bracuk mentioned, you can create a stored proc to do the 3 queries (you can still get the insert id after executing it).
Other possible options are:
not storing the id in the filename - you can concatenate it later if you want (when selecting)
using an ad-hoc auto increment instead of the native one - I would not recommend it in this case, but it's possible
using some sort of UUID instead of auto increment
generating unique file names using the file system (Marcell Fülöp's answer)
I don't think in this particular case it's reasonable to use MySQL insert Id in the file name. It might be required in some cases but you provided no information why it would be in this one.
Consider something like:
while( file_exists( $folder . $filename ) {
$filename = rand(1,999) . '_' . $filename;
}
$imgs[$i]['newImgName'] = $filename;
Of course you can use a larger range for rand or a loop counter if you wanted tot systematically increment the number used to prefix the original file name.

Perl/MySQL Relationship Query

I have the following perl code that will eventually be a webpage:
my($dbh) = DBI->connect("DBI:mysql:host=dbsrv;database=database","my_sqlu","my_sqlp") or die "Canny Connect";
my($sql) = "SELECT * FROM hardware where srv_name = \"$srv_name\"";
my($sth) = $dbh->prepare($sql);
$sth->execute();
$sth->bind_col( 1, \my($db_id));
$sth->bind_col( 2, \my($db_srv_name));
$sth->bind_col( 5, \my($db_site));
$sth->fetchrow();
$sth->finish ();
my($sql) = "SELECT sites.\`site_code\`, sites.\`long_name\` FROM \`hardware\` JOIN \`sites\` ON \`sites\`.id=\`hardware\`.\`site\` where \`hardware\`.\`id\`=\'$db_id\'";
my($sth) = $dbh->prepare($sql);
$sth->execute();
$sth->bind_col( 1, \my($db_site_code));
$sth->bind_col( 2, \my($db_long_name));
$sth->fetchrow();
$sth->finish ();
$dbh->disconnect;
print "$db_site_code<br>$db_long_name";
The query above does work however what I'm trying to find out is there any way I can run one SQL query and get the db_site_code and db_long_name from the sites DB without running the 2nd query? The hardware DB has the foreign key 'id' in the sites Db.
When you read anything about relational DBs they all say it's by far the most efficient method of getting data from your database but I just can't see how this is any quicker than just running 2 select queries. What I've done above would surely take longer than "select from hardware where srv_name = $srv_name" then "select from sites where id = db_site_id"? Any comments are greatly appreciated.
Here's an example of how to do this with placeholders as well as a combined query. If I understand your DB correctly, you can just omit the first query and add the server name instead of the ID in the second query. I might be mistaken there, but my example will still be of value for the Perl suggestions.
use strict;
use warnings;
use DBI;
# Create DB connection
my $dbh = DBI->connect("DBI:mysql:host=dbsrv;database=database","my_sqlu","my_sqlp")
or die "Cannot connect to database";
# Create the statement handle
my $sth = $dbh->prepare(<<'SQLQUERY') or die $dbh->errstr;
SELECT s.site_code, s.long_name
FROM hardware h
JOIN sites s ON s.id=h.site
WHERE h.srv_name=?
SQLQUERY
$sth->execute('Server Name'); # There's the parameter
my $res = $sth->fetchrow_hashref; # $res now has a hash ref with the first row
print "$res->{'site_code'}<br>$res->{'long_name'}";
There were a few issues with your code I'd like to point out to you:
You should always use strict and use warnings. They make your life easier!
You can leave the parens ( and ) out with my. Saves you keystrokes and makes your code more readable.
You can (but do not have to, this is preference!) leave out the parens after method calls that do not have arguments. Decide this for yourself.
As was already pointed out, always use placeholders with DBI. They are very simple. Now you don't have to escape the " with backslashes. Instead, just use ?.
Once you've combined your query, you can put it in a heredoc (<<'SQLQUERY'). It's a string that lasts from the next line to the delimiter (SQLQUERY). That way, your query is easier to read.
You can use one of the ref-fetchrow-methods to get all your result's columns into one hash. I used $sth->fetchrow_hashref because I find it most convenient. You've got the complete row and all the columns are named hash keys.
If called in a small scope (like a short sub), you don't need to finish a statement handle. It will be finished and destroyed by Perl automatically once it goes out of scope.
Another thing about performance: If this is just run occasionally, don't worry about it. You can profile your queries with DBI::Profile to see which way it is faster, but you should only do that if you really need to.
In my experience, especially with very huge queries and a very busy database, two or three queries are a lot better than a single big one because they do not take over the servers resources. But again, that is something you need to profile and benchmark (if the need arises).
Aside from #tadman's recommendation to use placeholders, I'd tag this as a sql question as well, but your solution is to simply add
srv_name = \"$srv_name\"
to your second where clause, so that your statement is:
"SELECT sites.\`site_code\`, sites.\`long_name\` FROM \`hardware\` JOIN \`sites\` ON \`sites\`.id=\`hardware\`.\`site\` where \`hardware\`.\`id\`=\'$db_id\'";
I strongly second #tadman's suggestion though -- use prepared statements and/or placeholders whenever possible.

CakePHP database query - am I overcomplicating things?

So, I need to search a real estate database for all homes belonging to realtors who are part of the same real estate agency as the current realtor. I'm currently doing this something like this:
$agency_data = $this->Realtor->find('all',array(
'conditions'=>
array(business_name'=>$realtor_settings['Realtor']['business_name']),
'fields'=>array('num'),
'recursive'=> -1
));
foreach($agency_data as $k=>$v){
foreach($v as $k=>$v1){
$agency_nums[] = $v1['num'];
}
}
$conditions = array(
'realtor_num'=>$agency_nums
);
It seems a bit crazy to me that I'm having to work so hard to break down the results of my first query, just to get a simple, one-dimensional array of ids that I can use to build a condition for my subsequent query. Am I doing this in an insanely roundabout way? Is there an easy way to write a single CakePHP query to communicate "select * from homes where realtor_num in (select num from realtors where business_name = 'n')"? If so, would it be any more efficient?
For sure it's complicated (in your way) :)
Depending from the results you can do following:
$agency_data = $this->Realtor->find('list',array(
'conditions'=>array('business_name'=>$realtor_settings['Realtor']['business_name']),
'fields'=>array('num', 'num'),
'recursive'=> -1
));
$agency_data; //this already contain array of id's
Method 2 - building a sub query there are 2 ways strict and not so strict :) The first one you can see here (search for Sub-queries).
The other option is to have following conditions parameter:
$this->Realtor->find('all', array('conditions'=>array('field in (select num from realtors where business_name like "'.$some_variable.'"))));
Of course be careful with the $some_variable in the sub-query. You shold escape it - use Sanitize class for example.
$agency_data = $this->Realtor->find('all',array(
'conditions'=>
array('business_name'=>$realtor_settings['Realtor']['business_name']),
'fields'=>array('num'),
'recursive'=> -1
));
$conditions = Set::extract("{n}.Realtor.num", $agency_data);
I would use something like Set::extract to grab the list of data you are looking for. The advantage of doing it this way is that you can reuse the same dataset in other places and save queries. You could also write the set::extract statement in this format:
$conditions = Set::extract("/Realtor/num", $agency_data);

MySQL string replace

Hey, what's the most effective way to remove beginning and ending slashes from all rows in a particular column using MySQL?
Before:
/hello/world/
foo/bar/
/something/else
/one/more/*
After:
hello/world
foo/bar
something/else
one/more/*
...or maybe this should be done in PHP instead?
See TRIM()
UPDATE MY_TABLE SET my_field=TRIM(BOTH '/' FROM my_field);
You could definitelyt make this work using the MySQL string functions but I think this would be best handled outside of the database using PHP or whatever programming language of your choice.
Your PHP option: (I'm assuming the fetched row is in $row)
$row['Field'] = explode('/', $row['Field']);
//Remove the empty elements
$row['Field'] = array_filter($row['Field']);
$row['Field'] = implode('/', $row['Field']);

Using enum in drupal

I have a mysql table id,name,gender,age religion( enum('HIN','CHR','MUS') ,category(enum('IND','AMR','SPA') where last 2 are enum datatype and my code in drupal was
$sql="SELECT * FROM {emp} WHERE age=".$age." and religion=".$rel." and category=".$categ;
$result=db_query_range($sql,0,10);
while($data=db_fetch_object($result))
{
print $data->id." ".$data->name."<br>";
}
I get no result or error . I'm trying different query with each field and all are fine except using enum.
for ex: $sql='SELECT * FROM {emp} WHERE religion="'.$rel.'"';
Is there any problem in using enum datatype in drupal
Enum is not something that I believe drupal can make with the schema API, which is what you in most cases want to use for modules and stuff. Also you are lacking an ending ) in your reference to it, but I'm sure you did it right when your made the table.
Enum is only a constraint that is built into the database when inserting values. So if you try to insert an invalid value, you will insert an empty string instead. So it wont have any effect on Drupal querying to get data. It also wont have any effect when Drupal insert values, other than converting invalid values to empty strings. You might want to check the your data, to see if it is as expected. You might just get no results because your query doesn't match anything.
Another thing is the way you construct your queries is a big NO NO, as it's very insecure. What you should do is this:
db_query("SELECT ... '%s' ...", $var);
Drupal will replace %s with your var and make sure there is no SQL injection and other nasty things. %s indicates the var is a string, use %d for ints and there are a few others I can't remember just now. You can have several placeholders like this, and they will be insert in order, much like the t function.
Seconding Googletorps advise on using parameterized queries (+1). That would not only be more secure, but also make it easier to spot the errors ;)
Your original query misses some quotes around your (String) comparison values. The following should work (Note the added single quotes):
$sql = "SELECT * FROM {emp} WHERE age='" . $age . "' and religion='" . $rel . "' and category='" . $categ . "'";
The right way to do it would be something like this:
$sql = "SELECT * FROM {emp} WHERE age='%s' and religion='%s' and category='%s'";
$args = array($age, $rel, $categ);
$result = db_query_range($sql, $args ,0 , 10);
// ...