Searching for phone numbers in mysql - mysql

I have a table which is full of arbitrarily formatted phone numbers, like this
027 123 5644
021 393-5593
(07) 123 456
042123456
I need to search for a phone number in a similarly arbitrary format ( e.g. 07123456 should find the entry (07) 123 456
The way I'd do this in a normal programming language is to strip all the non-digit characters out of the 'needle', then go through each number in the haystack, strip all non-digit characters out of it, then compare against the needle, eg (in ruby)
digits_only = lambda{ |n| n.gsub /[^\d]/, '' }
needle = digits_only[input_phone_number]
haystack.map(&digits_only).include?(needle)
The catch is, I need to do this in MySQL. It has a host of string functions, none of which really seem to do what I want.
Currently I can think of 2 'solutions'
Hack together a franken-query of CONCAT and SUBSTR
Insert a % between every character of the needle ( so it's like this: %0%7%1%2%3%4%5%6% )
However, neither of these seem like particularly elegant solutions.
Hopefully someone can help or I might be forced to use the %%%%%% solution
Update: This is operating over a relatively fixed set of data, with maybe a few hundred rows. I just didn't want to do something ridiculously bad that future programmers would cry over.
If the dataset grows I'll take the 'phoneStripped' approach. Thanks for all the feedback!
could you use a "replace" function to strip out any instances of "(", "-" and " ",
I'm not concerned about the result being numeric.
The main characters I need to consider are +, -, (, ) and space
So would that solution look like this?
SELECT * FROM people
WHERE
REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(phonenumber, '('),')'),'-'),' '),'+')
LIKE '123456'
Wouldn't that be terribly slow?

This looks like a problem from the start. Any kind of searching you do will require a table scan and we all know that's bad.
How about adding a column with a hash of the current phone numbers after stripping out all formatting characters. Then you can at least index the hash values and avoid a full blown table scan.
Or is the amount of data small and not expected to grow much?
Then maybe just sucking all the numbers into the client and running a search there.

I know this is ancient history, but I found it while looking for a similar solution.
A simple REGEXP may work:
select * from phone_table where phone1 REGEXP "07[^0-9]*123[^0-9]*456"
This would match the phonenumber column with or without any separating characters.

As John Dyer said, you should consider fixing the data in the DB and store only numbers. However, if you are facing the same situation as mine (I cannot run a update query) the workaround I found was combining 2 queries.
The "inside" query will retrieve all the phone numbers and format them removing the non-numeric characters.
SELECT REGEXP_REPLACE(column_name, '[^0-9]', '') phone_formatted FROM table_name
The result of it will be all phone numbers without any special character. After that the "outside" query just need to get the entry you are looking for.
The 2 queries will be:
SELECT phone_formatted FROM (
SELECT REGEXP_REPLACE(column_name, '[^0-9]', '') phone_formatted FROM table_name
) AS result WHERE phone_formatted = 9999999999
Important: the AS result is not used but it should be there to avoid erros.

An out-of-the-box idea, but could you use a "replace" function to strip out any instances of "(", "-" and " ", and then use an "isnumeric" function to test whether the resulting string is a number?
Then you could do the same to the phone number string you're searching for and compare them as integers.
Of course, this won't work for numbers like 1800-MATT-ROCKS. :)

Is it possible to run a query to reformat the data to match a desired format and then just run a simple query? That way even if the initial reformatting is slow you it doesn't really matter.

My solution would be something along the lines of what John Dyer said. I'd add a second column (e.g. phoneStripped) that gets stripped on insert and update. Index this column and search on it (after stripping your search term, of course).
You could also add a trigger to automatically update the column, although I've not worked with triggers. But like you said, it's really difficult to write the MySQL code to strip the strings, so it's probably easier to just do it in your client code.
(I know this is late, but I just started looking around here :)

i suggest to use php functions, and not mysql patterns, so you will have some code like this:
$tmp_phone = '';
for ($i=0; $i < strlen($phone); $i++)
if (is_numeric($phone[$i]))
$tmp_phone .= '%'.$phone[$i];
$tmp_phone .= '%';
$search_condition .= " and phone LIKE '" . $tmp_phone . "' ";

This is a problem with MySQL - the regex function can match, but it can't replace. See this post for a possible solution.

See
http://www.mfs-erp.org/community/blog/find-phone-number-in-database-format-independent
It is not really an issue that the regular expression would become visually appalling, since only mysql "sees" it. Note that instead of '+' (cfr. post with [\D] from the OP) you should use '*' in the regular expression.
Some users are concerned about performance (non-indexed search), but in a table with 100000 customers, this query, when issued from a user interface returns immediately, without noticeable delay.

Here is a working Solution for PHP users.
This uses a loop in PHP to build the Regular Expression. Then searches the database in MySQL with the RLIKE operator.
$phone = '(456) 584-5874' // can be any format
$phone = preg_replace('/[^0-9]/', '', $phone); // strip non-numeric characters
$len = strlen($phone); // get length of phone number
for ($i = 0; $i < $len - 1; $i++) {
$regex .= $phone[$i] . "[^[:digit:]]*";
}
$regex .= $phone[$len - 1];
This creates a Regular Expression that looks like this: 4[^[:digit:]]*5[^[:digit:]]*6[^[:digit:]]*5[^[:digit:]]*8[^[:digit:]]*4[^[:digit:]]*5[^[:digit:]]*8[^[:digit:]]*7[^[:digit:]]*4
Now formulate your MySQL something like this:
$sql = "SELECT Client FROM tb_clients WHERE Phone RLIKE '$regex'"
NOTE: I tried several of the other posted answers but found performance issues. For example, on our large database, it took 16 seconds to run the IsNumeric example. But this solution ran instantly. And this solution is compatible with older MySQL versions.

MySQL can search based on regular expressions.
Sure, but given the arbitrary formatting, if my haystack contained "(027) 123 456" (bear in mind position of spaces can change, it could just as easily be 027 12 3456 and I wanted to match it with 027123456, would my regex therefore need to be this?
"^[\D]+0[\D]+2[\D]+7[\D]+1[\D]+2[\D]+3[\D]+4[\D]+5[\D]+6$"
(actually it'd be worse as the mysql manual doesn't seem to indicate it supports \D)
If that is the case, isn't it more or less the same as my %%%%% idea?

Just an idea, but couldn't you use Regex to quickly strip out the characters and then compare against that like #Matt Hamilton suggested?
Maybe even set up a view (not sure of mysql on views) that would hold all phone numbers stripped by regex to a plain phone number?

Woe is me. I ended up doing this:
mre = mobile_number && ('%' + mobile_number.gsub(/\D/, '').scan(/./m).join('%'))
find(:first, :conditions => ['trim(mobile_phone) like ?', mre])

if this is something that is going to happen on a regular basis perhaps modifying the data to be all one format and then setup the search form to strip out any non-alphanumeric (if you allow numbers like 310-BELL) would be a good idea. Having data in an easily searched format is half the battle.

a possible solution can be found at http: //udf-regexp.php-baustelle.de/trac/
additional package need to be installed, then you can play with REGEXP_REPLACE

Create a user defined function to dynamically creates Regex.
DELIMITER //
CREATE FUNCTION udfn_GetPhoneRegex
(
var_Input VARCHAR(25)
)
RETURNS VARCHAR(200)
BEGIN
DECLARE iterator INT DEFAULT 1;
DECLARE phoneregex VARCHAR(200) DEFAULT '';
DECLARE output VARCHAR(25) DEFAULT '';
WHILE iterator < (LENGTH(var_Input) + 1) DO
IF SUBSTRING(var_Input, iterator, 1) IN ( '0', '1', '2', '3', '4', '5', '6', '7', '8', '9' ) THEN
SET output = CONCAT(output, SUBSTRING(var_Input, iterator, 1));
END IF;
SET iterator = iterator + 1;
END WHILE;
SET output = RIGHT(output,10);
SET iterator = 1;
WHILE iterator < (LENGTH(output) + 1) DO
SET phoneregex = CONCAT(phoneregex,'[^0-9]*',SUBSTRING(output, iterator, 1));
SET iterator = iterator + 1;
END WHILE;
SET phoneregex = CONCAT(phoneregex,'$');
RETURN phoneregex;
END//
DELIMITER ;
Call that User Defined Function in your stored procedure.
DECLARE var_PhoneNumberRegex VARCHAR(200);
SET var_PhoneNumberRegex = udfn_GetPhoneRegex('+ 123 555 7890');
SELECT * FROM Customer WHERE phonenumber REGEXP var_PhoneNumberRegex;

I would use Google's libPhoneNumber to format a number to E164 format. I would add a second column called "e164_number" to store the e164 formatted number and add an index on it.

In my case, I needed to identify Swiss (CH) mobile phone numbers in the phone column and move them in mobile column.
As all mobile phone numbers starts with 07x or +417x here is the regex to use :
/^(\+[0-9][0-9]\s*|0|)7.*/mgix
It find all numbers like the following :
+41 79 123 456 78
+417612345678
076 123 456 78
07812345678
7712345678
and ignore all others like theese :
+41 47 123 456 78
+413212345678
021 123 456 78
02212345678
3412345678
In MySQL it gives the following code :
UPDATE `contact`
SET `mobile` = `phone`,
`phone` = ''
WHERE `phone` REGEXP '^(\\+[\D+][0-9]\\s*|0|)(7.*)$'
You'll need to clean your number from special chars like -/.() before.
https://regex101.com/r/AiWFX8/1

Related

What is the purpose of using WHERE COLUMN like '%[_][01][7812]' in SQL statements?

What is the purpose of using WHERE COLUMN like '%[_][01][7812]' in SQL statements?
I get some result, but don't know how to use properly.
I see that it is searching through the base, but I don't understand the pattern.
Like selects strings similar to a pattern. The pattern you're looking at uses several wildcards, which you can review here: https://www.w3schools.com/SQL/sql_wildcards.asp
Briefly, the query seems to ba matching any row where COLUMN ends in an _ then a 0 or a 1, then a 7,8,1, or 2. (So it would match 'blah_07' but not 'blah_81', 'blah_0172', or 'blah18')
First thing as you might be aware that where clause is used for filtering rows.
In your case (Where column Like %[_][01][7812]) Means find the column ending with [_][01][7812] and there could be anything place of %
declare
#searchString varchar(50) = '[_][01][7812]',
#testString varchar(50) = 'BeginningOfString' + '[_][01][7812]' + 'EndofString'
select CHARINDEX(#searchString, #testString), #testString, LEN(#testString) as [totalLength]
set #testString = '[_][01][7812]' + 'EndofString'
select CHARINDEX(#searchString, #testString), #testString, LEN(#testString) as [totalLength]
set #testString = 'BeginningOfString' + '[_][01][7812]'
select CHARINDEX(#searchString, #testString), #testString, LEN(#testString) as [totalLength]
Although you've tagged your post MySQL, that code seems unlikely to have been written for it. That LIKE pattern, to me, resembles Microsoft SQL Server's variation on the syntax, where it would match anything ending with an underscore followed by a zero or a one, followed by a 7, an 8 a 1 or a 2.
So your example 'TA01_55_77' would not match, but 'TA01_55_18' would, as would 'GZ01_55_07'
(In SQL Server, enclosing a wildcard character like '_' in square brackets escapes it, turning it into a literal underscore.)
Of course, there may be other RDBMSes with similar syntax, but what you've presented doesn't seem like it would work on the data you've got if running in MySQL.

How to replace the exact word in MySQL?

I have this query:
UPDATE tbl SET fldname = replace(fldname,'St','Stats');
However if anything begins with the letters St, such as Stump, it would then look like Statsump. How can I just have the St replaced with the exact match? Thanks!
AFAIK MySQL does not directly support regexp replacing (http://bugs.mysql.com/bug.php?id=27389), unless you use user defined functions.
You could try adding a WHERE char_length(fldname) == 5, though - this would match only the strings of length 5, and the only string of length 5 containing 'Stats' is... 'Stats'.
Or of course ditch the whole thing and just do SET fldname='St' WHERE fldname == 'Stats', which would probably be the sane thing to do unless your query is more complex than your example.

MySQL - Confusing RegEx Variable Issue

I need some help with a RegEx. The concept is simple, but the actual solution is well beyond anything I know how to figure out. If anyone could explain how I could achieve my desired effect (and provide an explanation with any example code) it'd be much appreciated!
Basically, imagine a database table that stores the following string:
'My name is $1. I wonder who $2 is.'
First, bear in mind that the dollar sign-number format IS set in stone. That's not just for this example--that's how these wildcards will actually be stored. I would like an input like the following to be able to return the above string.
'My name is John. I wonder who Sarah is.'
How would I create a query that searches with wildcards in this format, and then returns the applicable rows? I imagine a regular expression would be the best way. Bear in mind that, theoretically, any number of wildcards should be acceptable.
Right now, this is the part of my existing query that drags the content out of the database. The concatenation, et cetera, is there because in a single database cell, there are multiple strings concatenated by a vertical bar.
AND CONCAT('|', content, '|')
LIKE CONCAT('%|', '" . mysql_real_escape_string($in) . "', '|%')
I need to modify ^this line to work with the variables that are a part of the query, while keeping the current effect (vertical bars, etc) in place. If the RegEx also takes into account the bars, then the CONCAT() functions can be removed.
Here is an example string with concatenation as it might appear in the database:
Hello, my name is $1.|Hello, I'm $1.|$1 is my name!
The query should be able to match with any of those chunks in the string, and then return that row if there is a match. The variables $1 should be treated as wildcards. Vertical bars will always delimit chunks.
For MySQL, this article is a nice guide which should help you. The Regexp would be "(\$)(\d+)". Here's a query I ripped off the article:
SELECT * FROM posts WHERE content REGEXP '(\\$)(\\d+)';
After retrieving data, use this handy function:
function ParseData($query,$data) {
$matches=array();
while(preg_match("/(\\$)(\\d+)/",$query,$matches)) {
if (array_key_exists(substr($matches[0],1),$data))
$query=str_replace($matches[0],"'".mysql_real_escape_string($data[substr($matches[0],1)])."'",$query);
else
$query=str_replace($matches[0],"''",$query);
}
return $query;
}
Usage:
$query="$1 went to $2's house";
$data=array(
'1' => 'Bob',
'2' => 'Joe'
);
echo ParseData($query,$data); //Returns "Bob went to Joe's house
If you aren't sticky about using the $1 and $2 and could change them around a bit, you could take a look at this:
http://php.net/manual/en/function.sprintf.php
E.G.
<?php
$num = 5;
$location = 'tree';
$format = 'There are %d monkeys in the %s';
printf($format, $num, $location);
?>
If you want to find entries in the database, then you can use a LIKE statement:
SELECT statement FROM myTable WHERE statement LIKE '%$1%'
Which will find all statements that include $1. I'm assuming that the first number to replace will always be $1 - it doesn't matter, in that case, that the total number of wildcards is arbitrary, as we're just looking for the first one.
The PHP replacement is a little trickier. You could probably do something like:
$count = 1;
while (strpos($statement, "$" . $count)) {
$statement = str_replace("$" . $count, $array[$count], $statement);
}
(I've not tested that, so there might be typos in there, but it should be enough to give the general idea.)
The one downside is that it will fail if you have more than ten parameters in your string to replace - the first runthrough will replace the first two characters of $10, as it's looking for $1.
I asked a different, but similar, question, and I think the solution applies to this question just as well.
https://stackoverflow.com/a/10763476/1382779

Converting a upper case database to proper case

I am new to SQL and I have several large database with upper case first and last names that I need to convert to proper case in SQL sever 2008.
I am using the following to do this:
update database
Set FirstNames = upper(substring(FirstNames, 1, 1))
+ lower(substring(FirstNames, 2, (len(FirstNames) - 1) ))
I was wondering if there was any way to adapt this so that a field with two first names is also updated (currently I make the change and then go through and manually change the second name).
I have looked over the other answers in this field and they all seem quit long, compared to the query above.
Also is there any way to assist with converting the Mc suranmes ( I will manually change the others)? MCDONALD to McDonald, again I am just using the about query but replacing the FirstNames with LastName.
This is probably best done outside of SQL. However, if there is a requirement to do it on the server or if speed isn't an issue (because it will be an issue so you need to figure out if you care), the way you are going about it is probably the best way of doing so. If you want, you could create a UDF that puts all of the logic in one area.
Here is some code I came across (with attribution and more information below it):
CREATE FUNCTION dbo.fCapFirst(#input NVARCHAR(4000)) RETURNS NVARCHAR(4000)
AS
BEGIN
DECLARE #position INT
WHILE IsNull(#position,Len(#input)) > 1
SELECT #input = Stuff(#input,IsNull(#position,1),1,upper(substring(#input,IsNull(#position,1),1))),
#position = charindex(' ',#input,IsNull(#position,1)) + 1
RETURN (#input)
END
--Call it like so
select dbo.fCapFirst(Lower(Column)) From MyTable
I got this code from http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=37760 There is more information and other suggestions in this forum as well.
As for dealing with cases like the McDonald, I would suggest one of two ways to handle this. One would be to put a search in the above UDF for key names ('McDonald', 'McGrew', etc.) or for patterns (the first two letters are Mc then make the next one capital, etc.) The second way would be to put these cases (the full names) in a table and have their replacement value in a second column. Then simply do a replace. Most likely, however, it will be easiest to identify rules like Mc then capitalize instead of trying to list every last-name possibility.
Don't forget you may want to modify the above UDF to include dashes, not just spaces.
Maybe this is too long but it is very easy and can be adapted for -, ', etc:
UPDATE tbl SET LastName = Case when (CharIndex(' ',lastname,1)<>0) then (Upper(Substring(lastname,1,1))+Lower(Substring(lastname,2,CharIndex(' ',lastname,1)-1)))+
(Upper(Substring(lastname,CharIndex(' ',lastname,1)+1,1))+
Lower(Substring(lastname,CharIndex(' ',lastname,1)+2,Len(lastname)-(CharIndex(' ',lastname,1)-1))))
else (Upper(Substring(lastname,1,1))+Lower(Substring(lastname,2,Len(lastname)-1))) end,
FirstName = Case when (CharIndex(' ',firstname,1)<>0) then (Upper(Substring(firstname,1,1))+Lower(Substring(firstname,2,CharIndex(' ',firstname,1)-1)))+
(Upper(Substring(firstname,CharIndex(' ',firstname,1)+1,1))+
Lower(Substring(firstname,CharIndex(' ',firstname,1)+2,Len(firstname)-(CharIndex(' ',firstname,1)-1))))
else (Upper(Substring(firstname,1,1))+Lower(Substring(firstname,2,Len(firstname)-1))) end;
Tony Rogerson has code that deals with:
double barrelled names eg Arthur Bentley-Smythe
Control characters
I haven't used it myself though...

MySQL Find similar strings

I have an InnoDB database table of 12 character codes which often need to be entered by a user.
Occasionally, a user will enter the code incorrectly (for example typing a lower case L instead of a 1 etc).
I'm trying to write a query that will find similar codes to the one they have entered but using LIKE '%code%' gives me way too many results, many of which contain only one matching character.
Is there a way to perform a more detailed check?
Edit - Case sensitive not required.
Any advice appreciated.
Thanks.
Have a look at soundex. Commonly misspelled strings have the same soundex code, so you can query for:
where soundex(Code) like soundex(UserInput)
use without wildcard % for that
SELECT `code` FROM table where code LIKE 'user_input'
thi wil also check the space
SELECT 'a' = 'a ', return 1 whereas SELCET 'a' LIKE 'a ' return 0
reference