Using REGEX to alter field data in a mysql query - mysql

I have two databases, both containing phone numbers. I need to find all instances of duplicate phone numbers, but the formats of database 1 vary wildly from the format of database 2.
I'd like to strip out all non-digit characters and just compare the two 10-digit strings to determine if it's a duplicate, something like:
SELECT b.phone as barPhone, sp.phone as SPPhone FROM bars b JOIN single_platform_bars sp ON sp.phone.REGEX = b.phone.REGEX
Is such a thing even possible in a mysql query? If so, how do I go about accomplishing this?
EDIT: Looks like it is, in fact, a thing you can do! Hooray! The following query returned exactly what I needed:
SELECT b.phone, b.id, sp.phone, sp.id
FROM bars b JOIN single_platform_bars sp ON REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(b.phone,' ',''),'-',''),'(',''),')',''),'.','') = REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(sp.phone,' ',''),'-',''),'(',''),')',''),'.','')

MySQL doesn't support returning the "match" of a regular expression. The MySQL REGEXP function returns a 1 or 0, depending on whether an expression matched a regular expression test or not.
You can use the REPLACE function to replace a specific character, and you can nest those. But it would be unwieldy for all "non-digit" characters. If you want to remove spaces, dashes, open and close parens e.g.
REPLACE(REPLACE(REPLACE(REPLACE(sp.phone,' ',''),'-',''),'(',''),')','')
One approach is to create user defined function to return just the digits from a string. But if you don't want to create a user defined function...
This can be done in native MySQL. This approach is a bit unwieldy, but it is workable for strings of "reasonable" length.
SELECT CONCAT(IF(SUBSTR(sp.phone,1,1) REGEXP '^[0-9]$',SUBSTR(sp.phone,1,1),'')
,IF(SUBSTR(sp.phone,2,1) REGEXP '^[0-9]$',SUBSTR(sp.phone,2,1),'')
,IF(SUBSTR(sp.phone,3,1) REGEXP '^[0-9]$',SUBSTR(sp.phone,3,1),'')
,IF(SUBSTR(sp.phone,4,1) REGEXP '^[0-9]$',SUBSTR(sp.phone,4,1),'')
,IF(SUBSTR(sp.phone,5,1) REGEXP '^[0-9]$',SUBSTR(sp.phone,5,1),'')
) AS phone_digits
FROM sp
To unpack that a bit... we extract a single character from the first position in the string, check if it's a digit, if it is a digit, we return the character, otherwise we return an empty string. We repeat this for the second, third, etc. characters in the string. We concatenate all of the returned characters and empty strings back into a single string.
Obviously, the expression above is checking only the first five characters of the string, you would need to extend this, basically adding a line for each position you want to check...
And unwieldy expressions like this can be included in a predicate (in a WHERE clause). (I've just shown it in the SELECT list for convenience.)

MySQL doesn't support such string operations natively. You will either need to use a UDF like this, or else create a stored function that iterates over a string parameter concatenating to its return value every digit that it encounters.

Related

confusion about mysql like search and = search

I got this question when I use mysql search something. here is the detailed information.
say I got a table named test with a column named content. in a specific record, the content column holds:
["
/^\w{2,}/","
/^[a-z][a-z0-9]+$/","
/^[a-z0-9]+$/","
/^[a-z]\d+$/"]
there is a linefeed character in the end of the lines(last line excluded)
so when I used the like syntax to search this record, I wrote a SQL like this
select * from test where `content` like
'[\"\n/^\\\\w{2,}/\",\"\n/^[a-z][a-z0-9]+$/\",\"\n/^[a-z0-9]+$/\",\"\n/^[a-z]\\\\d+$/\"]'
and it returned the right result. but when I changed the like to = and this SQL statement didn't work, after I tried several times, I got this SQL statement that worked:
select * from test where `content` =
'[\"\n/^\\w{2,}/\",\"\n/^[a-z][a-z0-9]+$/\",\"\n/^[a-z0-9]+$/\",\"\n/^[a-z]\\d+$/\"]'
it worked. so here is the question:
why on earth the like and = have different escape strategy? in the like statement I have to use \\\\w,\\\\d while in the = statement \\w,\\d just doing fine?
MySQL LIKE operator to select data based on patterns.
The LIKE operator is commonly used to select data based on patterns. Using the LIKE operator in the right way is essential to increase the query performance.
The LIKE operator allows you to select data from a table based on a specified pattern. Therefore, the LIKE operator is often used in the WHERE clause of the SELECT statement.
MySQL provides two wildcard characters for using with the LIKE operator, the percentage % and underscore _.
The percentage (%) wildcard allows you to match any string of zero or more characters.
The underscore (_) wildcard allows you to match any single character.
Comparison operations result in a value of 1 (TRUE), 0 (FALSE), or NULL. These operations work for both numbers and strings. Strings are automatically converted to numbers and numbers to strings as necessary.
The following relational comparison operators can be used to compare not only scalar operands, but row operands:
= > < >= <= <> !=
Note: = is Equal operator and LIKE for Simple pattern matching

MySQL - search for patterns

I'm trying to figure out if someone has an elegant way to look for patterns in data stored in a varchar field where a value is not known -- meaning I can't use LIKE. For example, say a table called test looked like this:
id, str
and the data looked like this:
1, YUUUY
2, DDDMM
3, MMMMT
4, XMXMX
and I want to do a select that will return anything where the value of str has a pattern that matches the pattern ABABA. ABABA here shows a pattern and not literal letters. So the only one that matches this pattern would be id = 4. Is there a regular expression that I can use to pattern match like this? To make sure I'm clear regarding the patterns:
The pattern for id=1 is ABBBA.
The pattern for id=2 is AAABB.
The pattern for id=3 is AAAAB.
When running the query, all I will know is the pattern to search for.
Alternatively, if it makes it easier, I can have the table set up like:
id,c1,c2,c3,c4,c5
and the data would look like this:
1,Y,U,U,U,Y
2,D,D,D,M,M
3,M,M,M,M,T
4,X,M,X,M,X
Not sure if that makes it easier, but I think regexp is out the window if the data is set up like that.
No regular expression support in MySQL to do that kind of pattern matching, no.
SQL wasn't specifically designed for pattern matching of strings (or patterns of values in separate columns.)
But... we could come up with something workable, even if it's not a regular expression and it's not elegant.
Assuming we don't have a custom built user-defined function, and we want to use native MySQL functions and expression...
And assuming that the patterns we are looking for are guaranteed to consist of only two distinct characters...
And assuming that we're looking at exactly five character positions...
And assuming that the pattern string we're matching to will always begin with the letter 'A', and the "other" letter in the pattern will also be 'B'
It wouldn't be overly ugly to do something like this:
SELECT t.id
, t.str
FROM myable t
WHERE CONCAT('A'
,IF(MID(t.str,2,1)=MID(t.str,1,1),'A','B')
,IF(MID(t.str,3,1)=MID(t.str,1,1),'A','B')
,IF(MID(t.str,4,1)=MID(t.str,1,1),'A','B')
,IF(MID(t.str,5,1)=MID(t.str,1,1),'A','B')
) = 'ABBBA'
The first character in the string is automatically converted to an 'A'.
The second character, if that matches the first character, then it's also an 'A' otherwise it's a 'B'.
We do the same thing for the third, fourth and fifth characters.
Concatenate the 'A' and 'B' characters into a single string, and we can now perform an equality comparison to a pattern string, consisting of 'A' and 'B', starting with an 'A'.
But that is going to fall apart if the stated assumptions aren't true. If str is less than five characters in length, if it contains more than two distinct characters (we'll see the first character as matching... this would see str=XYYZX as matching pattern ABBBA. (First character is automatic match to A, and the fifth character matches the first, so it's an A, and all of the other characters don't match, so they are 'B', even though they aren't the same.
And so on.
We could add some additional checks.
For example, to guaranteed that str is exactly five characters in length...
AND CHAR_LENGTH(t.str)=5
Note that the default collation in MySQL is case insensitive. That means means a str value of MmmmM would be converted to 'AAAAA', not 'ABBBA'. And a str value of MmmKk would match 'AAABB'.
Unfortunately, it doesn't look like MySQL supports regex groups. I was hoping you could do something like this to match ABBBA for example:
([A-Z])([A-Z])\2\2\1
Example here: http://regexr.com/3d8gu
It looks like there is a MySQL plugin that might support it:
https://github.com/mysqludf/lib_mysqludf_preg
Here is a real hacky way to do it.
ABBBA (or YUUUY, etc):
SELECT id, name FROM table WHERE
substring(name,1,1) = substring(name,5,1) AND
substring(name,2,1) = substring(name,3,1) AND
substring(name,3,1) = substring(name,4,1);
AAABB (or DDDMM, etc):
SELECT id, name FROM table WHERE
substring(name,1,1) = substring(name,2,1) AND
substring(name,2,1) = substring(name,3,1) AND
substring(name,4,1) = substring(name,5,1);
AAAAB (or MMMMT, etc):
SELECT id, name FROM table WHERE
substring(name,1,1) = substring(name,2,1) AND
substring(name,2,1) = substring(name,3,1) AND
substring(name,3,1) = substring(name,4,1) AND
substring(name,4,1) != substring(name,5,1);
You get the picture...
It would be similar if you separated the data into different columns. Instead of comparing substrings you would just compare the columns.

how do I inner-join an integer substring of a URL to an integer?

I have two MySQL tables in Joomla: categories and Menu.
The field menu.link has values like index.php?option=com_content&view=category&id=175.
The number after the very last equal sign is equal to the field categories.id.
I would like to create INNER JOIN between two tables so that categories.id will be equal to the number in menu.link.
I understand I have to remove all before the number, but how shall I do that?
It seems you are looking for a SQL expression that will extract the id value from your URL string. This is always a dicey proposition because it depends on unpredictable details of the format of the URL.
It's a doubly dicey proposition in MySQL because there aren't any regexp functions that return actual string values. They only return true/false. So you need to use non-regexp string processing functions to extract your data.
That being said, let us hack away. This expression will get that number.
CAST(SUBSTRING_INDEX(menu.link,'view=category&id=',-1) AS INT) AS cat_id
The heart of this string-processing hack is the string 'view=category&id='. The SUBSTRING_INDEX function retrieves everything to the right of that string, and the CAST operation takes just the integer.
If the substring is not found, the expression returns zero. That might or might not be what you want. (I said this was dicey!)
So, to perform the join you'd do something like this:
SELECT Menu.whatever,
categories.whatever
FROM Menu
JOIN categories
ON categories.id = CAST(SUBSTRING_INDEX(menu.link,'view=category&id=',-1) AS INT)
This will perform poorly. But that's probably OK because you won't have tens of thousands of rows in either table.

Mysql query returns no data with escaped \

I'm attempting to query our MSSQL database but I'm getting no data when there clearly is data there.
First I query
SELECT id, instruction_link FROM work_instructions WHERE instruction_link LIKE "%\\\\cots-sbs%";
Which returns 100+ lines.
http://tinypic.com/r/ief8td/8
(sorry couldn't post as actual picture, don't have enough rep :(
However if I query
SELECT id, instruction_link FROM work_instructions WHERE instruction_link LIKE "%\\\\cots-sbs\\%";
http://tinypic.com/r/33ksw3q/8
I get no results with the 2nd query. I have no idea what I'm doing wrong here. Seems pretty simple but I can't make any sense of it..
Thanks in advance.
As documented under LIKE:
Note
Because MySQL uses C escape syntax in strings (for example, “\n” to represent a newline character), you must double any “\” that you use in LIKE strings. For example, to search for “\n”, specify it as “\\n”. To search for “\”, specify it as “\\\\”; this is because the backslashes are stripped once by the parser and again when the pattern match is made, leaving a single backslash to be matched against.
\\% is parsed as a string containing a literal backslash followed by a percentage character, which is then interpreted as a pattern containing only a literal percentage sign.

mysql replace last character if match

I have a table called media with a column called accounts_used in which the rows appear in the following format
68146, 67342, 60577, 61506, 67194, 67034, 63484, 49113, 61518, 66971, 67511,
67351, 63621, 67725, 63638, 68141, 66114, 67262, 67537, 67537, 61765, 63701,
67087, 62641, 61294, 67063, 67049, 67038, 67170, 67147, 67289, 61264, 67091,
63690, 63505, 63505, 49172, 52313, 67070, 66945, 67234, 62265, 61368, 67870,
67211, 67586, 49240, 67538, 67538, 67809, 67183, 67164, 62712, 67519, 66895,
67693, 60266, 60266, 67593, 67031, 67137, 62570, 60682, 61195, 67569, 67569,
67069, 62082, 67345, 61748, 61553, 52029, 66877, 62630, 67196, 67196, 67196,
67196, 67196, 67196, 66873, 63677, 68174, 67127, 63594, 67107, 60419, 66601,
68156, 67203, 68161, 60233, 66586, 52654, 63570, 66887, 67191, 60877, 52108,
67131, 61784, 67566, 67162, 67073, 67092, 67064, 60133, 66907, 67559, 66846,
60490, 60347, 66558, 48737, 61539, 67236, 68135, 67238 , 63656, 67585, 67512
If the row has a comma at the end I want to remove this, so for example if the row looks like the following
1,2,3,4,5,6,
I want to replace it to just this
1,2,3,4,5,6
Is this possible to do using just a simple query?
It is a bad idea to store lists of ids in rows. But, you are doing it. You can fix this by doing:
update media
set accounts_used = left(accounts_used, length(accounts_used) - 1)
where accounts_used = '%,';
Instead, you should have a MediaAccounts table, with one row per "media" and one row per account.
EDIT:
Possibly, the row ends with a ', ' rather than just a comma:
update media
set accounts_used = left(accounts_used, length(accounts_used) - 2)
where accounts_used = '%, ';
We faced a similar string-replacement issue with a large dataset of bibliographic entries, where we also needed to trim extraneous punctuation from a large number of strings stored in the database which had been imported verbatim from another system. Many of the records in our dataset also contained Unicode characters, as such we needed to find a suitable SQL query that would allow us to find the relevant records that needed to be updated, and then to update them in a way that was Unicode (multibyte character) compatible under MySQL.
In testing with our dataset, I found performing a search for the relevant records we needed to update using MySQL's LEFT() and RIGHT() substring methods, performed better than using a LIKE pattern-match query. Additionally, MySQL's LENGTH() method returns the number of bytes in a string, rather than the number of characters, and the distinction is important when dealing with string fields that potentially contain multibyte character sequences as MySQL's substring methods operate on the number of characters to select, rather than the number of bytes. Thus using the LENGTH() method did not work in our case where many of strings under test contained multibyte characters. These requirements resulted in an UPDATE query with the format presented below:
UPDATE media
SET accounts_used = LEFT(accounts_used, CHAR_LENGTH(accounts_used) - 1)
WHERE RIGHT(accounts_used, 1) = ',';
The query selects records in the media table where the accounts_used column ends with a comma , (found here using the WHERE RIGHT(accounts_used, 1) = ',' clause to perform the filtering where the RIGHT() method returns a substring of specified length starting on the right of the provided string/column), and then uses the LEFT(accounts_used, CHAR_LENGTH(accounts_used) - 1) method call to perform the string trim operation, here trimming the last character from the accounts_used column value, where LEFT() returns a substring of specified length starting on the left of the provided string/column).
Here the use of the multibyte-aware CHAR_LENGTH() method – rather than the basic LENGTH() method – was important in our case due to the countless records in our dataset that contained multibyte characters. If you are only dealing with an ASCII-encoded or another single-byte encoded character set then the LENGTH() method would work perfectly, and indeed in that case CHAR_LENGTH() and LENGTH() would return the same length count, and could even be used interchangeably. When dealing with data that could contain multibyte characters, or if in doubt use the CHAR_LEGNTH() method instead as it will return an accurate character length count in either case.
Please note that the column and field names used in the example query above match those noted in the original question, and should be modified as needed to suit your own dataset needs.