SQL RegEx to handle comma separated IDs - mysql

I have a string that denotes which users are allowed to access something. For instance, if user 1, user 2, and user 3 could access it, the accessibility column would contain 1,2,3. If only user 1 could access it, it would only be 1 and so forth.
I know I can't do a simple CONTAINS clause because searching for 1 could return true for 14,2,3. How would I get a regex to accommodate when there is a comma on both sides, on one side, or neither of the ID number?
Here is a sample of what I'm trying to do
DataID: 1
Accessibility: "1,2,3,4,5"
Data: "secret stuff"
DataID: 2
Accessibility: "5,6,7,8,9"
Data: "more secret stuff"
I need to tell the regex to search for a number and to make sure its at the beginning of the string and the end of the string if it has no commas around it, is at the beginning of the string if it only has a comma after it, is at the end of a string if it only has a comma before it, or if it commas on both sides that's fine because it's in the middle of the string.
I know what I need to do, but don't know how to achieve it. Thanks.

First, you have a really bad data structure for several reasons:
The proper way to store lists in SQL is using tables, not strings.
The proper way to store integers in SQL is as integers, not strings.
Ids should be defined with a proper foreign key relationship, which you cannot do when the id is stored in a string.
Sometimes, we are stuck with other people's bad design decisions. That is, we are unable to create a proper junction table, with one column for the DataId and each user who has access to it.
In that situation, you can use the find_in_set() functionality in MySQL. This does not require a regular expression. You can just write:
where find_in_set($user, accessibility) > 0

Since A-Z, 0-9, and underscore are considered word boundaries, you could generalize like this:
-- word-bound DataID, e.g. 1 becomes \b1\b
SELECT '\b' || DataID || '\b' AS DataID_Bound FROM USER
WHERE REGEX_LIKE(DataID_Bound, Accessibility)
That way it doesn't matter if there is a comma leading, trailing, or if it's a sole occupant of the search subject. But it deffinitely cannot match 14 or 21, etc. \b1\b will only match solo 1, \b14\b will only match whole word 14, etc.

Related

Mysql compare comma-separated field with single string

So a field called schools in the database might have a value of:
'13,121,112,1212'
I'm using that to show the potential for a mistake.
Suppose I'm looking for a value of 12 in that field. The commas denote a "whole number" and I don't want to match 112 or 1212
Is there a more elegant match than this?
#compare = 12;
WHERE CONCAT(schools,',') LIKE CONCAT('%',compare,',%)
I was recently impressed by the GROUP_CONCAT function but this is kind of in reverse of that. Thanks!
For this simple case you can use FIND_IN_SET();
WHERE FIND_IN_SET('13', schools);
Note though that there is no good indexing for columns with comma separated text, so the queries will be much slower than a normalized database.

Delete all characters before and after quotation marks

I have a CSV file, which has two columns and 4500 rows. In one column, I have several phrases that are surrounded in quotation marks. I need to delete all the text that comes before and after the quotations marks.
For example:
How would you say "Hello, my Friend" when speaking outside?
should become "Hello, my Friend"
I also have several rows that have the word NULL in the second column. I need these rows deleted in full.
What's the best way of doing something like this? I have been looking at regular expressions, but I'm not sure if they are flexible enough to do what I want to do, or how you would use them on a CSV file (I need the table structure to remain).
EDIT:
1) At the moment I am just using Apple Numbers, but I know that wont don't it, so I am happy to any suggestions. It must support Kanji characters.
2) I have removed all the NULL rows, so that is no longer needed (I simply added a column of numbers, sorted the table so all the NULLs were together, deleted them and the sorted back by the column of numbers).
Find a text editor that supports regular expression search and replace.
Something like this would match ,NULL in the second column: ^.*,NULL.*$. Replace it with "DELETEMEDELETEME" to mark the line, or as an empty string or find a way to have it match on `\n' or '\r' to catch the line break and remove the entire line completely.
Stripping out parts of the quoted string might work like this:
^(.*,){n}(.*)(\".\")(.*)(,.*)$ replaced with \1\3\5 where n is the number of columns preceding the one you want to edit. Repeat (.*,) if that's not available. It will depend on the regex flavor of your tool.

Performance of LIKE 'xyz%' v/s LIKE '%xyz'

I was wondering how the LIKE operator actually work.
Does it simply start from first character of the string and try matching pattern, one character moving to the right? Or does it look at the placement of the %, i.e. if it finds the % to be the first character of the pattern, does it start from the right most character and starts matching, moving one character to the left on each successful match?
Not that I have any use case in my mind right now, just curious.
edit: made question narrow
If there is an index on the column, putting constant characters in the front will lead your dbms to use a more efficient searching/seeking algorithm. But even at the simplest form, the dbms has to test characters. If it is able to find it doesn't match early on, it can discard it and move onto the next test.
The LIKE search condition uses wildcards to search for patterns within a string. For example:
WHERE name LIKE 'Mickey%'
will locate all values that begin with 'Mickey' optionally followed by any number of characters. The % is not case sensitive and not accent sensitive and you can use multiple %, for example
WHERE name LIKE '%mouse%'
will return all values with 'mouse' (or 'Mouse' or 'mousé') in it.
The % is inclusive, meaning that
WHERE name like '%A%'
will return all that starts with an 'A', contain 'A' or end with 'A'.
You can use _ (underscore) for any character on a single position:
WHERE name LIKE '_at%'
will give you all values with 'a' as the second letter and 't' as the third. The first letter can be anything. For example: 'Batman'
In T-SQL, if you use [] you can find values in a range.
WHERE name LIKE '[c-f]%'
it will find any value beginning with letter between c and f, inclusive. Meaning it will return any value that start with c, d, e or f. This [] is T-SQL only. Use [^ ] to find values not in a range.
Finding all values that contain a number:
WHERE name LIKE '%[0-9]%'
returns everything that has a number in it. Example: 'Godfather2'
If you are looking for all values with the 3rd position to be a '-' (dash) use two underscores:
WHERE NAME '__-%'
It will return for example: 'Lo-Res'
Finding the values with names ends in 'xyz' use:
WHERE name LIKE '%xyz'
returns anything that ends with 'xyz'
Finding a % sign in a name use brackets:
WHERE name LIKE '%[%]%'
will return for example: 'Top%Movies'
Searching for [ use brackets around it:
WHERE name LIKE '%[[]%'
gives results as: 'New York [NY]'
The database collation's sort order determines both case sensitivety and the sort order for the range of characters. You can optionally use COLLATE to specify collation sort order used by the LIKE operator.
Usually the main performance bottleneck is IO. The efficiency of the LIKE operator can be only important if your whole table fits in the memory otherwise IO will take most of the time.
AFAIK oracle can use indexes for prefix matching. (like 'abc%'), but these index cannot be used for more complex expressions.
Anyway if you have only this kind of queries you should consider using a simple index on the related column. (Probably this is true for other RDBMS's as well.)
Otherwise LIKE operator is generally slow, but most of the RDBMS have some kind of full text searching solution. I think the main reason of the slowness is that LIKE is too general. Usually full text indexes has lots of different options which can tell the database what you really want to search for, and with these additional information the DB can do its task in a more efficient way.
As a rule of thumb I think if you want to search in a text field and you think performance can be an issue, you should consider your RDBMS's full text searching solution, or the real goal is not text searching, but this is some kind of "design side effect", for example xml/json/statuses stored in a field as text, then probably you should consider choosing a more efficient data storing option. (if there is any...)

How do I assign a variable to each letter of a string in MySQL?

I am trying to figure out a way of doing an "anagram" function as a stored procedure on MySQL. Lets say I have a database containing all the words in the dictionary - I want to enter a parameter of some letters as a VARCHAR and get back a list of words which make up an anagram of those letters.
I guess what I'm sort of saying is, how do I run an SQL command to say "Select all words which are the same length as the parameter AND contain each of the letters in the parameter".
I have explored the string functions available (http://www.hscripts.com/tutorials/mysql/string-function.php). I'm sure these can be used in conjunction in some way but can't quite get the syntax right when it gets complicated.
I am new to SQL, and it just seems like the String functions available are very limited. Any help would be greatly appreciated :)
You don't; it's not a sensible thing to ask a relational database to do.
However, if someone was forcing me at gunpoint to implement anagram finding using a relational database, I would denormalize it like this:
word | sorted
-----|-------
bar | abr
bra | abr
keel | eekl
leek | eekl
Where "sorted" consists of all of the letters in "word", sorted using any rule you like as long as it's a total order. You would use something other than SQL to compute that part.
Then you could find anagrams with something like this:
SELECT w2.word AS anagram
FROM words w1
JOIN words w2 ON w1.sorted=w2.sorted
WHERE w1.word = 'leek'
AND w2.word <> w1.word
SQL is probably not the right place to do this, you should do it on the front end.
First of all consider the properties of an anagram, it will be the same length as the words in your dictionary. You can start by retrieving those words.
Instead of creating a variable per letter consider using an array
Each letter maps to an index (a=0, b=3, etc...). Each time you run into that letter increase the value for that bucket so for the word "dad" you'll end up with a structure that looks like this:
arr[0]=1, arr[1]=0, arr[2]=0, arr[3]=2, arr[4]=0 and so on...
Now you can just see if your words match each item in the array.
While not impossible in SQL, you can represent that kind of logic in the database, for example another table that will have a reference to the dictionary word and each tuple would be the array, then you can just retrieve all the items with the same values.

Using REGEX to alter field data in a mysql query

I have two databases, both containing phone numbers. I need to find all instances of duplicate phone numbers, but the formats of database 1 vary wildly from the format of database 2.
I'd like to strip out all non-digit characters and just compare the two 10-digit strings to determine if it's a duplicate, something like:
SELECT b.phone as barPhone, sp.phone as SPPhone FROM bars b JOIN single_platform_bars sp ON sp.phone.REGEX = b.phone.REGEX
Is such a thing even possible in a mysql query? If so, how do I go about accomplishing this?
EDIT: Looks like it is, in fact, a thing you can do! Hooray! The following query returned exactly what I needed:
SELECT b.phone, b.id, sp.phone, sp.id
FROM bars b JOIN single_platform_bars sp ON REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(b.phone,' ',''),'-',''),'(',''),')',''),'.','') = REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(sp.phone,' ',''),'-',''),'(',''),')',''),'.','')
MySQL doesn't support returning the "match" of a regular expression. The MySQL REGEXP function returns a 1 or 0, depending on whether an expression matched a regular expression test or not.
You can use the REPLACE function to replace a specific character, and you can nest those. But it would be unwieldy for all "non-digit" characters. If you want to remove spaces, dashes, open and close parens e.g.
REPLACE(REPLACE(REPLACE(REPLACE(sp.phone,' ',''),'-',''),'(',''),')','')
One approach is to create user defined function to return just the digits from a string. But if you don't want to create a user defined function...
This can be done in native MySQL. This approach is a bit unwieldy, but it is workable for strings of "reasonable" length.
SELECT CONCAT(IF(SUBSTR(sp.phone,1,1) REGEXP '^[0-9]$',SUBSTR(sp.phone,1,1),'')
,IF(SUBSTR(sp.phone,2,1) REGEXP '^[0-9]$',SUBSTR(sp.phone,2,1),'')
,IF(SUBSTR(sp.phone,3,1) REGEXP '^[0-9]$',SUBSTR(sp.phone,3,1),'')
,IF(SUBSTR(sp.phone,4,1) REGEXP '^[0-9]$',SUBSTR(sp.phone,4,1),'')
,IF(SUBSTR(sp.phone,5,1) REGEXP '^[0-9]$',SUBSTR(sp.phone,5,1),'')
) AS phone_digits
FROM sp
To unpack that a bit... we extract a single character from the first position in the string, check if it's a digit, if it is a digit, we return the character, otherwise we return an empty string. We repeat this for the second, third, etc. characters in the string. We concatenate all of the returned characters and empty strings back into a single string.
Obviously, the expression above is checking only the first five characters of the string, you would need to extend this, basically adding a line for each position you want to check...
And unwieldy expressions like this can be included in a predicate (in a WHERE clause). (I've just shown it in the SELECT list for convenience.)
MySQL doesn't support such string operations natively. You will either need to use a UDF like this, or else create a stored function that iterates over a string parameter concatenating to its return value every digit that it encounters.