Finding small letter between two capital letters - MySQL - mysql

I've got problem - I need to find every single phrase like AbC (small b, between two Capital letters).
For Example a statement:
Little John had a ProBlEm and need to know how to do tHiS.
I need to select ProBlEm and tHiS (you see, BlE and HiS, one small letter in between two capital).
How can I select this?

In MySQL you can use a binary (to ensure case sensitivity) regular expression to filter for those records that contain such a pattern:
WHERE my_column REGEXP BINARY '[[:upper:]][[:lower:]][[:upper:]]'
However, it is not so straightforward to extract the substrings which match such a pattern from within MySQL. One can use a UDF, e.g. lib_mysqludf_preg, but it's probably a task more suited to being performed within your application layer. In either case, regular expressions can again help to simplify this task.

Firstly you have split the String. Please refer this SO Question
and then search each retrive word like
substring(word,2) LIKE '[A-Z]' COLLATE latin1_general_cs

Related

Performance of LIKE 'xyz%' v/s LIKE '%xyz'

I was wondering how the LIKE operator actually work.
Does it simply start from first character of the string and try matching pattern, one character moving to the right? Or does it look at the placement of the %, i.e. if it finds the % to be the first character of the pattern, does it start from the right most character and starts matching, moving one character to the left on each successful match?
Not that I have any use case in my mind right now, just curious.
edit: made question narrow
If there is an index on the column, putting constant characters in the front will lead your dbms to use a more efficient searching/seeking algorithm. But even at the simplest form, the dbms has to test characters. If it is able to find it doesn't match early on, it can discard it and move onto the next test.
The LIKE search condition uses wildcards to search for patterns within a string. For example:
WHERE name LIKE 'Mickey%'
will locate all values that begin with 'Mickey' optionally followed by any number of characters. The % is not case sensitive and not accent sensitive and you can use multiple %, for example
WHERE name LIKE '%mouse%'
will return all values with 'mouse' (or 'Mouse' or 'mousé') in it.
The % is inclusive, meaning that
WHERE name like '%A%'
will return all that starts with an 'A', contain 'A' or end with 'A'.
You can use _ (underscore) for any character on a single position:
WHERE name LIKE '_at%'
will give you all values with 'a' as the second letter and 't' as the third. The first letter can be anything. For example: 'Batman'
In T-SQL, if you use [] you can find values in a range.
WHERE name LIKE '[c-f]%'
it will find any value beginning with letter between c and f, inclusive. Meaning it will return any value that start with c, d, e or f. This [] is T-SQL only. Use [^ ] to find values not in a range.
Finding all values that contain a number:
WHERE name LIKE '%[0-9]%'
returns everything that has a number in it. Example: 'Godfather2'
If you are looking for all values with the 3rd position to be a '-' (dash) use two underscores:
WHERE NAME '__-%'
It will return for example: 'Lo-Res'
Finding the values with names ends in 'xyz' use:
WHERE name LIKE '%xyz'
returns anything that ends with 'xyz'
Finding a % sign in a name use brackets:
WHERE name LIKE '%[%]%'
will return for example: 'Top%Movies'
Searching for [ use brackets around it:
WHERE name LIKE '%[[]%'
gives results as: 'New York [NY]'
The database collation's sort order determines both case sensitivety and the sort order for the range of characters. You can optionally use COLLATE to specify collation sort order used by the LIKE operator.
Usually the main performance bottleneck is IO. The efficiency of the LIKE operator can be only important if your whole table fits in the memory otherwise IO will take most of the time.
AFAIK oracle can use indexes for prefix matching. (like 'abc%'), but these index cannot be used for more complex expressions.
Anyway if you have only this kind of queries you should consider using a simple index on the related column. (Probably this is true for other RDBMS's as well.)
Otherwise LIKE operator is generally slow, but most of the RDBMS have some kind of full text searching solution. I think the main reason of the slowness is that LIKE is too general. Usually full text indexes has lots of different options which can tell the database what you really want to search for, and with these additional information the DB can do its task in a more efficient way.
As a rule of thumb I think if you want to search in a text field and you think performance can be an issue, you should consider your RDBMS's full text searching solution, or the real goal is not text searching, but this is some kind of "design side effect", for example xml/json/statuses stored in a field as text, then probably you should consider choosing a more efficient data storing option. (if there is any...)

Ideas for Find and Replace character

I need to search address fields and change one character to upper case if there is an apartment number. So '521 Main St. #3b' would change to '521 Main St. #3B'.
The way I know to do this would be to write a program that loops through the recordset, looks at the address field for the last character to see if it's an alpha, then if the character before it is a numeric, change the case of the last char and update the record.
Is this something that would be quicker/simpler with regular expressions (haven't ever used)?
If so, is this best done from within a programming environmnet or using a text editor such as Textmate or vi ? The data is in MySQL and Excel, but I can export it to a text file.
Thanks.
I solved this using TextMate which, once I began to understand a little regex, was simple. (details here Regex Syntax for making the last character Uppercase in TextMate)
Still, I wonder if something like sed or awk, (which I started to try out) might be a better tool. And the SQL solution that Olexa provided works. I just don't know how to have it apply to the entire recordset.
If the data is stored in MySQL, then it is better to process it there:
UPDATE addresses
SET address = CONCAT(LEFT(address, CHAR_LENGTH(address) - 1), UPPER(RIGHT(address, 1)))
WHERE address REGEXP BINARY '#[[:digit:]]+[[:lower:]]{1}$'
;
I've added BINARY because otherwise REGEXP is not case-sensitive, but BINARY may need to be omitted to support multi-byte strings. In this case, surplus updates will be made, but the result would be correct anyway.
P. S. An example on SQL Fiddle showing which values are affected, and how they are affected: http://sqlfiddle.com/#!2/b29326/1

Mysql searching records containing wildcards

I have a table with tens of thousands of VIN numbers. Many of them look along the lines of this:
6MMTL#A423T######
WVWZZZ3BZ1?######
MPATFS27H??######
SCA2D680?7UH#####
SAJAC871?68H06###
The # represents a digit and the ? a letter (A-Z).
I want to search for the following: 6MMTL8A423T000000.
I am struggling to work out the logic. Should I use a function? Should I use mysql regex?
A regular expression match would be a good way to approach this problem. What you need to do is convert the vin expressions into valid regular expressions that represent the logic you've indicated. Here's a simple way to do that:
replace(replace(vin,'#','[0-9]'),'?','[A-Z]')
This would convert 6MMTL#A423T###### into 6MMTL[0-9]A423T[0-9][0-9][0-9][0-9][0-9][0-9]. Now using this converted format, do a regular expression match query:
select vin
from vins
where '6MMTL8A423T000000' regexp replace(replace(vin,'#','[0-9]'),'?','[A-Z]')
Sample Output: 6MMTL#A423T######
Demo: http://www.sqlfiddle.com/#!2/ee4de/4

Querying a mysql database fetching words with a regexp

I'm using a regexp for fetching a set of words that accomplish the next syntax:
SELECT * FROM words WHERE word REGEXP '^[dcqaahii]{5}$'
My first impression gave me the sensation that it was good till I realized that some letters were used more than contained in the regexp.
The question is that I want to get all words (i.e. of 5 letters) that can be formed with the letters within the brackets, so if I have two 'a' resulting words can have no 'a', one 'a' or even two 'a', but no more.
What should i add to my regexp for avoiding this?
Thanks in advance.
It would probably be better to retrieve all candidates first and post-process, as others have suggested:
SELECT * FROM words WHERE word REGEXP '^[dcqahi]{5}$'
However, nothing is stopping you from doing multiple REGEXPs. You can select 0, 1, or 2 incidences of the letter 'a' with this grungy expression:
'^[^a]*a?[^a]*a?[^a]*$'
So do the pre-filter first and then combine additional REGEXP requirements with AND:
SELECT * FROM words
WHERE word REGEXP '^[dcqahi]{5}$'
AND word REGEXP '^[^a]*a?[^a]*a?[^a]*$'
AND word REGEXP '^[^i]*i?[^i]*i?[^i]*$'
[edit] As an afterthought, I have inferred that for the non-vowels you also want to restrict to 0 or 1 occurrance. So if that's the case, you'd keep going...
AND word REGEXP '^[^d]*d?[^d]*$'
AND word REGEXP '^[^c]*c?[^c]*$'
AND word REGEXP '^[^q]*q?[^q]*$'
AND word REGEXP '^[^h]*h?[^h]*$'
Yuck.
Only solution I can think of would be to use the above SQL you have to get an initial filtered set of data but then loop through it and further filter with some server side code (PHP etc.) which is better suited to doing that kind of logic.
In regular expressions, square brackets [] are merely a character class, like a list of allowed characters. Specifying the same letter twice within the brackets is therefore redundant.
For example the pattern [sed] will match sed, and seed because e is part of the allowed characters. Specifying a character count afterward in braces {} is merely a total count of characters previously allowed by the character class.
The pattern [sed]{3} therefore will match sed but not seed.
I would recommend moving the logic for testing the validity of words from SQL into your program.

How to make MySQL aware of multi-byte characters in LIKE and REGEXP?

I have a MySQL table with two columns, both utf8_unicode_ci collated. It contains the following rows. Except for ASCII, the second field also contains Unicode codepoints like U+02C8 (MODIFIED LETTER VERTICAL LINE) and U+02D0 (MODIFIED LETTER TRIANGULAR COLON).
word | ipa
--------+----------
Hallo | haˈloː
IPA | ˌiːpeːˈʔaː
I need to search the second field with LIKE and REGEXP, but MySQL (5.0.77) seems to interpret these fields as bytes, not as characters.
SELECT * FROM pronunciation WHERE ipa LIKE '%ha?lo%'; -- 0 rows
SELECT * FROM pronunciation WHERE ipa LIKE '%ha??lo%'; -- 1 row
SELECT * FROM pronunciation WHERE ipa REGEXP 'ha.lo'; -- 0 rows
SELECT * FROM pronunciation WHERE ipa REGEXP 'ha..lo'; -- 1 row
I'm quite sure that the data is stored correctly, as it seems good when I retrieve it and shows up fine in phpMyAdmin. I'm on a shared host, so I can't really install programs.
How can I solve this problem? If it's not possible: is there a plausible work-around that does not involve processing the entire database with PHP every time? There are 40 000 lines, and I'm not dead-set on using MySQL (or UTF8, for that matter). I only have access to PHP and MySQL on the host.
Edit: There is an open 4-year-old MySQL bug report, Bug #30241 Regular expression problems, which notes that the regexp engine works byte-wise. Thus, I'm looking for a work-around.
EDITED to incorporate fix to valid critisism
Use the HEX() function to render your bytes to hexadecimal and then use RLIKE on that, for example:
select * from mytable
where hex(ipa) rlike concat('(..)*', hex('needle'), '(..)*'); -- looking for 'needle' in haystack, but maintaining hex-pair alignment.
The odd unicode chars render consistently to their hex values, so you're searching over standard 0-9A-F chars.
This works for "normal" columns too, you just don't need it.
p.s. #Kieren's (valid) point addressed using rlike to enforce char pairs
I'm not dead-set on using MySQL
Postgres seems to handle it quite fine:
test=# select 'ˌˈʔ' like '___';
?column?
----------
t
(1 row)
test=# select 'ˌˈʔ' ~ '^.{3}$';
?column?
----------
t
(1 row)
If you go down that road, note that in Postgres' ilike operator matches that of MySQL's like. (In Postgres, like is case-sensitive.)
For the MySQL-specific solution, you mind be able to work around by binding some user-defined function (maybe bind the ICU library?) into MySQL.
You have problems with UTF8? Eliminate them.
How many special characters do you use? Are you using only locase letters, am I right? So, my tip is: Write a function, which converts spec chars to regular chars, e.g. "æ" ->"A" and so on, and add a column to the table which stores that converted value (you have to convert all values first, and upon each insert/update). When searching, you just have to convert search string with the same function, and use it on that field with regexp.
If there're too many kind of special chars, you should convert it to multi-char. 1. Avoid finding "aa" in the "ba ab" sequence use some prefix, like "#ba#ab". 2. Avoid finding "#a" in "#ab" use fixed length tokens, say, 2.