Extracting a substring matched in Regex and bounded by spaces - mysql

Trying to parse dates entered in various ways and contexts, and that may or may not be present in a given record
I can SELECT candidate rows with
SELECT * FROM table WHERE column REGEXP '[-|.|/][0-9][0-9][-|.|/]' ;
This will indeed select records that read something like
I was on top of mount Everest (2010-10-10)
i went to see the doctor on 13/12/10 and she told me I was in great shape.
where the matched values are -10- and /12/ for the first and second records respectively.
Now, I want to extract the date from the column. Not merely the -10- or /12/ but the full date fragments 2010-10-10 or 13/12/10, i.e. the matched expression expanded backwards up to a space or a parenthesis, and expanded forward at as space of parenthesis.
Sorry if this is obvious - I am not familiar with REGEX.

you will have to find a pattern for the date input. you can use regex in your where, but you will need to isolate it somehow. is it always the last part of the col?
now that you isolated the location, you can do a case style select
select case
when right(date,4) between 1900 and 2200 then right(date,10) #mm/dd/yyyy
when left(date,4) between 1900 and 2200 then concantenate(left(right(date,5),2), "/", right(date,2))
end as date
that kind of ordeal
EDIT;;
SET #fieldName = "I was ON top of mount Everest (2010-10-22)";
SELECT IF(
STR_TO_DATE(
CONCAT (
RIGHT(SUBSTRING_INDEX(#fieldName,"-",1),4), "-",RIGHT(SUBSTRING_INDEX(#fieldName,"-",2),2), "-",LEFT(SUBSTRING_INDEX(#fieldName,"-",-1),2)
), '%Y-%m-%d'
) IS NULL ,
"bad date",
"good date")
but now for bad date and good date, you keep chaining that style to loop through all variants of dates...
although the best solution is to make that date a diff col in a special format if you can as it is entered

The proper REGEX (in this case) is [0-9+-]+[-|.|/][0-9][0-9][-|.|/]+[0-9+-]+

Your pattern [0-9+-]+[-./][0-9][0-9][-./]+[0-9+-]+ would match stuff like +-+-.99///.///-++++, is that really what you want?
Consider using
(?:(?P<year>\d{3,4})|(\d{1,2}))(?P<sep>[-./])\d{1,2}(?P=sep)(?(year)\d{1,2}|\d{1,4})
instead. It doesn't allow mixed separators like 1.2-2014, and doesn't allow more than one number to have more than 2 digits like 2010-10-2010.
Demo.

Related

Why isn't MySQL REGEXP filtering out these values?

So I'm trying to find what "special characters" have been used in my customer names. I'm going through updating this query to find them all one-by-one, but it's still showing all customers with a - despite me trying to exlude that in the query.
Here's the query I'm using:
SELECT * FROM customer WHERE name REGEXP "[^\da-zA-Z\ \.\&\-\(\)\,]+";
This customer (and many others with a dash) are still showing in the query results:
Test-able Software Ltd
What am I missing? Based on that regexp, shouldn't that one be excluded from the query results?
Testing it on https://regex101.com/r/AMOwaj/1 shows there is no match.
Edit - So I want to FIND any which have characters other than the ones in the regex character set. Not exclude any which do have these characters.
Your code checks if the string contains any character that does not belong to the character class, while you want to ensure that none does belong to it.
You can use ^ and $ to check the while string at once:
SELECT * FROM customer WHERE name REGEXP '^[^\da-zA-Z .&\-(),]+$';
This would probably be simpler expressed with NOT, and without negating the character class:
SELECT * FROM customer WHERE name NOT REGEXP '[\da-zA-Z .&\-(),]';
Note that you don't need to escape all the characters within the character class, except probably for -.
Use [0-9] or [[:digit:]] to match digits irrespective of MySQL version.
Use the hyphen where it can't make part of a range construction.
Fix the expression as
SELECT * FROM customer WHERE name REGEXP "[^0-9a-zA-Z .&(),-]+";
If the entire text should match this pattern, enclose with ^ / $:
SELECT * FROM customer WHERE name REGEXP "^[^0-9a-zA-Z .&(),-]+$";
- implies a range except if it is first. (Well, after the "not" (^).)
So use
"[^-0-9a-zA-Z .&(),]"
I removed the + at the end because you don't really care how many; this way it will stop after finding one.

Use of REGEXP_SUBSTR to get date values from string

I'm looking for the REGEXP_SUBSTR code that gets dates like format '06-11-2014 - 05-12-2014' or format '01/11/2019 - 30/11/2019' from a string. The first date being the startdate and the second date being the enddate. It would be extremely helpful to understand how the REGEXP_SUBSTR works in this case and also why. I want to get the string with the two dates, but then I want both dates to be in their own column.
A record look likes this:
Medium - nl (06-11-2014 - 05-12-2014) ruimte: Standaard (5.000 MB).
Although text can be shorter or longer the two dates between brackets are always there.
The code below gets the first one, but only if it's with '-'. I want both '-' and '/' variants displayed.
REGEXP_SUBSTR(description, '[0-9][0-9][-[0-9][0-9]-[0-9][0-9][0-9][0-9]')
Thanks a lot for any and all help.
Since you are using MySQL 8+, it means you also have access to the REGEXP_REPLACE function, which is suitable for isolating the portion of the string which contains the two dates. In the CTE below, I isolate the date string, then in a subquery on that CTE, I fish out the two dates in separate columns using SUBSTRING_INDEX.
WITH cte AS (
SELECT
text,
REGEXP_REPLACE(text, '^.*\(([0-9]{2}-[0-9]{2}-[0-9]{4} - [0-9]{2}-[0-9]{2}-[0-9]{4})\).*$', '$1') AS dates
FROM yourTable
)
SELECT
text,
SUBSTRING_INDEX(dates, ' - ', 1) AS first_date,
SUBSTRING_INDEX(dates, ' - ', -1) AS second_date
FROM cte;
Demo
Here is an explanation of the regex pattern used:
^ from the start of the string
.* match any content, until hitting
\( '(' which is followed by
( (capture what follows)
[0-9]{2}-[0-9]{2}-[0-9]{4} a single date
- -
[0-9]{2}-[0-9]{2}-[0-9]{4} another single date
) (stop capture)
\) ')'
.* match the remainder of the content
$ end of the string
Note that we include a pattern which matches the entire input, which is a requirement since we want to use a capture group. Also, note that REGEXP_SUBSTR might have been viable here, but it could run the risk that you get false positives, in the event that a date could appear elsewhere besides the terms in parentheses.

MySQL Invoice numbers range with count

Firstly I want this to be purely done with MySQL query.
I have a series of Invoice numbers
invoice_number
INV001
INV002
INV003
INV004
INV005
001
002
003
006
007
009
010
INVOICE333
INVOICE334
INVOICE335
INVOICE337
INVOICE338
INVOICE339
001INV
002INV
005INV
009INV
I want to output something like this
from_invoice_no to_invoice_no total_invoices
INV001 INV005 5
001 010 7
INVOICE333 INVOICE339 6
001INV 009INV 4
The invoice number pattern cannot be fixed. They can change in future
Please help me to achieve this.
Thanks in advance.
I will first show a general idea how to solve this problem and provide some code which will be ugly, but easily understandable. Then I'll explain what the issues are and how to remedy them.
STEP 1: Deriving the grouping criterion
For the first step, I assume you have the right (privilege) to create an additional column in your table. Let us name it invoice_text. Now, the general idea is to remove all digits from the invoice number so that only the "text pattern" remains. Then we can group by the text pattern.
Assuming that you have already created the column mentioned above, you could do the following:
UPDATE Invoices SET invoice_text = REPLACE(invoice_number, '0', '');
UPDATE Invoices SET invoice_text = REPLACE(invoice_text, '1', '');
UPDATE Invoices SET invoice_text = REPLACE(invoice_text, '2', '');
...
UPDATE Invoices SET invoice_text = REPLACE(invoice_text, '9', '');
After having done that, you will have the pure text pattern without digits in invoice_text and can use that for grouping:
SELECT COUNT(invoice_number) AS total_invoices FROM Invoices
GROUP BY invoice_text
This is nice, but it is not yet what you wanted. It does not show the first and last invoice number for each group.
STEP 2: Deriving the first and last invoice for each group
For this step, create one more column in your table. Let us name it invoice_digits. As the name implies, it is meant to take only the pure invoice number without the "pattern text".
Assuming you have that column, you could do the following:
UPDATE Invoices SET invoice_digits = REPLACE(invoice_number, 'A', '');
UPDATE Invoices SET invoice_digits = REPLACE(invoice_digits, 'B', '');
UPDATE Invoices SET invoice_digits = REPLACE(invoice_digits, 'C', '');
...
UPDATE Invoices SET invoice_digits = REPLACE(invoice_digits, 'Z', '');
Now, you can use that column to get the minimum and maximum invoice number (without "pattern text"):
SELECT
MIN(invoice_digits) AS from_invoice_no,
MAX(invoice_digits) AS to_invoice_no,
COUNT(invoice_number) AS total_invoices
FROM Invoices
GROUP BY invoice_text
Problems and how to solve them
1) According to your question, you want to get the minimum and maximum full invoice number text. The solution above will show only the minimum and maximum invoice number text without the text parts, i.e. only the digits.
We could remedy this by doing a further JOIN, but since I can very well imagine that you won't insist on this :-), and since it won't make the general idea more clear, I am leaving this to you. If you are interested, let us know.
2) It might be difficult to decide what a digit (i.e. what the actual invoice number) is. For example, if you have invoice numbers like INV001, INV002, this will be no problem, but what if you have INV001/001, INV001/002, INV002/003 and so on? In this example, my code would would yield 001001, 001002, 002003 as actual invoice numbers and use that to decide what the minimum and maximum numbers are.
This might not be what you want to do in that case. The only way around this is that you thoroughly think about what you should consider a digit and what not, and to adapt my code accordingly.
3) My code currently uses string comparisons to get the minimum and maximum invoice numbers. This may yield other results than comparing the values as numbers. If you are wondering what that means: Compare '19' to '9' as string, and compare 19 to 9 as number.
If this is a problem, then use MySQL's CAST to convert the text to a number before feeding it to MAX or MIN. But please be aware that this has its own caveats:
If you have very long invoice numbers with so many digits that they don't fit into MySQL's numeric data types, this method will fail. It will also fail if you have defined a character like / to be digits (due to the issues described in 2)) since MySQL can't convert this into a number.
Instead of converting to numbers, you can also pad the values in invoice_digits with leading zeroes, for example using MySQL's LPAD function. This will avoid the problems described above and sort the numbers as expected, even if they include non-digits like /, but you will have to know the maximum length of the digit string in advance.
4) The code is ugly! Do you really have to remove all possible characters from A to Z one by one by doing UPDATE statements to get the digit string?
Actually, it is even worse. I just have assumed that you only have the "text characters" A to Z in your invoices. But there could be any character Unicode defines: Russian or Chinese ones, special characters, in other words: thousands of different characters.
Unfortunately, AFAIK, MySQL still does not provide a REGEX-REPLACE function. I don't see any chance to get this problem solved unless you extend MySQL with an appropriate UDF (user defined function). There are some cool guys out there who have recognized the problem and have added such functions to MySQL. Since recommending libraries seems to be discouraged on SO, just google for "mysql regex replace".
When having extended MySQL that way, you can replace the ugly bunch of UPDATE statements which remove the digits / the text from the invoice number by a single one (using a REGEX, you can replace all digits or all non-digits at once).
For the sake of completeness, you could avoid the many UPDATE statements by doing UPDATE ... SET ... = REPLACE(REPLACE(REPLACE(...))) and thus applying all updates with one statement. But this is even more ugly and error prone, so if you are serious about your problem, you'll really have to extend MySQL by a REGEX-REPLACE.
5) The solution will only work if you have the privilege to create new columns in the table.
This is true for the solution as-is. But I have chosen to go that way solely because it makes the general idea clear and understandable. Instead of adding columns to your original table, you could also create a new table where you store the pure text / digits (this table might be a temporary one).
Furthermore, since MySQL supports grouping by computed values, you don't need additional columns / tables at all. You should decide by yourself what is the best way to go.

Removing single quotes from comparison in select statement

I have a table where a field can have single quotes, but I need to be able to search by that field without single quotes. For example, if the search query is "Johns favorite", I need to be able to find a row where that field contains "John's favorite". I was looking into regex for it, but that seems to return a 0 or 1 when used in a select statement, if I'm understanding it correctly.
Take a look at:
http://www.artfulsoftware.com/infotree/queries.php#552
This will give you the distance between two strings. I.e. you can check whether levensthein distance is less than 3, which means, less than 3 operations are required to be equal.
Try using REPLACE:
SELECT
IF(
REPLACE("John's favorite","'","") = "Johns favorite" ,
"found",
"not found"
)
It's not optimal but it should do the job.

MySQL Regexp won't work

I have the following query:
SELECT `admin_users`.*
FROM `admin_users`
WHERE (avatar REGEXP 'avatars/Name-[0-9]+.jpg+')
ORDER BY `admin_users`.`avatar`
DESC LIMIT 1
It's ok if I have something like:
avatars/Name-5.jpg
avatars/Name-6.jpg
But if I have, avatars/Name-15.jpg, for example, it doesn't return in query.
In other words, It only works for 1 digit, not for more. How can I solve it?
When comparing strings (an that is what avatar is), "avatars/Name-1..." comes before "avatars/Name-5..." simply because the string "1" comes before "5".
It is not practical to order those by an embedded number. This would do what you want, but it is pretty cryptic:
ORDER BY 0 + MID(avatar, 14)
To explain
MID will start at the 14th character of 'avatars/Name-15.jpg' and extract '15.jpg'.
0+ will take that string, convert it to a number and deliver the number 15. (When a string is turned into a number, the first characters are taken as long as it looks like a number. So, 0+'abc' will deliver 0, since there is nothing at the beginning of abc that looks like a number.)
If the left part were not exactly 14 characters in all cases, the trick will fail. And it may get so complicated as to be 'impossible' in SQL.