Use of REGEXP_SUBSTR to get date values from string - mysql

I'm looking for the REGEXP_SUBSTR code that gets dates like format '06-11-2014 - 05-12-2014' or format '01/11/2019 - 30/11/2019' from a string. The first date being the startdate and the second date being the enddate. It would be extremely helpful to understand how the REGEXP_SUBSTR works in this case and also why. I want to get the string with the two dates, but then I want both dates to be in their own column.
A record look likes this:
Medium - nl (06-11-2014 - 05-12-2014) ruimte: Standaard (5.000 MB).
Although text can be shorter or longer the two dates between brackets are always there.
The code below gets the first one, but only if it's with '-'. I want both '-' and '/' variants displayed.
REGEXP_SUBSTR(description, '[0-9][0-9][-[0-9][0-9]-[0-9][0-9][0-9][0-9]')
Thanks a lot for any and all help.

Since you are using MySQL 8+, it means you also have access to the REGEXP_REPLACE function, which is suitable for isolating the portion of the string which contains the two dates. In the CTE below, I isolate the date string, then in a subquery on that CTE, I fish out the two dates in separate columns using SUBSTRING_INDEX.
WITH cte AS (
SELECT
text,
REGEXP_REPLACE(text, '^.*\(([0-9]{2}-[0-9]{2}-[0-9]{4} - [0-9]{2}-[0-9]{2}-[0-9]{4})\).*$', '$1') AS dates
FROM yourTable
)
SELECT
text,
SUBSTRING_INDEX(dates, ' - ', 1) AS first_date,
SUBSTRING_INDEX(dates, ' - ', -1) AS second_date
FROM cte;
Demo
Here is an explanation of the regex pattern used:
^ from the start of the string
.* match any content, until hitting
\( '(' which is followed by
( (capture what follows)
[0-9]{2}-[0-9]{2}-[0-9]{4} a single date
- -
[0-9]{2}-[0-9]{2}-[0-9]{4} another single date
) (stop capture)
\) ')'
.* match the remainder of the content
$ end of the string
Note that we include a pattern which matches the entire input, which is a requirement since we want to use a capture group. Also, note that REGEXP_SUBSTR might have been viable here, but it could run the risk that you get false positives, in the event that a date could appear elsewhere besides the terms in parentheses.

Related

Mysql: extract a string from field between delimiters (backwards)

I have a Column 'ACCOUNT_NUMBER' from a table 'BankingActivity' which contains data as follow :
example:
ManualBanking-BankDeposit-350-1006590343--INTERNAL_A
or
MyPayCard-MyPayDeposit-620-989228234--TL
I need to extract the number '1006590343' or '989228234'
Initially i execute the following query:
select substr( `BankingActivity`.`ACCOUNT_NUMBER`,(
locate( '--', `BankingActivity`.`ACCOUNT_NUMBER` ) - 9 ),9 ) * 1
from BankingActivity
Which works fine if the length of the string does not exceed 9 digits. Over 9 digits, I obviously have issues and can not get the full string.
How can i look backwards for the delimiter '--' and then extract the value between the '--' delimiter and the previous '-' delimiter?
I tried with some Regex but I am not familiar enough with it to get a correct result.
Try
SELECT regexp_substr(
regexp_substr(acct, '-\\d+--'), '\\d+')
FROM (
SELECT 'ManualBanking-BankDeposit-350-1006590343--INTERNAL_A' as acct
UNION
SELECT 'MyPayCard-MyPayDeposit-620-989228234--TL'
) accounts;
The inner regexp_substr extracts a substring that begins with a dash followed by 1 or more digits and ends with two dashes. That would be e. g. '-1006590343--'. From this, the outer regexp_substr extracts all consecutive digits, that is '1006590343'.
More detailed information about regular expressions in MySQL can be found in the documentation.
If I have understood your question correctly then you can try something like this -
select SUBSTRING_INDEX(SUBSTRING_INDEX('ManualBanking-BankDeposit-350-1006590343--INTERNAL_A', '-' ,-3), '--', 1);
select SUBSTRING_INDEX(SUBSTRING_INDEX('MyPayCard-MyPayDeposit-620-989228234--TL', '-' ,-3), '--', 1);
This is probably a job for SUBSTRING_INDEX().
Check it out. Fiddle here.
SET #s = 'ManualBanking-BankDeposit-350-1006590343--INTERNAL_A';
SELECT SUBSTRING_INDEX(#s, '-', -3);
This splits your string on '-'. It takes everything after the third '-' delimiter from the end, and gives you back 1006590343--INTERNAL_A.
Then we use SUBSTRING_INDEX() again on that.
SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(#s, '-', -3), '-', 1);
Lo and behold, this gets us 1006590343.
But. This is a brittle way to do it. MySQL's string processing isn't easy to program in detailed ways. This solution doesn't take into account things like missing dashes at the end of the string. Garbage in, garbage out. Use a host language like C# / php / nodejs / Java etc to do this kind of string analysis if you want it to be super-robust for real world data.

SQL Query on editing a quantity field

I have a dataset where the values are different, and I want to bring them into a single format.The values are stored as varchar
For ex.
1st Case: 1.23.45 should be 123.45
2nd Case: 125.45 should be 125.45
The first one, has two decimals. I want to remove the first decimal only(if there are 2) else let the value be as it is.
How do I do this?
I tried using replace(Qty,'.',''). But this is removing of them.
I think this can do (although I am not 100% sure about corner cases)
SET Qty = SUBSTRING(Qty, 1, LOCATE(Qty, '.') - 1) + SUBSTRING(Qty, LOCATE(Qty, '.') + 1, LENGTH(Qty) - LOCATE(Qty, '.') - 1)
WHERE LENGTH(Qty) - LENGTH(REPLACE(Qty, '.', '')
You can use a regular expression to handle this case.
Assuming there are only two decimals in your string the below query should be able to handle the case.
select (value,'^(\d+)(\.)?(\d+\.\d+)$',concat('$1','$2')) as a
Here we are matching a regular expression pattern and capturing the following
digits before first decimal occurrence in group one
digits before and after last decimal occurrence including the last decimal in group two.
Following that we are concatenating the two captured groups.
Note that the first decimal has been made optional using ? character and hence we are able to handle both type of cases.
Even if there are more than two decimal cases, I believe a properly constructed regular expression should be able to handle it.

How to split a column in two columns

I have an issue with a table called "movies". I found the date and the movie title are both in the title column. As shown in the picture:
I don't know how to deal with this kind of issues. So, I tried to play with this code to make it similar to MySQL codes but I didn't work anyways.
DataFrame(row.str.split(' ',-1).tolist(),columns = ['title','date'])
How do I split it in two columns (title, date)?
If you are using MySQL 8+, then we can try using REGEXP_REPLACE:
SELECT
REGEXP_REPLACE(title, '^(.*)\\s\\(.*$', '$1') AS title,
REGEXP_REPLACE(title, '^.*\\s\\((\\d+)\\)$', '$1') AS date
FROM yourTable;
Demo
Here is a general regex pattern which can match your title strings:
^.*\s\((\d+)\)$
Explanation:
^ from the start of the string
(.*)\s match and capture anything, up to the last space
\( match a literal opening parenthesis
(\d+) match and capture the year (any number of digits)
\) match a literal closing parenthesis
$ end of string
I would simply do:
select left(title, length(title) - 7) as title,
replace(right(title, 5) ,')', '') as year
Regular expressions seem like overkill for this logic.
In Hive, you need to use substr() for this:
select substr(title, 1, length(title) - 7) as title,
substr(title, length(title) - 5, 4) as year
After struggling and searching I was able to build this command which works perfectly.
select
translate(substr(title,0,length(title) -6) ,'', '') as title,
translate(substr(title, -5) ,')', '') as date
from movies;
Thanks for the people who answered too!

Extracting a substring matched in Regex and bounded by spaces

Trying to parse dates entered in various ways and contexts, and that may or may not be present in a given record
I can SELECT candidate rows with
SELECT * FROM table WHERE column REGEXP '[-|.|/][0-9][0-9][-|.|/]' ;
This will indeed select records that read something like
I was on top of mount Everest (2010-10-10)
i went to see the doctor on 13/12/10 and she told me I was in great shape.
where the matched values are -10- and /12/ for the first and second records respectively.
Now, I want to extract the date from the column. Not merely the -10- or /12/ but the full date fragments 2010-10-10 or 13/12/10, i.e. the matched expression expanded backwards up to a space or a parenthesis, and expanded forward at as space of parenthesis.
Sorry if this is obvious - I am not familiar with REGEX.
you will have to find a pattern for the date input. you can use regex in your where, but you will need to isolate it somehow. is it always the last part of the col?
now that you isolated the location, you can do a case style select
select case
when right(date,4) between 1900 and 2200 then right(date,10) #mm/dd/yyyy
when left(date,4) between 1900 and 2200 then concantenate(left(right(date,5),2), "/", right(date,2))
end as date
that kind of ordeal
EDIT;;
SET #fieldName = "I was ON top of mount Everest (2010-10-22)";
SELECT IF(
STR_TO_DATE(
CONCAT (
RIGHT(SUBSTRING_INDEX(#fieldName,"-",1),4), "-",RIGHT(SUBSTRING_INDEX(#fieldName,"-",2),2), "-",LEFT(SUBSTRING_INDEX(#fieldName,"-",-1),2)
), '%Y-%m-%d'
) IS NULL ,
"bad date",
"good date")
but now for bad date and good date, you keep chaining that style to loop through all variants of dates...
although the best solution is to make that date a diff col in a special format if you can as it is entered
The proper REGEX (in this case) is [0-9+-]+[-|.|/][0-9][0-9][-|.|/]+[0-9+-]+
Your pattern [0-9+-]+[-./][0-9][0-9][-./]+[0-9+-]+ would match stuff like +-+-.99///.///-++++, is that really what you want?
Consider using
(?:(?P<year>\d{3,4})|(\d{1,2}))(?P<sep>[-./])\d{1,2}(?P=sep)(?(year)\d{1,2}|\d{1,4})
instead. It doesn't allow mixed separators like 1.2-2014, and doesn't allow more than one number to have more than 2 digits like 2010-10-2010.
Demo.

Extract Data Between Parenthesis That's Always Different Lengths

I have run into a problem and I cannot seem to find the correct solution anywhere. I'd like to extract data from a column that is never going to be the same length but, will always be in parenthesis. I've tried different SUBSTR and LOCATE statements to no avail.
Table: FiguresLog
|UpdateDate| |Description|
|2014-01-01| |(10.0.600.1) Various descriptions follow|
|2014-01-02| |(192.168.10.100) Various descriptions follow|
I need to be able to extract (create a new table/field) containing the IP Addresses within the parenthesis and as I stated, they will always be a different length.
You can do this with regular expressions. But, if there is only one parenthetical expression in the string, this should work for you:
select substring_index(substring_index(description, ')', 1), '(', -1) as IpAddress
You can do the whole thing using LOCATE and SUBSTR. Because of how SUBSTR takes position and length, the math gets a little funky. Hopefully this example makes it clear:
SELECT
SUBSTR(text, ip_start, ip_len) AS ip_addr
FROM
(
SELECT text,
(LOCATE('(', text) + 1) AS ip_start,
(LOCATE(')', text) - (LOCATE('(', text) + 1)) AS ip_len
FROM test
) temp;
Notice that (LOCATE('(', text) + 1) gets repeated. The + 1 is so we don't include the parenthesis in the substring.
The actual calculation for ip_len is ip_len = end_paren_pos - ip_start but we cannot create and select from ip_start in the same query.
Example in action: http://sqlfiddle.com/#!2/2845e/3