Mysql: extract a string from field between delimiters (backwards) - mysql

I have a Column 'ACCOUNT_NUMBER' from a table 'BankingActivity' which contains data as follow :
example:
ManualBanking-BankDeposit-350-1006590343--INTERNAL_A
or
MyPayCard-MyPayDeposit-620-989228234--TL
I need to extract the number '1006590343' or '989228234'
Initially i execute the following query:
select substr( `BankingActivity`.`ACCOUNT_NUMBER`,(
locate( '--', `BankingActivity`.`ACCOUNT_NUMBER` ) - 9 ),9 ) * 1
from BankingActivity
Which works fine if the length of the string does not exceed 9 digits. Over 9 digits, I obviously have issues and can not get the full string.
How can i look backwards for the delimiter '--' and then extract the value between the '--' delimiter and the previous '-' delimiter?
I tried with some Regex but I am not familiar enough with it to get a correct result.

Try
SELECT regexp_substr(
regexp_substr(acct, '-\\d+--'), '\\d+')
FROM (
SELECT 'ManualBanking-BankDeposit-350-1006590343--INTERNAL_A' as acct
UNION
SELECT 'MyPayCard-MyPayDeposit-620-989228234--TL'
) accounts;
The inner regexp_substr extracts a substring that begins with a dash followed by 1 or more digits and ends with two dashes. That would be e. g. '-1006590343--'. From this, the outer regexp_substr extracts all consecutive digits, that is '1006590343'.
More detailed information about regular expressions in MySQL can be found in the documentation.

If I have understood your question correctly then you can try something like this -
select SUBSTRING_INDEX(SUBSTRING_INDEX('ManualBanking-BankDeposit-350-1006590343--INTERNAL_A', '-' ,-3), '--', 1);
select SUBSTRING_INDEX(SUBSTRING_INDEX('MyPayCard-MyPayDeposit-620-989228234--TL', '-' ,-3), '--', 1);

This is probably a job for SUBSTRING_INDEX().
Check it out. Fiddle here.
SET #s = 'ManualBanking-BankDeposit-350-1006590343--INTERNAL_A';
SELECT SUBSTRING_INDEX(#s, '-', -3);
This splits your string on '-'. It takes everything after the third '-' delimiter from the end, and gives you back 1006590343--INTERNAL_A.
Then we use SUBSTRING_INDEX() again on that.
SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(#s, '-', -3), '-', 1);
Lo and behold, this gets us 1006590343.
But. This is a brittle way to do it. MySQL's string processing isn't easy to program in detailed ways. This solution doesn't take into account things like missing dashes at the end of the string. Garbage in, garbage out. Use a host language like C# / php / nodejs / Java etc to do this kind of string analysis if you want it to be super-robust for real world data.

Related

Use of REGEXP_SUBSTR to get date values from string

I'm looking for the REGEXP_SUBSTR code that gets dates like format '06-11-2014 - 05-12-2014' or format '01/11/2019 - 30/11/2019' from a string. The first date being the startdate and the second date being the enddate. It would be extremely helpful to understand how the REGEXP_SUBSTR works in this case and also why. I want to get the string with the two dates, but then I want both dates to be in their own column.
A record look likes this:
Medium - nl (06-11-2014 - 05-12-2014) ruimte: Standaard (5.000 MB).
Although text can be shorter or longer the two dates between brackets are always there.
The code below gets the first one, but only if it's with '-'. I want both '-' and '/' variants displayed.
REGEXP_SUBSTR(description, '[0-9][0-9][-[0-9][0-9]-[0-9][0-9][0-9][0-9]')
Thanks a lot for any and all help.
Since you are using MySQL 8+, it means you also have access to the REGEXP_REPLACE function, which is suitable for isolating the portion of the string which contains the two dates. In the CTE below, I isolate the date string, then in a subquery on that CTE, I fish out the two dates in separate columns using SUBSTRING_INDEX.
WITH cte AS (
SELECT
text,
REGEXP_REPLACE(text, '^.*\(([0-9]{2}-[0-9]{2}-[0-9]{4} - [0-9]{2}-[0-9]{2}-[0-9]{4})\).*$', '$1') AS dates
FROM yourTable
)
SELECT
text,
SUBSTRING_INDEX(dates, ' - ', 1) AS first_date,
SUBSTRING_INDEX(dates, ' - ', -1) AS second_date
FROM cte;
Demo
Here is an explanation of the regex pattern used:
^ from the start of the string
.* match any content, until hitting
\( '(' which is followed by
( (capture what follows)
[0-9]{2}-[0-9]{2}-[0-9]{4} a single date
- -
[0-9]{2}-[0-9]{2}-[0-9]{4} another single date
) (stop capture)
\) ')'
.* match the remainder of the content
$ end of the string
Note that we include a pattern which matches the entire input, which is a requirement since we want to use a capture group. Also, note that REGEXP_SUBSTR might have been viable here, but it could run the risk that you get false positives, in the event that a date could appear elsewhere besides the terms in parentheses.

How to pass multiple delimeters in substring_index

I want to query the string between https:// or http:// and the first delimeter characters that comes after it. For example, if the field contains:
https://google.com/en/
https://www.yahoo.com?en/
I want to get:
google.com
www.yahoo.com
My initial query that will capture the / only contains two substring_index as follows:
SELECT substring_index(substring_index(mycol,'/',3),'://',-1)
FROM mytable;
Now I found that the URLs may contain multiple delimeters. I want my statament to capture multiple delimeters possibilities which are (each one is a separate character):
:/?#[]#!$&'()*+,;=
How to do this in my statement? I tried this solution but the end result the command could not be executed due to syntax error while I am sure I followed the solution. Can anyone help me correctly construct the query to capture all the delimeter characters I listed above?
I use MySQL workbecnh 6.3 on Ubuntu 18.04.
EDIT:
Some corrections made in the first example of URLs.
First, note that https://www.yahoo.com?en/ seems like an unlikely URL, because it has a path separator contained inside the query string. In any case, if you are using MySQL 8+, then consider using its regex functionality. The REGEXP_REPLACE function can be helpful here, using the following pattern:
https?://([A-Za-z_0-9.-]+).*
Sample query:
WITH yourTable AS (
SELECT 'https://www.yahoo.com?en/' AS url UNION ALL
SELECT 'no match'
)
SELECT
REGEXP_REPLACE(url, 'https?://([A-Za-z_0-9.-]+).*', '$1') AS url
FROM yourTable
WHERE url REGEXP 'https?://[^/]+';
Demo
The term $1 refers to the first capture group in the regex pattern. An explicit capture group is denoted by a quantity in parentheses. In this case, here is the capture group (highlighted below):
https?://([A-Za-z_0-9.-]+).*
^^^^^^^^^^^^^^^
That is, the capture group is the first portion of the URL path, including domain, subdomain, etc.
In MySQL 8+, this should work:
SELECT regexp_replace(regexp_substr(mycol, '://[a-zA-Z0-9_.]+[/:?]'), '[^a-zA-Z0-9_.]', '')
FROM (SELECT 'https://google.com/en' as mycol union all
SELECT 'https://www.yahoo.com?en'
) x
In older versions, this is much more challenging because there is no way to search for a string class.
One brute force method is:
select (case when substring_index(mycol, '://', -1) like '%/%'
then substring_index(substring_index(mycol, '://', -1), '/', 1)
when substring_index(mycol, '://', -1) like '%?%'
then substring_index(substring_index(mycol, '://', -1), '?', 1)
. . . -- and so on for each character
else substring_index(mycol, '://', -1)
end) as what_you_want
The [a-zA-Z0-9_.] is intended to be something like the valid character class for your domain names.

How to split a column in two columns

I have an issue with a table called "movies". I found the date and the movie title are both in the title column. As shown in the picture:
I don't know how to deal with this kind of issues. So, I tried to play with this code to make it similar to MySQL codes but I didn't work anyways.
DataFrame(row.str.split(' ',-1).tolist(),columns = ['title','date'])
How do I split it in two columns (title, date)?
If you are using MySQL 8+, then we can try using REGEXP_REPLACE:
SELECT
REGEXP_REPLACE(title, '^(.*)\\s\\(.*$', '$1') AS title,
REGEXP_REPLACE(title, '^.*\\s\\((\\d+)\\)$', '$1') AS date
FROM yourTable;
Demo
Here is a general regex pattern which can match your title strings:
^.*\s\((\d+)\)$
Explanation:
^ from the start of the string
(.*)\s match and capture anything, up to the last space
\( match a literal opening parenthesis
(\d+) match and capture the year (any number of digits)
\) match a literal closing parenthesis
$ end of string
I would simply do:
select left(title, length(title) - 7) as title,
replace(right(title, 5) ,')', '') as year
Regular expressions seem like overkill for this logic.
In Hive, you need to use substr() for this:
select substr(title, 1, length(title) - 7) as title,
substr(title, length(title) - 5, 4) as year
After struggling and searching I was able to build this command which works perfectly.
select
translate(substr(title,0,length(title) -6) ,'', '') as title,
translate(substr(title, -5) ,')', '') as date
from movies;
Thanks for the people who answered too!

MySQL trim multiple 0 of string?

So I have a column of strings in a format like this: '123.123.123.123.123'.
I need to cut the string to the first two numbers like so: '123.123' that way I can GROUP BY the cut string to get the results needed.
I can do this easily by using SUBSTRING_INDEX(Version, '.', 2) however the problem arises when the second number part has multiple 0's therefore giving me duplicate entries in the query.
e.g. (10.00, 10.0) and 10.404, 10.4040 etc.
Is there a way to trim all unwanted zeros off the end of the string?
Note: I can only use straight MySQL or functions in this case.
EDIT:
I can get the desired result by replacing the first instance of '.0', trim the extra zeros and then replace the '.0' back
REPLACE(TRIM(TRAILING '0' FROM REPLACE(SUBSTRING_INDEX(Version, '.', 2), '.0', '^a')), '^a', '.0')
This probably is not the best option performance wise - therefore I will wait for others before accepting my own.
If you are only working with the first two components, then convert the values to a number, say:
cast(substring_index(version, '.', 2) + 0 as decimal(10, 4))
This will give everything with equal numeric values the same representation.
EDIT:
If you want to remove the trailing zeros from the end of the string, you can use this trick:
replace(rtrim(replace(substring_index(version, '.', 2), '0', ' ')), ' ', '0')
This replaces the zeros with spaces, then uses rtrim() and converts them back to zeroes.

Extract Data Between Parenthesis That's Always Different Lengths

I have run into a problem and I cannot seem to find the correct solution anywhere. I'd like to extract data from a column that is never going to be the same length but, will always be in parenthesis. I've tried different SUBSTR and LOCATE statements to no avail.
Table: FiguresLog
|UpdateDate| |Description|
|2014-01-01| |(10.0.600.1) Various descriptions follow|
|2014-01-02| |(192.168.10.100) Various descriptions follow|
I need to be able to extract (create a new table/field) containing the IP Addresses within the parenthesis and as I stated, they will always be a different length.
You can do this with regular expressions. But, if there is only one parenthetical expression in the string, this should work for you:
select substring_index(substring_index(description, ')', 1), '(', -1) as IpAddress
You can do the whole thing using LOCATE and SUBSTR. Because of how SUBSTR takes position and length, the math gets a little funky. Hopefully this example makes it clear:
SELECT
SUBSTR(text, ip_start, ip_len) AS ip_addr
FROM
(
SELECT text,
(LOCATE('(', text) + 1) AS ip_start,
(LOCATE(')', text) - (LOCATE('(', text) + 1)) AS ip_len
FROM test
) temp;
Notice that (LOCATE('(', text) + 1) gets repeated. The + 1 is so we don't include the parenthesis in the substring.
The actual calculation for ip_len is ip_len = end_paren_pos - ip_start but we cannot create and select from ip_start in the same query.
Example in action: http://sqlfiddle.com/#!2/2845e/3