How to pass multiple delimeters in substring_index - mysql

I want to query the string between https:// or http:// and the first delimeter characters that comes after it. For example, if the field contains:
https://google.com/en/
https://www.yahoo.com?en/
I want to get:
google.com
www.yahoo.com
My initial query that will capture the / only contains two substring_index as follows:
SELECT substring_index(substring_index(mycol,'/',3),'://',-1)
FROM mytable;
Now I found that the URLs may contain multiple delimeters. I want my statament to capture multiple delimeters possibilities which are (each one is a separate character):
:/?#[]#!$&'()*+,;=
How to do this in my statement? I tried this solution but the end result the command could not be executed due to syntax error while I am sure I followed the solution. Can anyone help me correctly construct the query to capture all the delimeter characters I listed above?
I use MySQL workbecnh 6.3 on Ubuntu 18.04.
EDIT:
Some corrections made in the first example of URLs.

First, note that https://www.yahoo.com?en/ seems like an unlikely URL, because it has a path separator contained inside the query string. In any case, if you are using MySQL 8+, then consider using its regex functionality. The REGEXP_REPLACE function can be helpful here, using the following pattern:
https?://([A-Za-z_0-9.-]+).*
Sample query:
WITH yourTable AS (
SELECT 'https://www.yahoo.com?en/' AS url UNION ALL
SELECT 'no match'
)
SELECT
REGEXP_REPLACE(url, 'https?://([A-Za-z_0-9.-]+).*', '$1') AS url
FROM yourTable
WHERE url REGEXP 'https?://[^/]+';
Demo
The term $1 refers to the first capture group in the regex pattern. An explicit capture group is denoted by a quantity in parentheses. In this case, here is the capture group (highlighted below):
https?://([A-Za-z_0-9.-]+).*
^^^^^^^^^^^^^^^
That is, the capture group is the first portion of the URL path, including domain, subdomain, etc.

In MySQL 8+, this should work:
SELECT regexp_replace(regexp_substr(mycol, '://[a-zA-Z0-9_.]+[/:?]'), '[^a-zA-Z0-9_.]', '')
FROM (SELECT 'https://google.com/en' as mycol union all
SELECT 'https://www.yahoo.com?en'
) x
In older versions, this is much more challenging because there is no way to search for a string class.
One brute force method is:
select (case when substring_index(mycol, '://', -1) like '%/%'
then substring_index(substring_index(mycol, '://', -1), '/', 1)
when substring_index(mycol, '://', -1) like '%?%'
then substring_index(substring_index(mycol, '://', -1), '?', 1)
. . . -- and so on for each character
else substring_index(mycol, '://', -1)
end) as what_you_want
The [a-zA-Z0-9_.] is intended to be something like the valid character class for your domain names.

Related

Mysql: extract a string from field between delimiters (backwards)

I have a Column 'ACCOUNT_NUMBER' from a table 'BankingActivity' which contains data as follow :
example:
ManualBanking-BankDeposit-350-1006590343--INTERNAL_A
or
MyPayCard-MyPayDeposit-620-989228234--TL
I need to extract the number '1006590343' or '989228234'
Initially i execute the following query:
select substr( `BankingActivity`.`ACCOUNT_NUMBER`,(
locate( '--', `BankingActivity`.`ACCOUNT_NUMBER` ) - 9 ),9 ) * 1
from BankingActivity
Which works fine if the length of the string does not exceed 9 digits. Over 9 digits, I obviously have issues and can not get the full string.
How can i look backwards for the delimiter '--' and then extract the value between the '--' delimiter and the previous '-' delimiter?
I tried with some Regex but I am not familiar enough with it to get a correct result.
Try
SELECT regexp_substr(
regexp_substr(acct, '-\\d+--'), '\\d+')
FROM (
SELECT 'ManualBanking-BankDeposit-350-1006590343--INTERNAL_A' as acct
UNION
SELECT 'MyPayCard-MyPayDeposit-620-989228234--TL'
) accounts;
The inner regexp_substr extracts a substring that begins with a dash followed by 1 or more digits and ends with two dashes. That would be e. g. '-1006590343--'. From this, the outer regexp_substr extracts all consecutive digits, that is '1006590343'.
More detailed information about regular expressions in MySQL can be found in the documentation.
If I have understood your question correctly then you can try something like this -
select SUBSTRING_INDEX(SUBSTRING_INDEX('ManualBanking-BankDeposit-350-1006590343--INTERNAL_A', '-' ,-3), '--', 1);
select SUBSTRING_INDEX(SUBSTRING_INDEX('MyPayCard-MyPayDeposit-620-989228234--TL', '-' ,-3), '--', 1);
This is probably a job for SUBSTRING_INDEX().
Check it out. Fiddle here.
SET #s = 'ManualBanking-BankDeposit-350-1006590343--INTERNAL_A';
SELECT SUBSTRING_INDEX(#s, '-', -3);
This splits your string on '-'. It takes everything after the third '-' delimiter from the end, and gives you back 1006590343--INTERNAL_A.
Then we use SUBSTRING_INDEX() again on that.
SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(#s, '-', -3), '-', 1);
Lo and behold, this gets us 1006590343.
But. This is a brittle way to do it. MySQL's string processing isn't easy to program in detailed ways. This solution doesn't take into account things like missing dashes at the end of the string. Garbage in, garbage out. Use a host language like C# / php / nodejs / Java etc to do this kind of string analysis if you want it to be super-robust for real world data.

How to split a column in two columns

I have an issue with a table called "movies". I found the date and the movie title are both in the title column. As shown in the picture:
I don't know how to deal with this kind of issues. So, I tried to play with this code to make it similar to MySQL codes but I didn't work anyways.
DataFrame(row.str.split(' ',-1).tolist(),columns = ['title','date'])
How do I split it in two columns (title, date)?
If you are using MySQL 8+, then we can try using REGEXP_REPLACE:
SELECT
REGEXP_REPLACE(title, '^(.*)\\s\\(.*$', '$1') AS title,
REGEXP_REPLACE(title, '^.*\\s\\((\\d+)\\)$', '$1') AS date
FROM yourTable;
Demo
Here is a general regex pattern which can match your title strings:
^.*\s\((\d+)\)$
Explanation:
^ from the start of the string
(.*)\s match and capture anything, up to the last space
\( match a literal opening parenthesis
(\d+) match and capture the year (any number of digits)
\) match a literal closing parenthesis
$ end of string
I would simply do:
select left(title, length(title) - 7) as title,
replace(right(title, 5) ,')', '') as year
Regular expressions seem like overkill for this logic.
In Hive, you need to use substr() for this:
select substr(title, 1, length(title) - 7) as title,
substr(title, length(title) - 5, 4) as year
After struggling and searching I was able to build this command which works perfectly.
select
translate(substr(title,0,length(title) -6) ,'', '') as title,
translate(substr(title, -5) ,')', '') as date
from movies;
Thanks for the people who answered too!

Regex Search with delimiters in and Mysql

I'm trying to convert a regex that works fine in PHP to MySQL.
MySQL does not allow negative look-ahead (?!) so I need a solution or a workaround
My DB column data is a string like this:
title:The Book Title¬#¬description:The Book Description¬#¬Price:$10.57
The regex I can use in PHP would be
(^|¬#¬)title:(((?!¬#¬).)*Book((?!¬#¬).)*)
but in MySQL I'm struggling. Anybody have any advice or suggestions
MySQL doesn't have a way to apply a REGEX to col's content in SELECT clause.
You may use SUBSTRING function to extract your content in this case.
SELECT
SUBSTRING_INDEX(
LEFT( content, LOCATE('?#?description', content)-1 ), 'title:', -1) AS title,
SUBSTRING_INDEX(
LEFT( content, LOCATE('?#?Price', content)-1 ), 'description:', -1) AS description,
SUBSTRING_INDEX(
RIGHT( content, LOCATE('?#?Price', content)-1 ), 'Price:', -1) AS price
FROM test_table
SQLFiddle
http://sqlfiddle.com/#!2/04e83/1
The solution was simple once I thought about splitting it like nhahtdh suggested.
select
SUBSTRING_INDEX(SUBSTR(table.data, LOCATE('title:', table.data)+6), '¬#¬', 1) regexp '[[:<:]]Book[[:>:]]' AS hasResult
from
table;

Select distinct records based on regexp pattern

I have a db with domains.
I need to pull the domains suffixes and create a list of those suffixes. (.com, .net, .org ...)
I've found that regexp patterns may help me. The only thing I can't make is to filter those domains based on the pattern + uniqueness, in order to get my list.
Here's my query:
$qry="select * from domain where domain_name REGEXP '[[.period.]][a-z]+'";
How should I add the unique criteria to it?
Thank you.
UPDATE:
Here's the working query:
SELECT DISTINCT SUBSTRING_INDEX(domain_name, '.', -1) FROM domains WHERE domain_name REGEXP '[[.period.]][a-z]+'
MySQL has no construct to substitute using regular expressions, or to access matching groups. So regular expressions likely won't help you. Perhaps the SUBSTRING_INDEX function is more useful for you, as you can use that to extract the part after the final dot, using
SUBSTRING_INDEX(domain_name, '.', -1) AS tld

going trough the position of a specific character using mysql select statement

sorry for the title..
my problem is on how to get a specific part of a URL using mysql select statement for example the url
http://www.google.com/search?q=lpol&ie=utf-8&oe=utf-8&client=ubuntu&channel=fs
and
http://www.google.com/search?q=query+to+count+specific+character&ie=utf-8&oe=utf-8&client=ubuntu&channel=fs#hl=fil&client=ubuntu&channel=fs&sa=X&ei=J1knUPu9GsiUiAe3xYB4&ved=0CEQQvwUoAQ&q=mysql+query+to+go+through+specific+character+position&spell=1&bav=on.2,or.r_gc.r_pw.r_qf.&fp=c4fd06cd155ee554&biw=1014&bih=424
these two are different url's but they both have google.com in their url so how can i get the google.com so i can count these 2 url in to 1 using mysql select statement
The SUBSTRING_INDEX function should work
SELECT SUBSTRING_INDEX('http://www.mysql.com/abcd/asd/', '/', 3); -> 'www.mysql.com'
Use this in combination with the column that you have.
SELECT SUBSTRING_INDEX(column, '/', 3) FROM table; -> URLS without slashes
im no entierly sure about the 3 it could be 2.
Goodluck.