Capture groups in mysql regexp - mysql

I have a table with a varchar column that represents a path. I want to search for rows that have a path that follow a pattern like name.name[*] where name can be anything. I am looking for repeated strings contained anywhere in the path column that are separated by a period and have a square bracket after them.
This seems to call for Regexp, so through python I have something like https://regex101.com/r/apS20a/4
However, trying to implement this with MySQL Regexp is not working. I have been able to translate the shorthand into REGEXP '([A-Za-z_]+).(\1[[0-9]+])', but it seems that MySql Regex does not support capture groups. Is there a way to accomplish what I am trying to do with mysql regexp? Thank you

I don't think that MySQL supports capture groups. But if you only have one example of .name[ in the string between the first . and the first [, you can hack your way around it. This is not a general solution, just a specific approach in this case.
You can get the name with:
select substring_index(substring_index(url, '[', 1), '.', -1) as name
And then incorporate this into a regular expression:
select t.*
from (select t.*,
substring_index(substring_index(url, '[', 1), '.', -1) as name
from t
) t
where url like concat('%', name, '.', name, '[%');
This just uses like instead of regexp, because [ and . are regular expression wildcards. Of course, this assumes that name does not have _ or %.
EDIT:
Here is a method that actually identifies when this occurs -- and works even if there are multiple patterns.
The idea is to construct the regular expression based on what happens between the . and [ -- and then to apply it. Delightfully self-referential:
select t.*,
(url regexp regex)
from (select t.*,
substr(regexp_replace(url, '[^.]*[.]([^\\[]*)\\[[^.]*', '|$1[.]$1\\\\['), 2) as regex
from (select 'abcde.de[12345.345[ABC' as url union all
select 'abcdefdef[[[[..123.124['
) t
) t;
Here is the above in a db<>fiddle.

Related

How to find variable pattern in MySql with Regex?

I am trying to pull a product code from a long set of string formatted like a URL address. The pattern is always 3 letters followed by 3 or 4 numbers (ex. ???### or ???####). I have tried using REGEXP and LIKE syntax, but my results are off for both/I am not sure which operators to use.
The first select statement is close to trimming the URL to show just the code, but oftentimes will show a random string of numbers it may find in the URL string.
The second select statement is more rudimentary, but I am unsure which operators to use.
Which would be the quickest solution?
SELECT columnName, SUBSTR(columnName, LOCATE(columnName REGEXP "[^=\-][a-zA-Z]{3}[\d]{3,4}", columnName), LENGTH(columnName) - LOCATE(columnName REGEXP "[^=\-][a-zA-Z]{3}[\d]{3,4}", REVERSE(columnName))) AS extractedData FROM tableName
SELECT columnName FROM tableName WHERE columnName LIKE '%___###%' OR columnName LIKE '%___####%'
-- Will take a substring of this result as well
Example Data:
randomwebsite.com/3982356923abcd1ab?random_code=12480712_ABC_DEF_ANOTHER_CODE-xyz123&hello_world=us&etc_etc
In this case, the desired string is "xyz123" and the location of said pattern is variable based on each entry.
EDIT
SELECT column, LOCATE(column REGEXP "([a-zA-Z]{3}[0-9]{3,4}$)", column), SUBSTR(column, LOCATE(column REGEXP "([a-zA-Z]{3}[0-9]{3,4}$)", column), LENGTH(column) - LOCATE(column REGEXP "^.*[a-zA-Z]{3}[0-9]{3,4}", REVERSE(column))) AS extractData From mainTable
This expression is still not grabbing the right data, but I feel like it may get me closer.
I suggest using
REGEXP_SUBSTR(column, '(?<=[&?]random_code=[^&#]{0,256}-)[a-zA-Z]{3}[0-9]{3,4}(?![^&#])')
Details:
(?<=[&?]random_code=[^&#]{0,256}-) - immediately on the left, there must be & or &, random_code=, and then zero to 256 chars other than & and # followed with a - char
[a-zA-Z]{3} - three ASCII letters
[0-9]{3,4} - three to four ASCII digits
(?![^&#]) - that are followed either with &, # or end of string.
See the online demo:
WITH cte AS ( SELECT 'randomwebsite.com/3982356923abcd1ab?random_code=12480712_ABC_DEF_ANOTHER_CODE-xyz123&hello_world=us&etc_etc' val
UNION ALL
SELECT 'randomwebsite.com/3982356923abcd1ab?random_code=12480712_ABC_DEF_ANOTHER_CODE-xyz4567&hello_world=us&etc_etc'
UNION ALL
SELECT 'randomwebsite.com/3982356923abcd1ab?random_code=12480712_ABC_DEF_ANOTHER_CODE-xyz89&hello_world=us&etc_etc'
UNION ALL
SELECT 'randomwebsite.com/3982356923abcd1ab?random_code=12480712_ABC_DEF_ANOTHER_CODE-xyz00000&hello_world=us&etc_etc'
UNION ALL
SELECT 'randomwebsite.com/3982356923abcd1ab?random_code=12480712_ABC_DEF_ANOTHER_CODE-aaaaa11111&hello_world=us&etc_etc')
SELECT REGEXP_SUBSTR(val,'(?<=[&?]random_code=[^&#]{0,256}-)[a-zA-Z]{3}[0-9]{3,4}(?![^&#])') output
FROM cte
Output:
I'd make use of capture groups:
(?<=[=\-\\])([a-zA-Z]{3}[\d]{3,4})(?=[&])
I assume with [^=\-] you wanted to capture string with "-","\" or "=" in front but not include those chars in the result. To do that use "positive lookbehind" (?<=.
I also added a lookahead (?= for "&".
If you'd like to fidget more with regex I recommend RegExr

GROUP BY multiple text matches within one column

Given data like:
URL
some_url.com
some_url.com
some_url.co.uk
some_other_url.com
some_other_url.co.uk
some_other_url.co.uk
some_other_url.org
is there a way to construct a query that will result in;
some_url 3
some_other_url 4
Currently I'm either using a standard group by url or I query the aggregations one by one using LIKE
Is there a way to do this in one query? (using mysql currently, but will be moving this data over to postgresql)
Would it be better practice to add a column to reflect this grouping (at insert time)? (this feels redundant but would be best performing I guess)
EDIT:
data can contain www and non-www as well as http, https. Also I'll have to do similar thing on other columns that contain (free) text values.
This is ANSI SQL compliant and should probably work with both MySQL and Postgresql:
select url, count(*)
from
(
select substring(url from 1 for position('.' in url) -1) as url
from tablename
) dt
group by url
Using position() to find the first . character. Do substring() and finally GROUP BY the result.
use SUBSTRING_INDEX in mysql which help you substring from a string before a specified number of occurrences of the delimiter.
select count(*) as cnt, SUBSTRING_INDEX(c,'.',1) as val from cte
group by SUBSTRING_INDEX(c,'.',1)
Since the values can have http, https and www, and may be query string too, you will have to clean all such values first before grouping it. Took the reference from here and modified it to match your requirement.
SELECT url,
SUBSTRING_INDEX(
SUBSTRING_INDEX(
SUBSTRING_INDEX(
SUBSTRING_INDEX(
SUBSTRING_INDEX(
SUBSTRING_INDEX(url, '/', 3),
'://', -1),
'/', 1),
'?', 1),
'www.', -1),
'.', 1) AS domain,
COUNT(1)
FROM tblname
GROUP BY domain;
This works in Postgesql:
select split_part(url,'.',1) g,count(*)
from url_table
group by g
order by g;
Best regards,
Bjarni

Regex bring just the match MySQL

I am trying to getting just the first two words on sql query, I am using the match: ^\w{2}- but with no success because nothing is coming to me, I need to get those values
BA, CE, DF, ES, GO, I don't know how can I do that, below some data example.
SC&Tipo=FM
SC&Tipo=Web
SC&Tipo=Comunitaria
RS&Tipo=Todas
RS&Tipo=AM
RS&Tipo=FM
RS&Tipo=Web
RS&Tipo=Comunitaria
BA-Salvador&Tipo=12horas
CE-Fortaleza&Tipo=12horas
CE-Interior&Tipo=12horas
DF-Brasilia&Tipo=12horas
ES-Interior&Tipo=12horas
ES-Vitoria&Tipo=12horas
GO-Goiania&Tipo=12horas
MG-ZonaDaMata/LestedeMinas&Tipo=12horas
MG-AltoParanaiba&Tipo=12horas
MG-BeloHorizonte&Tipo=12horas
MG-CentroOestedeMinas&Tipo=12horas
Query: SELECT * FROM tabel WHERE filter REGEXP '^\w{2}-'
EDIT SOLVED:
To solve the query should be:
SELECT SUBSTRING(column, 1, 2) AS column FROM table WHERE column REGEXP '^[[:alnum:]_]{2}-'
MySQL doesn't support the character class \w or \d. Instead of \w you have to use [[:alnum:]]. You can find all the supported character classes on the official MySQL documentation.
So you can use the following solution using REGEXP:
SELECT *
FROM table_name
WHERE filter REGEXP '^[[:alnum:]]{2}-'
You can use the following to get the result with regular expression too, using REGEXP_SUBSTR:
SELECT REGEXP_SUBSTR(filter, '^[[:alnum:]]{2}-')
FROM table_name
WHERE filter REGEXP '^[[:alnum:]]{2}-';
Or another solution using HAVING to filter the result:
SELECT REGEXP_SUBSTR(filter, '^[[:alnum:]]{2}-') AS colResult
FROM table_name
HAVING colResult IS NOT NULL;
To get the value before MySQL 8.0 you can use the following with LEFT:
SELECT LEFT(filter, 3)
FROM table_name
WHERE filter REGEXP '^[[:alnum:]]{2}-';
demo: https://www.db-fiddle.com/f/7mJEmCkEiYhCYK3PcEZTNE/0
Using SUBSTRING(<column>, 1, 2) should also work..
More or less like below
SELECT
<column>
, SUBSTRING(<column>, 1, 2)
FROM
<table>
WHERE
SUBSTRING(<column>, 1, 2) IN ('BA' [,<value>..])
Some things are BNF (Backus-Naur form) in the SQL code.
<..> means replace with what you need.
[, ..] means optional unlimited repeat the comma in there is part off SQL syntax

MySQL - need to find records without a period in them

I've been to the regexp page on the MySQL website and am having trouble getting the query right. I have a list of links and I want to find invalid links that do not contain a period. Here's my code that doesn't work:
select * from `links` where (url REGEXP '[^\\.]')
It's returning all rows in the entire database. I just want it to show me the rows where 'url' doesn't contain a period. Thanks for your help!
SELECT c1 FROM t1 WHERE c1 NOT LIKE '%.%'
Your regexp matches anything that contains a character that isn't a period. So if it contains foo.bar, the regexp matches the f and succeeds. You can do:
WHERE url REGEXP '^[^.]*$'
The anchors and repetition operator make this check that every character is not a period. Or you can do:
WHERE LOCATE(url, '.') = 0
BTW, you don't need to escape . when it's inside [] in a regexp.
Using regexp seems like an overkill here. A simple like operator would do the trick:
SELECT * FROM `links` WHERE url NOT LIKE '%.%
EDIT:
Having said that, if you really want to negate regexp, just use not regexp:
SELECT * FROM `links` WHERE url NOT REGEXP '[\\.]';

Group by substring

I have a field with text like "/site/index?sid=18&sub=321333&tid=site.net&ukey=1234543254".
How can I group it by part of string( 'sid' url param e.g.)?
And params may be in a different order.(sid on the end of line and etc.)
Take a look at the MySQL string functions:
http://dev.mysql.com/doc/refman/5.0/en/string-functions.html
Especially this looks helpful:
http://dev.mysql.com/doc/refman/5.0/en/string-functions.html#function_substring-index
UPDATE
This is exactly what you asked for:
SELECT SUBSTRING_INDEX(SUBSTRING_INDEX("/site/index?sid=18&sub=321333&tid=site.net&ukey=1234543254", 'sid=', -1), '&', 1) AS this_will_be_grouped
and use this_will_be_grouped in the GROUP BY clause of your query