Select distinct records based on regexp pattern - mysql

I have a db with domains.
I need to pull the domains suffixes and create a list of those suffixes. (.com, .net, .org ...)
I've found that regexp patterns may help me. The only thing I can't make is to filter those domains based on the pattern + uniqueness, in order to get my list.
Here's my query:
$qry="select * from domain where domain_name REGEXP '[[.period.]][a-z]+'";
How should I add the unique criteria to it?
Thank you.
UPDATE:
Here's the working query:
SELECT DISTINCT SUBSTRING_INDEX(domain_name, '.', -1) FROM domains WHERE domain_name REGEXP '[[.period.]][a-z]+'

MySQL has no construct to substitute using regular expressions, or to access matching groups. So regular expressions likely won't help you. Perhaps the SUBSTRING_INDEX function is more useful for you, as you can use that to extract the part after the final dot, using
SUBSTRING_INDEX(domain_name, '.', -1) AS tld

Related

Excluding records using regex

I'm trying to get exclude the email_id that has any name and end with either #abcd.in or #abcd.live and only include the email's having mobile numbers, but not sure if this is the correct regex I'm using, can you help?
the statement I'm using to filter is below
(NOT(lower(`table`.`user_email`) like '[a-z].*#Abcd.in$'|'[a-z].*#Abcd.live$')
If you want to do filtering based on a regular expression, you should be using REGEXP or REGEXP_LIKE (both are synonyms). Assuming you just want to exclude the two domains mentioned, you could use:
SELECT *
FROM yourTable
WHERE email NOT REGEXP '[a-z]+#Abcd\.(in|live)$';
Assuming you wanted to enhance the above by also whitelisting certain email patterns, you could make another call to REGEXP.
I'll probably do something like this:
SELECT *
FROM mytable
WHERE SUBSTRING_INDEX(user_email,'#',-1) IN ('Abcd.live','Abcd.in')
AND SUBSTRING_INDEX(user_email,'#',1) REGEXP '[0-9]'
Using SUBSTRING_INDEX() to separate the email name and domain by using # as delimiter. The first condition is simply just filtering the domain with IN so other than the ones being defined, it will be omitted. Then the second condition is using REGEXP to check if numerical values are present in the email name.
Demo fiddle

How to pass multiple delimeters in substring_index

I want to query the string between https:// or http:// and the first delimeter characters that comes after it. For example, if the field contains:
https://google.com/en/
https://www.yahoo.com?en/
I want to get:
google.com
www.yahoo.com
My initial query that will capture the / only contains two substring_index as follows:
SELECT substring_index(substring_index(mycol,'/',3),'://',-1)
FROM mytable;
Now I found that the URLs may contain multiple delimeters. I want my statament to capture multiple delimeters possibilities which are (each one is a separate character):
:/?#[]#!$&'()*+,;=
How to do this in my statement? I tried this solution but the end result the command could not be executed due to syntax error while I am sure I followed the solution. Can anyone help me correctly construct the query to capture all the delimeter characters I listed above?
I use MySQL workbecnh 6.3 on Ubuntu 18.04.
EDIT:
Some corrections made in the first example of URLs.
First, note that https://www.yahoo.com?en/ seems like an unlikely URL, because it has a path separator contained inside the query string. In any case, if you are using MySQL 8+, then consider using its regex functionality. The REGEXP_REPLACE function can be helpful here, using the following pattern:
https?://([A-Za-z_0-9.-]+).*
Sample query:
WITH yourTable AS (
SELECT 'https://www.yahoo.com?en/' AS url UNION ALL
SELECT 'no match'
)
SELECT
REGEXP_REPLACE(url, 'https?://([A-Za-z_0-9.-]+).*', '$1') AS url
FROM yourTable
WHERE url REGEXP 'https?://[^/]+';
Demo
The term $1 refers to the first capture group in the regex pattern. An explicit capture group is denoted by a quantity in parentheses. In this case, here is the capture group (highlighted below):
https?://([A-Za-z_0-9.-]+).*
^^^^^^^^^^^^^^^
That is, the capture group is the first portion of the URL path, including domain, subdomain, etc.
In MySQL 8+, this should work:
SELECT regexp_replace(regexp_substr(mycol, '://[a-zA-Z0-9_.]+[/:?]'), '[^a-zA-Z0-9_.]', '')
FROM (SELECT 'https://google.com/en' as mycol union all
SELECT 'https://www.yahoo.com?en'
) x
In older versions, this is much more challenging because there is no way to search for a string class.
One brute force method is:
select (case when substring_index(mycol, '://', -1) like '%/%'
then substring_index(substring_index(mycol, '://', -1), '/', 1)
when substring_index(mycol, '://', -1) like '%?%'
then substring_index(substring_index(mycol, '://', -1), '?', 1)
. . . -- and so on for each character
else substring_index(mycol, '://', -1)
end) as what_you_want
The [a-zA-Z0-9_.] is intended to be something like the valid character class for your domain names.

Extract email address from mysql field

I have a longtext column "description" in my table that sometimes contains an email address. I need to extract this email address and add to a separate column for each row. Is this possible to do in MySQL?
Yes, you can use mysql's REGEXP (perhaps this is new to version 5 and 8 which may be after this question was posted.)
SELECT *, REGEXP_SUBSTR(`description`, '([a-zA-Z0-9._%+\-]+)#([a-zA-Z0-9.-]+)\.([a-zA-Z]{2,4})') AS Emails FROM `mytable`;
You can use substring index to capture email addresses...
The first substring index capture the account.
The second substring_index captures the hostname. It is necessary to pick the same email address in case the are multiple atso (#) stored in the column.
select concat( substring_index(substring_index(description,'#',1),' ',-1)
, substring_index(substring_index( description,
substring_index(description,'#',1),-1),
' ',1))
You can't select matched part only from Regular expression matching using pure Mysql. You can use mysql extension (as stated in Return matching pattern, or use a scripting language (ex. PHP).
MySQL does have regular expressions, but regular expressions are not the best way to match email addresses. I'd strongly recommend using your client language.
If you can install the lib_mysqludf_preg MySQL UDF, then you could do:
SET #regex = "/([a-z0-9!#\$%&'\*\+\/=\?\^_`\{\|\}~\-]+(?:\.[a-z0-9!#\$%&'\*\+\/=\?^_`{\|}~\-]+)*#(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+(?:[A-Z]{2}|aero|arpa|asia|biz|cat|com|coop|edu|gov|info|int|jobs|mil|mobi|museum|name|net|org|post|pro|tel|travel|xxx))/i";
SELECT
PREG_CAPTURE(#regex, description)
FROM
example
WHERE
PREG_CAPTURE(#regex, description) > '';
to extract the first email address from the description field.
I can't think of another solution, as the REGEXP operator simply returns 1 or 0, and not the location of where the regular expression matched.

MySQL regex only returns a single row

I have been writing a REGEX in MySQL to identify those domains that have a .com TLD. The URLs are usually of the form
http://example.com/
The regex I came up with looks like this:
REGEXP '[[.colon.]][[.slash.]][[.slash.]]([:alnum:]+)[[...]]com[[./.]]'
The reason we match the :// is so that we don't pick up URLs such as http://example.com/error.com/wrong.com
Therefore my query is
SELECT DISTINCT name
FROM table
WHERE name REGEXP '[[.colon.]][[.slash.]][[.slash.]]([:alnum:]+)[[...]]com[[./.]]'"
However, this is returning only a single row when it should really be returning many more (upwards of a thousand). What mistake am I making with the query?
Not sure if that's the problem, but it should be [[:alnum:]], not [:alnum:]
Your current query only matches names that end with .com/ rather than .com followed by anything that starts with a slash. Try the following:
SELECT DISTINCT name
FROM table
WHERE name REGEXP '[[.colon.]][[.slash.]][[.slash.]]([:alnum:]+)[[...]]com([[./.]].*)?'"
It might be clearer to split the URL rather than regexing it
SELECT DISTINCT name FROM table
WHERE SUBSTRING_INDEX((SUBSTRING_INDEX(name,'/',3),'.',-1)='com';

How to select all distinct filename extensions from table of filenames?

I have a table of ~20k filenames. How do I select a list of the distinct extensions? A filename extension can be considered the case insensitive string after the last .
You can use substring_index:
SELECT DISTINCT substring_index(column_containing_file_names,'.',-1) FROM table
-1 means it will start searching for the '.' from the right side.
there is A very cool and powerful capability in MySQL and other databases is the ability to incorporate regular expression syntax when selecting data example
SELECT something FROM table WHERE column REGEXP 'regexp'
see this http://www.tech-recipes.com/rx/484/use-regular-expressions-in-mysql-select-statements/
so you can write pattern to select what you want.
The answer given by #bnvdarklord is right but it would include file names which does not have extensions as well in result set, so if you want only extension patterns use below query.
SELECT DISTINCT substring_index(column_containing_file_names,'.',-1) FROM table where column_containing_file_names like '%.%';