GROUP BY multiple text matches within one column - mysql

Given data like:
URL
some_url.com
some_url.com
some_url.co.uk
some_other_url.com
some_other_url.co.uk
some_other_url.co.uk
some_other_url.org
is there a way to construct a query that will result in;
some_url 3
some_other_url 4
Currently I'm either using a standard group by url or I query the aggregations one by one using LIKE
Is there a way to do this in one query? (using mysql currently, but will be moving this data over to postgresql)
Would it be better practice to add a column to reflect this grouping (at insert time)? (this feels redundant but would be best performing I guess)
EDIT:
data can contain www and non-www as well as http, https. Also I'll have to do similar thing on other columns that contain (free) text values.

This is ANSI SQL compliant and should probably work with both MySQL and Postgresql:
select url, count(*)
from
(
select substring(url from 1 for position('.' in url) -1) as url
from tablename
) dt
group by url
Using position() to find the first . character. Do substring() and finally GROUP BY the result.

use SUBSTRING_INDEX in mysql which help you substring from a string before a specified number of occurrences of the delimiter.
select count(*) as cnt, SUBSTRING_INDEX(c,'.',1) as val from cte
group by SUBSTRING_INDEX(c,'.',1)

Since the values can have http, https and www, and may be query string too, you will have to clean all such values first before grouping it. Took the reference from here and modified it to match your requirement.
SELECT url,
SUBSTRING_INDEX(
SUBSTRING_INDEX(
SUBSTRING_INDEX(
SUBSTRING_INDEX(
SUBSTRING_INDEX(
SUBSTRING_INDEX(url, '/', 3),
'://', -1),
'/', 1),
'?', 1),
'www.', -1),
'.', 1) AS domain,
COUNT(1)
FROM tblname
GROUP BY domain;

This works in Postgesql:
select split_part(url,'.',1) g,count(*)
from url_table
group by g
order by g;
Best regards,
Bjarni

Related

Group by variable substring in MySQL

I have a table that contains multiple fields - let's say FieldA, FieldB etc. and finally Location. The Location field has values such as:
http://192.168.1.10/location?n=5
http://192.168.1.10/location?n=8
http://192.168.15.6/location?n=1
http://192.168.0.9/location?n=11
http://192.168.15.6/location?n=5
http://192.168.0.9/location?n=6
http://192.168.1.10/location?n=2
I need to get the unique values of the IP addresses in the Location field. In other words, from the above example data, I should get
http://192.168.1.10
http://192.168.15.6
http://192.168.0.9
Based on this answer, I am using the following SQL - without much luck
SELECT * FROM `table` WHERE FieldA = 'Example' GROUP BY (SELECT SUBSTRING_INDEX("`table`.Location", "/", 3))
The above gives me just a single record. What am I doing wrong?
Making judicious use of SUBSTRING_INDEX:
SELECT DISTINCT SUBSTRING_INDEX(SUBSTRING_INDEX(Location, '/', 3), '/', -1) AS distinct_ips
FROM yourTable;
Demo
For an explanation on how the above logic works, consider the location value http://192.168.1.10/location?n=5. The inner call to SUBSTRING_INDEX returns http://192.168.1.10, which is everything to the left of the third forward slash. Then, the outer call returns everything to the right of the last forward slash, which leaves us with the IP address.

Capture groups in mysql regexp

I have a table with a varchar column that represents a path. I want to search for rows that have a path that follow a pattern like name.name[*] where name can be anything. I am looking for repeated strings contained anywhere in the path column that are separated by a period and have a square bracket after them.
This seems to call for Regexp, so through python I have something like https://regex101.com/r/apS20a/4
However, trying to implement this with MySQL Regexp is not working. I have been able to translate the shorthand into REGEXP '([A-Za-z_]+).(\1[[0-9]+])', but it seems that MySql Regex does not support capture groups. Is there a way to accomplish what I am trying to do with mysql regexp? Thank you
I don't think that MySQL supports capture groups. But if you only have one example of .name[ in the string between the first . and the first [, you can hack your way around it. This is not a general solution, just a specific approach in this case.
You can get the name with:
select substring_index(substring_index(url, '[', 1), '.', -1) as name
And then incorporate this into a regular expression:
select t.*
from (select t.*,
substring_index(substring_index(url, '[', 1), '.', -1) as name
from t
) t
where url like concat('%', name, '.', name, '[%');
This just uses like instead of regexp, because [ and . are regular expression wildcards. Of course, this assumes that name does not have _ or %.
EDIT:
Here is a method that actually identifies when this occurs -- and works even if there are multiple patterns.
The idea is to construct the regular expression based on what happens between the . and [ -- and then to apply it. Delightfully self-referential:
select t.*,
(url regexp regex)
from (select t.*,
substr(regexp_replace(url, '[^.]*[.]([^\\[]*)\\[[^.]*', '|$1[.]$1\\\\['), 2) as regex
from (select 'abcde.de[12345.345[ABC' as url union all
select 'abcdefdef[[[[..123.124['
) t
) t;
Here is the above in a db<>fiddle.

How to extract part of a Base64 encoded string in MySQL?

I have a field in my database which is encoded. After using from_base64 on the field it looks like this:
<string>//<string>//<string>/2017//06//21//<string>//file.txt
There may be an undetermined number of strings at the beginning of the path, however, the date (YYYY//MM//DD) will always have two fields to the right (a string followed by file extension).
I want to sort by this YYYY//MM//DD pattern and get a count for all paths with this date.
So basically I want to do this:
select '<YYYY//MM//DD portion of decoded_path>', count(*) from table group by '<YYYY//MM//DD portion of decoded_path>' order by '<YYYY//MM//DD portion of decoded_path>';
Summary
MySQL's SUBSTRING_INDEX comes in useful for doing this by looking for the specified delimiter and counting backwards from the end if a negative count value is specified.
Demo
Rextester demo: http://rextester.com/TCJ65469
SQL
SELECT datepart,
COUNT(*) AS occurrences
FROM
(SELECT CONCAT(
LEFT(SUBSTRING_INDEX(txt, '//', -5), INSTR(SUBSTRING_INDEX(txt, '//', -5), '//') - 1),
'/',
LEFT(SUBSTRING_INDEX(txt, '//', -4), INSTR(SUBSTRING_INDEX(txt, '//', -4), '//') - 1),
'/',
LEFT(SUBSTRING_INDEX(txt, '//', -3), INSTR(SUBSTRING_INDEX(txt, '//', -3), '//') - 1))
AS datepart
FROM tbl) subq
GROUP BY datepart
ORDER BY datepart;
Assumptions
Have assumed for now that the single slash before the year in the example given in the question was a typo and should have been a double slash. (If it turns out this isn't the case I'll update my answer.)
little crazy but it works
select REPLACE(SUBSTRING_INDEX(SUBSTRING_INDEX(REPLACE('<string>//<string>//<string>/2017//06//21//<string>//file.txt',"//","-"),"/",-1),"-<",1),"-","/"), count(*) from `chaissilist` group by REPLACE(SUBSTRING_INDEX(SUBSTRING_INDEX(REPLACE('<string>//<string>//<string>/2017//06//21//<string>//file.txt',"//","-"),"/",-1),"-<",1),"-","/") order by REPLACE(SUBSTRING_INDEX(SUBSTRING_INDEX(REPLACE('<string>//<string>//<string>/2017//06//21//<string>//file.txt',"//","-"),"/",-1),"-<",1),"-","/");

MYSQL: after using SUBSTRING_INDEX my data change/corrupt (rtl language)

After using substring_index on a column, my data changes or corrupts the non-English text in the field, which is in Persian. I also checked the collation and charset and it is UTF-8.
If I use English, it works like a charm, but in rtl languages it doesn't work. Here is my record before substring:
select group_id , rows from concat
Here is what I get after substring_index:
select group_id , SUBSTRING_INDEX(rows, ',', 1) as name from concat
It shows "A+3" but
It should show "فثس".
Anyone know a solution?
Actually, I figured it out; The problem was after using substring_index the field type changes to "MEDIUMBLOB" which causes the problem.
so I did converted it and now it works.
select group_id , CONVERT(SUBSTRING_INDEX(rows, ',', 1), CHAR(1000)) as name from concat

Group by substring

I have a field with text like "/site/index?sid=18&sub=321333&tid=site.net&ukey=1234543254".
How can I group it by part of string( 'sid' url param e.g.)?
And params may be in a different order.(sid on the end of line and etc.)
Take a look at the MySQL string functions:
http://dev.mysql.com/doc/refman/5.0/en/string-functions.html
Especially this looks helpful:
http://dev.mysql.com/doc/refman/5.0/en/string-functions.html#function_substring-index
UPDATE
This is exactly what you asked for:
SELECT SUBSTRING_INDEX(SUBSTRING_INDEX("/site/index?sid=18&sub=321333&tid=site.net&ukey=1234543254", 'sid=', -1), '&', 1) AS this_will_be_grouped
and use this_will_be_grouped in the GROUP BY clause of your query