Split string on token and aggregate on split words - mysql

I have a field that includes files that have 'words' separated by an underscore, _, such as this:
`file_name`
MY_NEW_MOVIE.mov
HD_VIDEO_720p.mov
720p_DISNEY_MOVIE.mov
LG_TYLERPERRY_FEATURE_HD_8CH_EN_L9714343_16X9_235_2398_FINAL_FRSUB.srt
And I want to split on _ and get the count of each word after the split, meaining:
`word` `count`
MY 1
NEW 1
MOVIE 2
HD 1
VIDEO 1
720p 2
DISNEY 1
Would it be possible/feasible to do this in SQL? So far I have just gotten the perfunctory "remove the file extension", but not sure how I could split on the token and then count that:
select left(file_name, length(file_name) - length(substring_index(file_name, '.', -1))-1) from asset
Additionally,

The result you want can be achieved with a query derived from this answer, which uses a generated numbers table along with SUBSTRING_INDEX to split out all the words in each file_name. This is then used as a derived table to count the occurrence of each word. Note the numbers table must have sufficient values to cover the maximum number of words in a filename (12 for this sample data).
SELECT word, COUNT(*)
FROM (
SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(LEFT(file_name, LENGTH(file_name)-4), '_', numbers.n), '_', -1) AS word
FROM (
select 1 n union all
select 2 union all select 3 union all select 4 union all
select 5 union all select 6 union all select 7 union all
select 8 union all select 9 union all select 10 union all
select 11 union all select 12
) numbers
JOIN asset ON LENGTH(file_name)
- LENGTH(REPLACE(file_name, '_', '')) >= numbers.n - 1
) w
GROUP BY word
Output (for your sample data):
word COUNT(*)
16X9 1
235 1
2398 1
720p 2
8CH 1
DISNEY 1
EN 1
FEATURE 1
FINAL 1
FRSUB 1
HD 2
L9714343 1
LG 1
MOVIE 2
MY 1
NEW 1
TYLERPERRY 1
VIDEO 1
Demo on dbfiddle

Assuming the filenames always have exactly three components, SUBSTRING_INDEX can get the job done here:
SELECT word, COUNT(*) AS count
FROM
(
SELECT SUBSTRING_INDEX(file_name, '_', 1) AS word FROM asset
UNION ALL
SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(file_name, '_', 2), '_', -1) FROM asset
UNION ALL
SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(file_name, '_', -1), '.', 1) FROM asset
) t
GROUP BY word;
Demo
Note: This answer was given based on the OP's original sample data, where all filenames had exactly three underscore-separate components. This answer will not work for the updated question.

Related

mysql - How to split comma separated text and create table

how to split comma separated string from one column and turn it into several columns?
this is my table:
SELECT id,lik FROM `tbl_users_posts` WHERE id=1;
id lik
-------------
1 10,11,12,13,14,15
how can i split 'lik' column and get this result?
id lik
-------------
1 10
1 11
1 12
1 13
1 14
1 15
displays id 1 in the first row and split the 'lik' column into pieces in the second row and displays it one by one
Unfortunately MySQL doesn't have a split string functions. One way is create a temporary table as following with the max values of the largest row:
create temporary table numbers as (
select 1 as n
union select 2 as n
union select 3 as n
union select 4 as n
union select 5 as n
union select 6 as n
union select 7 as n
union select 8 as n
);
Then you can use substring_index to accomplish the desired result
select id,
substring_index( substring_index(lik, ',', n),',', -1) as lik
from tbl_users_posts
join numbers on char_length(lik) - char_length(replace(lik, ',', '')) >= n - 1
https://dbfiddle.uk/?rdbms=mysql_8.0&fiddle=84bc1b4e60a7feea5af0d0b568bc7bcb
Edit.
Another method if you have MySQL 8+ for the lik string to split into thousands of pieces without a loop is create an temporary table using recursive cte as follows:
CREATE TEMPORARY TABLE numbers WITH RECURSIVE cte AS
( select 1 as n
union all
select n +1
from cte
limit 1000
)
SELECT * FROM cte;
And then use the same query as above:
select id,
substring_index( substring_index(lik, ',', n),',', -1) as lik
from tbl_users_posts
join numbers on char_length(lik) - char_length(replace(lik, ',', '')) >= n - 1
order by id asc;
https://dbfiddle.uk/?rdbms=mysql_8.0&fiddle=9231202418ce9b17aef8609ad6875fbe
If the lik is a number that can be found in the database you can do:
select p.id, t.lik_id
from table_containing_lik t
join tbl_users_posts p on find_in_set(t.lik_id, p.lik)
where p.id=1;

Separate field into different rows [duplicate]

This question already has answers here:
SQL split values to multiple rows
(12 answers)
Closed 3 years ago.
I have a table called tableA, and the id field may stored values with comma at one record.
When I use sql: "select id from tableA", I will get the below result.
I do not know how to separate a row record into several rows like below:
27
19
7
18
...
Is it any hints for the sql script? Thank you.
I'm not confident there isn't a nicer way, but maybe something like this is a good starting point:
SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(tablea.id, ',', numbers.row), ',', -1)
FROM
tablea INNER JOIN
-- This numbers subquery is taken from #Unreason's answer in https://stackoverflow.com/questions/304461/generate-an-integer-sequence-in-mysql
(
SELECT #row := #row + 1 AS row FROM
(select 0 union all select 1 union all select 2 union all select 3 union all select 4 union all select 5 union all select 6 union all select 7 union all select 8 union all select 9) t,
(select 0 union all select 1 union all select 2 union all select 3 union all select 4 union all select 5 union all select 6 union all select 7 union all select 8 union all select 9) t2,
(SELECT #row:=0) n) numbers
on numbers.row <= LENGTH(tablea.id)-LENGTH(REPLACE(tablea.id, ',', ''))+1
Given tablea.id='27,19,8', SUBSTRING_INDEX(SUBSTRING_INDEX(tablea.id, ',', 1), ',', -1) will return the first element (27) and SUBSTRING_INDEX(SUBSTRING_INDEX(tablea.id, ',', 2), ',', -1) will return the second element (19), and so forth.
Therefore I join that with a list of numbers (in this case my list goes up to 100, so I am assuming a single tablea.id field never has more than 100 comma-separated values, although the details could be changed). The join condition uses LENGTH(tablea.id)-LENGTH(REPLACE(tablea.id, ',', ''))+1 which is a count of how many comma-separated values are in the field.
Here's a db fiddle of it working to play around with.

mysql select distinct comma delimited values

i have a mysql table
id cid c_name keywords
1 28 Stutgart BW,Mercedes,Porsche,Auto,Germany
2 34 Roma Sezar,A.S. Roma
3 28 München BMW,Oktober Fest,Auto,Germany
i need a query to show keywords from cid=28 but i want to see only 1 time a keyword, like (BW,Mercedes,Porsche,Auto,Bmw,Oktober Fest,Germany)
i dont want to list 2 time a keyword, how can resolve this problem?
i have tried distinct but could not get what i want
Split it before adding it all up with DISTINCT.Of course,better is to normalize your data(no more than 1 value in a column)
SELECT
GROUP_CONCAT( DISTINCT SUBSTRING_INDEX(SUBSTRING_INDEX(keywords, ',', n.digit+1), ',', -1)) keyword
FROM
t
INNER JOIN
(SELECT 0 digit UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6) n
ON LENGTH(REPLACE(keywords, ',' , '')) <= LENGTH(keywords)-n.digit
WHERE cid=28
See it working
If you want to get a dynamic output then you can use the following query to get a distinct comma delimited values in a single record.
Note: here doesn't matter how many values are in comma delimited row & it's fetched distinct record from a number of rows based on your condition
$tag_list = DB::select('SELECT
TRIM(TRAILING "," FROM REPLACE(GROUP_CONCAT(DISTINCT keywords, ","),",,",",")) tag_list
FROM
test
WHERE id = 28');
$unique_tags = implode(',', array_unique(explode(",",$result[0]->search_tags)));

calculating the word occurrence in mysql table

I am getting the reviews from different sites and storing into table. For each review I am getting adjective and noun list in separate column.
So for each review there are main 3 values here.
review, adjective_list, rate
Now I want to count number of times adjectives repeats. After that recommending only those review which has adjectives which repeats maximum time and having review 4-5.
Which is correct way to do this?
My thought about this:
Creating trigger which perform action when ever there is insert review operation.
This trigger will read column having adjectives, calculate the occurrence(don't know how?) and storing top adjectives with their occurrence.
While recommendation selecting adjective with maximum occurrence, and looking into 4-5 rated review.
I am not sure what is correct way. Any help is appreciable
Main table looks like this:
Not tested, but if I understand you requirement correctly you should be able to base a query on something like this to do the job:-
SELECT id, SUBSTRING_INDEX(SUBSTRING_INDEX(adj_noun, ',', aCnt + 1), ',', -1), COUNT(*)
FROM Main_Table
INNER JOIN
(
SELECT Units.i + Tends.i * 10 + Hundreds.i * 100 AS aCnt
(SELECT 0 i UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9) Units
(SELECT 0 i UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9) Tens
(SELECT 0 i UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9) Hundreds
) Integers_Query
ON aCnt <= (LENGTH(adj_noun) - LENGTH(REPLACE(adj_noun, ',', '')))
GROUP BY id, SUBSTRING_INDEX(SUBSTRING_INDEX(adj_noun, ',', aCnt + 1), ',', -1)
This uses a subquery to get a range of numbers (0 to 999), and does a join of this against your table where the number is less than or equal to the number of time a comma appears in the adj_noun column (ie, subtract the length of adj_noun with all the commas removed from the full length of adj_noun). Then use SUBSTRING_INDEX to get the string up to the aCnt comma, and again use SUBSTRING_INDEX to get the string from that comma back to the previous comma (excludes the commas from the result).
The COUNT / GROUP BY should get you the number of times each word appears in the resulting list for each item.
Probably fairly inefficient. Only copes with 1000 comma separated words (easily extended, but will be slower).

Counting word occurrences in a table column

I have a table with a varchar(255) field. I want to get (via a query, function, or SP) the number of occurences of each word in a group of rows from this table.
If there are 2 rows with these fields:
"I like to eat bananas"
"I don't like to eat like a monkey"
I want to get
word | count()
---------------
like 3
eat 2
to 2
i 2
a 1
Any idea? I am using MySQL 5.2.
#Elad Meidar, I like your question and I found a solution:
SELECT SUM(total_count) as total, value
FROM (
SELECT count(*) AS total_count, REPLACE(REPLACE(REPLACE(x.value,'?',''),'.',''),'!','') as value
FROM (
SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(t.sentence, ' ', n.n), ' ', -1) value
FROM table_name t CROSS JOIN
(
SELECT a.N + b.N * 10 + 1 n
FROM
(SELECT 0 AS N UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9) a
,(SELECT 0 AS N UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9) b
ORDER BY n
) n
WHERE n.n <= 1 + (LENGTH(t.sentence) - LENGTH(REPLACE(t.sentence, ' ', '')))
ORDER BY value
) AS x
GROUP BY x.value
) AS y
GROUP BY value
Here is the full working fiddle: http://sqlfiddle.com/#!2/17481a/1
First we do a query to extract all words as explained here by #peterm(follow his instructions if you want to customize the total number of words processed). Then we convert that into a sub-query and then we COUNT and GROUP BY the value of each word, and then make another query on top of that to GROUP BY not grouped words cases where accompanied signs might be present. ie: hello = hello! with a REPLACE
I would recommend not to do this in SQL at all. You're loading DB with something that it isn't best at. Selecting a group of rows and doing frequency calculation on the application side will be easier to implement, will work faster and will be maintained with less issues/headaches.
You can try this perverted-a-little way:
SELECT
(LENGTH(field) - LENGTH(REPLACE(field, 'word', ''))) / LENGTH('word') AS `count`
ORDER BY `count` DESC
This query can be very slow. Also, it looks pretty ugly.
I think you should do it like indexing, with additional table.
Whenever u create, update, or delete a row in your original table, you should update your indexing table. That indexing table should have the columns: word, and the number of occurrences.
I think you are trying to do too much with SQL if all the words are in one field of each row. I recommend to do any text processing/counting with your application after you grab the text fields from the db.