SQL compose bi-gram and search if exists in other table - mysql

In SQL, having a table T1 contains
TITLE
age 5 alton john live
show must go on
Having a table T2 contains
NAME.
DESCRIPTION
John Bo
altonjohn for kids
Alton
show age5 mustgo kids
I would like to finding bigrams (pairs of consecutive words) in TITLE (T1) and check if at list 1 bigram exists in DESCRIPTION (T2) and return TITLE, DESCRIPTION & the BI-GRAM
Expected Output:
TITLE
DESCRIPTION
. BIGRAM
age 5 alton john live.
altonjohn for kids.
. altonjohn
age 5 alton john live.
show age5 mustgo kids
. age5
show must go on
show age5 mustgo kids
. mustgo

A slight variation of the previous query should do this easily:
WITH RECURSIVE cte AS (
SELECT TITLE,
LENGTH(TITLE)-LENGTH(REPLACE(TITLE,' ','')) AS num_bigrams,
SUBSTRING_INDEX(
SUBSTRING_INDEX(TITLE, ' ',
LENGTH(TITLE)-LENGTH(REPLACE(TITLE,' ',''))+1
), ' ', -2) AS bigram
FROM t1
UNION ALL
SELECT TITLE,
num_bigrams - 1 AS num_bigrams,
SUBSTRING_INDEX(SUBSTRING_INDEX(TITLE, ' ', num_bigrams), ' ', -2)
FROM cte
WHERE num_bigrams > 1
)
SELECT TITLE, DESCRIPTION, bigram
FROM cte
INNER JOIN t2
ON t2.DESCRIPTION REGEXP CONCAT('( |^)', cte.bigram, '( |$)')
Differences are:
Using -2 in the SUBSTRING_INDEX function, to recover the last two words instead of the last one [you can generalize on this for trigrams and others too], both for the base and the recursive step of the recursion.
Getting the recursion to end 1 step earlier, cause bigrams will take 2 words at a time, hence changing the recursion ending condition to WHERE num_bigrams > 1 instead of WHERE num_bigrams > 0.
Check the demo here.
Note: if you want to remove the middle space from the bigram, you just need to add a REPLACE function that removes that extra space.

Related

counting comma separated values mysql-postgre

I have a column called "feedback", and have 1 field called "emotions". In those emotions field, we can see the random values and random length like
emotions
sad, happy
happy, angry, boring
boring
sad, happy, boring, laugh
etc with different values and different length.
so, the question is, what's query to serve the mysql or postgre data:
emotion
count
happy
3
angry
1
sad
2
boring
3
laugh
1
based on SQL: Count of items in comma-separated column in a table we could try using
SELECT value as [Holiday], COUNT(*) AS [Count]
FROM OhLog
CROSS APPLY STRING_SPLIT([Holidays], ',')
GROUP BY value
but it wont help because that is for sql server, not mysql or postgre. or anyone have idea to translation those sqlserver query to mysql?
thank you so much.. I really appreciate it
Using Postgres:
create table emotions(id integer, emotions varchar);
insert into emotions values (1, 'sad, happy');
insert into emotions values (2, 'happy, angry, boring');
insert into emotions values (3, 'boring');
insert into emotions values (4, 'sad, happy, boring, laugh');
select
emotion, count(*)
from
(select
trim(regexp_split_to_table(emotions, ',')) as emotion
from emotions) as t
group by
emotion;
emotion | count
---------+-------
happy | 3
sad | 2
boring | 3
laugh | 1
angry | 1
From String functions regexp_split_to_table will split the string on ',' and return the individual elements as rows. Since there are spaces between the ',' and the word use trim to get rid of the spaces. This then generates a 'table' that is used as a sub-query. In the outer query group by the emotion field and count them.
Try the following using MySQL 8.0:
WITH recursive numbers AS
(
select 1 as n
union all
select n + 1 from numbers where n < 100
)
,
Counts as (
select trim(substring_index(substring_index(emotions, ',', n),',',-1)) as emotions
from Emotions
join numbers
on char_length(emotions) - char_length(replace(emotions, ',', '')) >= n - 1
)
select emotions,count(emotions) as counts from Counts
group by emotions
order by emotions
See a demo from db-fiddle.
The recursive query is to generate numbers from 1 to 100, supposing that the maximum number of sub-strings is 100, you may change this number accordingly.
I've used MySQL 8.0, the query has no string limits. (Thanks to Ahmed for the intuition on recursive clause)
WITH RECURSIVE cte AS (
SELECT ( LENGTH(REGEXP_REPLACE(emotions, ' ?[A-z]+ ?', ''))+1) AS n, emotions AS subs
FROM feedback
UNION ALL
SELECT n-1 AS n, ( SUBSTRING_INDEX(subs, ', ', n-1) ) AS subs
FROM cte
HAVING n>0
)
SELECT SUBSTRING_INDEX(subs, ', ', -1) AS emotions, COUNT(subs) AS cnt
FROM cte
GROUP BY emotions

MySQL - How can I group music together when the names are similar?

I would like to be able to return a single line when the name of some musics are the same or similar, as for example this case:
music with similar names
You can see that the names are the same with an extension like " - JP Ver." or something like that, I would like to be able to group them in one row with the first column incrementing the whole.
My current request to return these lines is as follows:
select count(id) number, name, sec_to_time(floor(sum(duration) / 1000)) time
from track
where user_id = 'value'
group by name, duration
order by number desc, time desc;
I would like to get a result like this
Thank you for reading and responding! I wish you all a good day!
Try:
SELECT COUNT(name) no,
TRIM(SUBSTRING_INDEX(name, '-', 1)) namee
FROM track
GROUP BY namee
Example: https://onecompiler.com/mysql/3xt3bfev6
Use GROUP_CONCAT
Here is a proof of concept script. You can add your other columns. I have grouped by the first 4 letters. You will probably want to use more.
CREATE TABLE track (
idd INT,
nam CHAR(50),
tim INT
);
INSERT INTO track VALUES (1,'Abba 1',5);
INSERT INTO track VALUES (2,'Abba 2',6);
INSERT INTO track VALUES (3,'Beta 1',12);
INSERT INTO track VALUES (4,'Beta 4',8);
SELECT
LEFT(nam,4) AS 'Group',
COUNT(idd) AS 'Number',
GROUP_CONCAT(DISTINCT idd ORDER BY idd ASC SEPARATOR ' & ') AS IDs,
GROUP_CONCAT(DISTINCT nam ORDER BY nam ASC SEPARATOR ', ') AS 'track names',
SUM(tim) AS 'total time'
FROM track
GROUP BY LEFT(nam,4);
DROP TABLE track;
Output
Group Number IDs track names total time
Abba 2 1 & 2 Abba 1, Abba 2 11
Beta 2 3 & 4 Beta 1, Beta 4 20

How to query for a phrase on SQL database of words?

I am using MySQL and I have an SQL database of of songs with a table that consists of 8 columns of information on words of a song. each row represents a single word from the songs lyrics:
songSerial - the serial number of the song
songName - the song name
word - a single word from the song's lyrics
row_number - the number of the row that the word is found
word_position_in_row - the number of the word in the row alone
house_number - the number of the house the word belongs to
house_row - the number of the row in the house that the word is found in
word_number - the number of the word out of all the songs lyrics
example for a row: { 4 , The Scientist , secrets , 8 , 4 , 2 , 1 , 37 }
Now I want to query all the songs that contains a group of words. For instance all the words that have the sentence: "I Love You" in them. It must be in that order and not from different rows or houses.
Here are scripts in my oneDrive for creating the databastable and about 400 rows:
TwoTextScriptFilesAndTheirZip
Can anyone help ?
Thank you
One method is to use joins:
select s.*
from songwords sw1 join
songwords sw2
on sw2.songSerial = sw1.songSerial and
sw2.word_number = sw1.word_number + 1 join
songwords sw3
on sw3.songSerial = sw2.songSerial and
sw3.word_number = sw2.word_number + 1
where sw1.word = 'I' and sw2.word = 'love' and sw3.word = 'you';
Or, if you prefer:
where concat_ws(' ', sw1.word, sw2.word, sw3.word) = 'I love you'
This is worse from an optimization perspective (indexes using word do not help performance), but it is clear what the query is doing.
Searches of this type suggest using a full text index. The only caveat is that you will need to remove the stop word list and index all words, regardless of length. ("I" and "you" are typical examples of stop words.)
This is an expensive approach for a large table, assuming word is not null, we could do something like this:
SET group_concat_max_len = 16777216 ;
SELECT t.song_serial
, t.house_number
, t.row_number
FROM mytable t
GROUP
BY t.songserial
, t.house_number
, t.row_number
HAVING CONCAT(' ',GROUP_CONCAT(t.word ORDER BY t.word_position_by_row),' ')
LIKE CONCAT('% ','I love you',' %')
We would definitely want a suitable index available, e.g.
... ON `mytable` (`songserial`,`house_number`,`row_number`,`word`)
If one of the words in the phrase is infrequent, we might be able to optimize a bit with a search for that infrequent word first, and then get all of the words on the same row ...
SELECT t.song_serial
, t.house_number
, t.row_number
FROM ( SELECT r.songserial
, r.house_number
, r.row_number
FROM mytable r
WHERE r.word = 'love'
GROUP
BY r.word
, r.songserial
, r.house_number
, r.row_number
) s
JOIN mytable t
ON t.songserial = s.songserial
AND t.house_number = s.house_number
AND t.row_number = s.row_number
GROUP
BY t.songserial
, t.house_number
, t.row_number
HAVING CONCAT(' ',GROUP_CONCAT(t.word ORDER BY t.word_position_by_row),' ')
LIKE CONCAT('% ','I love you',' %')
That inline view s would benefit from a covering index with word as the leading column
... ON `mytable` (`word`,`songserial`,`house_number`,`row_number`)
You look for these words and relative search positions: 1 = I, 2 = love, 3 = you. Let's compare them with two song lines:
And I love, love, love you
real pos: 1 2 3 4 5 6
search pos: - 1 2 2 2 3
diff: - 1 1 2 3 3
I miss you and I love you
real pos: 1 2 3 4 5 6 7
search pos: 1 - 3 - 1 2 3
diff: 0 - 0 - 4 4 4
If we look at the position deltas of the first line, we get 1 (twice), 2 (once), and 3 (twice).
For the second line we get deltas 0 (twice), and 4 (thrice).
So for the second song line we find a delta with as many matches as search words, for the first line not. The second line is a match.
And here is the query. I assume we have a temporary table search filled with the search words and relative positions for readability.
select distinct w.songserial, w.songname, w.house_number
from words w
join search s on s.word = w.word
group by
w.songserial, w.songname, w.row_number, w.house_number, w.house_row, -- song line
w.word_position_in_row - s.pos -- delta
having count(*) = (select count(*) from search);
This query is based on:
a song is identified by songserial + songname + house_number
a song line is identified by songserial + songname + row_number + house_number + house_row
This may be wrong; I don't know what house and house number mean in reference to a song. But that'll be easy to adjust.

Dedup rows of a mirrored column using SQL

In MySQL, assuming I have a table with First Name and Last Name,
FName - LName
John - Paul
Paul - John
Alice - Peter
Peter - Alice
So if you see every row will have duplicate entry but in reverse.
I would like to select the rows in such a way that only one of the rows is selected for each unique entry (Doesn't matter which one).
My resulting table should be like:
FName - LName
John - Paul
Peter - Alice
There is more than one correct result, but I hope you got the point.
Thanks in advance!
SELECT DISTINCT
least(fName, lName) fName,
greatest (FName, lName) lName
FROM table
This will do it. Your first names will come before the associated last names in the collatino.
Try the following, assuming there are always 2 duplicates, no more, no less:
This assumes your table has one column with the 2 values separated by a hyphen.
Fiddle: http://sqlfiddle.com/#!2/c04fae/2/0
select
min(col_lr) as de_duplicated
from
(
select
x.col as col_lr,
count(y.col) + count(z.col) as grp
from
tbl x
left join tbl y on x.col < y.col
left join tbl z on concat(right(x.col, length(x.col) - locate(' - ', x.col) - 2), ' - ', substr(x.col, 1, locate(' - ', x.col) - 1)) < z.col
group by
x.col
) x
group by
grp
It establishes a composite rank in ABC order (for both left to right, and right to left) within the table, giving that group value the same for both duplicated rows, at which point you can just select the first of the two.

SQL query - tagging system, delimited string + unique values

I am working for a client that stores item tags in the MySQL DB like so (I know, I know - not ideal):
coats_and_jackets-Woven_Jacket-brand:Hobbs;
coats_and_jackets-Woven_Jacket-color:Black;
coats_and_jackets-Woven_Jacket-style:Boucle;
coats_and_jackets-Woven_Jacket-pattern:Plain;
dresses-Pinafore-brand:COS;
dresses-Pinafore-color:Blue _ Navy;
dresses-Pinafore-style:Wool;
dresses-Pinafore-pattern:Plain;
shoes-Ankle_Boot-brand:Topshop;
shoes-Ankle_Boot-color:Black;
shoes-Ankle_Boot-style:Leather;
shoes-Ankle_Boot-pattern:Plain;
bags-Tote-brand:Mulberry;
bags-Tote-color:Brown _ Tan;
bags-Tote-style:Leather;
bags-Tote-pattern:Plain;
shoes-Ballet_shoes-brand:Chanel;
shoes-Ballet_shoes-color:Black;
shoes-Ballet_shoes-style:Leather;
shoes-Ballet_shoes-pattern:Plain;
accessories-Scarf-brand:Zara;
accessories-Scarf-color:Brown _ Tan;
accessories-Scarf-style:Wool;
accessories-Scarf-pattern:Checked;
Each tag is broken down into 4 parts like so: category-type-brand, category-type-color, category-type-style, category-type-pattern
Not all 4 parts of a tag are required and can be omitted from the DB.
I have been tasked with finding out how many tags an item has, so in this example 6 tags have been used, each with all 4 parts.
The query I have so far counts all the tag parts, in this example 24, but I cannot assume that each tag will have all 4 parts stored. So cannot divide the parts amount by 4 to get the amount of tags.
In this example, the 6 tags used are as follows:
Coats & Jackets (Woven Jacket)
Dresses (Pinafore)
Shoes (Ankle boot)
Bags (Tote)
Shoes (Ballet Shoes)
Accessories (Scarf)
Now I'm not concerned about the category, type or parts (brand, color, style, pattern) - I'm just concerned about fetching the total amount of tags for this item.
Also, the data example above would be stored in a db row that looks like:
+----------+-------------+----------------------------+
| ID | meta_key | meta_value |
+----------+-------------+----------------------------+
| 1 | tags | coats_and_jackets-wove... |
+----------+-------------+----------------------------+
| 2 | item_desc | Fashion editor |
+----------+-------------+----------------------------+
Help structuring this query would be much appreciated.
The tags use hyphen as a separator. Here is a method for finding the number of tags used by a given item:
select it.*, length(it.tags) - length(replace(it.tags, '-', ''))+1
from itemtags it
This replaces the hyphen with an empty string, and measures the difference in lengths.
Assuming I'm understanding your requirement correctly, how about something like this (with CTE used to demonstrate assumed table structure)
WITH CTE1(tag) AS(
select 'coats_and_jackets-Woven_Jacket-brand:Hobbs' union
-- ...
select 'accessories-Scarf-color:Brown _ Tan' union
select 'accessories-Scarf-style:Wool' union
select 'accessories-Scarf-pattern:Checked'
)
, CTE2(tag_prefix) AS(
select LEFT(tag, CHARINDEX('-', tag, CHARINDEX('-', tag) + 1) - 1) from CTE1
)
select tag_prefix, COUNT(*) from CTE2 group by tag_prefix
This will give you results of...
accessories-Scarf 4
bags-Tote 4
coats_and_jackets-Woven_Jacket 4
dresses-Pinafore 4
shoes-Ankle_Boot 4
shoes-Ballet_shoes 4
... which gives you the tag prefix and number of parts used. From there you can count the individual rows or sum the number of parts or whatever else you need...
I've just realised that my solution is completely pointless given that I missed the 'mysql' tag ;) but I'll post it up here anyway. Hopefully it can give you a pointer on how to proceed.
WITH CTE1(ID, meta_key, meta_value) AS(
select 1, 'tags', 'coats_and_jackets-Wo...' union all
select 2, 'item_desc', 'Fashion editor'
)
, TagsCTE AS(
select t.ID, x.Item as tag_and_value
from CTE1 t
cross apply dbo.fn_SplitString(t.meta_value, ';') x
where meta_key = 'tags' and LEN(x.Item) > 0
)
select ID, COUNT(parts_count) from (
select ID, COUNT(*) as parts_count
from TagsCTE
group by ID, LEFT(tag_and_value, CHARINDEX('-', tag_and_value, CHARINDEX('-', tag_and_value) + 1) - 1)
) a group by ID
This gives results of:
1 6
Good luck.