Counting most common word in SQL table with exclusions - mysql

I'm trying to count the most common words from a table full of text (strings) in a MySQL database (using MYSQL workbench). I got this code working from reading another post (written by Kickstart).
This code uses a separate table called integer with 10 columns from 0 to 9 for counting.
Table Schema for the main table. I'm mainly only interested in data the "Text" column.
'Id', 'int(11)', 'NO', 'PRI', '0', ''
'PostId', 'int(11)', 'YES', 'MUL', NULL, ''
'Score', 'int(11)', 'YES', 'MUL', NULL, ''
'Text', 'varchar(4000)', 'YES', '', NULL, ''
'CreationDate', 'varchar(25)', 'YES', '', NULL, ''
'UserId', 'int(11)', 'YES', 'MUL', NULL, ''
'UserDisplayName', 'varchar(255)', 'YES', '', NULL, ''
SQL query:
SELECT aWord, COUNT(*) AS WordOccuranceCount
FROM (SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(concat(Text, ' '), ' ', aCnt), ' ', -1) AS aWord
FROM table
CROSS JOIN (
SELECT a.i+b.i*10+c.i*100 + 1 AS aCnt
FROM integers a, integers b, integers c) Sub1
WHERE (LENGTH(Body) + 1 - LENGTH(REPLACE(Text, ' ', ''))) >= aCnt) Sub2
WHERE Sub2.aWord != ''
GROUP BY aWord
ORDER BY WordOccuranceCount DESC
LIMIT 10
It lists out the top 10 words, but they are full of short words like a, the, you, me... etc.
How can I change it to skip certain words like those?
How can I make it so that say, only words 5 characters and up are counted?
Schema of integers table
'i', 'int(11)', 'NO', 'PRI', NULL, ''
Original post and code taken from this post. I am new and couldn't post anything on it so I had to ask here.
determining most used set of words php mysql
Thank you so much for your help!

You should be able to just add another condition to your WHERE clause:
SELECT aWord, COUNT(*) AS WordOccuranceCount
FROM (SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(concat(Text, ' '), ' ', aCnt), ' ', -1) AS aWord
FROM table
CROSS JOIN (
SELECT a.i+b.i*10+c.i*100 + 1 AS aCnt
FROM integers a, integers b, integers c) Sub1
WHERE (LENGTH(Body) + 1 - LENGTH(REPLACE(Text, ' ', ''))) >= aCnt) Sub2
WHERE Sub2.aWord != '' AND
LENGTH(Sub2.aWord) >= 5
GROUP BY aWord
ORDER BY WordOccuranceCount DESC
LIMIT 10
Just checking to see if the length of aWord is at least 5 chars, and if so, include it in the result set. The LIMIT will be applied to the result set (post-filtering) and you should have what you need.

Related

MySQL: Why would a query run faster with literal conditions compared to variables

Not sure whether the actually query matters but, I have a MySQL Stored Procedure where I commented out the other parts of the proc except the following query...
INSERT INTO temp_attribution (`attribute_type`, `domain`, `id`, `name`, `score`, `rank`, `partner_match`, `person_match`, `sponsor_match`, `date_match`)
SELECT 'Campaign' AS attribute_type, domain, id, name, score, (#proc_counter := #proc_counter + 1) AS rank,
partner_match, person_match, sponsor_match, date_match
FROM (
SELECT m_c.domain, m_c.campaign_id AS id, m_c.name, m_c.client_id, m_c.sent_date,
proc_sponsors AS invoice_sponsor, bs.sponsor AS campaign_sponsor,
proc_email AS invoice_email, aes_decrypt(m_r.email, in_encrypt_key) as campaign_email,
if (m_c.client_id = proc_client_id COLLATE latin1_general_ci, 'Yes', 'No') AS partner_match,
if (aes_encrypt(proc_email, in_encrypt_key) = m_r.email, 'Exact Email', 'Email Domain') AS person_match,
if (LOCATE(CONVERT(bs.sponsor USING utf8mb4), proc_sponsors) > 0, 'Sponsor',
if (CONVERT(bs.vendor USING utf8mb4) = proc_vendor, 'Vendor', 'No') ) AS sponsor_match,
if (datediff(proc_invoice_date, m_c.sent_date) BETWEEN 0 AND 92, 'Within Three', 'Within Six') AS date_match,
(
if (m_c.client_id = proc_client_id COLLATE latin1_general_ci, 45, 10) + 30 +
if (LOCATE(CONVERT(bs.sponsor USING utf8mb4), proc_sponsors) > 0, 10,
if (CONVERT(bs.vendor USING utf8mb4) = proc_vendor, 5, 0) ) +
if (datediff(proc_invoice_date, m_c.sent_date) BETWEEN 0 AND 92, 15, 5)
) AS score
FROM campaign_table m_c
INNER JOIN recipient_table m_r ON m_c.domain = m_r.domain AND m_c.campaign_id = m_r.campaign_id
LEFT JOIN booking_sponsor bs ON m_c.domain = bs.domain AND m_c.campaign_id = bs.campaign_id
WHERE datediff(proc_invoice_date, m_c.sent_date) BETWEEN 0 AND 185
AND ( aes_encrypt(proc_email, in_encrypt_key) = m_r.email OR m_r.email_domain = proc_email_domain )
) T ORDER BY score DESC, sent_date DESC LIMIT 5;
The fields starting with 'proc_' are actually variables declared at the beginning of the procedure and this only takes 0.385 seconds to initialise whereas the entire proc takes 15 seconds.
On a separate query window, I copied the relevant query and substituted variables starting with 'proc_' to test speed and optimise, like so...
INSERT INTO temp_attribution (`attribute_type`, `domain`, `id`, `name`, `score`, `rank`, `partner_match`, `person_match`, `sponsor_match`, `date_match`)
SELECT 'Campaign' AS attribute_type, domain, id, name, score, (#proc_counter := #proc_counter + 1) AS rank,
partner_match, person_match, sponsor_match, date_match
FROM (
SELECT m_c.domain, m_c.campaign_id AS id, m_c.name, m_c.client_id, m_c.sent_date,
'VENDOR SPONSOR VALUE' AS invoice_sponsor, bs.sponsor AS campaign_sponsor,
'johnsmith#domain.com' AS invoice_email, aes_encrypt('johnsmith#domain.com', 'secret_key') as campaign_email,
if (m_c.client_id = m_c.client_id COLLATE latin1_general_ci, 'Yes', 'No') AS partner_match,
if (aes_encrypt('johnsmith#domain.com', 'secret_key'), 'Exact Email', 'Email Domain') AS person_match,
if (LOCATE(CONVERT(bs.sponsor USING utf8mb4), 'VENDOR SPONSOR VALUE') > 0, 'Sponsor',
if (CONVERT(bs.vendor USING utf8mb4) = 'VENDOR', 'Vendor', 'No') ) AS sponsor_match,
if (datediff('2016-10-14', m_c.sent_date) BETWEEN 0 AND 92, 'Within Three', 'Within Six') AS date_match,
(
if (m_c.client_id = m_c.client_id COLLATE latin1_general_ci, 45, 10) + 30 +
if (LOCATE(CONVERT(bs.sponsor USING utf8mb4), 'VENDOR SPONSOR VALUE') > 0, 10,
if (CONVERT(bs.vendor USING utf8mb4) = 'VENDOR', 5, 0) ) +
if (datediff('2016-10-14', m_c.sent_date) BETWEEN 0 AND 92, 15, 5)
) AS score
FROM campaign_table m_c
INNER JOIN recipient_table m_r ON m_c.domain = m_r.domain AND m_c.campaign_id = m_r.campaign_id
LEFT JOIN booking_sponsor bs ON m_c.domain = bs.domain AND m_c.campaign_id = bs.campaign_id
WHERE datediff('2016-10-14', m_c.sent_date) BETWEEN 0 AND 185
AND ( aes_encrypt('johnsmith#domain.com', 'secret_key') = m_r.email OR m_r.email_domain = 'domain.com' )
) T ORDER BY score DESC, sent_date DESC LIMIT 5;
Now, magically without doing anything else, the query runs within two seconds. How is that possible?
Figured it out. Some of the declared variable type was different compared to the column being compared, so I guess MySQL could not compare them in the most efficient way possible.

Flattening a Table in prep for Json

Maybe it's that I'm tired but this is escaping me.
Let's say that I want to flatten this table:
a_id a_val b_id b_val c_id c_val d_id d_val
1 a 10 b 100 c 1000 f
1 a 20 d 200 g null null
2 e 30 h 300 i null null
2 j 40 k null null null null
3 l null null null null null null
Into this query result:
id mystring
1, (1:a,10:b,100:c,1000:f),(1:a,20:d,200:g)
2, (2:e,30:h,300:i),(2:j,40:k)
3, (3:l)
The table only renders four levels deep (a, b, c, d) so no dynamic sql issue.
Now I'd usually just use GROUP_CONCAT(CONCAT(...)) but that won't work with the Nulls present. And maybe using coalesce somehow will solve this but... I feel pretty stupid at the moment... and I can't figure it out.
Unfortunately I can't use mysql json services on this installation so I need to construct the data. thanks.
The solution here will probably just be a combination of clever concatenation and IFNULL calls. My shot in the dark:
SELECT a_id, GROUP_CONCAT(CONCAT('(',
a_id, ':', a_value, ',',
IFNULL(b_id, ''), IF(b_id IS NOT NULL, ':', ''), IFNULL(b_val, ''),
...repeat for c and d
')'
) SEPARATOR ',')
FROM table
GROUP BY a_id;
select a_id as id,
group_concat(concat(
case isnull(a_id) when true then '' else '(' end,
coalesce(a_id, ''),
case isnull(a_id) when true then '' else ':' end,
coalesce(a_val, ''),
case isnull(b_id) when true then '' else ',' end,
coalesce(b_id, ''),
case isnull(b_id) when true then '' else ':' end,
coalesce(b_val, ''),
case isnull(c_id) when true then '' else ',' end,
coalesce(c_id, ''),
case isnull(c_id) when true then '' else ':' end,
coalesce(c_val, ''),
case isnull(d_id) when true then '' else ',' end,
coalesce(d_id, ''),
case isnull(d_id) when true then '' else ':' end,
coalesce(d_val,''),
case isnull(a_id) when true then '' else ')' end
) separator ',')
from table
group by a_id;

Removing commas in varchar to get number and test

I have a serious problem with varchar in mysql.
This is my query :
SELECT ak.`address`, ak.lien1_amount, ak.`comp`, ak.`zestimate`, ak.`counvalue`, IFNULL(COALESCE(ak.comp, ak.zestimate, ak.counvalue), 0) AS cmazestcv, CEIL( CAST(ak.lien1_amount AS DECIMAL(10,5)) ) AS testnum
FROM allinformationk AS ak
LEFT JOIN `home_buyers_alias1` ON ak.`house id` = `home_buyers_alias1`.`house id`
WHERE ak.`is_deleted` = 'no' AND ( (CASE WHEN ak.`sale date 4` IS NOT NULL THEN ak.`sale date 4` WHEN ak.`sale date 3` IS NOT NULL THEN ak.`sale date 3` WHEN ak.`sale date 2` IS NOT NULL THEN ak.`sale date 2` ELSE ak.`sale date` END) IS NOT NULL)
AND ( CAST(IFNULL(COALESCE(ak.comp, ak.zestimate, ak.counvalue), 0) AS DECIMAL(10,5)) > CAST(ak.lien1_amount AS DECIMAL(10,5)) )
AND ak.lien1_amount IS NOT NULL
LIMIT 0, 10;
This is the result of my query :
The fields lien1_amount, comp, zestimate, counvalue are VARCHAR. This is why I am casting them to DECIMAL in my query. But still can't get it for testing as integer, you can see what lien1_amount gives in testnum column when I want to transform it to decimal.
How to change those varchar with comas to amount of money and test them ?
I can not do it since two days now.
If the said columns are of type varchar and hold numeric value always, then just replace the comma and use the result. No explicit cast is required.
And, if columns comp, zestimate, counvalue too contain comma as part of them, then apply replace on them too.
SELECT ak.`address`
, CEIL( replace( ifnull( ak.lien1_amount, 0 ), ',', '' ) ) as lien1_amount
, replace( ifnull( ak.`comp`, 0 ), ',', '' ) as `comp`
, replace( ifnull( ak.`zestimate`, 0 ), ',', '' ) as `zestimate`
, replace( ifnull( ak.`counvalue`, 0 ), ',', '' ) as `counvalue`
, replace( ifnull( COALESCE( ak.comp, ak.zestimate, ak.counvalue )
, 0 ), ',', '' ) AS cmazestcv
FROM allinformationk AS ak
LEFT JOIN `home_buyers_alias1`
ON ak.`house id` = `home_buyers_alias1`.`house id`
WHERE ak.`is_deleted` = 'no'
AND ak.lien1_amount IS NOT NULL
AND COALESCE( ak.`sale date 4`
, ak.`sale date 3`
, ak.`sale date 2`
, ak.`sale date` ) IS NOT NULL
AND replace(
IFNULL( COALESCE( ak.comp, ak.zestimate, ak.counvalue ), 0 ), ',', '' )
> replace( ak.lien1_amount, ',', '' )
LIMIT 0, 10;

MySQL group by - union all

I have this query for a report:
select #t := '' as 'Clave', #tf:='Inventario Físico' as 'Descripción', #t:= '' as 'Cantidad', #t:= '' as 'Precio Unitario' union all
select #t:= '', #t:= '', #t:= '', #t:= '' union all
(select cla, des, can, CAST(pl1*can as Decimal(10,2)) from inventario order by cla) union all
select #t:= '', #t:='', #tnde := 'Número de Elementos: ', count(*) from inventario union all
select #t:= '', #t:= '', #tne:= 'Suma total: $', sum(ppu) from inventario;
I need an "order by" for the 3rd query.
select cla, des, can, CAST(pl1*can as Decimal(10,2)) from inventario order by cla
By itself, that line of code works perfectly, but, when it's between the unions, all the info it is not ordered. How can I solve this? Thanks.
union all does not guarantee that the data is in the order specified by the subqueries. You need to do an explicit order by to get that result.
This approach adds an ordering column to keep the groups together. The final order by clause first orders by ordering and then by the column used for ordering the third subquery:
(select #t := '' as Clave, #tf:='Inventario Físico' as Descripción,
#t:= '' as "Cantida", #t:= '' as "Precio Unitario", 0 as ordering
) union all
(select #t:= '', #t:= '', #t:= '', #t:= '', 1 as ordering) union all
(select cla, des, can, CAST(pl1*can as Decimal(10,2)), 2 from inventario) union all
(select #t:= '', #t:='', #tnde := 'Número de Elementos: ', count(*), 3 from inventario) union all
(select #t:= '', #t:= '', #tne:= 'Suma total: $', sum(ppu), 4 from inventario)
order by ordering, clave;
I also changed the single quotes on the column aliases to double quotes. I think it is good practice to only use single quotes for string constants.

Multiple IF statements on MYSQL

I'm trying to Display somes values in my database result, I am using this code but I can not succeed:
SELECT
item_code,
IF(category_code = 'HERR1', 'NO', 1) OR (category_code = 'COLN5', 'NO', 2) AS category_code,
item_name,
item_quantity
FROM qa_items
EDIT :
I Want to display for example:
If category_code = 'HERR1'
Display = 1
else if category_code = 'COLN5'
Display = 2
End If
If anyone has any idea, would greatly appreciate it
I'd rather use CASE :
SELECT item_code,
CASE category_code
WHEN 'HERR1' THEN 1
WHEN 'COLN5' THEN 2
ELSE 'NO'
END as category_code, item_name, item_quantity
FROM qa_items
But IF will also work : IF(category_code='HERR1',1, IF(category_code='COLN5',2,'NO'))
You need to nest the if statements
SELECT item_code, IF(category_code = 'HERR1', 'NO', IF(category_code = 'COLN5', 1, 2)) AS category_code, item_name, item_quantity FROM qa_items
Then the first if will fail and the nested if will evaluate
Is this what you were after?
SELECT
item_code,
CASE category_code
WHEN 'HERR1' THEN 1
WHEN 'COLN5' THEN 2
ELSE 'NO'
END AS category_code,
item_name,
item_quantity
FROM qa_items
Try the following
SELECT item_code, CASE category_code WHEN 'HERR1' THEN 1 WHEN 'COLN5' THEN 0 ELSE 'NONE' END AS category_code, item_name, item_quantity FROM qa_items
You can try this.
Use IF in select query and update the table you want ;)
create table student(marks int,grade char);
insert into student values(200,null),(120,null),
(130,null);
UPDATE student a
INNER JOIN (select s.marks, IF(s.marks>=200,'A',IF(s.marks>=130,'B','P')) AS Grade from student s) b on a.marks= b.marks
SET a.Grade = b.Grade;
SELECT MOBILE,
CASE (SUBSTRING(mobile, LENGTH(MOBILE), 1)) WHEN ','
THEN SUBSTRING(mobile, 1, LENGTH(MOBILE) - 1)
ELSE SUBSTRING(mobile, 1, LENGTH(MOBILE))
END AS newmobile
FROM (SELECT CONCAT(IFNULL(`mob1`, ''), IF(`mob1` IS NULL, '', ','),
IFNULL(`mob2`, ''), IF(`mob2` IS NULL, '', ','),
IFNULL(`mob3`, ''), IF(`mob3` IS NULL, '', ','),
IFNULL(`mob4`, ''), IF(`mob4` IS NULL, '', ','),
IFNULL(`mob5`, ''), IF(`mob5` IS NULL, '', ','),
IFNULL(`mob6`, ''))
AS mobile
FROM `temp_consignordata`) AS T
SELECT item_code,
-- First if
IF(category_code = 'HERR1', 1,
-- second else IF
IF(category_code = 'COLN5', 2,
-- last else
'NO')
AS category_code,
item_name,
item_quantity
FROM qa_items;
Explanation
first if evalutes for value 'HERR1' and if found assigns 1
Second (else ) IF evalues for Value 'COLN5' and if found assigns 2
Last (else) default case assigns 'NO'
to category_code