Count the frequency of each word - mysql

I've been trolling the internet and realize that MySQL is not the best way to get at this but I'm asking anyway. What query, function or stored procedure has anyone seen or used that will get the frequency of a word across a text column.
ID|comment
----------------------
Ex. 1|I love this burger
2|I hate this burger
word | count
-------|-------
burger | 2
I | 2
this | 2
love | 1
hate | 1

This solution seems to do the job (stolen almost verbatim from this page). It requires an auxiliary table, filled with sequential numbers from 1 to at least the expected number of distinct words. This is quite important to check that the auxiliary table is large enough, or results will be wrong (showing no error).
SELECT
SUBSTRING_INDEX(SUBSTRING_INDEX(maintable.comment, ' ', auxiliary.id), ' ', -1) AS word,
COUNT(*) AS frequency
FROM maintable
JOIN auxiliary ON
LENGTH(comment)>0 AND SUBSTRING_INDEX(SUBSTRING_INDEX(comment, ' ', auxiliary.id), ' ', -1)
<> SUBSTRING_INDEX(SUBSTRING_INDEX(comment, ' ', auxiliary.id-1), ' ', -1)
GROUP BY word
HAVING word <> ' '
ORDER BY frequency DESC;
SQL Fiddle
This approach is as inefficient as one can be, because it cannot use any index.
As an alterative, I would use a statistics table that I would keep up-to-date with triggers. Perhaps initialise the stats table with the above.

Something like this should work. Just make sure you don't pass in a 0 length string.
SET #searchString = 'burger';
SELECT
ID,
LENGTH(comment) - LENGTH(REPLACE(comment, #searchString, '')) / LENGTH(#searchString) AS count
FROM MyTable;

Related

Is there a way to count the LIKE results per row in MySQL?

I have a MySQL table jobs like this:
ID | title | keywords
1 | UI Designer | HTML, CSS, Photoshop
2 | Web site Designer | PHP
3 | UI/UX Developer | CSS, HTML, JavaScript
and I have a query like this:
SELECT * FROM jobs
WHERE title LIKE '%UX%' OR title LIKE '%UI%' OR title LIKE '%Developer%' OR keywords LIKE '%HTML%' OR keywords LIKE '%CSS%'
I want to sort results by most similarity.
for example for first row (ID 1), there is UI and HTML and CSS in the record row. then the number of CORRECT LIKE conditions is 3 for first row. same as this calculation, it is 0 for second row and it is 5 for third row.
then I want the result ordered by the number of CORRECT LIKE conditions, like this:
Results
ID | title | keywords
3 | UI/UX Developer | CSS, HTML, JavaScript
1 | UI Designer | HTML, CSS, Photoshop
Then, is there anyway to count the number of similarities per row in query and sort the result like what I describe?
You could sum the matching resul in order by using if
SELECT *
FROM jobs
WHERE title LIKE '%UX%'
OR title LIKE '%UI%'
OR title LIKE '%Developer%'
OR keywords LIKE '%HTML%'
OR keywords LIKE '%CSS%'
ORDER BY (title LIKE '%UX%'+ title LIKE '%UI%'+
keywords LIKE '%HTML%'+ keywords LIKE '%HTML%') DESC
if return 1 or 0 so adding the true result you should obatin the most matching rows
You should not be storing keywords in a string like that. You should have a separate table.
If -- for some reason such as someone else's really, really, really bad design choices -- you have to deal with this data, then take the delimiters into account. In MySQL, I would recommend find_in_set() for this purpose:
SELECT j.*
FROM jobs j
WHERE title LIKE '%UX%' OR
title LIKE '%UI%' OR
title LIKE '%Developer%' OR
FIND_IN_SET('HTML', REPLACE(keywords, ', ', '')) > 0 OR
FIND_IN_SET('CSS', REPLACE(keywords, ', ', '')) > 0
ORDER BY ( (title LIKE '%UX%') +
(title LIKE '%UI%') +
(title LIKE '%Developer%') +
(FIND_IN_SET('HTML', REPLACE(keywords, ', ', '')) > 0) +
(FIND_IN_SET('CSS', REPLACE(keywords, ', ', '')) > 0)
) DESC ;
This finds an exact match on the keyword.
You can simplify the WHERE, but not the ORDER BY, to:
WHERE title REGEXP 'UX|UI|Developer' OR
FIND_IN_SET('HTML', REPLACE(keywords, ', ', '')) > 0 OR
FIND_IN_SET('CSS', REPLACE(keywords, ', ', '')) > 0

MySQL strip ' on where queries

I have a large table which contains, or not, records that have ' tags like (martin's, lay's, martins, lays, so on).
Actually to search the client can be write exactly text, for example: martin's, to search all records that contains "martin's" but it is complicate, then, I need the client can to search by "martins" or "martin's".
This is a simple example:
A mysql table like:
ID | Title
---------------
1 lays
2 lay's
3 some text
4 other text
5 martin's
I need a sql query to search by lays or lay's and both need show me a Result like:
ID | Title
---------------
1 lays
2 lay's
I'm tried with many post solutions but I cant do that :-(
Appreciate any help.
Just remove the single quote:
select t.*
from t
where replace(t.title, '''', '') = 'lays';
To search if the word contains:
select t.*
from t
where replace(t.title, '''', '') LIKE '%lays%';

SELECT COUNT(*) Performance

Lately I discovered that the most consuming requests in my website are the SELECT COUNT(*)
a simply request can take sometimes more than a second
SELECT COUNT(*) as count FROM post WHERE category regexp '[[:<:]](17|222)[[:>:]]' AND approve=1 AND date < '2014-01-25 19:08:17';
+-------+
| count |
+-------+
| 3585 |
+-------+
1 row in set (0.49 sec)
I'm not sure what's the problem I've indexes for category, approve and date.
This is your query:
SELECT COUNT(*) as count
FROM post
WHERE category regexp '[[:<:]](17|222)[[:>:]]' AND approve=1 AND
date < '2014-01-25 19:08:17';
It is not a simple request because the regexp has to run on every row (or every row filtered by the other conditions).
An index on post(approve, date, category) might help. You want one index with the columns listed in that order.
EDIT:
If the values are being stored in a space separated list, you might try this to see if it is faster:
WHERE (concat(' ', category, ' ') like '% 17 %' or concat(' ', category, ' ') like '% 222 %') AND
approve = 1 AND date < '2014-01-25 19:08:17';
It is possible that these expressions are faster than the regular expression.
And, finally, if you really do need to search for "words" in a field, then consider a full text index. I think you might have to tinker with the options in this case so numbers are allowed in the index.

Sql Select into array - column has seperater

I have a column in my DB that has the following data (yeah i know its wrong to have multiple names separated by some random character)
"John Cusack | Thandie Newton | Chiwetel Ejiofor"
I want to be able to separate these people into an array to use later or even just to be able display them like below will help
John Cusack
Thandie Newton
Chiwetel Ejiofor
any ideas please
thanks in advance
As you say, storing delimited lists in an RDBMS really is not a good idea; however, you may be able to use MySQL's string manipulation functions such as SUBSTRING_INDEX() to obtain your desired results (MySQL doesn't have array types, so I assume you're merely looking to split the data):
SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(my_column, '|', 1), -1),
SUBSTRING_INDEX(SUBSTRING_INDEX(my_column, '|', 2), -1),
SUBSTRING_INDEX(SUBSTRING_INDEX(my_column, '|', 3), -1)
FROM my_table
Note that one doesn't actually need to invoke SUBSTRING_INDEX() twice for the first and last elements of the list, but I thought it informative to do so in order that the pattern for further elements can be seen more clearly.
If you were so inclined, you could build a stored procedure that loops over the string populating a temporary table with each found element—but this is all so far away from "good practice" that it's almost certainly not worth delving into it any further.
you can try this.
select substring_index(substring_index('a|b|c|h', '|',#r:=#r+1),'|',-1) zxz
from (select #r:=0) x,
(select 'x' xx union select 'v' xx union select 'z' xx union select 'p' xx) z;
Result looks like
----
|zxz|
-----
|a |
------
|b |
------
|c |
------
|h |
------
locatet here: Mysql
and a little modified.
Remember: The "count" of the union statements have to be the same as your delemiter.
Kind Regars

Mysql + count all words in a Column

I have 2 columns in a table and I would like to roughly report on the total number of words.
Is it possible to run a MySQL query and find out the total number of words down a column.
It would basically be any text separated by a space or multiple space.
Doesn't need to be 100% accurate as its just a general guide.
Is this possible?
Try something like this:
SELECT COUNT(LENGTH(column) - LENGTH(REPLACE(column, ' ', '')) + 1)
FROM table
This will count the number of caracters in your column, and substracts the number of caracters in your column removing all the spaces. Hereby you know how many spaces you have in your row and hereby know how many words there are (roughly because you can also type in a double space, this wil count as two words but you say you want it roughly so this should suffice).
Count simply gives you the number of found rows. You need to use SUM instead.
SELECT SUM(LENGTH(column) - LENGTH(REPLACE(column, ' ', '')) + 1) FROM table
A less rough count:
SELECT LENGTH(column) - LENGTH(REPLACE(column, SPACE(1), ''))
FROM
( SELECT CONCAT(TRIM(column), SPACE(1)) AS column
FROM
( SELECT REPLACE(column, SPACE(2), SPACE(1)) AS column
FROM
( SELECT REPLACE(column, SPACE(3), SPACE(1)) AS column
FROM
( SELECT REPLACE(column, SPACE(5), SPACE(1)) AS column
FROM
( SELECT REPLACE(column, SPACE(9), SPACE(1)) AS column
FROM
( SELECT REPLACE(column, SPACE(17), SPACE(1)) AS column
FROM
( SELECT REPLACE(column, SPACE(33), SPACE(1)) AS column
FROM tableX
) AS x
) AS x
) AS x
) AS x
) AS x
) AS x
) AS x
I stumbled upon this post while I was looking for an answer myself and truthfully I've tested all of the answers here and the closest one was #fikre's answer. However, I have concern over data that have leading spaces and/or extra spaces between the words (trailing spaces doesn't seem to have effect to fikre's query during my testing). So, I'm looking for a way to identify any spaces in between words and remove them. While I found a few answers using advanced function (which is beyond my skill set), I did find a very simple way to do it.
tl;dr > #fikre's answer is the only one working for me but I did a minor tweak to ensure that I'll get the most accurate word count.
Query 1 -- This will return 5 "Word Count"
SELECT SUM(LENGTH(input) - LENGTH(REPLACE(input, ' ', '')) + 1) AS "Word Count" FROM
(SELECT TRIM(REPLACE(REPLACE(REPLACE(input,' ','<>'),'><',''),'<>',' ')) AS input
FROM (SELECT ' too late to the party ' AS input) i) r;
Query 2 -- This will return 13 "Word Count"
SELECT SUM(LENGTH(input) - LENGTH(REPLACE(input, ' ', '')) + 1) AS "Word Count"
FROM (SELECT ' too late to the party ' AS input) i;
-- breakdown ' too late to the party '
1 leading space= 1 word count
2 spaces after the first space from the word 'too'= 2 word count
1 space after the first space from the word 'late'= 1 word count
4 spaces after the first space from the word 'the'= 4 word count
trailing space(s) wasn't counted at all.
Total spaces > 1+2+1+4=8 + 5 word count = 13
So, basically if the data row contains even a million spaces in between (disclaimer: an assumption. I've only tested 336,896 spaces), Query 1 will still return Word count=5.
Note: The mid part REPLACE(REPLACE(REPLACE(input,' ','<>'),'><',''),'<>',' ') I took from this answer https://stackoverflow.com/a/55476224/10910692