I'm trying to figure out how to go about determining the most used words on a mysql dataset.
Not sure how to go about this or if there's a simpler approach. Read a couple posts where some suggests an algorithm.
Example:
From 24,500 records, find out the top 10 used words.
Right, this runs like a dog and is limited to working with a single delimiter, but hopefully will give you an idea.
SELECT aWord, COUNT(*) AS WordOccuranceCount
FROM (SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(concat(SomeColumn, ' '), ' ', aCnt), ' ', -1) AS aWord
FROM SomeTable
CROSS JOIN (
SELECT a.i+b.i*10+c.i*100 + 1 AS aCnt
FROM integers a, integers b, integers c) Sub1
WHERE (LENGTH(SomeColumn) + 1 - LENGTH(REPLACE(SomeColumn, ' ', ''))) >= aCnt) Sub2
WHERE Sub2.aWord != ''
GROUP BY aWord
ORDER BY WordOccuranceCount DESC
LIMIT 10
This relies on having a table called integers with a single column called i with 10 rows with the values 0 to 9. It copes with up to ~1000 words but can easily be altered to cope with more (but will slow down even more).
Why not do it all in PHP? Steps would be
Create a dictionary (word => count)
Read you data in PHP
Split it into words
Add each word to the dictionary (you might want to lowercase and trim them first)
If already in the dictionary, increment its count. If not already in the dictionary, set 1 as its value (count = 1)
Iterate your dictionary elements to find the highest 10 values
I wouldn't do it in SQL mainly because it'd end up more complex.
General idea would be to figure out how many delimiters (e.g. spaces) are in each field, and run SUBSTRING_INDEX() in a loop, for each such field. Populating this into a temporary table has the added benefit of being able to run this in chunks, in parallel, etc. Shouldn't be too cumbersome to throw some SPs together to do this.
SELECT `COLUMNNAME`, COUNT(*) FROM `TABLENAME` GROUP BY `COLUMNNAME`
its very simple and worked... :)
A little improve, remove stop words from the list with AND Sub2.aWord not in (list of stop words)
SELECT aWord, COUNT(*) AS WordOccuranceCount
FROM (SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(concat(txt_msg, ' '), ' ', aCnt), ' ', -1) AS aWord
FROM mensagens
CROSS JOIN (
SELECT a.i+b.i*10+c.i*100 + 1 AS aCnt
FROM integers a, integers b, integers c) Sub1
WHERE (LENGTH(txt_msg) + 1 - LENGTH(REPLACE(txt_msg, ' ', ''))) >= aCnt) Sub2
WHERE Sub2.aWord != '' AND Sub2.aWord not in ('a','about','above', .....)
GROUP BY aWord
ORDER BY WordOccuranceCount DESC
LIMIT 10
Related
I have a varchar(255) column with FULLTEXT index. I need a query to get the most frequent words in the entire column as
Word Frequency
key1 4533
key2 4332
key3 2932
Note 1: I would prefer to skip common words such as prepositions, but it is not critical as I can filter them later. Just mentioned if it can speed up the query.
Note 2: It is a table with over a million rows. It is not a regular query but should be practically fast.
If you even give a hint how the query should look like, it will be a great help.
This is not really something that is easy to do in MySQL. The full text index is not available for querying. One thing you can do is extract words. This is a bit painful. The following assumes that words are separated by a single space and gets the frequencies of the first three words:
select substring_index(substring_index(t.words, ' ', n.n), ' ', -1) as word, count(*)
from t cross join
(select 1 as n union all select 2 union all select 3
) n
on n.n <= length(t.words) - length(replace(t.words, ' ', '') + 1
group by substring_index(substring_index(t.words, ' ', n.n), ' ', -1)
order by count(*) desc;
I have a mysql table "post" :
id Post
-----------------------------
1 Post Testing
2 Post Checking
3 My First Post
4 My first Post Check
I need to count the number of distinct words in all the values for the Post column.
Is there any way to get the following results using a single query?
post count
------------------
Post 4
Testing 1
checking 1
My 2
first 2
check 1
Not in an easy way. If you know the maximum number of words, then you can do something like this:
select substring_index(substring_index(p.post, ' ', n.n), ' ', -1) as word,
count(*)
from post p join
(select 1 as n union all select 2 union all select 3 union all select 4
) n
on length(p.post) - length(replace(p.post, ' ', '')) < n.n
group by word;
Note that this only works if the words are separated by single spaces. If you have a separate dictionary of all possibly words, you can also use that, something like:
select d.word, count(p.id)
from dictionary d left join
posts p
on concat(' ', p.post, ' ') like concat(' %', d.word, ' %')
group by d.word
You can use a FULLTEXT index.
First add a FULLTEXT index to your column like:
CREATE FULLTEXT INDEX ft_post
ON post(Post);
Then flush the index to disk using optimize table:
SET GLOBAL innodb_optimize_fulltext_only=ON;
OPTIMIZE TABLE post;
SET GLOBAL innodb_optimize_fulltext_only=OFF;
Set the aux table:
SET GLOBAL innodb_ft_aux_table = '{yourDb}/post';
And now you may simply select for word and word counts like:
SELECT word, doc_count FROM INFORMATION_SCHEMA.INNODB_FT_INDEX_TABLE;
Is there a way to get a value like this one: "300, 400, 500, 300" check each number separated with comma and if it is doubled delete it. So the value will look like this : "300, 400, 500".
I could do it in PHP script but I just wonder if it is possible using MySQL.
Create a temp table with unique index, insert values ignoring duplicate errors, select all records from the temp table, delete the table.
Quick play, but to get the unique values for each row you could use something like this
SELECT Id, GROUP_CONCAT(DISTINCT aWord ORDER BY aWord ASC)
FROM (SomeTable.Id, SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(concat(SomeColumn, ','), ' ', aCnt), ',', -1) AS aWord
FROM SomeTable
CROSS JOIN (
SELECT a.i+b.i*10+c.i*100 + 1 AS aCnt
FROM integers a, integers b, integers c) Sub1
WHERE (LENGTH(SomeColumn) + 1 - LENGTH(REPLACE(SomeColumn, ',', ''))) >= aCnt) Sub2
GROUP BY ID
This relies on having a table called integers with a single column called i with 10 rows with the values 0 to 9. It copes with up to ~1000 words but can easily be altered to cope with more
Probably easiest to use an INSERT / ON DUPLICATE KEY UPDATE to use this to make the values unique.
How can I order my query by word count? Is it possible?
I have some rows in table, with text fields. I want to order them by word count of these text fields.
Second problem is, that I need to select only these rows, which have for example minimum 10 words, or maximum 20.
Well, this will not perform very well since string calculations need to be performed for all rows:
You can count number of words in a MySQL column like so: SELECT SUM( LENGTH(name) - LENGTH(REPLACE(name, ' ', ''))+1) FROM table (provided that words are defined as "whatever-delimited-by-a-whitespace")
Now, add this to your query:
SELECT
<fields>
FROM
<table>
WHERE
<condition>
ORDER BY SUM(LENGTH(<fieldWithWords>) - LENGTH(REPLACE(<fieldWithWords>, ' ', '')) + 1)
Or, add it to the condition:
SELECT
<fields>
FROM
<table>
WHERE
SUM(LENGTH(<fieldWithWords>) - LENGTH(REPLACE(<fieldWithWords>, ' ', '')) + 1) BETWEEN 10 AND 20
ORDER BY <something>
Maybe something like this:
SELECT Field1, SUM( LENGTH(Field2) - LENGTH(REPLACE(Field2, ' ', ''))+1)
AS cnt
FROM tablename
GROUP BY Field1
ORDER BY cnt
Field2 is the string field in which you'd like to count words.
This is a query I've been puzzling over for quite some time, I've never been able to get it to work quite right and after about 40 hours of pondering I've gotten to this point.
Setup
For the example issue we have 2 tables, one being...
field_site_id field_sitename field_admins
1 Some Site 1,
2 Other Site 1,2,
And the other is admins like...
field_user_id field_firstname field_lastname
1 Joe Bloggs
2 Barry Wills
Now all this query is designed to do is the following:
List all sites in the database
Using a JOIN and FIND_IN_SET to pull each admin
And GROUP_CONCAT(field_firstname, ' ', field_lastname) with a GROUP BY to build a field with the real user names.
Also allow HAVING to filter on the custom result to narrow the results down further.
All this part works perfectly fine.
What I can't work out how to achieve is to sort the results by the GROUP_CONCAT result, I imagine this is being the ORDER BY works before the concat function therefore the data doesn't exist to order by it, so what would the alternative be?
Code examples:
SELECT *
GROUP_CONCAT(DISTINCT field_firstname, ' ', field_lastname ORDER BY field_lastname SEPARATOR ', ') AS field_admins_fullname,
FROM `table_sites`
LEFT JOIN `table_admins` ON FIND_IN_SET( `table_admins`.`field_user_id`, `table_sites`.`field_site_id` ) > 0
GROUP BY field_site_id
I also tried a query that used a subquery to gather the group_concat result as below...
( SELECT GROUP_CONCAT(field_firstname, ' ', field_lastname ORDER BY field_lastname ASC SEPARATOR ', ') FROM table_admins
WHERE FIND_IN_SET( `table_admins`.`field_user_id`, `table_sites`.`field_admins` ) > 0
) AS field_admins_fullname
Conclusion
Either way attempting to ORDER BY field_admins_fullname will not create the correct results, it won't error out but assume that's because the given ORDER BY is blank so it just does whatever it wants.
Any suggestions would be welcome, if this is just not possible, what would be another recommend index methodology?
Two things I see wrong:
1st, is the JOIN. It should be using s.field_admins and not field_site_id :
ON FIND_IN_SET( a.field_user_id, s.field_admins ) > 0
2nd, you should use the CONCAT() function (to conactenate fields from the same row) inside the GROUP_CONCAT().
Try this:
SELECT s.field_site_id
, s.field_sitename
, GROUP_CONCAT( CONCAT(a.field_firstname, ' ', a.field_lastname)
ORDER BY a.field_lastname ASC
SEPARATOR ', '
)
AS field_admins_fullname
FROM table_sites s
LEFT JOIN table_admins a
ON FIND_IN_SET( a.field_user_id, s.field_admins ) > 0
GROUP BY s.field_site_id
Friendly advice:
Don't use Do use
------------ --------
table_sites site
table_admins admin
field_site_id site_id
field_sitename sitename
field_admins admins
But what should really be stressed, is your setup. Having fields that have comma separated values lead to this kind of horrible queries that use FIND_IN_SET() for joins and GROUP_CONCAT() for showing results. Horrible to see, difficult to maintain and most important, very, very slow as no index can be used.
You should have something like this instead:
Setup suggestion
Table: site
site_id sitename
1 Some Site
2 Other Site
Table: site_admin
site_id admin_id
1 1
2 1
2 2
Table: admin
user_id firstname lastname
1 Joe Bloggs
2 Barry Wills
I think you need to repeat the complex CONCAT statement you are selecting within the ORDER BY.
So your order by would be more like...
ORDER BY (GROUP_CONCAT(DISTINCT field_firstname, ' ',
field_lastname ORDER BY field_lastname SEPARATOR ', ')) ASC
I have not tried this but I had a similar issue which this seemed to solve but it was much simpler without the DISTINCT etc.
wrong group by, try this ?
GROUP BY field_site_id