Count how many times a word is being used per day - mysql

I have a MySQL table named "content"containing (a.o.) the fields "_date" and "text", for example:
_date text
---------------------------------------------------------
2011-02-18 I'm afraid my car won't start tomorrow
2011-02-18 I hope I'm going to pass my exams
2011-02-18 Exams coming up - I'm not afraid :P
2011-02-19 Not a single f was given this day
2011-02-20 I still hope I passed, but I'm afraid I didn't
2011-02-20 On my way to school :)
I'm looking for a query to count the number of times the words "hope" and "afraid" are being used per day. In other words, the output would have to be something like:
_date word count
-----------------------
2011-02-18 hope 1
2011-02-18 afraid 2
2011-02-19 hope 0
2011-02-19 afraid 0
2011-02-20 hope 1
2011-02-20 afraid 1
Is there an easy way to do this or should I just write I different query per term? I now have this, but I don't know what to put instead of "?"
SELECT COUNT(?) FROM content WHERE text LIKE '%hope' GROUP BY _date
Can somebody help met with the correct query for this?

I think the most easy and redable way is to make subquerys:
Select
_date, 'hope' as word,
sum( case when `text` like '%hope%' then 1 else 0 end) as n
from content
group by _date
UNION
Select
_date, 'afraid' as word,
sum( case when `text` like '%afraid%' then 1 else 0 end) as n
from content
group by _date
This approach has not the best performace. If you are looking for performance you should grouping in subquery by day, also this like condition is a performance killer. This is a solution if you only execute the query in batch mode time by time. Explain your performance requeriments for an accurate solution.
EDITED TO MATCH LAST OP REQUERIMENT

Your query is almost correct:
SELECT _date, 'hope' AS word, COUNT(*) as count
FROM content WHERE text LIKE '%hope%' GROUP BY _date
use %hope% to match the word anywhere (not only at the end of the string). COUNT(*) should do what you want.
To get multiple words from a single query, use UNION ALL
Another approach is to create a sequence of words on the fly and use it as the second table in a join:
SELECT _date, words.word, COUNT(*) as count
FROM (
SELECT 'hope' AS word
UNION
SELECT 'afraid' AS word
) AS words
CROSS JOIN content
WHERE text LIKE CONCAT('%', words.word, '%')
GROUP BY _date, words.word
Note that it will only count a single occurrence of each word per sentence. So »I hope there is still hope« will only give you 1, and not 2
To get 0 when there are no matches, join the previous result with the dates again:
SELECT content._date, COALESCE(result.word, 'no match'), COALESCE(result.count, 0)
FROM content
LEFT JOIN (
SELECT _date, words.word, COUNT(*) as count
FROM (
SELECT 'hope' AS word
UNION
SELECT 'afraid' AS word
) AS words
CROSS JOIN content
WHERE text LIKE CONCAT('%', words.word, '%')
GROUP BY _date, words.word ) AS result
ON content._date = result._date

Assuming you want to count all words and find the most used words (rather than looking for the count of a few specific words) you might want to try something like the following stored procedure (string splitting compliments of this blog post):
DROP PROCEDURE IF EXISTS wordsUsed;
DELIMITER //
CREATE PROCEDURE wordsUsed ()
BEGIN
DROP TEMPORARY TABLE IF EXISTS wordTmp;
CREATE TEMPORARY TABLE wordTmp (word VARCHAR(255));
SET #wordCt = 0;
SET #tokenCt = 1;
contentLoop: LOOP
SET #stmt = 'INSERT INTO wordTmp SELECT REPLACE(SUBSTRING(SUBSTRING_INDEX(`text`, " ", ?),
LENGTH(SUBSTRING_INDEX(`text`, " ", ? -1)) + 1),
" ", "") word
FROM content
WHERE LENGTH(SUBSTRING_INDEX(`text`, " ", ?)) != LENGTH(`text`)';
PREPARE cmd FROM #stmt;
EXECUTE cmd USING #tokenCt, #tokenCt, #tokenCt;
SELECT ROW_COUNT() INTO #wordCt;
DEALLOCATE PREPARE cmd;
IF (#wordCt = 0) THEN
LEAVE contentLoop;
ELSE
SET #tokenCt = #tokenCt + 1;
END IF;
END LOOP;
SELECT word, count(*) usageCount FROM wordTmp GROUP BY word ORDER BY usageCount DESC;
END //
DELIMITER ;
CALL wordsUsed();
You might want to write another query (or procedure) or add some nested "REPLACE" statements to further remove punctuation from the resulting temp table of words, but this should be a good start.

Related

How can I use an IF or Case function to summarize a GROUP_CONCAT column? AND then apply it to the original data table?

I am quite the novice at MYSQL and would appreciate any pointers - the goal here would be to automate a categorical field using GROUP_CONCAT in a certain way, and then summarize certain patterns in the GROUP_CONCAT field in a new_column. Furthermore, is it possible to add the new_column to the original table in one query? Below is what I've tried and errors to an unknown column "Codes" if this assists:
SELECT
`ID`,
`Code`,
GROUP_CONCAT(DISTINCT `Code` ORDER BY `Code` ASC SEPARATOR ", ") AS `Codes`,
IF(`Codes` LIKE '123%', 'Description1',
IF(`Codes` = '123, R321', 'Description2',
"Logic Needed"))
FROM Table1
GROUP BY `ID`
Instead of nested if statements, I would like to have a CASE statement as a substitute. Reason being is that I already have around 1000 lines of logical already written as "If [column] = "?" Then "?" else if" etc. I feel like using CASE would be an easier transition with the logic. Maybe something like:
SELECT
`ID`,
`Code`,
GROUP_CONCAT(DISTINCT `Code` ORDER BY `Code` ASC SEPARATOR ", ") AS `Codes`,
CASE
WHEN `Codes` LIKE '123%' THEN 'Description1'
WHEN `Codes` = '123, R321' THEN 'Description2'
ELSE "Logic Needed"
END
FROM Table1
GROUP BY `ID`
Table Example:
ID,Code
1,R321
1,123
2,1234
3,1231
4,123
4,R321
Completed Table:
ID,Codes,New_Column
1,"123, R321",Description2
2,1234,Description1
3,1231,Description1
4,"123, R321",Description2
How then can I add back the summarized data to the original table?
Final Table:
ID,Code,New_Column
1,R321,Description2
1,123,Description2
2,1234,Description1
3,1231,Description1
4,123,Description2
4,R321,Description2
Thanks.
You can't refer to a column alias in the same query. You need to do the GROUP_CONCAT() in a subquery, then the main query can refer to Codes to summarize it.
It also doesn't make sense to select Code, since there isn't a single Code value in the group.
SELECT ID, Codes,
CASE
WHEN `Codes` = '123, R321' THEN 'Description2'
WHEN `Codes` LIKE '123%' THEN 'Description1'
ELSE "Logic Needed"
END AS New_Column
FROM (
SELECT
`ID`,
GROUP_CONCAT(DISTINCT `Code` ORDER BY `Code` ASC SEPARATOR ", ") AS `Codes`
FROM Table1
GROUP BY ID
) AS x
As mentioned in a comment, the WHEN clauses are tested in order, so you need to put the more specific cases first. You might want to use FIND_IN_SET() rather than LIKE, since 123% will match 1234, not just 123, something

Getting first record that satisfies lef-most condition in SQL

My main goal is to search in a column for a specific value (say word). If it doesn't exist, want to find the first that matches word% or wor% or wo% or w%.
In "English", the query would read like: "look for 'word' and return it if it exists. If not, look for the first word that has the maximum same prefix as 'word'".
I can write
SELECT word FROM words WHERE word = 'word' or word LIKE 'word%' or ... LIMIT 1;
I was trying to order by alphabetically, but it won't work (wo comes before wor). Also, can't order in reverse order, or 'wordy' will come before 'word'.
My current idea is just to call the database n times, where n = length(word). But I would like to know if there is any kind of 'short-circuit OR' in SQL -- MySQL/MariaDB, to be precise.
Example
DB has 'w', 'word', 'wording', want to search by 'word' and retrieve 'word' only.
DB has 'z', 'zab', 'zac', 'ze', 'zeb' want to search by 'za' and get 'zab'
It sounds like you are looking for a string distance algorithm. The string distance algorithm tells you how many changes are needed to change the current word into the desired word. The idea is that you give all your words a string distance and sort ascending on the distance. An exact match will have 0, a missing or extra letter will have 1.
Not exactly the answer to your question, but I am hoping it is actually what you were looking for. You may also be interested in word stems which will go nicely with this.
EDIT
Extending my answer with a solution to your actual query.
Add a funtion:
CREATE FUNCTION `WORDRANK`(`a` VARCHAR(150), `b` VARCHAR(150)) RETURNS INT
BEGIN
DECLARE rank INT DEFAULT 0;
WHILE rank < LENGTH(a) DO
IF rank = 0 AND b = a THEN RETURN rank;
ELSEIF rank = 0 AND b LIKE CONCAT(a, "%") THEN RETURN rank + 1;
ELSEIF b LIKE CONCAT(LEFT(a, LENGTH(a) - rank), "%") THEN RETURN rank + 2;
END IF;
SET rank = rank + 1;
END WHILE;
RETURN rank + 100;
END
Then create a stored procedure:
CREATE PROCEDURE `getClosestMatch`(IN `q` VARCHAR(150))
BEGIN
SELECT
word
FROM words
WHERE word LIKE CONCAT(LEFT(q, 1),"%")
ORDER BY WORDRANK(q, word), word
LIMIT 1;
END
In order to get the desired results, we need to rank each word based on your desired algorithm, which we defined in the WORDRANK function. The stored procedure is so we have a generic way of executing the query.
Assumption: Words are ASCII, minimum length = 1, maximum value < 'ZZ'.
Assumption: VARCHAR input, no trail spaces.
Assumption: if 'word' is not there, but 'wording' and 'wordy' are there,
you want 'wording' not 'wordy'.
Perhaps not simpler, but with a single SELECT statement ...
set #x = 'w'; /* or whatever word you want to search with */
select * from words
where word <= concat(#x,'zz')
and (word like concat(#x,'%') or #x like concat(word,'%'))
order by length(word) <> length(#x),
case when length(word) = length(#x) then 1 else 0 end asc,
case when length(word) > length(#x) then word else 'zz' end asc,
word desc limit 1;
Try the following query:
SELECT word
FROM words
WHERE word LIKE 'w%'
ORDER BY word;

Query to count the distinct words of all values in a column

I have a mysql table "post" :
id Post
-----------------------------
1 Post Testing
2 Post Checking
3 My First Post
4 My first Post Check
I need to count the number of distinct words in all the values for the Post column.
Is there any way to get the following results using a single query?
post count
------------------
Post 4
Testing 1
checking 1
My 2
first 2
check 1
Not in an easy way. If you know the maximum number of words, then you can do something like this:
select substring_index(substring_index(p.post, ' ', n.n), ' ', -1) as word,
count(*)
from post p join
(select 1 as n union all select 2 union all select 3 union all select 4
) n
on length(p.post) - length(replace(p.post, ' ', '')) < n.n
group by word;
Note that this only works if the words are separated by single spaces. If you have a separate dictionary of all possibly words, you can also use that, something like:
select d.word, count(p.id)
from dictionary d left join
posts p
on concat(' ', p.post, ' ') like concat(' %', d.word, ' %')
group by d.word
You can use a FULLTEXT index.
First add a FULLTEXT index to your column like:
CREATE FULLTEXT INDEX ft_post
ON post(Post);
Then flush the index to disk using optimize table:
SET GLOBAL innodb_optimize_fulltext_only=ON;
OPTIMIZE TABLE post;
SET GLOBAL innodb_optimize_fulltext_only=OFF;
Set the aux table:
SET GLOBAL innodb_ft_aux_table = '{yourDb}/post';
And now you may simply select for word and word counts like:
SELECT word, doc_count FROM INFORMATION_SCHEMA.INNODB_FT_INDEX_TABLE;

Cursor to find word Count

I am trying a very simple cursor to find word count in a table with a like condition. My Cursor is:
declare #Engword varchar(max)
Declare #wcount int
DECLARE word_cursor CURSOR FOR
select distinct engtitle from Table
where a1 = 'EHD'
ORDER BY Engtitle;
OPEN Word_cursor;
FETCH NEXT FROM Word_cursor
INTO #Engword;
WHILE ##FETCH_STATUS = 0
BEGIN
Select #wcount = COUNT(*) from Table where engtitle like '%#Engword %'
Insert into WordStatus(Engtitle,Estatus)
Values(#Engword,#wcount)
FETCH NEXT FROM Word_cursor
INTO #Engword;
end
CLOSE Word_cursor;
DEALLOCATE Word_cursor;
GO
I want to insert each word with their count which is comming from like condition in different table WordStatus.This cursor is inserting words in new table but all have same counts 0.
Plz Help!
This is because you cannot have variables inside the quote strings '%#Engword%' is not valid. I think it should work if you change that line to this:
Select #wcount = COUNT(*) from Table where engtitle like concat('%', #Engword, '%')
Your query doesn't make sense. You are fetching the entire title from the table in the cursor. Then you are checking if the title with a space following is in the table. Happily, your titles are not misformed, so you do not get any match.
If you have a list of words, you can do what you want with a single query, and no cursors:
select w.word, count(t.engtitle) as NumWords
from Words w left join
Table t
on t.engtitle like concat('%', w.word, '%');
If you want words with separators, then do:
select w.word, count(t.engtitle) as NumWords
from Words w left join
Table t
on concat(' ', t.engtitle, ' ') like concat('% ', w.word, ' %');

How do I retrieve results from Mysql in this order?

I have a table with a column, lets call it "query", which is a varchar.
I would like to retrieve the values into a paginated list in such a way that each page will contain 208 results, 8 from each letter in the alphabet.
So on page 1 the first 8 results will begin with "a", the next 8 will begin with "b", and so on until "z" (if there aren't any results for that letter then it just continues to the next letter.
On page 2 of the results it would show the next 8 results beginning with "a", the next with "b", and so on.
Basically instead of sorting by query ASC, which will result in the first page having all words beginning with "a", I would like each page to contain words beginning with each letter of the alphabet.
If you feel I did not explain myself properly (I do!), then please feel free to ask. I have the idea in my head but it's not easy translating it into words!
a naive start approach could be:
SELECT * FROM table1 WHERE `query` LIKE 'a%' ORDER BY `query` LIMIT 8
UNION
SELECT * FROM table1 WHERE `query` LIKE 'b%' ORDER BY `query` LIMIT 8
UNION
SELECT * FROM table1 WHERE `query` LIKE 'c%' ORDER BY `query` LIMIT 8
....
The second page would need to be done using
SELECT * FROM table1 WHERE `query` LIKE 'a%' ORDER BY `query` LIMIT 8,8
UNION
....
Third page:
SELECT * FROM table1 WHERE `query` LIKE 'a%' ORDER BY `query` LIMIT 16,8
UNION
....
etc.
I think this is not possible to do in one statement (or, may be possible, but too slow in execution). May be you must take a look on MySQL prepared statements or change behavior of the page.
you can use group by the first letter using
group by left(query, 1)
And from the server side script, show 8 results from the query result (perhaps by using the page page number * 8 and starting from that index for each letter, starting from page 0, that is)
I would never use this query on large dataset but just for personal fun I found this solution:
select *,
ceil(row_num/8) as gr
from (
select
*,
#num := if(#word = substring(word,1,1), #num + 1, 1) as row_num,
#word := substring(word,1,1) as w
from words,(select #num:=0,#word:='') as r order by word ) as t
order by gr,word