What does the term "Stopword" mean in MySQL?

What does the term "Stopword" mean in MySQL? - mysql

I'm currently studying about MySQL command and got stuck at using the "MATCH...AGAINST" command on FULLTEXT index. It returns an "empty set" when it's against a "stopword"(which is "and" in my case).
Here's what I did. The database I'm working on contains a list of books and their author. I'm trying to select the entries that contain "and" in their title. Here's a list in my 'classics' table.
+--------------------+------------------------------+
| author | title |
+--------------------+------------------------------+
| Mark Twain | The Adventures of Tom Sawyer |
| Jane Austen | Pride and Prejudice |
| Charles Darwin | The Origin of Species |
| Charles Dickens | The Old Curiosity Shop |
| William Shakespear | Romeo and Juliet |
+--------------------+------------------------------+
This is the code I've written
SELECT author, title FROM classics
WHERE MATCH(author, title) AGAINST('and');
Empty set (0.00 sec)
The result in my expectation was "Pride and Prejudice" and "Romeo and Juliet" instead of "Empty set (0.00 sec)". I now realized that "and" is a stopword.
My question is What does the "stopword" mean and how do I know which word is a stopword? And what should I do if I really want to select the query which contains "and" in its title?

My question is What does the "stopword" mean ...
A stopword is a word that will be ignored when given as a keyword in a full-text search.
For more information read the Wikipedia page on stopwords.
MySQL uses the term in a way that is consistent with the normal definition.
... and how do I know which word is a stopword?
For InnoDB tables you can query the INFORMATION_SCHEMA.INNODB_FT_DEFAULT_STOPWORD table.
For MyISAM search indexes, the stopwords are loaded from a file. It may be possible to read the file at runtime using Java file I/O, but it apparently can't be accessed via a database query.
And what should I do if I really want to select the query which contains "and" in its title?
The MySQL documentation explains how to do it; see Section 12.9.4 Full-Text Stopwords. (There is too much detail to copy it here.)
My reading is that you need to make configuration changes and restart the database server to change the stopwords. For InnoDB tables you also need to regenerate the table's full-text index.
That means that you cannot change the stopwords for each query ... if that is what you are aiming to do. But you could explicitly query for a stopword using LIKE; e.g.
SELECT author, title FROM classics
WHERE title LIKE '% and %';
That query would probably entail a table scan, so you want to avoid it if possible.

You can see an example of stopword list in dev.mysql.com:
To see the default InnoDB stopword list, query the INFORMATION_SCHEMA.INNODB_FT_DEFAULT_STOPWORD table.
mysql> SELECT * FROM INFORMATION_SCHEMA.INNODB_FT_DEFAULT_STOPWORD;
+-------+
| value |
+-------+
| a |
| about |
See more at "The INFORMATION_SCHEMA INNODB_FT_DEFAULT_STOPWORD Table"
The glossary defines stopword as:
In a FULLTEXT index, a word that is considered common or trivial enough that it is omitted from the search index and ignored in search queries.
Different configuration settings control stopword processing for InnoDB and MyISAM tables.
To force a fulltext index to include three letters words, you would need to change ft_min_word_len to 3 (restart mysqld and rebuild the table)

Maybe you should just do like:
SELECT author, title FROM classics WHERE title LIKE '% and %';

Related

How to calculate "confidence level" of results returned by MySQL FULLTEXT index query?

Suppose you have a FULLTEXT index defined on a column in MySQL database table to allow for natural language searches. If you now run a query using MATCH() and AGAINST(), you can retrieve the "rank" of the search results, as described here:
https://dev.mysql.com/doc/refman/5.6/en/fulltext-natural-language.html
For example:
mysql> SELECT id, body, MATCH (title,body) AGAINST
('Security implications of running MySQL as root'
IN NATURAL LANGUAGE MODE) AS score
FROM articles WHERE MATCH (title,body) AGAINST
('Security implications of running MySQL as root'
IN NATURAL LANGUAGE MODE);
+----+-------------------------------------+-----------------+
| id | body | score |
+----+-------------------------------------+-----------------+
| 4 | 1. Never run mysqld as root. 2. ... | 1.5219271183014 |
| 6 | When configured properly, MySQL ... | 1.3114095926285 |
+----+-------------------------------------+-----------------+
2 rows in set (0.00 sec)
The problem is that MATCH() returns some floating point number but no upper bound to it. I need to derive a "confidence factor" to each of the resulting rows as a percentage 0 to 100. For example, a confidence factor of 95% for a particular row would mean that it's very likely exactly what the user is searching for. Conversely, if the confidence factor is low, it'd be something like 10%.
Note that this is not a matter of selecting the larges score from MATCH() and setting that to 100. The row with the largest score may still be not at all what the user is searching for... So perhaps using MATCH() won't work but, could you please suggest some way to calculate such a "confidence factor"?
Much thanks in advance.

MySQL FULLTEXT query issue

I'm trying to query using mysql FULLTEXT, but unfortunately its returning empty results even the table contain those input keyword.
Table: user_skills:
+----+----------------------------------------------+
| id | skills |
+----+----------------------------------------------+
| 1 | Dance Performer,DJ,Entertainer,Event Planner |
| 2 | Animation,Camera Operator,Film Direction |
| 3 | DJ |
| 4 | Draftsman |
| 5 | Makeup Artist |
| 6 | DJ,Music Producer |
+----+----------------------------------------------+
Indexes:
Query:
SELECT id,skills
FROM user_skills
WHERE ( MATCH (skills) AGAINST ('+dj' IN BOOLEAN MODE))
Here once I run the above query none of the DJ rows are returning. In the table there are 3 rows with is having the value dj.

A full text index is the wrong approach for what you are trying to do. But, your specific issue is the minimum word length, which is either 3 or 4 (by default), depending on the ending. This is explained in the documentation, specifically here.
Once you reset the value, you will need to recreate the index.
I suspect you are trying to be clever. You have probably heard the advice "don't store lists of things in delimited strings". But you instead countered "ah, but I can use a full text index". You can, although you will find that more complex queries do not optimize very well.
Just do it right. Create the association table user_skills with one row per user and per skill that the user has. You will find it easier to use in queries, to prevent duplicates, to optimize queries, and so on.

Your search term is to short
as in mysql doc
Some words are ignored in full-text searches:
Any word that is too short is ignored. The default minimum length of words that are found by full-text searches is three characters for
InnoDB search indexes, or four characters for MyISAM. You can control
the cutoff by setting a configuration option before creating the
index: innodb_ft_min_token_size configuration option for InnoDB search
indexes, or ft_min_word_len for MyISAM.
.
Boolean full-text searches have these characteristics:
They do not use the 50% threshold.
They do not automatically sort rows in order of decreasing relevance.
You can see this from the preceding query result: The row with the
highest relevance is the one that contains “MySQL” twice, but it is
listed last, not first.
They can work even without a FULLTEXT index, although a search
executed in this fashion would be quite slow.
The minimum and maximum word length full-text parameters apply.
https://dev.mysql.com/doc/refman/5.6/en/fulltext-natural-language.html
https://dev.mysql.com/doc/refman/5.6/en/fulltext-boolean.html

Boolean Full-Text Search Exclude Phrase AB-CD, e.g. -"AB-CD"?

I have a table that is populated with certain values e.g.
| CODE | NAME | NB: THIS IS A VERY BASIC EXAMPLE
| zygnc | oscar alpha |
| ab-cd | delta tiger |
| fsdys | delta bravo |
Using MySQL Full-Text Boolean search i would like to search this table for all names containing 'delta' but exclude the first result basic on its unique code 'ab-cd'. This code contains a minus sign and this is a requirement and removing this would not be possible.
So the following query 'should' in my mind accommodate for this:
SELECT code, name
FROM items
WHERE MATCH (code, name)
AGAINST ('delta -"ab-cd"' IN BOOLEAN MODE)
However, running this query does not product the desired result in that the result does still contain the row with the code that is meant to be excluded, 'ab-cd'.
The Coalition of these two tables are set to utf8_bin.
The ft_min_word_len value is set to 4.
Could someone possibly suggest a reason for this behavior, I assume that it treats the string possibly as two separate values, e.g. "-ab", "-cd" and as the ft_min_word_len value is 4, neither of these two strings can produce any result?
I would think that the use of the encapsulation "", would mean that the second minus sign would be treated as literal but it seems that this is not the case. Perhaps it has something to do with the table coalition that i am not aware of?
In any case, any suggestions/advice/input/feedback/direction would be greatly appreciated, thank you!!

You need to change the value of variable ft_min_word_len in my.cnf file.
By default ft_min_word_len value is 3. Once change the variable, you need to restart the server.
Here "ab-cd" treated as two words as "ab" and "cd". So minimum word length is not matched with the words.

Search for a value within an input string in a MySQL database

I have a database of job descriptions, and I need to match these descriptions with as many job listings as possible. In my database, I have a primary job title as a key (for example, Aircraft Pilot), and several alternate titles (Jet Pilot, Airliner Captain, etc).
My problem is that with many of the descriptions I have to process, the title includes too much information - a sample title from a listing might be "747 Aircraft Pilot", for example.
While I know I can't get 100% accuracy matching this way, would there be any way for me to match something like "747 Aircraft Pilot" with my description for "Aircraft Pilot" without running a search on each combination of words in the string? Is there an algorithm, for example, that would assign a match percentage between two strings and return all pairs with a certain percentage matching, for example?

You can use Full-text search function in MySQL. A good tutorial can be found here:
http://devzone.zend.com/article/1304
http://forge.mysql.com/w/images/c/c5/Fulltext.pdf
When you add Fulltext index using
ALTER TABLE jobs ADD FULLTEXT(body, title);
You can do query like this:
mysql> SELECT id, title, MATCH (title,body) AGAINST
-> ('Aircraft Pilot')
-> AS score
-> FROM jobs WHERE MATCH (title,body) AGAINST
-> ('Aircraft Pilot');
+-----------------------------+------------------+
| id | title | score |
+-----------------------------+------------------+
| 4 | 747 Aircraft Pilot ... | 1.5055546709332 |
| 6 | Aircraft Captain ... | 1.31140957288 |
+-----------------------------+------------------+
2 rows in set (0.00 sec)

MySQL Optimization, "like" vs "="

I have a table with columns like this:
| Country.Number | CountryName |
| US.01 | USA |
| US.02 | USA |
I'd like to modify this to:
| Country | Number | CountryName |
| US | 01 | USA |
| US | 02 | USA |
Regarding optimization, is there a difference in performance if I use:
select * from mytable where country.number like "US.%"
or
select * from mytable where country = "US"

The performance difference will most likely be miniscule in this particular case, as mysql uses an index on "US.%". The performance degradation is mostly felt when searching for something like "%.US" (the wildcard is in front). As it then does a tablescan without using indices.
EDIT: you can look at it like this:
MySql internally stores varchar indices like trees with first symbol being the root and branching to each next letter.
So when searching for = "US" it looks for U, then goes one step down for S and then another to make sure that is the end of the value. That's three steps.
Searching for LIKE "US.%" it looks again for U, then S, then . and then stops searching and displays the results - that's also three steps only as it cares not whether the value terminated there.
EDIT2: I'm in no way promoting such database denormalization, I just wanted to attract your attention that this matter may not be as straightforward as it seems at first glance.

The later query:
select * from mytable where country = "US"
should be much faster because mySQL does not have to look for wildcards patterns unlike LIKE query. It just looks for the value that has been equalized.

If you need to optimize, a simple = is way better than a like.
Why ?
With an = either the string is exactly the same and it's true or it doesn't match and it's false.
With a like, MySQL must compare the string and test if the other string match the mask, and that takes more time and needs more operations.
So for the sake of your database, use SELECT * FROM 'mytable' WHERE country = "US".

The second is faster if there is an index on column country. MySQL has to scan less index entries to produce the result.

Not technically an answer to the question.. but...I would understand them to be close enough in speed for it not to (usually) matter - thus using "=" would be better as it displays the intent in a more obvious way.

why dont you just make country_id a tinyint unsigned and have an iso_code varchar(3) column which is unique ? (saves you from all the BS)

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

What does the term "Stopword" mean in MySQL? - mysql

Maybe you should just do like: SELECT author, title FROM classics WHERE title LIKE '% and %';

Related

How to calculate "confidence level" of results returned by MySQL FULLTEXT index query?

MySQL FULLTEXT query issue

Boolean Full-Text Search Exclude Phrase AB-CD, e.g. -"AB-CD"?

Search for a value within an input string in a MySQL database

MySQL Optimization, "like" vs "="

Categories

Resources