MySQL - Fulltext Index Search Issue - mysql

Two rows in the my database have the following data:
brand | product | style
=================================================
Doc Martens | Doc Martens 1460 Boots | NULL
NewBalance | New Balance WR1062 SG Width | NULL
Mininum word length is set to 3, and a FULLTEXT index is created across all the three columns above.
When I run a search for IS BOOLEAN matches for +doc in the index, I get the first row returned as a result. When I search for +new, I get no results.
Can someone explain why?
Thanks.

It called FULLTEXT because of it indexes whole words. so for searching words started with "New" you have to put "astrisk" sign in the end:
MATCH (`brand`) AGAINST ('new*')
More detailed here

Related

FullText Search Innodb Fails, MyIsam Returns Results

I've upgraded a table from myisam to innodb but am not having the same performance. The innodb returns a 0 score when there should be some relation. The myisam table returns a match for the same term (I kept a copy of the old table so I can still run the same query).
SELECT MATCH (COLUMNS) AGAINST ('+"Term Ex"' IN BOOLEAN MODE) as score
FROM table_myisam
where id = 1;
Returns:
+-------+
| score |
+-------+
| 1 |
+-------+
but:
SELECT MATCH (COLUMNS) AGAINST ('+"Term Ex"' IN BOOLEAN MODE) as score
FROM table
where id = 1;
returns:
+-------+
| score |
+-------+
| 0 |
+-------+
I thought the ex might not have been indexed because innodb_ft_min_token_size was set to 3. I lowered that to 1 and optimized the table but that had no affect. The column contents are 99 characters long so I presumed the whole column wasn't indexed because of innodb_ft_max_token_size. I increased that as well to 150 and ran the optimize again but again had the same result.
The only difference between these tables is the engine and the character set. This table is using utf8, the myisam table is using latin1.
Has anyone seen these behavior, or have advice for how to resolve it?
UPDATE:
I added ft_stopword_file="" to my my.cnf and ran OPTIMIZE TABLE table again. This time I got
optimize | note | Table does not support optimize, doing recreate + analyze instead
The query worked after this change. Ex is not a stop word though so not sure why it would make a difference.
A new query that fails though is:
SELECT MATCH (Columns) AGAINST ('+Term +Ex +in' IN BOOLEAN MODE) as score FROM Table where id = 1;
+-------+
| score |
+-------+
| 0 |
+-------+
the in causes this to fail but that is the next word in my table.
SELECT MATCH (Columns) AGAINST ('+Term +Ex' IN BOOLEAN MODE) as score FROM Table where id = 1;
+--------------------+
| score |
+--------------------+
| 219.30206298828125 |
+--------------------+
I also tried CREATE TABLE my_stopwords(value VARCHAR(30)) ENGINE = INNODB;, then updated my.cnf with innodb_ft_server_stopword_table='db/my_stopwords'. I restarted and ran:
show variables like 'innodb_ft_server_stopword_table';
which brought back:
+---------------------------------+---------------------------+
| Variable_name | Value |
+---------------------------------+---------------------------+
| innodb_ft_server_stopword_table | 'db/my_stopwords'; |
+---------------------------------+---------------------------+
so I thought the in would not cause the query to fail now but it continues. I also tried OPTIMIZE TABLE table again and even ALTER TABLE table DROP INDEX ... and ALTER TABLE table ADD FULLTEXT KEY ... none of which have had an affect.
Second Update
The issue is with the stop words.
$userinput = preg_replace('/\b(a|about|an|are|as|at|be|by|com|de|en|for|from|how|i|in|is|it|la|of|on|or|that|the|this|to|was|what|when|where|who|will|with|und|the|www)\b/', '', $userinput);
resolves the issue but that doesn't appear as a good solution to me. I'd like a solution that avoids the stop words breaking this in mysql.
Stopword table data:
CREATE TABLE `my_stopwords` (
`value` varchar(30) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1
and
Name: my_stopwords
Engine: InnoDB
Version: 10
Row_format: Compact
Rows: 0
Avg_row_length: 0
Data_length: 16384
Max_data_length: 0
Index_length: 0
Data_free: 0
Auto_increment: NULL
Create_time: 2019-04-09 17:39:55
Update_time: NULL
Check_time: NULL
Collation: latin1_swedish_ci
Checksum: NULL
Create_options:
Comment:
There are several differences between MyISAM's FULLTEXT and InnoDB's. I think you were caught by the handling of 'short' words and/or stop words. MyISAM will show rows, but InnoDB will fail to.
What I have done when using FT (and after switching to InnoDB) is to filter the user's input to avoid short words. It takes extra effort but gets me the rows desired. My case is slightly different since the resulting query is something like this. Note that I have added + to require the words, but not on words shorter than 3 (my ft_min_token_size is 3). These searches were for build a table and build the table:
WHERE match(description) AGAINST('+build* a +table*' IN BOOLEAN MODE)
WHERE match(description) AGAINST('+build* +the* +table*' IN BOOLEAN MODE)
(The trailing * may be redundant; I have not investigated that.)
Another approach
Since FT is very efficient at non-short, non-stop words, do the search with two phases, each being optional: To search for "a long word", do
WHERE MATCH(d) AGAINST ('+long +word' IN BOOLEAN MODE)
AND d REGEXP '[[:<:]]a[[:>:]]'
The first part whittles down the possible rows rapidly by looking for 'long' and 'word' (as words). The second part makes sure there is a word a in the string, too. The REGEXP is costly but will be applied only to those rows that pass the first test.
To search just for "long word":
WHERE MATCH(d) AGAINST ('+long +word' IN BOOLEAN MODE)
To search just for the word "a":
WHERE d REGEXP '[[:<:]]a[[:>:]]'
Caveat: This case will be slow.
Note: My examples allow for the words to be in any order, and in any location in the string. That is, this string will match in all my examples: "She was longing for a word from him."
Here is a step by step procedure which should have reproduced your problem. (This is actually how you should have written your question.) The environment is a freshly installed VM with Debian 9.8 and Percona Server Ver 5.6.43-84.3.
Create an InnoDB table with a fulltext index and some dummy data:
create table test.ft_innodb (
txt text,
fulltext index (txt)
) engine=innodb charset=utf8 collate=utf8_unicode_ci;
insert into test.ft_innodb (txt) values
('Some dummy text'),
('Text with a long and short stop words in it ex');
Execute a test query to verify that it doesn't work yet as we need:
select txt
, match(t.txt) against ('+some' in boolean mode) as score0
, match(t.txt) against ('+with' in boolean mode) as score1
, match(t.txt) against ('+in' in boolean mode) as score2
, match(t.txt) against ('+ex' in boolean mode) as score3
from test.ft_innodb t;
Result (rounded):
txt | score0 | score1 | score2 | score3
-----------------------------------------------|--------|--------|--------|-------
Some dummy text | 0.0906 | 0 | 0 | 0
Text with a long and short stop words in it ex | 0 | 0 | 0 | 0
As you see, it's not working with stop words ("+with") or with short words ("+ex").
Create an empty InnoDB table for custom stop words:
create table test.my_stopwords (value varchar(30)) engine=innodb;
Edit /etc/mysql/my.cnf and add the following two lines in the [mysqld] block:
[mysqld]
# other settings
innodb_ft_server_stopword_table = "test/my_stopwords"
innodb_ft_min_token_size = 1
Restart MySQL with service mysql restart
Run the query from (2.) again (The result should be the same)
Rebuild the fulltext index with
optimize table test.ft_innodb;
It will actually rebuild the entire tabe including all indexes.
Execute the test query from (2.) again. Now the result is:
txt | score1 | score1 | score2 | score3
-----------------------------------------------|--------|--------|--------|-------
Some dummy text | 0.0906 | 0 | 0 | 0
Text with a long and short stop words in it ex | 0 | 0.0906 | 0.0906 | 0.0906
You see it works just fine for me. And it's quite simple to reproduce. (Again - This is how you should have written your question.)
Since your procedure is rather chaotic than detailed, it's difficult to say what could go wrong for you. For example:
CREATE TABLE my_stopwords(value VARCHAR(30)) ENGINE = INNODB;
This doesn't contain the information, in which database you have defined that table. Note that I have prefixed all my tables with the corresponding database. Now consider the following: I change my.cnf and set innodb_ft_server_stopword_table = "db/my_stopwords". Note - There is no such table on my server (not even the schema db exists). Restart the MySQL server. And check the new settings with
show variables like 'innodb_ft_server_stopword_table';
This returns:
Variable_name | Value
--------------------------------|----------------
innodb_ft_server_stopword_table | db/my_stopwords
And after optimize table test.ft_innodb; the test query returns this:
txt | score0 | score1 | score2 | score3
-----------------------------------------------|--------|--------|--------|-------
Some dummy text | 0.0906 | 0 | 0 | 0
Text with a long and short stop words in it ex | 0 | 0 | 0 | 0.0906
You see? It's not working with stopwords any more. But it works with short non stop words like "+ex". So make sure, that the table you defined in innodb_ft_server_stopword_table actually exists.
A common technique in searching is to make an extra column with the 'sanitized' string to search in. Then add the FULLTEXT index to that column instead of the original column.
In your case, removing the stopwords is the main difference. But there may also be punctuation that could (should?) be removed. Sometimes hyphenated words or words or contractions or part numbers or model numbers cause trouble. They can be modified to change the punctuation or spacing to make it more friendly with the FT requirements and/or the user's flavor of input. Another thing is to add words to the search-string column that are common misspellings of the words that are in the column.
Sure, this is more work than you would like to have to do. But I think it provides a viable solution.

MySQL FULLTEXT query issue

I'm trying to query using mysql FULLTEXT, but unfortunately its returning empty results even the table contain those input keyword.
Table: user_skills:
+----+----------------------------------------------+
| id | skills |
+----+----------------------------------------------+
| 1 | Dance Performer,DJ,Entertainer,Event Planner |
| 2 | Animation,Camera Operator,Film Direction |
| 3 | DJ |
| 4 | Draftsman |
| 5 | Makeup Artist |
| 6 | DJ,Music Producer |
+----+----------------------------------------------+
Indexes:
Query:
SELECT id,skills
FROM user_skills
WHERE ( MATCH (skills) AGAINST ('+dj' IN BOOLEAN MODE))
Here once I run the above query none of the DJ rows are returning. In the table there are 3 rows with is having the value dj.
A full text index is the wrong approach for what you are trying to do. But, your specific issue is the minimum word length, which is either 3 or 4 (by default), depending on the ending. This is explained in the documentation, specifically here.
Once you reset the value, you will need to recreate the index.
I suspect you are trying to be clever. You have probably heard the advice "don't store lists of things in delimited strings". But you instead countered "ah, but I can use a full text index". You can, although you will find that more complex queries do not optimize very well.
Just do it right. Create the association table user_skills with one row per user and per skill that the user has. You will find it easier to use in queries, to prevent duplicates, to optimize queries, and so on.
Your search term is to short
as in mysql doc
Some words are ignored in full-text searches:
Any word that is too short is ignored. The default minimum length of words that are found by full-text searches is three characters for
InnoDB search indexes, or four characters for MyISAM. You can control
the cutoff by setting a configuration option before creating the
index: innodb_ft_min_token_size configuration option for InnoDB search
indexes, or ft_min_word_len for MyISAM.
.
Boolean full-text searches have these characteristics:
They do not use the 50% threshold.
They do not automatically sort rows in order of decreasing relevance.
You can see this from the preceding query result: The row with the
highest relevance is the one that contains “MySQL” twice, but it is
listed last, not first.
They can work even without a FULLTEXT index, although a search
executed in this fashion would be quite slow.
The minimum and maximum word length full-text parameters apply.
https://dev.mysql.com/doc/refman/5.6/en/fulltext-natural-language.html
https://dev.mysql.com/doc/refman/5.6/en/fulltext-boolean.html

How to create the index on MUL key in Mysql?

We need to create the index on "source path" column, which is already in MUL - Key. For Example it have /src/com/Vendor/DTP/Emp/Grd1/Sal/2016/Jan/31-01/Joseph and we need to search like '%Sal/2016/Jan%' it have almost 10 Million records.
Please suggest any idea for performance improvement.
| Field | Type | Null | Key | Default | Extra |
+------------+----------+------+-----+---------+----------------+
| Id | int(11) | NO | PRI | NULL | auto_increment |
| Name | char(35) | NO | | | |
| Country | char(3) | NO | UNI | | |
| source Path| char(20) | YES | MUL | | |
| Population | int(11) | NO | | 0 |
Unfortunately, a search that starts with % cannot use an index (it has not much to do with being in a composite index).
You have some options though:
The values in your path seem to have actual meaning. The ideal solution would be to use the meta-data, e.g. the month, name, whatever "SAL" stands for, and store it in their own columns or an attribute table, and then query for that meta-data instead. This is obviously only possible in very specific cases where you have the required meta-data for every path, so it is probably not an option here.
You can add a "search table" (e.g. (id, subpath)) that contains all subpaths of your source path, e.g.
'/src/com/Vendor/DTP/Emp/Grd1/Sal/2016/Jan/31-01/Joseph'
'/com/Vendor/DTP/Emp/Grd1/Sal/2016/Jan/31-01/Joseph'
'/Vendor/DTP/Emp/Grd1/Sal/2016/Jan/31-01/Joseph'
...
'/Sal/2016/Jan/31-01/Joseph'
...
'/31-01/Joseph'
'/Joseph'
so 11 rows in your example. It's now possible to use an index on that, e.g. in
...
where exists
(select * from subpaths s
where s.subpath like '/Sal/2016/Jan%' and s.id = outerquery.id)
This relies on knowing the start of your search term. If Sal in your example %Sal/2016/Jan should actually include word endings, e.g. /NoSal/2016/Jan, you would have to modify your input term to remove the first word, so %Sal/2016/Jan% would require you to search for /2016/Jan% (with an index) and then recheck the resultset afterwards if it also fits %Sal/2016/Jan% (see the fulltext option for an example, it has the same "problem" to only look for the beginning of words).
You will have to maintain the search table, which is usually done in a trigger (update the subpath table when you insert, update or delete values in your original table).
Since this is a new table, you cannot combine it (directly) with another index, to e.g. optimize where country = 'A' and subpath like 'Sal/2016/Jan%' if country = 'A' would already get rid of 99.99% of the rows. You may have to check explain for your query if MySQL actually uses the index (because the optimizer can try something different) and then maybe reorganize your query (e.g. use a join or force index).
You can use a fulltext search. From the userinput, you would have to generate a query like
select * from
(select * from table
where match(`source Path`) against ('+SAL +2016 +Jan' in boolean mode)) subquery
where `source path` like '%Sal/2016/Jan%'
The fulltext search will not care about the order of the words, so you have to recheck the resultset if it actually is the correct path, but the fulltext search will use the (fulltext) index to speed it up. It will only look for the beginning of words, so similar to the "search table" option, if Sal can be the end of the word, you have to remove it from the fulltext search. By default, only words with at least 3 or 4 letters (depending on your engine) will be added to the index, so you have to set the value of either ft_min_word_len or innodb_ft_min_token_size to whatever fits your requirements.
The search table approach is probably the most convenient solution, as it can be used very similar to your current search: you can add the userinput directly in one place (without having to interpret it to create the against (...) expression) and you can also use it easily in other situations (e.g. in something like join table2 on concat(table2.Year,'/',table2.Month,'%') like ...); but you will have to set up the triggers (or however else you maintain the table), which is a little more complicated than just adding a fulltext index.

Is a MySQL primary key already in some sort of default order

I just stumbled upon a few lines of code in a system I just started working with that I don't really get. The system has a large table that saves lots of entities with unique IDs and removes them once they're not longer needed but it never reuses them. So the table looks like this
------------------------
| id |info1|info2|info3|
------------------------
| 1 | foo1| foo2| foo3|
------------------------
| 17 | bar1| bar2| bar3|
------------------------
| 26 | bam1| bam2| bam3|
------------------------
| 328| baz1| baz2| baz3|
------------------------
etc.
In one place in the codebase there is a while loop whose purpose it is to loop through all entities in the DB and do things to them and right now this is solved like this
int lastId = fetchMaxId()
int id = 0
while (id = fetchNextId()){
doStuffWith(id)
}
where fetchMaxId is straight forward
int fetchMaxId(){
return sqlQuery("SELECT MAX(id) FROM Table")
}
but fetchNextId confuses me. It is implemented as
int fetchNextId(currentId, maxId){
return sqlQuery("
SELECT id FROM Table where id > :currentId and id <= :maxId LIMIT 1
")
}
This system has been in production for several years so it obviously works but when I tried searching for a solution to why this works I only found people saying the same thing that I already thought i knew. The order in which a MySQL DB returns the result is not easily determined and should not be relied upon so if you wan't a particular order use a ORDER BY clause. But are there some times you can safely ignore the ORDER BY? This code has worked for 12 years and continued to work through several DB updates. Are we just lucky or am I missing something here? Before I saw this code I would have said that if you called
fetchNextId(1, 328)
you could end up with either 17 or 26 as the answer.
Some clues to why this works may be that the id column is the primary key of the Table in question and it's set to auto increment but I can't find any documentation that would explain why
fetchNextId(1, 328)
should always returns 17 when called on the table-snippet given above.
The short answer is yes, the primary key has an order, all indexes have an order, and a primary key is simply a unique index.
As you have rightly said, you should not rely on data being returned in the order the data is stored in, the optimiser is free to return it in any order it likes, and this will be dependent on the query plan. I will however attempt to explain why your query has worked for 12 years.
Your clustered index is just your table data, and your clustering key defines the order that it is stored in. The data is stored on the leaf, and the clustering key helps the root (and intermediate notes) act as pointers to quickly get to the right leaf to retrieve the data. A nonclustered index is a very similar structure, but the lowest level simply contains a pointer to the correct position on the leaf of the clustered index.
In MySQL the primary key and the clustered index are synonymous, so the primary key is ordered, however they are fundamentally two different things. In other DBMS you can define both a primary key and a clustered index, when you do this your primary key becomes a unique nonclustered index with a pointer back to the clustered index.
In it's simplest terms you can imagine a table with an ID column that is the primary key, and another column (A), your B-Tree structure for your clustered index would be something like:
Root Node
+---+
| 1 |
+---+
Intermediate Nodes
+---+ +---+ +---+
| 1 | | 4 | | 7 |
+---+ +---+ +---+
Leaf
+-----------+ +-----------+ +-----------+
ID -> | 1 | 2 | 3 | | 4 | 5 | 6 | | 7 | 8 | 9 |
A -> | A | B | C | | D | E | F | | G | H | I |
+-----------+ +-----------+ +-----------+
In reality the leaf pages will be much bigger, but this is just a demo. Each page also has a pointer to the next page and the previous page for ease of traversing the tree. So when you do a query like:
SELECT ID, A
FROM T
WHERE ID > 5
LIMIT 1;
you are scanning a unique index so it is very likely this will be a sequential scan. Very likely is not guaranteed though.
MySQL will scan the Root node, if there is a potential match it will move on to the intermediate nodes, if the clause had been something like WHERE ID < 0 then MySQL would know that there were no results without going any further than the root node.
Once it moves on to the intermediate node it can identify that it needs to start on the second page (between 4 and 7) to start searching for an ID > 5. So it will sequentially scan the leaf starting on the second leaf page, having already identified the LIMIT 1 it will stop once it finds a match (in this case 6) and return this data from the leaf. In such a simple example this behaviour appears to be reliable and logical. I have tried to force exceptions by choosing an ID value I know is at the end of a leaf page to see if the leaf will be scanned in the reverse order, but as yet have been unable to produce this behaviour, this does not however mean it won't happen, or that future releases of MySQL won't do this in the scenarios I have tested.
In short, just add an order by, or use MIN(ID) and be done with it. I wouldn't lose too much sleep trying to delve into the inner workings of the query optimiser to see what kind of fragmentation, or data ranges would be required to observe different ordering of the clustered index within the query plan.
The answer to your question is yes. If you look at MySQL documentation you will see that whenever a table has a primary key it has an associated index.
When looking at the documentation for indexes you will see that they will mention primary keys as a type of index.
So in case of your particular scenario:
SELECT id FROM Table where id > :currentId and id <= :maxId LIMIT 1
The query will stop executing as soon as it has found a value because of the LIMIT 1.
Without the LIMIT 1 it would have returned 17, 24 and 328.
However will all that said I don't think you will run into any order problems when the primary key is auto incrementing but whenever there is a scenario were the primary key is a unique employee no. instead of an auto incrementing field I would not trust the order of the result because the documentation also notes that MySQL reads sequentially, so the possibility is there that a primary key could fall out of the WHERE clause conditions and be skipped.

Search for a value within an input string in a MySQL database

I have a database of job descriptions, and I need to match these descriptions with as many job listings as possible. In my database, I have a primary job title as a key (for example, Aircraft Pilot), and several alternate titles (Jet Pilot, Airliner Captain, etc).
My problem is that with many of the descriptions I have to process, the title includes too much information - a sample title from a listing might be "747 Aircraft Pilot", for example.
While I know I can't get 100% accuracy matching this way, would there be any way for me to match something like "747 Aircraft Pilot" with my description for "Aircraft Pilot" without running a search on each combination of words in the string? Is there an algorithm, for example, that would assign a match percentage between two strings and return all pairs with a certain percentage matching, for example?
You can use Full-text search function in MySQL. A good tutorial can be found here:
http://devzone.zend.com/article/1304
http://forge.mysql.com/w/images/c/c5/Fulltext.pdf
When you add Fulltext index using
ALTER TABLE jobs ADD FULLTEXT(body, title);
You can do query like this:
mysql> SELECT id, title, MATCH (title,body) AGAINST
-> ('Aircraft Pilot')
-> AS score
-> FROM jobs WHERE MATCH (title,body) AGAINST
-> ('Aircraft Pilot');
+-----------------------------+------------------+
| id | title | score |
+-----------------------------+------------------+
| 4 | 747 Aircraft Pilot ... | 1.5055546709332 |
| 6 | Aircraft Captain ... | 1.31140957288 |
+-----------------------------+------------------+
2 rows in set (0.00 sec)