Wildcard search in MySQL full-text search - mysql

Let's say we have the following query:
SELECT *
FROM companies
WHERE name LIKE '%nited'
It returns
name
united
How do I write a query using MySQL's full-text search that will provide similar results?

Unfortunately you cannot do this using a MySQL full-text index. You cannot retrieve '*nited states' instantly from index because left characters are the most important part of the index. However, you can search 'United Sta*'.
// the only possible wildcard full-text search in MySQL
WHERE MATCH(column) AGAINST ('United Sta*' IN BOOLEAN MODE)
MySQL's full-text performs best when searching whole words in sentences - even that can suck at times. Otherwise, I'd suggest using an external full-text engine like Solr or Sphinx. I think Sphinx allows prefix and suffix wildcards, not sure about the others.
You could go back to MySQL's LIKE clause, but again, running queries like LIKE '%nited states' or LIKE '%nited Stat%', will also suffer on performance, as it can't use the index on the first few characters. 'United Sta%' and 'Unit%States' are okay as the index can be used against the first bunch of known characters.
Another quite major caveat using MySQL's full-text indexing is the stop-word list and minimum word length settings. For example, on a shared hosting environment, you will be limited to words greater than or equal to 4-characters. So searching 'Goo' to get 'Google' would fail. The stop-word list also disallows common words like 'and', 'maybe' and 'outside' - in-fact, there are 548 stop-words all together! Again, if not using shared hosting, these settings are relatively easily to modify, but if you are, then you will get annoyed with some of the default settings.

You can use MySQL's full-text index, but you must configure the parser to be the n-gram parser.
If your data is English (as opposed to Chinese or similar), you ought to also increase the ngram_token_size parameter to the minimum search term length you are willing to have. Otherwise, the search will be unacceptably slow.
You will also want to set innodb_ft_enable_stopword=0, otherwise an idiosyncrasy of how ngram stopword handling works will mean that many useful queries will return no results.
To explain why you must also increase ngram_token_size, you may think of this index as the following schema. MySQL then does a series of joins to find the results which match the search term:
CREATE TABLE fulltext_index
(
docid int(11) NOT NULL,
term char(2) NOT NULL,
PRIMARY KEY (docid, term),
INDEX term_idx (term)
);
The n-gram (2) parser breaks each word in your query into segments like se, eg, gm, me, en, nt, ts. For each of these n-grams, there are many results in English, so the index doesn't help much since it ends up iterating over everything anyway. Meanwhile, you can see how Chinese 随机的 would split into a much more useful 随机 and 机的. With n-gram size set to 4, the segments are segm, egme, gmen, ment, ents. These larger segments are much more likely to be unique, so each segment narrows down the search space significantly.
Disabling stopwords is also necessary because the ngram parser excludes all n-grams that contain any of the stopwords. For example, with an n-gram (4) parser, stopword will be parsed into stop, topw, opwr, pwor, and word:
stop will be excluded because it contains "to"
topw will be excluded because it contains "to"
opwr will be kept
pwor will be excluded because it contains "or"
word will be excluded because it contains "or"
Because these tokens are excluded from the index, a search for MATCH(name) AGAINST('stop' IN BOOLEAN MODE) would not return anything unless the stopwords mechanism is disabled before creating the index.
To answer your question,
set ngram_token_size to 3, 4, or whatever your minimum search term length is.
set innodb_ft_enable_stopword to 0 or OFF.
create the index with CREATE FULLTEXT INDEX companies_name_idx ON companies (name) WITH PARSER ngram;
SELECT * FROM companies WHERE MATCH(name) AGAINST('nited' IN BOOLEAN MODE);
This will also return results for nitedA, so you might want to further filter the results from there, if that's required for your application.

Related

MySQL: Full-Text inflectional forms of words? [duplicate]

Stemming Words in MySQL
For e.g. the user might search for "testing", "tested" or "tests". All these words are related to each other because the base word "test" is common in all of them.
Is there a way to get such result or function?
MySQL Full-Text Search
Historically, full-text searches were supported in MyISAM engines. After version 5.6, MySQL also supported full-text searches in InnoDB storage engines. This has been great news, since it enables developers to benefit from InnoDB’s referential integrity, ability to perform transactions, and row level locks.
There are basically two approaches to full-text searches in MySQL: natural language and boolean mode. (A third option augments the natural language search with a second expansion query.)
The main difference between the natural and boolean modes is that the boolean allows certain operators as part of the search. For instance, boolean operators can be used if a word has greater relevance than others in the query or if a specific word should be present in the results, etc. It’s worth noticing that in both cases, results can be sorted by the relevance computed by MySQL during the search.
The best fit for our problem was to use InnoDb full-text searches in boolean mode. Why?
We had little time to implement the search function.
At this point, we had no big data to crunch nor a massive load to require something like Elasticsearch or Sphinx.
We used shared hosting that doesn’t support Elasticsearch or Sphinx and the hardware was pretty limited at this stage.
While we wanted word stemming in our search function, it wasn’t a deal breaker: we could implement it (within constraints) by way of some simple PHP coding and data denormalization
Full-text searches in boolean mode can search words with wildcards (for the word stemming) and sort the results based on relevance.
In the Normalized Vertabelo Model
Let’s see how a simple search would work. We’ll create a sample table first:
CREATE TABLE artists (
id int(11) NOT NULL AUTO_INCREMENT, name varchar(255) NOT NULL,bio text NOT NULL, CONSTRAINT artists_pk PRIMARY KEY (id)
)ENGINE InnoDB;
CREATE FULLTEXT INDEX artists_idx_1 ON artists (name);
In natural language mode
You can insert some sample data and start testing. (It would be good to add it to your sample dataset.) For instance, we’ll try searching for Michael Jackson:
SELECT
*
FROM
artists
WHERE
MATCH (artists.name) AGAINST ('Michael Jackson' IN NATURAL LANGUAGE MODE)
This query will find records that match the search terms and will sort matching records by relevance; the better the match, the more relevant it is and the higher the result will appear in the list.
In boolean mode
We can perform the same search in boolean mode. If we don’t apply any operators to our query, the only difference will be that results are not sorted by relevance:
SELECT
*
FROM
artists
WHERE
MATCH (artists.name) AGAINST ('Michael Jackson' IN BOOLEAN MODE)
The wildcard operator in boolean mode
Since we want to search stemmed and partial words, we will need the wildcard operator (*). This operator can be used in boolean mode searches, which is why we chose that mode.
So, let’s unleash the power of boolean search and try searching for part of the artist’s name. We’ll use the wildcard operator to match any artist whose name starts with ‘Mich’:
SELECT
*
FROM
artists
WHERE
MATCH (name) AGAINST ('Mich*' IN BOOLEAN MODE)

Low relevancy results using Match and Natural Lanaguage Mode

I'm building a query that is used by an autocomplete function on a website. The field "term" is indexed with the Full Index type. My query should be floating the most relevant results to the top of the list. But there are some examples where the most obvious match is not given enough relevancy.
Here's one example. I have a product term "Just Believe Bird Feeder". It does show up in a search for that exact phrase. But with a lower relevancy than terms that contain one of the search words more than once (i.e. "bird tube bird feeder")
Further, searching on "believe" or "just believe" yields zero results.
What would my best solution to overcome this?
SELECT
term,
MATCH (term) AGAINST (
'Just Believe Bird Feeder' IN NATURAL LANGUAGE MODE
) AS relevancy
FROM
autocomplete
WHERE
MATCH (term) AGAINST (
'Just Believe Bird Feeder' IN NATURAL LANGUAGE MODE
)
ORDER BY
relevancy DESC
LIMIT 15
Your words believe and just are on the MyISAM stopword list. Words on that list are ignored when indexing (or searching) with the fulltext index, so you can neither find them, nor will they influence the relevance score.
The idea of a stopword list is to exclude words that are so common in english texts that their occurance bares no relevance. This feature is less useful for searching in short titles or product codes or artificial term lists though.
You can adjust the ft_stopword_file configuration setting to specify your own stopword list, e.g. set it to an empty string to disable it completely, otherwise specify the filename for your own stopword list. You need to rebuild the indexes after adjusting the setting and a server restart, e.g. by using REPAIR TABLE tbl_name QUICK.
If you cannot control the server configuration, you could switch your table to InnoDB, which uses a significantly smaller stopword list.
Some additional notes:
the fulltext index uses a minimum word length, by default 4 for MyISAM and 3 for InnoDB. You may need to adjust those settings too if you want terms like "8 oz" to have an effect.
the order of terms has no effect on the relevance in a fulltext search

Stemming Words in MySQL

Stemming Words in MySQL
For e.g. the user might search for "testing", "tested" or "tests". All these words are related to each other because the base word "test" is common in all of them.
Is there a way to get such result or function?
MySQL Full-Text Search
Historically, full-text searches were supported in MyISAM engines. After version 5.6, MySQL also supported full-text searches in InnoDB storage engines. This has been great news, since it enables developers to benefit from InnoDB’s referential integrity, ability to perform transactions, and row level locks.
There are basically two approaches to full-text searches in MySQL: natural language and boolean mode. (A third option augments the natural language search with a second expansion query.)
The main difference between the natural and boolean modes is that the boolean allows certain operators as part of the search. For instance, boolean operators can be used if a word has greater relevance than others in the query or if a specific word should be present in the results, etc. It’s worth noticing that in both cases, results can be sorted by the relevance computed by MySQL during the search.
The best fit for our problem was to use InnoDb full-text searches in boolean mode. Why?
We had little time to implement the search function.
At this point, we had no big data to crunch nor a massive load to require something like Elasticsearch or Sphinx.
We used shared hosting that doesn’t support Elasticsearch or Sphinx and the hardware was pretty limited at this stage.
While we wanted word stemming in our search function, it wasn’t a deal breaker: we could implement it (within constraints) by way of some simple PHP coding and data denormalization
Full-text searches in boolean mode can search words with wildcards (for the word stemming) and sort the results based on relevance.
In the Normalized Vertabelo Model
Let’s see how a simple search would work. We’ll create a sample table first:
CREATE TABLE artists (
id int(11) NOT NULL AUTO_INCREMENT, name varchar(255) NOT NULL,bio text NOT NULL, CONSTRAINT artists_pk PRIMARY KEY (id)
)ENGINE InnoDB;
CREATE FULLTEXT INDEX artists_idx_1 ON artists (name);
In natural language mode
You can insert some sample data and start testing. (It would be good to add it to your sample dataset.) For instance, we’ll try searching for Michael Jackson:
SELECT
*
FROM
artists
WHERE
MATCH (artists.name) AGAINST ('Michael Jackson' IN NATURAL LANGUAGE MODE)
This query will find records that match the search terms and will sort matching records by relevance; the better the match, the more relevant it is and the higher the result will appear in the list.
In boolean mode
We can perform the same search in boolean mode. If we don’t apply any operators to our query, the only difference will be that results are not sorted by relevance:
SELECT
*
FROM
artists
WHERE
MATCH (artists.name) AGAINST ('Michael Jackson' IN BOOLEAN MODE)
The wildcard operator in boolean mode
Since we want to search stemmed and partial words, we will need the wildcard operator (*). This operator can be used in boolean mode searches, which is why we chose that mode.
So, let’s unleash the power of boolean search and try searching for part of the artist’s name. We’ll use the wildcard operator to match any artist whose name starts with ‘Mich’:
SELECT
*
FROM
artists
WHERE
MATCH (name) AGAINST ('Mich*' IN BOOLEAN MODE)

How to speed up search MySQL? Is fulltext search with special characters possible?

I have strings like the following in my VARCHAR InnoDB table column:
"This is a {{aaaa->bbb->cccc}} and that is a {{dddd}}!"
Now, I'd like to search for e.g. {{xxx->yyy->zzz}}. Brackets are part of the string. Sometimes searched together with another colum, but which only contains an ordinary id and hence don't need to be considered (I guess).
I know I can use LIKE or REGEXP. But these (already tried) ways are too slow. Can I introduce a fulltext index? Or should I add another helping table? Should I replace the special characters {, }, -, > to get words for the fulltext search? Or what else could I do?
The search works with some ten-thousand rows and I assume that I often get about one hundred hits.
This link should give you all the info you need regarding FULLTEXT indexes in MySQL.
MySQL dev site
The section that you will want to pay particular attention to is:
"Full-text searching is performed using MATCH() ... AGAINST syntax. MATCH() takes a comma-separated list that names the columns to be searched. AGAINST takes a string to search for, and an optional modifier that indicates what type of search to perform. The search string must be a string value that is constant during query evaluation. This rules out, for example, a table column because that can differ for each row."
So in short, to answer your question you should see an improvement in query execution times by implementing a full text index on wide VARCHAR columns. Providing you are using a compatible storage engine ( InnoDB or MyISAM)
Also here is an example of how you can query the full text index and also an additional ID field as hinted in your question:
SELECT *
FROM table
WHERE MATCH (fieldlist) AGAINST ('search text here')
AND ( field2= '1234');

MySQL fulltextsearch: no results

i use this query for a fulltextsearch in my table:
SELECT Titel FROM cmsa WHERE MATCH(Titel) AGAINST ('+"Ort" +"Berlin"' IN BOOLEAN MODE)
but the result is empty.
If i use
SELECT Titel FROM cmsa WHERE Titel LIKE '%Berlin%'
the result would be (without quotes):
"Ort - Berlin"
Why the fulltextsearch didnt find this result. The word "Ort" and the word "Berlin" are both in the field Titel of the entry.
Other fulltext searches works great.
Any Idea?
I guess it is because MySQL has a server parameter - The minimum length of the word to be included in a FULLTEXT index. Default value for this parameter is 4 so your first word Ort is not included in this index. You should change this system parameter, restart server and then rebuild all FULLTEXT indexes.
REPAIR TABLE cmsa QUICK;
Change the full text index minimum word length with MySQL
Try without double quotes and make sure Mysql engine is MYISAM
SELECT
Titel
FROM
cmsa
WHERE
MATCH(Titel) AGAINST ('+Ort +Berlin' IN BOOLEAN MODE)
Some explanation
Boolean Mode Searches
SELECT headline, story FROM news
WHERE MATCH (headline,story)
AGAINST ('+Hurricane -Katrina' IN BOOLEAN MODE);
The above statement would match news stories about hurricanes but not those that mention hurricane katrina.
Query Expansion
The Blind Query Expansion (or automatic relevance feedback) feature can be used to expand the results of the search. This often includes much more noise, and makes for a very fuzzy search.
In most cases you would use this operation if the users query returned just a few results, you try it again WITH QUERY EXPANSION and it will add words that are commonly found with the words in the query.
SELECT headline, story FROM news
WHERE MATCH (headline,story)
AGAINST ('Katrina' WITH QUERY EXPANSION);
The above query might return all news stories about hurricanes, not just ones containing Katrina.
A couple points about Full-Text searching in MySQL:
Searches are not case sensitive
Short words are ignored, the default minimum length is 4 characters. You can change the min and max word length with the variables ft_min_word_len and ft_max_word_len
Words called stopwords are ignored, you can specify your own stopwords, but default words include the, have, some - see default
stopwords list.
You can disable stopwords by setting the variable ft_stopword_file to an empty string.
Full Text searching is only supported by the MyISAM storage engine.
If a word is present in more than 50% of the rows it will have a weight of zero. This has advantages on large datasets, but can make
testing difficult on small ones.
Your query was working perfectly for me. Please check your table is MYISAM or not because full text search working in only myisam engine.