MySQL: Full-Text inflectional forms of words? [duplicate] - mysql

Stemming Words in MySQL
For e.g. the user might search for "testing", "tested" or "tests". All these words are related to each other because the base word "test" is common in all of them.
Is there a way to get such result or function?

MySQL Full-Text Search
Historically, full-text searches were supported in MyISAM engines. After version 5.6, MySQL also supported full-text searches in InnoDB storage engines. This has been great news, since it enables developers to benefit from InnoDB’s referential integrity, ability to perform transactions, and row level locks.
There are basically two approaches to full-text searches in MySQL: natural language and boolean mode. (A third option augments the natural language search with a second expansion query.)
The main difference between the natural and boolean modes is that the boolean allows certain operators as part of the search. For instance, boolean operators can be used if a word has greater relevance than others in the query or if a specific word should be present in the results, etc. It’s worth noticing that in both cases, results can be sorted by the relevance computed by MySQL during the search.
The best fit for our problem was to use InnoDb full-text searches in boolean mode. Why?
We had little time to implement the search function.
At this point, we had no big data to crunch nor a massive load to require something like Elasticsearch or Sphinx.
We used shared hosting that doesn’t support Elasticsearch or Sphinx and the hardware was pretty limited at this stage.
While we wanted word stemming in our search function, it wasn’t a deal breaker: we could implement it (within constraints) by way of some simple PHP coding and data denormalization
Full-text searches in boolean mode can search words with wildcards (for the word stemming) and sort the results based on relevance.
In the Normalized Vertabelo Model
Let’s see how a simple search would work. We’ll create a sample table first:
CREATE TABLE artists (
id int(11) NOT NULL AUTO_INCREMENT, name varchar(255) NOT NULL,bio text NOT NULL, CONSTRAINT artists_pk PRIMARY KEY (id)
)ENGINE InnoDB;
CREATE FULLTEXT INDEX artists_idx_1 ON artists (name);
In natural language mode
You can insert some sample data and start testing. (It would be good to add it to your sample dataset.) For instance, we’ll try searching for Michael Jackson:
SELECT
*
FROM
artists
WHERE
MATCH (artists.name) AGAINST ('Michael Jackson' IN NATURAL LANGUAGE MODE)
This query will find records that match the search terms and will sort matching records by relevance; the better the match, the more relevant it is and the higher the result will appear in the list.
In boolean mode
We can perform the same search in boolean mode. If we don’t apply any operators to our query, the only difference will be that results are not sorted by relevance:
SELECT
*
FROM
artists
WHERE
MATCH (artists.name) AGAINST ('Michael Jackson' IN BOOLEAN MODE)
The wildcard operator in boolean mode
Since we want to search stemmed and partial words, we will need the wildcard operator (*). This operator can be used in boolean mode searches, which is why we chose that mode.
So, let’s unleash the power of boolean search and try searching for part of the artist’s name. We’ll use the wildcard operator to match any artist whose name starts with ‘Mich’:
SELECT
*
FROM
artists
WHERE
MATCH (name) AGAINST ('Mich*' IN BOOLEAN MODE)

Related

Fulltext match search in natural language mode

I am attempting a fulltext search in mysql. I expect that when I pass in a string, I will receive ranked by relevancy when I use [Natural Language Mode]mysql - fulltext index - what is natural language mode .
Here is how I created the index: CREATE FULLTEXT INDEX item_name ON list_items(name);
When I use LIKE, I receive results, except I want to order them by relevancy. Hence, the fulltext search.
Here is the query I have using LIKE: SELECT name FROM list_items WHERE name LIKE "%carro%";
Which results in Carrots, Carrots, Carrots etc.
Here is the query I have attempting the MATCH search: SELECT name FROM list_items WHERE MATCH(name) AGAINST('carro' IN NATURAL LANGUAGE MODE); Which returns no results.
I am basing my query on the selected answer on this post: Order SQL by strongest LIKE?
And this page: https://www.w3resource.com/mysql/mysql-full-text-search-functions.php
Even when I run the query without Natural Language Mode or even in Boolean Mode, I don't get any results. What am I missing?
You seem to want to use * as a wildcard. For that you need to use "boolean" mode rather than "natural language". So, this might do what you want:
SELECT name
FROM list_items
WHERE MATCH(name) AGAINST('carro*' IN BOOLEAN MODE)
This still produces a relevance ranking, although it might not be exactly the same as natural language mode.
Also note that this will get matches such as "carrouse".
I don't think that MySQL supports synonym lists for full text search, so this is tricky to avoid (although like filtering along with the full text filtering might suffice).

Stemming Words in MySQL

Stemming Words in MySQL
For e.g. the user might search for "testing", "tested" or "tests". All these words are related to each other because the base word "test" is common in all of them.
Is there a way to get such result or function?
MySQL Full-Text Search
Historically, full-text searches were supported in MyISAM engines. After version 5.6, MySQL also supported full-text searches in InnoDB storage engines. This has been great news, since it enables developers to benefit from InnoDB’s referential integrity, ability to perform transactions, and row level locks.
There are basically two approaches to full-text searches in MySQL: natural language and boolean mode. (A third option augments the natural language search with a second expansion query.)
The main difference between the natural and boolean modes is that the boolean allows certain operators as part of the search. For instance, boolean operators can be used if a word has greater relevance than others in the query or if a specific word should be present in the results, etc. It’s worth noticing that in both cases, results can be sorted by the relevance computed by MySQL during the search.
The best fit for our problem was to use InnoDb full-text searches in boolean mode. Why?
We had little time to implement the search function.
At this point, we had no big data to crunch nor a massive load to require something like Elasticsearch or Sphinx.
We used shared hosting that doesn’t support Elasticsearch or Sphinx and the hardware was pretty limited at this stage.
While we wanted word stemming in our search function, it wasn’t a deal breaker: we could implement it (within constraints) by way of some simple PHP coding and data denormalization
Full-text searches in boolean mode can search words with wildcards (for the word stemming) and sort the results based on relevance.
In the Normalized Vertabelo Model
Let’s see how a simple search would work. We’ll create a sample table first:
CREATE TABLE artists (
id int(11) NOT NULL AUTO_INCREMENT, name varchar(255) NOT NULL,bio text NOT NULL, CONSTRAINT artists_pk PRIMARY KEY (id)
)ENGINE InnoDB;
CREATE FULLTEXT INDEX artists_idx_1 ON artists (name);
In natural language mode
You can insert some sample data and start testing. (It would be good to add it to your sample dataset.) For instance, we’ll try searching for Michael Jackson:
SELECT
*
FROM
artists
WHERE
MATCH (artists.name) AGAINST ('Michael Jackson' IN NATURAL LANGUAGE MODE)
This query will find records that match the search terms and will sort matching records by relevance; the better the match, the more relevant it is and the higher the result will appear in the list.
In boolean mode
We can perform the same search in boolean mode. If we don’t apply any operators to our query, the only difference will be that results are not sorted by relevance:
SELECT
*
FROM
artists
WHERE
MATCH (artists.name) AGAINST ('Michael Jackson' IN BOOLEAN MODE)
The wildcard operator in boolean mode
Since we want to search stemmed and partial words, we will need the wildcard operator (*). This operator can be used in boolean mode searches, which is why we chose that mode.
So, let’s unleash the power of boolean search and try searching for part of the artist’s name. We’ll use the wildcard operator to match any artist whose name starts with ‘Mich’:
SELECT
*
FROM
artists
WHERE
MATCH (name) AGAINST ('Mich*' IN BOOLEAN MODE)

Wildcard search in MySQL full-text search

Let's say we have the following query:
SELECT *
FROM companies
WHERE name LIKE '%nited'
It returns
name
united
How do I write a query using MySQL's full-text search that will provide similar results?
Unfortunately you cannot do this using a MySQL full-text index. You cannot retrieve '*nited states' instantly from index because left characters are the most important part of the index. However, you can search 'United Sta*'.
// the only possible wildcard full-text search in MySQL
WHERE MATCH(column) AGAINST ('United Sta*' IN BOOLEAN MODE)
MySQL's full-text performs best when searching whole words in sentences - even that can suck at times. Otherwise, I'd suggest using an external full-text engine like Solr or Sphinx. I think Sphinx allows prefix and suffix wildcards, not sure about the others.
You could go back to MySQL's LIKE clause, but again, running queries like LIKE '%nited states' or LIKE '%nited Stat%', will also suffer on performance, as it can't use the index on the first few characters. 'United Sta%' and 'Unit%States' are okay as the index can be used against the first bunch of known characters.
Another quite major caveat using MySQL's full-text indexing is the stop-word list and minimum word length settings. For example, on a shared hosting environment, you will be limited to words greater than or equal to 4-characters. So searching 'Goo' to get 'Google' would fail. The stop-word list also disallows common words like 'and', 'maybe' and 'outside' - in-fact, there are 548 stop-words all together! Again, if not using shared hosting, these settings are relatively easily to modify, but if you are, then you will get annoyed with some of the default settings.
You can use MySQL's full-text index, but you must configure the parser to be the n-gram parser.
If your data is English (as opposed to Chinese or similar), you ought to also increase the ngram_token_size parameter to the minimum search term length you are willing to have. Otherwise, the search will be unacceptably slow.
You will also want to set innodb_ft_enable_stopword=0, otherwise an idiosyncrasy of how ngram stopword handling works will mean that many useful queries will return no results.
To explain why you must also increase ngram_token_size, you may think of this index as the following schema. MySQL then does a series of joins to find the results which match the search term:
CREATE TABLE fulltext_index
(
docid int(11) NOT NULL,
term char(2) NOT NULL,
PRIMARY KEY (docid, term),
INDEX term_idx (term)
);
The n-gram (2) parser breaks each word in your query into segments like se, eg, gm, me, en, nt, ts. For each of these n-grams, there are many results in English, so the index doesn't help much since it ends up iterating over everything anyway. Meanwhile, you can see how Chinese 随机的 would split into a much more useful 随机 and 机的. With n-gram size set to 4, the segments are segm, egme, gmen, ment, ents. These larger segments are much more likely to be unique, so each segment narrows down the search space significantly.
Disabling stopwords is also necessary because the ngram parser excludes all n-grams that contain any of the stopwords. For example, with an n-gram (4) parser, stopword will be parsed into stop, topw, opwr, pwor, and word:
stop will be excluded because it contains "to"
topw will be excluded because it contains "to"
opwr will be kept
pwor will be excluded because it contains "or"
word will be excluded because it contains "or"
Because these tokens are excluded from the index, a search for MATCH(name) AGAINST('stop' IN BOOLEAN MODE) would not return anything unless the stopwords mechanism is disabled before creating the index.
To answer your question,
set ngram_token_size to 3, 4, or whatever your minimum search term length is.
set innodb_ft_enable_stopword to 0 or OFF.
create the index with CREATE FULLTEXT INDEX companies_name_idx ON companies (name) WITH PARSER ngram;
SELECT * FROM companies WHERE MATCH(name) AGAINST('nited' IN BOOLEAN MODE);
This will also return results for nitedA, so you might want to further filter the results from there, if that's required for your application.

Text search in MySQL - Performance and Alternatives

I have a set of tables in MySQL like this (foreign keys referenced by [table_name]_id):
Articles(id, author_id, title, date, broad_search, ...)
Keywords(id, article_id, keyword (varchar))
Authors(id, name, ...)
Attachments(id, article_id, url, ...)
The table we are concerned about most is 'Keywords' so I am mentioning the indexes only for it:
id - Primary - BTREE
(article_id,keyword) - Unique - BTREE
keyword - BTREE
article_id - BTREE
Each Article has associated list of Keywords. The "broad_search" column in Articles states whether that particular article can be matched broadly with the keywords (broad_search=1) or if it has to be an exact match of the keyword(broad_search=0). I have a SELECT query which pulls a list of articles based on keywords, broad_search parameter and other filter criteria.
$sql = "SELECT *
FROM Keywords k, Attachments at, Articles ar, Authors a (2 more tables)
WHERE
((ar.broad_search=0 AND k.keyword = '$Keyword')
OR (ar.broad_search=1 AND (INSTR('$Keyword', k.keyword)>0 OR k.keyword like '%$Keyword%')))
AND at.article_id = ar.id
AND a.id = ar.author_id
... (more conditions)
LIMIT 20";
An article can be set to either braod match or exact match, and I'm trying to get a list of them based on a keyword.
Exact match is straightforward. But broad match has various cases which will not let me use a simple wild card pattern like '%search_term%'. An example:
Keywords for a broad match article = {books, used books, reading books, popular book}
search term = new books
Now, we cannot use the mysql wildcard string matching as '%new books%' will not match any of the keywords but it needs to be retrieved as the search term contains a substring of the keywords (broad_search=1). So, broad_search is of 2 types: search_term = "cars" in keyword "used cars" and search term = "used cars" in keyword "cars".
If broad_search=0, do an exact match. If broad_search=1, match both cases:
((ar.broad_search=0 AND k.keyword = '$Keyword')
OR (ar.broad_search=1 AND (INSTR('$Keyword', k.keyword)>0 OR k.keyword like '%$Keyword%')))
The query I wrote perfectly does the job. But the issue is with performance. The keywords table is very large with 100,000+ rows and keeps growing. Also, this is a high load app and kills my server due to the huge number of requests it receives.
I feel this is not the right way to perform a text search. I tried going through mysql docs regarding full text search but I did not quite understand it's application and if it fits my search criteria. Also, I was thinking if Apache Lucene would be a better choice, but I haven't used it earlier so not really sure (this query runs in a PHP script).
How should I be implementing this? Is it indexing issue, or is the MySQL INSTR function inefficient, or should I be using a whole different approach?
MySQL isn't a search engine, it's a Relation Database Management System (RDBMS). However, you can implement native MySQL tools to emulate Full-Text searching capabilities, such as setting up a search table as MyISAM and adding a FULLTEXT index to columns you wish to search upon. You can read the MySQL docs for more info on how MySQL supports Full-Text searching.
Even if you get Full-Text search queries to work the way you want, you will still miss out on a whole host of features that a true search engine (Lucene) supports. Features such as a facets, spatial searches, result boosting, weighting, etc. I'd suggest you read up on Apache SOLR, as it supports all these features and many more. There is even a PHP SOLR API which you can use to access a SOLR instance.
I'm not saying to abandon MySQL altogether, but use it for it's intended purpose, to persistently store data which can be queried upon, and which can be uses to populate your search engine indices. SOLR even has a built in Document Import Handler, which will allow you to set a database query to be used when you want to mass import data from your MySQL database.
The learning curve is relatively high, as it is with learning most new technologies, but when you are done you will wonder how you ever got by without using a true Full-Text search engine.

How to perform search on MySQL table for a website

How do I perform a search similar to that of Wikipedia on a MySQL table (or several tables at a time) without crawling the database first? (Search on wikipedia used to show you the relevancy in percentage).
What I'm looking for here is how to determine relevancy of the results and sort them accordingly, especially in case where you pull data from several tables at a time.
What do you use for search function on your websites?
You can use MySQL's full-text search functionality. You need to have a FULLTEXT index on the fields to be searched. For natural language searches, it returns a relevance value which is "a similarity measure between the search string and the text in that row."
If you are searching multiple tables, the relevance value should be comparable across sets of results; you could do a UNION of individual fulltext queries on each of the tables, then sort the results of the union based on the relevance value.