I'm running into an issue where I need to support multiple languages in my fulltext search.
Because of this I've been using the n-gram parser. Unfortunately I'm noticing for non-ideographic languages that the results are pretty much useless.
Since I need this to also function in English/French/Spanish/etc... I'm now hoping to be able to switch between an n-gram and standard fulltext index depending on the context.
Is there a way for me to either improve n-gram results on non-ideographic languages, or a way to choose which fulltext search I'll make (n-gram or standard)?
Thanks!
I've applied both a standard and n-gram index and tried to use the USE INDEX hint to determine which index is used but I've noticed it simply defaults to the first applied index.
Is there any way to search on specific text in MySQL without using the Full Text Search
I know LIKE is a solution but using wildcard at the beginning will disable using indexes , therefore not best performance for large data
Please specify more details of your use case.
Meanwhile, I have found this to be beneficial in some use cases. For example, suppose you wanted to search for a bracketed word:
WHERE MATCH(col) AGAINST('+word' IN BOOLEAN MODE)
AND col LIKE '%[word]%'
The MATCH would rapidly find the few rows with "word", then the LIKE would slowly check those few rows. It gives reasonably fast overall speed while checking for some types of non-words.
Acronyms are a pain in my database, especially when doing a search. I haven't decided if I should accept periods during search queries. These are the problems I face when searching:
'IRQ' will not find 'I.R.Q.'
'I.R.Q' will not find 'IRQ'
'IRQ.' or 'IR.Q' will not find 'IRQ' or 'I.R.Q.'
etc...
The same problem goes for ellipses (...) or three series of periods.
I just need to know what directions should I take with this issue:
Is it better to remove all periods when inserting the string to the database?
If so what regex can I use to identify periods (instead of ellipses or three series of periods) to identify what needs to be removed?
If it is possible to keep the periods in acronyms, how can it be scripted in a query to find 'I.R.Q' if I input 'IRQ' in the search field, through MySQL using regex or maybe a MySQL function I don't know about?
My responses for each question:
Is it better to remove all periods when inserting the string to the database?
Yes and no. You want the database to have the original text. If you want, create a separate field that is "cleaned up" to search against. Here, you can remove periods, make everything lowercase, etc.
If so what regex can I use to identify periods (instead of ellipses or three series of periods) to identify what needs to be removed?
/\.+/
That finds one or more periods in a given spot. But you'll want to integrate it with your search formula.
Note: regex on a database isn't known to have high performance. Be cautious with this.
Other note: you may want to use FullText search in MySQL. This also, isn't known to have high performance with data sets over 1000+ entries. If you have big data and need fulltext search, use Sphinx (available as a MySQL plug-in and RAM-based indexing system).
If it is possible to keep the periods in acronyms, how can it be scripted in a query to find 'I.R.Q' if I input 'IRQ' in the search field, through MySQL using regex or maybe a MySQL function I don't know about?
Yes, by having the 2 fields I described in the first bullet's answer.
You need to consider the sanctity of your input. If it is not yours to alter then don't alter it. Instead you should have a separate system to allow for text searching, and that can alter the text as it sees fit to be able to handle these types of issues.
Have a read up on Lucene, and specifically Lucene's standard analyzer, to see the types of changes that are commonly carried out to allow successful searching of complex text.
I think you can use the REGEXP function of MySQL to send an acronym :
SELECT col1, col2...coln FROM yourTable WHERE colWithAcronym REGEXP "#I\.?R\.?Q\.?#"
If you use PHP you can build your regexp by this simple loop :
$result = "#";
foreach($yourAcronym as $char){
$result.=$char."\\.?";
}
$result.="#";
The functionality you are searching for is a fulltext search. Mysql supports this for myisam-tables, but not for innodb. (http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html)
Alternatively you could go for an external framework that provides that funcionality. Lucene is a popular open-source one. (lucene.apache.org)
There would be 2 methods,
1. save data -removing symbols from text and match accordingly,
2. you can make a regex ,like this for eg.
select * from table where acronym regexp '^[A-Z]+[.]?[A-Z]+[.]?[A-Z]+[.]?$';
Please note, however, that this requires the acronym to be stored in uppercase. If you don't want the case to matter, just change [A-Z] to [A-Za-z].
I'm trying to decide between two options to achieve a prefix match for names (50 Millions options) with MySQL. The usage is for an autocomplete for search results. The dilemma is between building:
An index on the VARCHAR and performing a LIKE 'word%' query
A FTS (full text search) index and performing a MATCH 'word*' query
Which is better for such a case? Should I consider additional FTS features for such an auto-suggest autocomplete of names?
FTS and prefix matching are two different things. So the answer depends on what your actual requirement is.
Do you need to return a list of all results that exactly match the condition column LIKE 'word%'? Specifically that the string must start with the word you're looking for.
Full text search does matching based on relevance. It's not always going to give you things that match specific strings. It does stemming, it has stopwords, it omits results if a word is too common.
I think in this case the best answer is "Full text search doesn't quite do what you think it does" So if you have precise requirements for matching, you should stick to the method that will work.
I am trying to build a product search for a jewelry store. I know that if a term is in over 50% of the entries then it has a weight of zero. So right now if I do a search for "diamond" I get no results because over 50% contain diamond. Is there a way to change that?
Quoting the documentation of MySQL : 11.9.6. Fine-Tuning MySQL Full-Text Search
If you really need to search for such
common words, it would be better to
search using IN BOOLEAN MODE instead,
which does not observe the 50%
threshold.
See : 11.9.2. Boolean Full-Text Searches
The other solution seems to go with patching MySQL's source-code and recompiling -- which is probably not something you want to do...
Another approach, commented on MySQL website, is to use boolean mode only if the fulltext gives no results, but keep in mind that the second search won't sort the results in order of relevance.
11.9.1. Natural Language Full-Text Searches