Switching between N-Gram and standard MySQL Fulltext Search - mysql

I'm running into an issue where I need to support multiple languages in my fulltext search.
Because of this I've been using the n-gram parser. Unfortunately I'm noticing for non-ideographic languages that the results are pretty much useless.
Since I need this to also function in English/French/Spanish/etc... I'm now hoping to be able to switch between an n-gram and standard fulltext index depending on the context.
Is there a way for me to either improve n-gram results on non-ideographic languages, or a way to choose which fulltext search I'll make (n-gram or standard)?
Thanks!
I've applied both a standard and n-gram index and tried to use the USE INDEX hint to determine which index is used but I've noticed it simply defaults to the first applied index.

Related

Full-Text Indexing in MySql to speed up "contains" searches

I am running into issues, since I have very large tables, doing updates that are based on where clauses with like % X or like '% X %. I even got to the point where I considered "exploding" the Name fields into NameWord1, NameWord2 so I can construct what would be more complex where clauses but each would at least be '=' vs 'like %'.
I noticed however full-text indexes and am not clear in the documentation if these might achieve the same result. I am loathe to full-text index the multi-million records table to test and I can't see a test on a small table giving me any real insights into performance gains so I am posting what I realize is somewhat of a generic question here to get some feedback on MySql Full-Text indexes applicability to my issue.
As a general answer, "yes". Your use-case is what full-text indexes are designed for.
Two things to keep in mind as you implement one:
The minimum word length. The default is 3 or 4 depending on the storage engine. If you are looking for one character, then this is clearly too long. (See here.)
Stop word list. These are values that are not in the index, either because they are uninformative ("nevertheless") or too common ("have").
Also, if your where clause contains other conditions, then the full text index may not be as great an improvement as it otherwise would be.
And, one final note. If the column you are indexing is really a list of codes, then these should be in a junction table. Although full-text indexing could provide a benefit, a proper data structure with appropriate indexes would have even better performance -- and you could maintain relational integrity as well.

Prevent MySQL fulltext natural language search from ignoring words which appear in at least 50% of rows

When searching MyISAM tables, MySQL fulltext search in natural language mode ignores words which appear in at least 50% of rows. Is there any way to prevent this behavior? I would like to include popular words rather than ignore them.
For example, I have a MySQL table of clothing entries, 90% of which are categorized as t-shirts. I would like customers to be able to search for "t-shirt" without having the "t-shirt" word ignored in the fulltext index.
The only solution I have come up with so far is to use boolean mode search rather than natural language search. However, the scores that boolean mode search returns don't seem to be as diverse or meaningful as scores returned by natural language search.
Side Note: As of MySQL 5.6 and up, natural language search on InnoDB tables does not ignore words which appear in at least 50% of rows. However, on my web server, I am still using MySQL 5.5 which doesn't offer fulltext search on InnoDB tables, so the table needs to remain in MyISAM format.
Thanks!
A couple years later, I have found the answer in documentation for MySQL 5.7:
For MyISAM search indexes, the 50% threshold for natural language searches is determined by the particular weighting scheme chosen. To disable it, look for the following line in storage/myisam/ftdefs.h:
#define GWS_IN_USE GWS_PROB
Change that line to this:
#define GWS_IN_USE GWS_FREQ
Then recompile MySQL. There is no need to rebuild the indexes in this case.

MySQL: Best way to search in files' content (fulltext search)

I'm currently developing a website which allows the users to upload presentations, documents and e-books (something like scribd and slideshare) so I need to be able to search in the file's content. I'm currently extracting the text from the files in a txt file.
I am considering 2 options as I am using MySQL:
Store the plain text in a separate table and use mysql's fulltext index to search through it.
Use an inverted index to store words and search through them. (2 new tables - words and many-to-many with the documents table). Now in this case what can I do to work with repeating words that give more relevance to the results.
The text will only be used for searching. The problem with (1) is that the text of an e-book may be huge so I consider limiting it to (for example) 50kb or less.
(2) also has a problem with lots of words in an e-book which, again, can be limited.
So can you guide me to the best way to index the text and be able to do fast fulltext searches. I need to get the best out of mysql in this case.
I decided to use Sphinx as suggested by Rob Di Marco. Turns out it is the fastest (and opensource) FullText search engine out there. I had some trouble with compiling and getting SphinxSE not to crash mysql so I now use MariaDB which includes the plugin.
I chose version 1.10 because of the RealTime index. It means that there is no need to wait for the indexer thing to rebuild the entire index if you just add a row. ( I know about the main+delta workarounds but this is way easier to configure and use with SphinxQL )
See also Some questions related to SphinxSE and RT indexes

Doctrine Searchable vs. Mysql Fulltext

I have a CSV file will 3x10^6 rows, title and body mostly, the text
file has 3GB+.
Im a little scared of the fact that Seachable will almost duplicate
the data, can anyone please advise me on the advantages of searchable
over a mysql fulltext index on the title, body columns?
After detailed investigation, I found that the search functions are not fully implemented in Doctrine.
So, if you use Searchable, you just only get an index table. After that, you should write your own functions to use that table with complex search queries. Doctrine only provides a basic search function, which can only search one keyword. (By the way I am talking about Doctrine 1.2, I am not sure what does Doctrine 2.0 have).
On the other hand, MySQL Fulltext search has everything you need. You can even use boolean searches easily.

Wildcard search on column(s) in a large table (>10.000.000 rows) in MySQL

Which techniqes would you use to implement a search for contents in a column on a very big table in MySql? Say for instance that you have 10.000.000 emails stored in a table in the database and would like to implement a subject search, that would enable me to search for one or more words that was present in the email subject. If the user searched for "christmas santa" you should find a emails with subjects like "Santa visits us this christmas" and "christmas, will santa ever show".
My idea is to process all the words in the subjects (strip all numbers, special signs, commas etc) and save each word in an index table, where I have a unique index on the word column. Then I would link that to the email table by a many to many relationship table.
Is there a better way to perform wildcard searches on very big tables ?
Is there databases that natively supports this kind of searches ?
You could use FULLTEXT indexes if you are using MyISAM as the storage engine. However, MySQL in general is not very good with text search.
A much better option would be to go with a dedicated text indexing solution such as Lucene or Sphinx. Personally I'd recommend Sphinx - it has great integration with PHP and MySQL and is very, very fast (can be used to speed up even ordinary queries - performs very fast grouping and ordering).
Wikipedia has a nice list of different indexing engines - here.
MySQL's MyISAM tables support a FULLTEXT index, which helps in this kind of search.
But it's not the speediest technology available for this kind of search. And you can't use it on data stored in InnoDB tables.
I've heard some good things about Sphinx Search, but I haven't used it yet.
Here's another blog about Sphinx: http://capttofu.livejournal.com/13037.html
While a mysql fulltext index is possible, I suspect I would look at using something designed to be a search engine like Lucene.
This sounds like a a full text search, which SQL Server supports.
But your idea is generally sound. You're effectively computing an "index" on your table in advance to speed up searches.
You want to look at the MATCH...AGAINST function.
See, for example: Using MySQL Full-text Searching
check "full text search" in MySQL docs (AFAIK, all current DBMS support this)