Best practices for seaching for alternate forms of a word with Lucene

Best practices for seaching for alternate forms of a word with Lucene - language-agnostic

I have a site which is searchable using Lucene. I've noticed from logs that users sometimes don't find what they're looking for because they enter a singular term, but only the plural version of that term is used on the site. I would like the search to find uses of other forms of a word as well. This is a problem that I'm sure has been solved many times over, so what are the best practices for this?
Please note: this site only has English content.
Some approaches I've thought of:
Look up the word in some kind of thesaurus file to determine alternate forms of a given word.
Some examples:
Searches for "car", also add "cars" to the query.
Searches for "carry", also add "carries" and "carried" to the query.
Searches for "small", also add "smaller" and "smallest" to the query.
Searches for "can", also add "can't", "cannot", "cans", and "canned" to the query.
And it should work in reverse (i.e. search for "carries" should add "carry" and "carried").
Drawbacks:
Doesn't work for many new technical words unless the dictionary/thesaurus is updated frequently.
I'm not sure about the performance of searching the thesaurus file.
Generate the alternate forms algorithmically, based on some heuristics.
Some examples:
If the word ends in "s" or "es" or "ed" or "er" or "est", drop the suffix
If the word ends in "ies" or "ied" or "ier" or "iest", convert to "y"
If the word ends in "y", convert to "ies", "ied", "ier", and "iest"
Try adding "s", "es", "er" and "est" to the word.
Drawbacks:
Generates lots of non-words for most inputs.
Feels like a hack.
Looks like something you'd find on TheDailyWTF.com. :)
Something much more sophisticated?
I'm thinking of doing some kind of combination of the first two approaches, but I'm not sure where to find a thesaurus file (or what it's called, as "thesaurus" isn't quite right, but neither is "dictionary").

Consider including the PorterStemFilter in your analysis pipeline. Be sure to perform the same analysis on queries that is used when building the index.
I've also used the Lancaster stemming algorithm with good results. Using the PorterStemFilter as a guide, it is easy to integrate with Lucene.

Word stemming works OK for English, however for languages where word stemming is nearly impossible (like mine) option #1 is viable. I know of at least one such implementation for my language (Icelandic) for Lucene that seems to work very well.

Some of those look like pretty neat ideas. Personally, I would just add some tags to the query (query transformation) to make it fuzzy, or you can use the builtin FuzzyQuery, which uses Levenshtein edit distances, which would help for mispellings.
Using fuzzy search 'query tags', Levenshtein is also used. Consider a search for 'car'. If you change the query to 'car~', it will find 'car' and 'cars' and so on. There are other transformations to the query that should handle almost everything you need.

If you're working in a specialised field (I did this with horticulture) or with a language that does't play nicely with normal stemming methods you could use the query logging to create a manual stemming table.
Just create a word -> stem mapping for all the mismatches you can think of / people are searching for, then when indexing or searching replace any word that occurs in the table with the appropriate stem. Thanks to query caching this is a pretty cheap solution.

Stemming is a pretty standard way to address this issue. I've found that the Porter stemmer is way to aggressive for standard keyword search. It ends up conflating words together that have different meanings. Try the KStemmer algorithm.

Related

Loose searching, e.g. such that "htlm" would find "html5"

I have a huge database with keywords such as html, html5, xhtml, and so on.
The user can search for rooms and as of now it is merely just implemented as
[...] WHERE name LIKE '%keyword%' LIMIT 20;
This is a simple solution to start with, but it is not fault-tolerant. And users make a lot of faults. To enhance this, I would like to introduce a "loose search", meaning that if "html" returns no or only few (less than, say, 10) matches it adds "html" and similar to the list.
The real question now is: How do I do that?
Does this 'loose searching' has a technical term?

This is definitely part of text retrieval and is also called fuzzy matching or approximate string matching. For instance, go to Google, type "MSYQL" and it will recommend "MYSQL" instead.
Here is a typical approach. Start with a list of all valid keywords. Yes, that is the place to begin. In many text applications, this would be called a lexicon.
The look for your search term(s) in the list of valid keywords. If you do not find any, then use something called "Levenshtein distance" (described here) to find the closest matches. Then use these in your search. If you search for "Levenshtein distance mysql", you will find implementations of the algorithm here.
If you have just a few known misspellings, then you can also solve the problem with a thesaurus. This replaces one search term with other terms that might match.

How to integrate "Did you mean" functionality in rails?

How can you implement the "Did you mean: " like Google does in some search queries?
PS: I am using sphinx in my product. Can you suggest how can I implement this. Any guides or suggestions for some other search engines who has this functionality are most welcomed.
I am using rails2.3.8, if that helps
One Solution can be:
Make a dictionary of known "keywords" or "phrases", and in search action if nothing is found then run a secondary query in that dictionary. Update that dictionary whenever a searchable entry is created say, a blog post or username.
query = "supreman"
dictionary = ["superman", "batman", "hanuman" ...] (in DB table)
search(query)
if no results, then
search in dictionary (where "keyword" LIKE query or "phrase" LIKE query) => "superman"
Check in sphinx or solr documentation. They might have a better implementation of this "Like" query which returns a % match.
display -> Did you mean "superman"?
But the point is how to make it efficient?

Have a look at the Damerau-Levenshtein distance algorithm. It calculates the "distance" between two strings and determines how many steps it takes to transform one string into another. The less steps the closer the two strings are.
This article shows the algorithm implemented as a MySQL stored function.
The algorithm is so much better than LIKE or SOUNDEX.
I believe Google uses crowd sourced data rather than an algorithm. ie if a user types in abcd, clicks on the back button and then immediately searches for abd then it establishes a relationship between the two search terms as the user wasn't happy with the results. Once you have a very large community searching then the pattern appears.

You should take a look at the actual theory of how Google implements something like this: How to Write a Spelling Corrector.
Although that article is written in Python, there are links to implementations in other languages at the bottom of the article. Here is a Ruby implementation.

I think you're looking for a string match algorithms.
I remember mislav's gist used to raise errors when initialize was slightly misspelled. That might be a good read.
Also, take a look at some of the articles he suggests:
http://www.catalysoft.com/articles/StrikeAMatch.html
http://www.catalysoft.com/articles/MatchingSimilarStrings.html

Now a days did you mean feature is implemented based on phonetic spell corrector. When we misspell we generally write phonetically similar words. Based on this idea phonetic spell corrector searches its database for the most similar word. Similarity ties are broken using context(for a multi-word query other words also help in deciding the correct word) and popularity of the word. If two words are phonetically very close to the misspelled word than the word which fits the context and is more frequently used in daily life is chosen.

this is working for me:
SELECT * FROM table_name WHERE soundex(field_name) LIKE CONCAT('%', soundex('searching_element'), '%')

Setting Up an Easily Searchable MySQL Database for Word Searches

I have appx. 2TB of text that I want to turn into a searchable database, where I will usually be searching to see if 2-4 word expressions exist in the database (for instance I might do a search to see if the phrase "these are four words", or "three consecutive words" appears anywhere in the text).
These searches will happen very often so it is very important that I setup the database to use as little processing as possible. I'd also want to minimize the overhead as much as possible so I can lower the amount of database servers I'll need.
Does anybody have any suggestions as to how I should setup this database?
For instance I was thinking of doing a linked list that was organized |id|word1|word2| (with all three beings keys) so for the expression "these are four words", I'd first search "these are", then I'd search "are four", check to see if any matches for "these are" are 1 id lower than "are four", and then do the same thing for "four words". But I think there has to be a more efficient way of doing it.
EDIT: The ONLY thing I will be using this database for is doing these 2-4 word exact match searches, and it is meant for internal use. All I want this database to be able to do is let me know if a 2-4 word expression exists somewhere in all of my files of information, and nothing more.

Does anybody have any suggestions as
to how I should setup this database?
Personally, I'd first rule out the possibility of using MySQL's full-text search, and every Open Source, full-text search engine. There's a list of Open Source search engines on Wikipedia. I'd also rule out using Google Custom Search. Heck, I'd even consider a commercial product before I'd try rolling my own.
At the very least, studying their code might give you some ideas about index structure.
If you're thinking of building a linked list in SQL, well, you might want to build a tiny test before you get too far into it. I don't think it will be practical, but I could be wrong.
It takes a lot of work to do full-text search really well. (Think about proximity searches—find "there are" within 3 words of "many ways to fail". ) Reinventing this wheel might not be the best use of your time.

MySQL search tactics

I'm implementing a search on my project, requirements:
keywords could be found at two columns (tags & titles) from different tables
#1 search tags for exact matching
if empty, 2# search tags using MySQL SOUND LIKE feature
if empty, 3# for each word search titles using LIKE='%word%'
intersect titles results
Don't know if you get the idea. Hope so.
Now the problem. I read in a book that querys using LIKE and % at the begging wont use indexes for speed up.
If this is true, my formula will get slow while tags and titles increase.
What would you do in terms of performance?
Thanks,

Assuming MyISAM tables, you might want to look into implementing full-text search as an alternative to the LIKE.

I think in any "linear" type of search formula adding things like keywords/tags to search for will eventually cause it to become alot slower as far as searching goes. Isn't that the main problem with Relational Databases anyways? (correct me if im wrong)

If you want to stick with just your 2 tables and you want to do fuzzy text matching, have a look at the MATCH ... AGAINST. It will do more or less what you want. But there are some caveats.
FULLTEXT indices can only be implemented on MyISAM tables, not INNODB
if your text does follow standard word distribution, your results may not be optimal
If you're willing to go beyond a relational approach, an indexer like Solr might work for you.

How to search for text fragments in a database

Are there any open source or commercial tools available that allow for text fragment indexing of database contents and can be queried from Java?
Background of the question is a large MySQL database table with several hundred thousand records, containing several VARCHAR columns. In these columns people would like to search for fragments of the contents, so a fulltext index (which is based on word boundaries) would not help.
EDIT: [Added to make clear why these first suggestions would not solve the problem:]
This is why MySQL's built in fulltext index will not do the job, and neither will Lucene or Sphinx, all of which were suggested in the answers. I already looked at both those, but as far as I can tell, these are based on indexing words, excluding stop words and doing all sorts of sensible things for a real fulltext search. However this is not suitable, because I might be looking for a search term like "oison" which must match "Roisonic Street" as well as "Poison-Ivy". The key difference here is that the search term is just a fragment of the column content, that need not be delimited by any special characters or white space.
EDIT2: [Added some more background info:]
The requested feature that is to be implemented based on this is a very loose search for item descriptions in a merchandise management system. Users often do not know the correct item number, but only part of the name of the item. Unfortunately the quality of these descriptions is rather low, they come from a legacy system and cannot be changed easily. If for example people were searching for a sledge hammer they would enter "sledge". With a word/token based index this would not find matches that are stored as "sledgehammer", but only those listen "sledge hammer". There are all kinds of weird variances that need to be covered, making a token based approach impractical.
Currently the only thing we can do is a LIKE '%searchterm%' query, effectively disabling any index use and requiring lots of resources and time.
Ideally any such tool would create an index that allowed me to get results for suchlike queries very quickly, so that I could implement a spotlight-like search, only retrieving the "real" data from the MySQL table via the primary key when a user picks a result record.
If possible the index should be updatable (without needing a full rebuild), because data might change and should be available for search immediately by other clients.
I would be glad to get recommendations and/or experience reports.
EDIT3: Commercial solution found that "just works"
Even though I got a lot of good answers for this question, I wanted to note here, that in the end we went with a commercial product called "QuickFind", made and sold by a German company named "HMB Datentechnik". Please note that I am not affiliated with them in any way, because it might appear like that when I go on and describe what their product can do. Unfortunately their website looks rather bad and is German only, but the product itself is really great. I currently have a trial version from them - you will have to contact them, no downloads - and I am extremely impressed.
As there is no comprehensive documentation available online, I will try and describe my experiences so far.
What they do is build a custom index file based on database content. They can integrate via ODBC, but from what I am told customers rarely do that. Instead - and this is what we will probably do - you generate a text export (like CSV) from your primary database and feed that to their indexer. This allows you to be completely independent of the actual table structure (or any SQL database at all); in fact we export data joined together from several tables. Indexes can be incrementally updated later on the fly.
Based on that their server (a mere 250kb or so, running as a console app or Windows service) serves listens for queries on a TCP port. The protocol is text based and looks a little "old", but it is simple and works. Basically you just pass on which of the available indexes you want to query and the search terms (fragments), space delimited.
There are three output formats available, HTML/JavaScript array, XML or CSV. Currently I am working on a Java wrapper for the somewhat "dated" wire protocol. But the results are fantastic: I currently have a sample data set of approximately 500.000 records with 8 columns indexed and my test application triggers a search across all 8 columns for the contents of a JTextField on every keystroke while being edited and can update the results display (JTable) in real-time! This happens without going to the MySQL instance the data originally came from. Based on the columns you get back, you can then ask for the "original" record by querying MySQL with the primary key of that row (needs to be included in the QuickFind index, of course).
The index is about 30-40% the size of the text export version of the data. Indexing was mainly bound by disk I/O speed; my 500.000 records took about a minute or two to be processed.
It is hard to describe this as I found it even hard to believe when I saw an in-house product demo. They presented a 10 million row address database and searched for fragments of names, addresses and phone numbers and when hitting the "Search" button, results came back in under a second - all done on a notebook! From what I am told they often integrate with SAP or CRM systems to improve search times when call center agents just understand fragments of the names or addresses of a caller.
So anyway, I probably won't get much better in describing this. If you need something like this, you should definitely go check this out. Google Translate does a reasonably good job translating their website from German to English, so this might be a good start.

This may not be what you want to hear, because I presume you are trying to solve this with SQL code, but Lucene would be my first choice. You can also build up fairly clever ranking and boosting techniques with additional tools. Lucene is written in Java so it should give you exactly the interface you need.
If you were a Microsoft shop, the majority of what you're looking for is built into SQL Server, and wildcards can be enabled which will give you the ability to do partial word matches.
In Lucene and Lucene.Net, you can use wildcard matches if you like. However, it's not supported to use wildcards as the first symbol in a search. If you want the ability to use first character wildcards, you'll probably need to implement some sort of trie-based index on your own, since it's an expensive operation in many cases to filter the set of terms down to something reasonable for the kind of index most commonly needed for full text search applications, where suffix stemming is generally more valuable.
You can apparently alter the Query Parser instance in Lucene to override this rule by setting setAllowLeadingWildcard to true.
I'm fairly sure that wildcard-on-both-ends-of-a-word searches are inherently inefficient. Skip lists are sometimes used to improve performance on such searches with plaintext, but I think you're more likely to find an implementation like that in something like grep than a generalized text indexing tool.
There are other solutions for the problem that you describe where one word may occur spelled as two, or vice versa. Fuzzy queries are supported in Lucene, for example. Orthographic and morphological variants can be handled using either by providing a filter that offers suggestions based on some sort of Bayesian mechanism, or by indexing tricks, namely, taking a corpus of frequent variants and stuffing the index with those terms. I've even seen knowledge from structured data stuffed into the full text engine (e.g. adding city name and the word "hotel" to records from the hotel table, to make it more likely that "Paris Hotels" will include a record for the pension-house Caisse des Dépôts.) While not exactly a trivial problem, it's manageable without destroying the advantages of word-based searches.

I haven't had this specific requirement myself, but my experience tells me Lucene can do the trick, though perhaps not standalone. I'd definitely use it through Solr as described by Michael Della Bitta in the first answer. The link he gave was spot on - read it for more background.
Briefly, Solr lets you define custom FieldTypes. These consist of an index-time Analyzer and a query-time Analyzer. Analyzers figure out what to do with the text, and each consists of a Tokenizer and zero to many TokenFilters. The Tokenizer splits your text into chunks and then each TokenFilter can add, subtract, or modify tokens.
The field can thus end up indexing something quite different from the original text, including multiple tokens if necessary. So what you want is a multiple-token copy of your original text, which you query by sending Lucene something like "my_ngram_field:sledge". No wildcards involved :-)
Then you follow a model similar to the prefix searching offered up in the solrconfig.xml file:
<fieldType name="prefix_token" class="solr.TextField" positionIncrementGap="1">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="20"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
</fieldType>
The EdgeNGramFilterFactory is how they implement prefix matching for search box autocomplete. It takes the tokens coming from the previous stages (single whitespace-delimited words transformed into lower case) and fans them out into every substring on the leading edge. sledgehammer = s,sl,sle,sled,sledg,sledge,sledgeh, etc.
You need to follow this pattern, but replace the EdgeNGramFilterFactory with your own which does all NGrams in the field. The default org.apache.solr.analysis.NGramFilterFactory is a good start, but it does letter transpositions for spell checking. You could copy it and strip that out - it's a pretty simple class to implement.
Once you have your own FieldType (call it ngram_text) using your own MyNGramFilterFactory, just create your original field and the ngram field like so:
<field name="title" type="text" indexed="true" stored="true"/>
<field name="title_ngrams" type="ngram_text" indexed="true" stored="false"/>
Then tell it to copy the original field into the fancy one:
<copyField source="title" dest="title_ngrams"/>
Alright, now when you search "title_ngrams:sledge" you should get a list of documents that contain this. Then in your field list for the query you just tell it to retrieve the field called title rather than the field title_ngrams.
That should be enough of a nudge to allow you to fit things together and tune it to astonishing performance levels rather easily. At an old job we had a database with over ten million products with large HTML descriptions and managed to get Lucene to do both the standard query and the spellcheck in under 200ms on a mid-sized server handling several dozen simultaneous queries. When you have a lot of users, caching kicks in and makes it scream!
Oh, and incremental (though not real-time) indexing is a cinch. It can even do it under high loads since it creates and optimizes the new index in the background and autowarms it before swapping it in. Very slick.
Good luck!

If your table is MyISAM, you can use MySQL's full text search capabilites: http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html
If not, the "industry standard" is http://www.sphinxsearch.com/
Some ideas on what to do if you are using InnoDB: http://www.mysqlperformanceblog.com/2009/09/10/what-to-do-with-mysql-full-text-search-while-migrating-to-innodb/
Also, a good presentation that introduces Sphinx and explains architecture+usage
http://www.scribd.com/doc/2670976/Sphinx-High-Performance-Full-Text-Search-for-MySQL-Presentation
Update
Having read your clarification to the question -- Sphinx can do substring matches. You need to set "enable-star" and create an infix index with the appropriate min_infix_length (1 will give you all possible substrings, but obviously the higher the set it, the smaller your index will be, and the faster your searches). See http://sphinxsearch.com/docs/current.html for details.

I'd use Apache Solr. The indexing strategy is entirely tunable (see http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters), can incrementally read directly from your database to populate the index (see DataImportHandler in the same wiki), and can be queried from basically any language that speaks HTTP and XML or something like JSON.

what about using tools such as proposed above (lucene etc.) for full text indexing and having LIKE search for cases, where nothing was found? (i.e. run LIKE only after fulltext indexed search returned zero results)

What you're trying to do is unlikely to ever be all that much faster than LIKE '%searchterm%' without a great deal of custom code. The equivalent of LIKE 'searchterm%' ought to be trivial though. You could do what you're asking by building an index of all possible partial words that aren't covered by the trailing wild-card, but this would result in an unbelievably large index size, and it would be unusually slow for updates. Long tokens would result in Bad Things™. May I ask why you need this? Re: Spotlight... You do realize that Spotlight doesn't do this, right? It's token-based just like every other full-text indexer. Usually query expansion is the appropriate method of getting inexact matches if that's your goal.
Edit:
I had a project exactly like this at one point; part-numbers for all kinds of stuff. We finally settled on searchterm* in Xapian, but I believe Lucene also has the equivalent. You won't find a good solution that handles wild-card searches on either side of the token, but a trailing wild-card is usually more than good enough for what you want, and I suspect you'll find that users adapt to your system fairly quickly if they have any control over cleaning up the data. Combine it with query expansion (or even limited token expansion) and you should be pretty well set. Query expansion would convert a query for "sledgehammer" into "sledgehammer* OR (sledge* hammer*)" or something similar. Not every query will work, but people are already pretty well trained to try related queries when something doesn't work, and as long as at least one or two obvious queries come up with the results they expect, you should be OK. Your best bet is still to clean up the data and organize it better. You'd be surprised how easy this ends up being if you version everything and implement an egalitarian edit policy. Maybe let people add keywords to an entry and be sure to index those, but put limits on how many can be set. Too many and you may actually degrade the search results.

Shingle search could do the trick.
http://en.wikipedia.org/wiki/W-shingling
For example, if you use 3-character shingles, you can split "Roisonic" to: "roi", "son", "ic ", and store all three values, associating them with original entry. When searching for "oison", you first will search for "ois", "iso", "son". First you fuzzy-match all entries by shingles (finding the one with "son"), and then you can refine the search by using exact string matching.
Note that 3-character shingle require the fragment in query to be at least 5 characters long, 4-char shingle requires 7-char query and so on.

The exact answer to your question is right here Whether it will perform sufficiently well for the size of your data is another question.

I'm pretty sure Mysql offers a fulltext option, and it's probably also possible to use Lucene.
See here for related comments
Best efficient way to make a fulltext search in MySQL

A "real" full text index using parts of a word would be many times bigger than the source text and while the search may be faster any update or insert processing would be horibly slow.
You only hope is if there is some sort of pattern to the "mistakes' made. You could apply a set of "AI" type rules to the incoming text and produce cannonical form of the text which you could then apply a full text index to. An example for a rule could be to split a word ending in hammer into two words s/(\w?)(hammer)/\1 \2/g or to change "sledg" "sled" and "schledge" to "sledge". You would need to apply the same set of rules to the query text. In the way a product described as "sledgehammer" could be matched by a search for ' sledg hammer'.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008