Efficient Filtering / Searching - mysql

We have a hosted application that manages pages of content. Each page can have a number of customized fields, and some standard fields (timestamp, user name, user email, etc).
With potentially hundreds of different sites using the system -- what is an efficient way to handle filtering/searching? Picture a grid view that you want to narrow down. You can filter on specific fields (userid, date) or you can enter a full-text search.
For example, "all pages started by userid 10" would be a pretty quick query against a MySQL database. But things like "all pages started by a user whose userid is 10 and matches [some search query]" would suck against the database, so it's suited for a search engine like Lucene.
Basically I'm wondering how other large sites do this sort of thing. Do they utilize a search engine 100% for all types of filtering? Do they mix database queries with a search engine?
If we use only a search engine, there's a problem with the delay time it takes for a new/updated object to appear in the search index. That is, I've read that it's not smart to update the index immediately, and to do it in batches instead. Even if this means every 5 minutes, users will get confused when their recently added page isn't immediately listed when they view a simple page listing (say a search query of "category:5").
We are using MySQL and have been looking closely at Lucene for searching. Is there some other technology I don't know about?
My thought is to offer a simple filtering page which uses MySQL to filter on basic fields. Then offer a separate fulltext search page that would present results similar to Google. Is this the only way?

Solr or grassyknoll both provide slightly more abstract interfaces to Lucene.
That said: Yes. If you are a primarily content driven site, providing fulltext searching over your data, there is something in play beyond LIKE. While MySql's FULLTEXT indexies aren't perfect, it might be an acceptable placeholder in the interim.
Assuming you do create a Lucene index, linking Lucene Documents to your relational objects is pretty straightforward, simply add a stored property to the document at index time (this property can be a url, ID, GUID etc.) Then, searching becomes a 2 phase system:
1) Issue query to Lucene indexies (Display simple results like title)
2) Get more detailed information about the object from your relational stores by its key
Since instantiation of Documents is relatively expensive in Lucene, you only want to store fields searched in the Lucene index, as opposed to complete clones of your relational objects.

Don't write-off MySQL so readily!
Implement it using the database e.g. a select with a 'like' in the where-clause or whatever.
Profile it, add indexes if necessary. Roll out a beta, so you get real numbers from user's actual data patterns - not all columns might be equally asked after, etc.
If the performance does suck, then thats when you consider other options. You can consider tuning your SQL, your database, the machine the database is running on, and finally using another technology stack...

In case you want to use MySQL or PostgreSQL, a open source solution that works great with it is Sphinx:
http://www.sphinxsearch.com/
We are having the same problem and considering Sphinx and Lucene as possible solutions.

Related

When to consider Solr

I am working on an application that needs to do interesting things with search, including full-text search, hit-highlighting, faceted-search, etc...
The dataset is likely to be between 3000-10000 records with 20-30 fields on each, and is all stored in MySQL. The traffic profile of the site is likely to be on the small size of medium.
All of these requirements could be achieved (clunkily) in MySQL, but at what point (in terms of data-size and traffic levels) does it become worth looking at more focused technologies like Solr or Sphinx?
This question calls for a very broad answer to be answered in all aspects. There are very well certain specificas that may make one system superior to another for a special use case, but I want to cover the basics here.
I will deal entirely with Solr as an example for several search engines that function roughly the same way.
I want to start with some hard facts:
You cannot rely on Solr/Lucene as a secure database. There are a list of facts why but they mostly consist of missing recovery options, lack of acid transactions, possible complications etc. If you decide to use solr, you need to populate your index from another source like an SQL table. In fact solr is perfect for storing documents that include data from several tables and relations, that would otherwise requrie complex joins to be constructed.
Solr/Lucene provides mind blowing text-analysis / stemming / full text search scoring / fuzziness functions. Things you just can not do with MySQL. In fact full text search in MySql is limited to MyIsam and scoring is very trivial and limited. Weighting fields, boosting documents on certain metrics, score results based on phrase proximity, matching accurazy etc is very hard work to almost impossible.
In Solr/Lucene you have documents. You cannot really store relations and process. Well you can of course index the keys of other documents inside a multivalued field of some document so this way you can actually store 1:n relations and do it both ways to get n:n, but its data overhead. Don't get me wrong, its perfectily fine and efficient for a lot of purposes (for example for some product catalog where you want to store the distributors for products and you want to search only parts that are available at certain distributors or something). But you reach the end of possibilities with HAS / HAS NOT. You can almonst not do something like "get all products that are available at at least 3 distributors".
Solr/Lucene has very nice facetting features and post search analysis. For example: After a very broad search that had 40000 hits you can display that you would only get 3 hits if you refined your search to the combination of having this field this value and that field that value. Stuff that need additional queries in MySQL is done efficiently and convinient.
So let's sum up
The power of Lucene is text searching/analyzing. It is also mind blowingly fast because of the reverse index structure. You can really do a lot of post processing and satisfy other needs. Altough it's document oriented and has no "graph querying" like triple stores do with SPARQL, basic N:M relations are possible to store and to query. If your application is focused on text searching you should definitely go for Solr/Lucene if you haven't good reasons, like very complex, multi-dmensional range filter queries, to do otherwise.
If you do not have text-search but rather something where you can point and click something but not enter text, good old relational databases are probably a better way to go.
Use Solr if:
You do not want to stress your database.
Get really full text search.
Perform lightning fast search results.
I currently maintain a news website with 5 million users per month, with MySQL as the main datastore and Solr as the search engine.
Solr works like magick for full text indexing, which is difficult to achieve with Mysql. A mix of Mysql and Solr can be used: Mysql for CRUD operations and Solr for searches. I have previusly worked with one of India's best real estate online classifieds portal which was using Solr for search ( and was previously using Mysql). The migration reduced the search times manifold.
Solr can be easily integrated with Mysql:
Solr Full Dataimport can be used for importing data from Mysql tables into Solr collections.
Solr Delta import can be scheduled at short frequencies to load latest data from Mysql to Solr collections.

Search Short Fields Using Solr, Etc. or Use Straight-Forward DB Index

My website stores several million entities. Visitors search for entities by typing words contained only in the titles. The titles are at most 100 characters long.
This is not a case of classic document search, where users search inside large blobs.
The fields are very short. Also, the main issue here is performance (and not relevance) seeing as entities are provided "as you type" (auto-suggested).
What would be the smarter route?
Create a MySql table [word, entity_id], have 'word' indexed, and then query using
select entity_id from search_index where word like '[query_word]%
This obviously requires me to break down each title to its words and add a row for each word.
Use Solr or some similar search engine, which from my reading are more oriented towards full text search.
Also, how will this affect me if I'd like to introduce spelling suggestions in the future.
Thank you!
Pro's of a Database Only Solution:
Less set up and maintenance (you already have a database)
If you want to JOIN your search results with other data or otherwise manipulate them you will be able to do so natively in the database
There will be no time lag (if you periodically sync Solr with your database) or maintenance procedure (if you opt to add/update entries in Solr in real time everywhere you insert them into the database)
Pro's of a Solr Solution:
Performance: Solr handles caching and is fast out of the box
Spell check - If you are planning on doing spell check type stuff Solr handles this natively
Set up and tuning of Solr isn't very painful, although it helps if you are familiar with Java application servers
Although you seem to have simple requirements, I think you are getting at having some kind of logic around search for words; Solr does this very well
You may also want to consider future requirements (what if your documents end up having more than just a title field and you want to assign some kind of relevancy? What if you decide to allow people to search the body text of these entities and/or you want to index other document types like MS Word? What if you want to facet search results? Solr is good at all of these).
I am not sure if you would need to create an entry for every word in your database, vs. just '%[query_word]%' search if you are going to create records with each word anyway. It may be simpler to just go with a database for starters, since the requirements seem pretty simple. It should be fairly easy to scale the database performance.
I can tell you we use Solr on site and we love the performance and we use it for even very simple lookups. However, one thing we are missing is a way to combine Solr data with database data. And there is extra maintenance. At the end of the day there is not an easy answer.

Scalable Full Text Search With Per User Result Ordering

What options exist for creating a scalable, full text search with results that need to be sorted on a per user basis? This is for PHP/MySQL (Symfony/Doctrine as well, if relevant).
In our case, we have a database of workouts that have been performed by users. The workouts that the user has done before should appear at the top of the results. The more frequently they've done the workout, the higher it should appear in search matches. If it helps, you can assume we know the number of times a user has done a workout in advance.
Possible Solutions
Sphinx - Use Sphinx to implement full text search, do all the querying and sorting in MySQL. This seems promising (and there's a Symfony Plugin!) but I don't know much about it.
Lucene - Use Lucene to perform full text search and put the users' completions into the query. As is suggested in this Stack Overflow thread. Alternatively, use Lucene to retrieve the results, then reorder them in PHP. However, both solutions seem clunky and potentially unscalable as a user may have completed hundreds of workouts.
Mysql - No native full text support (InnoDB), so we'd have use LIKE or REGEX, which isn't scalable.
MySQL does have a native FULLTEXT support, though only in MyISAM tables.
For most real-world tasks, Sphinx is the fastest engine. However, it is an external index, so it can only be updated on a timely basis with a cron script.
By using SphinxSE (a pluggable MySQL interface to Sphinx), you can join MySQL tables and Sphinx indexes in one query. Updating, though, will still require an external script.
Since the number of workouts performed seems to change frequently, keeping it in Sphinx would require too much effort on rebuilding the index.
With SphinxSE, you can write a query similar to that:
SELECT *
FROM workouts w
JOIN user_workouts uw
ON uw.workout = w.id
WHERE w.query = 'query query query;filter=user_id,$user_id'
AND uw.user = $user_id
ORDER BY
uw.times_performed DESC
I'm not sure why you're assuming using Lucene would be unscalable. Hundreds of workouts per user is not a lot of data to deal with.
Try using Solr/Lucene for the search backend. It has a JSON/XML interface which will play nicely with your PHP frontend. Store a user's completed workout # in a database table. When a query is issued, take the results from Solr, and you can select from the database table and resort in PHP code. Should be plenty fast and scalable. With Solr, maintaining your index is dirt simple; just issue add/update/delete requests to your Solr server.

Search implementation dilemma: full text vs. plain SQL

I have a MySQL/Rails app that needs search. Here's some info about the data:
Users search within their own data only, so searches are narrowed down by user_id to begin with.
Each user will have up to about five thousand records (they accumulate over time).
I wrote out a typical user's records to a text file. The file size is 2.9 MB.
Search has to cover two columns: title and body. title is a varchar(255) column. body is column type text.
This will be lightly used. If I average a few searches per second that would be surprising.
It's running an a 500 MB CentOS 5 VPS machine.
I don't want relevance ranking or any kind of fuzziness. Searches should be for exact strings and reliably return all records containing the string. Simple date order -- newest to oldest.
I'm using the InnoDB table type.
I'm looking at plain SQL search (through the searchlogic gem) or full text search using Sphinx and the Thinking Sphinx gem.
Sphinx is very fast and Thinking Sphinx is cool, but it adds complexity, a daemon to maintain, cron jobs to maintain the index.
Can I get away with plain SQL search for a small scale app?
I think plain SQL search won't be the good choice. Because of when we fetching text type columns in MySQL the request is always falling to hard drive no matter what cache settings are.
You can use plain SQL search only with very small apps.
I'd prefer Sphinx for that.
I would start out simple -- chances are that plain SQL will work well, and you can always switch to full text search later if the search function proves to be a bottleneck.
I'm developing and maintaining an application with a search function with properties similar to yours, and plain SQL search has worked very well for me so far. I had similar performance concerns when I first implemented the search function a year or two ago, but I haven't seen any performance problems whatsoever yet.
Having used MySQL fulltext search for about 4 years, and just moving now to Sphinx, I'd say that a regular MySQL search using the fulltext boolean (ie exact) syntax will be fine. It's fast and it will do exactly what you want. The amount of data you will be searching at any one time will be small.
The only problem might be ordering the results. MySQL's fulltext search can get slow when you start ordering things by (eg) date, as that requires that you search the entire table, rather than just the first nn results it finds. That was ultimately the reason I moved to Sphinx.
Sphinx is also awesome, so don't be afraid to try it, but it sounds like the additional functionality may not be required in your case.

How to search for text fragments in a database

Are there any open source or commercial tools available that allow for text fragment indexing of database contents and can be queried from Java?
Background of the question is a large MySQL database table with several hundred thousand records, containing several VARCHAR columns. In these columns people would like to search for fragments of the contents, so a fulltext index (which is based on word boundaries) would not help.
EDIT: [Added to make clear why these first suggestions would not solve the problem:]
This is why MySQL's built in fulltext index will not do the job, and neither will Lucene or Sphinx, all of which were suggested in the answers. I already looked at both those, but as far as I can tell, these are based on indexing words, excluding stop words and doing all sorts of sensible things for a real fulltext search. However this is not suitable, because I might be looking for a search term like "oison" which must match "Roisonic Street" as well as "Poison-Ivy". The key difference here is that the search term is just a fragment of the column content, that need not be delimited by any special characters or white space.
EDIT2: [Added some more background info:]
The requested feature that is to be implemented based on this is a very loose search for item descriptions in a merchandise management system. Users often do not know the correct item number, but only part of the name of the item. Unfortunately the quality of these descriptions is rather low, they come from a legacy system and cannot be changed easily. If for example people were searching for a sledge hammer they would enter "sledge". With a word/token based index this would not find matches that are stored as "sledgehammer", but only those listen "sledge hammer". There are all kinds of weird variances that need to be covered, making a token based approach impractical.
Currently the only thing we can do is a LIKE '%searchterm%' query, effectively disabling any index use and requiring lots of resources and time.
Ideally any such tool would create an index that allowed me to get results for suchlike queries very quickly, so that I could implement a spotlight-like search, only retrieving the "real" data from the MySQL table via the primary key when a user picks a result record.
If possible the index should be updatable (without needing a full rebuild), because data might change and should be available for search immediately by other clients.
I would be glad to get recommendations and/or experience reports.
EDIT3: Commercial solution found that "just works"
Even though I got a lot of good answers for this question, I wanted to note here, that in the end we went with a commercial product called "QuickFind", made and sold by a German company named "HMB Datentechnik". Please note that I am not affiliated with them in any way, because it might appear like that when I go on and describe what their product can do. Unfortunately their website looks rather bad and is German only, but the product itself is really great. I currently have a trial version from them - you will have to contact them, no downloads - and I am extremely impressed.
As there is no comprehensive documentation available online, I will try and describe my experiences so far.
What they do is build a custom index file based on database content. They can integrate via ODBC, but from what I am told customers rarely do that. Instead - and this is what we will probably do - you generate a text export (like CSV) from your primary database and feed that to their indexer. This allows you to be completely independent of the actual table structure (or any SQL database at all); in fact we export data joined together from several tables. Indexes can be incrementally updated later on the fly.
Based on that their server (a mere 250kb or so, running as a console app or Windows service) serves listens for queries on a TCP port. The protocol is text based and looks a little "old", but it is simple and works. Basically you just pass on which of the available indexes you want to query and the search terms (fragments), space delimited.
There are three output formats available, HTML/JavaScript array, XML or CSV. Currently I am working on a Java wrapper for the somewhat "dated" wire protocol. But the results are fantastic: I currently have a sample data set of approximately 500.000 records with 8 columns indexed and my test application triggers a search across all 8 columns for the contents of a JTextField on every keystroke while being edited and can update the results display (JTable) in real-time! This happens without going to the MySQL instance the data originally came from. Based on the columns you get back, you can then ask for the "original" record by querying MySQL with the primary key of that row (needs to be included in the QuickFind index, of course).
The index is about 30-40% the size of the text export version of the data. Indexing was mainly bound by disk I/O speed; my 500.000 records took about a minute or two to be processed.
It is hard to describe this as I found it even hard to believe when I saw an in-house product demo. They presented a 10 million row address database and searched for fragments of names, addresses and phone numbers and when hitting the "Search" button, results came back in under a second - all done on a notebook! From what I am told they often integrate with SAP or CRM systems to improve search times when call center agents just understand fragments of the names or addresses of a caller.
So anyway, I probably won't get much better in describing this. If you need something like this, you should definitely go check this out. Google Translate does a reasonably good job translating their website from German to English, so this might be a good start.
This may not be what you want to hear, because I presume you are trying to solve this with SQL code, but Lucene would be my first choice. You can also build up fairly clever ranking and boosting techniques with additional tools. Lucene is written in Java so it should give you exactly the interface you need.
If you were a Microsoft shop, the majority of what you're looking for is built into SQL Server, and wildcards can be enabled which will give you the ability to do partial word matches.
In Lucene and Lucene.Net, you can use wildcard matches if you like. However, it's not supported to use wildcards as the first symbol in a search. If you want the ability to use first character wildcards, you'll probably need to implement some sort of trie-based index on your own, since it's an expensive operation in many cases to filter the set of terms down to something reasonable for the kind of index most commonly needed for full text search applications, where suffix stemming is generally more valuable.
You can apparently alter the Query Parser instance in Lucene to override this rule by setting setAllowLeadingWildcard to true.
I'm fairly sure that wildcard-on-both-ends-of-a-word searches are inherently inefficient. Skip lists are sometimes used to improve performance on such searches with plaintext, but I think you're more likely to find an implementation like that in something like grep than a generalized text indexing tool.
There are other solutions for the problem that you describe where one word may occur spelled as two, or vice versa. Fuzzy queries are supported in Lucene, for example. Orthographic and morphological variants can be handled using either by providing a filter that offers suggestions based on some sort of Bayesian mechanism, or by indexing tricks, namely, taking a corpus of frequent variants and stuffing the index with those terms. I've even seen knowledge from structured data stuffed into the full text engine (e.g. adding city name and the word "hotel" to records from the hotel table, to make it more likely that "Paris Hotels" will include a record for the pension-house Caisse des Dépôts.) While not exactly a trivial problem, it's manageable without destroying the advantages of word-based searches.
I haven't had this specific requirement myself, but my experience tells me Lucene can do the trick, though perhaps not standalone. I'd definitely use it through Solr as described by Michael Della Bitta in the first answer. The link he gave was spot on - read it for more background.
Briefly, Solr lets you define custom FieldTypes. These consist of an index-time Analyzer and a query-time Analyzer. Analyzers figure out what to do with the text, and each consists of a Tokenizer and zero to many TokenFilters. The Tokenizer splits your text into chunks and then each TokenFilter can add, subtract, or modify tokens.
The field can thus end up indexing something quite different from the original text, including multiple tokens if necessary. So what you want is a multiple-token copy of your original text, which you query by sending Lucene something like "my_ngram_field:sledge". No wildcards involved :-)
Then you follow a model similar to the prefix searching offered up in the solrconfig.xml file:
<fieldType name="prefix_token" class="solr.TextField" positionIncrementGap="1">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="20"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
</fieldType>
The EdgeNGramFilterFactory is how they implement prefix matching for search box autocomplete. It takes the tokens coming from the previous stages (single whitespace-delimited words transformed into lower case) and fans them out into every substring on the leading edge. sledgehammer = s,sl,sle,sled,sledg,sledge,sledgeh, etc.
You need to follow this pattern, but replace the EdgeNGramFilterFactory with your own which does all NGrams in the field. The default org.apache.solr.analysis.NGramFilterFactory is a good start, but it does letter transpositions for spell checking. You could copy it and strip that out - it's a pretty simple class to implement.
Once you have your own FieldType (call it ngram_text) using your own MyNGramFilterFactory, just create your original field and the ngram field like so:
<field name="title" type="text" indexed="true" stored="true"/>
<field name="title_ngrams" type="ngram_text" indexed="true" stored="false"/>
Then tell it to copy the original field into the fancy one:
<copyField source="title" dest="title_ngrams"/>
Alright, now when you search "title_ngrams:sledge" you should get a list of documents that contain this. Then in your field list for the query you just tell it to retrieve the field called title rather than the field title_ngrams.
That should be enough of a nudge to allow you to fit things together and tune it to astonishing performance levels rather easily. At an old job we had a database with over ten million products with large HTML descriptions and managed to get Lucene to do both the standard query and the spellcheck in under 200ms on a mid-sized server handling several dozen simultaneous queries. When you have a lot of users, caching kicks in and makes it scream!
Oh, and incremental (though not real-time) indexing is a cinch. It can even do it under high loads since it creates and optimizes the new index in the background and autowarms it before swapping it in. Very slick.
Good luck!
If your table is MyISAM, you can use MySQL's full text search capabilites: http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html
If not, the "industry standard" is http://www.sphinxsearch.com/
Some ideas on what to do if you are using InnoDB: http://www.mysqlperformanceblog.com/2009/09/10/what-to-do-with-mysql-full-text-search-while-migrating-to-innodb/
Also, a good presentation that introduces Sphinx and explains architecture+usage
http://www.scribd.com/doc/2670976/Sphinx-High-Performance-Full-Text-Search-for-MySQL-Presentation
Update
Having read your clarification to the question -- Sphinx can do substring matches. You need to set "enable-star" and create an infix index with the appropriate min_infix_length (1 will give you all possible substrings, but obviously the higher the set it, the smaller your index will be, and the faster your searches). See http://sphinxsearch.com/docs/current.html for details.
I'd use Apache Solr. The indexing strategy is entirely tunable (see http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters), can incrementally read directly from your database to populate the index (see DataImportHandler in the same wiki), and can be queried from basically any language that speaks HTTP and XML or something like JSON.
what about using tools such as proposed above (lucene etc.) for full text indexing and having LIKE search for cases, where nothing was found? (i.e. run LIKE only after fulltext indexed search returned zero results)
What you're trying to do is unlikely to ever be all that much faster than LIKE '%searchterm%' without a great deal of custom code. The equivalent of LIKE 'searchterm%' ought to be trivial though. You could do what you're asking by building an index of all possible partial words that aren't covered by the trailing wild-card, but this would result in an unbelievably large index size, and it would be unusually slow for updates. Long tokens would result in Bad Things™. May I ask why you need this? Re: Spotlight... You do realize that Spotlight doesn't do this, right? It's token-based just like every other full-text indexer. Usually query expansion is the appropriate method of getting inexact matches if that's your goal.
Edit:
I had a project exactly like this at one point; part-numbers for all kinds of stuff. We finally settled on searchterm* in Xapian, but I believe Lucene also has the equivalent. You won't find a good solution that handles wild-card searches on either side of the token, but a trailing wild-card is usually more than good enough for what you want, and I suspect you'll find that users adapt to your system fairly quickly if they have any control over cleaning up the data. Combine it with query expansion (or even limited token expansion) and you should be pretty well set. Query expansion would convert a query for "sledgehammer" into "sledgehammer* OR (sledge* hammer*)" or something similar. Not every query will work, but people are already pretty well trained to try related queries when something doesn't work, and as long as at least one or two obvious queries come up with the results they expect, you should be OK. Your best bet is still to clean up the data and organize it better. You'd be surprised how easy this ends up being if you version everything and implement an egalitarian edit policy. Maybe let people add keywords to an entry and be sure to index those, but put limits on how many can be set. Too many and you may actually degrade the search results.
Shingle search could do the trick.
http://en.wikipedia.org/wiki/W-shingling
For example, if you use 3-character shingles, you can split "Roisonic" to: "roi", "son", "ic ", and store all three values, associating them with original entry. When searching for "oison", you first will search for "ois", "iso", "son". First you fuzzy-match all entries by shingles (finding the one with "son"), and then you can refine the search by using exact string matching.
Note that 3-character shingle require the fragment in query to be at least 5 characters long, 4-char shingle requires 7-char query and so on.
The exact answer to your question is right here Whether it will perform sufficiently well for the size of your data is another question.
I'm pretty sure Mysql offers a fulltext option, and it's probably also possible to use Lucene.
See here for related comments
Best efficient way to make a fulltext search in MySQL
A "real" full text index using parts of a word would be many times bigger than the source text and while the search may be faster any update or insert processing would be horibly slow.
You only hope is if there is some sort of pattern to the "mistakes' made. You could apply a set of "AI" type rules to the incoming text and produce cannonical form of the text which you could then apply a full text index to. An example for a rule could be to split a word ending in hammer into two words s/(\w?)(hammer)/\1 \2/g or to change "sledg" "sled" and "schledge" to "sledge". You would need to apply the same set of rules to the query text. In the way a product described as "sledgehammer" could be matched by a search for ' sledg hammer'.