I am working on an application that needs to do interesting things with search, including full-text search, hit-highlighting, faceted-search, etc...
The dataset is likely to be between 3000-10000 records with 20-30 fields on each, and is all stored in MySQL. The traffic profile of the site is likely to be on the small size of medium.
All of these requirements could be achieved (clunkily) in MySQL, but at what point (in terms of data-size and traffic levels) does it become worth looking at more focused technologies like Solr or Sphinx?
This question calls for a very broad answer to be answered in all aspects. There are very well certain specificas that may make one system superior to another for a special use case, but I want to cover the basics here.
I will deal entirely with Solr as an example for several search engines that function roughly the same way.
I want to start with some hard facts:
You cannot rely on Solr/Lucene as a secure database. There are a list of facts why but they mostly consist of missing recovery options, lack of acid transactions, possible complications etc. If you decide to use solr, you need to populate your index from another source like an SQL table. In fact solr is perfect for storing documents that include data from several tables and relations, that would otherwise requrie complex joins to be constructed.
Solr/Lucene provides mind blowing text-analysis / stemming / full text search scoring / fuzziness functions. Things you just can not do with MySQL. In fact full text search in MySql is limited to MyIsam and scoring is very trivial and limited. Weighting fields, boosting documents on certain metrics, score results based on phrase proximity, matching accurazy etc is very hard work to almost impossible.
In Solr/Lucene you have documents. You cannot really store relations and process. Well you can of course index the keys of other documents inside a multivalued field of some document so this way you can actually store 1:n relations and do it both ways to get n:n, but its data overhead. Don't get me wrong, its perfectily fine and efficient for a lot of purposes (for example for some product catalog where you want to store the distributors for products and you want to search only parts that are available at certain distributors or something). But you reach the end of possibilities with HAS / HAS NOT. You can almonst not do something like "get all products that are available at at least 3 distributors".
Solr/Lucene has very nice facetting features and post search analysis. For example: After a very broad search that had 40000 hits you can display that you would only get 3 hits if you refined your search to the combination of having this field this value and that field that value. Stuff that need additional queries in MySQL is done efficiently and convinient.
So let's sum up
The power of Lucene is text searching/analyzing. It is also mind blowingly fast because of the reverse index structure. You can really do a lot of post processing and satisfy other needs. Altough it's document oriented and has no "graph querying" like triple stores do with SPARQL, basic N:M relations are possible to store and to query. If your application is focused on text searching you should definitely go for Solr/Lucene if you haven't good reasons, like very complex, multi-dmensional range filter queries, to do otherwise.
If you do not have text-search but rather something where you can point and click something but not enter text, good old relational databases are probably a better way to go.
Use Solr if:
You do not want to stress your database.
Get really full text search.
Perform lightning fast search results.
I currently maintain a news website with 5 million users per month, with MySQL as the main datastore and Solr as the search engine.
Solr works like magick for full text indexing, which is difficult to achieve with Mysql. A mix of Mysql and Solr can be used: Mysql for CRUD operations and Solr for searches. I have previusly worked with one of India's best real estate online classifieds portal which was using Solr for search ( and was previously using Mysql). The migration reduced the search times manifold.
Solr can be easily integrated with Mysql:
Solr Full Dataimport can be used for importing data from Mysql tables into Solr collections.
Solr Delta import can be scheduled at short frequencies to load latest data from Mysql to Solr collections.
Related
So I'm in the process of designing a system that is going to store document type of data (i.e. transcribed documents). Immediately, I thought this would be a great opportunity to leverage a NOSQL implementation like MongoDB. However, given that I have zero experience with Mongo, I'm wondering: on each of these docuemnts, I have a number of metadata tags I want to be able to search across: things like date, author, keywords, etc. If I were to use an RDBMS like MySql, I'd probably store these items in a separate table liked by a foreign key and the index the items that were most likely to be searched on. Then I could run queries against that table and only pull back and the full text results for the items that matched (saves on disk read not having to reach through a row that contains a lot of text or BLOB information).
Would something similar be possible with Mongo? I know in Mongo I could simply create 1 document that would have all the metadata AND the actual transcription but is it easy and highly performant to search the various fields in the metadata if the document is stored like that? Is there a best practice when needing to perform searches across various items in a document in Mongo? Or is this type of scenario more suited for an RDBMS rather than a NOSQL implementation?
You can add indexes for individual fields in mongodb documents. Only when the indexes get larger than your memory, performance of index based searches may become a problem.
When you decide if to go with mongodb, keep in mind that there is no join operation. This has to be done by your db layer or above.
If your primary concern is searching, there is an ElasticSearch river for mongodb, so you can utilize ElasticSearch on your dataset.
The NoSQL model, is geared for data storage in long range (OLTP model), yes you can create indexes and perform queries that you want, instead of you having related entities across tables, you have a complete entity that owns all entities dependent on it within herself.
When you have to extract complex reports with many joins in a relational database in a context of millions of data becomes impractical such an act, because you may end up compromising other applications.
For example:
We have the room and student bodies.
Each room have many students, the relational model we would have the following:
SELECT * FROM ROOM R
INNER JOIN
S STUDENT
ON = S.ID R.STUDENTID
Imagine doing that with some 20 tables with thousands of data? His performance will be horrible.
With MongoDB you will do so:
db.sala.find (null)
And will have all their rooms with their students.
MongoDB is a database that performs scanning horizontally.
You can read:
http://openmymind.net/mongodb.pdf
This site also has an interactive tutorial that uses the book's examples. Very nice.
And here you can experience the mongodb online and test your commands.
Search for try mongo db.
Also read about shards with replicaSets. I believe it will open your mind greatly.
You can install Robomongo which is a graphical interface for you to tinker with mongodb.
http://robomongo.org/
we have a eMall application based mainly around a ~500k rows MySQL master table (with detail tables storing non searchable fields and other related tables with shop info etc).
Users can today search based on specific structured product data (e.g. brand, category, price, specific shop etc).
We would also like to support keyword search in combination with the structured data.
We also want to improve the performance of our application and are considering our infrastructure options to achieve both the functional requirement of keyword search and the technical requirement of improved speed:
Lucene, Sphinx etc to index all products?
A NoSQL db (mongo, couch etc) used as an intermediate cache layer in front of MySQL?
A NOSQL db to replace MySQL?
A combination of the above?
In the case of Lucene and Sphinx - how flexible are they in terms of combining structured criteria? Or would we need to first run a text search and then filter the results with a second structured query on mySQL?
Any hints or leasons learned from your own experiences would be more than welcome!
thanks in advance
I suggest you use Solr - It enables keyword search, based on Lucene. You can use facets and filters for your structured product data. 500 K items seem a size Solr can handle rather easily. It can be considered a NoSQL DB, and it is easier to use than pure Lucene. You can go over the relevant considerations in Full Text Search Engine versus DBMS.
I have been using Sphinx for full text searching similar to your requirements (searching based on freetext and structured attributes), with a few GB of data & 5M rows in MySQL. I am very satisfied with the performance and reliability (not even a single downtime).
The advantage in using Sphinx is it is targeted to use with MySQL, so it is really easy to setup. Normally you can have the whole system ready in less than an hour, so why not give it a try?
My website stores several million entities. Visitors search for entities by typing words contained only in the titles. The titles are at most 100 characters long.
This is not a case of classic document search, where users search inside large blobs.
The fields are very short. Also, the main issue here is performance (and not relevance) seeing as entities are provided "as you type" (auto-suggested).
What would be the smarter route?
Create a MySql table [word, entity_id], have 'word' indexed, and then query using
select entity_id from search_index where word like '[query_word]%
This obviously requires me to break down each title to its words and add a row for each word.
Use Solr or some similar search engine, which from my reading are more oriented towards full text search.
Also, how will this affect me if I'd like to introduce spelling suggestions in the future.
Thank you!
Pro's of a Database Only Solution:
Less set up and maintenance (you already have a database)
If you want to JOIN your search results with other data or otherwise manipulate them you will be able to do so natively in the database
There will be no time lag (if you periodically sync Solr with your database) or maintenance procedure (if you opt to add/update entries in Solr in real time everywhere you insert them into the database)
Pro's of a Solr Solution:
Performance: Solr handles caching and is fast out of the box
Spell check - If you are planning on doing spell check type stuff Solr handles this natively
Set up and tuning of Solr isn't very painful, although it helps if you are familiar with Java application servers
Although you seem to have simple requirements, I think you are getting at having some kind of logic around search for words; Solr does this very well
You may also want to consider future requirements (what if your documents end up having more than just a title field and you want to assign some kind of relevancy? What if you decide to allow people to search the body text of these entities and/or you want to index other document types like MS Word? What if you want to facet search results? Solr is good at all of these).
I am not sure if you would need to create an entry for every word in your database, vs. just '%[query_word]%' search if you are going to create records with each word anyway. It may be simpler to just go with a database for starters, since the requirements seem pretty simple. It should be fairly easy to scale the database performance.
I can tell you we use Solr on site and we love the performance and we use it for even very simple lookups. However, one thing we are missing is a way to combine Solr data with database data. And there is extra maintenance. At the end of the day there is not an easy answer.
I'm building an index of data, which will entail storing lots of triplets in the form (document, term, weight). I will be storing up to a few million such rows. Currently I'm doing this in MySQL as a simple table. I'm storing the document and term identifiers as string values than foreign keys to other tables. I'm re-writing the software and looking for better ways of storing the data.
Looking at the way HBase works, this seems to fit the schema rather well. Instead of storing lots of triplets, I could map document to {term => weight}.
I'm doing this on a single node, so I don't care about distributed nodes etc. Should I just stick with MySQL because it works, or would it be wise to try HBase? I see that Lucene uses it for full-text indexing (which is analogous to what I'm doing). My question is really how would a single HBase node compare with a single MySQL node? I'm coming from Scala, so might a direct Java API have an edge over JDBC and MySQL parsing etc each query?
My primary concern is insertion speed, as that has been the bottleneck previously. After processing, I will probably end up putting the data back into MySQL for live-querying because I need to do some calculations which are better done within MySQL.
I will try prototyping both, but I'm sure the community can give me some valuable insight into this.
Use the right tool for the job.
There are a lot of anti-RDBMSs or BASE systems (Basically Available, Soft State, Eventually consistent), as opposed to ACID (Atomicity, Consistency, Isolation, Durability) to choose from here and here.
I've used traditional RDBMSs and though you can store CLOBs/BLOBs, they do
not have built-in indexes customized specifically for searching these objects.
You want to do most of the work (calculating the weighted frequency for
each tuple found) when inserting a document.
You might also want to do some work scoring the usefulness of
each (documentId,searchWord) pair after each search.
That way you can give better and better searches each time.
You also want to store a score or weight for each search and weighted
scores for similarity to other searches.
It's likely that some searches are more common than others and that
the users are not phrasing their search query correctly though they mean
to do a common search.
Inserting a document should also cause some change to the search weight
indexes.
The more I think about it, the more complex the solution becomes.
You have to start with a good design first. The more factors your
design anticipates, the better the outcome.
MapReduce seems like a great way of generating the tuples. If you can get a scala job into a jar file (not sure since I've not used scala before and am a jvm n00b), it'd be a simply matter to send it along and write a bit of a wrapper to run it on the map reduce cluster.
As for storing the tuples after you're done, you also might want to consider a document based database like mongodb if you're just storing tuples.
In general, it sounds like you're doing something more statistical with the texts... Have you considered simply using lucene or solr to do what you're doing instead of writing your own?
We have a hosted application that manages pages of content. Each page can have a number of customized fields, and some standard fields (timestamp, user name, user email, etc).
With potentially hundreds of different sites using the system -- what is an efficient way to handle filtering/searching? Picture a grid view that you want to narrow down. You can filter on specific fields (userid, date) or you can enter a full-text search.
For example, "all pages started by userid 10" would be a pretty quick query against a MySQL database. But things like "all pages started by a user whose userid is 10 and matches [some search query]" would suck against the database, so it's suited for a search engine like Lucene.
Basically I'm wondering how other large sites do this sort of thing. Do they utilize a search engine 100% for all types of filtering? Do they mix database queries with a search engine?
If we use only a search engine, there's a problem with the delay time it takes for a new/updated object to appear in the search index. That is, I've read that it's not smart to update the index immediately, and to do it in batches instead. Even if this means every 5 minutes, users will get confused when their recently added page isn't immediately listed when they view a simple page listing (say a search query of "category:5").
We are using MySQL and have been looking closely at Lucene for searching. Is there some other technology I don't know about?
My thought is to offer a simple filtering page which uses MySQL to filter on basic fields. Then offer a separate fulltext search page that would present results similar to Google. Is this the only way?
Solr or grassyknoll both provide slightly more abstract interfaces to Lucene.
That said: Yes. If you are a primarily content driven site, providing fulltext searching over your data, there is something in play beyond LIKE. While MySql's FULLTEXT indexies aren't perfect, it might be an acceptable placeholder in the interim.
Assuming you do create a Lucene index, linking Lucene Documents to your relational objects is pretty straightforward, simply add a stored property to the document at index time (this property can be a url, ID, GUID etc.) Then, searching becomes a 2 phase system:
1) Issue query to Lucene indexies (Display simple results like title)
2) Get more detailed information about the object from your relational stores by its key
Since instantiation of Documents is relatively expensive in Lucene, you only want to store fields searched in the Lucene index, as opposed to complete clones of your relational objects.
Don't write-off MySQL so readily!
Implement it using the database e.g. a select with a 'like' in the where-clause or whatever.
Profile it, add indexes if necessary. Roll out a beta, so you get real numbers from user's actual data patterns - not all columns might be equally asked after, etc.
If the performance does suck, then thats when you consider other options. You can consider tuning your SQL, your database, the machine the database is running on, and finally using another technology stack...
In case you want to use MySQL or PostgreSQL, a open source solution that works great with it is Sphinx:
http://www.sphinxsearch.com/
We are having the same problem and considering Sphinx and Lucene as possible solutions.