When I added search functionality to my first Rails app, I used Sphinx, after reading that using MySQL's built-in fulltext search was a bad idea. While Sphinx works well, it's a bit complicated to set up, and I feel there's too much overload for the simple searching functionality I require in my app.
Searches aren't performed very often on my site (at most one search every 3-4 seconds), so I'm not too worried about load.
My question: Why exactly is using MySQL's full text search a bad idea, compared to Sphinx/Ferret/Solr/etc..?
MySQL is a relational database and not a search server so right off, we are talking about using something that wasn't built specifically for the task. That being said, MySQL's full text search works pretty well; however, it isn't good if you need to scale.
You don't want your DB server doing more than it has to as it is usually the bottleneck of the application even without something like full-text search running.
The MySQL full-text search requires that you use the MyISAM engine which is a problem if you care about the consistency of your data.
MyISAM doesn't support many of the enhanced data validation facilities supported by engines like InnoDB so you are generally at a disadvantage by starting with MyISAM.
But, YMMV and if your application can survive being subjected to MyISAM's shortcomings, by all means, use it. Just know that it is not a great production engine for MOST tasks (not ALL, but most).
Related
I've looked into Doctrine's built-in search, MySQL myisam fulltext search, Zend_Lucene, and sphinx - but all the nuances and implementation details are making it hard to sort out for me, given that I don't have experience with anything other than the myisam search.
What I really want is something simple that will work with the Zend Framework and Doctrine (MySQL back-end, probably InnoDB). I don't need complex things like word substitutions, auto-complete, and so on (not that I'd be opposed to such things, if it were easy enough and time effective enough to implement).
The main thing is the ability to search for strings across multiple database tables, and multiple fields with some basic search criteria (e.g. user.state. = CA AND user.active = 1). The size of the database will start at around 50K+ records (old data being dumped in), the biggest single searchable table would be around 15K records, and it would grow considerably over time.
That said, Zend_Lucene is appealing to me because it is flexible (in case I do need my search solution to gorw in the future) and because it can parse MS Office files (which will be uploaded to my application by users). But its flexibility also makes it kind of complicated to set up.
I suppose the most straightforward option would be to just use Doctrine's search capabilities, but I'm not sure if that's going to be able to handle what I need. And I don't know that there is any option out there which is going to combine my desire for simplicity & power.
What search solutions would you recommend I investigate? And why would you think that solution would work well in this situation?
I would recomment using Solr search engine.
Solr is an open source enterprise search server based on the Lucene Java search library, with XML/HTTP and JSON APIs, hit highlighting, faceted search, caching, replication, a web administration interface (which is really great) and many more features.
It runs in a Java servlet container such as Tomcat.
You can use the solr-php-client to handle queries in php.
I need to implement something like a full text search on a couple if index's in a large products table, using innodb, MyISAM is not an option due to its lack of transactions and relationship support.
I can think of a couple of ways of doing this, plugin, stored procedure, search table with keys and copied index's in MyIsam format.
How have you acheived this in the past, which is the best way (including any I have not mentioned) and why?
The plugin sounds expensive, the stored procedure sounds slow, and the search table sounds like and admin nightmare.
Love to hear your views.
A possibility might be to use an external (independant of the Database) full-text engine, like, for instance, Lucene (if you are using PHP, you might also want to take a look at Zend_Search_Lucene)
There are plenty of those (some free, some costly, some open source, and some proprietary) ; but they'll generally allow you to go farther than what MySQL full-text allows -- and it won't be integrated inside the database (which might be good : you can index documents, for instance -- or bad : you can't join those results with one of your SQL query results that easily)
The advantage of using such an engine is that full-text search is really their job ; and they're supposed to do it well (and be "intelligent" about what users tend to input).
Drawback is you must have another engine ; which means more work, more configuration, and more development, as you can't just have them "plugged-in" in you DB.
Have a look at Sphinx Search
Xapian do a nice set of php bindings, and like a few others xapian can run as a seperate server listening on a port for search requests, great for sharing content between multiple sites.
"Did you mean" functionality is present in xapian but not in the remote server version (yet)
Currently working on a project that is centered around a medical nomenclature known as SNOMED. At the heart of snomed is are three relational datasets that are 350,000, 1.1 mil, and 1.3 mil records in length. We want to be able to quickly query this dataset for the data entry portion where we would like to have some shape or form of auto-completion/suggestion.
Its currently in a MySQL MyISAM DB just for dev purposes but we want to start playing with some in memory options. It's currently 30MB + 90MB + 70MB in size including the indexes. The MEMORY MySQL Engine and MemCached were the obvious ones, so my question is which of these would you suggest or is there something better out there?
We're working in Python primarily at the app level if that makes a difference. Also we're running on a single small dedicated server moving to 4GB DDR2 soon.
Edit: Additional Info
We're interested in keeping the suggesting and autocompletion fast. Something that will peform well for these types of queires is desirable. Each term in snomed typically has several synonyms, abbreviations, and a preferred name. We will be querying this dataset heavily (90MB in size including index). We're also considering building an inverted index to speed things up and return more relevant results (many of the terms are long "Entire coiled artery of decidua basalis (body structure)"). Lucene or some other full text search may be appropriate.
From your use case, it sounds like you want to do full-text searching; I would suggest sphinx. It's blazing fast, even on large data sets. You can integrate memcached if you need extra speed.
Please see
Techniques to make autocomplete on website more responsive
How to do query auto-completion suggestions in Lucene
autocomplete server side implementation
For how to do this with Lucene. Lucene is the closest to industry standard full-text search library. It is fast and gives quality results. However, It takes time to master Lucene - you have to handle many low-level details. An easier way may be to use Solr, a Lucene sub-project which is much easier to set up, and can give JSON output, that can be used for autocomplete.
As Todd said, you can also use Sphinx. I have never used it, but heard it is highly integrable with MySQL. I failed to find how to implement autocomplete using Sphinx - maybe you should post this as a separate question.
How do the full text search systems of PostgreSQL and MySQL compare? Is any clearly better than the oder? In which way are they different?
PostgreSQL 8.3 has built in full text search which is an integrated version of the "tsearch2"
Here is the documentation: http://www.postgresql.org/docs/8.3/static/textsearch.html
And the example from the documentation:
SELECT title
FROM pgweb
WHERE to_tsvector(body) ## to_tsquery('friend');
Where body is a text field. You can index specifically for these types of searches and of course they can become more complex than this simple example. The functionality is very solid and worth diving into as you make your decision.
Best of luck.
Update: Starting in MySQL 5.6, InnoDB supports fulltext search
I'm not well versed in PostgreSQL unfortunately, but if you use the FULL TEXT search in MySQL you're immediately tied to MyISAM. If you want to use InnoDB (and if ACID compliance means anything to you, you should be using InnoDB) you're stuck using other solutions.
Two popular alternatives that are often rolled out are Lucene (an apache project with a Zend module if you're using PHP) and Sphinx.
If your using Hibernate as a ORM I highly recommend using Hibernate search. Its build on top of Lucene so its super fast.
Karl
I've had pretty good experience with postgresql/tsearch2, especially since it was rolled into the standard distribution (before version 8.0 - I think - it was an optional contrib feature, and upgrading to tsearch2 involved a bit of work).
If I recall correctly you have to set some properties (fuzzy matching, dictionary stuff) before startup, whereas on other databases those things are flexibly exposed through the fulltext syntax itself (I'm thinking of Oracle Text, here, though I know that's not relevant to your question).
I think you can use Sphinx with both MySQL and Postgres.
Here is an article to explain how to use Sphinx with MySQL (you can add it as a plugin)
Mysql full text search is very slow. It can't handle data more than 1 million (several tens of seconds per query).
I've no experience using postgresql full text search.
I have used sphinxsearch. It is very fast and easy to use. But it is not so powerful. I mean the search functionality. For example, it doesn't support like 'abc?', where '?' stands for any character.
I also know lucene. It is powerful, but it is hard to learn.
My Django project is going to be backed by a large database with several hundred thousand entries, and will need to support searching (I'll probably end up using djangosearch or a similar project.)
Which database backend is best suited to my project and why? Can you recommend any good resources for further reading?
For whatever it's worth the the creators of Django recommend PostgreSQL.
If you're not tied to any legacy
system and have the freedom to choose
a database back-end, we recommend
PostgreSQL, which achives a fine
balance between cost, features, speed
and stability. (The Definitive Guide to Django, p. 15)
As someone who recently switched a project from MySQL to Postgresql I don't regret the switch.
The main difference, from a Django point of view, is more rigorous constraint checking in Postgresql, which is a good thing, and also it's a bit more tedious to do manual schema changes (aka migrations).
There are probably 6 or so Django database migration applications out there and at least one doesn't support Postgresql. I don't consider this a disadvantage though because you can use one of the others or do them manually (which is what I prefer atm).
Full text search might be better supported for MySQL. MySQL has built-in full text search supported from within Django but it's pretty useless (no word stemming, phrase searching, etc.). I've used django-sphinx as a better option for full text searching in MySQL.
Full text searching is built-in with Postgresql 8.3 (earlier versions need TSearch module). Here's a good instructional blog post: Full-text searching in Django with PostgreSQL and tsearch2
large database with several hundred
thousand entries,
This is not large database, it's very small one.
I'd choose PostgreSQL, because it has a lot more features. Most significant it this case: in PostgreSQL you can use Python as procedural language.
Go with whichever you're more familiar with. MySQL vs PostgreSQL is an endless war. Both of them are excellent database engines and both are being used by major sites. It really doesn't matter in practice.
All the answers bring interesting information to the table, but some are a little outdated, so here's my grain of salt.
As of 1.7, migrations are now an integral feature of Django. So they documented the main differences that Django developers might want to know beforehand.
Backend Support
Migrations are supported on all backends that Django ships with, as
well as any third-party backends if they have programmed in support
for schema alteration (done via the SchemaEditor class).
However, some databases are more capable than others when it comes to schema migrations; some of the caveats are covered below.
PostgreSQL
PostgreSQL is the most capable of all the databases here in terms of schema support.
MySQL
MySQL lacks support for transactions around schema alteration operations, meaning that if a migration fails to apply you will have to manually unpick the changes in order to try again (it’s impossible to roll back to an earlier point).
In addition, MySQL will fully rewrite tables for almost every schema operation and generally takes a time proportional to the number of rows in the table to add or remove columns. On slower hardware this can be worse than a minute per million rows - adding a few columns to a table with just a few million rows could lock your site up for over ten minutes.
Finally, MySQL has relatively small limits on name lengths for columns, tables and indexes, as well as a limit on the combined size of all columns an index covers. This means that indexes that are possible on other backends will fail to be created under MySQL.
SQLite
SQLite has very little built-in schema alteration support, and so
Django attempts to emulate it by:
Creating a new table with the new schema
Copying the data across
Dropping the old table
Renaming the new table to match the original name
This process generally works well, but it can be slow and occasionally
buggy. It is not recommended that you run and migrate SQLite in a
production environment unless you are very aware of the risks and its
limitations; the support Django ships with is designed to allow
developers to use SQLite on their local machines to develop less
complex Django projects without the need for a full database.
Even if Postgresql looks better, I find it has some performances issues with Django:
Postgresql is made to handle "long connections" (connection pooling, persistant connections, etc.)
MySQL is made to handle "short connections" (connect, do your queries, disconnect, has some performances issues with a lot of open connections)
The problem is that Django does not support connection pooling or persistant connection, it has to connect/disconnect to the database at each view call.
It will works with Postgresql, but connecting to a Postgresql cost a LOT more than connecting to a MySQL database (On Postgresql, each connection has it own process, it's a lot slower than just popping a new thread in MySQL).
Then you get some features like the Query Cache that can be really useful on some cases. (But you lost the superb text search of PostgreSQL)
When a migration fails in django-south, the developers encourage you not to use MySQL:
! The South developers regret this has happened, and would
! like to gently persuade you to consider a slightly
! easier-to-deal-with DBMS (one that supports DDL transactions)
Having gone down the road of MySQL because I was familiar with it (and struggling to find a proper installer and a quick test of the slow web "workbench" interface of postgreSQL put me off), at the end of the project, after a few months after deployment, while looking into back up options, I see you have to pay for MySQL's enterprise back up features. Gotcha right at the very end.
With MySql I had to write some ugly monster raw SQL queries in Django because no select distinct per group for retrieving the latest per group query. Also looking at postgreSQL's full-text search and wishing I had used postgresSQL.
I recommend PostgreSQL even if you are familiar with MySQL, but your mileage may vary.
UPDATE: DBeaver is a great equivalent of MySql Workbench gui tool but works with PostgreSQL very nicely (and many others as its a universal DB tool).
To add to previous answers :
"Full text search might be better supported for MySQL"
The FULLTEXT index in MySQL is a joke.
It only works with MyISAM tables, so you lose ACID, Transactions, Constraints, Relations, Durability, Concurrency, etc.
INSERT/UPDATE/DELETE to a largish TEXT column (like a forum post) will a rebuild a large part of the index. If it does not fit in myisam_key_buffer, then large IO will occur. I've seen a single forum post insertion trigger 100MB or more of IO ... meanwhile the posts table is exclusiely locked !
I did some benchmarking (3 years ago, may be stale...) which showed that on large datasets, basically postgres fulltext is 10-100x faster than mysql, and Xapian 10-100x faster than postgres (but not integrated).
Other reasons not mentioned are the extremely smart query optimizer, large choice of join types (merge, hash, etc), hash aggregation, gist indexes on arrays, spatial search, etc which can result in extremely fast plans on very complicated queries.
Will this application be hosted on your own servers or by a hosting company? Make sure that if you are using a hosting company, they support the database of choice.
There is a major licensing difference between the two db that will affect you if you ever intend to distribute code using the db. MySQL's client libraries are GPL and PostegreSQL's is under a BSD like license which might be easier to work with.