I've looked into Doctrine's built-in search, MySQL myisam fulltext search, Zend_Lucene, and sphinx - but all the nuances and implementation details are making it hard to sort out for me, given that I don't have experience with anything other than the myisam search.
What I really want is something simple that will work with the Zend Framework and Doctrine (MySQL back-end, probably InnoDB). I don't need complex things like word substitutions, auto-complete, and so on (not that I'd be opposed to such things, if it were easy enough and time effective enough to implement).
The main thing is the ability to search for strings across multiple database tables, and multiple fields with some basic search criteria (e.g. user.state. = CA AND user.active = 1). The size of the database will start at around 50K+ records (old data being dumped in), the biggest single searchable table would be around 15K records, and it would grow considerably over time.
That said, Zend_Lucene is appealing to me because it is flexible (in case I do need my search solution to gorw in the future) and because it can parse MS Office files (which will be uploaded to my application by users). But its flexibility also makes it kind of complicated to set up.
I suppose the most straightforward option would be to just use Doctrine's search capabilities, but I'm not sure if that's going to be able to handle what I need. And I don't know that there is any option out there which is going to combine my desire for simplicity & power.
What search solutions would you recommend I investigate? And why would you think that solution would work well in this situation?
I would recomment using Solr search engine.
Solr is an open source enterprise search server based on the Lucene Java search library, with XML/HTTP and JSON APIs, hit highlighting, faceted search, caching, replication, a web administration interface (which is really great) and many more features.
It runs in a Java servlet container such as Tomcat.
You can use the solr-php-client to handle queries in php.
Related
I have a data source which is sitting in relational database. I managed to index/store everything into Solr and thrilled to see the search performance and the awesome API (search/admin..etc).
However, people say if your data is truly structured, relational database should be fast if you index everything. However, even if I dump all the data into a relation database like MySQL, what I am missing is all the beautiful query API.
I guess my question is:
is it possible to only use the query API of Solr-ish and totally use relation database as the backend instead of using index at all.
if that is not possible, is there any mature project/product that can build a full stack query API on a relational database?
Document search engines and relational databases serves different usage patterns. If you're using Solr for anything that involves tokenization and analysis chains, replicating that in an RDBMS requires implementing that functionality yourself (or just using a subset, such as full text indices in certain RDBMSes). I detailed some of these differences and features in Should I just query the database or use a proper search engine solution?.
It's usually better to use the RDBMS as the main storage for your data and then push it into the search index as required. This will also let you get new features from those who care about search and the problem it tries to solve, without having to wait for a niche product to implement it on top of your RDBMS (there's still quite a few new features in each iteration of Lucene, Elastic and solr).
I am planning to develop an application which will have a Model with lots of attributes on it. These attributes will be one of the most important parts of the application thus users will be firing search queries most of the time in order to find the result they are looking for.
My question is, is it OK to relay on mysql or postgres for it, or should i start with something like solr, elasticsearch from the beginning.
I want this application not to consume lots of memory while doing these searches. This is the first thing I want, since i will start with a basic server setup with 2 cores and 4gb ram.
Both of them (rdbms and fulltext se) are valid technologies...mainly it depends on
your access pattern
features you want to offer in your search services
For instance if you want to do fulltext search, or you want things like autocompletion, faceting,stemming Solr or ES is your friend. On the other side, if you want to pickup data in realtime (and you don't want things above) I would use an rdbms
In general: you described a bit your "non" functional requirements, but the decision involves functional requirements, too. Definitely
So I'm working on a huge eCommerce solution on top of Symfony2, Doctrine2 and MySQL (maybe a cluster since we will have a lot of people connected and working in our platform) so I'm trying to decide if will be better to use Sphinx search or MysQL for search solution since some data will need to be duplicated in MySQL tables and in Sphinx. Our main goal is performance so excellent response times is what we look for. I'm not an expert in none so I need some advice from people here based on theirs experience and so on, maybe some docs or whatever. What path did yours take on this side.
PS: The DB will grow up really fast take that into account and the platform will be for the entire worl
Sphinx is usually preferred when it comes to performance vs MySQL for high volume searches because it's easy to scale. You will have a delay on results, to allow it to sync data with mysql, but even so it's better.
You should also take a good look at the actual queries that will run and store in Sphinx
only the fields that are searchable along with their ids. Once you get the ids from sphinx,
to list them use a mysql slave to get their other, non-searchable data.
Depending on what queries you are using for search,
A better solution than sphinx is Amazon Cloudsearch. We had a hard time implementing it, but it was well worth it, both time and $$$, and it replaced our sphinx solution
I am building a MySQL database that will have roughly 10,000 records. Each record will contain a textual document (a few pages of text in most cases). I want to do all sorts of n-gram counting across the entire database. I have algorithms already written in Python that will what I want against a directory containing a large number of text files, but to do that I will need to extract 10,000 text files from the database - this will have performance issues.
I'm a rookie with MySQL, so I'm not sure if it has any built-in features that do n-gram analysis, or whether there are good plugins out there that would do it. Please note that I need to go up to at least 4-grams (preferably 5-grams) in my analysis, so the simple 2-gram plugins I've seen won't work here. I also need to have the ability to remove the stopwords from the textual documents before doing the n-gram counting.
Any ideas from the community?
Thanks,
Ron
My suggestion would be to use a dedicated full-text search index program like lucene/solr, which has much richer and extensible support for this sort of thing. It will require you to learn a bit to get it set up, but it sounds as if you want to mess around at a level that will be difficult to customize in MySQL.
If you really want to prematurely optimize ;) you could translate your python into C and then wrap it with thin mysql UDF wrapper code.
But I'd highly recommend just loading your documents one at a time and running your python scripts on them to populate a mysql table of n-grams. My hammer for every nail at the moment is Django. It's ORM makes interacting with mysql tables and optimizing those interactions a cinch. I'm using it to do statistics in python on multimillion record databases for production sites that have to return gobs of data in less than a second. And any python ORM will make it easier to switch out your database if you find something better than mysql, like postgre. The best part is that there are lots of python and django tools to monitor all aspects of your app's performance (python execution, mysql load/save, memory/swap). That way you can attack the right problem. It may be that sequential bulk mysql reads aren't what's slowing you down...
Currently working on a project that is centered around a medical nomenclature known as SNOMED. At the heart of snomed is are three relational datasets that are 350,000, 1.1 mil, and 1.3 mil records in length. We want to be able to quickly query this dataset for the data entry portion where we would like to have some shape or form of auto-completion/suggestion.
Its currently in a MySQL MyISAM DB just for dev purposes but we want to start playing with some in memory options. It's currently 30MB + 90MB + 70MB in size including the indexes. The MEMORY MySQL Engine and MemCached were the obvious ones, so my question is which of these would you suggest or is there something better out there?
We're working in Python primarily at the app level if that makes a difference. Also we're running on a single small dedicated server moving to 4GB DDR2 soon.
Edit: Additional Info
We're interested in keeping the suggesting and autocompletion fast. Something that will peform well for these types of queires is desirable. Each term in snomed typically has several synonyms, abbreviations, and a preferred name. We will be querying this dataset heavily (90MB in size including index). We're also considering building an inverted index to speed things up and return more relevant results (many of the terms are long "Entire coiled artery of decidua basalis (body structure)"). Lucene or some other full text search may be appropriate.
From your use case, it sounds like you want to do full-text searching; I would suggest sphinx. It's blazing fast, even on large data sets. You can integrate memcached if you need extra speed.
Please see
Techniques to make autocomplete on website more responsive
How to do query auto-completion suggestions in Lucene
autocomplete server side implementation
For how to do this with Lucene. Lucene is the closest to industry standard full-text search library. It is fast and gives quality results. However, It takes time to master Lucene - you have to handle many low-level details. An easier way may be to use Solr, a Lucene sub-project which is much easier to set up, and can give JSON output, that can be used for autocomplete.
As Todd said, you can also use Sphinx. I have never used it, but heard it is highly integrable with MySQL. I failed to find how to implement autocomplete using Sphinx - maybe you should post this as a separate question.