Sphinx search or MySQL for a huge database search enginee - mysql

So I'm working on a huge eCommerce solution on top of Symfony2, Doctrine2 and MySQL (maybe a cluster since we will have a lot of people connected and working in our platform) so I'm trying to decide if will be better to use Sphinx search or MysQL for search solution since some data will need to be duplicated in MySQL tables and in Sphinx. Our main goal is performance so excellent response times is what we look for. I'm not an expert in none so I need some advice from people here based on theirs experience and so on, maybe some docs or whatever. What path did yours take on this side.
PS: The DB will grow up really fast take that into account and the platform will be for the entire worl

Sphinx is usually preferred when it comes to performance vs MySQL for high volume searches because it's easy to scale. You will have a delay on results, to allow it to sync data with mysql, but even so it's better.
You should also take a good look at the actual queries that will run and store in Sphinx
only the fields that are searchable along with their ids. Once you get the ids from sphinx,
to list them use a mysql slave to get their other, non-searchable data.
Depending on what queries you are using for search,
A better solution than sphinx is Amazon Cloudsearch. We had a hard time implementing it, but it was well worth it, both time and $$$, and it replaced our sphinx solution

Related

Only solr or with Mysql

I want to use solr for my search index.What confuse me is ,should i put most the data fields in solr ,or only search for the id ,then get the data from Mysql,please help.Which is faster,better
I had the same Question in 2010 an decided to take Solr as a search index only to get a list of IDs in the first step an read the data from MySQL related to the IDs in the 2nd step.
That works fine in an Environment with 20 million docs.
During an reconstruction of the whole application in 2014, we decided to additional store the data in Solr (not only indexing) in order to fetch the whole docs during a search, so that the MySQL connect is not necessary anymore.
We are talking about an Web - Application with only max. 1-3 thousand parallel users and there is absolutely no perceived difference in application speed between the version from 2010 and 2014.
But there are some benefits, if you take the documents from Solr, not MySql.
The application code is a bit cleaner.
You only need one connect to get the data....
But: the main reason, why we begin to store the document in solr is: we needed to use the highlighting feature. This will only work well, if you store the docs on solr and fetches them from solr too.
Btw: there is no change in search performance if you store the docs or not.
The disadvantage is, that you have to hold the data twice:
1.) in MySQL as the base dataset and
2.) in Solr for your application.
And: if you have very big documents, solr probably is not the right tool to serve that kind of documents.
Putting all the data into Solr will absolutely be faster: you are saving yourself from having to make two queries, and also you are removing the need for a slow piece of code (PHP or whatever) to bridge the gap between these two where you pull out the id from solr and then query mysql. Alternatively you could put everything into MySQL, which would be of comparable speed. i.e. choose a technology suiting your needs best, but don't mix unnecessarily for performance reasons. A good comparison when you might want to use Solr vs MySQL can be found here.

MySQL & Memcached for large datasets?

For a customer I am currently investigating improvements to their database structure.
My customer offers holiday rentals on their website.
On their front page they have a search function wich sends a query to a MySQL database architecture (Master-Master setup) that answers that query with all the holiday rentals that the customer is interested in.
Due to the growth of the company and the increasing load on their servers the search query's are currently running up to 10+ seconds. Mainly because the query's end with an ORDER BY which causes MySQL to create a temp table and sort all the data, an average search query can return up to 20k holiday homes.
Ofcourse one of the things we are doing is investigating the query's, rewriting them and putting indexes where needed. Unfortunately we are unable to get allot more performance under these circumstances.
That's why we are looking into implementing Memcached on top of MySQL to cache these large datasets in memory for faster retrieval. Unfortunately the datasets that the query's return are quite large wich makes Memcached not that effective at this point. The array that MySQL returns are currently about 15k rows with about 60 values per row.
The reason Memcached is interesting is because we want to drastically improve the search function, and lowering the load on the MySQL platform. This would make it more scalable.
I am wondering if there is anyone that is familair with (longterm) caching MySQL data in Memcached and making it more effective for large datasets?
Thanks a bunch!
Memcache is for storing key-value pairs, not for large sets of data. Will it work? Yes. Of course it will. But with how much data you guys are going to throw at it, you're going to run out of memory very soon and end up hitting the database anyway with how often your search results may change. And remember that just because it's memcache doesn't mean it doesn't have to go through web sockets to a (most likely) different machine. Your problem seems to be that you're using MySQL for something it was never designed well for, which is its use as a search engine. No matter how many things you optimize, all you're doing is raising the ceiling an inch at a time.
I could take this post in a "you need to optimize MySQL parameters so that it doesn't have to create those temp tables" direction, but I'm going to assume you've already looked into that and keep going.
My recommendation is that you implement something on top of MySQL to handle the searching. In my own quest for fast searching, these are the solutions I gave the most weight to:
Sphinx: http://sphinxsearch.com
Solr: http://lucene.apache.org/solr
Elasticsearch: http://www.elasticsearch.org
You'll find plenty of resources here on StackOverflow for which of those is better and faster and what not. For our purposes, we picked Elasticsearch for one of our projects and Solr for another.

Solr vs. MySQL performance for autocomplete

In one of our applications, we need to hold some plain tabular data and we need to be able to perform user-side autocompletion on one of the columns.
The initial solution we came up with, was to couple MySQL with Solr to achieve this (MySQL to hold data and Solr to just hold the tokenized column and return ids as result). But something unpleasant happened recently (developers started storing some of the data in Solr, because the MySQL table and the operations done on it are nothing that Solr can not provide) and we thought maybe we could merge them together and eliminate one of the two.
So we had to either: (1) move all the data to Solr (2) use MySQL for autocompletion
(1) sounded terrible so I gave it a shot with (2), I started with loading that single column's data into MySQL, disabled all caches on both MySQL and Solr, wrote a tiny webapp that is able to perform very similar queries [1] on both databases, and fired up a few JMeter scenarios against both in a local and similar environment. The results show a 2.5-3.5x advantage for Solr, however, I think the results may be totally wrong and fault prone.
So, what would you suggest for:
Correctly benchmarking these two systems, I believe you need to
provide similar[to MySQL] environment to JVM.
Designing this system.
Thanks for any leads.
[1] SELECT column FROM table WHERE column LIKE 'USER-INPUT%' on MySQL and column:"USER-INPUT" on Solr.
I recently moved a website over from getting its data from the database (postgres) to getting all data from Solr. Unbelievable difference in speed. We also have autocomplete for Australian suburbs (about 15K of them) and it finds them in a couple of milliseconds, so the ajax auto-complete (we used jQuery) reacts almost instantly.
All updates are done against the original database, but our site is a mostly-read site. We used triggers to fire events when records were updated and that spawns a reindex into Solr of the record.
The other big speed improvement was pre-caching data required to render the items - ie we denormalize data and pre-calculate lots of stuff at Solr indexing time so the rendering is easy for the web guys and super fast.
Another advantage is that we can put our site into read-only mode if the database needs to be taken offline for some reason - we just fall back to Solr. At least the site doesn't go fully down.
I would recommend using Solr as much as possible, for both speed and scalability.

Good search solution for Zend Framework + Doctrine + MySQL?

I've looked into Doctrine's built-in search, MySQL myisam fulltext search, Zend_Lucene, and sphinx - but all the nuances and implementation details are making it hard to sort out for me, given that I don't have experience with anything other than the myisam search.
What I really want is something simple that will work with the Zend Framework and Doctrine (MySQL back-end, probably InnoDB). I don't need complex things like word substitutions, auto-complete, and so on (not that I'd be opposed to such things, if it were easy enough and time effective enough to implement).
The main thing is the ability to search for strings across multiple database tables, and multiple fields with some basic search criteria (e.g. user.state. = CA AND user.active = 1). The size of the database will start at around 50K+ records (old data being dumped in), the biggest single searchable table would be around 15K records, and it would grow considerably over time.
That said, Zend_Lucene is appealing to me because it is flexible (in case I do need my search solution to gorw in the future) and because it can parse MS Office files (which will be uploaded to my application by users). But its flexibility also makes it kind of complicated to set up.
I suppose the most straightforward option would be to just use Doctrine's search capabilities, but I'm not sure if that's going to be able to handle what I need. And I don't know that there is any option out there which is going to combine my desire for simplicity & power.
What search solutions would you recommend I investigate? And why would you think that solution would work well in this situation?
I would recomment using Solr search engine.
Solr is an open source enterprise search server based on the Lucene Java search library, with XML/HTTP and JSON APIs, hit highlighting, faceted search, caching, replication, a web administration interface (which is really great) and many more features.
It runs in a Java servlet container such as Tomcat.
You can use the solr-php-client to handle queries in php.

Geospatial and full text search for Rails app hosted on Heroku

I'm planning out a Rails app that will be hosted on Heroku and will need both geospatial and full text search capabilities.
I know that Heroku offers add-ons like WebSolr and IndexTank that sound like they can do the job, but I was wondering if this could be done in MySQL and/or PostgreSQL without having to pay for any add-ons?
Depending on the scale of your application you should be able to accomplish both FULLTEXT and SPATIAL indexes in MySQL with ease. Once your application gets massive, i.e hundreds of millions of rows with high concurrency and multiples of thousands of requests per second you might need to move to another solution for either FULLTEXT or SPATIAL queries. But, I wouldn't recommend optimize for that early on, since it can be very hard to do properly. For the foreseeable future MySQL should suffice.
You can read about spatial indexes in MySQL here. You can read about fulltext indexes in MySQL here. Finally, I would recommend taking the steps outlined here to make your schema.rb file and rake tasks work with these two index types.
I have only used MySQL for both, but my understanding is that PostgreSQL has a good geo-spatial index solution as well.
If you have a database at Heroku, you can use Postgres's support for Full Text Search: http://www.postgresql.org/docs/8.3/static/textsearch.html. The oldest servers Heroku runs (for shared databases) are on 8.3 and 8.4. The newest are on 9.0.
A blog post noticing this little fact can be seen here: https://tenderlovemaking.com/2009/10/17/full-text-search-on-heroku.html
Apparently, that "texticle" (heh. cute.) addon works...pretty well. It will even create the right indexes for you, as I understand it.
Here's the underlying story: postgres full-text-search is pretty fast and fuss-free (although Rails-integration may not be great), although it does not offer the bells and whistles of Solr or IndexTank. Make sure you read about how to properly set up GIN and/or GiST indexes, and use the tsvector/tsquery types.
The short version:
Create an (in this case, expression-based) index: CREATE INDEX pgweb_idx ON pgweb USING gin(to_tsvector('english', body));. In this case "body" is the field being indexed.
Use the ## operator: SELECT * FROM ... WHERE to_tsvector('english', pgweb.body) ## to_tsquery('hello & world') LIMIT 30
The hard part may be mapping things back into application land, the blog post previously cited is trying to do that.
The dedicated databases can also be requisitioned with PostGIS, which is a very powerful and fully featured system for indexing and querying geographical data. OpenStreetMap uses the PostgreSQL geometry types (built-in) extensively, and many people combine that with PostGIS to great effect.
Both of these (full text search, PostGIS) take advantage of the extensible data type and indexing infrastructure in Postgres, so you should expect them to work with high performance for many, many records (spend a little time carefully reviewing the situation if things look busted). You might also take advantage of fact that you are able to leverage these features in combination with transactions and structured data. For example:
CREATE TABLE products (pk bigserial, price numeric, quantity integer, description text); can just as easily be used with full text search...any text field will do, and it can be in connection with regular attributes (price, quantity in this case).
I'd use thinking sphinx, a full text search engine also deployable on heroku.
It has geo search built-in: http://freelancing-god.github.com/ts/en/geosearching.html
EDIT:
Sphynx is almost ready for heroku, see here: http://flying-sphinx.com/
IndexTank is now free up to 100k documents on Heroku, we just haven't updated the documentation. This may not be enough for your needs, but I thought I'd let you know just in case.
For full text search via Postgre I recommend pg_search, I am using it myself on heroku at the moment. I have not used texticle but from what I can see pg_search has more development activity lately and it has been built upon texticle (it will not add indexes for you, you have to do it yourself).
I cannot find the thread now but I saw that Heroku gave option for pg geo search but it was in beta.
My advice is if you are not able to find postgre solution is to host your own instance of SOLR (on EC2 instance) and use sunspot solr gem to integrate it with rails.
I have implemented my own solution and used WebSolr as well. Basically that is what they give you their own SOLR instance hassle free. Is it worth the money, in my opinion no. For integration that use sunspot solr client as well, so it is just are you going to pay somebody 20$/40$/... to host SOLR for you. I know you also get backups, maintenance etc. but call me cheap I prefer my own instance. Also WebSolr is locked on 1.4.x version of SOLR.