uncertain on using sphinx for these queries - mysql

How much can a engine like Sphinx help to find a small set of rows when you are comparing multiple columns instead of doing a full text search.
As an example, we have a database for each client. Many of these have 10million + user rows. The issue is that most of these rows represent 1 time logins instead of real users (insert 100 different reasons why this is bad, but these are decisions made by people not me).
A "real" user can be differentiated from a login or other type of user by doing string comparisons on 3 different columns. (Earlier I excluded full text because these are small values - usually in the 10-15 character range but never over 32).
Would a tool like Sphinx be a good choice for queries that only want 1 type of user?

Well sphinx might help to that, but only because it provides a way to do a sort of "materialized view".
... ie can do a big complicated query, composing your 'comparisons' probably as a JOIN in a complicated database query. Create an aggregation of all the data (even these single logins). A query that might take a LONG time to execute. the 'indexer' tool, as some functions that make running lots of small queries to compose a big index.
Ie still doing the grunt work in the database and just saving the result in a sphinx index. Sphinx just helps a bit with composing into a single index (or view)
... can then do arbitrary queries, on the the sphinx index, eg to just list 'real' users.
Ie can do 'fast' queries, on the 'intermediate' data stored in sphinx.
But in that way, sphinx doesnt offer much advantage, to just doing the materialized view in mysql directly. ie could save the intermediate data in mysql directly, and run queries there.
Not really doing anything that CANT be done with mysql, just possibly doing it 'easier' with sphinx.

Related

Optimizing of database with multiple JOINs

First, some details about the website and the database structure -
With my website you can learn English words, and you can insert on each word a sentence, an association, an image, in addition - each word has a category, sub category, group...
My database includes about 20 tables. any user who registers to my website 'add' to users table something like 4000 rows - the number of the words on my website. I have a serious problem while the user is filtering words (somthing like 'search' word but according char/s & category/s & group/s etc.. I have 9 JOINs in my sql query, and it takes something like 1 MIN to display results..
The target of JOINs - inside the table users (where each user has 4000 rows / each row = word) there are joins on this style:
$this->db->join('users', 'sentences.id = users.sentence_id' ,'left');
The same thing with associations, groups, images, binds between words etc..
The users table includes id of sentences, associations, groups.. and with the JOIN there is a connection.
I don't know what to do.. it takes too much time. maybe the problem is the structure of the database? multiple joins? maybe using indexing? but how and where? because it's necessary sometimes retrieve all the words so indexing wouldn't help.
I'm using MySQL.
First of all, if you're using that many joins, indexes will not save you (as they will not be used in joins most of the time).
There are a few things you can do.
Schema Design
You probably would want to reconsider your schema design/query if you need 9 joins to achieve what you are doing!
From the looks of it, it seems your your tables are very normalized, perhaps in 3rd normal form? In that case consider denormalizing your tables into a larger one to avoid joins (joins are more expensive than full table scans!). There are many online documentations on this, however there's always costs to this, as it increases development complexity and data redundancy. Also by denormalizing your tables you avoid joins and can make better use of indexes.
Also I believe MyISAM is the only storage engine in MySQL that supports FULL TEXT indexes. However it does not have transactions and have table level-locking and no MVCC, so it depends on what you need.
Resources
I suggest you have a read at this book High Performance MySQL.
A truly awesome book on tuning MySQL databases
I also suggest having a read at the official documentation on your chosen storage engine. This is significant as each storage engine is VERY DIFFERENT! InnoDB is completely different from MyISAM which is also completely different from PBXT. Each engine has its benefits and you will have to consider which one fits your situation.
I would draw out the relational schema and work out the number of operations for the queries you are running, and go from there. Most DBMS's attempt to optimise queries implicitly, but not always optimally. You should look into re-ordering the joins so that the most restrictive are carried out first. Indexes could help, and again, would require some analysis to find which attributes you are searching on.
Building databases to deal with natural language is a very challenging subject and there is a lot of research on the subject. Have you looked into Markov chains? Have you taken a step back and thought about the computational complexity of what you are trying to do? If you arrive at the same conclusion of nine joins, then it may be fair to say that the problem is not scalable enough for a real-time application.
As an aside, I believe Google App Engine's data store attempts to index attributes for you, with implicit scalability. If you're running your database on a small web server, then you may see better results deploying it with a more comprehensive DBMS. I would only look into this as a last resort, however.

Search Short Fields Using Solr, Etc. or Use Straight-Forward DB Index

My website stores several million entities. Visitors search for entities by typing words contained only in the titles. The titles are at most 100 characters long.
This is not a case of classic document search, where users search inside large blobs.
The fields are very short. Also, the main issue here is performance (and not relevance) seeing as entities are provided "as you type" (auto-suggested).
What would be the smarter route?
Create a MySql table [word, entity_id], have 'word' indexed, and then query using
select entity_id from search_index where word like '[query_word]%
This obviously requires me to break down each title to its words and add a row for each word.
Use Solr or some similar search engine, which from my reading are more oriented towards full text search.
Also, how will this affect me if I'd like to introduce spelling suggestions in the future.
Thank you!
Pro's of a Database Only Solution:
Less set up and maintenance (you already have a database)
If you want to JOIN your search results with other data or otherwise manipulate them you will be able to do so natively in the database
There will be no time lag (if you periodically sync Solr with your database) or maintenance procedure (if you opt to add/update entries in Solr in real time everywhere you insert them into the database)
Pro's of a Solr Solution:
Performance: Solr handles caching and is fast out of the box
Spell check - If you are planning on doing spell check type stuff Solr handles this natively
Set up and tuning of Solr isn't very painful, although it helps if you are familiar with Java application servers
Although you seem to have simple requirements, I think you are getting at having some kind of logic around search for words; Solr does this very well
You may also want to consider future requirements (what if your documents end up having more than just a title field and you want to assign some kind of relevancy? What if you decide to allow people to search the body text of these entities and/or you want to index other document types like MS Word? What if you want to facet search results? Solr is good at all of these).
I am not sure if you would need to create an entry for every word in your database, vs. just '%[query_word]%' search if you are going to create records with each word anyway. It may be simpler to just go with a database for starters, since the requirements seem pretty simple. It should be fairly easy to scale the database performance.
I can tell you we use Solr on site and we love the performance and we use it for even very simple lookups. However, one thing we are missing is a way to combine Solr data with database data. And there is extra maintenance. At the end of the day there is not an easy answer.

Scalable Full Text Search With Per User Result Ordering

What options exist for creating a scalable, full text search with results that need to be sorted on a per user basis? This is for PHP/MySQL (Symfony/Doctrine as well, if relevant).
In our case, we have a database of workouts that have been performed by users. The workouts that the user has done before should appear at the top of the results. The more frequently they've done the workout, the higher it should appear in search matches. If it helps, you can assume we know the number of times a user has done a workout in advance.
Possible Solutions
Sphinx - Use Sphinx to implement full text search, do all the querying and sorting in MySQL. This seems promising (and there's a Symfony Plugin!) but I don't know much about it.
Lucene - Use Lucene to perform full text search and put the users' completions into the query. As is suggested in this Stack Overflow thread. Alternatively, use Lucene to retrieve the results, then reorder them in PHP. However, both solutions seem clunky and potentially unscalable as a user may have completed hundreds of workouts.
Mysql - No native full text support (InnoDB), so we'd have use LIKE or REGEX, which isn't scalable.
MySQL does have a native FULLTEXT support, though only in MyISAM tables.
For most real-world tasks, Sphinx is the fastest engine. However, it is an external index, so it can only be updated on a timely basis with a cron script.
By using SphinxSE (a pluggable MySQL interface to Sphinx), you can join MySQL tables and Sphinx indexes in one query. Updating, though, will still require an external script.
Since the number of workouts performed seems to change frequently, keeping it in Sphinx would require too much effort on rebuilding the index.
With SphinxSE, you can write a query similar to that:
SELECT *
FROM workouts w
JOIN user_workouts uw
ON uw.workout = w.id
WHERE w.query = 'query query query;filter=user_id,$user_id'
AND uw.user = $user_id
ORDER BY
uw.times_performed DESC
I'm not sure why you're assuming using Lucene would be unscalable. Hundreds of workouts per user is not a lot of data to deal with.
Try using Solr/Lucene for the search backend. It has a JSON/XML interface which will play nicely with your PHP frontend. Store a user's completed workout # in a database table. When a query is issued, take the results from Solr, and you can select from the database table and resort in PHP code. Should be plenty fast and scalable. With Solr, maintaining your index is dirt simple; just issue add/update/delete requests to your Solr server.

Search implementation dilemma: full text vs. plain SQL

I have a MySQL/Rails app that needs search. Here's some info about the data:
Users search within their own data only, so searches are narrowed down by user_id to begin with.
Each user will have up to about five thousand records (they accumulate over time).
I wrote out a typical user's records to a text file. The file size is 2.9 MB.
Search has to cover two columns: title and body. title is a varchar(255) column. body is column type text.
This will be lightly used. If I average a few searches per second that would be surprising.
It's running an a 500 MB CentOS 5 VPS machine.
I don't want relevance ranking or any kind of fuzziness. Searches should be for exact strings and reliably return all records containing the string. Simple date order -- newest to oldest.
I'm using the InnoDB table type.
I'm looking at plain SQL search (through the searchlogic gem) or full text search using Sphinx and the Thinking Sphinx gem.
Sphinx is very fast and Thinking Sphinx is cool, but it adds complexity, a daemon to maintain, cron jobs to maintain the index.
Can I get away with plain SQL search for a small scale app?
I think plain SQL search won't be the good choice. Because of when we fetching text type columns in MySQL the request is always falling to hard drive no matter what cache settings are.
You can use plain SQL search only with very small apps.
I'd prefer Sphinx for that.
I would start out simple -- chances are that plain SQL will work well, and you can always switch to full text search later if the search function proves to be a bottleneck.
I'm developing and maintaining an application with a search function with properties similar to yours, and plain SQL search has worked very well for me so far. I had similar performance concerns when I first implemented the search function a year or two ago, but I haven't seen any performance problems whatsoever yet.
Having used MySQL fulltext search for about 4 years, and just moving now to Sphinx, I'd say that a regular MySQL search using the fulltext boolean (ie exact) syntax will be fine. It's fast and it will do exactly what you want. The amount of data you will be searching at any one time will be small.
The only problem might be ordering the results. MySQL's fulltext search can get slow when you start ordering things by (eg) date, as that requires that you search the entire table, rather than just the first nn results it finds. That was ultimately the reason I moved to Sphinx.
Sphinx is also awesome, so don't be afraid to try it, but it sounds like the additional functionality may not be required in your case.

Efficient Filtering / Searching

We have a hosted application that manages pages of content. Each page can have a number of customized fields, and some standard fields (timestamp, user name, user email, etc).
With potentially hundreds of different sites using the system -- what is an efficient way to handle filtering/searching? Picture a grid view that you want to narrow down. You can filter on specific fields (userid, date) or you can enter a full-text search.
For example, "all pages started by userid 10" would be a pretty quick query against a MySQL database. But things like "all pages started by a user whose userid is 10 and matches [some search query]" would suck against the database, so it's suited for a search engine like Lucene.
Basically I'm wondering how other large sites do this sort of thing. Do they utilize a search engine 100% for all types of filtering? Do they mix database queries with a search engine?
If we use only a search engine, there's a problem with the delay time it takes for a new/updated object to appear in the search index. That is, I've read that it's not smart to update the index immediately, and to do it in batches instead. Even if this means every 5 minutes, users will get confused when their recently added page isn't immediately listed when they view a simple page listing (say a search query of "category:5").
We are using MySQL and have been looking closely at Lucene for searching. Is there some other technology I don't know about?
My thought is to offer a simple filtering page which uses MySQL to filter on basic fields. Then offer a separate fulltext search page that would present results similar to Google. Is this the only way?
Solr or grassyknoll both provide slightly more abstract interfaces to Lucene.
That said: Yes. If you are a primarily content driven site, providing fulltext searching over your data, there is something in play beyond LIKE. While MySql's FULLTEXT indexies aren't perfect, it might be an acceptable placeholder in the interim.
Assuming you do create a Lucene index, linking Lucene Documents to your relational objects is pretty straightforward, simply add a stored property to the document at index time (this property can be a url, ID, GUID etc.) Then, searching becomes a 2 phase system:
1) Issue query to Lucene indexies (Display simple results like title)
2) Get more detailed information about the object from your relational stores by its key
Since instantiation of Documents is relatively expensive in Lucene, you only want to store fields searched in the Lucene index, as opposed to complete clones of your relational objects.
Don't write-off MySQL so readily!
Implement it using the database e.g. a select with a 'like' in the where-clause or whatever.
Profile it, add indexes if necessary. Roll out a beta, so you get real numbers from user's actual data patterns - not all columns might be equally asked after, etc.
If the performance does suck, then thats when you consider other options. You can consider tuning your SQL, your database, the machine the database is running on, and finally using another technology stack...
In case you want to use MySQL or PostgreSQL, a open source solution that works great with it is Sphinx:
http://www.sphinxsearch.com/
We are having the same problem and considering Sphinx and Lucene as possible solutions.