Scalable Full Text Search With Per User Result Ordering - mysql

What options exist for creating a scalable, full text search with results that need to be sorted on a per user basis? This is for PHP/MySQL (Symfony/Doctrine as well, if relevant).
In our case, we have a database of workouts that have been performed by users. The workouts that the user has done before should appear at the top of the results. The more frequently they've done the workout, the higher it should appear in search matches. If it helps, you can assume we know the number of times a user has done a workout in advance.
Possible Solutions
Sphinx - Use Sphinx to implement full text search, do all the querying and sorting in MySQL. This seems promising (and there's a Symfony Plugin!) but I don't know much about it.
Lucene - Use Lucene to perform full text search and put the users' completions into the query. As is suggested in this Stack Overflow thread. Alternatively, use Lucene to retrieve the results, then reorder them in PHP. However, both solutions seem clunky and potentially unscalable as a user may have completed hundreds of workouts.
Mysql - No native full text support (InnoDB), so we'd have use LIKE or REGEX, which isn't scalable.

MySQL does have a native FULLTEXT support, though only in MyISAM tables.
For most real-world tasks, Sphinx is the fastest engine. However, it is an external index, so it can only be updated on a timely basis with a cron script.
By using SphinxSE (a pluggable MySQL interface to Sphinx), you can join MySQL tables and Sphinx indexes in one query. Updating, though, will still require an external script.
Since the number of workouts performed seems to change frequently, keeping it in Sphinx would require too much effort on rebuilding the index.
With SphinxSE, you can write a query similar to that:
SELECT *
FROM workouts w
JOIN user_workouts uw
ON uw.workout = w.id
WHERE w.query = 'query query query;filter=user_id,$user_id'
AND uw.user = $user_id
ORDER BY
uw.times_performed DESC

I'm not sure why you're assuming using Lucene would be unscalable. Hundreds of workouts per user is not a lot of data to deal with.
Try using Solr/Lucene for the search backend. It has a JSON/XML interface which will play nicely with your PHP frontend. Store a user's completed workout # in a database table. When a query is issued, take the results from Solr, and you can select from the database table and resort in PHP code. Should be plenty fast and scalable. With Solr, maintaining your index is dirt simple; just issue add/update/delete requests to your Solr server.

Related

uncertain on using sphinx for these queries

How much can a engine like Sphinx help to find a small set of rows when you are comparing multiple columns instead of doing a full text search.
As an example, we have a database for each client. Many of these have 10million + user rows. The issue is that most of these rows represent 1 time logins instead of real users (insert 100 different reasons why this is bad, but these are decisions made by people not me).
A "real" user can be differentiated from a login or other type of user by doing string comparisons on 3 different columns. (Earlier I excluded full text because these are small values - usually in the 10-15 character range but never over 32).
Would a tool like Sphinx be a good choice for queries that only want 1 type of user?
Well sphinx might help to that, but only because it provides a way to do a sort of "materialized view".
... ie can do a big complicated query, composing your 'comparisons' probably as a JOIN in a complicated database query. Create an aggregation of all the data (even these single logins). A query that might take a LONG time to execute. the 'indexer' tool, as some functions that make running lots of small queries to compose a big index.
Ie still doing the grunt work in the database and just saving the result in a sphinx index. Sphinx just helps a bit with composing into a single index (or view)
... can then do arbitrary queries, on the the sphinx index, eg to just list 'real' users.
Ie can do 'fast' queries, on the 'intermediate' data stored in sphinx.
But in that way, sphinx doesnt offer much advantage, to just doing the materialized view in mysql directly. ie could save the intermediate data in mysql directly, and run queries there.
Not really doing anything that CANT be done with mysql, just possibly doing it 'easier' with sphinx.

500000 user DB is being rather slow

I have a database with the following structure:
username,email,ip,hash,salt
Currently we have around 600.000 users in this database.
Users are complaining that querying this database is rather slow.
In our tests, we found that it takes around 1.15 seconds to retrieve a user record.
This test is based on the following query:
SELECT * FROM users WHERE email = 'test#mail.com'
I'm no expert in database management. I know how to get by when using it like a dictionary, however I have no idea on database optimization.
I was hoping I could get some help. Ideally, we'd be able to query the DB like this in under a second on even 10 million users.
Does anyone have any suggestion on optimizing simple queries like this? I'm open to anything right now, even restructuring the database if there's a more logical way to do it. Because right now, they're just ordered in the order that they registered with.
MySQL has two important facilities for improving performance. For your type of query, 500,000 rows or 10,000,000 rows is just not a big deal. Although other technologies such as NOSQL can perform the same actions, applications such as yours typically rely on the ACID properties of databases. A relational database is probably the right solution.
The first facility -- as mentioned elsewhere -- are indexes. In your case:
create index idx_users_email on users(email);
An index will incur a very small amount of overhead for insert and delete operations. However, with the index, looking up a row should go down to well under 0.1 seconds -- even with concurrent queries.
Depending on other queries you are running other indexes may be appropriate.
The second important capability is partitioning the tables. This is not necessary for a users table. However, it can be quite useful for transactions and other types of data.
you could add an index as already mentioned in the comments, but one thought present itself - you are currently retrieving ALL info for that row - it would be more efficient to target the query to only retrieve that information which is necessary - such as
SELECT username FROM users WHERE email = 'test#mail.com';
also - you should investigate PDO and bound parameters for security.

Optimizing of database with multiple JOINs

First, some details about the website and the database structure -
With my website you can learn English words, and you can insert on each word a sentence, an association, an image, in addition - each word has a category, sub category, group...
My database includes about 20 tables. any user who registers to my website 'add' to users table something like 4000 rows - the number of the words on my website. I have a serious problem while the user is filtering words (somthing like 'search' word but according char/s & category/s & group/s etc.. I have 9 JOINs in my sql query, and it takes something like 1 MIN to display results..
The target of JOINs - inside the table users (where each user has 4000 rows / each row = word) there are joins on this style:
$this->db->join('users', 'sentences.id = users.sentence_id' ,'left');
The same thing with associations, groups, images, binds between words etc..
The users table includes id of sentences, associations, groups.. and with the JOIN there is a connection.
I don't know what to do.. it takes too much time. maybe the problem is the structure of the database? multiple joins? maybe using indexing? but how and where? because it's necessary sometimes retrieve all the words so indexing wouldn't help.
I'm using MySQL.
First of all, if you're using that many joins, indexes will not save you (as they will not be used in joins most of the time).
There are a few things you can do.
Schema Design
You probably would want to reconsider your schema design/query if you need 9 joins to achieve what you are doing!
From the looks of it, it seems your your tables are very normalized, perhaps in 3rd normal form? In that case consider denormalizing your tables into a larger one to avoid joins (joins are more expensive than full table scans!). There are many online documentations on this, however there's always costs to this, as it increases development complexity and data redundancy. Also by denormalizing your tables you avoid joins and can make better use of indexes.
Also I believe MyISAM is the only storage engine in MySQL that supports FULL TEXT indexes. However it does not have transactions and have table level-locking and no MVCC, so it depends on what you need.
Resources
I suggest you have a read at this book High Performance MySQL.
A truly awesome book on tuning MySQL databases
I also suggest having a read at the official documentation on your chosen storage engine. This is significant as each storage engine is VERY DIFFERENT! InnoDB is completely different from MyISAM which is also completely different from PBXT. Each engine has its benefits and you will have to consider which one fits your situation.
I would draw out the relational schema and work out the number of operations for the queries you are running, and go from there. Most DBMS's attempt to optimise queries implicitly, but not always optimally. You should look into re-ordering the joins so that the most restrictive are carried out first. Indexes could help, and again, would require some analysis to find which attributes you are searching on.
Building databases to deal with natural language is a very challenging subject and there is a lot of research on the subject. Have you looked into Markov chains? Have you taken a step back and thought about the computational complexity of what you are trying to do? If you arrive at the same conclusion of nine joins, then it may be fair to say that the problem is not scalable enough for a real-time application.
As an aside, I believe Google App Engine's data store attempts to index attributes for you, with implicit scalability. If you're running your database on a small web server, then you may see better results deploying it with a more comprehensive DBMS. I would only look into this as a last resort, however.

When to consider Solr

I am working on an application that needs to do interesting things with search, including full-text search, hit-highlighting, faceted-search, etc...
The dataset is likely to be between 3000-10000 records with 20-30 fields on each, and is all stored in MySQL. The traffic profile of the site is likely to be on the small size of medium.
All of these requirements could be achieved (clunkily) in MySQL, but at what point (in terms of data-size and traffic levels) does it become worth looking at more focused technologies like Solr or Sphinx?
This question calls for a very broad answer to be answered in all aspects. There are very well certain specificas that may make one system superior to another for a special use case, but I want to cover the basics here.
I will deal entirely with Solr as an example for several search engines that function roughly the same way.
I want to start with some hard facts:
You cannot rely on Solr/Lucene as a secure database. There are a list of facts why but they mostly consist of missing recovery options, lack of acid transactions, possible complications etc. If you decide to use solr, you need to populate your index from another source like an SQL table. In fact solr is perfect for storing documents that include data from several tables and relations, that would otherwise requrie complex joins to be constructed.
Solr/Lucene provides mind blowing text-analysis / stemming / full text search scoring / fuzziness functions. Things you just can not do with MySQL. In fact full text search in MySql is limited to MyIsam and scoring is very trivial and limited. Weighting fields, boosting documents on certain metrics, score results based on phrase proximity, matching accurazy etc is very hard work to almost impossible.
In Solr/Lucene you have documents. You cannot really store relations and process. Well you can of course index the keys of other documents inside a multivalued field of some document so this way you can actually store 1:n relations and do it both ways to get n:n, but its data overhead. Don't get me wrong, its perfectily fine and efficient for a lot of purposes (for example for some product catalog where you want to store the distributors for products and you want to search only parts that are available at certain distributors or something). But you reach the end of possibilities with HAS / HAS NOT. You can almonst not do something like "get all products that are available at at least 3 distributors".
Solr/Lucene has very nice facetting features and post search analysis. For example: After a very broad search that had 40000 hits you can display that you would only get 3 hits if you refined your search to the combination of having this field this value and that field that value. Stuff that need additional queries in MySQL is done efficiently and convinient.
So let's sum up
The power of Lucene is text searching/analyzing. It is also mind blowingly fast because of the reverse index structure. You can really do a lot of post processing and satisfy other needs. Altough it's document oriented and has no "graph querying" like triple stores do with SPARQL, basic N:M relations are possible to store and to query. If your application is focused on text searching you should definitely go for Solr/Lucene if you haven't good reasons, like very complex, multi-dmensional range filter queries, to do otherwise.
If you do not have text-search but rather something where you can point and click something but not enter text, good old relational databases are probably a better way to go.
Use Solr if:
You do not want to stress your database.
Get really full text search.
Perform lightning fast search results.
I currently maintain a news website with 5 million users per month, with MySQL as the main datastore and Solr as the search engine.
Solr works like magick for full text indexing, which is difficult to achieve with Mysql. A mix of Mysql and Solr can be used: Mysql for CRUD operations and Solr for searches. I have previusly worked with one of India's best real estate online classifieds portal which was using Solr for search ( and was previously using Mysql). The migration reduced the search times manifold.
Solr can be easily integrated with Mysql:
Solr Full Dataimport can be used for importing data from Mysql tables into Solr collections.
Solr Delta import can be scheduled at short frequencies to load latest data from Mysql to Solr collections.

Efficient Filtering / Searching

We have a hosted application that manages pages of content. Each page can have a number of customized fields, and some standard fields (timestamp, user name, user email, etc).
With potentially hundreds of different sites using the system -- what is an efficient way to handle filtering/searching? Picture a grid view that you want to narrow down. You can filter on specific fields (userid, date) or you can enter a full-text search.
For example, "all pages started by userid 10" would be a pretty quick query against a MySQL database. But things like "all pages started by a user whose userid is 10 and matches [some search query]" would suck against the database, so it's suited for a search engine like Lucene.
Basically I'm wondering how other large sites do this sort of thing. Do they utilize a search engine 100% for all types of filtering? Do they mix database queries with a search engine?
If we use only a search engine, there's a problem with the delay time it takes for a new/updated object to appear in the search index. That is, I've read that it's not smart to update the index immediately, and to do it in batches instead. Even if this means every 5 minutes, users will get confused when their recently added page isn't immediately listed when they view a simple page listing (say a search query of "category:5").
We are using MySQL and have been looking closely at Lucene for searching. Is there some other technology I don't know about?
My thought is to offer a simple filtering page which uses MySQL to filter on basic fields. Then offer a separate fulltext search page that would present results similar to Google. Is this the only way?
Solr or grassyknoll both provide slightly more abstract interfaces to Lucene.
That said: Yes. If you are a primarily content driven site, providing fulltext searching over your data, there is something in play beyond LIKE. While MySql's FULLTEXT indexies aren't perfect, it might be an acceptable placeholder in the interim.
Assuming you do create a Lucene index, linking Lucene Documents to your relational objects is pretty straightforward, simply add a stored property to the document at index time (this property can be a url, ID, GUID etc.) Then, searching becomes a 2 phase system:
1) Issue query to Lucene indexies (Display simple results like title)
2) Get more detailed information about the object from your relational stores by its key
Since instantiation of Documents is relatively expensive in Lucene, you only want to store fields searched in the Lucene index, as opposed to complete clones of your relational objects.
Don't write-off MySQL so readily!
Implement it using the database e.g. a select with a 'like' in the where-clause or whatever.
Profile it, add indexes if necessary. Roll out a beta, so you get real numbers from user's actual data patterns - not all columns might be equally asked after, etc.
If the performance does suck, then thats when you consider other options. You can consider tuning your SQL, your database, the machine the database is running on, and finally using another technology stack...
In case you want to use MySQL or PostgreSQL, a open source solution that works great with it is Sphinx:
http://www.sphinxsearch.com/
We are having the same problem and considering Sphinx and Lucene as possible solutions.