Search Box - Selecting Data from Multiple Tables - mysql

In my application, I want to create a "universal search" box that will allow the users to perform a general search on any of the 'informational' data within the database. The system happens to be an account management system, so ideally they'd be able to do searches for e-mail addresses, usernames, ID's, etc.
I've been searching around the web for a solution but I haven't come to a conclusion yet so I figured I'd ask the question on SO.
What's the best way to perform a 'search' query on the database and return potential results from multiple tables?
My initial thought was to perform a SELECT query on each individual table using a wildcard for each 'searchable' column. Would this be a correct approach?

I would use a special search engine for such kind of "universal search". For example, Sphinx, free open source SQL full-text search engine.
A SELECT query on each table will cause very low performance if the database is large enough.

Related

Rails and mysql - adding columns to a table dynamically based upon form values

I'm working on a legacy app - right now, we allow admins to generate forms with custom fields (they create the field, choose an input type, label, etc).
When the user fills out this custom form, all fields from that form are checked - if that field is not a column on the users table, we add it to the users table as a column.
For example, if an admin added a field called "flight arrival time", we would add a column called "flight_arrival_time" to the users table, and the User model would have an attribute called #user.flight_arrival_time.
What alternatives might there be to this current course of action? Is there a more efficient way of storing these values?
Here are some of the limitations:
We have tens of thousands of users ( I was told that storing these attributes in a different table and joining them would slow the system A LOT. We often have around 20 or so admins querying, importing, updating, and generally using the hell out of our system, which is already pretty slow under load. I wouldn't have the power to say "buy more {X} so we can be faster. ).
I assume a join table (called something like user_attributes) would store the user_id, the attribute name, and the attribute value. If each user has an additional 15 attributes, and we have 100,000 users, how much slower will it be?
The storage must be easily query-able ( We use a dynamic search that allows the users to choose any column from the User model as a search field and find an inputted value ).
Would you option allow easy queries (for instance, find all users whose attribute named "Flight Arrival Time" is tomorrow). Would this also become very slow?
I will experiment a bit, generate some of the proposed schema, generate 100,000 users and 20 attributes for each, and run some test queries to check execution times, but I'd like some input on where to start.
Thanks for your input.
not exactly an answer, but i think that this kind of app would benefit a lot from a document-oriented database / NOSQL
system like mongoDB.
Such systems are schema-less by design.
To add my two cents, let users make dynamic changes on the schema seems a very dangerous option in an RDBMS environment to begin with. You could end up with tables with thousands of mostly empty columns, and rails would instantiate objects with thousands of methods on them. .. and what happens when you delete a column ?
In a long run with the approach you are following can make your database very slow.
Your database size will grow as adding columns according to user behavior will leads to null values for other tuples.
It's better you use Document oriented databases. like mongodb, couchdb, cassandra etc.
There's a gem for that. https://github.com/Liooo/dynabute
Also, this question is a duplication of
Rails: dynamic columns/attributes on models?

When to consider Solr

I am working on an application that needs to do interesting things with search, including full-text search, hit-highlighting, faceted-search, etc...
The dataset is likely to be between 3000-10000 records with 20-30 fields on each, and is all stored in MySQL. The traffic profile of the site is likely to be on the small size of medium.
All of these requirements could be achieved (clunkily) in MySQL, but at what point (in terms of data-size and traffic levels) does it become worth looking at more focused technologies like Solr or Sphinx?
This question calls for a very broad answer to be answered in all aspects. There are very well certain specificas that may make one system superior to another for a special use case, but I want to cover the basics here.
I will deal entirely with Solr as an example for several search engines that function roughly the same way.
I want to start with some hard facts:
You cannot rely on Solr/Lucene as a secure database. There are a list of facts why but they mostly consist of missing recovery options, lack of acid transactions, possible complications etc. If you decide to use solr, you need to populate your index from another source like an SQL table. In fact solr is perfect for storing documents that include data from several tables and relations, that would otherwise requrie complex joins to be constructed.
Solr/Lucene provides mind blowing text-analysis / stemming / full text search scoring / fuzziness functions. Things you just can not do with MySQL. In fact full text search in MySql is limited to MyIsam and scoring is very trivial and limited. Weighting fields, boosting documents on certain metrics, score results based on phrase proximity, matching accurazy etc is very hard work to almost impossible.
In Solr/Lucene you have documents. You cannot really store relations and process. Well you can of course index the keys of other documents inside a multivalued field of some document so this way you can actually store 1:n relations and do it both ways to get n:n, but its data overhead. Don't get me wrong, its perfectily fine and efficient for a lot of purposes (for example for some product catalog where you want to store the distributors for products and you want to search only parts that are available at certain distributors or something). But you reach the end of possibilities with HAS / HAS NOT. You can almonst not do something like "get all products that are available at at least 3 distributors".
Solr/Lucene has very nice facetting features and post search analysis. For example: After a very broad search that had 40000 hits you can display that you would only get 3 hits if you refined your search to the combination of having this field this value and that field that value. Stuff that need additional queries in MySQL is done efficiently and convinient.
So let's sum up
The power of Lucene is text searching/analyzing. It is also mind blowingly fast because of the reverse index structure. You can really do a lot of post processing and satisfy other needs. Altough it's document oriented and has no "graph querying" like triple stores do with SPARQL, basic N:M relations are possible to store and to query. If your application is focused on text searching you should definitely go for Solr/Lucene if you haven't good reasons, like very complex, multi-dmensional range filter queries, to do otherwise.
If you do not have text-search but rather something where you can point and click something but not enter text, good old relational databases are probably a better way to go.
Use Solr if:
You do not want to stress your database.
Get really full text search.
Perform lightning fast search results.
I currently maintain a news website with 5 million users per month, with MySQL as the main datastore and Solr as the search engine.
Solr works like magick for full text indexing, which is difficult to achieve with Mysql. A mix of Mysql and Solr can be used: Mysql for CRUD operations and Solr for searches. I have previusly worked with one of India's best real estate online classifieds portal which was using Solr for search ( and was previously using Mysql). The migration reduced the search times manifold.
Solr can be easily integrated with Mysql:
Solr Full Dataimport can be used for importing data from Mysql tables into Solr collections.
Solr Delta import can be scheduled at short frequencies to load latest data from Mysql to Solr collections.

Search Short Fields Using Solr, Etc. or Use Straight-Forward DB Index

My website stores several million entities. Visitors search for entities by typing words contained only in the titles. The titles are at most 100 characters long.
This is not a case of classic document search, where users search inside large blobs.
The fields are very short. Also, the main issue here is performance (and not relevance) seeing as entities are provided "as you type" (auto-suggested).
What would be the smarter route?
Create a MySql table [word, entity_id], have 'word' indexed, and then query using
select entity_id from search_index where word like '[query_word]%
This obviously requires me to break down each title to its words and add a row for each word.
Use Solr or some similar search engine, which from my reading are more oriented towards full text search.
Also, how will this affect me if I'd like to introduce spelling suggestions in the future.
Thank you!
Pro's of a Database Only Solution:
Less set up and maintenance (you already have a database)
If you want to JOIN your search results with other data or otherwise manipulate them you will be able to do so natively in the database
There will be no time lag (if you periodically sync Solr with your database) or maintenance procedure (if you opt to add/update entries in Solr in real time everywhere you insert them into the database)
Pro's of a Solr Solution:
Performance: Solr handles caching and is fast out of the box
Spell check - If you are planning on doing spell check type stuff Solr handles this natively
Set up and tuning of Solr isn't very painful, although it helps if you are familiar with Java application servers
Although you seem to have simple requirements, I think you are getting at having some kind of logic around search for words; Solr does this very well
You may also want to consider future requirements (what if your documents end up having more than just a title field and you want to assign some kind of relevancy? What if you decide to allow people to search the body text of these entities and/or you want to index other document types like MS Word? What if you want to facet search results? Solr is good at all of these).
I am not sure if you would need to create an entry for every word in your database, vs. just '%[query_word]%' search if you are going to create records with each word anyway. It may be simpler to just go with a database for starters, since the requirements seem pretty simple. It should be fairly easy to scale the database performance.
I can tell you we use Solr on site and we love the performance and we use it for even very simple lookups. However, one thing we are missing is a way to combine Solr data with database data. And there is extra maintenance. At the end of the day there is not an easy answer.

Scalable Full Text Search With Per User Result Ordering

What options exist for creating a scalable, full text search with results that need to be sorted on a per user basis? This is for PHP/MySQL (Symfony/Doctrine as well, if relevant).
In our case, we have a database of workouts that have been performed by users. The workouts that the user has done before should appear at the top of the results. The more frequently they've done the workout, the higher it should appear in search matches. If it helps, you can assume we know the number of times a user has done a workout in advance.
Possible Solutions
Sphinx - Use Sphinx to implement full text search, do all the querying and sorting in MySQL. This seems promising (and there's a Symfony Plugin!) but I don't know much about it.
Lucene - Use Lucene to perform full text search and put the users' completions into the query. As is suggested in this Stack Overflow thread. Alternatively, use Lucene to retrieve the results, then reorder them in PHP. However, both solutions seem clunky and potentially unscalable as a user may have completed hundreds of workouts.
Mysql - No native full text support (InnoDB), so we'd have use LIKE or REGEX, which isn't scalable.
MySQL does have a native FULLTEXT support, though only in MyISAM tables.
For most real-world tasks, Sphinx is the fastest engine. However, it is an external index, so it can only be updated on a timely basis with a cron script.
By using SphinxSE (a pluggable MySQL interface to Sphinx), you can join MySQL tables and Sphinx indexes in one query. Updating, though, will still require an external script.
Since the number of workouts performed seems to change frequently, keeping it in Sphinx would require too much effort on rebuilding the index.
With SphinxSE, you can write a query similar to that:
SELECT *
FROM workouts w
JOIN user_workouts uw
ON uw.workout = w.id
WHERE w.query = 'query query query;filter=user_id,$user_id'
AND uw.user = $user_id
ORDER BY
uw.times_performed DESC
I'm not sure why you're assuming using Lucene would be unscalable. Hundreds of workouts per user is not a lot of data to deal with.
Try using Solr/Lucene for the search backend. It has a JSON/XML interface which will play nicely with your PHP frontend. Store a user's completed workout # in a database table. When a query is issued, take the results from Solr, and you can select from the database table and resort in PHP code. Should be plenty fast and scalable. With Solr, maintaining your index is dirt simple; just issue add/update/delete requests to your Solr server.

Efficient Filtering / Searching

We have a hosted application that manages pages of content. Each page can have a number of customized fields, and some standard fields (timestamp, user name, user email, etc).
With potentially hundreds of different sites using the system -- what is an efficient way to handle filtering/searching? Picture a grid view that you want to narrow down. You can filter on specific fields (userid, date) or you can enter a full-text search.
For example, "all pages started by userid 10" would be a pretty quick query against a MySQL database. But things like "all pages started by a user whose userid is 10 and matches [some search query]" would suck against the database, so it's suited for a search engine like Lucene.
Basically I'm wondering how other large sites do this sort of thing. Do they utilize a search engine 100% for all types of filtering? Do they mix database queries with a search engine?
If we use only a search engine, there's a problem with the delay time it takes for a new/updated object to appear in the search index. That is, I've read that it's not smart to update the index immediately, and to do it in batches instead. Even if this means every 5 minutes, users will get confused when their recently added page isn't immediately listed when they view a simple page listing (say a search query of "category:5").
We are using MySQL and have been looking closely at Lucene for searching. Is there some other technology I don't know about?
My thought is to offer a simple filtering page which uses MySQL to filter on basic fields. Then offer a separate fulltext search page that would present results similar to Google. Is this the only way?
Solr or grassyknoll both provide slightly more abstract interfaces to Lucene.
That said: Yes. If you are a primarily content driven site, providing fulltext searching over your data, there is something in play beyond LIKE. While MySql's FULLTEXT indexies aren't perfect, it might be an acceptable placeholder in the interim.
Assuming you do create a Lucene index, linking Lucene Documents to your relational objects is pretty straightforward, simply add a stored property to the document at index time (this property can be a url, ID, GUID etc.) Then, searching becomes a 2 phase system:
1) Issue query to Lucene indexies (Display simple results like title)
2) Get more detailed information about the object from your relational stores by its key
Since instantiation of Documents is relatively expensive in Lucene, you only want to store fields searched in the Lucene index, as opposed to complete clones of your relational objects.
Don't write-off MySQL so readily!
Implement it using the database e.g. a select with a 'like' in the where-clause or whatever.
Profile it, add indexes if necessary. Roll out a beta, so you get real numbers from user's actual data patterns - not all columns might be equally asked after, etc.
If the performance does suck, then thats when you consider other options. You can consider tuning your SQL, your database, the machine the database is running on, and finally using another technology stack...
In case you want to use MySQL or PostgreSQL, a open source solution that works great with it is Sphinx:
http://www.sphinxsearch.com/
We are having the same problem and considering Sphinx and Lucene as possible solutions.