possible alternative to LIKE clause in specific situation - mysql

I have a web service which returns results to a jquery auto-complete.
For the query I must use four full blown like clauses
LIKE('%SOME_TERM%')
The reason being that the users need to be able to return results from sub strings as well as proper words. I have also tried full-text indexes with their many options in this case in both Natural and Boolean mode but they just does not work as well and their results leave a lot to be desired in this case.
The database is highly optimized with indexes and even with the LIKE clauses returning results from a query with multiple joins and with one of the tables having 200,000 rows takes ~ 0.2/0.3 seconds on first run. Once its cached by the server then obviously the time taken is miniscule.
I was wondering if there is anything else that would be worth trying here. I had looked at some standalone search providers but I'm a little tight time-wise on this project(nearly done and ready to launch) so can't afford any large setups or large-scale refactoring time and funding wise.
Its possible that it's as good as it gets but no harm in letting SO have its say is my attitude.

I think apache solr is they way to go for you. For full text searches.
http://lucene.apache.org/solr/
But like you said, if you don't have much time left, and you are sure your queries are performing at their best. I don't see much you can do.

Related

Why is MongoDB's skip() so slow and bad to use, while MySQL's LIMIT is so fast?

Mongo's skip() is fine for small collections, but once you have a few thousand records it starts to become slow.
I would honestly like to know why a decent LIMIT-like function in a NoSQL database is so hard to do.
The only "solution" I keep reading about is using range-based queries. But for a lot of cases that is not a viable solution.
MySQL's LIMIT has similar challenges with large offsets. A few examples from StackOverflow:
How can I speed up a MySQL query with a large offset in the LIMIT clause?
Why does MYSQL higher LIMIT offset slow the query down?.
The underlying problem with the efficiency of limit queries with large skip/offset values is that in order for the database query to get to the point where it can start returning records, a large number of index entries have to be skipped. For a database with active writes (inserts/updates/deletes) this can also lead to some interesting paging side effects if you don't isolate the query or cache the results.
Range-based pagination is a more efficient approach that can take advantage of indexes, but as you noted it may not suit all use cases. If you frequently need to skip thousands of records and find this too expensive you may want to adjust your data model, add some caching, or add usage limitations in your user interface.
For example, when paging through Google search results for a generic term there may be hundreds of millions of results. If you keep clicking through pages you'll eventually discover there is a UI limit of 1,000 results (which may equate to 50-100 pages depending on how many results are filtered). Another popular UI approach is Continuous Scrolling which avoids some of the common pagination issues (though creates other usability challenges).
I was facing the same issue with skip(). Everything worked fine when my database was having fewer documents but performance started affecting when my document count was more than 10L.
I had to get rid of skip but then fetching specific count of data was a challenge to me. below blog helped me out.
https://scalegrid.io/blog/fast-paging-with-mongodb/

RDBMS for extremely large data sets - what are people using?

I have to perform some serious data mining on very large data sets stored in MySQL db. However, queries that require a bit more than a basic SELECT * FROM X WHERE ... tend to become rather inefficient since they return results on the order of 10e6 or more, especially when JOIN on one or more tables is introduced - think of joining 2 or more tables containing several tens of millions rows (after filtering data), which is something that pretty much happens on every query. More than often we'd like to run aggregate functions on these (sum, avg, count, etc), but this is impossible since MySQL simply chokes.
I should note that many efforts were put to optimize the current performance - all tables are indexed properly and queries are tuned, the hardware is top notch, the storage engine was configured and so on. However, still each query takes very long - to the point where "let's run it before we go home and hope for the best when we come to work tomorrow." Not good.
This has to be a solvable problem - many large companies perform very data and computational intensive mining, and handle it well (without writing their own storage engines, google). I'm willing to accept time penalty to get the job done, but on the order of hours, not days. My question is - what do people use to counter problems like this? I've heard of storage engines geared to this type of problem (greenplum, etc.), but I wanted to hear how this problem is typically approached. Our current data store is obviously relational and should probably remain such, but any thoughts or suggestions are welcome. Thanks.
I suggest PostgreSQL, which I've been working with quite successfully on tables with ~0.5B rows that required some complex join operations. Oracle should be good for that too, but I don't have much experience with it.
It should be noted that switching an RDBMS isn't a magic solution, if you want to scale to those sizes there's a LOT of hard work to be done in optimizing your queries, optimizing the database structure and indexes, fine tuning the database configuration, using the right hardware for your usage, replication, using materialized views (which are extremely powerful when used correctly. see here and here - its postgres specific, but applies to other RDBMSs too)... and at some point, you just have to throw more money on the problem.
edited fixed some weird typos (useless android auto correct...) and added some resources about materialized views
We have used MS SqlServer to run analytics on financial data with ten of millions of rows and more using complex JOIN and aggregation. Several things that we have done other than what you have mentioned are:
We chunk the calculation into a lot of temporary tables instead of using sub-query. These tables then we apply proper keys, indexing and so on via the code. Query with sub-query just fails for us
In the temporary tables, we often apply the clustered index that makes sense for us. Note that this temporary tables are filtered results so applying the index on the fly is not expensive compared to use the sub query in place of this temporary tables. Note I am speaking from our experience and might not apply to all cases
As we have done a lot of aggregation function as well, we did a lot indexing on the group columns
We do a lot of query run planning using SQL Query Analyzer that shows us the execution plan. Based on the plan, we revised the query, change the index
We provide hints for the SQL Server that we think could help the execution such as the choice of JOIN Algorithm to take (Hash, Merged or Nested)

Implementing MySQL search results pagination

This ought to be a fairly common problem. I could not find a solution in the stackoverflow questions database. I am not sure I did the right search.
I run a MySQL, CGI/Perl site. May be a 1000 hits a day. A user can search the database for the website. The where clause can become quite lengthy. I display 10 items per page and give the user links to go to the next and previous pages. Currently, I do a new search, every time the user clicks on 'prev' or 'next' page link. I use
LIMIT num-rows-to-fetch OFFSET num-rows-to-skip
along with the query statement. However, response time is way too much for subsequent searches. This can only get worse as I add more users. I am trying to see if I can implement this in a better way.
If you can give me some pointers, I would appreciate it very much.
If you don't mind using Javascript, you should check DataTables. This way you send all rows to the client, and pagination is done on client side.
If it is not an option, then you could try using mod_perl or CGI::Session in order to save the query result between page querys, so you will not need to query mysql again and again.
you could try analyzing the query to find out which part causes the most trouble for the database. If there are joins, you could create indexes that include both the key fields and possible filter fields used in the query. But that could, depending on the database layout and size, create a lot of indexes. And if the query can change significantly you may still be at a loss if you did not create an index for a particular case. In any case i would at first analyze the query, trying to find out what makes the query so slow. If there is one major cause for the bad performance you could try throwing indexes at it.
Another way would be to cache the search results in memory and do pagination from there, avoiding the database roundtrip at all. Depending on your database size you might want to limit the search results to a reasonable number.
I have never used this module, so cannot vouch for it's performance, but have a look at DBI::ResultPager

Is it better to return one big query or a few smaller ones?

I'm using MySQL to store video game data. I have tables for titles, platforms, tags, badges, reviews, developers, publishers, etc...
When someone is viewing a game, is it best to have have one query that returns all the data associated with a game, or is it better to use several queries? Intuitively, since we have reviews, it seems pointless to include them in the same query since they'll need to be paginated. But there are other situations where I'm unsure if to break the query down or use two queries...
I'm a bit worried about performance since I'm now joining to games the following tables: developers, publishers, metatags, badges, titles, genres, subgenres, classifications... to grab game badges, (from games_badges; many-to-many to games table, and many to many to badges table) I can either do another join, or run a separate query.... and I'm unsure what is best....
It is significantly faster to use one query than to use multiple queries because the startup of a query and calculation of the query plan itself is costly and running multiple queries in a row slows the server more each time. Obviously you should only get the data that you actually need, but fewer queries is always better.
So if you are going to show 20 games on a page, you can speed up the query (still using only one query) with a LIMIT clause and only run that query again later when they get to the next page. That or you can just make them wait for the query to complete and have all of the data there at once. One big wait or several little waits.
tl;dr use as few queries as possible.
There is no panacea.
Always try to get only necessary data.
There is no answer whether one big or several small queries is better. Each case is unique and to answer this question you should profile your application and examine queries' EXPLAINs
This is generally a processing problem.
If making one query would imply retrieving thousands of entries, call several queries to have MySQL do the processing (sums, etc.).
If making multiple queries involves making tens or hundreds of them, then call a single query.
Obviously you're always facing both of these since neither is a goto option if you're asking the question, so the choices really are:
Pick the one you can take the hit on
Cache or mitigate it as much as you can so that you take a hit very rarely
Try to insert preprocessed data in the database to help you process the current data
Do the processing as part of a cron and have the application only retrieve the data
Take a few steps back and explore other possible approaches that don't require the processing

How do I use EXPLAIN to *predict* performance of a MySQL query?

I'm helping maintain a program that's essentially a friendly read-only front-end for a big and complicated MySQL database -- the program builds ad-hoc SELECT queries from users' input, sends the queries to the DB, gets the results, post-processes them, and displays them nicely back to the user.
I'd like to add some form of reasonable/heuristic prediction for the constructed query's expected performance -- sometimes users inadvertently make queries that are inevitably going to take a very long time (because they'll return huge result sets, or because they're "going against the grain" of the way the DB is indexed) and I'd like to be able to display to the user some "somewhat reliable" information/guess about how long the query is going to take. It doesn't have to be perfect, as long as it doesn't get so badly and frequently out of whack with reality as to cause a "cry wolf" effect where users learn to disregard it;-) Based on this info, a user might decide to go get a coffee (if the estimate is 5-10 minutes), go for lunch (if it's 30-60 minutes), kill the query and try something else instead (maybe tighter limits on the info they're requesting), etc, etc.
I'm not very familiar with MySQL's EXPLAIN statement -- I see a lot of information around on how to use it to optimize a query or a DB's schema, indexing, etc, but not much on how to use it for my more limited purpose -- simply make a prediction, taking the DB as a given (of course if the predictions are reliable enough I may eventually switch to using them also to choose between alternate forms a query could take, but, that's for the future: for now, I'd be plenty happy just to show the performance guesstimates to the users for the above-mentioned purposes).
Any pointers...?
EXPLAIN won't give you any indication of how long a query will take.
At best you could use it to guess which of two queries might be faster, but unless one of them is obviously badly written then even that is going to be very hard.
You should also be aware that if you're using sub-queries, even running EXPLAIN can be slow (almost as slow as the query itself in some cases).
As far as I'm aware, MySQL doesn't provide any way to estimate the time a query will take to run. Could you log the time each query takes to run, then build an estimate based on the history of past similar queries?
I think if you want to have a chance of building something reasonably reliable out of this, what you should do is build a statistical model out of table sizes and broken-down EXPLAIN result components correlated with query processing times. Trying to build a query execution time predictor based on thinking about the contents of an EXPLAIN is just going to spend way too long giving embarrassingly poor results before it gets refined to vague usefulness.
MySQL EXPLAIN has a column called Key. If there is something in this column, this is a very good indication, it means that the query will use an index.
Queries that use indicies are generally safe to use since they were likely thought out by the database designer when (s)he designed the database.
However
There is another field called Extra. This field sometimes contains the text using_filesort.
This is very very bad. This literally means MySQL knows that the query will have a result set larger than the available memory, and therefore will start to swap the data to disk in order to sort it.
Conclusion
Instead of trying to predict the time a query takes, simply look at these two indicators. If a query is using_filesort, deny the user. And depending on how strict you want to be, if the query is not using any keys, you should also deny it.
Read more about the resultset of the MySQL EXPLAIN statement