Improve performance on MySQL fulltext search query - mysql

I have a following MySQL query:
SELECT p.*, MATCH (p.description) AGAINST ('random text that you can use in sample web pages or typography samples') AS score
FROM posts p
WHERE p.post_id <> 23
AND MATCH (p.description) AGAINST ('random text that you can use in sample web pages or typography samples') > 0
ORDER BY score DESC LIMIT 1
With 108,000 rows, it takes ~200ms. With 265,000 rows, it takes ~500ms.
Under performance testing(~80 concurrent users) it shows ~18sec average latency.
Is any way to improve performance for this query ?
EXPLAIN OUTPUT:
UPDATED
We have added one new mirror MyISAM table with post_id, description and synchronized it with posts table via triggers. Now, fulltext search on this new MyISAM table works ~400ms(with the same performance load where InnoDB shows ~18sec.. this is a huge performance boost) Look like MyISAM is much more quicker for fulltext in MySQL than InnoDB. Could you please explain it ?
MySQL profiler results:
Tested on AWS RDS db.t2.small instance
Original InnoDB posts table:
MyISAM mirror table with post_id, description only:

Here are a few tips what to look for in order to maximise the speed of such queries with InnoDB:
Avoid redundant sorting. Since InnoDB already sorted the result according to ranking. MySQL Query Processing layer does not need to
sort to get top matching results.
Avoid row by row fetching to get the matching count. InnoDB provides all the matching records. All those not in the result list
should all have ranking of 0, and no need to be retrieved. And InnoDB
has a count of total matching records on hand. No need to recount.
Covered index scan. InnoDB results always contains the matching records' Document ID and their ranking. So if only the Document ID and
ranking is needed, there is no need to go to user table to fetch the
record itself.
Narrow the search result early, reduce the user table access. If the user wants to get top N matching records, we do not need to fetch
all matching records from user table. We should be able to first
select TOP N matching DOC IDs, and then only fetch corresponding
records with these Doc IDs.
I don't think you cannot get that much faster looking only at the query itself, maybe try removing the ORDER BY part to avoid unnecessary sorting. To dig deeper into this, maybe profile the query using MySQLs inbuild profiler.
Other than that, you might look into the configuration of your MySQL server. Have a look at this chapter of the MySQL manual, it contains some good informations on how to tune the fulltext index to your needs.
If you've already maximized the capabilities of your MySQL server configuration, then consider looking at the hardware itself - sometimes even a lost cost solution like moving the tables to another, faster hard drive can work wonders.

My best guess for the performance hit is the number of rows being returned by the query. To test this, simply remove the order by score and see if that improves the performance.
If it does not, then the issue is the full text index. If it does, then the issue is the order by. If so, the problem becomes a bit more difficult. Some ideas:
Determine a hardware solution to speed up the sorts (getting the intermediate files to be in memory).
Modifying the query so it returns fewer values. This might involve changing the stop-word list, changing the query to boolean mode, or other ideas.
Finding another way of pre-filtering the results.

The issue here is WHERE p.post_id <> 23
Design your system in such a way so that non-indexed columns — like post_id — need not be added to the WHERE clause.
Basically MySQL will search for the full-text indexed column and then filter the post_id. Hence, if there are a lot of matches returned by the full text search, the response time will not be as expected.

Related

Optimised way to store large key value kind of data

I am working on a database that has a table user having columns user_id and user_service_id. My application needs to fetch all the users whose user_service_id is a particular value. Normally I would add an index to the user_service_id column and run a query like this :
select user_id from user where user_service_id = 2;
Since the cardinality of the column user_service_id is very less than around 3-4 and the table has around 10M entries, the query will end up scanning almost the entire table.
I was wondering what is the recommendation for such usecases. Also, would it make more sense to move the data to another nosql datastore as this doesn't seem to be an efficient usecase for MySQL or any SQL datastore? Tried to search this but couldn't find any recommendations here. Can someone please help or provide the necessary references?
Thanks in advance.
That query needs this index, which is both "composite" and "covering":
INDEX(user_service_id, user_id) -- in this order
But what will you do with the millions of rows that you get? Sounds like it will choke the client, whether it comes fast or slow.
See my Index Cookbook
"very dynamic" -- Not a problem.
"cache" -- the dynamic nature defeats caching.
"cardinality" -- not important, except to point out that there will be millions of rows.
"millions of rows" -- that takes time to deliver to the client. The number of rows delivered is the biggest factor in cost.
"select entire table, then filter in client" -- That will be even slower! (See "millions of rows".)

Will records order change between two identical query in mysql without order by

The problem is I need to do pagination.I want to use order by and limit.But my colleague told me mysql will return records in the same order,and since this job doesn't care in which order the records are shown,so we don't need order by.
So I want to ask if what he said is correct? Of course assuming that no records are updated or inserted between the two queries.
You don't show your query here, so I'm going to assume that it's something like the following (where ID is the primary key of the table):
select *
from TABLE
where ID >= :x:
limit 100
If this is the case, then with MySQL you will probably get rows in the same order every time. This is because the only predicate in the query involves the primary key, which is a clustered index for MySQL, so is usually the most efficient way to retrieve.
However, probably may not be good enough for you, and if your actual query is any more complex than this one, probably no longer applies. Even though you may think that nothing changes between queries (ie, no rows inserted or deleted), so you'll get the same optimization plan, that is not true.
For one thing, the block cache will have changed between queries, which may cause the optimizer to choose a different query plan. Or maybe not. But I wouldn't take the word of anyone other than one of the MySQL maintainers that it won't.
Bottom line: use an order by on whatever column(s) you're using to paginate. And if you're paginating by the primary key, that might actually improve your performance.
The key point here is that database engines need to handle potentially large datasets and need to care (a lot!) about performance. MySQL is never going to waste any resource (CPU cycles, memory, whatever) doing an operation that doesn't serve any purpose. Sorting result sets that aren't required to be sorted is a pretty good example of this.
When issuing a given query MySQL will try hard to return the requested data as quick as possible. When you insert a bunch of rows and then run a simple SELECT * FROM my_table query you'll often see that rows come back in the same order than they were inserted. That makes sense because the obvious way to store the rows is to append them as inserted and the obvious way to read them back is from start to end. However, this simplistic scenario won't apply everywhere, every time:
Physical storage changes. You won't just be appending new rows at the end forever. You'll eventually update values, delete rows. At some point, freed disk space will be reused.
Most real-life queries aren't as simple as SELECT * FROM my_table. Query optimizer will try to leverage indices, which can have a different order. Or it may decide that the fastest way to gather the required information is to perform internal sorts (that's typical for GROUP BY queries).
You mention paging. Indeed, I can think of some ways to create a paginator that doesn't require sorted results. For instance, you can assign page numbers in advance and keep them in a hash map or dictionary: items within a page may appear in random locations but paging will be consistent. This is of course pretty suboptimal, it's hard to code and requieres constant updating as data mutates. ORDER BY is basically the easiest way. What you can't do is just base your paginator in the assumption that SQL data sets are ordered sets because they aren't; neither in theory nor in practice.
As an anecdote, I once used a major framework that implemented pagination using the ORDER BY and LIMIT clauses. (I won't say the same because it isn't relevant to the question... well, dammit, it was CakePHP/2). It worked fine when sorting by ID. But it also allowed users to sort by arbitrary columns, which were often not unique, and I once found an item that was being shown in two different pages because the framework was naively sorting by a single non-unique column and that row made its way into both ORDER BY type LIMIT 10 and ORDER BY type LIMIT 10, 10 because both sortings complied with the requested condition.

Whether or not SQL query (SELECT) continues or stops reading data from table when find the value

Greeting,
My question; Whether or no sql query (SELECT) continues or stops reading data (records) from table when find the value that I was looking for?
referance: "In order to return data for this query, mysql must start at the beginning of the disk data file, read in enough of the record to know where the category field data starts (because long_text is variable length), read this value, see if it satisfies the where condition (and so decide whether to add to the return record set), then figure out where the next record set is, then repeat."
link for referance: http://www.verynoisy.com/sql-indexing-dummies/#how_the_database_finds_records_normally
In general you don't know and you don't care, but you have to adapt when queries take too long to execute. When you do something like
select a,b,c from mytable where a=3 and b=5
then the database engine has a couple of options to optimize. When all these options fail, then it will do a "full table scan" - which means, it will have to examine the entire table to see which rows are eligible. When you have indices on e.g. column a then the database engine can optimize the search because it can pre-select rows where a has value 3. So, in general, make sure that you have indices for the columns that are most searched. (Perversely, some database engines get confused when you have too many indices and will fall back to a full table scan because they've lost their way...)
As to whether or not the scanning stops: In general, the database engine has to examine all data in the table (hopefully aided by indices) and won't stop after having found just one hit. If you want just the first hit, use a limit 1 clause to make sure that your result set has only one outcome. But then again, if you have a sort by clause, the database engine cannot stop after the first hit, there might be next ones that should get priority given the sorting.
Summarizing, how the db engine does its scan depends on how smart it is, what indices are available etc.. If your select queries take too long then consider re-organizing your indices, writing your select statements differently, or rebuilding the table.
The RDBMS reading data from disk is something you cannot know, you should not care and you must not rely on.
The issue is too broad to get a precise answer. The engine reads data from storage in blocks, a block can contain records that are not needed by the query at hand. If all the columns needed by the query is available in an index, the RDBMS won't even read the data file, it will only use the index. The data it needs could already be cached in memory (because it was read during the execution of a previous query). The underlying OS and the storage media also keep their own caches.
On a busy system, all these factors could lead to very different storage access patterns while running the same query several times on a couple of minutes apart.
Yes it scans the entire file. Unless you put something like
select * from user where id=100 limit 1
This of course will still search entire rows if id 100 is the last record.
If id is a primary key it will automatically be indexed and searching would be optimized
I'm sorry... I thought the table.
I will change question and I will explain it in the following image;
I understand that in CASE 1 all columns must be read with each iteration.
My question is: If it's the same in the CASE 2 or columns that are not selected in the query are excluded from reading in each iteration.
Also, are the both queries are the some in performance perspective?
Clarify:
CASE: 1 In first CASE select print all data
CASE: 2 In second CASE select print columns first_name and last_name
Whether in CASE 2 mysql server (SQL query) reads only columns first_name, last_name or read the entire table to get that data(rows)=(first_name, last_name)?
An interest of me how the server reads table row in CASE 1 and CASE 2?

What does optimize table do, or How do I rightfully optimize PK on the disk

I was under the impression OPTIMIZE TABLE fixes fragmentation. So, if before I would do
select * from t -- (no order by, no nothing)
I would get the order of the records on the disk.
While after doing the optimize, and again running this query, the result would be ordered by the PK.
I just tried it on a table of mine, and nothing changed, I still get arbitrary order of records.
I am storing all my tables in one file. I am using InnoDB. MySQL 5.5
Am I missing something, should I have defined the PK somehow else?
Without an order by statement you are never guaranteed order.
Your assumption of
if before I would do select * from t (no order by, no nothing) I would
get the order of the records on the disk
is wrong.
How the Database decides to retrieve records and display them on the screen (or whatever you're viewing them through) is totally up to the internal implementation of the database. In the past this might have been disk order but the only way to know is to check if the Database (in your case MYSQL) mentions anything about it in their documentation.
I doubt they would though because then people would depend on this ordering and they couldn't improve their record retrieving algorithms without breaking things in the past.
Edit:
As for optimizing the table try using an index that reflexes the query results you're looking for.
Edit 2:
Another thought is that the situation you just described is a classic caching issue. Because the database already has the result set stored away somewhere in the original odd ordering, your optimization won't show a reordering until the cached data set is no longer cached. How you flush caches is a bit beyond my knowledge.

MySQL: Optimizing Searches with LIKE or FULLTEXT

I am building a forum and I am looking for the proper way to build a search feature that finds users by their name or by the title of their posts. What I have come up with is this:
SELECT users.id, users.user_name, users.user_picture
FROM users, subject1, subject2
WHERE users.id = subject1.user_id
AND users.id = subject2.user_id
AND (users.user_name LIKE '%{$keywords}%'
OR subject1.title1 LIKE '%{$keywords}%'
OR subject2.title2 LIKE '%{$keywords}%')
ORDER BY users.user_name ASC
LIMIT 10
OFFSET {$offset}
The LIMIT and the OFFSET is for pagination. My question is, would doing a LIKE search through multiple tables greatly slow down performance when the number of rows reach a significant amount?
I have a few alternatives:
One, perhaps I can rewrite that query to have the LIKE searches done inside a subquery that only returns indexed user_ids. Then, I would find the remaining user information based on that. Would that increase performance by much?
Second, I suppose I can have the $keyword string appear before the first wildcard as in LIKE {$keyword}%. This way, I can index the user_name, title1, and title2 columns. However, since I will be trading accuracy for speed here, how much of a difference in performance would this make? Will it be worth sacrificing this much accuracy to index these columns?
Third, perhaps I can give users 3 search fields to choose from, and have each search through only one table. Would this increase performance by much?
Lastly, should I consider using a FULLTEXT search instead of LIKE? What are the performance differences between the two? Also, my tables are using the InnoDB storage engine, and I am not able to use the FULLTEXT index unless I switch to MyISAM. Will there be any major differences in switching to MyISAM?
Pagination is another performance issue I am worried about, because in order to do pagination, I would need to find the total number of results the query returns. At the moment, I am basically doing the query I just mentioned TWICE because the first time it is used only to COUNT the results.
There are two things in your query that will prevent MySql from using indexes firstly your patterns start with a wildcard %, MySql can't use indexes to search for patterns that start with a wildcard, secondly you have OR in your WHERE clause you need to rewrite your query using UNION to avoid using OR which also prevents MySql from using indexes. Without using an index MySql needs to do a full table scan every time and the time needed for that will increase linearly as the number of rows grow in your table and yes as you put it "it would greatly slow down performance when the number of rows reach a significant amount" so I'd say your only real scalable option is to use FULLTEXT search.
Most of your questions are explained here: http://use-the-index-luke.com/sql/where-clause/searching-for-ranges/like-performance-tuning
InnoDB/fulltext indexing is announced for MySQL 5.6, but that will probably not help you right now.
How about starting with EXPLAIN <select-statement>? http://dev.mysql.com/doc/refman/5.6/en/explain.html
Switching to MyISAM should work seemlessly. The only downside is, that MyISAM is locking the whole table upon inserts/updates, which can be slow down tables with many more inserts than selects. Basically a rule of thumb in my opinion is to use MyISAM when you don't need foreign keys and the table has far more selects than inserts and use InnoDB when the table has far more inserts/updates than selects (e.g. for a statistic table).
In your case I guess switching to MyISAM is the better choice as a fulltext index is way more powerful and faster.
It also delivers the possibilty to use certain query modifiers like excluding words ("cat -dog") or similar. But keep in mind that it's not possible to look for words ending with a phrase anymore like with a LIKE-search ("*bar"). "foo*" will work though.