MySQL: Optimizing Searches with LIKE or FULLTEXT - mysql

I am building a forum and I am looking for the proper way to build a search feature that finds users by their name or by the title of their posts. What I have come up with is this:
SELECT users.id, users.user_name, users.user_picture
FROM users, subject1, subject2
WHERE users.id = subject1.user_id
AND users.id = subject2.user_id
AND (users.user_name LIKE '%{$keywords}%'
OR subject1.title1 LIKE '%{$keywords}%'
OR subject2.title2 LIKE '%{$keywords}%')
ORDER BY users.user_name ASC
LIMIT 10
OFFSET {$offset}
The LIMIT and the OFFSET is for pagination. My question is, would doing a LIKE search through multiple tables greatly slow down performance when the number of rows reach a significant amount?
I have a few alternatives:
One, perhaps I can rewrite that query to have the LIKE searches done inside a subquery that only returns indexed user_ids. Then, I would find the remaining user information based on that. Would that increase performance by much?
Second, I suppose I can have the $keyword string appear before the first wildcard as in LIKE {$keyword}%. This way, I can index the user_name, title1, and title2 columns. However, since I will be trading accuracy for speed here, how much of a difference in performance would this make? Will it be worth sacrificing this much accuracy to index these columns?
Third, perhaps I can give users 3 search fields to choose from, and have each search through only one table. Would this increase performance by much?
Lastly, should I consider using a FULLTEXT search instead of LIKE? What are the performance differences between the two? Also, my tables are using the InnoDB storage engine, and I am not able to use the FULLTEXT index unless I switch to MyISAM. Will there be any major differences in switching to MyISAM?
Pagination is another performance issue I am worried about, because in order to do pagination, I would need to find the total number of results the query returns. At the moment, I am basically doing the query I just mentioned TWICE because the first time it is used only to COUNT the results.

There are two things in your query that will prevent MySql from using indexes firstly your patterns start with a wildcard %, MySql can't use indexes to search for patterns that start with a wildcard, secondly you have OR in your WHERE clause you need to rewrite your query using UNION to avoid using OR which also prevents MySql from using indexes. Without using an index MySql needs to do a full table scan every time and the time needed for that will increase linearly as the number of rows grow in your table and yes as you put it "it would greatly slow down performance when the number of rows reach a significant amount" so I'd say your only real scalable option is to use FULLTEXT search.

Most of your questions are explained here: http://use-the-index-luke.com/sql/where-clause/searching-for-ranges/like-performance-tuning
InnoDB/fulltext indexing is announced for MySQL 5.6, but that will probably not help you right now.

How about starting with EXPLAIN <select-statement>? http://dev.mysql.com/doc/refman/5.6/en/explain.html

Switching to MyISAM should work seemlessly. The only downside is, that MyISAM is locking the whole table upon inserts/updates, which can be slow down tables with many more inserts than selects. Basically a rule of thumb in my opinion is to use MyISAM when you don't need foreign keys and the table has far more selects than inserts and use InnoDB when the table has far more inserts/updates than selects (e.g. for a statistic table).
In your case I guess switching to MyISAM is the better choice as a fulltext index is way more powerful and faster.
It also delivers the possibilty to use certain query modifiers like excluding words ("cat -dog") or similar. But keep in mind that it's not possible to look for words ending with a phrase anymore like with a LIKE-search ("*bar"). "foo*" will work though.

Related

Improve Mysql Select Query Performance [duplicate]

I've been using indexes on my MySQL databases for a while now but never properly learnt about them. Generally I put an index on any fields that I will be searching or selecting using a WHERE clause but sometimes it doesn't seem so black and white.
What are the best practices for MySQL indexes?
Example situations/dilemmas:
If a table has six columns and all of them are searchable, should I index all of them or none of them?
What are the negative performance impacts of indexing?
If I have a VARCHAR 2500 column which is searchable from parts of my site, should I index it?
You should definitely spend some time reading up on indexing, there's a lot written about it, and it's important to understand what's going on.
Broadly speaking, an index imposes an ordering on the rows of a table.
For simplicity's sake, imagine a table is just a big CSV file. Whenever a row is inserted, it's inserted at the end. So the "natural" ordering of the table is just the order in which rows were inserted.
Imagine you've got that CSV file loaded up in a very rudimentary spreadsheet application. All this spreadsheet does is display the data, and numbers the rows in sequential order.
Now imagine that you need to find all the rows that have some value "M" in the third column. Given what you have available, you have only one option. You scan the table checking the value of the third column for each row. If you've got a lot of rows, this method (a "table scan") can take a long time!
Now imagine that in addition to this table, you've got an index. This particular index is the index of values in the third column. The index lists all of the values from the third column, in some meaningful order (say, alphabetically) and for each of them, provides a list of row numbers where that value appears.
Now you have a good strategy for finding all the rows where the value of the third column is "M". For instance, you can perform a binary search! Whereas the table scan requires you to look N rows (where N is the number of rows), the binary search only requires that you look at log-n index entries, in the very worst case. Wow, that's sure a lot easier!
Of course, if you have this index, and you're adding rows to the table (at the end, since that's how our conceptual table works), you need to update the index each and every time. So you do a little more work while you're writing new rows, but you save a ton of time when you're searching for something.
So, in general, indexing creates a tradeoff between read efficiency and write efficiency. With no indexes, inserts can be very fast -- the database engine just adds a row to the table. As you add indexes, the engine must update each index while performing the insert.
On the other hand, reads become a lot faster.
Hopefully that covers your first two questions (as others have answered -- you need to find the right balance).
Your third scenario is a little more complicated. If you're using LIKE, indexing engines will typically help with your read speed up to the first "%". In other words, if you're SELECTing WHERE column LIKE 'foo%bar%', the database will use the index to find all the rows where column starts with "foo", and then need to scan that intermediate rowset to find the subset that contains "bar". SELECT ... WHERE column LIKE '%bar%' can't use the index. I hope you can see why.
Finally, you need to start thinking about indexes on more than one column. The concept is the same, and behaves similarly to the LIKE stuff -- essentially, if you have an index on (a,b,c), the engine will continue using the index from left to right as best it can. So a search on column a might use the (a,b,c) index, as would one on (a,b). However, the engine would need to do a full table scan if you were searching WHERE b=5 AND c=1)
Hopefully this helps shed a little light, but I must reiterate that you're best off spending a few hours digging around for good articles that explain these things in depth. It's also a good idea to read your particular database server's documentation. The way indices are implemented and used by query planners can vary pretty widely.
Check out presentations like More Mastering the Art of Indexing.
Update 12/2012: I have posted a new presentation of mine: How to Design Indexes, Really. I presented this in October 2012 at ZendCon in Santa Clara, and in December 2012 at Percona Live London.
Designing the best indexes is a process that has to match the queries you run in your app.
It's hard to recommend any general-purpose rules about which columns are best to index, or whether you should index all columns, no columns, which indexes should span multiple columns, etc. It depends on the queries you need to run.
Yes, there is some overhead so you shouldn't create indexes needlessly. But you should create the indexes that give benefit to the queries you need to run quickly. The overhead of an index is usually far outweighed by its benefit.
For a column that is VARCHAR(2500), you probably want to use a FULLTEXT index or a prefix index:
CREATE INDEX i ON SomeTable(longVarchar(100));
Note that a conventional index can't help if you're searching for words that may be in the middle of that long varchar. For that, use a fulltext index.
I won't repeat some of the good advice in other answers, but will add:
Compound Indices
You can create compound indices - an index that includes multiple columns. MySQL can use these from left to right. So if you have:
Table A
Id
Name
Category
Age
Description
if you have a compound index that includes Name/Category/Age in that order, these WHERE clauses would use the index:
WHERE Name='Eric' and Category='A'
WHERE Name='Eric' and Category='A' and Age > 18
but
WHERE Category='A' and Age > 18
would not use that index because everything has to be used from left to right.
Explain
Use Explain / Explain Extended to understand what indices are available to MySQL and which one it actually selects. MySQL will only use ONE key per query.
EXPLAIN EXTENDED SELECT * from Table WHERE Something='ABC'
Slow Query Log
Turn on the slow query log to see which queries are running slow.
Wide Columns
If you have a wide column where MOST of the distinction happens in the first several characters, you can use only the first N characters in your index. Example: We have a ReferenceNumber column defined as varchar(255) but 97% of the cases, the reference number is 10 characters or less. I changed the index to only look at the first 10 characters and improved performance quite a bit.
If a table has six columns and all of them are searchable, should i index all of them or none of them
Are you searching on a field by field basis or are some searches using multiple fields?
Which fields are most being searched on?
What are the field types? (Index works better on INTs than on VARCHARs for example)
Have you tried using EXPLAIN on the queries that are being run?
What are the negetive performance impacts of indexing
UPDATEs and INSERTs will be slower. There's also the extra storage space requirments, but that's usual unimportant these days.
If i have a VARCHAR 2500 column which is searchable from parts of my site, should i index it
No, unless it's UNIQUE (which means it's already indexed) or you only search for exact matches on that field (not using LIKE or mySQL's fulltext search).
Generally I put an index on any fields that i will be searching or selecting using a WHERE clause
I'd normally index the fields that are the most queried, and then INTs/BOOLEANs/ENUMs rather that fields that are VARCHARS. Don't forget, often you need to create an index on combined fields, rather than an index on an individual field. Use EXPLAIN, and check the slow log.
Load Data Efficiently: Indexes speed up retrievals but slow down inserts and deletes, as well as updates of values in indexed columns. That is, indexes slow down most operations that involve writing. This occurs because writing a row requires writing not only the data row, it requires changes to any indexes as well. The more indexes a table has, the more changes need to be made, and the greater the average performance degradation. Most tables receive many reads and few writes, but for a table with a high percentage of writes, the cost of index updating might be significant.
Avoid Indexes: If you don’t need a particular index to help queries perform better, don’t create it.
Disk Space: An index takes up disk space, and multiple indexes take up correspondingly more space. This might cause you to reach a table size limit more quickly than if there are no indexes. Avoid indexes wherever possible.
Takeaway: Don't over index
In general, indices help speedup database search, having the disadvantage of using extra disk space and slowing INSERT / UPDATE / DELETE queries. Use EXPLAIN and read the results to find out when MySQL uses your indices.
If a table has six columns and all of them are searchable, should i index all of them or none of them?
Indexing all six columns isn't always the best practice.
(a) Are you going to use any of those columns when searching for specific information?
(b) What is the selectivity of those columns (how many distinct values are there stored, in comparison to the total amount of records on the table)?
MySQL uses a cost-based optimizer, which tries to find the "cheapest" path when performing a query. And fields with low selectivity aren't good candidates.
What are the negetive performance impacts of indexing?
Already answered: extra disk space, lower performance during insert - update - delete.
If i have a VARCHAR 2500 column which is searchable from parts of my site, should i index it?
Try the FULLTEXT Index.
1/2) Indexes speed up certain select operations but they slow down other operations like insert, update and deletes. It can be a fine balance.
3) use a full text index or perhaps sphinx

Improve performance on MySQL fulltext search query

I have a following MySQL query:
SELECT p.*, MATCH (p.description) AGAINST ('random text that you can use in sample web pages or typography samples') AS score
FROM posts p
WHERE p.post_id <> 23
AND MATCH (p.description) AGAINST ('random text that you can use in sample web pages or typography samples') > 0
ORDER BY score DESC LIMIT 1
With 108,000 rows, it takes ~200ms. With 265,000 rows, it takes ~500ms.
Under performance testing(~80 concurrent users) it shows ~18sec average latency.
Is any way to improve performance for this query ?
EXPLAIN OUTPUT:
UPDATED
We have added one new mirror MyISAM table with post_id, description and synchronized it with posts table via triggers. Now, fulltext search on this new MyISAM table works ~400ms(with the same performance load where InnoDB shows ~18sec.. this is a huge performance boost) Look like MyISAM is much more quicker for fulltext in MySQL than InnoDB. Could you please explain it ?
MySQL profiler results:
Tested on AWS RDS db.t2.small instance
Original InnoDB posts table:
MyISAM mirror table with post_id, description only:
Here are a few tips what to look for in order to maximise the speed of such queries with InnoDB:
Avoid redundant sorting. Since InnoDB already sorted the result according to ranking. MySQL Query Processing layer does not need to
sort to get top matching results.
Avoid row by row fetching to get the matching count. InnoDB provides all the matching records. All those not in the result list
should all have ranking of 0, and no need to be retrieved. And InnoDB
has a count of total matching records on hand. No need to recount.
Covered index scan. InnoDB results always contains the matching records' Document ID and their ranking. So if only the Document ID and
ranking is needed, there is no need to go to user table to fetch the
record itself.
Narrow the search result early, reduce the user table access. If the user wants to get top N matching records, we do not need to fetch
all matching records from user table. We should be able to first
select TOP N matching DOC IDs, and then only fetch corresponding
records with these Doc IDs.
I don't think you cannot get that much faster looking only at the query itself, maybe try removing the ORDER BY part to avoid unnecessary sorting. To dig deeper into this, maybe profile the query using MySQLs inbuild profiler.
Other than that, you might look into the configuration of your MySQL server. Have a look at this chapter of the MySQL manual, it contains some good informations on how to tune the fulltext index to your needs.
If you've already maximized the capabilities of your MySQL server configuration, then consider looking at the hardware itself - sometimes even a lost cost solution like moving the tables to another, faster hard drive can work wonders.
My best guess for the performance hit is the number of rows being returned by the query. To test this, simply remove the order by score and see if that improves the performance.
If it does not, then the issue is the full text index. If it does, then the issue is the order by. If so, the problem becomes a bit more difficult. Some ideas:
Determine a hardware solution to speed up the sorts (getting the intermediate files to be in memory).
Modifying the query so it returns fewer values. This might involve changing the stop-word list, changing the query to boolean mode, or other ideas.
Finding another way of pre-filtering the results.
The issue here is WHERE p.post_id <> 23
Design your system in such a way so that non-indexed columns — like post_id — need not be added to the WHERE clause.
Basically MySQL will search for the full-text indexed column and then filter the post_id. Hence, if there are a lot of matches returned by the full text search, the response time will not be as expected.

MySQL search FTS vs Multiple Queries

Working on a project where schema is something like this:
id , key, value
The key and value columns are varchar, and the table is InnoDB.
A user can search on the basis of key value pairs ... Whats the best way to query in MySQL ? the options I can think of is:
For each key => value form a query and perform an inner join to get id matching all criterias.
Or in the background, populate a MyISAM table id, info with Full Text index on info and a single query using like '%key:value%key2:value2%'. The benefit of this will be later on if the website is popular and the table has a hundred thousand rows, I can easily port the code to Lucene but for now MySQL.
The pattern you're talking about is called relational division.
Option #1 (the self-join) is a much faster solution if you have the right indexes.
I compared the performance for a couple of solutions to relational division in my presentation
SQL Query Patterns, Optimized. The self-join solution worked in 0.005 seconds even against a table with millions of rows.
Option #2 with fulltext isn't correct anyway as you've written it, because you wouldn't use LIKE with fulltext search. You'd use MATCH(info) AGAINST('...' IN BOOLEAN MODE). I'm not sure you can use patterns in key:value format anyway. MySQL FTS prefers to match words.
#Bill Karwin
If you're going to do this for 1 condition, it will be super fast with this EAV-like schema, but if you do it for many (esp. with mixed ANDs and ORs) it will probably fall apart. The best you can hope for is some sort of super fast index merge, and that's elusive. You're going to get a temporary table in most DBMSes if you do anything fancy. I think I remember reading you're no fan of EAV, though, and maybe I'm misunderstanding you.
As I recall, a DBMS is also free to do multiple scans and then handle this with a disposable bitmap index. But fulltext indexes keep the document lists sorted and do a low-cost merge across all criteria with a FTS planner that starts strategically with the rarer keywords. That's all they do to execute "word1 & word2" all day. They're optimized for this sort of thing.
So if you have lots of simple facts, a FTS index is one decent way to do it I think. Am I missing something? You just need to change the facts to something indexable like COLORID_3, then search for "COLORID_3 & SOMETHINGELSEID_5."
If the queries involve no merging or sorting, I suspect it will be pretty much as wash. Nothing here but us BTREEs ...

Is it OK to index all the fields in this mysql query?

I have this mysql query and I am not sure what are the implications of indexing all the fields in the query . I mean is it OK to index all the fields in the CASE statement, Join Statement and Where Statement? Are there any performance implications of indexing fields?
SELECT roots.id as root_id, root_words.*,
CASE
WHEN root_words.title LIKE '%text%' THEN 1
WHEN root_words.unsigned_title LIKE '%normalised_text%' THEN 2
WHEN unsigned_source LIKE '%normalised_text%' THEN 3
WHEN roots.root LIKE '%text%' THEN 4
END as priorities
FROM roots INNER JOIN root_words ON roots.id=root_words.root_id
WHERE (root_words.unsigned_title LIKE '%normalised_text%') OR (root_words.title LIKE '%text%')
OR (unsigned_source LIKE '%normalised_text."%') OR (roots.root LIKE '%text%') ORDER by priorities
Also, How can I further improve the speed of the query above?
Thanks!
You index columns in tables, not queries.
None of the search criteria you've specified will be able to make use of indexes (since the search terms begin with a wild card).
You should make sure that the id column is indexed, to speed the JOIN. (Presumably, it's already indexed as a PRIMARY KEY in one table and a FOREIGN KEY in the other).
To speed up this query you will need to use full text search. Adding indexes will not speed up this particular query and will cost you time on INSERTs, UPDATEs, and DELETEs.
Caveat: Indexes speed up retrieval time but cause inserts and updates to run slower.
To answer the implications of indexing every field, there is a performance hit when using indexes whenever the data that is indexed is modified, either through inserts, updates, or deletes. This is because SQL needs to maintain the index. It's a balance between how often the data is read versus how often it is modified.
In this specific query, the only index that could possibly help would be in your JOIN clause, on the fields roots.id and root_words.root_id.
None of the checks in your WHERE clause could be indexed, because of the leading '%'. This causes SQL to scan every row in these tables for a matching value.
If you are able to remove the leading '%', you would then benefit from indexes on these fields... if not, you should look into implementing full-text search; but be warned, this isn't trivial.
Indexing won't help when used in conjunction with LIKE '%something%'.
It's like looking for words in a dictionary that have ae in them somewhere. The dictionary (or Index in this case) is organised based on the first letter of the word, then the second letter, etc. It has no mechanism to put all the words with ae in them close together. You still end up reading the whole dictionary from beginning to end.
Indexing the fields used in the CASE clause will likely not help you. Indexing helps by making it easy to find records in a table. The CASE clause is about processing the records you have found, not finding them in the first place.
Optimisers can also struggle with optimising multiple unrelated OR conditions such as yours. The optimiser is trying to narrow down the amount of effort to complete your query, but that's hard to do when unrelated conditions could make a record acceptable.
All in all your query would benefit from indexes on roots(root_id) and/or roots(id), but not much else.
If you were to index additional fields though, the two main costs are:
- Increased write time (insert, update or delete) due to additional indexes to write to
- Increased space taken up on the disk

Does this field need an index?

I currently have a summary table to keep track of my users' post counts, and I run SELECTs on that table to sort them by counts, like WHERE count > 10, for example. Now I know having an index on columns used in WHERE clauses speeds things up, but since these fields will also be updated quite often, would indexing provide better or worse performance?
If you have a query like
SELECT count(*) as rowcount
FROM table1
GROUP BY name
Then you cannot put an index on count, you need to put an index on the group by field instead.
If you have a field named count
Then putting an index in this query may speed up the query, it may also make no difference at all:
SELECT id, `count`
FROM table1
WHERE `count` > 10
Whether an index on count will speed up the query really depends on what percentage of the rows satisfy the where clause. If it's more than 30%, MySQL (or any SQL for that matter) will refuse to use an index.
It will just stubbornly insist on doing a full table scan. (i.e. read all rows)
This is because using an index requires reading 2 files (1 index file and then the real table file with the actual data).
If you select a large percentage of rows, reading the extra index file is not worth it and just reading all the rows in order will be faster.
If only a few rows pass the sets, using an index will speed up this query a lot
Know your data
Using explain select will tell you what indexes MySQL has available and which one it picked and (kind of/sort of in a complicated kind of way) why.
See: http://dev.mysql.com/doc/refman/5.0/en/explain.html
Indexes in general provide better read performance at the cost of slightly worse insert, update and delete performance. Usually the tradeoff is worth it depending on the width of the index and the number of indexes that already exist on the table. In your case, I would bet that the overall performance (reading and writing) will still be substantially better with the index than without but you would need to run tests to know for sure.
It will improve read performance and worsen write performance. If the tables are MyISAM and you have a lot of people posting in a short amount of time you could run into issues where MySQL is waiting for locks, eventually causing a crash.
There's no way of really knowing that without trying it. A lot depends on the ratio of reads to writes, storage engine, disk throughput, various MySQL tuning parameters, etc. You'd have to setup a simulation that resembles production and run before and after.
I think its unlikely that the write performance will be a serious issue after adding the index.
But note that the index won't be used anyway if it is not selective enough - if more than for example 10% of your users have count > 10 the fastest query plan might be to not use the index and just scan the entire table.