Big-O of MySQL Fuzzy Search

Big-O of MySQL Fuzzy Search - mysql

What is the Big-O of MySQL Fuzzy Search? Does it vary by index type, if so, what performs the best?
e.g. SELECT * FROM foo WHERE field1 LIKE '%ello Wo%';
I'm unsure of the underlying datatype, what kind of magic it possesses. Something like a trie (https://en.wikipedia.org/wiki/Trie) would be nice for search who is fuzzy at the end, e.g. LIKE 'Hello Wo%'.
I'm guessing the Big-O is O(n) but wish to confirm. There may even be differences between fuzzy searches too, e.g. %ello Wo% vs. Hello W% vs. %lo World vs. %ell%o%W%or%
Are there different ways to index that give better performance? If yes, for particular cases, can you please share?

With a leading wildcard
MySQL will
Scan all the rows in the table (not an index). This is called a "table scan". (This assuming no other filtering going on.)
For each row, scan the column in question for the LIKE;
Deliver the rows not filtered out.
Most of the time is spent in Step 1, which is O(N) where N is the number of rows. Much less time is spent in steps 2 and 3.
Without a leading wildcard
Use an index on that column, if you have one, to limit the number of rows to search. If you have an index on the column and are saying WHERE col LIKE 'Hello W%', it will find all the rows in the index starting with Hello W. They will be consecutive in the index, making this step faster.
For each of those, reach into the Data for the row and do whatever else is required.
There are a number of variables (caching, number of rows, randomness of rows, etc) that lead to whether #1 is more or less costly than #2. But this is likely to be much faster than the leading-wildcard case -- O(n), where n is the number of rows starting with 'Hello W'.

Related

Improve Mysql Select Query Performance [duplicate]

I've been using indexes on my MySQL databases for a while now but never properly learnt about them. Generally I put an index on any fields that I will be searching or selecting using a WHERE clause but sometimes it doesn't seem so black and white.
What are the best practices for MySQL indexes?
Example situations/dilemmas:
If a table has six columns and all of them are searchable, should I index all of them or none of them?
What are the negative performance impacts of indexing?
If I have a VARCHAR 2500 column which is searchable from parts of my site, should I index it?

You should definitely spend some time reading up on indexing, there's a lot written about it, and it's important to understand what's going on.
Broadly speaking, an index imposes an ordering on the rows of a table.
For simplicity's sake, imagine a table is just a big CSV file. Whenever a row is inserted, it's inserted at the end. So the "natural" ordering of the table is just the order in which rows were inserted.
Imagine you've got that CSV file loaded up in a very rudimentary spreadsheet application. All this spreadsheet does is display the data, and numbers the rows in sequential order.
Now imagine that you need to find all the rows that have some value "M" in the third column. Given what you have available, you have only one option. You scan the table checking the value of the third column for each row. If you've got a lot of rows, this method (a "table scan") can take a long time!
Now imagine that in addition to this table, you've got an index. This particular index is the index of values in the third column. The index lists all of the values from the third column, in some meaningful order (say, alphabetically) and for each of them, provides a list of row numbers where that value appears.
Now you have a good strategy for finding all the rows where the value of the third column is "M". For instance, you can perform a binary search! Whereas the table scan requires you to look N rows (where N is the number of rows), the binary search only requires that you look at log-n index entries, in the very worst case. Wow, that's sure a lot easier!
Of course, if you have this index, and you're adding rows to the table (at the end, since that's how our conceptual table works), you need to update the index each and every time. So you do a little more work while you're writing new rows, but you save a ton of time when you're searching for something.
So, in general, indexing creates a tradeoff between read efficiency and write efficiency. With no indexes, inserts can be very fast -- the database engine just adds a row to the table. As you add indexes, the engine must update each index while performing the insert.
On the other hand, reads become a lot faster.
Hopefully that covers your first two questions (as others have answered -- you need to find the right balance).
Your third scenario is a little more complicated. If you're using LIKE, indexing engines will typically help with your read speed up to the first "%". In other words, if you're SELECTing WHERE column LIKE 'foo%bar%', the database will use the index to find all the rows where column starts with "foo", and then need to scan that intermediate rowset to find the subset that contains "bar". SELECT ... WHERE column LIKE '%bar%' can't use the index. I hope you can see why.
Finally, you need to start thinking about indexes on more than one column. The concept is the same, and behaves similarly to the LIKE stuff -- essentially, if you have an index on (a,b,c), the engine will continue using the index from left to right as best it can. So a search on column a might use the (a,b,c) index, as would one on (a,b). However, the engine would need to do a full table scan if you were searching WHERE b=5 AND c=1)
Hopefully this helps shed a little light, but I must reiterate that you're best off spending a few hours digging around for good articles that explain these things in depth. It's also a good idea to read your particular database server's documentation. The way indices are implemented and used by query planners can vary pretty widely.

Check out presentations like More Mastering the Art of Indexing.
Update 12/2012: I have posted a new presentation of mine: How to Design Indexes, Really. I presented this in October 2012 at ZendCon in Santa Clara, and in December 2012 at Percona Live London.
Designing the best indexes is a process that has to match the queries you run in your app.
It's hard to recommend any general-purpose rules about which columns are best to index, or whether you should index all columns, no columns, which indexes should span multiple columns, etc. It depends on the queries you need to run.
Yes, there is some overhead so you shouldn't create indexes needlessly. But you should create the indexes that give benefit to the queries you need to run quickly. The overhead of an index is usually far outweighed by its benefit.
For a column that is VARCHAR(2500), you probably want to use a FULLTEXT index or a prefix index:
CREATE INDEX i ON SomeTable(longVarchar(100));
Note that a conventional index can't help if you're searching for words that may be in the middle of that long varchar. For that, use a fulltext index.

I won't repeat some of the good advice in other answers, but will add:
Compound Indices
You can create compound indices - an index that includes multiple columns. MySQL can use these from left to right. So if you have:
Table A
Id
Name
Category
Age
Description
if you have a compound index that includes Name/Category/Age in that order, these WHERE clauses would use the index:
WHERE Name='Eric' and Category='A'
WHERE Name='Eric' and Category='A' and Age > 18
but
WHERE Category='A' and Age > 18
would not use that index because everything has to be used from left to right.
Explain
Use Explain / Explain Extended to understand what indices are available to MySQL and which one it actually selects. MySQL will only use ONE key per query.
EXPLAIN EXTENDED SELECT * from Table WHERE Something='ABC'
Slow Query Log
Turn on the slow query log to see which queries are running slow.
Wide Columns
If you have a wide column where MOST of the distinction happens in the first several characters, you can use only the first N characters in your index. Example: We have a ReferenceNumber column defined as varchar(255) but 97% of the cases, the reference number is 10 characters or less. I changed the index to only look at the first 10 characters and improved performance quite a bit.

If a table has six columns and all of them are searchable, should i index all of them or none of them
Are you searching on a field by field basis or are some searches using multiple fields?
Which fields are most being searched on?
What are the field types? (Index works better on INTs than on VARCHARs for example)
Have you tried using EXPLAIN on the queries that are being run?
What are the negetive performance impacts of indexing
UPDATEs and INSERTs will be slower. There's also the extra storage space requirments, but that's usual unimportant these days.
If i have a VARCHAR 2500 column which is searchable from parts of my site, should i index it
No, unless it's UNIQUE (which means it's already indexed) or you only search for exact matches on that field (not using LIKE or mySQL's fulltext search).
Generally I put an index on any fields that i will be searching or selecting using a WHERE clause
I'd normally index the fields that are the most queried, and then INTs/BOOLEANs/ENUMs rather that fields that are VARCHARS. Don't forget, often you need to create an index on combined fields, rather than an index on an individual field. Use EXPLAIN, and check the slow log.

Load Data Efficiently: Indexes speed up retrievals but slow down inserts and deletes, as well as updates of values in indexed columns. That is, indexes slow down most operations that involve writing. This occurs because writing a row requires writing not only the data row, it requires changes to any indexes as well. The more indexes a table has, the more changes need to be made, and the greater the average performance degradation. Most tables receive many reads and few writes, but for a table with a high percentage of writes, the cost of index updating might be significant.
Avoid Indexes: If you don’t need a particular index to help queries perform better, don’t create it.
Disk Space: An index takes up disk space, and multiple indexes take up correspondingly more space. This might cause you to reach a table size limit more quickly than if there are no indexes. Avoid indexes wherever possible.
Takeaway: Don't over index

In general, indices help speedup database search, having the disadvantage of using extra disk space and slowing INSERT / UPDATE / DELETE queries. Use EXPLAIN and read the results to find out when MySQL uses your indices.
If a table has six columns and all of them are searchable, should i index all of them or none of them?
Indexing all six columns isn't always the best practice.
(a) Are you going to use any of those columns when searching for specific information?
(b) What is the selectivity of those columns (how many distinct values are there stored, in comparison to the total amount of records on the table)?
MySQL uses a cost-based optimizer, which tries to find the "cheapest" path when performing a query. And fields with low selectivity aren't good candidates.
What are the negetive performance impacts of indexing?
Already answered: extra disk space, lower performance during insert - update - delete.
If i have a VARCHAR 2500 column which is searchable from parts of my site, should i index it?
Try the FULLTEXT Index.

1/2) Indexes speed up certain select operations but they slow down other operations like insert, update and deletes. It can be a fine balance.
3) use a full text index or perhaps sphinx

mysql not using index?

I have a table with columns like word, A_, E_, U_ .. these columns with X_ are tinyints having the value of how many times the specific letter exists in the word (to later help optimize the wildcard search query).
There is totally 252k rows. If i make search like WHERE u_ > 0 i get 60k rows. But if i do the explain of that select, it says there is 225k rows to go through and no index possible. Why? Column was added as index. Why it doesn't say there is 60k rows to go through and that possible key is U_?
listing the indexes on table (also strange that others are groupped under A_ index)
In comparison if i run query: where id > 250000 i get 2983 results, and if i do explain of that select it says there is 2982 rows and key to be used primary.
Btw if i group by U_ i get this: (but probably doesnt matter much because i already said the query returns 60k results)
EDIT:
If i create column U (varchar(1)) and do the update U = 'U' where U_ > 0, then if i do the select WHERE U = 'U' i get also 60k rows (obviously), but if i do explain i get this:
Still not so good (rows 120k not 60k) but at least better than rows 225k in previous case. Although this solution is bit more piggy that than the first one, but maybe bit more efficient.

My experience is that MySQL chooses to do a tablescan, even if there is an index on the column you're searching, if your query would select more than approximately 25% of the rows in the table.
The reason for this is that using a secondary index in InnoDB is a bit more work than using a primary index.
Look up value in secondary index, like your index on u_.
Read index entry, and find corresponding primary key value(s) of rows where that value in u_ is stored.
Look up row(s) by primary key.
It's actually at least double the work to look up by secondary key. This isn't a problem if you ultimately match a small minority of rows of the table, and there are definitely cases where a secondary index is really important for your query. So don't be reluctant to use secondary indexes.
But if your query matches too many rows, and that becomes a big portion of the table, then it would be less work to just scan the table start-to-finish.
By analogy, why doesn't the index at the back of a book contain the word "the"? Because the entry would naturally list every single page in the book, and it would be a waste for you to refer to the index and then use it to guide you to each page in the main part of the book. You would have been better off just reading the book.
MySQL does not have any officially documented threshold for choosing a tablescan over an indexed search. The 25% figure is only my experience (actually sometimes it seems closer to 21%, but I don't know the code well enough to understand exactly how the threshold is calculated).
I've seen cases where the proportion of rows matched was very close to whatever threshold is in the implementation, and the behavior of the optimizer can actually flip-flop from one query to the next, resulting in highly variable performance.
If this case applies to you, you can use an index hint to make MySQL's optimizer pretend that a tablescan is prohibitively expensive, and it should prefer an index to a tablescan. This is done with the FORCE INDEX hint.
SELECT * FROM words FORCE INDEX(U_) WHERE U_ > 0
I still try to use index hints conservatively. They aren't necessary except in rare cases, and using an index hint means your query must include the index name. This makes it hard to change indexes without breaking your application code.

You're asking about the backend query optimizer. In particular you're asking: "how does it choose an access path? Why index here but tablescan there?"
Let's think about that optimizer. What is it optimizing? Elapsed time, in expectation. It has a model for how long sequential reads and random reads take, and for query selectivity, that is, expected number of rows returned by a query. From several alternative access paths it chooses the one that appears to require the least elapsed time.
Your id > 250000 query had a few things going for it:
good selectivity, so less than 1% of rows will appear in the result set
id is the Primary Key, so all columns are immediately available upon navigating to the right place in the btree
This caused the optimizer to compute an expected elapsed time for the indexed access path much smaller than expected time for tablescan.
On the other hand, your u_ > 0 query has very poor selectivity, dragging nearly a quarter of the rows into the result set. Additionally, the index is not a covering index for your * demand of copying all column values into the result set. So the optimizer predicts it will have to read a quarter of the index blocks, and then essentially all of the data row blocks that they point to. So compared to tablescan, we'd have to read more blocks from disk, and they would be random reads instead of sequential reads. Both of those argue against using the index, so tablescan was selected because it was cheapest. Also, remember that often multiple rows will fit within a single disk block, or within a single read request. We would call it a pessimizer if it always chose the indexed access path, even in cases where indexed disk I/O would take longer.
summary advice
Use an index on a single column when your queries have good selectivity, returning much less than 1% of a relation's rows. Use a covering index when your queries have poor selectivity and you're willing to make a space vs. time tradeoff.

Does the order of conditions make a performance difference in MySQL?

Suppose I have a MySQL query like this, the table PEOPLE has about 2 million rows:
SELECT * FROM `PEOPLE` WHERE `SEX`=1 AND `AGE`=28;
The first condition will return 1 million rows, and the second condition may return 20,000 rows. From the local website, most developers said that it will cause a better affect to change the order of them. And they also said that It will cause a 2 million + 1 million + *10,000* I/O time if change the order, while original query above will cause a 2 million + 20,000 + *10,000* I/O time. It sounds make sense.
As we all know that MySQL has an internal query optimizer for such work. Does the order needs pay particular attention for optimal performance? I was totally confused.
PS: I noticed that there are some similar question asked already, but they are two or tree years ago, it seems better to ask again.
Thanks all noticed this question. This is a explain about why i ask again:
Before I ask this question, I run EXPLAIN for a couple of times. The answer is the order doesn't matter. But the Interviewer told me the order will make a difference performance, I want make it sure if there is something i missing.

You should first understand a fundamental thing: in theory, a relational database does not have indices.
A purely theoretical relational database engine would indeed scan all records, check the criterion on the sex and age columns and only return the relevant rows.
However, indices are a common layer added by SQL database engines to filter rows faster. In this case, you should have indices for both of these columns.
What is more, these same database engines perform analysis on these indices (if any) to determine the best possible course of action to retrieve the relevant rows faster. In particular, one criterion in index metadata is cardinality: for a given value of the indexed column, how many rows match on average? The higher the number of rows, the lower the cardinality. Therefore, the higher the cardinality the better.
Therefore, an SQL engine's query optimizer will certainly select to cut through the result set by looking up the age index first, and only then the index of sex. And it may even choose not to use the index on sex at all if it determines that it can be faster by just looking up the sex column value for each row resulting from the first filter. Which is likely here, since the cardinality of the sex column is ridiculously low.
Have a look here for an introduction to the relational model.

Why does greater-than versus equals make a difference in MySQL SELECT?

I have a large MyISAM table. It's approaching 1 million rows. It's basically a list of items and some information about them.
There are two indices:
primary: the item ID
date (date) and col (int).
I run two queries:
SELECT * FROM table WHERE date = '2011-02-01' AND col < 5 LIMIT 10
SELECT * FROM table WHERE date < '2011-02-01' AND col < 5 LIMIT 10
The first one finishes in ~0.0005 seconds and the second in ~0.05 seconds. That is 100X difference. Is it wrong for me to expect both of these to run at roughly the same speed? I must not be understanding the indices very well. How can I speed up the second query?

Regardless of Mysql it boils down to basic algorithm theory.
Greater than and Less than operations on a large set are slower than Identity operations.
With a large data set an ideal data structure for determining less than or greater is a self balancing tree (binary or n-tree).
On a a self balanced tree the worst case scenario to find all less/greater is log n.
The ideal data structure for identity lookup is a hashtable. The performance of hashtables is generally O(1) aka fixed time. A hashtable however is not good for greater/less.
Generally a well balanced tree is only slightly less performing than a hashtable (which is how Haskell gets away with using a tree for hashtables).
Thus irregardless of what Mysql does its not surprise that <,> is slower than =
Old Answer below:
Because the first one is like Hashtable lookup since its '=' (particularly if your index is a hashtable) it will be faster than the second one which might work better with a tree like index.
Since MySql allows to configure the index format you can try changing that but I'm rather sure the first will always run faster than the second.

I'm assuming you have an index on the date column.
The first query uses the index, the second query probably does a linear scan (at least over part of the data). A direct fetch is always faster than a linear scan.

MySQL stores its indexes by default in a BTREE. No hashing in general.
The short answer for the performance difference is that the < form evaluates more nodes then the = form.
The index that you've got on there (date, col) stores the values roughly like a phone book:
2011-01-01, col=1, row_ptr
2011-01-01, col=2, row_ptr
2011-01-01, col=3, row_ptr
etc...
2011-02-01, col=1, row_ptr
2011-02-01, col=2, row_ptr
2011-02-01, col=3, row_ptr
etc...
2011-02-02, col=1, row_ptr
2011-02-02, col=2, row_ptr
etc...
...in ascending sorted tree nodes of size B (2011-01-01, col=1) < (2011-01-01, col=2) < (2011-01-02, col=1).
Your question is essentially asking the difference between:
Find all phone numbers with last name 'Smith' and first name starting with 'A'
Find all phone numbers that come before
'Smith' and have first name starting with 'A'.
It should be obvious why #1 is so much faster then #2.
There are also considerations of memory /disk transfer efficiency and heap allocations (= does WAY fewer transfers then <) that account for a not-insignificant amount of time but depend largely on the distribution of the data and the specific location of the 2011-02-01, col=min(col) key record.
[1] http://en.wikipedia.org/wiki/B-tree
[2] http://forge.mysql.com/wiki/MySQL_Internals_MyISAM
[3] http://forge.mysql.com/wiki/MySQL_Internals_InnoDB

The first one performs a seek over data where as the second one goes for a scan . Scans are always costlier than seeks hence the time difference .
Its like that, the the scan means running through all the pages of the book where as seek is directly jumping to a page number.
Hope this might help.

MySQL indexes - what are the best practices?

I've been using indexes on my MySQL databases for a while now but never properly learnt about them. Generally I put an index on any fields that I will be searching or selecting using a WHERE clause but sometimes it doesn't seem so black and white.
What are the best practices for MySQL indexes?
Example situations/dilemmas:
If a table has six columns and all of them are searchable, should I index all of them or none of them?
What are the negative performance impacts of indexing?
If I have a VARCHAR 2500 column which is searchable from parts of my site, should I index it?

Check out presentations like More Mastering the Art of Indexing.
Update 12/2012: I have posted a new presentation of mine: How to Design Indexes, Really. I presented this in October 2012 at ZendCon in Santa Clara, and in December 2012 at Percona Live London.
Designing the best indexes is a process that has to match the queries you run in your app.
It's hard to recommend any general-purpose rules about which columns are best to index, or whether you should index all columns, no columns, which indexes should span multiple columns, etc. It depends on the queries you need to run.
Yes, there is some overhead so you shouldn't create indexes needlessly. But you should create the indexes that give benefit to the queries you need to run quickly. The overhead of an index is usually far outweighed by its benefit.
For a column that is VARCHAR(2500), you probably want to use a FULLTEXT index or a prefix index:
CREATE INDEX i ON SomeTable(longVarchar(100));
Note that a conventional index can't help if you're searching for words that may be in the middle of that long varchar. For that, use a fulltext index.

I won't repeat some of the good advice in other answers, but will add:
Compound Indices
You can create compound indices - an index that includes multiple columns. MySQL can use these from left to right. So if you have:
Table A
Id
Name
Category
Age
Description
if you have a compound index that includes Name/Category/Age in that order, these WHERE clauses would use the index:
WHERE Name='Eric' and Category='A'
WHERE Name='Eric' and Category='A' and Age > 18
but
WHERE Category='A' and Age > 18
would not use that index because everything has to be used from left to right.
Explain
Use Explain / Explain Extended to understand what indices are available to MySQL and which one it actually selects. MySQL will only use ONE key per query.
EXPLAIN EXTENDED SELECT * from Table WHERE Something='ABC'
Slow Query Log
Turn on the slow query log to see which queries are running slow.
Wide Columns
If you have a wide column where MOST of the distinction happens in the first several characters, you can use only the first N characters in your index. Example: We have a ReferenceNumber column defined as varchar(255) but 97% of the cases, the reference number is 10 characters or less. I changed the index to only look at the first 10 characters and improved performance quite a bit.

If a table has six columns and all of them are searchable, should i index all of them or none of them
Are you searching on a field by field basis or are some searches using multiple fields?
Which fields are most being searched on?
What are the field types? (Index works better on INTs than on VARCHARs for example)
Have you tried using EXPLAIN on the queries that are being run?
What are the negetive performance impacts of indexing
UPDATEs and INSERTs will be slower. There's also the extra storage space requirments, but that's usual unimportant these days.
If i have a VARCHAR 2500 column which is searchable from parts of my site, should i index it
No, unless it's UNIQUE (which means it's already indexed) or you only search for exact matches on that field (not using LIKE or mySQL's fulltext search).
Generally I put an index on any fields that i will be searching or selecting using a WHERE clause
I'd normally index the fields that are the most queried, and then INTs/BOOLEANs/ENUMs rather that fields that are VARCHARS. Don't forget, often you need to create an index on combined fields, rather than an index on an individual field. Use EXPLAIN, and check the slow log.

Load Data Efficiently: Indexes speed up retrievals but slow down inserts and deletes, as well as updates of values in indexed columns. That is, indexes slow down most operations that involve writing. This occurs because writing a row requires writing not only the data row, it requires changes to any indexes as well. The more indexes a table has, the more changes need to be made, and the greater the average performance degradation. Most tables receive many reads and few writes, but for a table with a high percentage of writes, the cost of index updating might be significant.
Avoid Indexes: If you don’t need a particular index to help queries perform better, don’t create it.
Disk Space: An index takes up disk space, and multiple indexes take up correspondingly more space. This might cause you to reach a table size limit more quickly than if there are no indexes. Avoid indexes wherever possible.
Takeaway: Don't over index

In general, indices help speedup database search, having the disadvantage of using extra disk space and slowing INSERT / UPDATE / DELETE queries. Use EXPLAIN and read the results to find out when MySQL uses your indices.
If a table has six columns and all of them are searchable, should i index all of them or none of them?
Indexing all six columns isn't always the best practice.
(a) Are you going to use any of those columns when searching for specific information?
(b) What is the selectivity of those columns (how many distinct values are there stored, in comparison to the total amount of records on the table)?
MySQL uses a cost-based optimizer, which tries to find the "cheapest" path when performing a query. And fields with low selectivity aren't good candidates.
What are the negetive performance impacts of indexing?
Already answered: extra disk space, lower performance during insert - update - delete.
If i have a VARCHAR 2500 column which is searchable from parts of my site, should i index it?
Try the FULLTEXT Index.

1/2) Indexes speed up certain select operations but they slow down other operations like insert, update and deletes. It can be a fine balance.
3) use a full text index or perhaps sphinx

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008