How does mysql indexes turn random I/O into sequential I/O - mysql

the High Performance MySQL says that one of the benefit of index is "Indexes turn random I/O into sequential I/O", so I don't understand how does mysql indexes turn random I/O into sequential I/O.

I think they mean that when you have such indexes column in the table you can sort your stuff very easy as the indexes are a sequence

Think of a telephone book. Suppose you want to look up people with last name starting with "S". You can find all of them very close together.
Sorting is also helped by using an index. If you search for all the people with last names starting with "S" and you want them in order by last name, that's no problem, the query can read them in the same order they are listed in the phone book, and they're all stored together. So the fetching is done by reading sequentially.
But if you want them sorted by something else, like their first name, there has to be some additional step of sorting the fetched rows before returning them in the order you wanted. The dataset you want to be sorted might be larger than can fit in memory, so it must use temporary disk space to do the sorting. There are ways to do this pretty efficiently, but it could require multiple passes, or building a sorted list as it fetches the rows from the database.
Basically, having the rows stored in pre-sorted order makes it efficient to search for values that are stored together, and efficient to retrieve them in the order of the index.

Related

Will records order change between two identical query in mysql without order by

The problem is I need to do pagination.I want to use order by and limit.But my colleague told me mysql will return records in the same order,and since this job doesn't care in which order the records are shown,so we don't need order by.
So I want to ask if what he said is correct? Of course assuming that no records are updated or inserted between the two queries.
You don't show your query here, so I'm going to assume that it's something like the following (where ID is the primary key of the table):
select *
from TABLE
where ID >= :x:
limit 100
If this is the case, then with MySQL you will probably get rows in the same order every time. This is because the only predicate in the query involves the primary key, which is a clustered index for MySQL, so is usually the most efficient way to retrieve.
However, probably may not be good enough for you, and if your actual query is any more complex than this one, probably no longer applies. Even though you may think that nothing changes between queries (ie, no rows inserted or deleted), so you'll get the same optimization plan, that is not true.
For one thing, the block cache will have changed between queries, which may cause the optimizer to choose a different query plan. Or maybe not. But I wouldn't take the word of anyone other than one of the MySQL maintainers that it won't.
Bottom line: use an order by on whatever column(s) you're using to paginate. And if you're paginating by the primary key, that might actually improve your performance.
The key point here is that database engines need to handle potentially large datasets and need to care (a lot!) about performance. MySQL is never going to waste any resource (CPU cycles, memory, whatever) doing an operation that doesn't serve any purpose. Sorting result sets that aren't required to be sorted is a pretty good example of this.
When issuing a given query MySQL will try hard to return the requested data as quick as possible. When you insert a bunch of rows and then run a simple SELECT * FROM my_table query you'll often see that rows come back in the same order than they were inserted. That makes sense because the obvious way to store the rows is to append them as inserted and the obvious way to read them back is from start to end. However, this simplistic scenario won't apply everywhere, every time:
Physical storage changes. You won't just be appending new rows at the end forever. You'll eventually update values, delete rows. At some point, freed disk space will be reused.
Most real-life queries aren't as simple as SELECT * FROM my_table. Query optimizer will try to leverage indices, which can have a different order. Or it may decide that the fastest way to gather the required information is to perform internal sorts (that's typical for GROUP BY queries).
You mention paging. Indeed, I can think of some ways to create a paginator that doesn't require sorted results. For instance, you can assign page numbers in advance and keep them in a hash map or dictionary: items within a page may appear in random locations but paging will be consistent. This is of course pretty suboptimal, it's hard to code and requieres constant updating as data mutates. ORDER BY is basically the easiest way. What you can't do is just base your paginator in the assumption that SQL data sets are ordered sets because they aren't; neither in theory nor in practice.
As an anecdote, I once used a major framework that implemented pagination using the ORDER BY and LIMIT clauses. (I won't say the same because it isn't relevant to the question... well, dammit, it was CakePHP/2). It worked fine when sorting by ID. But it also allowed users to sort by arbitrary columns, which were often not unique, and I once found an item that was being shown in two different pages because the framework was naively sorting by a single non-unique column and that row made its way into both ORDER BY type LIMIT 10 and ORDER BY type LIMIT 10, 10 because both sortings complied with the requested condition.

Improve Mysql Select Query Performance [duplicate]

I've been using indexes on my MySQL databases for a while now but never properly learnt about them. Generally I put an index on any fields that I will be searching or selecting using a WHERE clause but sometimes it doesn't seem so black and white.
What are the best practices for MySQL indexes?
Example situations/dilemmas:
If a table has six columns and all of them are searchable, should I index all of them or none of them?
What are the negative performance impacts of indexing?
If I have a VARCHAR 2500 column which is searchable from parts of my site, should I index it?
You should definitely spend some time reading up on indexing, there's a lot written about it, and it's important to understand what's going on.
Broadly speaking, an index imposes an ordering on the rows of a table.
For simplicity's sake, imagine a table is just a big CSV file. Whenever a row is inserted, it's inserted at the end. So the "natural" ordering of the table is just the order in which rows were inserted.
Imagine you've got that CSV file loaded up in a very rudimentary spreadsheet application. All this spreadsheet does is display the data, and numbers the rows in sequential order.
Now imagine that you need to find all the rows that have some value "M" in the third column. Given what you have available, you have only one option. You scan the table checking the value of the third column for each row. If you've got a lot of rows, this method (a "table scan") can take a long time!
Now imagine that in addition to this table, you've got an index. This particular index is the index of values in the third column. The index lists all of the values from the third column, in some meaningful order (say, alphabetically) and for each of them, provides a list of row numbers where that value appears.
Now you have a good strategy for finding all the rows where the value of the third column is "M". For instance, you can perform a binary search! Whereas the table scan requires you to look N rows (where N is the number of rows), the binary search only requires that you look at log-n index entries, in the very worst case. Wow, that's sure a lot easier!
Of course, if you have this index, and you're adding rows to the table (at the end, since that's how our conceptual table works), you need to update the index each and every time. So you do a little more work while you're writing new rows, but you save a ton of time when you're searching for something.
So, in general, indexing creates a tradeoff between read efficiency and write efficiency. With no indexes, inserts can be very fast -- the database engine just adds a row to the table. As you add indexes, the engine must update each index while performing the insert.
On the other hand, reads become a lot faster.
Hopefully that covers your first two questions (as others have answered -- you need to find the right balance).
Your third scenario is a little more complicated. If you're using LIKE, indexing engines will typically help with your read speed up to the first "%". In other words, if you're SELECTing WHERE column LIKE 'foo%bar%', the database will use the index to find all the rows where column starts with "foo", and then need to scan that intermediate rowset to find the subset that contains "bar". SELECT ... WHERE column LIKE '%bar%' can't use the index. I hope you can see why.
Finally, you need to start thinking about indexes on more than one column. The concept is the same, and behaves similarly to the LIKE stuff -- essentially, if you have an index on (a,b,c), the engine will continue using the index from left to right as best it can. So a search on column a might use the (a,b,c) index, as would one on (a,b). However, the engine would need to do a full table scan if you were searching WHERE b=5 AND c=1)
Hopefully this helps shed a little light, but I must reiterate that you're best off spending a few hours digging around for good articles that explain these things in depth. It's also a good idea to read your particular database server's documentation. The way indices are implemented and used by query planners can vary pretty widely.
Check out presentations like More Mastering the Art of Indexing.
Update 12/2012: I have posted a new presentation of mine: How to Design Indexes, Really. I presented this in October 2012 at ZendCon in Santa Clara, and in December 2012 at Percona Live London.
Designing the best indexes is a process that has to match the queries you run in your app.
It's hard to recommend any general-purpose rules about which columns are best to index, or whether you should index all columns, no columns, which indexes should span multiple columns, etc. It depends on the queries you need to run.
Yes, there is some overhead so you shouldn't create indexes needlessly. But you should create the indexes that give benefit to the queries you need to run quickly. The overhead of an index is usually far outweighed by its benefit.
For a column that is VARCHAR(2500), you probably want to use a FULLTEXT index or a prefix index:
CREATE INDEX i ON SomeTable(longVarchar(100));
Note that a conventional index can't help if you're searching for words that may be in the middle of that long varchar. For that, use a fulltext index.
I won't repeat some of the good advice in other answers, but will add:
Compound Indices
You can create compound indices - an index that includes multiple columns. MySQL can use these from left to right. So if you have:
Table A
Id
Name
Category
Age
Description
if you have a compound index that includes Name/Category/Age in that order, these WHERE clauses would use the index:
WHERE Name='Eric' and Category='A'
WHERE Name='Eric' and Category='A' and Age > 18
but
WHERE Category='A' and Age > 18
would not use that index because everything has to be used from left to right.
Explain
Use Explain / Explain Extended to understand what indices are available to MySQL and which one it actually selects. MySQL will only use ONE key per query.
EXPLAIN EXTENDED SELECT * from Table WHERE Something='ABC'
Slow Query Log
Turn on the slow query log to see which queries are running slow.
Wide Columns
If you have a wide column where MOST of the distinction happens in the first several characters, you can use only the first N characters in your index. Example: We have a ReferenceNumber column defined as varchar(255) but 97% of the cases, the reference number is 10 characters or less. I changed the index to only look at the first 10 characters and improved performance quite a bit.
If a table has six columns and all of them are searchable, should i index all of them or none of them
Are you searching on a field by field basis or are some searches using multiple fields?
Which fields are most being searched on?
What are the field types? (Index works better on INTs than on VARCHARs for example)
Have you tried using EXPLAIN on the queries that are being run?
What are the negetive performance impacts of indexing
UPDATEs and INSERTs will be slower. There's also the extra storage space requirments, but that's usual unimportant these days.
If i have a VARCHAR 2500 column which is searchable from parts of my site, should i index it
No, unless it's UNIQUE (which means it's already indexed) or you only search for exact matches on that field (not using LIKE or mySQL's fulltext search).
Generally I put an index on any fields that i will be searching or selecting using a WHERE clause
I'd normally index the fields that are the most queried, and then INTs/BOOLEANs/ENUMs rather that fields that are VARCHARS. Don't forget, often you need to create an index on combined fields, rather than an index on an individual field. Use EXPLAIN, and check the slow log.
Load Data Efficiently: Indexes speed up retrievals but slow down inserts and deletes, as well as updates of values in indexed columns. That is, indexes slow down most operations that involve writing. This occurs because writing a row requires writing not only the data row, it requires changes to any indexes as well. The more indexes a table has, the more changes need to be made, and the greater the average performance degradation. Most tables receive many reads and few writes, but for a table with a high percentage of writes, the cost of index updating might be significant.
Avoid Indexes: If you don’t need a particular index to help queries perform better, don’t create it.
Disk Space: An index takes up disk space, and multiple indexes take up correspondingly more space. This might cause you to reach a table size limit more quickly than if there are no indexes. Avoid indexes wherever possible.
Takeaway: Don't over index
In general, indices help speedup database search, having the disadvantage of using extra disk space and slowing INSERT / UPDATE / DELETE queries. Use EXPLAIN and read the results to find out when MySQL uses your indices.
If a table has six columns and all of them are searchable, should i index all of them or none of them?
Indexing all six columns isn't always the best practice.
(a) Are you going to use any of those columns when searching for specific information?
(b) What is the selectivity of those columns (how many distinct values are there stored, in comparison to the total amount of records on the table)?
MySQL uses a cost-based optimizer, which tries to find the "cheapest" path when performing a query. And fields with low selectivity aren't good candidates.
What are the negetive performance impacts of indexing?
Already answered: extra disk space, lower performance during insert - update - delete.
If i have a VARCHAR 2500 column which is searchable from parts of my site, should i index it?
Try the FULLTEXT Index.
1/2) Indexes speed up certain select operations but they slow down other operations like insert, update and deletes. It can be a fine balance.
3) use a full text index or perhaps sphinx

In a relational database, should all columns that will be ordered in a query have an index?

I'm accessing the database (Predominately MS SQL Server, Postgre) through ORM and defining attributes (like whether the field/column should have an index) via code.
I'm thinking that if a column will be ordered via ORDER BY, it should have an index, otherwise full table scan will be required every time (e.g. if you want to get top 5 records ordered by date).
As I'm defining these indexes in code (on Entity Framework POCO entities, as .NET attributes), I can access these metadata at runtime. When displaying the data in a grid, I'm planning to make only those columns sortable (by clicking on column header) that have an index attribute. Is my thinking correct, or maybe there exist some reasonable conditions where sorting can be desirable on non-indexed column, or vice-versa (indexed column sorting would not make much sense?..)
In short, is it good to assume that only those columns should be sortable in UI, that have corresponding index applied at the database level?
Or, to phrase more generic question: Should columns that will be ordered always have some sort of index?
Whether you need an index depends on how often you query the ordered sequence compared to how often you make changes that could influence the ordered sequence.
Every time you make changes that influence the ordered sequence your database has to reorder the ordered index. So if you will considerably make more changes than queries then the index will be ordered more often than the result of the ordering will be used.
Furthermore it depends on who is willing to wait for the result: the one who makes changes that requires a re-index, or the one who does the queries.
I wouldn't be surprised if the index is ordered by a separate process after the change has been made. If the query is done while the ordering is not finished, the database will need to first finish enough of the ordering before the query can return.
On the other hand, if a new change is made while the ordering that was needed because of an earlier change was not finished, the database probably will not finish the previous ordering, but start ordering the new situation.
So I guess it is not mandatory to have an ordered index for every query. To order every possible column-combination will be too much work, but if quite often a certain ordering is requested by a process that is waiting for the results, it might be wise to create the ordered index.
order by doesn't mandate index on a column but if isn't indexed then it will end up doing a file sort than index sort and thus it's always preferred to have those column indexed if you are intended to use them in WHERE / JOIN ON / HAVING / ORDER BY.
You can generate the query execution plan and see the differences between the versions (indexed over non-indexed)
Kudos to #Harald Coppoolse for a thorough answer - there's something else which you should know about sorting on the DB, and that it is preferred to be done at the app level. See item number 2 in the following list: https://www.brentozar.com/archive/2013/02/7-things-developers-should-know-about-sql-server/

MySQL Improving speed of order by statements

I've got a table in a MySQL db with about 25000 records. Each record has about 200 fields, many of which are TEXT. There's nothing I can do about the structure - this is a migration from an old flat-file db which has 16 years of records, and many fields are "note" type free-text entries.
Users can be viewing any number of fields, and order by any single field, and any number of qualifiers. There's a big slowdown in the sort, which is generally taking several seconds, sometimes as much as 7-10 seconds.
an example statement might look like this:
select a, b, c from table where b=1 and c=2 or a=0 order by a desc limit 25
There's never a star-select, and there's always a limit, so I don't think the statement itself can really be optimized much.
I'm aware that indexes can help speed this up, but since there's no way of knowing what fields are going to be sorted on, i'd have to index all 200 columns - what I've read about this doesn't seem to be consistent. I understand there'd be a slowdown when inserting or updating records, but assuming that's acceptable, is it advisable to add an index to each column?
I've read about sort_buffer_size but it seems like everything I read conflicts with the last thing I read - is it advisable to increase this value, or any of the other similar values (read_buffer_size, etc)?
Also, the primary identifier is a crazy pattern they came up with in the nineties. This is the PK and so should be indexed by virtue of being the PK (right?). The records are (and have been) submitted to the state, and to their clients, and I can't change the format. This column needs to sort based on the logic that's in place, which involves a stored procedure with string concatenation and substring matching. This particular sort is especially slow, and doesn't seem to cache, even though this one field is indexed, so I wonder if there's anything I can do to speed up the sorting on this particular field (which is the default order by).
TYIA.
I'd have to index all 200 columns
That's not really a good idea. Because of the way MySQL uses indexes most of them would probably never be used while still generating quite a large overhead. (see chapter 7.3 in link below for details). What you could do however, is to try to identify which columns appear most often in WHERE clause, and index those.
In the long run however, you will probably need to find a way, to rework your data structure into something more manageable, because as it is now, it has the smell of 'spreadsheet turned into database' which is not a nice smell.
I've read about sort_buffer_size but it seems like everything I read
conflicts with the last thing I read - is it advisable to increase
this value, or any of the other similar values (read_buffer_size,
etc)?
In general he answer is yes. However the actual details depend on your hardware, OS and what storage engine you use. See chapter 7.11 (especially 7.11.4 in link below)
Also, the primary identifier is a crazy pattern they came up with in
the nineties.[...] I wonder if there's anything I can do to speed up
the sorting on this particular field (which is the default order by).
Perhaps you could add a primarySortOrder column to your table, into which you could store numeric values that would map the PK order (precaluclated from the store procedure you're using).
Ant the link you've been waiting for: Chapter 7 from MySQL manual: Optimization
Add an index to all the columns that have a large number of distinct values, say 100 or even 1000 or more. Tune this number as you go.

How does mysql order by implemented internally?

How does Mysql order by implemented internally? would ordering by multiple columns involve scanning the data set multiple times once for each column specified in the order by constraint?
Here's the description:
http://dev.mysql.com/doc/refman/5.0/en/order-by-optimization.html
Unless you have out-of-row columns (BLOB or TEXT) or your SELECT list is too large, this algorithm is used:
Read the rows that match the WHERE clause.
For each row, record a tuple of values consisting of the sort key value and row position, and also the columns required for the query.
Sort the tuples by sort key value
Retrieve the rows in sorted order, but read the required columns directly from the sorted tuples rather than by accessing the table a second time.
Ordering by multiple columns does not require scanning the dataset twice, since all data required for sorting can be fetched in a single read.
Note that MySQL can avoid the order completely and just read the values in order, if you have an index with the leftmost part matching your ORDER BY condition.
MySQL is canny. Its sorting algorithm depends on a couple of factors -
Available Indexes
Expected size of result
MySQL version
MySQL has two methods to produce sorted/ordered streams of data.
1. Smart use of Indexes
Firstly MySQL optimiser analyses the query and figures out if it can just take advantage of sorted indexes available. If yes, it naturally returns records in index order. (The exception is NDB engine, which needs to perform a merge sort once it gets data from all storage nodes)
Hands down to the MySQL optimiser, who smartly figures out if the index access method is cheaper than other access methods.
Really interesting thing to see here
The index may also be used even if ORDER BY doesn’t match the index exactly, as long as other columns in ORDER BY are constants
Sometimes, the optimizer probably may not use Index if it finds indexes expensive as compared to scanning through the table.
2. Filesort Algorithm
If Indexes can not be used to satisfy an ORDER BY clause, MySQL utilises filesort algorithm. This is a really interesting algorithm. In a nutshell, It works like
It scans through the table and finds the rows which matches the WHERE condition
It maintains a buffer and stores a couple of values (sort key value, row pointer and columns required in the query) from each row in it. The size of this chunk is system variable sort_buffer_size.
When, buffer is full, it runs a quick sort on it based on the sort key and stores it safely to the temp file on disk and remembers a pointer to it
It will repeat the same step on chunks of data until there are no more rows left
Now, it has a couple of chunks which are sorted
Finally, it applies merge sort on all sorted chunks and puts it in one result file
In the end, it will fetch the rows from the sorted result file
If the expected result fits in one chunk, the data never hits disk, but remains in RAM.
For a detailed Info - https://www.pankajtanwar.in/blog/what-is-the-sorting-algorithm-behind-order-by-query-in-mysql