How does mysql order by implemented internally? - mysql

How does Mysql order by implemented internally? would ordering by multiple columns involve scanning the data set multiple times once for each column specified in the order by constraint?

Here's the description:
http://dev.mysql.com/doc/refman/5.0/en/order-by-optimization.html
Unless you have out-of-row columns (BLOB or TEXT) or your SELECT list is too large, this algorithm is used:
Read the rows that match the WHERE clause.
For each row, record a tuple of values consisting of the sort key value and row position, and also the columns required for the query.
Sort the tuples by sort key value
Retrieve the rows in sorted order, but read the required columns directly from the sorted tuples rather than by accessing the table a second time.
Ordering by multiple columns does not require scanning the dataset twice, since all data required for sorting can be fetched in a single read.
Note that MySQL can avoid the order completely and just read the values in order, if you have an index with the leftmost part matching your ORDER BY condition.

MySQL is canny. Its sorting algorithm depends on a couple of factors -
Available Indexes
Expected size of result
MySQL version
MySQL has two methods to produce sorted/ordered streams of data.
1. Smart use of Indexes
Firstly MySQL optimiser analyses the query and figures out if it can just take advantage of sorted indexes available. If yes, it naturally returns records in index order. (The exception is NDB engine, which needs to perform a merge sort once it gets data from all storage nodes)
Hands down to the MySQL optimiser, who smartly figures out if the index access method is cheaper than other access methods.
Really interesting thing to see here
The index may also be used even if ORDER BY doesn’t match the index exactly, as long as other columns in ORDER BY are constants
Sometimes, the optimizer probably may not use Index if it finds indexes expensive as compared to scanning through the table.
2. Filesort Algorithm
If Indexes can not be used to satisfy an ORDER BY clause, MySQL utilises filesort algorithm. This is a really interesting algorithm. In a nutshell, It works like
It scans through the table and finds the rows which matches the WHERE condition
It maintains a buffer and stores a couple of values (sort key value, row pointer and columns required in the query) from each row in it. The size of this chunk is system variable sort_buffer_size.
When, buffer is full, it runs a quick sort on it based on the sort key and stores it safely to the temp file on disk and remembers a pointer to it
It will repeat the same step on chunks of data until there are no more rows left
Now, it has a couple of chunks which are sorted
Finally, it applies merge sort on all sorted chunks and puts it in one result file
In the end, it will fetch the rows from the sorted result file
If the expected result fits in one chunk, the data never hits disk, but remains in RAM.
For a detailed Info - https://www.pankajtanwar.in/blog/what-is-the-sorting-algorithm-behind-order-by-query-in-mysql

Related

In a relational database, should all columns that will be ordered in a query have an index?

I'm accessing the database (Predominately MS SQL Server, Postgre) through ORM and defining attributes (like whether the field/column should have an index) via code.
I'm thinking that if a column will be ordered via ORDER BY, it should have an index, otherwise full table scan will be required every time (e.g. if you want to get top 5 records ordered by date).
As I'm defining these indexes in code (on Entity Framework POCO entities, as .NET attributes), I can access these metadata at runtime. When displaying the data in a grid, I'm planning to make only those columns sortable (by clicking on column header) that have an index attribute. Is my thinking correct, or maybe there exist some reasonable conditions where sorting can be desirable on non-indexed column, or vice-versa (indexed column sorting would not make much sense?..)
In short, is it good to assume that only those columns should be sortable in UI, that have corresponding index applied at the database level?
Or, to phrase more generic question: Should columns that will be ordered always have some sort of index?
Whether you need an index depends on how often you query the ordered sequence compared to how often you make changes that could influence the ordered sequence.
Every time you make changes that influence the ordered sequence your database has to reorder the ordered index. So if you will considerably make more changes than queries then the index will be ordered more often than the result of the ordering will be used.
Furthermore it depends on who is willing to wait for the result: the one who makes changes that requires a re-index, or the one who does the queries.
I wouldn't be surprised if the index is ordered by a separate process after the change has been made. If the query is done while the ordering is not finished, the database will need to first finish enough of the ordering before the query can return.
On the other hand, if a new change is made while the ordering that was needed because of an earlier change was not finished, the database probably will not finish the previous ordering, but start ordering the new situation.
So I guess it is not mandatory to have an ordered index for every query. To order every possible column-combination will be too much work, but if quite often a certain ordering is requested by a process that is waiting for the results, it might be wise to create the ordered index.
order by doesn't mandate index on a column but if isn't indexed then it will end up doing a file sort than index sort and thus it's always preferred to have those column indexed if you are intended to use them in WHERE / JOIN ON / HAVING / ORDER BY.
You can generate the query execution plan and see the differences between the versions (indexed over non-indexed)
Kudos to #Harald Coppoolse for a thorough answer - there's something else which you should know about sorting on the DB, and that it is preferred to be done at the app level. See item number 2 in the following list: https://www.brentozar.com/archive/2013/02/7-things-developers-should-know-about-sql-server/

How does mysql indexes turn random I/O into sequential I/O

the High Performance MySQL says that one of the benefit of index is "Indexes turn random I/O into sequential I/O", so I don't understand how does mysql indexes turn random I/O into sequential I/O.
I think they mean that when you have such indexes column in the table you can sort your stuff very easy as the indexes are a sequence
Think of a telephone book. Suppose you want to look up people with last name starting with "S". You can find all of them very close together.
Sorting is also helped by using an index. If you search for all the people with last names starting with "S" and you want them in order by last name, that's no problem, the query can read them in the same order they are listed in the phone book, and they're all stored together. So the fetching is done by reading sequentially.
But if you want them sorted by something else, like their first name, there has to be some additional step of sorting the fetched rows before returning them in the order you wanted. The dataset you want to be sorted might be larger than can fit in memory, so it must use temporary disk space to do the sorting. There are ways to do this pretty efficiently, but it could require multiple passes, or building a sorted list as it fetches the rows from the database.
Basically, having the rows stored in pre-sorted order makes it efficient to search for values that are stored together, and efficient to retrieve them in the order of the index.

MySQL performance with index clairification

Say I have a mysql table with an index on the column 'name':
I do this query:
select * from name_table where name = 'John';
Say there are 5 results that are returned from a table with 100 rows.
Say I now insert 1 million new rows, non that have a name John, so there are still only 5 Johns in the table. Will the select statement be as fast as previously, so will inserting all these rows have an impact on the read speed of an indexed table?
Indexes have their own "tables", and when the MySQL engine determines that the lookup references an indexed column, the lookup happens on this table. It isn't really a table per-se, but the gist checks out.
That said, it will be nanoseconds slower, but not something you should concern yourself with.
More importantly, concern youself with indexing pertinent data, and column order, as these have MUCH more of an impact on database performance.
To learn more about what is happening behind the scenes, query the EXPLAIN:
EXPLAIN select * from name_table where name = 'John';
Note: In addition to the column orders listed in the link, it is a good (nay, great) idea to have variable length columns (VARCHAR) after their fixed-length counterparts (CHAR) as, durring the lookup, the engine has to either look at the row, read the column lengths, then skip forward for the lookup (mind you, this is only for non-indexed columns), or read the table declairation and know it always has to look at the column with the offset X. It is more complicated behind the scenes, but if you can shift all fixed-length columns to the front, you will thank yourself. Basically:
Indexed columns.
Everything Fixed-Length in order according to the link.
Everything Variable-Length in order according to the link.
Yes, it will be just as fast.
(In addition to the excellent points made Mike's answer...) there's an important point we should make regarding indexes (B-tree indexes in particular):
The entries in the index are stored "in order".
The index is also organized in a way that allows the database to very quickly identify the blocks in the index that contain the entries it's looking for (or the block that would contain entries, if no matching entries are there.)
What this means is that the database doesn't need to look at every entry in the index. Given a predicate like the one in your question:
WHERE name = 'John'
with an index with a leading column of name, the database can eliminate vast swaths of blocks that don't need to be checked.
Blocks near the beginning of the index contain entries 'Adrian' thru 'Anna', a little later in the index, a block contains entries for Caleb thru Carl, further long in the index James thru Jane, etc.
Because of the way the index is organized, the database effectively "knows" that the entries we're looking for cannot be in any of those blocks (because the index is in order, there's no way the value John could appear in those blocks we mentioned). So none of those blocks needs to be checked. (The database figures out in just a very small number of operations, that 98% of the blocks in the index can be eliminated from consideration.
High cardinality = good performance
The take away from this is that indexes are most effective on columns that have high cardinality. That is, there are a large number of distinct values in the column, and those values are unique or nearly unique.
This should clear up the answer to the question you were asking. You can add brazilians of rows to the table. If only five of those rows have a value of
John in the name column, when you do a query
WHERE name = `John`
it will be just as fast. The database will be able to locate the entries your looking for nearly as fast as it can when you had a thousand rows in the table.
(As the index grows larger, it does add "levels" to the index, to traverse down to the leaf nodes... so, it gets ever so slightly slower because of a tiny few more operations. Where performance really starts to bog down is when the InnoDB buffer cache is too small, and we have to wait for the (glacially slow in comparison) disk io operations to fetch blocks into memory.
Low cardinality = poor performance
Indexes on columns with low cardinality are much less effective. For example, a column that has two possible values, with an even distribution of values across the rows in the table (about half of the rows have one value, and the other half have the other value.) In this case, the database can't eliminate 98% of the blocks, or 90% of the blocks. The database has to slog through half the blocks in the index, and then (usually) perform a lookup to the pages in the underlying table to get the other values for the row.
But with gazillions of rows with a column gender, with two values 'M' and 'F', an index with gender as a leading column will not be effective in satisfying a query
WHERE gender = 'M'
... because we're effectively telling the database to retrieve half the rows in the table, and it's likely those rows are going to be evenly distributed in the table. So nearly every page in the table is going to contain at least one row we need, the database is going to opt to do a full table scan (to look at every row in every block in the table) to locate the rows, rather than using an index.
So, in terms of performance for looking up rows in the table using an index... the size of the table isn't really an issue. The real issue is the cardinality of the values in the index, and how many distinct values we're looking for, and how many rows need to be returned.

Is there a benefit to index mysql column if I always have different value in every row?

Question is for rows like timestamp, where always different value stored in every row.
I'm already search through stackoverflow and read about indexes, but I don't understand profit if no one value equals to another. So, index cardinality will be equal to number of rows. What the profit?
This kind of column would actually be an excellent candidate for an index, preferably a unique one.
Tables are unsorted sets of data, so without any knowledge about the table, the database will have to go over the entire table sequentially to find the rows you're looking for (O(n) complexity, where n is the number of rows).
An index is, essentially a tree that stores values in a sorted way, which allows the database to intelligently find the rows you're looking for (O(log n)). In addition, making the index unique tell the database there can be only one row per timestamp value, so once a single row is retrieved the database can stop searching for more.
The performance benefit for such an index, assuming you search for rows according to timestamps, should be significant.
An index is a map between key values and retrieval pointers. The DBMS uses an index during a query if a strategy that uses the index appears to be optimal.
If the index never gets used, then it is useless.
Indexes can speed up lookups based on a single keyed value, or based on a range of key values (depending on the index type), or by allowing index only retrieval in cases where only the key is needed for the query. Speed ups can be as low as two for one or as high as a hundred for one, depending on the size of the table and various other factors.
If your timestamp field is never used in the WHERE clause or the ON clause of a query, the chances are you are better off with no index. The art of choosing indexes well goes a lot deeper than this, but this is a start.

Are database tables sorted before or after being retrieved?

I am creating a high scores table stored on a database for my game, but I was wondering the best practices are for storing such tables.
Should the table be resorted each time a new score is added or should the data being received by a query just be sorted?
It seems like it would be easier on the server to just sort when retrieving data instead of updating the table each time.
Rows in a relational database such as MySQL, Oracle, PostgreSQL, etc. are not maintained in any order. In the theory of relational databases result sets are returned in no specified order unless the query contains an ORDER BY clause. Any ordering is (must be) applied each time the data is retrieved.
Implementations may, in some cases, store the data in some order, but they are not required to do so. In fact, if you run the exact same query twice on the same data there is no guarantee that the data will be returned in the same sequence.
In other words, you cannot impose a storage order on your data, you impose order only on result sets at the time the query is executed.
I recommend sorting the data in your MySQL query. As you said it is easier to only sort when needed, not when every record is added.
Data in tables are unsorted. The actual physical order of rows in a relational table is undetermined. However, some databases will order rows on disks according to a clustered index.
If your tables contain a few thousand rows, two approaches are not much different about performance. However, if your tables are around more than 10,000 rows, you can use clustered index.
( for reference about clustered index, http://www.karafilis.net/sql-indexing-part2/).