Performance cost for using primary key in order by - mysql

Let's say I have a table with a three column primary key. If I select all from that table without any order by clause they are, to my understanding, ordered by these columns. The first one, and within that by the second column and within that by the third.
Is there any additional cost, or perhaps performance gain, by explicitly adding a order by clause with the three columns in the order they are part of the primary key?

If I select all from that table without any order by clause they are, occurring to my understanding, ordered by these columns
Your understanding is incorrect. Without an order by the database engine may output the results in any order it chooses. In fact if you were to add a clustered index on another field it is likely to change the order of the result.
Is there any additional cost, or perhaps performance gain
There will never be a performance gain by ordering (unless you're doing something that effects branch prediction ). There will be no cost if the database was going to order that way. However there may be a correctness issue if you're not specifying the order.

Related

MySQL index key on table with more columns

In my script, I have a lot of SQL WHERE clauses, e.g.:
SELECT * FROM cars WHERE active=1 AND model='A3';
SELECT * FROM cars WHERE active=1 AND year=2017;
SELECT * FROM cars WHERE active=1 AND brand='BMW';
I am using different SQL clauses on same table because I need different data.
I would like to set index key on table cars, but I am not sure how to do it. Should I set separate keys for each column (active, model, year, brand) or should I set keys for groups (active,model and active,year and active,brand)?
WHERE a=1 AND y='m'
is best handled by INDEX(a,y) in either order. The optimal set of indexes is several pairs like that. However, I do not recommend having more than a few indexes. Try to limit it to queries that users actually make.
INDEX(a,b,c,d):
WHERE a=1 AND b=22 -- Index is useful
WHERE a=1 AND d=44 -- Index is less useful
Only the "left column(s)" of an index are used. Hence the second case, uses a, but stops because b is not in the WHERE.
You might be tempted to also have (active, year, model). That combination works well for active AND year, active AND year AND model, but not active AND model (but no year).
More on creating indexes: http://mysql.rjweb.org/doc.php/index_cookbook_mysql
Since model implies a make, there is little use to put both of those in the same composite index.
year is not very selective, and users might want a range of years. These make it difficult to get an effective index on year.
How many rows will you have? If it is millions, we need to work harder to avoid performance problems. I'm leaning toward this, but only because the lookup would be more compact.
We use single indexing when we want to query for just one column, same asin your case and multiple group indexing when we have multiple condition in the same where clause.
Go for single indexing.
For more detailed explanation, refer this article: https://www.sqlinthewild.co.za/index.php/2010/09/14/one-wide-index-or-multiple-narrow-indexes/

How do you order the indexing columns in MySQL if you are using order by in your query?

I am reading an article about how Pinterest shards their MySQL database: https://medium.com/#Pinterest_Engineering/sharding-pinterest-how-we-scaled-our-mysql-fleet-3f341e96ca6f
And here they have an example of a table:
CREATE TABLE board_has_pins (
board_id INT,
pin_id INT,
sequence INT,
INDEX(board_id, pin_id, sequence)
) ENGINE=InnoDB;
And they are showing how they query from that table:
SELECT pin_id FROM board_has_pins
WHERE board_id=241294561224164665 ORDER BY sequence
LIMIT 50 OFFSET 150
What I don't understand here is the ordering of the index. Would it not make more sense if the index was like this since they are ordering by sequence and filtering by board_id?
INDEX(board_id, sequence, pin_id)
Am I missing something here or have I misunderstood how indexing works?
You are correct. The better index for this query is:
INDEX(board_id, sequence, pin_id)
The columns should be in this order:
Column(s) involved in equality comparisons. If there are multiple columns, their order does not matter.
Column(s) involved the ORDER BY clause, in the same order they appear in the ORDER BY.
Other columns used to fetch values, like pin_id.
Once the equality conditions find the subset of matching rows, they are all tied with respect to their order, because naturally they all have the same value for the column of the quality condition (board_id in this case).
The tie is resolved by the order of the next column in the index. If (and only if) the next column is the one used in the ORDER BY clause, then the rows can be read in index order, with no further work needed to sort them.
I don't know what is the explanation for the Pinterest blog post you linked to. I guess it's a mistake, because the index is not optimal for the query they showed.

How do I create one MySQL index for 2 SQL queries?

SELECT * FROM messages_messages WHERE (from_user_id=? AND to_user_id=?) OR (from_user_id=? AND to_user_id=?) ORDER BY created_at DESC
I have another query, which is this:
SELECT COUNT(*) FROM messages_messages WHERE from_user_id=? AND to_user_id=? AND read_at IS NULL
I want to index both of these queries, but I don't want to create 2 separate indexes.
Right now, I'm using 2 indexes:
[from_user_id, to_user_id, created_at]
[from_user_id, to_user_id, read_at]
I was wondering if I could do this with one index instead of 2?
These are the only 2 queries I have for this table.
The docs explain fairly completely how MySQL uses indices. In particular, its optimizer can use any left prefix of a multi-column index. Therefore, you could drop either of your two existing indices, and the other would be eligible for use in both queries, though it would be more selective / useful for one than for the other.
In principle, it could be more beneficial to keep your first index, provided that the created_at column was indexed in descending order. In practice, MySQL allows you to specify index column order, but in fact implements only ascending order. Therefore, having created_at in your index probably doesn't help very much.
No, you need both indexes for these two queries if you want to optimize fully.
Once you reach the column used for either sorting or range comparison (IS [NOT] NULL counts as a range predicate for this purpose), you don't get any benefit from putting more columns in the index. In other words, your index can have:
Some columns that are used in equality predicates
One column that is used either in a range predicate, or to avoid a filesort -- but not both.
Extra columns used in neither searching nor sorting, but only for the sake of a covering index.
So you cannot make a four-column index that serves both queries.
The only way you can reduce this to one index, as #JohnBollinger says, is to make an index that optimizes for one query, and uses a subset of the index for the second query. But that won't work as well.

Performance - MySQL default sorting order on inserting records

The default ordering ID of records in mysql is ASC (i.e. Rows that i insert goes down the table) but we'll be using only the latest information from the table (i.e. Rows that are below).
Will there be any performance improvements if we change the default ordering to DESC (i.e New records goes to the top of the table) and frequent information will be queried from top of the table.
I think it would be the opposite.
I'm basing this comment on how I understand indexes to work in SQL Server-I'll try to revise later if I get a chance to read up more on how they work in MySQL.
There could be a slight performance advantage to insert your rows in the same order as your index is sorted, versus inserting them in the opposite order.
If you insert in the same order, and your next row to insert is always greater in sort order than existing rows then you will always find the next available empty spot (when one exists) in your last page of rows data.
If you do the opposite, always have your next insert row lesser in sort order than existing rows then you will probably always have a collision in your first page of rows data and the engine will do a tad bit more work to shift the position of rows if the page has room for it.
As for your order by clause in the select statement:
1) there's nothing in the SQL standard about indexes, and nothing that guarantees your result set ordering except for the ORDER BY clause. Normally queries in SQL Server that use just one index will see results returned in the order of the index. But if the isolation level changes to "read uncommitted" (chaos?) then it will return rows in more likely in the order it finds them in memory or on disk which is not necessarily the order you want.
2) If the order by in your select statement is based on the exact same column criteria as the index, then your database server should perform the same with either the index order, or the opposite of the index order. This is pretty straightforward except perhaps if you have a multi-column index with mixed ASC-DESC declarations for different columns. You get away with equal performance with order by equal to index order and with order by equal to inverse index order where the inverse index order is determined by substituting the ASC and DESC declarations (explicit and implicit) in the index declaration with DESC and ASC in the order by clause.
Any performance change would be on querying the records, not inserting one.
For queries, I doubt this will have much affect as database queries by keys usually have similar speeds.
It also depends on your data so I would run some tests.

(Why) Can't MySQL use index in such cases?

1 - PRIMARY used in a secondary index, e.g. secondary index on (PRIMARY,column1)
2 - I'm aware mysql cannot continue using the rest of an index as soon as one part was used for a range scan, however: IN (...,...,...) is not considered a range, is it? Yes, it is a range, but I've read on mysqlperformanceblog.com that IN behaves differently than BETWEEN according to the use of index.
Could anyone confirm those two points? Or tell me why this is not possible? Or how it could be possible?
UPDATE:
Links:
http://www.mysqlperformanceblog.com/2006/08/10/using-union-to-implement-loose-index-scan-to-mysql/
http://www.mysqlperformanceblog.com/2006/08/14/mysql-followup-on-union-for-query-optimization-query-profiling/comment-page-1/#comment-952521
UPDATE 2: example of nested SELECT:
SELECT * FROM user_d1 uo
WHERE EXISTS (
SELECT 1 FROM `user_d1` ui
WHERE ui.birthdate BETWEEN '1990-05-04' AND '1991-05-04'
AND ui.id=uo.id
)
ORDER BY uo.timestamp_lastonline DESC
LIMIT 20
So, the outer SELECT uses timestamp_lastonline for sorting, the inner either PK to connect with the outer or birthdate for filtering.
What other options rather than this query are there if MySQL cannot use index on a range scan and for sorting?
The column(s) of the primary key can certainly be used in a secondary index, but it's not often worthwhile. The primary key guarantees uniqueness, so any columns listed after it cannot be used for range lookups. The only time it will help is when a query can use the index alone
As for your nested select, the extra complication should not beat the simplest query:
SELECT * FROM user_d1 uo
WHERE uo.birthdate BETWEEN '1990-05-04' AND '1991-05-04'
ORDER BY uo.timestamp_lastonline DESC
LIMIT 20
MySQL will choose between a birthdate index or a timestamp_lastonline index based on which it feels will have the best chance of scanning fewer rows. In either case, the column should be the first one in the index. The birthdate index will also carry a sorting penalty, but might be worthwhile if a large number of recent users will have birth dates outside of that range.
If you wish to control the order, or potentially improve performance, a (timestamp_lastonline, birthdate) or (birthdate, timestamp_lastonline) index might help. If it doesn't, and you really need to select based on the birthdate first, then you should select from the inner query instead of filtering on it:
SELECT * FROM (
SELECT * FROM user_d1 ui
WHERE ui.birthdate BETWEEN '1990-05-04' AND '1991-05-04'
) as uo
ORDER BY uo.timestamp_lastonline DESC
LIMIT 20
Even then, MySQL's optimizer might choose to rewrite your query if it finds a timestamp_lastonline index but no birthdate index.
And yes, IN (..., ..., ...) behaves differently than BETWEEN. Only the latter can effectively use a range scan over an index; the former would look up each item individually.
2.IN will obviously differ from BETWEEN. If you have an index on that column, BETWEEN will need to get the starting point and it's all done. If you have IN, it will look for a matching value in the index value by value thus it will look for the values as many times as there are values compared to BETWEEN's one time look.
yes #Andrius_Naruševičius is right the IN statement is merely shorthand for EQUALS OR EQUALS OR EQUALS has no inherent order whatsoever where as BETWEEN is a comparison operator with an implicit greater than or less than and therefore absolutely loves indexes
I honestly have no idea what you are talking about, but it does seem you are asking a good question I just have no notion what it is :-). Are you saying that a primary key cannot contain a second index? because it absolutely can. The primary key never needs to be indexed because it is ALWAYS indexed automatically, so if you are getting an error/warn (I assume you are?) about supplementary indices then it's not the second, third index causing it it's the PRIMARY KEY not needing it, and you mentioning that probably is the error. Having said that I have no idea what question you asked - it's my answer to my best guess as to your actual question.