Does the order of columns in a query matter? - mysql

When selecting columns from a MySQL table, is performance affected by the order that you select the columns as compared to their order in the table (not considering indexes that may cover the columns)?
For example, you have a table with rows uid, name, bday, and you have the following query.
SELECT uid, name, bday FROM table
Does MySQL see the following query any differently and thus cause any sort of performance hit?
SELECT uid, bday, name FROM table

The order doesn't matter, actually, so you are free to order them however you'd like.
edit: I guess a bit more background is helpful: As far as I know, the process of optimizing any query happens prior to determining exactly what subset of the row data is being pulled. So the query optimizer breaks it down into first what table to look at, joins to perform, indexes to use, aggregates to apply, etc., and then retrieves that dataset. The column ordering happens between the data pull and the formation of the result set, so the data actually "arrives" as ordered by the database, and is then reordered as it is returned to your application.

In practice, I suspect it might.
With a decent query optimiser: it shouldn't.
You can only tell for your cases by measuring. And the measurements will likely change as the distribution of data changes in the database.
with regards
Wazzy

The order of the attributes selected is negligible. The underlying storage engines surely order their attribute locations, but you would not necessarily have a way to know the specific ordering (renames, alter tables, row vs. column stores) in most cases may be independent from the table description which is just meta data anyway. The order of presentation into the result set would be insignificant in terms of any measurable overhead.

Related

Will records order change between two identical query in mysql without order by

The problem is I need to do pagination.I want to use order by and limit.But my colleague told me mysql will return records in the same order,and since this job doesn't care in which order the records are shown,so we don't need order by.
So I want to ask if what he said is correct? Of course assuming that no records are updated or inserted between the two queries.
You don't show your query here, so I'm going to assume that it's something like the following (where ID is the primary key of the table):
select *
from TABLE
where ID >= :x:
limit 100
If this is the case, then with MySQL you will probably get rows in the same order every time. This is because the only predicate in the query involves the primary key, which is a clustered index for MySQL, so is usually the most efficient way to retrieve.
However, probably may not be good enough for you, and if your actual query is any more complex than this one, probably no longer applies. Even though you may think that nothing changes between queries (ie, no rows inserted or deleted), so you'll get the same optimization plan, that is not true.
For one thing, the block cache will have changed between queries, which may cause the optimizer to choose a different query plan. Or maybe not. But I wouldn't take the word of anyone other than one of the MySQL maintainers that it won't.
Bottom line: use an order by on whatever column(s) you're using to paginate. And if you're paginating by the primary key, that might actually improve your performance.
The key point here is that database engines need to handle potentially large datasets and need to care (a lot!) about performance. MySQL is never going to waste any resource (CPU cycles, memory, whatever) doing an operation that doesn't serve any purpose. Sorting result sets that aren't required to be sorted is a pretty good example of this.
When issuing a given query MySQL will try hard to return the requested data as quick as possible. When you insert a bunch of rows and then run a simple SELECT * FROM my_table query you'll often see that rows come back in the same order than they were inserted. That makes sense because the obvious way to store the rows is to append them as inserted and the obvious way to read them back is from start to end. However, this simplistic scenario won't apply everywhere, every time:
Physical storage changes. You won't just be appending new rows at the end forever. You'll eventually update values, delete rows. At some point, freed disk space will be reused.
Most real-life queries aren't as simple as SELECT * FROM my_table. Query optimizer will try to leverage indices, which can have a different order. Or it may decide that the fastest way to gather the required information is to perform internal sorts (that's typical for GROUP BY queries).
You mention paging. Indeed, I can think of some ways to create a paginator that doesn't require sorted results. For instance, you can assign page numbers in advance and keep them in a hash map or dictionary: items within a page may appear in random locations but paging will be consistent. This is of course pretty suboptimal, it's hard to code and requieres constant updating as data mutates. ORDER BY is basically the easiest way. What you can't do is just base your paginator in the assumption that SQL data sets are ordered sets because they aren't; neither in theory nor in practice.
As an anecdote, I once used a major framework that implemented pagination using the ORDER BY and LIMIT clauses. (I won't say the same because it isn't relevant to the question... well, dammit, it was CakePHP/2). It worked fine when sorting by ID. But it also allowed users to sort by arbitrary columns, which were often not unique, and I once found an item that was being shown in two different pages because the framework was naively sorting by a single non-unique column and that row made its way into both ORDER BY type LIMIT 10 and ORDER BY type LIMIT 10, 10 because both sortings complied with the requested condition.

Indeterminate sort ordering - MySQL vs Postgres

I have the following query
SELECT *
FROM table_1
INNER JOIN table_2 ON table_1.orders = table_2.orders
ORDER BY table_2.purchasetime;
The above query result is indeterminate i.e it can change with different queries when the purchase time is of same value as per the MySQL manual itself.To overcome this we give sort ordering on a unique column and combine it with the regular sort ordering.
The customer does not want to see different results with different page refreshes so we have put in the above fix specifically for MySQL which is unnecessary and needs extra compound indexes for both asc and desc.
I am not sure whether the same is applicable for postgres.So far I have not been able to reproduce the scenario.I would appreciate if someone could answer this for postgres or point me in the right direction.
Edit 1 : The sort column is indexed.So assuming the disk data has no ordering, but in the case of index (btree data structure) a constant ordering might be possible with postgres ?
No, it will not be different in PostgreSQL (or, in fact, in any other relational database that I know of).
See http://www.postgresql.org/docs/9.4/static/queries-order.html :
After a query has produced an output table (after the select list has been processed) it can optionally be sorted. If sorting is not chosen, the rows will be returned in an unspecified order. The actual order in that case will depend on the scan and join plan types and the order on disk, but it must not be relied on. A particular output ordering can only be guaranteed if the sort step is explicitly chosen.
Even if by accident you manage to find a PostgreSQL version and index that will guarantee the order in all the test you run, please don't rely on it. Any database upgrade, data change or a change in the Maya calendar or the phase of the moon can suddenly upset your sorting order. And debugging it then is a true and terrible pain in the neck.
Your concern seems to be that order by table_2.purchasetime is indeterminate when there are multiple rows with the same value.
To fix this -- in any database or really any computer language -- you need a stable sort. You can turn any sort into a stable sort by adding a unique key. So, adding a unique column (typically an id of some sort) fixes this in both MySQL and Postgres (and any other database).
I should note that instability in sorts can be a very subtle problem, one that only shows up under certain circumstances. So, you could run the same query many times and it is fine. Then you insert or delete a record (perhaps even one not chosen by the query) and the order changes.

Should I avoid ORDER BY in queries for large tables?

In our application, we have a page that displays user a set of data, a part of it actually. It also allows user to order it by a custom field. So in the end it all comes down to query like this:
SELECT name, info, description FROM mytable
WHERE active = 1 -- Some filtering by indexed column
ORDER BY name LIMIT 0,50; -- Just a part of it
And this worked just fine, as long as the size of table is relatively small (used only locally in our department). But now we have to scale this application. And let's assume, the table has about a million of records (we expect that to happen soon). What will happen with ordering? Do I understand correctly, that in order to do this query, MySQL will have to sort a million records each time and give a part of it? This seems like a very resource-heavy operation.
My idea is simply to turn off that feature and don't let users select their custom ordering (maybe just filtering), so that the order would be a natural one (by id in descending order, I believe the indexing can handle that).
Or is there a way to make this query work much faster with ordering?
UPDATE:
Here is what I read from the official MySQL developer page.
In some cases, MySQL cannot use indexes to resolve the ORDER BY,
although it still uses indexes to find the rows that match the WHERE
clause. These cases include the following:
....
The key used to
fetch the rows is not the same as the one used in the ORDER BY:
SELECT * FROM t1 WHERE key2=constant ORDER BY key1;
So yes, it does seem like mysql will have a problem with such a query? So, what do I do - don't use an order part at all?
The 'problem' here seems to be that you have 2 requirements (in the example)
active = 1
order by name LIMIT 0, 50
The former you can easily solve by adding an index on the active field
The latter you can improve by adding an index on name
Since you do both in the same query, you'll need to combine this into an index that lets you resolve the active value quickly and then from there on fetches the first 50 names.
As such, I'd guess that something like this will help you out:
CREATE INDEX idx_test ON myTable (active, name)
(in theory, as always, try before you buy!)
Keep in mind though that there is no such a thing as a free lunch; you'll need to consider that adding an index also comes with downsides:
the index will make your INSERT/UPDATE/DELETE statements (slightly) slower, usually the effect is negligible but only testing will show
the index will require extra space in de database, think of it as an additional (hidden) special table sitting next to your actual data. The index will only hold the fields required + the PK of the originating table, which usually is a lot less data then the entire table, but for 'millions of rows' it can add up.
if your query selects one or more fields that are not part of the index, then the system will have to fetch the matching PK fields from the index first and then go look for the other fields in the actual table by means of the PK. This probably is still (a lot) faster than when not having the index, but keep this in mind when doing something like SELECT * FROM ... : do you really need all the fields?
In the example you use active and name but from the text I get that these might be 'dynamic' in which case you'd have to foresee all kinds of combinations. From a practical point this might not be feasible as each index will come with the downsides of above and each time you add an index you'll add supra to that list again (cumulative).
PS: I use PK for simplicity but in MSSQL it's actually the fields of the clustered index, which USUALLY is the same thing. I'm guessing MySQL works similarly.
Explain your query, and check, whether it goes for filesort,
If Order By doesnt get any index or if MYSQL optimizer prefers to avoid the existing index(es) for sorting, it goes with filesort.
Now, If you're getting filesort, then you should preferably either avoid ORDER BY or you should create appropriate index(es).
if the data is small enough, it does operations in Memory else it goes on the disk.
so you may try and change the variable < sort_buffer_size > as well.
there are always tradeoffs, one way to improve the preformance of order query is to set the buffersize and then the run the order by query which improvises the performance of the query
set sort_buffer_size=100000;
<>
If this size is further increased then the performance will start decreasing

What drive the natural result order for an unordered MySQL request

How does mysql return lines when there is no ORDER BY in the request?
What drives the natural order?
There can obviously be many different queries but let's say a simple
select column from table where date < NOW()
There is no natural predictable order when you don't specify one.
Be very careful with this. For all SQL there is no defined implied order. Never count on this. Even if you see a specific behavior at a point in time, that could change in a future release or even with the adding of an index. If you are expecting an order and counting on it, the specify it explicitly.
Problem is that "natural order" of results is often affected completely or partly by the access plan the DB engine uses. For instance, if you do a group by FieldA there is a good chance (not a guarantee) that the results will come back in FieldA sequence. If you do a very simple select chances are the results will be in the sequence they are stored in the database, which may or may not be the order of the IDs or the primary key. IF you don't specify the order it is giving the DB engine the option to do whatever is most convenient for it at the time based on how it got the results. So really does become unpredictable and open to change.
Wish I could explain better, but trying to convey the real randomness of the process form an observer viewpoint.
If the query is using an index, it will prefer the ordering of that index. Group by forces an ordering. This is why combining group by and order can have a performance penalty.
In your case, if you have an index on date, it will probably order by that, hard to say how it handles tie breaks though. For more information, as usual explain the query.
Of course there's a caveat to ordering on the index used as well. If the index is on an autoincremented field and the data was added with prespecified ids, you may find it prefers the order the data was added in.

Composite Primary and Cardinality

I have some questions on Composite Primary Keys and the cardinality of the columns. I searched the web, but did not find any definitive answer, so I am trying again. The questions are:
Context: Large (50M - 500M rows) OLAP Prep tables, not NOSQL, not Columnar. MySQL and DB2
1) Does the order of keys in a PK matter?
2) If the cardinality of the columns varies heavily, which should be used first. For example, if I have CLIENT/CAMPAIGN/PROGRAM where CLIENT is highly cardinal, CAMPAIGN is moderate, PROGRAM is almost like a bitmap index, what order is the best?
3) What order is the best for Join, if there is a Where clause and when there is no Where Clause (for views)
Thanks in advance.
You've got "MySQL and DB2". This answer is for DB2, MySQL has none of this.
Yes, of course that is logical, but the optimiser takes much more than just that into account.
Generally, the order of the columns in the WHERE clause (join) do not (and should not) matter.
However, there are two items related to the order of predicates which may be the reason for your question.
What does matter, is the order of the columns in the index, against which the WHERE clause is processed. Yes, there it is best to specify the columns in the order of highest cardinality to lowest. That allows the optimiser to target a smaller range of rows.
And along those lines do not bother implementing indices for single-column, low cardinality columns (there are useless). If the index is correct, then it will be used more often.
.
The order of tables being joined (not columns in the join) matters very much, it is probably the most important consideration. In fact Join Transitive Closure is automatic, and the optimiser evaluates all possible join orders, and chooses what it thinks is the best, based on Statistics (which is why UPDATE STATS is so important).
Regardless of the no of rows in the tables, if you are joining 100 rows from table_A on a bad index with 1,000,000 rows in table_B on a good index, you want the order A:B, not B:A. If you are getting less than the max IOPS, you may want to do something about it.
The correct sequence of steps is, no surprise:
check that the index is correct as per (1). Do not just add another index, correct the ones you have.
check that update stats is being executed regularly
always try the default operation of the optimiser first. Set stats on and measure I/Os. Use representative sets of values (that the user will use in production).
check the shoowplan, to ensure that the code is correct. Of course that will also identify the join order chosen.
if the performance is not good enough, and you believe that the the join order chosen by the optimiser for those sets of values is sub-optimal, SET JTC OFF (syntax depends on your version of DB2), then specify the order that you want in the WHERE clause. Measure I/Os. Use representative sets
form an opinion. Choose whichever is better performance overall. Never tune for single queries.
1) Does the order of keys in a PK matter?
Yes, it changes the order of the record for the index that is used to police the PRIMARY KEY.
2) If the cardinality of the columns varies heavily, which should be used first. For example, if I have CLIENT/CAMPAIGN/PROGRAM where CLIENT is highly cardinal, CAMPAIGN is moderate, PROGRAM is almost like a bitmap index, what order is the best?
For select queries, this totally depends on the queries you are going to use. If you are searching for all three columns at once, the order is not important; if you are searching for two or one columns, they should be leading in the index.
For inserts, it is better to make the leading column match the order in which the records are inserted.
3) What order is the best for Join, if there is a Where clause and when there is no Where Clause (for views)
Again, this depends on the WHERE clause.