Is it safe to assume when reading a MySQL table line by line from an application that the table will always read from top to bottom, one after the other in perfect sequential order.
E.G. If a table is ordered by a unique ID and I read it in via C++ one line at a time. Is it safe to assume I will get each line in exact unique ID order every time.
My gut feeling is that it is not a safe assumption but I have no technical reasoning for that.
My testing has always shown that it does provide table rows in order but it makes me nervous relying on it. As a result I write programs such that they do not depend on this assumption which makes them a little more complicated and a little less efficient.
Thanks
C
If you use the following query you should have no issues,
SELECT columns
FROM tables
WHERE predicates
ORDER BY column ASC/DESC;
If you are using the standard C++ MySQL connector, then according to the reference manual, "The preview version does buffer all result sets on the client to support cursors".
It has also generally been my experience that your result sets are buffered, and therefore do not change when the underlying table changes.
You are right. It is not safe to assume when reading a MySQL table line by line from an application that the table will always be read from top to bottom, one after the other in perfect sequential order. It is not safe to assume that if a table is ordered by a unique ID and I read it in (via C++ or otherwise) one line at a time, you will get each line in exact unique ID order every time.
There is no guarantee for that, on any RDBMS. No one should rely on that assumption.
Rows have no (Read: should not have) intrinsic or default order in relational tables. Tables (relations) are, by definition, unordered sets or rows.
What gives this impression is that most systems, when asked to return a result for a query like:
SELECT columns
FROM table
they retrieve all the rows from the disk, reading the whole file. So, they return the rows (usually) in the order they were stored in the file or in the order of the clustered key (e.g. in InnoDB tables in MySQL). So, they return the result with the same order every time.
If there are multiple tables in the FROM clause or if there are WHERE conditions, it's a whole different situation, as not the whole tables are read, different indexes may be used, so the system may not read the tables files but just the index files. Or read a small part of the tables.
It's also a different story if you have partitioned tables or distributed databases.
Conclusion is that you should have an ORDER BY part in your queries, if you want to guarantee the same order every time.
Related
The problem is I need to do pagination.I want to use order by and limit.But my colleague told me mysql will return records in the same order,and since this job doesn't care in which order the records are shown,so we don't need order by.
So I want to ask if what he said is correct? Of course assuming that no records are updated or inserted between the two queries.
You don't show your query here, so I'm going to assume that it's something like the following (where ID is the primary key of the table):
select *
from TABLE
where ID >= :x:
limit 100
If this is the case, then with MySQL you will probably get rows in the same order every time. This is because the only predicate in the query involves the primary key, which is a clustered index for MySQL, so is usually the most efficient way to retrieve.
However, probably may not be good enough for you, and if your actual query is any more complex than this one, probably no longer applies. Even though you may think that nothing changes between queries (ie, no rows inserted or deleted), so you'll get the same optimization plan, that is not true.
For one thing, the block cache will have changed between queries, which may cause the optimizer to choose a different query plan. Or maybe not. But I wouldn't take the word of anyone other than one of the MySQL maintainers that it won't.
Bottom line: use an order by on whatever column(s) you're using to paginate. And if you're paginating by the primary key, that might actually improve your performance.
The key point here is that database engines need to handle potentially large datasets and need to care (a lot!) about performance. MySQL is never going to waste any resource (CPU cycles, memory, whatever) doing an operation that doesn't serve any purpose. Sorting result sets that aren't required to be sorted is a pretty good example of this.
When issuing a given query MySQL will try hard to return the requested data as quick as possible. When you insert a bunch of rows and then run a simple SELECT * FROM my_table query you'll often see that rows come back in the same order than they were inserted. That makes sense because the obvious way to store the rows is to append them as inserted and the obvious way to read them back is from start to end. However, this simplistic scenario won't apply everywhere, every time:
Physical storage changes. You won't just be appending new rows at the end forever. You'll eventually update values, delete rows. At some point, freed disk space will be reused.
Most real-life queries aren't as simple as SELECT * FROM my_table. Query optimizer will try to leverage indices, which can have a different order. Or it may decide that the fastest way to gather the required information is to perform internal sorts (that's typical for GROUP BY queries).
You mention paging. Indeed, I can think of some ways to create a paginator that doesn't require sorted results. For instance, you can assign page numbers in advance and keep them in a hash map or dictionary: items within a page may appear in random locations but paging will be consistent. This is of course pretty suboptimal, it's hard to code and requieres constant updating as data mutates. ORDER BY is basically the easiest way. What you can't do is just base your paginator in the assumption that SQL data sets are ordered sets because they aren't; neither in theory nor in practice.
As an anecdote, I once used a major framework that implemented pagination using the ORDER BY and LIMIT clauses. (I won't say the same because it isn't relevant to the question... well, dammit, it was CakePHP/2). It worked fine when sorting by ID. But it also allowed users to sort by arbitrary columns, which were often not unique, and I once found an item that was being shown in two different pages because the framework was naively sorting by a single non-unique column and that row made its way into both ORDER BY type LIMIT 10 and ORDER BY type LIMIT 10, 10 because both sortings complied with the requested condition.
I'm accessing the database (Predominately MS SQL Server, Postgre) through ORM and defining attributes (like whether the field/column should have an index) via code.
I'm thinking that if a column will be ordered via ORDER BY, it should have an index, otherwise full table scan will be required every time (e.g. if you want to get top 5 records ordered by date).
As I'm defining these indexes in code (on Entity Framework POCO entities, as .NET attributes), I can access these metadata at runtime. When displaying the data in a grid, I'm planning to make only those columns sortable (by clicking on column header) that have an index attribute. Is my thinking correct, or maybe there exist some reasonable conditions where sorting can be desirable on non-indexed column, or vice-versa (indexed column sorting would not make much sense?..)
In short, is it good to assume that only those columns should be sortable in UI, that have corresponding index applied at the database level?
Or, to phrase more generic question: Should columns that will be ordered always have some sort of index?
Whether you need an index depends on how often you query the ordered sequence compared to how often you make changes that could influence the ordered sequence.
Every time you make changes that influence the ordered sequence your database has to reorder the ordered index. So if you will considerably make more changes than queries then the index will be ordered more often than the result of the ordering will be used.
Furthermore it depends on who is willing to wait for the result: the one who makes changes that requires a re-index, or the one who does the queries.
I wouldn't be surprised if the index is ordered by a separate process after the change has been made. If the query is done while the ordering is not finished, the database will need to first finish enough of the ordering before the query can return.
On the other hand, if a new change is made while the ordering that was needed because of an earlier change was not finished, the database probably will not finish the previous ordering, but start ordering the new situation.
So I guess it is not mandatory to have an ordered index for every query. To order every possible column-combination will be too much work, but if quite often a certain ordering is requested by a process that is waiting for the results, it might be wise to create the ordered index.
order by doesn't mandate index on a column but if isn't indexed then it will end up doing a file sort than index sort and thus it's always preferred to have those column indexed if you are intended to use them in WHERE / JOIN ON / HAVING / ORDER BY.
You can generate the query execution plan and see the differences between the versions (indexed over non-indexed)
Kudos to #Harald Coppoolse for a thorough answer - there's something else which you should know about sorting on the DB, and that it is preferred to be done at the app level. See item number 2 in the following list: https://www.brentozar.com/archive/2013/02/7-things-developers-should-know-about-sql-server/
I have the following query
SELECT *
FROM table_1
INNER JOIN table_2 ON table_1.orders = table_2.orders
ORDER BY table_2.purchasetime;
The above query result is indeterminate i.e it can change with different queries when the purchase time is of same value as per the MySQL manual itself.To overcome this we give sort ordering on a unique column and combine it with the regular sort ordering.
The customer does not want to see different results with different page refreshes so we have put in the above fix specifically for MySQL which is unnecessary and needs extra compound indexes for both asc and desc.
I am not sure whether the same is applicable for postgres.So far I have not been able to reproduce the scenario.I would appreciate if someone could answer this for postgres or point me in the right direction.
Edit 1 : The sort column is indexed.So assuming the disk data has no ordering, but in the case of index (btree data structure) a constant ordering might be possible with postgres ?
No, it will not be different in PostgreSQL (or, in fact, in any other relational database that I know of).
See http://www.postgresql.org/docs/9.4/static/queries-order.html :
After a query has produced an output table (after the select list has been processed) it can optionally be sorted. If sorting is not chosen, the rows will be returned in an unspecified order. The actual order in that case will depend on the scan and join plan types and the order on disk, but it must not be relied on. A particular output ordering can only be guaranteed if the sort step is explicitly chosen.
Even if by accident you manage to find a PostgreSQL version and index that will guarantee the order in all the test you run, please don't rely on it. Any database upgrade, data change or a change in the Maya calendar or the phase of the moon can suddenly upset your sorting order. And debugging it then is a true and terrible pain in the neck.
Your concern seems to be that order by table_2.purchasetime is indeterminate when there are multiple rows with the same value.
To fix this -- in any database or really any computer language -- you need a stable sort. You can turn any sort into a stable sort by adding a unique key. So, adding a unique column (typically an id of some sort) fixes this in both MySQL and Postgres (and any other database).
I should note that instability in sorts can be a very subtle problem, one that only shows up under certain circumstances. So, you could run the same query many times and it is fine. Then you insert or delete a record (perhaps even one not chosen by the query) and the order changes.
Greeting,
My question; Whether or no sql query (SELECT) continues or stops reading data (records) from table when find the value that I was looking for?
referance: "In order to return data for this query, mysql must start at the beginning of the disk data file, read in enough of the record to know where the category field data starts (because long_text is variable length), read this value, see if it satisfies the where condition (and so decide whether to add to the return record set), then figure out where the next record set is, then repeat."
link for referance: http://www.verynoisy.com/sql-indexing-dummies/#how_the_database_finds_records_normally
In general you don't know and you don't care, but you have to adapt when queries take too long to execute. When you do something like
select a,b,c from mytable where a=3 and b=5
then the database engine has a couple of options to optimize. When all these options fail, then it will do a "full table scan" - which means, it will have to examine the entire table to see which rows are eligible. When you have indices on e.g. column a then the database engine can optimize the search because it can pre-select rows where a has value 3. So, in general, make sure that you have indices for the columns that are most searched. (Perversely, some database engines get confused when you have too many indices and will fall back to a full table scan because they've lost their way...)
As to whether or not the scanning stops: In general, the database engine has to examine all data in the table (hopefully aided by indices) and won't stop after having found just one hit. If you want just the first hit, use a limit 1 clause to make sure that your result set has only one outcome. But then again, if you have a sort by clause, the database engine cannot stop after the first hit, there might be next ones that should get priority given the sorting.
Summarizing, how the db engine does its scan depends on how smart it is, what indices are available etc.. If your select queries take too long then consider re-organizing your indices, writing your select statements differently, or rebuilding the table.
The RDBMS reading data from disk is something you cannot know, you should not care and you must not rely on.
The issue is too broad to get a precise answer. The engine reads data from storage in blocks, a block can contain records that are not needed by the query at hand. If all the columns needed by the query is available in an index, the RDBMS won't even read the data file, it will only use the index. The data it needs could already be cached in memory (because it was read during the execution of a previous query). The underlying OS and the storage media also keep their own caches.
On a busy system, all these factors could lead to very different storage access patterns while running the same query several times on a couple of minutes apart.
Yes it scans the entire file. Unless you put something like
select * from user where id=100 limit 1
This of course will still search entire rows if id 100 is the last record.
If id is a primary key it will automatically be indexed and searching would be optimized
I'm sorry... I thought the table.
I will change question and I will explain it in the following image;
I understand that in CASE 1 all columns must be read with each iteration.
My question is: If it's the same in the CASE 2 or columns that are not selected in the query are excluded from reading in each iteration.
Also, are the both queries are the some in performance perspective?
Clarify:
CASE: 1 In first CASE select print all data
CASE: 2 In second CASE select print columns first_name and last_name
Whether in CASE 2 mysql server (SQL query) reads only columns first_name, last_name or read the entire table to get that data(rows)=(first_name, last_name)?
An interest of me how the server reads table row in CASE 1 and CASE 2?
This question already has answers here:
Which is faster/best? SELECT * or SELECT column1, colum2, column3, etc
(49 answers)
Closed 9 years ago.
Basically what's the difference in terms of security and speed in these 2 queries?
SELECT * FROM `myTable`
and
SELECT `id`, `name`, `location`, `place` etc... FROM `myTable`
Would using * increase the benchmark on my query and perform slower than static rows?
There won't be much appreciable difference in performance if you also select all columns individually.
The idea is to select only the data you require and no more, which can improve performance if there is alot of unneeded columns in your query, for example, when you join several tables.
Ofc, on the other side of the coin, using * makes life easier when you make changes to the table.
Security-wise, the less you select, the less potentially sensitive data can be inadvertently dumped to the user's browser. Imagine if * included the column social_security_number and somewhere in your debug code it gets printed out as an HTML comment.
Performance-wise, in many cases your database is on another server, so requesting the entire row when you only need a small part of it means a lot more data going over the network.
There is not a single, simple answer, and your literal question cannot fully be answered without more detail of the specific table structure, but I'm going with the assumption that you aren't actually talking about a specific query against a specific table, but rather about selecting columns explicitly or using the *.
SELECT * is always wasteful of something unless you are actually going to use every column that exists in the rows you're reading... maybe network bandwidth, or CPU resources, or disk I/O, or a combination, but something is being unnecessarily used, though that "something" may be in very small and imperceptible quantities ... or it may not ... but it can add up over time.
The two big examples that come to mind where SELECT * can be a performance killer are cases like...
...tables with long VARCHAR and BLOB columns:
Columns such as BLOB and VARCHAR that are too long to fit on a B-tree page are stored on separately allocated disk pages called overflow pages. We call such columns off-page columns. The values of these columns are stored in singly-linked lists of overflow pages, and each such column has its own list of one or more overflow pages
— http://dev.mysql.com/doc/refman/5.6/en/innodb-row-format-overview.html
So if * includes columns that weren't stored on-page with the rest of the row data, you just took an I/O hit and/or wasted space in your buffer pool with accesses that could have been avoided had you selected only what you needed.
...also cases where SELECT * prevents the query from using a covering index:
If the index is a covering index for the queries and can be used to satisfy all data required from the table, only the index tree is scanned. In this case, the Extra column says Using index. An index-only scan usually is faster than ALL because the size of the index usually is smaller than the table data.
— http://dev.mysql.com/doc/refman/5.6/en/explain-output.html
When one or more columns are indexed, copies of the column data are stored, sorted, in the index tree, which also includes the primary key, for finding the rest of the row data. When selecting from a table, if all of the columns you are selecting can be found within a single index, the optimizer will automatically choose to return the data to you by reading it directly from the index, instead of going to the time and effort to read in all of the row data... and this, some cases, is a very significant difference in the performance of a query, because it can mean substantially smaller resource usage.
If EXPLAIN SELECT does not reveal the exact same query plan when selecting the individual columns you need compared with the plan used when selecting *, then you are looking at some fairly hard evidence that you are putting the server through unnecessary work by selecting things you aren't going to use.
In additional cases, such as with the information_schema tables, the columns you select can make a particularly dramatic and obvious difference in performance. The information_schema tables are not actually tables -- they're server internal structures exposed via the SQL interface... and the columns you select can significantly change the performance of the query because the server has to do more work to calculate the values of some columns, compared to others. A similar situation is true of FEDERATED tables, which actually fetch data from a remote MySQL server to make a foreign table appear logically to be local. The columns you select are actually transferred across the network between servers.
Explicitly selecting the columns you need can also lead to fewer sneaky bugs. If a column you were using in code is later dropped from a table, the place in your code's data structure -- in some languages -- is going to contain an undefined value, which in many languages is the same think you would see if the column still existed but was null... so the code thinks "okay, that's null, so..." a logical error follows. Had you explicitly selected the columns you wanted, subsequent executions of the query would throw a hard error instead of quietly misbehaving.
MySQL's C-client API, which some other client libraries are built on, supports two modes of fetching data, one of which is mysql_store_result, which buffers the data from the server on the client side before the application actually reads it into its internal structures... so as you are "reading from the server" you may have already implicitly allocated a lot of memory on the client side to store that incoming result-set even when you think you're fetching a row at a time. Selecting unnecessary columns means even more memory needed.
SELECT COUNT(*) is an exception. The COUNT() function counts the number of non-null values seen, and * merely means "count the rows"... it doesn't examine column data, so if you want a star there, go for it.
As a favor to your future self, unless you want to go back later and rewrite all of those queries when you're trying to get more performance out of your server, you should bite the bullet and do the extra typing, now.
As a bonus, when other people see your code, they won't accuse you of laziness or inexperience.