Is selecting fewer columns speeding up my query? - mysql

I have seen several questions comparing select * to select by all columns explicitly, but what about fewer columns selected vs more.
In other words, is:
SELECT id,firstname,lastname,lastlogin,email,phone
More than negligibly faster than:
SELECT id,firstname,lastlogin
I realize there will be small differences for more data being transferred through the system and to the application, but this is a total data/load difference, not a cost of the query (larger data in the cells would have the same effect anyway I believe) - I'm only trying to optimize my query, as I will have to load ALL the data at some point anyway...
When my admin user logs in, I'm going to load the entire user database into a cache, but I can either query only critical data upfront to shave some execution time, or just get everything - if it works out roughly the same. I know more rows equals longer query execution - but what about more selected values in my query?

Under most circumstances, the only difference is going to be slightly larger data for these fields and the additional time to fetch them.
There are two things to consider:
If the additional fields are very big, then this could be a big difference in performance.
If there is an index that covers the columns you actually want, then the index can be used for the query. This could speed the query in the database.
In general, though, the advice is to return the columns you want to the application. If there is complex processing, you should consider doing that in the database rather than the application.

Related

Is it faster to only query specific columns?

I've heard that it is faster to select colums manually ("col1, col2, col3, etc") instead of querying them all with "*".
But what if I don't even want to query all columns of a table? Would it be faster to query, for Example, only "col1, col2" insteaf of "col1, col2, col3, col4"?
From my understanding SQL has to search through all of the columns anyway, and just the return-result changes. I'd like to know if I can achieve a gain in performance by only choosing the right columns.
(I'm doing this anyway, but a backend API of one of my applications returns more often than not all columns, so I'm thinking about letting the user manually select the columns he want)
In general, reducing the number of columns in the select is a minor optimization. It means that less data is being returned from the database server to the application calling the server. Less data is usually faster.
Under most circumstances, this a minor improvement. There are some cases where the improvement can be more important:
If a covering index is available for the query, so the index satisfies the query without having to access data pages.
If some fields are very long, so records occupy multiple pages.
If the volume of data being retrieved is a small fraction (think < 10%) of the overall data in each record.
Listing the columns individually is a good idea, because it protects code from changes in underlying schema. For instance, if the name of a column is changed, then a query that lists columns explicitly will break with an easy-to-understand error. This is better than a query that runs and produces erroneous results.
You should try not to use select *.
Inefficiency in moving data to the consumer. When you SELECT *, you're often retrieving more columns from the database than your application really needs to function. This causes more data to move from the database server to the client, slowing access and increasing load on your machines, as well as taking more time to travel across the network. This is especially true when someone adds new columns to underlying tables that didn't exist and weren't needed when the original consumers coded their data access.
Indexing issues. Consider a scenario where you want to tune a query to a high level of performance. If you were to use *, and it returned more columns than you actually needed, the server would often have to perform more expensive methods to retrieve your data than it otherwise might. For example, you wouldn't be able to create an index which simply covered the columns in your SELECT list, and even if you did (including all columns [shudder]), the next guy who came around and added a column to the underlying table would cause the optimizer to ignore your optimized covering index, and you'd likely find that the performance of your query would drop substantially for no readily apparent reason.
Binding Problems. When you SELECT *, it's possible to retrieve two columns of the same name from two different tables. This can often crash your data consumer. Imagine a query that joins two tables, both of which contain a column called "ID". How would a consumer know which was which? SELECT * can also confuse views (at least in some versions SQL Server) when underlying table structures change -- the view is not rebuilt, and the data which comes back can be nonsense. And the worst part of it is that you can take care to name your columns whatever you want, but the next guy who comes along might have no way of knowing that he has to worry about adding a column which will collide with your already-developed names.
I got this from this answer.
I believe this topic has already been covered here:
select * vs select column
I believe it covers your concerns as well. Please take a look.
All the column labels and values occupy some space. Sending them to the issuer of the request instead of a subset of the columns means sending more data. More data is sent slower.
If you have columns, like
id, username, password, email, bio, url
and you want to get only the username and password, then
select username, password ...
is quicker than
select * ...
because id, email, bio and url are sent as well for the latter, which makes the response larger. But the main problem with select * is different. It might be the source of inconsistencies if, for some reason the order of the columns changed. Also, it might retrieve data you do not want to retrieve. It is always better to have a whitelist with the columns you actually want to retrieve.

Database Optimisation through denormalization and smaller rows

Does tables with many columns take more time than the tables with less columns during SELECT or UPDATE query? (row count is same and I will update/select same number of columns in both cases)
example: I have a database to store user details and to store their last active time-stamp. In my website, I only need to show active users and their names.
Say, one table named userinfo has the following columns: (id,f_name,l_name,email,mobile,verified_status). Is it a good idea to store last active time also in the same table? Or its better to make a separate table(say, user_active) to store the last activity timestamp?
The reason I am asking, If I make two tables, userinfo table will only be accessed during new signups(to INSERT new user row) and I will use user_active table (table with less columns) to UPADATE timestamp and SELECT active users frequently.
But the cost I have to pay for creating two tables is data duplication as user_active table columns will be (id, f_name, timestamp).
The answer to your question is that, to a close approximation, having more columns in a table does not really take more time than having fewer columns for accessing a single row. This may seem counter-intuitive, but you need to understand how data is stored in databases.
Rows of a table are stored on data pages. The cost of a query is highly dependent on the number of pages that need to be read and written during the course of the query. Parsing the row from the data page is usually not a significant performance issue.
Now, wider rows do have a very slight performance disadvantage, because more data would (presumably) be returned to the user. This is a very minor consideration for rows that fit on a single page.
On a more complicated query, wider rows have a larger performance disadvantage, because more data pages need to be read and written for a given number of rows. For a single row, though, one page is being read and written -- assuming you have an index to find that row (which seems very likely in this case).
As for the rest of your question. The structure of your second table is not correct. You would not (normally) include fname in two tables -- that is data redundancy and causes all sort of other problems. There is a legitimate question whether you should store a table of all activity and use that table for the display purposes, but that is not the question you are asking.
Finally, for the data volumes you are talking about, having a few extra columns would make no noticeable difference on any reasonable transaction volume. Use one table if you have one attribute per entity and no compelling reason to do otherwise.
When returning and parsing a single row, the number of columns is unlikely to make a noticeable difference. However, searching and scanning tables with smaller rows is faster than tables with larger rows.
When searching using an index, MySQL utilizes a binary search so it would require significantly larger rows (and many rows) before any speed penalty is noticeable.
Scanning is a different matter. When scanning, it's reading through all of the data for all of the rows, so there's a 1-to-1 performance penalty for larger rows. Yet, with proper indexes, you shouldn't be doing much scanning.
However, in this case, keep the date together with the user info because they'll be queried together and there's a 1-to-1 relationship, and a table with larger rows is still going to be faster than a join.
Only denormalize for optimization when performance becomes an actual problem and you can't resolve it any other way (adding an index, improving hardware, etc.).

Does SELECT * really take more time than selecting only the needed columns?

Will it make a discernible difference in the time a website page loads? On average, my tables have 10 columns, if I just need 3 of those columns, should I just call those in the query to make it faster?
Will it make a discernable difference. Probably not under most circumstances. Here are some cases where it would possibly make a big difference:
The 7 unneeded columns are really, really big.
You are returning lots and lots of rows.
You have a big table, are getting many rows, and an index is available on the 3 columns but not the 10.
But, there are other reasons not to use *:
It will replace the columns based on the order of the columns in the database at the time the query is compiled. This can cause problems if the structure of the table changes.
If a column name changes or is removed, your query would work and subsequent code might break. If you explicitly list the columns, then the query will break, making the problem easier to spot.
Typing three column names shouldn't be a big deal. Explicitly listing the columns makes the code more informative.
Let's say you had a table with 1000 columns, and you only needed 3.
What do you think would run faster and why?
This: SELECT * FROM table_name;
or this:SELECT col1, col2, col3, FROM table_name;
When you are using * you are now holding that entire selection (big or small) in memory. The bigger the selection...the more memory its going to use/need.
So even though your table isn't necessarily big, I would still only select the data that you actually need. You might not even notice a difference in speed, but it will definately be faster.
Yes if you only need a handful of columns, only select those. Below are some reasons:
THE MOST OBVIOUS: Extra data needs to be sent back making for larger packets to transmit (or pipe via local socket). This will increase overall latency. This might not seem like much for 1 or 2 rows, but wait until you've got 100 or 1000 rows... 7 extra columns of data will significanly affect overall transit latency expecially if you end up having the result set having to be broken into more TCP packets for transmit. This might not be such an issue if you're hitting a localhost socket, but move your DB to a server across a network, to another datacenter, etc... and the impact will be plain as day!
With the MySQL query cache enabled, storing unneeded data in result sets will increase your over cache space needs--larger query caches can suffer performance hits.
A HUGE HIT CAN BE: If you need only columns that are part of a covering index, doing a select * will require follow up point lookups for the remaining data fields in the main table rather than just use the data from the index table.
Yes you should.
Using named columns in select is a best practice working with database for multiple reasons.
Only the needed data travel from the database to the application server reducing cpu, memory and disk usage.
It helps detecting coding errors and structure changes.
There are only a few cases when using select * is a good idea, in all the other queries do yourself a favour and use the column names.
Yes definitely. * will get replaced with all the column names. After that only the execution starts. For example if there are 3 columns a, b, c in a table.. select a, b, c directly starts execution where as select * starts transforming the query into select a, b, c after that only the execution stats.
The short of it is yes, if you are returning more data it will take longer. This may be a very very very tiny amount of time but yes it will take longer. As stated above select * can be dangerous in a production situation where you may not be the one designing/implementing the database. If you assume that columns are returned in a particular order or the database structure is of a particular type and then the DBA goes in and makes some kind of a change and does not inform you, you may have an issue with your code.
The difference is very minimal, but there is a slight difference, I think it really depends on several factors for which is faster.
1) How many columns are in the table?
2) How many columns do you actually need to grab?
3) How many records are you grabbing?
In your case, based on what you said of having 10 columns and only needing 3 of those columns, I doubt it'll make a difference if you use 'Select *' or not, unless perhaps you're grabbing tens of thousands of records. But in more extreme cases with a lot more columns involved I have found 'Select *' to be slightly faster, but that might not be true in all cases.
I once did some speed tests in a SQLite table with over 150 columns, where I needed to grab only about 40 of the columns, and I needed all 20,000+ records. The speed differences were very minimal (we're talking 20 to 40 milliseconds difference), but it was actually faster to grab the data from All the columns with a 'SELECT ALL *', rather than going 'Select All Field1, Field2, etc'.
I assume the more records and columns in your table, the greater the speed difference this example will net you. But if you only needed 3 columns in a gigantic table I'd guess that just grabbing those 3 columns would be faster.
Bottom line though, if you really care about the minimal speed differences between 'Select *' and 'Select field1, field2, etc', then do some speed tests.

What's wrong with using a wildcard in your mySQL query? [duplicate]

This question already has answers here:
Which is faster/best? SELECT * or SELECT column1, colum2, column3, etc
(49 answers)
Closed 9 years ago.
Basically what's the difference in terms of security and speed in these 2 queries?
SELECT * FROM `myTable`
and
SELECT `id`, `name`, `location`, `place` etc... FROM `myTable`
Would using * increase the benchmark on my query and perform slower than static rows?
There won't be much appreciable difference in performance if you also select all columns individually.
The idea is to select only the data you require and no more, which can improve performance if there is alot of unneeded columns in your query, for example, when you join several tables.
Ofc, on the other side of the coin, using * makes life easier when you make changes to the table.
Security-wise, the less you select, the less potentially sensitive data can be inadvertently dumped to the user's browser. Imagine if * included the column social_security_number and somewhere in your debug code it gets printed out as an HTML comment.
Performance-wise, in many cases your database is on another server, so requesting the entire row when you only need a small part of it means a lot more data going over the network.
There is not a single, simple answer, and your literal question cannot fully be answered without more detail of the specific table structure, but I'm going with the assumption that you aren't actually talking about a specific query against a specific table, but rather about selecting columns explicitly or using the *.
SELECT * is always wasteful of something unless you are actually going to use every column that exists in the rows you're reading... maybe network bandwidth, or CPU resources, or disk I/O, or a combination, but something is being unnecessarily used, though that "something" may be in very small and imperceptible quantities ... or it may not ... but it can add up over time.
The two big examples that come to mind where SELECT * can be a performance killer are cases like...
...tables with long VARCHAR and BLOB columns:
Columns such as BLOB and VARCHAR that are too long to fit on a B-tree page are stored on separately allocated disk pages called overflow pages. We call such columns off-page columns. The values of these columns are stored in singly-linked lists of overflow pages, and each such column has its own list of one or more overflow pages
— http://dev.mysql.com/doc/refman/5.6/en/innodb-row-format-overview.html
So if * includes columns that weren't stored on-page with the rest of the row data, you just took an I/O hit and/or wasted space in your buffer pool with accesses that could have been avoided had you selected only what you needed.
...also cases where SELECT * prevents the query from using a covering index:
If the index is a covering index for the queries and can be used to satisfy all data required from the table, only the index tree is scanned. In this case, the Extra column says Using index. An index-only scan usually is faster than ALL because the size of the index usually is smaller than the table data.
— http://dev.mysql.com/doc/refman/5.6/en/explain-output.html
When one or more columns are indexed, copies of the column data are stored, sorted, in the index tree, which also includes the primary key, for finding the rest of the row data. When selecting from a table, if all of the columns you are selecting can be found within a single index, the optimizer will automatically choose to return the data to you by reading it directly from the index, instead of going to the time and effort to read in all of the row data... and this, some cases, is a very significant difference in the performance of a query, because it can mean substantially smaller resource usage.
If EXPLAIN SELECT does not reveal the exact same query plan when selecting the individual columns you need compared with the plan used when selecting *, then you are looking at some fairly hard evidence that you are putting the server through unnecessary work by selecting things you aren't going to use.
In additional cases, such as with the information_schema tables, the columns you select can make a particularly dramatic and obvious difference in performance. The information_schema tables are not actually tables -- they're server internal structures exposed via the SQL interface... and the columns you select can significantly change the performance of the query because the server has to do more work to calculate the values of some columns, compared to others. A similar situation is true of FEDERATED tables, which actually fetch data from a remote MySQL server to make a foreign table appear logically to be local. The columns you select are actually transferred across the network between servers.
Explicitly selecting the columns you need can also lead to fewer sneaky bugs. If a column you were using in code is later dropped from a table, the place in your code's data structure -- in some languages -- is going to contain an undefined value, which in many languages is the same think you would see if the column still existed but was null... so the code thinks "okay, that's null, so..." a logical error follows. Had you explicitly selected the columns you wanted, subsequent executions of the query would throw a hard error instead of quietly misbehaving.
MySQL's C-client API, which some other client libraries are built on, supports two modes of fetching data, one of which is mysql_store_result, which buffers the data from the server on the client side before the application actually reads it into its internal structures... so as you are "reading from the server" you may have already implicitly allocated a lot of memory on the client side to store that incoming result-set even when you think you're fetching a row at a time. Selecting unnecessary columns means even more memory needed.
SELECT COUNT(*) is an exception. The COUNT() function counts the number of non-null values seen, and * merely means "count the rows"... it doesn't examine column data, so if you want a star there, go for it.
As a favor to your future self, unless you want to go back later and rewrite all of those queries when you're trying to get more performance out of your server, you should bite the bullet and do the extra typing, now.
As a bonus, when other people see your code, they won't accuse you of laziness or inexperience.

mysql performance comparison

Which is more efficient,and by how much?
type 1:
insert into table_name(column1,column2..) select column1,column2 ... from another_table where
columnX in (value_list)
type 2:
insert into table_name(column1,column2..) values (column1_0,column2_0..),(column1_1,column2_1..)
The first edition looks short,and the second may become extremely long,when value_list contains,say 500 or even more values.
But I've no idea about whose performance will be better,though feels the first should be more efficient,intuitively.
The first is cleaner, especially if your columns are already in mysql (which I'm assuming you are saying?). You would save some time in network overhead sending data, and parsing time, and have to worry less about hitting whatever query size limit your client has.
However, in general, I would expect the performance to be similar as the number rows grows larger, especially on a well-indexed table. Most of the time for inserts w/ large queries is spent doing things like building indexes (see here), and both those queries, absent turning indexes off, would have to do that.
I agree with Todd, the first query is cleaner and will be faster to send to the MySQL server and faster to compile. And it's probably true that as the number of inserted records increases, the speed differential will drop.
But the first form has substantial other benefits to consider:
It's far easier to maintain: you only have to add or modify a field every now and then.
You avoid the expense of querying another_table and processing the results to concatenate the second query (a hidden cost of that approach).
If you need to run this update more than once, the first query can be cached in the MySQL server along with its compiled form and query plan. This makes subsequent invocations of the query run a bit faster.