This question already has answers here:
Which is faster/best? SELECT * or SELECT column1, colum2, column3, etc
(49 answers)
Closed 8 years ago.
Is it okay to always use SELECT * even if you only need one column when retrieving data from MySQL? Does it affect the speed of the query or the speed of the system? Thanks.
No, it is not always okay.
But it is also not always a problem.
In order of performance impact:
If you only select a subset of columns, it can positively affect the access path. Maybe those columns can be read from an index without touching the table at all.
Beyond that, there is also raw network I/O. Sending three columns uses a lot less bandwidth than sending three hundred (especially for many rows).
Beyond that, there is also the memory required by your client application to process the result set.
I believe the columns in the select are the least time/CPU intensive piece of the query. Limiting the number of rows, either by "WHERE" clauses or explicitly using "LIMIT" is where time is saved.
In my personal experience you should prefer named columns over SELECT * whenever possible.
But, performance is not the key reason.
Code that uses SELECT * is usually harder to read and debug as it is not explicit what the intent of the query is.
Code that uses SELECT * can break when the database structure is changed (referring to columns by index rather than by name is almost always the wrong way to write your code).
Finally, retrieving bigger datasets does affect speed, bandwidth and memory consumption, and that's never advisable if it can easily be avoided.
As far as performance is concerned, JOINs and row-count are more likely to slow query performance than the difference in selected columns, but inefficiencies have a habit of compounding later on in projects. ie. You may have no performance issues with a test-bed application but when things scale, or data is accessible only over restricted bandwidth of a network that's when you'll be pleased you wrote explicit SELECTs to start with.
Note that if you're just writing a one-off query to check some data I wouldn't worry, but if you're writing a query for a codebase that might be executed often, it pays to write good queries and, when necessary consider Stored Procedures.
I have database which contains huge number of tables, stored procedure. So,
how can i get specific objects like table, stored procedure in a single query for specific database.
SELECT
[schema] = s.name,
[object] = o.name,
o.type_desc
FROM sys.objects AS o
INNER JOIN sys.schemas AS s
ON o.[schema_id] = s.[schema_id]
WHERE o.[type] IN ('P','U');
Some other answers you'll find on this or other sites might suggest some or all of the following:
sysobjects - stay away, this is a backward compatibility view that has been deprecated, and shouldn't be used in any version > SQL Server 2000. See a thorough but not exhaustive replacement map here.
built-in functions like OBJECT_NAME(), SCHEMA_NAME() and OBJECT_SCHEMA_NAME() - I've recommended these myself over the years, until I realized they are blocking functions and don't observe the transaction's isolation semantics. So if you want to grab this information under read uncommitted while there are underlying changes happening, you can't, and you'll have to wait. Which may be what you want to do, but not always.
INFORMATION_SCHEMA - these views are there to satisfy the standards, but aren't complete, are warned to be inaccurate, and aren't updated to reflect new features (I blogged about several specific problems here). So for very basic information (or when you need to write cross-platform metadata code), they may be ok, but in almost all cases I suggest just always using a method you can trust instead of picking and choosing.
Why is SELECT * bad practice? Wouldn't it mean less code to change if you added a new column you wanted?
I understand that SELECT COUNT(*) is a performance problem on some DBs, but what if you really wanted every column?
There are really three major reasons:
Inefficiency in moving data to the consumer. When you SELECT *, you're often retrieving more columns from the database than your application really needs to function. This causes more data to move from the database server to the client, slowing access and increasing load on your machines, as well as taking more time to travel across the network. This is especially true when someone adds new columns to underlying tables that didn't exist and weren't needed when the original consumers coded their data access.
Indexing issues. Consider a scenario where you want to tune a query to a high level of performance. If you were to use *, and it returned more columns than you actually needed, the server would often have to perform more expensive methods to retrieve your data than it otherwise might. For example, you wouldn't be able to create an index which simply covered the columns in your SELECT list, and even if you did (including all columns [shudder]), the next guy who came around and added a column to the underlying table would cause the optimizer to ignore your optimized covering index, and you'd likely find that the performance of your query would drop substantially for no readily apparent reason.
Binding Problems. When you SELECT *, it's possible to retrieve two columns of the same name from two different tables. This can often crash your data consumer. Imagine a query that joins two tables, both of which contain a column called "ID". How would a consumer know which was which? SELECT * can also confuse views (at least in some versions SQL Server) when underlying table structures change -- the view is not rebuilt, and the data which comes back can be nonsense. And the worst part of it is that you can take care to name your columns whatever you want, but the next guy who comes along might have no way of knowing that he has to worry about adding a column which will collide with your already-developed names.
But it's not all bad for SELECT *. I use it liberally for these use cases:
Ad-hoc queries. When trying to debug something, especially off a narrow table I might not be familiar with, SELECT * is often my best friend. It helps me just see what's going on without having to do a boatload of research as to what the underlying column names are. This gets to be a bigger "plus" the longer the column names get.
When * means "a row". In the following use cases, SELECT * is just fine, and rumors that it's a performance killer are just urban legends which may have had some validity many years ago, but don't now:
SELECT COUNT(*) FROM table;
in this case, * means "count the rows". If you were to use a column name instead of * , it would count the rows where that column's value was not null. COUNT(*), to me, really drives home the concept that you're counting rows, and you avoid strange edge-cases caused by NULLs being eliminated from your aggregates.
Same goes with this type of query:
SELECT a.ID FROM TableA a
WHERE EXISTS (
SELECT *
FROM TableB b
WHERE b.ID = a.B_ID);
in any database worth its salt, * just means "a row". It doesn't matter what you put in the subquery. Some people use b's ID in the SELECT list, or they'll use the number 1, but IMO those conventions are pretty much nonsensical. What you mean is "count the row", and that's what * signifies. Most query optimizers out there are smart enough to know this. (Though to be honest, I only know this to be true with SQL Server and Oracle.)
The asterisk character, "*", in the SELECT statement is shorthand for all the columns in the table(s) involved in the query.
Performance
The * shorthand can be slower because:
Not all the fields are indexed, forcing a full table scan - less efficient
What you save to send SELECT * over the wire risks a full table scan
Returning more data than is needed
Returning trailing columns using variable length data type can result in search overhead
Maintenance
When using SELECT *:
Someone unfamiliar with the codebase would be forced to consult documentation to know what columns are being returned before being able to make competent changes. Making code more readable, minimizing the ambiguity and work necessary for people unfamiliar with the code saves more time and effort in the long run.
If code depends on column order, SELECT * will hide an error waiting to happen if a table had its column order changed.
Even if you need every column at the time the query is written, that might not be the case in the future
the usage complicates profiling
Design
SELECT * is an anti-pattern:
The purpose of the query is less obvious; the columns used by the application is opaque
It breaks the modularity rule about using strict typing whenever possible. Explicit is almost universally better.
When Should "SELECT *" Be Used?
It's acceptable to use SELECT * when there's the explicit need for every column in the table(s) involved, as opposed to every column that existed when the query was written. The database will internally expand the * into the complete list of columns - there's no performance difference.
Otherwise, explicitly list every column that is to be used in the query - preferably while using a table alias.
Even if you wanted to select every column now, you might not want to select every column after someone adds one or more new columns. If you write the query with SELECT * you are taking the risk that at some point someone might add a column of text which makes your query run more slowly even though you don't actually need that column.
Wouldn't it mean less code to change if you added a new column you wanted?
The chances are that if you actually want to use the new column then you will have to make quite a lot other changes to your code anyway. You're only saving , new_column - just a few characters of typing.
If you really want every column, I haven't seen a performance difference between select (*) and naming the columns. The driver to name the columns might be simply to be explicit about what columns you expect to see in your code.
Often though, you don't want every column and the select(*) can result in unnecessary work for the database server and unnecessary information having to be passed over the network. It's unlikely to cause a noticeable problem unless the system is heavily utilised or the network connectivity is slow.
If you name the columns in a SELECT statement, they will be returned in the order specified, and may thus safely be referenced by numerical index. If you use "SELECT *", you may end up receiving the columns in arbitrary sequence, and thus can only safely use the columns by name. Unless you know in advance what you'll be wanting to do with any new column that gets added to the database, the most probable correct action is to ignore it. If you're going to be ignoring any new columns that get added to the database, there is no benefit whatsoever to retrieving them.
In a lot of situations, SELECT * will cause errors at run time in your application, rather than at design time. It hides the knowledge of column changes, or bad references in your applications.
Think of it as reducing the coupling between the app and the database.
To summarize the 'code smell' aspect:
SELECT * creates a dynamic dependency between the app and the schema. Restricting its use is one way of making the dependency more defined, otherwise a change to the database has a greater likelihood of crashing your application.
If you add fields to the table, they will automatically be included in all your queries where you use select *. This may seem convenient, but it will make your application slower as you are fetching more data than you need, and it will actually crash your application at some point.
There is a limit for how much data you can fetch in each row of a result. If you add fields to your tables so that a result ends up being over that limit, you get an error message when you try to run the query.
This is the kind of errors that are hard to find. You make a change in one place, and it blows up in some other place that doesn't actually use the new data at all. It may even be a less frequently used query so that it takes a while before someone uses it, which makes it even harder to connect the error to the change.
If you specify which fields you want in the result, you are safe from this kind of overhead overflow.
I don't think that there can really be a blanket rule for this. In many cases, I have avoided SELECT *, but I have also worked with data frameworks where SELECT * was very beneficial.
As with all things, there are benefits and costs. I think that part of the benefit vs. cost equation is just how much control you have over the datastructures. In cases where the SELECT * worked well, the data structures were tightly controlled (it was retail software), so there wasn't much risk that someone was going to sneek a huge BLOB field into a table.
Reference taken from this article.
Never go with "SELECT *",
I have found only one reason to use "SELECT *"
If you have special requirements and created dynamic environment when add or delete column automatically handle by application code. In this special case you don’t require to change application and database code and this will automatically affect on production environment. In this case you can use “SELECT *”.
Generally you have to fit the results of your SELECT * ... into data structures of various types. Without specifying which order the results are arriving in, it can be tricky to line everything up properly (and more obscure fields are much easier to miss).
This way you can add fields to your tables (even in the middle of them) for various reasons without breaking sql access code all over the application.
Using SELECT * when you only need a couple of columns means a lot more data transferred than you need. This adds processing on the database, and increase latency on getting the data to the client. Add on to this that it will use more memory when loaded, in some cases significantly more, such as large BLOB files, it's mostly about efficiency.
In addition to this, however, it's easier to see when looking at the query what columns are being loaded, without having to look up what's in the table.
Yes, if you do add an extra column, it would be faster, but in most cases, you'd want/need to change your code using the query to accept the new columns anyways, and there's the potential that getting ones you don't want/expect can cause issues. For example, if you grab all the columns, then rely on the order in a loop to assign variables, then adding one in, or if the column orders change (seen it happen when restoring from a backup) it can throw everything off.
This is also the same sort of reasoning why if you're doing an INSERT you should always specify the columns.
Selecting with column name raises the probability that database engine can access the data from indexes rather than querying the table data.
SELECT * exposes your system to unexpected performance and functionality changes in the case when your database schema changes because you are going to get any new columns added to the table, even though, your code is not prepared to use or present that new data.
There is also more pragmatic reason: money. When you use cloud database and you have to pay for data processed there is no explanation to read data that you will immediately discard.
For example: BigQuery:
Query pricing
Query pricing refers to the cost of running your SQL commands and user-defined functions. BigQuery charges for queries by using one metric: the number of bytes processed.
and Control projection - Avoid SELECT *:
Best practice: Control projection - Query only the columns that you need.
Projection refers to the number of columns that are read by your query. Projecting excess columns incurs additional (wasted) I/O and materialization (writing results).
Using SELECT * is the most expensive way to query data. When you use SELECT *, BigQuery does a full scan of every column in the table.
Understand your requirements prior to designing the schema (if possible).
Learn about the data,
1)indexing
2)type of storage used,
3)vendor engine or features; ie...caching, in-memory capabilities
4)datatypes
5)size of table
6)frequency of query
7)related workloads if the resource is shared
8)Test
A) Requirements will vary. If the hardware can not support the expected workload, you should re-evaluate how to provide the requirements in the workload. Regarding the addition column to the table. If the database supports views, you can create an indexed(?) view of the specific data with the specific named columns (vs. select '*'). Periodically review your data and schema to ensure you never run into the "Garbage-in" -> "Garbage-out" syndrome.
Assuming there is no other solution; you can take the following into account. There are always multiple solutions to a problem.
1) Indexing: The select * will execute a tablescan. Depending on various factors, this may involve a disk seek and/or contention with other queries. If the table is multi-purpose, ensure all queries are performant and execute below you're target times. If there is a large amount of data, and your network or other resource isn't tuned; you need to take this into account. The database is a shared environment.
2) type of storage. Ie: if you're using SSD's, disk, or memory. I/O times and the load on the system/cpu will vary.
3) Can the DBA tune the database/tables for higher performance? Assumming for whatever reason, the teams have decided the select '*' is the best solution to the problem; can the DB or table be loaded into memory. (Or other method...maybe the response was designed to respond with a 2-3 second delay? --- while an advertisement plays to earn the company revenue...)
4) Start at the baseline. Understand your data types, and how results will be presented. Smaller datatypes, number of fields reduces the amount of data returned in the result set. This leaves resources available for other system needs. The system resources are usually have a limit; 'always' work below these limits to ensure stability, and predictable behaviour.
5) size of table/data. select '*' is common with tiny tables. They typically fit in memory, and response times are quick. Again....review your requirements. Plan for feature creep; always plan for the current and possible future needs.
6) Frequency of query / queries. Be aware of other workloads on the system. If this query fires off every second, and the table is tiny. The result set can be designed to stay in cache/memory. However, if the query is a frequent batch process with Gigabytes/Terabytes of data...you may be better off to dedicate additional resources to ensure other workloads aren't affected.
7) Related workloads. Understand how the resources are used. Is the network/system/database/table/application dedicated, or shared? Who are the stakeholders? Is this for production, development, or QA? Is this a temporary "quick fix". Have you tested the scenario? You'll be surprised how many problems can exist on current hardware today. (Yes, performance is fast...but the design/performance is still degraded.) Does the system need to performance 10K queries per second vs. 5-10 queries per second. Is the database server dedicated, or do other applications, monitoring execute on the shared resource. Some applications/languages; O/S's will consume 100% of the memory causing various symptoms/problems.
8) Test: Test out your theories, and understand as much as you can about. Your select '*' issue may be a big deal, or it may be something you don't even need to worry about.
There's an important distinction here that I think most answers are missing.
SELECT * isn't an issue. Returning the results of SELECT * is the issue.
An OK example, in my opinion:
WITH data_from_several_tables AS (
SELECT * FROM table1_2020
UNION ALL
SELECT * FROM table1_2021
...
)
SELECT id, name, ...
FROM data_from_several_tables
WHERE ...
GROUP BY ...
...
This avoids all the "problems" of using SELECT * mentioned in most answers:
Reading more data than expected? Optimisers in modern databases will be aware that you don't actually need all columns
Column ordering of the source tables affects output? We still select and
return data explicitly.
Consumers can't see what columns they receive from the SQL? The columns you're acting on are explicit in code.
Indexes may not be used? Again, modern optimisers should handle this the same as if we didn't SELECT *
There's a readability/refactorability win here - no need to duplicate long lists of columns or other common query clauses such as filters. I'd be surprised if there are any differences in the query plan when using SELECT * like this compared with SELECT <columns> (in the vast majority of cases - obviously always profile running code if it's critical).
If I want to union data from multiple tables located on different drives, will SQL pull the data in parallel? Are there any related setting or hints I should know about?
The UNION should run in parallel, at least since SQL Server 2005.
It doesn't make a difference if the tables are located on different drives or the same drive. In the modern world, disk can be virtual, or have multiple read heads. The distinction between one drive and more than one drive is less and less relevant.
If you have MAXDOP set to 1, then there will only be one thread.
Do note that UNION is going to be much slower than UNION ALL.
Brandon . . . let me respond here. You seem to be thinking in terms of older style architectures. These definitely still exist. However, modern disks have multiple read heads and multiple platters. Often, the issue with returning data involves the bandwidth at the controller level, and not the speed of the read. You also have multiple levels of caching and read-ahead (sometimes at both the file system and database levels). You are often better off letting the data base engines manage this complexity.
For instance, the machine that I'm working on right now is really a virtual machine. The disk I use is a partition on an EMC box. The processors are some set of processors in a big box.
My understanding of multi-threading in SQL Server is that we should leave it to the query optimiser - queries will be run in parallel when optimal.
You can limit the number of threads by using the MAXDOP hint (see What is the purpose for using OPTION(MAXDOP 1) in SQL Server?).
The default behaviour is to run in parallel when possible and optimal.
I wouldn't count on data being returned in a specific order solely by the order of your union'ed queries.
For me, when I have to do something like that I always wrap that entire query as a sub select only to handle sorting. like the following
Select pk_id, value from (
select pk_id, value from table1
union
select pk_id, value from table2
) order by PK_id, value
That way your never surprised by what you get back.
I'm storing an object in a database described by a lot of integer attributes. The real object is a little bit more complex, but for now let's assume that I'm storing cars in my database. Each car has a lot of integer attributes to describe the car (ie. maximum speed, wheelbase, maximum power etc.) and these are searchable by the user. The user defines a preferred range for each of the objects and since there are a lot of attributes there most likely won't be any car matching all the attribute ranges. Therefore the query has to return a number of cars sorted by the best match.
At the moment I implemented this in MySQL using the following query:
SELECT *, SQRT( POW((a < min_a)*(min_a - a) + (a > max_a)*(a - max_a), 2) +
POW((b < min_b)*(min_b - b) + (b > max_b)*(b - max_b), 2) +
... ) AS match
WHERE a < (min_a - max_allowable_deviation) AND a > (max_a + max_allowable_deviation) AND ...
ORDER BY match ASC
where a and b are attributes of the object and min_a, max_a, min_b and max_b are user defined values. Basically the match is the square root of the sum of the squared differences between the desired range and the real value of the attribute. A value of 0 meaning a perfect match.
The table contains a couple of million records and the WHERE clausule is only introduced to limit the number of records the calculation is performed on. An index is placed on all of the queryable records and the query takes like 500ms. I'd like to improve this number and I'm looking into ways to improve this query.
Furthermore I am wondering whether there would be a different database better suited to perform this job. Moreover I'd very much like to change to a NoSQL database, because of its more flexible data scheme options. I've been looking into MongoDB, but couldn't find a way to solve this problem efficiently (fast).
Is there any database better suited for this job than MySQL?
Take a look at R-trees. (The pages on specific variants go in to a lot more detail and present pseudo code). These data structures allow you to query by a bounding rectangle, which is what your problem of searching by ranges on each attribute is.
Consider your cars as points in n-dimensional space, where n is the number of attributes that describe your car. Then given a n ranges, each describing an attribute, the problem is the find all the points contained in that n-dimensional hyperrectangle. R-trees support this query efficiently. MySQL implements R-trees for their spatial data types, but MySQL only supports two-dimensional space, which is insufficient for you. I'm not aware of any common databases that support n-dimensional R-trees off the shelf, but you can take some database with good support for user-defined tree data structures and implement R-trees yourself on top of that. For example, you can define a structure for an R-tree node in MongoDB, with child pointers. You will then implement the R-tree algorithms in your own code while letting MongoDB take care of storing the data.
Also, there's this C++ header file implementing of an R-tree, but currently it's only an in-memory structure. Though if your data set is only a few million rows, it seems feasible to just load this memory structure upon startup and update it whenever new cars are added (which I assume is infrequent).
Text search engines, such as Lucene, meet your requirements very well. They allow you to "boost" hits depending on how they were matched, eg you can define engine size to be considered a "better match" than wheel base. Using lucene is really easy and above all, it's SUPER FAST. Way faster than mysql.
Mysql offer a plugin to provide text-based searching, but I prefer to use it separately, that way it's easily scalable (being read-only, you can have multiple lucene engines), and easily manageable.
Also check out Solr, which sits on top of lucene and allows you to store, retrieve and search for simple java object (Lists, arrays etc).
Likely, your indexes aren't helping much, and I can't think of another database technology that's going to be significantly better. A few things to try with MySQL....
I'd try putting a copy of the data in a memory table. At least the table scans will be in memory....
http://dev.mysql.com/doc/refman/5.0/en/memory-storage-engine.html
If that doesn't work for you or help much, you could also try a User Defined Function to optimize the calculation of the matching. Basically, this means executing the range testing in a C library you provide:
http://dev.mysql.com/doc/refman/5.0/en/adding-functions.html