SELECT TableName.Col1 VS SELECT Col1 - mysql

This might be a weird question but didnt know how to research on it. When doing the following query:
SELECT Foo.col1, Foo.col2, Foo.col3
FROM Foo
INNER JOIN Bar ON Foo.ID = Bar.BID
I tend to use TableName.Column instead of just col1, col2, col3
Is there any performance difference? Is it faster to specify Table name for each column?
My guess would be that yes it is faster since it would take some time to lookup the column name and and differentiate it.
If anyone knows a link where I could read up on this I would be grateful. I did not even know how to title this question better since not sure how to search on it.

First of all: This should not matter. The time to look up the columns is such a miniscule fraction of the total processing time of a typical query, that this might be the wrong spot to look for additional performance.
Second: Tablename.Colname is faster than Colname only, as it eliminates the need to search the referenced tables (and table-like structures like views and subqueries) for a fitting column. Again: The difference is inside the statistical noise.
Third: Using Tablename.Colname is a good idea, but for other reasons: If you use Colname only, and one of the tables in your query gets a new column with the same name, you end up with the oh-so-well-known "ambiguous column name" error. Typical candidates for such a columns often are "comment", "lastchanged", and friends. If you qualify your col references, this maintainability problem simply disappears - your query will work as allways, ignoring the new fields.

If it's faster, the difference is surely negligible, like a few microseconds per query. All the data about the tables mentioned in the query has to be loaded into memory, so it doesn't save any disk access. It's done during query parsing, not during data processing. Even if you run the query thousands of times, it might not make up for the time spent typing those extra characters, and certainly not the time we've spent discussing it. :)
But it makes the queries longer, so there's slightly more time spent in communications. If you're sending the query over a network, that will probably negate any time saved during parsing. You can reduce this by using short table aliases, though:
SELECT t.col1, t.col2
FROM ReallyLongTableName t
As a general rule, when worrying about database performance you only need to concern yourself with aspects whose time is dependent on the size number of rows in the tables. Anything that's the same regardless of the amount of data will fall into the noise, unless you're dealing with extremely tiny tables (in which case, why are you bothering with a database -- use a flat file).

Related

Is it faster to only query specific columns?

I've heard that it is faster to select colums manually ("col1, col2, col3, etc") instead of querying them all with "*".
But what if I don't even want to query all columns of a table? Would it be faster to query, for Example, only "col1, col2" insteaf of "col1, col2, col3, col4"?
From my understanding SQL has to search through all of the columns anyway, and just the return-result changes. I'd like to know if I can achieve a gain in performance by only choosing the right columns.
(I'm doing this anyway, but a backend API of one of my applications returns more often than not all columns, so I'm thinking about letting the user manually select the columns he want)
In general, reducing the number of columns in the select is a minor optimization. It means that less data is being returned from the database server to the application calling the server. Less data is usually faster.
Under most circumstances, this a minor improvement. There are some cases where the improvement can be more important:
If a covering index is available for the query, so the index satisfies the query without having to access data pages.
If some fields are very long, so records occupy multiple pages.
If the volume of data being retrieved is a small fraction (think < 10%) of the overall data in each record.
Listing the columns individually is a good idea, because it protects code from changes in underlying schema. For instance, if the name of a column is changed, then a query that lists columns explicitly will break with an easy-to-understand error. This is better than a query that runs and produces erroneous results.
You should try not to use select *.
Inefficiency in moving data to the consumer. When you SELECT *, you're often retrieving more columns from the database than your application really needs to function. This causes more data to move from the database server to the client, slowing access and increasing load on your machines, as well as taking more time to travel across the network. This is especially true when someone adds new columns to underlying tables that didn't exist and weren't needed when the original consumers coded their data access.
Indexing issues. Consider a scenario where you want to tune a query to a high level of performance. If you were to use *, and it returned more columns than you actually needed, the server would often have to perform more expensive methods to retrieve your data than it otherwise might. For example, you wouldn't be able to create an index which simply covered the columns in your SELECT list, and even if you did (including all columns [shudder]), the next guy who came around and added a column to the underlying table would cause the optimizer to ignore your optimized covering index, and you'd likely find that the performance of your query would drop substantially for no readily apparent reason.
Binding Problems. When you SELECT *, it's possible to retrieve two columns of the same name from two different tables. This can often crash your data consumer. Imagine a query that joins two tables, both of which contain a column called "ID". How would a consumer know which was which? SELECT * can also confuse views (at least in some versions SQL Server) when underlying table structures change -- the view is not rebuilt, and the data which comes back can be nonsense. And the worst part of it is that you can take care to name your columns whatever you want, but the next guy who comes along might have no way of knowing that he has to worry about adding a column which will collide with your already-developed names.
I got this from this answer.
I believe this topic has already been covered here:
select * vs select column
I believe it covers your concerns as well. Please take a look.
All the column labels and values occupy some space. Sending them to the issuer of the request instead of a subset of the columns means sending more data. More data is sent slower.
If you have columns, like
id, username, password, email, bio, url
and you want to get only the username and password, then
select username, password ...
is quicker than
select * ...
because id, email, bio and url are sent as well for the latter, which makes the response larger. But the main problem with select * is different. It might be the source of inconsistencies if, for some reason the order of the columns changed. Also, it might retrieve data you do not want to retrieve. It is always better to have a whitelist with the columns you actually want to retrieve.

Does SELECT * really take more time than selecting only the needed columns?

Will it make a discernible difference in the time a website page loads? On average, my tables have 10 columns, if I just need 3 of those columns, should I just call those in the query to make it faster?
Will it make a discernable difference. Probably not under most circumstances. Here are some cases where it would possibly make a big difference:
The 7 unneeded columns are really, really big.
You are returning lots and lots of rows.
You have a big table, are getting many rows, and an index is available on the 3 columns but not the 10.
But, there are other reasons not to use *:
It will replace the columns based on the order of the columns in the database at the time the query is compiled. This can cause problems if the structure of the table changes.
If a column name changes or is removed, your query would work and subsequent code might break. If you explicitly list the columns, then the query will break, making the problem easier to spot.
Typing three column names shouldn't be a big deal. Explicitly listing the columns makes the code more informative.
Let's say you had a table with 1000 columns, and you only needed 3.
What do you think would run faster and why?
This: SELECT * FROM table_name;
or this:SELECT col1, col2, col3, FROM table_name;
When you are using * you are now holding that entire selection (big or small) in memory. The bigger the selection...the more memory its going to use/need.
So even though your table isn't necessarily big, I would still only select the data that you actually need. You might not even notice a difference in speed, but it will definately be faster.
Yes if you only need a handful of columns, only select those. Below are some reasons:
THE MOST OBVIOUS: Extra data needs to be sent back making for larger packets to transmit (or pipe via local socket). This will increase overall latency. This might not seem like much for 1 or 2 rows, but wait until you've got 100 or 1000 rows... 7 extra columns of data will significanly affect overall transit latency expecially if you end up having the result set having to be broken into more TCP packets for transmit. This might not be such an issue if you're hitting a localhost socket, but move your DB to a server across a network, to another datacenter, etc... and the impact will be plain as day!
With the MySQL query cache enabled, storing unneeded data in result sets will increase your over cache space needs--larger query caches can suffer performance hits.
A HUGE HIT CAN BE: If you need only columns that are part of a covering index, doing a select * will require follow up point lookups for the remaining data fields in the main table rather than just use the data from the index table.
Yes you should.
Using named columns in select is a best practice working with database for multiple reasons.
Only the needed data travel from the database to the application server reducing cpu, memory and disk usage.
It helps detecting coding errors and structure changes.
There are only a few cases when using select * is a good idea, in all the other queries do yourself a favour and use the column names.
Yes definitely. * will get replaced with all the column names. After that only the execution starts. For example if there are 3 columns a, b, c in a table.. select a, b, c directly starts execution where as select * starts transforming the query into select a, b, c after that only the execution stats.
The short of it is yes, if you are returning more data it will take longer. This may be a very very very tiny amount of time but yes it will take longer. As stated above select * can be dangerous in a production situation where you may not be the one designing/implementing the database. If you assume that columns are returned in a particular order or the database structure is of a particular type and then the DBA goes in and makes some kind of a change and does not inform you, you may have an issue with your code.
The difference is very minimal, but there is a slight difference, I think it really depends on several factors for which is faster.
1) How many columns are in the table?
2) How many columns do you actually need to grab?
3) How many records are you grabbing?
In your case, based on what you said of having 10 columns and only needing 3 of those columns, I doubt it'll make a difference if you use 'Select *' or not, unless perhaps you're grabbing tens of thousands of records. But in more extreme cases with a lot more columns involved I have found 'Select *' to be slightly faster, but that might not be true in all cases.
I once did some speed tests in a SQLite table with over 150 columns, where I needed to grab only about 40 of the columns, and I needed all 20,000+ records. The speed differences were very minimal (we're talking 20 to 40 milliseconds difference), but it was actually faster to grab the data from All the columns with a 'SELECT ALL *', rather than going 'Select All Field1, Field2, etc'.
I assume the more records and columns in your table, the greater the speed difference this example will net you. But if you only needed 3 columns in a gigantic table I'd guess that just grabbing those 3 columns would be faster.
Bottom line though, if you really care about the minimal speed differences between 'Select *' and 'Select field1, field2, etc', then do some speed tests.

MYSQL IN vs <> performance

I have a table where I have a status field which can have values like 1,2,3,4,5. I need to select all the rows from the table with status != 1. I have the following 2 options:
NOTE that the table has INDEX over status field.
SELECT ... FROM my_tbl WHERE status <> 1;
or
SELECT ... FROM my_tbl WHERE status IN(2,3,4,5);
Which of the above is a better choice? (my_tbl is expected to grow very big).
You can run your own tests to find out, because it will vary depending on the underlying tables.
More than that, please don't worry about "fastest" without having first done some sort of measurement that it matters.
Rather than worrying about fastest, think about which way is clearest.
In databases especially, think about which way is going to protect you from data errors.
It doesn't matter how fast your program is if it's buggy or gives incorrect answers.
How many rows have the value "1"? If less than ~20%, you will get a table scan regardless of how you formulate the WHERE (IN, <>, BETWEEN). That's assuming you have INDEX(status).
But indexing ENUMs, flags, and other things with poor cardinality is rarely useful.
An IN clause with 50K items causes memory problems (or at least used to), but not performance problems. They are sorted, and a binary search is used.
Rule of Thumb: The cost of evaluation of expressions (IN, <>, functions, etc) is mostly irrelevant in performance. The main cost is fetching the rows, especially if they need to be fetched from disk.
An INDEX may assist in minimizing the number of rows fetched.
you can use BENCHMARK() to test it yourself.
http://sqlfiddle.com/#!2/d41d8/29606/2
the first one if faster which makes sense since it only has to compare 1 number instead of 4 numbers.

What's wrong with using a wildcard in your mySQL query? [duplicate]

This question already has answers here:
Which is faster/best? SELECT * or SELECT column1, colum2, column3, etc
(49 answers)
Closed 9 years ago.
Basically what's the difference in terms of security and speed in these 2 queries?
SELECT * FROM `myTable`
and
SELECT `id`, `name`, `location`, `place` etc... FROM `myTable`
Would using * increase the benchmark on my query and perform slower than static rows?
There won't be much appreciable difference in performance if you also select all columns individually.
The idea is to select only the data you require and no more, which can improve performance if there is alot of unneeded columns in your query, for example, when you join several tables.
Ofc, on the other side of the coin, using * makes life easier when you make changes to the table.
Security-wise, the less you select, the less potentially sensitive data can be inadvertently dumped to the user's browser. Imagine if * included the column social_security_number and somewhere in your debug code it gets printed out as an HTML comment.
Performance-wise, in many cases your database is on another server, so requesting the entire row when you only need a small part of it means a lot more data going over the network.
There is not a single, simple answer, and your literal question cannot fully be answered without more detail of the specific table structure, but I'm going with the assumption that you aren't actually talking about a specific query against a specific table, but rather about selecting columns explicitly or using the *.
SELECT * is always wasteful of something unless you are actually going to use every column that exists in the rows you're reading... maybe network bandwidth, or CPU resources, or disk I/O, or a combination, but something is being unnecessarily used, though that "something" may be in very small and imperceptible quantities ... or it may not ... but it can add up over time.
The two big examples that come to mind where SELECT * can be a performance killer are cases like...
...tables with long VARCHAR and BLOB columns:
Columns such as BLOB and VARCHAR that are too long to fit on a B-tree page are stored on separately allocated disk pages called overflow pages. We call such columns off-page columns. The values of these columns are stored in singly-linked lists of overflow pages, and each such column has its own list of one or more overflow pages
— http://dev.mysql.com/doc/refman/5.6/en/innodb-row-format-overview.html
So if * includes columns that weren't stored on-page with the rest of the row data, you just took an I/O hit and/or wasted space in your buffer pool with accesses that could have been avoided had you selected only what you needed.
...also cases where SELECT * prevents the query from using a covering index:
If the index is a covering index for the queries and can be used to satisfy all data required from the table, only the index tree is scanned. In this case, the Extra column says Using index. An index-only scan usually is faster than ALL because the size of the index usually is smaller than the table data.
— http://dev.mysql.com/doc/refman/5.6/en/explain-output.html
When one or more columns are indexed, copies of the column data are stored, sorted, in the index tree, which also includes the primary key, for finding the rest of the row data. When selecting from a table, if all of the columns you are selecting can be found within a single index, the optimizer will automatically choose to return the data to you by reading it directly from the index, instead of going to the time and effort to read in all of the row data... and this, some cases, is a very significant difference in the performance of a query, because it can mean substantially smaller resource usage.
If EXPLAIN SELECT does not reveal the exact same query plan when selecting the individual columns you need compared with the plan used when selecting *, then you are looking at some fairly hard evidence that you are putting the server through unnecessary work by selecting things you aren't going to use.
In additional cases, such as with the information_schema tables, the columns you select can make a particularly dramatic and obvious difference in performance. The information_schema tables are not actually tables -- they're server internal structures exposed via the SQL interface... and the columns you select can significantly change the performance of the query because the server has to do more work to calculate the values of some columns, compared to others. A similar situation is true of FEDERATED tables, which actually fetch data from a remote MySQL server to make a foreign table appear logically to be local. The columns you select are actually transferred across the network between servers.
Explicitly selecting the columns you need can also lead to fewer sneaky bugs. If a column you were using in code is later dropped from a table, the place in your code's data structure -- in some languages -- is going to contain an undefined value, which in many languages is the same think you would see if the column still existed but was null... so the code thinks "okay, that's null, so..." a logical error follows. Had you explicitly selected the columns you wanted, subsequent executions of the query would throw a hard error instead of quietly misbehaving.
MySQL's C-client API, which some other client libraries are built on, supports two modes of fetching data, one of which is mysql_store_result, which buffers the data from the server on the client side before the application actually reads it into its internal structures... so as you are "reading from the server" you may have already implicitly allocated a lot of memory on the client side to store that incoming result-set even when you think you're fetching a row at a time. Selecting unnecessary columns means even more memory needed.
SELECT COUNT(*) is an exception. The COUNT() function counts the number of non-null values seen, and * merely means "count the rows"... it doesn't examine column data, so if you want a star there, go for it.
As a favor to your future self, unless you want to go back later and rewrite all of those queries when you're trying to get more performance out of your server, you should bite the bullet and do the extra typing, now.
As a bonus, when other people see your code, they won't accuse you of laziness or inexperience.

Big tables and analysis in MySql

For my startup, I track everything myself rather than rely on google analytics. This is nice because I can actually have ips and user ids and everything.
This worked well until my tracking table rose about 2 million rows. The table is called acts, and records:
ip
url
note
account_id
...where available.
Now, trying to do something like this:
SELECT COUNT(distinct ip)
FROM acts
JOIN users ON(users.ip = acts.ip)
WHERE acts.url LIKE '%some_marketing_page%';
Basically never finishes. I switched to this:
SELECT COUNT(distinct ip)
FROM acts
JOIN users ON(users.ip = acts.ip)
WHERE acts.note = 'some_marketing_page';
But it is still very slow, despite having an index on note.
I am obviously not pro at mysql. My question is:
How do companies with lots of data track things like funnel conversion rates? Is it possible to do in mysql and I am just missing some knowledge? If not, what books / blogs can I read about how sites do this?
While getting towards 'respectable', 2 Millions rows is still a relatively small size for a table. (And therefore a faster performance is typically possible)
As you found out, the front-ended wildcard are particularly inefficient and we'll have to find a solution for this if that use case is common for your application.
It could just be that you do not have the right set of indexes. Before I proceed, however, I wish to stress that while indexes will typically improve the DBMS performance with SELECT statements of all kinds, it systematically has a negative effect on the performance of "CUD" operations (i.e. with the SQL CREATE/INSERT, UPDATE, DELETE verbs, i.e. the queries which write to the database rather than just read to it). In some cases the negative impact of indexes on "write" queries can be very significant.
My reason for particularly stressing the ambivalent nature of indexes is that it appears that your application does a fair amount of data collection as a normal part of its operation, and you will need to watch for possible degradation as the INSERTs queries get to be slowed down. A possible alternative is to perform the data collection into a relatively small table/database, with no or very few indexes, and to regularly import the data from this input database to a database where the actual data mining takes place. (After they are imported, the rows may be deleted from the "input database", keeping it small and fast for its INSERT function.)
Another concern/question is about the width of a row in the cast table (the number of columns and the sum of the widths of these columns). Bad performance could be tied to the fact that rows are too wide, resulting in too few rows in the leaf nodes of the table, and hence a deeper-than-needed tree structure.
Back to the indexes...
in view of the few queries in the question, it appears that you could benefit from an ip + note index (an index made at least with these two keys in this order). A full analysis of the index situation, and frankly a possible review of the database schema cannot be done here (not enough info for one...) but the general process for doing so is to make the list of the most common use case and to see which database indexes could help with these cases. One can gather insight into how particular queries are handled, initially or after index(es) are added, with mySQL command EXPLAIN.
Normalization OR demormalization (or indeed a combination of both!), is often a viable idea for improving performance during mining operations as well.
Why the JOIN? If we can assume that no IP makes it into acts without an associated record in users then you don't need the join:
SELECT COUNT(distinct ip) FROM acts
WHERE acts.url LIKE '%some_marketing_page%';
If you really do need the JOIN it might pay to first select the distinct IPs from acts, then JOIN those results to users (you'll have to look at the execution plan and experiment to see if this is faster).
Secondly, that LIKE with a leading wild card is going to cause a full table scan of acts and also necessitate some expensive text searching. You have three choices to improve this:
Decompose the url into component parts before you store it so that the search matches a column value exactly.
Require the search term to appear at the beginning of the of the url field, not in the middle.
Investigate a full text search engine that will index the url field in such a way that even an internal LIKE search can be performed against indexes.
Finally, in the case of searching on acts.notes, if an index on notes doesn't provide sufficient search improvement, I'd consider calculating and storing an integer hash on notes and searching for that.
Try running 'EXPLAIN PLAN' on your query and look to see if there are any table scans.
Should this be a LEFT JOIN?
Maybe this site can help.