Is it faster to only query specific columns? - mysql

I've heard that it is faster to select colums manually ("col1, col2, col3, etc") instead of querying them all with "*".
But what if I don't even want to query all columns of a table? Would it be faster to query, for Example, only "col1, col2" insteaf of "col1, col2, col3, col4"?
From my understanding SQL has to search through all of the columns anyway, and just the return-result changes. I'd like to know if I can achieve a gain in performance by only choosing the right columns.
(I'm doing this anyway, but a backend API of one of my applications returns more often than not all columns, so I'm thinking about letting the user manually select the columns he want)

In general, reducing the number of columns in the select is a minor optimization. It means that less data is being returned from the database server to the application calling the server. Less data is usually faster.
Under most circumstances, this a minor improvement. There are some cases where the improvement can be more important:
If a covering index is available for the query, so the index satisfies the query without having to access data pages.
If some fields are very long, so records occupy multiple pages.
If the volume of data being retrieved is a small fraction (think < 10%) of the overall data in each record.
Listing the columns individually is a good idea, because it protects code from changes in underlying schema. For instance, if the name of a column is changed, then a query that lists columns explicitly will break with an easy-to-understand error. This is better than a query that runs and produces erroneous results.

You should try not to use select *.
Inefficiency in moving data to the consumer. When you SELECT *, you're often retrieving more columns from the database than your application really needs to function. This causes more data to move from the database server to the client, slowing access and increasing load on your machines, as well as taking more time to travel across the network. This is especially true when someone adds new columns to underlying tables that didn't exist and weren't needed when the original consumers coded their data access.
Indexing issues. Consider a scenario where you want to tune a query to a high level of performance. If you were to use *, and it returned more columns than you actually needed, the server would often have to perform more expensive methods to retrieve your data than it otherwise might. For example, you wouldn't be able to create an index which simply covered the columns in your SELECT list, and even if you did (including all columns [shudder]), the next guy who came around and added a column to the underlying table would cause the optimizer to ignore your optimized covering index, and you'd likely find that the performance of your query would drop substantially for no readily apparent reason.
Binding Problems. When you SELECT *, it's possible to retrieve two columns of the same name from two different tables. This can often crash your data consumer. Imagine a query that joins two tables, both of which contain a column called "ID". How would a consumer know which was which? SELECT * can also confuse views (at least in some versions SQL Server) when underlying table structures change -- the view is not rebuilt, and the data which comes back can be nonsense. And the worst part of it is that you can take care to name your columns whatever you want, but the next guy who comes along might have no way of knowing that he has to worry about adding a column which will collide with your already-developed names.
I got this from this answer.

I believe this topic has already been covered here:
select * vs select column
I believe it covers your concerns as well. Please take a look.

All the column labels and values occupy some space. Sending them to the issuer of the request instead of a subset of the columns means sending more data. More data is sent slower.
If you have columns, like
id, username, password, email, bio, url
and you want to get only the username and password, then
select username, password ...
is quicker than
select * ...
because id, email, bio and url are sent as well for the latter, which makes the response larger. But the main problem with select * is different. It might be the source of inconsistencies if, for some reason the order of the columns changed. Also, it might retrieve data you do not want to retrieve. It is always better to have a whitelist with the columns you actually want to retrieve.

Related

Is selecting fewer columns speeding up my query?

I have seen several questions comparing select * to select by all columns explicitly, but what about fewer columns selected vs more.
In other words, is:
SELECT id,firstname,lastname,lastlogin,email,phone
More than negligibly faster than:
SELECT id,firstname,lastlogin
I realize there will be small differences for more data being transferred through the system and to the application, but this is a total data/load difference, not a cost of the query (larger data in the cells would have the same effect anyway I believe) - I'm only trying to optimize my query, as I will have to load ALL the data at some point anyway...
When my admin user logs in, I'm going to load the entire user database into a cache, but I can either query only critical data upfront to shave some execution time, or just get everything - if it works out roughly the same. I know more rows equals longer query execution - but what about more selected values in my query?
Under most circumstances, the only difference is going to be slightly larger data for these fields and the additional time to fetch them.
There are two things to consider:
If the additional fields are very big, then this could be a big difference in performance.
If there is an index that covers the columns you actually want, then the index can be used for the query. This could speed the query in the database.
In general, though, the advice is to return the columns you want to the application. If there is complex processing, you should consider doing that in the database rather than the application.

Is it possible to find out which columns are (not) explicitly queried in MySQL/MariaDB?

We have a very large, very old table with a few hundred columns. Some of the columns are historical and aren't used in any writing client. They are mostly empty (expect in very old records). I want to clean up the DB and get rid of old, unused columns in certain tables.
The problem are all the third party clients that access this DB (reading only). I can't expect all providers to update their clients. As long as they are querying for SELECT * ..., it doesn't matter. But I expect them to query explicitly (SELECT colA, colB, ...). Removing colA from the table would result in errors on the client side, obviously.
Now I would like to know which columns are explicitly used by any query statements, so I can remove the unused ones. I guess I could use the query log, analyze it and find explicitly used columns, but:
We receive a few millions queries per hour.
Some clients access our DB maybe once a week if at all, some every single second.
That means the query log would have to run for months in a production environment and I don't know if that could/would have any negative impact on the servers or overall performance.
Is there any other, more solid solution? Are my concerns regarding the query log exaggerated? I was hoping that MariaDB/MySQL are storing statistical data somewhere, showing the usage of columns, but I couldn't find anything I need.
There no log of what columns are mentioned in queries.
The "general log" copies every query to a file. This can be a serious disk hog (space and speed), especially with "millions of queries per hour". But it would have an attempt at the answer...
The general log can, I think, be summarized via pt-query-digest.
Another possibility is to use tcpdump together with pt-query-digest to grab all the queries.
The advantage of the digest is that it consolidates multiple 'similar' queries into a single entry. You would still have to manually (or problematically) sift through the output.
Most columns are not worth removing, even if they are never used. I would suggest focusing on the 10% that are the bulkiest. There might be a way to use tcpdump | egrep to look for just those column names. Refine that a few times, and you might discover some prime candidate(s) for removal.
Unfortunately, if the clients do SELECT *, then "all" columns are being used.
KEY_COLUMN_USAGE table describes which key columns have constraints , you will get even the table name in the result . If you will get a empty result it means you can delete the field from table without worry about FK .
SELECT *
FROM
KEY_COLUMN_USAGE
WHERE
REFERENCED_COLUMN_NAME = 'your_column_name';

What's wrong with using a wildcard in your mySQL query? [duplicate]

This question already has answers here:
Which is faster/best? SELECT * or SELECT column1, colum2, column3, etc
(49 answers)
Closed 9 years ago.
Basically what's the difference in terms of security and speed in these 2 queries?
SELECT * FROM `myTable`
and
SELECT `id`, `name`, `location`, `place` etc... FROM `myTable`
Would using * increase the benchmark on my query and perform slower than static rows?
There won't be much appreciable difference in performance if you also select all columns individually.
The idea is to select only the data you require and no more, which can improve performance if there is alot of unneeded columns in your query, for example, when you join several tables.
Ofc, on the other side of the coin, using * makes life easier when you make changes to the table.
Security-wise, the less you select, the less potentially sensitive data can be inadvertently dumped to the user's browser. Imagine if * included the column social_security_number and somewhere in your debug code it gets printed out as an HTML comment.
Performance-wise, in many cases your database is on another server, so requesting the entire row when you only need a small part of it means a lot more data going over the network.
There is not a single, simple answer, and your literal question cannot fully be answered without more detail of the specific table structure, but I'm going with the assumption that you aren't actually talking about a specific query against a specific table, but rather about selecting columns explicitly or using the *.
SELECT * is always wasteful of something unless you are actually going to use every column that exists in the rows you're reading... maybe network bandwidth, or CPU resources, or disk I/O, or a combination, but something is being unnecessarily used, though that "something" may be in very small and imperceptible quantities ... or it may not ... but it can add up over time.
The two big examples that come to mind where SELECT * can be a performance killer are cases like...
...tables with long VARCHAR and BLOB columns:
Columns such as BLOB and VARCHAR that are too long to fit on a B-tree page are stored on separately allocated disk pages called overflow pages. We call such columns off-page columns. The values of these columns are stored in singly-linked lists of overflow pages, and each such column has its own list of one or more overflow pages
— http://dev.mysql.com/doc/refman/5.6/en/innodb-row-format-overview.html
So if * includes columns that weren't stored on-page with the rest of the row data, you just took an I/O hit and/or wasted space in your buffer pool with accesses that could have been avoided had you selected only what you needed.
...also cases where SELECT * prevents the query from using a covering index:
If the index is a covering index for the queries and can be used to satisfy all data required from the table, only the index tree is scanned. In this case, the Extra column says Using index. An index-only scan usually is faster than ALL because the size of the index usually is smaller than the table data.
— http://dev.mysql.com/doc/refman/5.6/en/explain-output.html
When one or more columns are indexed, copies of the column data are stored, sorted, in the index tree, which also includes the primary key, for finding the rest of the row data. When selecting from a table, if all of the columns you are selecting can be found within a single index, the optimizer will automatically choose to return the data to you by reading it directly from the index, instead of going to the time and effort to read in all of the row data... and this, some cases, is a very significant difference in the performance of a query, because it can mean substantially smaller resource usage.
If EXPLAIN SELECT does not reveal the exact same query plan when selecting the individual columns you need compared with the plan used when selecting *, then you are looking at some fairly hard evidence that you are putting the server through unnecessary work by selecting things you aren't going to use.
In additional cases, such as with the information_schema tables, the columns you select can make a particularly dramatic and obvious difference in performance. The information_schema tables are not actually tables -- they're server internal structures exposed via the SQL interface... and the columns you select can significantly change the performance of the query because the server has to do more work to calculate the values of some columns, compared to others. A similar situation is true of FEDERATED tables, which actually fetch data from a remote MySQL server to make a foreign table appear logically to be local. The columns you select are actually transferred across the network between servers.
Explicitly selecting the columns you need can also lead to fewer sneaky bugs. If a column you were using in code is later dropped from a table, the place in your code's data structure -- in some languages -- is going to contain an undefined value, which in many languages is the same think you would see if the column still existed but was null... so the code thinks "okay, that's null, so..." a logical error follows. Had you explicitly selected the columns you wanted, subsequent executions of the query would throw a hard error instead of quietly misbehaving.
MySQL's C-client API, which some other client libraries are built on, supports two modes of fetching data, one of which is mysql_store_result, which buffers the data from the server on the client side before the application actually reads it into its internal structures... so as you are "reading from the server" you may have already implicitly allocated a lot of memory on the client side to store that incoming result-set even when you think you're fetching a row at a time. Selecting unnecessary columns means even more memory needed.
SELECT COUNT(*) is an exception. The COUNT() function counts the number of non-null values seen, and * merely means "count the rows"... it doesn't examine column data, so if you want a star there, go for it.
As a favor to your future self, unless you want to go back later and rewrite all of those queries when you're trying to get more performance out of your server, you should bite the bullet and do the extra typing, now.
As a bonus, when other people see your code, they won't accuse you of laziness or inexperience.

Big tables and analysis in MySql

For my startup, I track everything myself rather than rely on google analytics. This is nice because I can actually have ips and user ids and everything.
This worked well until my tracking table rose about 2 million rows. The table is called acts, and records:
ip
url
note
account_id
...where available.
Now, trying to do something like this:
SELECT COUNT(distinct ip)
FROM acts
JOIN users ON(users.ip = acts.ip)
WHERE acts.url LIKE '%some_marketing_page%';
Basically never finishes. I switched to this:
SELECT COUNT(distinct ip)
FROM acts
JOIN users ON(users.ip = acts.ip)
WHERE acts.note = 'some_marketing_page';
But it is still very slow, despite having an index on note.
I am obviously not pro at mysql. My question is:
How do companies with lots of data track things like funnel conversion rates? Is it possible to do in mysql and I am just missing some knowledge? If not, what books / blogs can I read about how sites do this?
While getting towards 'respectable', 2 Millions rows is still a relatively small size for a table. (And therefore a faster performance is typically possible)
As you found out, the front-ended wildcard are particularly inefficient and we'll have to find a solution for this if that use case is common for your application.
It could just be that you do not have the right set of indexes. Before I proceed, however, I wish to stress that while indexes will typically improve the DBMS performance with SELECT statements of all kinds, it systematically has a negative effect on the performance of "CUD" operations (i.e. with the SQL CREATE/INSERT, UPDATE, DELETE verbs, i.e. the queries which write to the database rather than just read to it). In some cases the negative impact of indexes on "write" queries can be very significant.
My reason for particularly stressing the ambivalent nature of indexes is that it appears that your application does a fair amount of data collection as a normal part of its operation, and you will need to watch for possible degradation as the INSERTs queries get to be slowed down. A possible alternative is to perform the data collection into a relatively small table/database, with no or very few indexes, and to regularly import the data from this input database to a database where the actual data mining takes place. (After they are imported, the rows may be deleted from the "input database", keeping it small and fast for its INSERT function.)
Another concern/question is about the width of a row in the cast table (the number of columns and the sum of the widths of these columns). Bad performance could be tied to the fact that rows are too wide, resulting in too few rows in the leaf nodes of the table, and hence a deeper-than-needed tree structure.
Back to the indexes...
in view of the few queries in the question, it appears that you could benefit from an ip + note index (an index made at least with these two keys in this order). A full analysis of the index situation, and frankly a possible review of the database schema cannot be done here (not enough info for one...) but the general process for doing so is to make the list of the most common use case and to see which database indexes could help with these cases. One can gather insight into how particular queries are handled, initially or after index(es) are added, with mySQL command EXPLAIN.
Normalization OR demormalization (or indeed a combination of both!), is often a viable idea for improving performance during mining operations as well.
Why the JOIN? If we can assume that no IP makes it into acts without an associated record in users then you don't need the join:
SELECT COUNT(distinct ip) FROM acts
WHERE acts.url LIKE '%some_marketing_page%';
If you really do need the JOIN it might pay to first select the distinct IPs from acts, then JOIN those results to users (you'll have to look at the execution plan and experiment to see if this is faster).
Secondly, that LIKE with a leading wild card is going to cause a full table scan of acts and also necessitate some expensive text searching. You have three choices to improve this:
Decompose the url into component parts before you store it so that the search matches a column value exactly.
Require the search term to appear at the beginning of the of the url field, not in the middle.
Investigate a full text search engine that will index the url field in such a way that even an internal LIKE search can be performed against indexes.
Finally, in the case of searching on acts.notes, if an index on notes doesn't provide sufficient search improvement, I'd consider calculating and storing an integer hash on notes and searching for that.
Try running 'EXPLAIN PLAN' on your query and look to see if there are any table scans.
Should this be a LEFT JOIN?
Maybe this site can help.

How do I search part of a column?

I have a mysql table containing 40 million records that is being populated by a process over which I have no control. Data is added only once every month. This table needs to be search-able by the Name column. But the name column contains the full name in the format 'Last First Middle'.
In the sphinx.conf, I have
sql_query = SELECT Id, OwnersName,
substring_index(substring_index(OwnersName,' ',2),' ',-1) as firstname,
substring_index(OwnersName,' ',2) as lastname
FROM table1
How do I use sphinx search to search by firstname and/or lastname? I would like to be able to search for 'Smith' in only the first name?
Per-row functions in SQL queries are always a bad idea for tables that may grow large. If you want to search on part of a column, it should be extracted out to its own column and indexed.
I would suggest, if you have power over the schema (as opposed to the population process), inserting new columns called OwnersFirstName and OwnersLastName along with an update/insert trigger which extracts the relevant information from OwnersName and populats the new columns appropriately.
This means the expense of figuring out the first name is only done when a row is changed, not every single time you run your query. That is the right time to do it.
Then your queries become blindingly fast. And, yes, this breaks 3NF, but most people don't realize that it's okay to do that for performance reasons, provided you understand the consequences. And, since the new columns are controlled by the triggers, the data duplication that would be cause for concern is "clean".
Most problems people have with databases is the speed of their queries. Wasting a bit of disk space to gain a large amount of performance improvement is usually okay.
If you have absolutely no power over even the schema, another possibility is to create your own database with the "correct" schema and populate it periodically from the real database. Then query yours. That may involve a fair bit of data transfer every month however so the first option is the better one, if allowed.
Judging by the other answers, I may have missed something... but to restrict a search in Sphinx to a specific field, make sure you're using the extended (or extended2) match mode, and then use the following query string: #firstname Smith.
You could use substring to get the parts of the field that you want to search in, but that will slow down the process. The query can not use any kind of index to do the comparison, so it has to touch each record in the table.
The best would be not to store several values in the same field, but put the name components in three separate fields. When you store more than one value in a fields it's almost always some problems accessing the data. I see this over and over in different forums...
This is an intractable problrm because fulll names can contains prefixes, suffixes, middle names and no middle names, composite first and last names with and without hyphens, etc. There is no reasonable way to do this with 100% reliability