This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Can select * usage ever be justified?
Curious to hear this from folks with more DBA insight, but what performance implications does an application face from when you see a query like:
select * from some_large_table;
You have to do a full table scan since no index is being hit, and I believe if we're talking O notation, we're speaking O(N) here where N is the size of the table. Is this typically considered not optimal behavior? What if you really do need everything from the table at certain times? Yes we have tools such as pagination etc, but I'm talking strictly from a database perspective here. Is this type of behavior normally frowned upon?
What happens if you don't specify columns, is that the DB Engine has to query the master table data for the column list. This query is really fast, but causes a minor performance issue. As long as you're not doing a sloppy SELECT * with a JOIN statement or nested queries, you should be fine. However, note the small performance impact of letting the DB Engine doing a query to find the columns.
MySQL server opens a cursor on server-side to read that table. The client of the query may read none or all records and performance for the client will only depend on the number of records it actually fetched. Also the performance of the query on server-side can acutally be faster than query with some conditions as it involves also some index reading. Only if client fetched all records, it will be equivalent to full table scan.
Selecting more columns than you need (select *) is always bad. Don't do more than you have to
If you're selecting from the whole table, it doesn't matter if you have an index.
Some other issues you're going to run into is how you want to lock the table. If this is a busy application you might not want to prevent locking entirely because of the inconsistent data that might be returned. But if you lock too tightly it could slow the query further. O(n) is considered acceptable in any computer science application. However in databases we measure in time & number of reads/writes. This is a huge number of reads and will probably take a long time to execute. Therefore it's unacceptable.
Related
This question already has answers here:
Which is faster/best? SELECT * or SELECT column1, colum2, column3, etc
(49 answers)
Closed 8 years ago.
Is it okay to always use SELECT * even if you only need one column when retrieving data from MySQL? Does it affect the speed of the query or the speed of the system? Thanks.
No, it is not always okay.
But it is also not always a problem.
In order of performance impact:
If you only select a subset of columns, it can positively affect the access path. Maybe those columns can be read from an index without touching the table at all.
Beyond that, there is also raw network I/O. Sending three columns uses a lot less bandwidth than sending three hundred (especially for many rows).
Beyond that, there is also the memory required by your client application to process the result set.
I believe the columns in the select are the least time/CPU intensive piece of the query. Limiting the number of rows, either by "WHERE" clauses or explicitly using "LIMIT" is where time is saved.
In my personal experience you should prefer named columns over SELECT * whenever possible.
But, performance is not the key reason.
Code that uses SELECT * is usually harder to read and debug as it is not explicit what the intent of the query is.
Code that uses SELECT * can break when the database structure is changed (referring to columns by index rather than by name is almost always the wrong way to write your code).
Finally, retrieving bigger datasets does affect speed, bandwidth and memory consumption, and that's never advisable if it can easily be avoided.
As far as performance is concerned, JOINs and row-count are more likely to slow query performance than the difference in selected columns, but inefficiencies have a habit of compounding later on in projects. ie. You may have no performance issues with a test-bed application but when things scale, or data is accessible only over restricted bandwidth of a network that's when you'll be pleased you wrote explicit SELECTs to start with.
Note that if you're just writing a one-off query to check some data I wouldn't worry, but if you're writing a query for a codebase that might be executed often, it pays to write good queries and, when necessary consider Stored Procedures.
What is the optimal solution, use Inner Join or multiple queries?
something like this:
SELECT * FROM brands
INNER JOIN cars
ON brands.id = cars.brand_id
or like this:
SELECT * FROM brands
... (while query)...
SELECT * FROM cars WHERE brand_id = [row(brands.id)]
Generally speaking, one query is better, but there are come caveats to that. For example, older versions of SQL Server had a great decreases in performance if you did more than seven joins. The answer will really depend on the database engine, version, query, schema, fields, etc., so we can't say for sure which is better. Always look into minimizing the number of queries when possible without going too overboard and creating result sets that are crazy or impossible to maintain.
This is a very subjective question but remember that each time you call the database there's significant overhead.
Almost without exception the optimum is to issue as few commands and pull out all the data you'll need. However for practical reasons this clearly may not be possible.
Generally speaking if a database is well maintained one query is quicker than two. If it's not you need to look at your data/indicies and determine why.
A final point, you're hinting in your second example that you'd load the brand then issue a command to get all the cars in each brand. This is without a doubt your worst option as it doesn't issue 2 commands - it issues N+1 where N is the number of brands you have... 100 brands is 101 DB hits!
Your two queries are not exactly the same.
The first returns all fields from brands and cars in one row. The second returns two different result sets that need to be combined together.
In general, it is better to do as many operations in the database as possible. The database is more efficient for processing large amounts of data. And, it generally reduces the amount of data being brought back to the client.
That said, there are a few circumstances where more data is being returned in a single query than with multiple queries. For instance in your example, if you have one brand record with 100 columns and 10,000 car records with three columns, then the two-query method is probably faster. You are only bringing back the columns from brands table once rather than 10,000 times.
These examples where multiple queries is better are few and far between. In general, it is better to do the processing in the database. If performance needs to be improved, then in a few rare cases, you might be able to break up queries and improve performance.
In general, use first query. Why? Because query execution time is not just query itself time, but also some overheads, such as:
Creating connection overhead
Network data sending overhead
Closing (handling) connection overhead
Depending of situation, some overheads may present or not. For example, if you're using persistent connection, then you'll not get connection overhead. But in common case that's not true, thus, it will have place. And creating/maintaining/closing connection overhead is very significant part. Imagine that you have this overhead as only 1% from total query time (in real situation it will be much more). And you have - let's say, 1.000.000 rows. Then first query will produce that overhead only once, while second will be 1.000.000/100 = 10.000 times. Just think about - how slow it will be.
Besides, INNER JOIN will also be done using key - if it exists, thus, in terms of query itself speed it will be near same as second. So I highly recommend to use INNER JOIN option.
Breaking complex query into simple queries may be useful in a very specific cases. For example, case with IN subquery. In this situation, if you're using WHERE id IN (subquery), where (subquery) is some SQL, MySQL will treat that as = ANY subquery and will not use key for that, even if subquery results in narrow list of ids. And - yes, split it into two queries may have sense since WHERE IN(static list) will work in another way - MySQL will use range index scan for that (strange, but true - because for IN (static list) statement IN will be treated as comparison operator, and not =ANY subquery qualifier). This part isn't directly about your case - but to show that - yes, cases, when splitting processing from DBMS may be useful in terms of performance - exist.
One query is better, because up to about 90% of the expense of executing a query is in the overheads:
communication traffic to/from database
syntax checking
authority checking
access plan calculation by optimizer
logging
locking (even read-only requires a lock)
lots of other stuff too
Do all that just once for one query, or do it all n times for n queries, but get the same data.
I realize this is a sort of meta-programming question, but I'm assuming there are enough experienced people here to give a decent answer.
I was just building a query again, to retrieve some data from a table.
SELECT pl.field1, pl.field2
FROM table pl
LEFT JOIN table2 dp on pl.field1 = dp.field1
WHERE dp.field1 IS NULL
Executing this query took ages (1800+ seconds).
After I got sick of waiting, and made the effort to EXPLAIN the query, it turned out that a full table scan was done.
I created an index on dp.field1 and the query was almost instant thereafter, creating that index took less than a second.
Judging from the EXPLAIN, this wasn't too difficult to determine. Why can't, or won't, MySQL do this automatically? Spending just a second to create that index will make the query instant, so MySQL could theoretically create a temporary index, use it to do the query and then remove it again, which would still be orders of magnitude faster than the alternative.
I'm expecting the usual answers of 'to make sure you design a good schema' or 'mysql just does what you tell it to do', but I'm wondering if there might be a technical reason why this is a bad idea.
For columns with low cardinality it is not a good idea to use a B-Tree Index. B-Trees become degenerated for low cardinalities and do in fact increase query time in comparison to a full table scan.
So always creating a B-Tree index is not a good idea. At least it have to consider cardinality, too. And maybe several other things, too.
Quite simply - because the idea doesn't really scale using the current design of RDBMS engines.
It's okay for a single user, but databases are designed to support many concurrent users, and having each user's query also run a speculative optimization step ("can I speed up this query by creating an index?"), and creating that index, which in some circumstances is a very expensive operation, would become slow at any degree of scale. Having the index be "single use" would be wasteful of both computation time and disk space, but having lots of permanent indices in turn would slow down the query optimizer by having to investigate many indices for a given query. It would also slow down data modification operations.
Admittedly, on modern hardware, these concerns are a lot less significant - basic design of RDBMS engines dates back to the days when disk space was expensive, CPUs were several orders of magnitude slower, and memory was an unimaginable luxury.
I'm only speaking for MySQL because there may be a database system out there that automatically modifies your database design.
The simple answer is, MySQL simply does what you tell it to do.
MySQL cannot predict the future. Only you can. You know much more about your data than MySQL does. MySQL keeps some statistics, but it's guessing the best way to execute your query on very sparse information (that is sometimes outdated) before it actually tries to do so. Once it starts executing, it doesn't change its plan, no matter how wrong the guess was.
The methods that it uses to guess are all very well documented. It's our job to provide the indexes that will provide the most benefit, and even, at times, hint that it should use those indexes.
If you tell MySQL to perform a query that requires a table scan, it assumes you know that it's going to do a table scan, because it told you in its documentation that it would. It simply obeys.
Database systems that don't allow the DBA to make decisions don't scale well. There are always tradeoffs to be made, and you're the one to make them. MySQL is a hammer, not a carpenter.
Sorry for lots of useless text. Most important stuff is told on last 3 paragraphs :D
Recently we had some mysql problem in one of client servers. Something out of blue starts sky-rocking CPU of mysql process. This problem lead us to finding and optimizing bad queries and here is a problem.
I was thinking that optimization is speeding up queries (total time needed for a query to execute). But after optimizing several queries towards it my colleague starting colleague started complaining that some queries read too many rows, all rows from table (as shown with EXPLAIN).
After rewriting a query I noticed that, if I want a query to read less rows - query speed suffers, if I query is made for speed - more rows are read.
And that didn't make me a sense: less rows read, but execution time is longer
And that made me wonder what should be done. Of course it would be perfect to have fast query which reads least rows. But since it doesn't seem to be possible for me, I'm searching for some answers. Which approach should I take - speed or less rows read? What are pros&cons when query is fast but with more rows read and when less rows are read with speed suffer? What happens with server at different cases?
After googling all I could find was articles and discussions about how to improve speed, but neither covered those different cases I mentioned before.
I'm looking forward to seeing even personal choices of course with some reasoning.
Links which could direct me right way are welcome too.
I think your problem depends on how you are limiting the amount of rows read. If you read less rows by implementing more WHERE clauses that MySQL needs to run against, then yes, performance will take a hit.
I would look at perhaps indexing some of your columns that make your search more complex. Simple data types are faster to lookup than complex ones. See if you are searching toward indexed columns.
Without more data, I can give you some hints:
Be sure your tables are properly indexed. Create the appropriate indexes for each of your tables. Also drop the indexes that are not needed.
Decide the best approach for each query. For example, if you use group by only to deduplicate rows, you are wasting resources; it is better to use select distinct (on an indexed field).
"Divide and conquer". Can you split your process in two, three or more intermediate steps? If the answer is "yes", then: Can you create temporary tables for some of these steps? I've split proceses using temp tables, and they are very useful to speed things up.
The count of rows read reported by EXPLAIN is an estimate anyway -- don't take it as a literal value. Notice that if you run EXPLAIN on the same query multiple times, the number of rows read changes each time. This estimate can even be totally inaccurate, as there have been bugs in EXPLAIN from time to time.
Another way to measure query performance is SHOW SESSION STATUS LIKE 'Handler%' as you test the query. This will tell you accurate counts of how many times the SQL layer made requests for individual rows to the storage engine layer. For examples, see my presentation, SQL Query Patterns, Optimized.
There's also an issue of whether the rows requested were already in the buffer pool (I'm assuming you use InnoDB), or did the query have to read them from disk, incurring I/O operations. A small number of rows read from disk can be orders of magnitude slower than a large number of rows read from RAM. This doesn't necessarily account for your case, but it points out that such a scenario can occur, and "rows read" doesn't tell you if the query caused I/O or not. There might even be multiple I/O operations for a single row, because of InnoDB's multi-versioning.
Insight into the difference between logical row request vs. physical I/O reads is harder to get. In Percona Server, enhancements to the slow query log include the count of InnoDB I/O operations per query.
I once had a MySQL database table containing 25 million records, which made even a simple COUNT(*) query takes minute to execute. I ended up making partitions, separating them into a couple tables. What i'm asking is, is there any pattern or design techniques to handle this kind of problem (huge number of records)? Is MSSQL or Oracle better in handling lots of records?
P.S
the COUNT(*) problem stated above is just an example case, in reality the app does crud functionality and some aggregate query (for reporting), but nothing really complicated. It's just that it takes quite a while (minutes) to execute some these queries because of the table volume
See Why MySQL could be slow with large tables and COUNT(*) vs COUNT(col)
Make sure you have an index on the column you're counting. If your server has plenty of RAM, consider increasing MySQL's buffer size. Make sure your disks are configured correctly -- DMA enabled, not sharing a drive or cable with the swap partition, etc.
What you're asking with "SELECT COUNT(*)" is not easy.
In MySQL, the MyISAM non-transactional engine optimises this by keeping a record count, so SELECT COUNT(*) will be very quick.
However, if you're using a transactional engine, SELECT COUNT(*) is basically saying:
Exactly how many records exist in this table in my transaction ?
To do this, the engine needs to scan the entire table; it probably knows roughly how many records exist in the table already, but to get an exact answer for a particular transaction, it needs a scan. This isn't going to be fast using MySQL innodb, it's not going to be fast in Oracle, or anything else. The whole table MUST be read (excluding things stored separately by the engine, such as BLOBs)
Having the whole table in ram will make it a bit faster, but it's still not going to be fast.
If your application relies on frequent, accurate counts, you may want to make a summary table which is updated by a trigger or some other means.
If your application relies on frequent, less accurate counts, you could maintain summary data with a scheduled task (which may impact performance of other operations less).
Many performance issues around large tables relate to indexing problems, or lack of indexing all together. I'd definitely make sure you are familiar with indexing techniques and the specifics of the database you plan to use.
With regards to your slow count(*) on the huge table, i would assume you were using the InnoDB table type in MySQL. I have some tables with over 100 million records using MyISAM under MySQL and the count(*) is very quick.
With regards to MySQL in particular, there are even slight indexing differences between InnoDB and MyISAM tables which are the two most commonly used table types. It's worth understanding the pros and cons of each and how to use them.
What kind of access to the data do you need? I've used HBase (based on Google's BigTable) loaded with a vast amount of data (~30 million rows) as the backend for an application which could return results within a matter of seconds. However, it's not really appropriate if you need "real time" access - i.e. to power a website. Its column-oriented nature is also a fairly radical change if you're used to row-oriented DBMS.
Is count(*) on the whole table actually something you do a lot?
InnoDB will have to do a full table scan to count the rows, which is obviously a major performance issue if counting all of them is something you actually want to do. But that doesn't mean that other operations on the table will be slow.
With the right indexes, MySQL will be very fast at retrieving data from tables much bigger than that. The problem with indexes is that they can hurt insert speeds, particularly for large tables as insert performance drops dramatically once the space required for the index reaches a certain threshold - presumably the size it will keep in memory. But if you only need modest insert speeds, MySQL should do everything you need.
Any other database will have similar tradeoffs between retrieve speed and insert speed; they may or may not be better for your application. But I would look first at getting the indexes right, and maybe rewriting your queries, before you try other databases. For what it's worth, we picked MySQL originally because we found it performed best.
Note that MyISAM tables in MySQL store the total size of the table. They maintain this because it's useful to the optimiser in some cases, but a side effect is that count(*) on the whole table is really fast. That doesn't necessarily mean they're faster than InnoDB at anything else.
I answered a similar question in This Stackoverflow Posting in some detail, describing the merits of the architectures of both systems. To some extent it was done from a data warehousing point of view but many of the differences also matter on transactional systems.
However, 25 million rows is not a VLDB and if you are having performance problems you should look to indexing and tuning. You don't need to go to Oracle to support a 25 million row database - you've got about 3 orders of magnitude to go before you're truly in VLDB territory.
You are asking for a books worth of answer and I therefore propose you get a good book on databases. There are many.
To get you started, here are some database basics:
First, you need a great data model based not just on what data you need to store but on usage patterns. Good database performance starts with good schema design.
Second, place indicies on columns based upon expected lookup AND update needs as update performance is often overlooked.
Third, don't put functions in where clauses if at all possible.
Fourth, use an -ahem- RDBMS engine that is of quality design. I would respectfully submit that while it has improved greatly in the recent past, mysql does not qualify. (Apologies to those who wish to argue it has finally made the grade in recent times.) There is no longer any need to choose between high-price and quality; Postgres (aka PostgreSql) is available open-source and is truly fantastic - and has all the plug-ins available to meet your needs.
Finally, learn what you are asking a database engine to do - gain some insight into internals - so you can better judge what kinds of things are expensive and why.
I'm going to second #Mark Baker, and say that you need to build indices on your tables.
For other queries than the one you selected, you should also be aware that using constructs such as IN() is faster than a series of OR statements in the query. There are lots of little steps you can take to speed-up individual queries.
Indexing is key to performance with this number of records, but how you write the queries can make a big difference as well. Specific performance tuning methods vary by database, but in general, avoid returning more records or fields than you actually need, make sure all join fields are indexed (as well as common where clause fields), avoid cursors (although I think this is less true in Oracle than SQL Server I don't know about mySQL).
Hardware can also be a bottleneck especially if you are running things besides the database server on the same machine.
Performance tuning is a very technical subject and can't really be answered well in a format like this. I suggest you get a performance tuning book and read it. Here is a link to one for mySQL
http://www.amazon.com/High-Performance-MySQL-Optimization-Replication/dp/0596101716