Which is more efficient,and by how much?
type 1:
insert into table_name(column1,column2..) select column1,column2 ... from another_table where
columnX in (value_list)
type 2:
insert into table_name(column1,column2..) values (column1_0,column2_0..),(column1_1,column2_1..)
The first edition looks short,and the second may become extremely long,when value_list contains,say 500 or even more values.
But I've no idea about whose performance will be better,though feels the first should be more efficient,intuitively.
The first is cleaner, especially if your columns are already in mysql (which I'm assuming you are saying?). You would save some time in network overhead sending data, and parsing time, and have to worry less about hitting whatever query size limit your client has.
However, in general, I would expect the performance to be similar as the number rows grows larger, especially on a well-indexed table. Most of the time for inserts w/ large queries is spent doing things like building indexes (see here), and both those queries, absent turning indexes off, would have to do that.
I agree with Todd, the first query is cleaner and will be faster to send to the MySQL server and faster to compile. And it's probably true that as the number of inserted records increases, the speed differential will drop.
But the first form has substantial other benefits to consider:
It's far easier to maintain: you only have to add or modify a field every now and then.
You avoid the expense of querying another_table and processing the results to concatenate the second query (a hidden cost of that approach).
If you need to run this update more than once, the first query can be cached in the MySQL server along with its compiled form and query plan. This makes subsequent invocations of the query run a bit faster.
Related
For example, if there is a table named paper, I execute sql with
[ select paper.user_id, paper.name, paper.score from paper where user_id in (201,205,209……) ]
I observed that when this statement is executed, index will only be used when the number of "in" is less than a certain number. and the certain number is dynamic.
For example,when the total number of rows in the table is 4000 and cardinality is 3939, the number of "in" must be less than 790,MySQL will execute index query.
(View MySQL explain. If <790, type=range; if >790, type=all)
when the total number of rows in the table is 1300000 and cardinality is 1199166, the number of "in" must be less than 8500,MySQL will execute index query.
The result of this experiment is very strange to me.
I imagined that if I implemented this "in" query, I would first find in (max) and in (min), and then find the page where in (max) and in (min) are located,Then exclude the pages before in (min) and the pages after in (max). This is definitely faster than performing a full table scan.
Then, my test data can be summarized as follows:
Data in the table 1 to 1300000
Data of "in" 900000 to 920000
My question is, in a table with 1300000 rows of data, why does MySQL think that when the number of "in" is more than 8500, it does not need to execute index queries?
mysql version 5.7.20
In fact, this magic number is 8452. When the total number of rows in my table is 600000, it is 8452. When the total number of rows is 1300000, it is still 8452. Following is my test screenshot
When the number of in is 8452, this query only takes 0.099s.
Then view the execution plan. range query.
If I increase the number of in from 8452 to 8453, this query will take 5.066s, even if I only add a duplicate element.
Then view the execution plan. type all.
This is really strange. It means that if I execute the query with "8452 in" first, and then execute the remaining query, the total time is much faster than that of directly executing the query with "8453 in".
who can debug MySQL source code to see what happens in this process?
thanks very much.
Great question and nice find!
The query planner/optimizer has to decide if it's going seek the pages it needs to read or it's going to start reading many more and scan for the ones it needs. The seek strategy is more memory and especially cpu intensive while the scan probably is significantly more expensive in terms of I/O.
The bigger a table the less attractive the seek strategy becomes. For a large table a bigger part of the nonclustered index used for the seek needs to come from disk, memory pressure rises and the potential for sequential reads shrinks the longer the seek takes. Therefore the threshold for the rows/results ratio to which a seek is considered lowers as the table size rises.
If this is a problem there're a few things you could try to tune. But when this is a problem for you in production it might be the right time to consider a server upgrade, optimizing the queries and software involved or simply adjust expectations.
'Harden' or (re)enforce the query plans you prefer
Tweak the engine (when this is a problem that affects most tables server/database settings maybe can be optimized)
Optimize nonclustered indexes
Provide query hints
Alter tables and datatypes
It is usually folly go do a query in 2 steps. That framework seems to be fetching ids in one step, then fetching the real stuff in a second step.
If the two queries are combined into a single on (with a JOIN), the Optimizer is mostly forced to do the random lookups.
"Range" is perhaps always the "type" for IN lookups. Don't read anything into it. Whether IN looks at min and max to try to minimize disk hits -- I would expect this to be a 'recent' optimization. (I have not it in the Changelogs.)
Are those UUIDs with the dashes removed? They do not scale well to huge tables.
"Cardinality" is just an estimate. ANALYZE TABLE forces the recomputation of such stats. See if that changes the boundary, etc.
The problem is I need to do pagination.I want to use order by and limit.But my colleague told me mysql will return records in the same order,and since this job doesn't care in which order the records are shown,so we don't need order by.
So I want to ask if what he said is correct? Of course assuming that no records are updated or inserted between the two queries.
You don't show your query here, so I'm going to assume that it's something like the following (where ID is the primary key of the table):
select *
from TABLE
where ID >= :x:
limit 100
If this is the case, then with MySQL you will probably get rows in the same order every time. This is because the only predicate in the query involves the primary key, which is a clustered index for MySQL, so is usually the most efficient way to retrieve.
However, probably may not be good enough for you, and if your actual query is any more complex than this one, probably no longer applies. Even though you may think that nothing changes between queries (ie, no rows inserted or deleted), so you'll get the same optimization plan, that is not true.
For one thing, the block cache will have changed between queries, which may cause the optimizer to choose a different query plan. Or maybe not. But I wouldn't take the word of anyone other than one of the MySQL maintainers that it won't.
Bottom line: use an order by on whatever column(s) you're using to paginate. And if you're paginating by the primary key, that might actually improve your performance.
The key point here is that database engines need to handle potentially large datasets and need to care (a lot!) about performance. MySQL is never going to waste any resource (CPU cycles, memory, whatever) doing an operation that doesn't serve any purpose. Sorting result sets that aren't required to be sorted is a pretty good example of this.
When issuing a given query MySQL will try hard to return the requested data as quick as possible. When you insert a bunch of rows and then run a simple SELECT * FROM my_table query you'll often see that rows come back in the same order than they were inserted. That makes sense because the obvious way to store the rows is to append them as inserted and the obvious way to read them back is from start to end. However, this simplistic scenario won't apply everywhere, every time:
Physical storage changes. You won't just be appending new rows at the end forever. You'll eventually update values, delete rows. At some point, freed disk space will be reused.
Most real-life queries aren't as simple as SELECT * FROM my_table. Query optimizer will try to leverage indices, which can have a different order. Or it may decide that the fastest way to gather the required information is to perform internal sorts (that's typical for GROUP BY queries).
You mention paging. Indeed, I can think of some ways to create a paginator that doesn't require sorted results. For instance, you can assign page numbers in advance and keep them in a hash map or dictionary: items within a page may appear in random locations but paging will be consistent. This is of course pretty suboptimal, it's hard to code and requieres constant updating as data mutates. ORDER BY is basically the easiest way. What you can't do is just base your paginator in the assumption that SQL data sets are ordered sets because they aren't; neither in theory nor in practice.
As an anecdote, I once used a major framework that implemented pagination using the ORDER BY and LIMIT clauses. (I won't say the same because it isn't relevant to the question... well, dammit, it was CakePHP/2). It worked fine when sorting by ID. But it also allowed users to sort by arbitrary columns, which were often not unique, and I once found an item that was being shown in two different pages because the framework was naively sorting by a single non-unique column and that row made its way into both ORDER BY type LIMIT 10 and ORDER BY type LIMIT 10, 10 because both sortings complied with the requested condition.
I have seen several questions comparing select * to select by all columns explicitly, but what about fewer columns selected vs more.
In other words, is:
SELECT id,firstname,lastname,lastlogin,email,phone
More than negligibly faster than:
SELECT id,firstname,lastlogin
I realize there will be small differences for more data being transferred through the system and to the application, but this is a total data/load difference, not a cost of the query (larger data in the cells would have the same effect anyway I believe) - I'm only trying to optimize my query, as I will have to load ALL the data at some point anyway...
When my admin user logs in, I'm going to load the entire user database into a cache, but I can either query only critical data upfront to shave some execution time, or just get everything - if it works out roughly the same. I know more rows equals longer query execution - but what about more selected values in my query?
Under most circumstances, the only difference is going to be slightly larger data for these fields and the additional time to fetch them.
There are two things to consider:
If the additional fields are very big, then this could be a big difference in performance.
If there is an index that covers the columns you actually want, then the index can be used for the query. This could speed the query in the database.
In general, though, the advice is to return the columns you want to the application. If there is complex processing, you should consider doing that in the database rather than the application.
This might be a weird question but didnt know how to research on it. When doing the following query:
SELECT Foo.col1, Foo.col2, Foo.col3
FROM Foo
INNER JOIN Bar ON Foo.ID = Bar.BID
I tend to use TableName.Column instead of just col1, col2, col3
Is there any performance difference? Is it faster to specify Table name for each column?
My guess would be that yes it is faster since it would take some time to lookup the column name and and differentiate it.
If anyone knows a link where I could read up on this I would be grateful. I did not even know how to title this question better since not sure how to search on it.
First of all: This should not matter. The time to look up the columns is such a miniscule fraction of the total processing time of a typical query, that this might be the wrong spot to look for additional performance.
Second: Tablename.Colname is faster than Colname only, as it eliminates the need to search the referenced tables (and table-like structures like views and subqueries) for a fitting column. Again: The difference is inside the statistical noise.
Third: Using Tablename.Colname is a good idea, but for other reasons: If you use Colname only, and one of the tables in your query gets a new column with the same name, you end up with the oh-so-well-known "ambiguous column name" error. Typical candidates for such a columns often are "comment", "lastchanged", and friends. If you qualify your col references, this maintainability problem simply disappears - your query will work as allways, ignoring the new fields.
If it's faster, the difference is surely negligible, like a few microseconds per query. All the data about the tables mentioned in the query has to be loaded into memory, so it doesn't save any disk access. It's done during query parsing, not during data processing. Even if you run the query thousands of times, it might not make up for the time spent typing those extra characters, and certainly not the time we've spent discussing it. :)
But it makes the queries longer, so there's slightly more time spent in communications. If you're sending the query over a network, that will probably negate any time saved during parsing. You can reduce this by using short table aliases, though:
SELECT t.col1, t.col2
FROM ReallyLongTableName t
As a general rule, when worrying about database performance you only need to concern yourself with aspects whose time is dependent on the size number of rows in the tables. Anything that's the same regardless of the amount of data will fall into the noise, unless you're dealing with extremely tiny tables (in which case, why are you bothering with a database -- use a flat file).
which will take more execution time insert operation or select operation if both are single query affecting only one row.
for eg:
insert into example values('id','name','email')
or
select *from example where id='id';
For all benchmarking, you need to keep in mind:
It's not usually a simple matter, there are a large number of factors that can affect the figures.
Some of these factors are the number of indexes, historical patterns of read/write, database tuning, disk layout, fragmentation and so forth.
It rarely matter for small thing like a one-row operation, provided you have an intelligent setup (correct indexes and so on).
The best way to tell, for a given setup, is to test.
In any case, this sort of question usually arises when you want to choose the fastest of two functionally identical choices.
In this case, there is zero crossover in functionality so I'm not sure what you will gain with this answer. If you want to insert information, use insert. If you want to extract it, use select.
It's not like you can use select (no matter how fast it may be) to insert data into your database (well, other than as part of insert into ... select ...).