I have a scenario where I need to check for 10,000 different specific names against a table with about 60,000 records of names. Assuming caching is not relevant, generally speaking, for performance purposes, is it better to:
(1) Break up into mini-queries so that there are maybe 200 different names per query?
or
(2) Write one mongocious sql statement with 10,000 "OR" clauses?
You missed out number 3: Do it another way entirely:
I would write the list to a separate table/temp table or something, then filter using a join/exists or whatever.
One first observation is that usually RDBMSs have a limit of the size of the query string which you might exceed with so many ORs.
So a solution would be to write a stored procedure and do it in a loop.
Ignoring this, given that in case (1) the data would be accessed more times than in case (2), the latter one is preferable.
Or #4 - Use an IN() query in batches. About 1000 usually works pretty well:
SELECT * FROM table WHERE name IN ('str1', 'str2', 'str3', ...)
It's not perfect, but there's no temporary table involved, and MySQL is pretty good about optimizing IN().
Related
If I have a query like:
Select EmployeeId
From Employee
Where EmployeeTypeId IN (1,2,3)
and I have an index on the EmployeeTypeId field, does SQL server still use that index?
Yeah, that's right. If your Employee table has 10,000 records, and only 5 records have EmployeeTypeId in (1,2,3), then it will most likely use the index to fetch the records. However, if it finds that 9,000 records have the EmployeeTypeId in (1,2,3), then it would most likely just do a table scan to get the corresponding EmployeeIds, as it's faster just to run through the whole table than to go to each branch of the index tree and look at the records individually.
SQL Server does a lot of stuff to try and optimize how the queries run. However, sometimes it doesn't get the right answer. If you know that SQL Server isn't using the index, by looking at the execution plan in query analyzer, you can tell the query engine to use a specific index with the following change to your query.
SELECT EmployeeId FROM Employee WITH (Index(Index_EmployeeTypeId )) WHERE EmployeeTypeId IN (1,2,3)
Assuming the index you have on the EmployeeTypeId field is named Index_EmployeeTypeId.
Usually it would, unless the IN clause covers too much of the table, and then it will do a table scan. Best way to find out in your specific case would be to run it in the query analyzer, and check out the execution plan.
Unless technology has improved in ways I can't imagine of late, the "IN" query shown will produce a result that's effectively the OR-ing of three result sets, one for each of the values in the "IN" list. The IN clause becomes an equality condition for each of the list and will use an index if appropriate. In the case of unique IDs and a large enough table then I'd expect the optimiser to use an index.
If the items in the list were to be non-unique however, and I guess in the example that a "TypeId" is a foreign key, then I'm more interested in the distribution. I'm wondering if the optimiser will check the stats for each value in the list? Say it checks the first value and finds it's in 20% of the rows (of a large enough table to matter). It'll probably table scan. But will the same query plan be used for the other two, even if they're unique?
It's probably moot - something like an Employee table is likely to be small enough that it will stay cached in memory and you probably wouldn't notice a difference between that and indexed retrieval anyway.
And lastly, while I'm preaching, beware the query in the IN clause: it's often a quick way to get something working and (for me at least) can be a good way to express the requirement, but it's almost always better restated as a join. Your optimiser may be smart enough to spot this, but then again it may not. If you don't currently performance-check against production data volumes, do so - in these days of cost-based optimisation you can't be certain of the query plan until you have a full load and representative statistics. If you can't, then be prepared for surprises in production...
So there's the potential for an "IN" clause to run a table scan, but the optimizer will
try and work out the best way to deal with it?
Whether an index is used doesn't so much vary on the type of query as much of the type and distribution of data in the table(s), how up-to-date your table statistics are, and the actual datatype of the column.
The other posters are correct that an index will be used over a table scan if:
The query won't access more than a certain percent of the rows indexed (say ~10% but should vary between DBMS's).
Alternatively, if there are a lot of rows, but relatively few unique values in the column, it also may be faster to do a table scan.
The other variable that might not be that obvious is making sure that the datatypes of the values being compared are the same. In PostgreSQL, I don't think that indexes will be used if you're filtering on a float but your column is made up of ints. There are also some operators that don't support index use (again, in PostgreSQL, the ILIKE operator is like this).
As noted though, always check the query analyser when in doubt and your DBMS's documentation is your friend.
#Mike: Thanks for the detailed analysis. There are definately some interesting points you make there. The example I posted is somewhat trivial but the basis of the question came from using NHibernate.
With NHibernate, you can write a clause like this:
int[] employeeIds = new int[]{1, 5, 23463, 32523};
NHibernateSession.CreateCriteria(typeof(Employee))
.Add(Restrictions.InG("EmployeeId",employeeIds))
NHibernate then generates a query which looks like
select * from employee where employeeid in (1, 5, 23463, 32523)
So as you and others have pointed out, it looks like there are going to be times where an index will be used or a table scan will happen, but you can't really determine that until runtime.
Select EmployeeId From Employee USE(INDEX(EmployeeTypeId))
This query will search using the index you have created. It works for me. Please do a try..
I have a view (say 'v') that is the combination of 10 tables using several Joins and complex calculations. In that view, there are around 10 Thousand rows.
And then I select 1 row based on row as WHERE id = 23456.
Another possible way to use a larger query in which I can cut short the dataset to 1% before the complex calculation starts.
Question: Are SQL views optimized in some form?
MySQL Views are just syntactic sugar. There is not special optimization. Think of views as being textually merged; then optimized. That is, you could get the same optimizations (or not) by manually writing the equivalent SELECT.
If you would like to discuss the particular query further, please provide SHOW CREATE TABLE/VIEW and EXPLAIN SELECT .... It may be that you are missing a useful 'composite' index.
I have a stored procedure in which a temporary table is created.
There are 16 different select statements which are used to insert data into temp table by using joins on 4 tables at a time.
New requirement is to apply few more where conditions based on some input parameters.
My questions is:
I have two choices now:
apply conditions in where clause in each select statements while inserting data into temporary table.
do not apply any condition while inserting the data but in the end delete the data from temp table (data which is not required).
The second approach seems simple, but I was thinking about performance issues as initially unnecessary data would be inserted into it, but again there are multiple filters applied every time.
Can anyone guide me which approach should be used.
Basically among filtering, insertion, deletion which takes more time.
All tables have thousands of rows in them.
It's hard to answer without the exact details, but generally speaking, the first approach sounds better.
The second approach means you'll be doing (potentially, depends on the exact conditions) twice the I/O - once to copy the data into the temp table, and again to delete it. If your dataset is large, this will be considerable.
Your first approach is better.
If the select is taking time, you should optimize the selects so that it use the indexes.
Secondly,you have a large table and then you are selecting a handful of records from it, doesn't solve your problem.
I've Googled this question and can't seem to find a consistent opinion, or many opinions that are based on solid data. I simply would like to know if using the wildcard in a SQL SELECT statement incurs additional overhead than calling each item out individually. I have compared the execution plans of both in several different test queries, and it seems that the estimates always read the same. Is it possible that some overhead is incurred elsewhere, or are they truly handled identically?
What I am referring to specifically:
SELECT *
vs.
SELECT item1, item2, etc.
SELECT * FROM...
and
SELECT every, column, list, ... FROM...
will perform the same because both are an unoptimised scan
The difference is:
the extra lookup in sys.columns to resolve *
the contract/signature change when the table schema changes
inability to create a covering index. In fact, no tuning options at all, really
have to refresh views needed if non schemabound
can not index or schemabind a view using *
...and other stuff
Other SO questions on the same subject...
What is the reason not to use select * ?
Is there a difference betweeen Select * and Select list each col
SQL Query Question - Select * from view or Select col1,col2…from view
“select * from table” vs “select colA,colB,etc from table” interesting behaviour in SqlServer2005
Do you mean select * from ... instead of select col1, col2, col3 from ...?
I think it's always better to name the column and retrieve the minimal amount of information, because
your code will work independently of the physical order of the columns in the db. The column order should not impact your application, but it will be the case if you use *. It can be dangerous in case of db migration, etc.
if you name the columns, the DBMS can optimize further the execution. For instance, if there is an index that contains all the data your are interested in, the table will not be accessed at all.
If you mean something else with "wildcard", just ignore my answer...
EDIT: If you are talking about the asterisk wild card as in Select * From ... then see other responses...
If you are talking about wildcards in predicate clauses, or other query expressions using Like operator, (_ , % ) as described below, then:
This has to do with whether using the Wildcard affects whether the SQL is "SARG-ABLE" or not. SARGABLE, (Search-ARGument-able)means whether or not the query's search or sort arguments can be used as entry parameters to an existing index. If you prepend the wild card to the beginning of an argument
Where Name Like '%ing'
Then there is no way to traverse an index on the name field to find the nodes that end in 'ing'.
If otoh you append the wildcard to the end,
Where Name like 'Donald%'
then the optimizer can still use an index on the name column, and the query is still SARG-able
If that you call SQL wild car is *. It does not imply performance overhead by it self. However, if the table is extended you could find yourself retrieving fields you doesn't search.
In general not being specific in the fields you search or insert is a bad habit.
Consider
insert into mytable values(1,2)
What happen if the table is extended to three fields?
It may not be more work from an execution plan standpoint. But if you're fetching columns you don't actually need, that's additional network bandwidth being used between the database and your application. Also if you're using a high-level client API that performs some work on the returned data (for example, Perl's selectall_hashref) then those extra columns will impose performance cost on the client side. How much? Depends.
Based on this question here Selecting NOT NULL columns from a table One of the posters said
you shouldn't use SELECT * in production.
My Question: Is it true that we shouldn't use Select * in a mysql query on a production server? If yes, why shouldn't we use select all?
Most people do advise against using SELECT * in production, because it tends to break things. There are a few exceptions though.
SELECT * fetches all columns - while most of the times you don't
need them all. This causes the SQL-server to send more columns than
needed, which is a waste and makes the system slower.
With SELECT *, when you later add a column, the old query will also
select this new column, while typically it will not need it. Naming
the columns explicitly prevents this.
Most people that write SELECT * queries also tend to grab the rows
and use column order to get the columns - which WILL break your code
once columns are injected between existing columns.
Explicitly naming the columns also guarantees they are always in the same order, while SELECT * might behave differently when the table column order is modified.
But there are exceptions, for example statements like these:
INSERT INTO table_history
SELECT * FROM table
A query like that takes rows from table, and inserts them into table_history. If you want this query to keep working when new rows are added to table AND to table_history, SELECT * is the way to go.
Remember that your database server isn't necessarily on the same machine as the program querying the database. The database server could be on a network with limited bandwidth; it could even be halfway across the world.
If you really do need every column, then by all means do SELECT * FROM table.
If you only need certain columns, though, it would waste bandwidth to ask for all columns using SELECT * FROM table only to throw half the columns away.
Other potential reasons it might be good to specify which exact columns you want:
The database structure may change. If your program assumes certain column names, then it may fail if the column names change, for example. Explicitly naming the columns you want to retrieve will make the program fail immediately if your assumptions about the column names are violated.
As #Konerak mentioned, naming the columns you want also ensures that the order of the columns in your result is the same, even if the table schema changes (i.e. inserting one column in-between two others.) This is important if you're depending on FirstName being the [2]nd element of a result.
(Note: a more robust and self-documenting way of dealing with this is to ask for your database results as a list of key-value pairs, like a PHP associative array, Perl hash or a Python dict. That way you never need to use a number to index into the result (name = result[2] ) - instead you can use the column name: name = result["FirstName"].)
Using SELECT * is very inefficient, especially for tables that have a lot of columns. You should only select the columns you need.
Besides this, using column names makes the query easier to read and maintain.