What's the fastest way to check that entry exists in database? - mysql

I'm looking for the fastest way to check that entry exists...
All my life, I did with something like this...
SELECT COUNT(`id`) FROM `table_name`
Some people don't use COUNT(id), but COUNT(*). Is that faster?
What about LIMIT 1?
P.S. With id I meant primary key, of course.
Thanks in an advice!

In most situations, COUNT(*) is faster than COUNT(id) in MySQL (because of how grouping queries with COUNT() are executed, it may be optimized in future releases so both versions run the same). But if you only want to find if at least one row exists, you can use EXISTS
simple:
( SELECT COUNT(id) FROM table_name ) > 0
a bit faster:
( SELECT COUNT(*) FROM table_name ) > 0
much faster:
EXISTS (SELECT * FROM table_name)

If you aren't worried about accuracy, explain select count(field) from table is incredibly fast.
http://www.mysqlperformanceblog.com/2007/04/10/count-vs-countcol/
This link explains the difference between count(*) and count(field). When in doubt, count(*)
As for checking that a table is not empty...
SELECT EXISTS(SELECT 1 FROM table)

Related

SQL engine execution plan of HAVING vs (subquery) WHERE

When executed, is there any difference at between the following two sql queries:
SELECT name, count(*) FROM mytable GROUP BY name HAVING count(*) > 1
And:
SELECT * from (SELECT name, count(*) cnt FROM mytable GROUP BY name) x where cnt > 1
In other words, is having more a "convenience" clause to simplify having to do subselect, or does the query engine fundamentally performance different when a having statement is used vs the second approach? Currently in mysql:
Create table:
CREATE TABLE `mytable` (
`name` varchar(20) NOT NULL DEFAULT ''
) ENGINE=InnoDB DEFAULT CHARSET=utf-8;
In almost any other database, the two would be equivalent. For conciseness, HAVING is usually a better choice.
At least historically, MySQL materialized subqueries. So, this query:
SELECT *
FROM (SELECT name, count(*) as cnt
FROM mytable
GROUP BY name
) x
WHERE cnt > 1;
suggests that it is going to write out the derived table, and then re-scan it for the final WHERE. However, this makes little difference to performance because the GROUP BY is already reading and writing the data.
So, these queries are probably quite similar in performance on MySQL. And, they would have the same execution plan on almost any other database. The HAVING clause results in the simpler query.

SQL UNION ALL to eliminate duplicates

I found this sample interview question and answer posted on toptal reproduced here. But I don't really understand the code. How can a UNION ALL turn into a UNIION (distinct) like that? Also, why is this code faster?
QUESTION
Write a SQL query using UNION ALL (not UNION) that uses the WHERE clause to eliminate duplicates. Why might you want to do this?
Hide answer
You can avoid duplicates using UNION ALL and still run much faster than UNION DISTINCT (which is actually same as UNION) by running a query like this:
ANSWER
SELECT * FROM mytable WHERE a=X UNION ALL SELECT * FROM mytable WHERE b=Y AND a!=X
The key is the AND a!=X part. This gives you the benefits of the UNION (a.k.a., UNION DISTINCT) command, while avoiding much of its performance hit.
But in the example, the first query has a condition on column a, whereas the second query has a condition on column b. This probably came from a query that's hard to optimize:
SELECT * FROM mytable WHERE a=X OR b=Y
This query is hard to optimize with simple B-tree indexing. Does the engine search an index on column a? Or on column b? Either way, searching the other term requires a table-scan.
Hence the trick of using UNION to separate into two queries for one term each. Each subquery can use the best index for each search term. Then combine the results using UNION.
But the two subsets may overlap, because some rows where b=Y may also have a=X in which case such rows occur in both subsets. Therefore you have to do duplicate elimination, or else see some rows twice in the final result.
SELECT * FROM mytable WHERE a=X
UNION DISTINCT
SELECT * FROM mytable WHERE b=Y
UNION DISTINCT is expensive because typical implementations sort the rows to find duplicates. Just like if you use SELECT DISTINCT ....
We also have a perception that it's even more "wasted" work if the two subset of rows you are unioning have a lot of rows occurring in both subsets. It's a lot of rows to eliminate.
But there's no need to eliminate duplicates if you can guarantee that the two sets of rows are already distinct. That is, if you guarantee there is no overlap. If you can rely on that, then it would always be a no-op to eliminate duplicates, and therefore the query can skip that step, and therefore skip the costly sorting.
If you change the queries so that they are guaranteed to select non-overlapping subsets of rows, that's a win.
SELECT * FROM mytable WHERE a=X
UNION ALL
SELECT * FROM mytable WHERE b=Y AND a!=X
These two sets are guaranteed to have no overlap. If the first set has rows where a=X and the second set has rows where a!=X then there can be no row that is in both sets.
The second query therefore only catches some of the rows where b=Y, but any row where a=X AND b=Y is already included in the first set.
So the query achieves an optimized search for two OR terms, without producing duplicates, and requiring no UNION DISTINCT operation.
The most simple way is like this, especially if you have many columns:
SELECT *
INTO table2
FROM table1
UNION
SELECT *
FROM table1
ORDER BY column1
I guest this is right (Oracle):
select distinct * from (
select * from test_a
union all
select * from test_b
);
The question will be correct if the table has unique identifier - primary key. Otherwise every select can return many the same rows.
To understand why it can faster let's look at how database executes UNION ALL and UNION.
The first is simple joining results from two independent queries. These queries can be processed in parallel and taken to client one by one.
The second is joining + distinction. To distinct records from 2 queries db needs to have all them in memory or if memory is not enough db needs to store them to temporary table and next select unique ones. This is where performance degradation can be. DB's are pretty smart and distinction algorithms are developed good but for large result sets it could be a problem anyway.
UNION ALL + additional WHERE condition can be faster if an index will be used while filtering.
So, here the performance magic.
I guess it will work
select col1 From (
select row_number() over (partition by col1 order by col1) as b, col1
from (
select col1 From u1
union all
select col1 From u2 ) a
) x
where x.b =1
This will also do the same trick:
select * from (
select * from table1
union all
select * from table2
) a group by
columns
having count(*) >= 1
or
select * from table1
union all
select * from table2 b
where not exists (select 1 from table1 a where a.col1 = b.col1)

Optimize mysql query with subquery in values list

I am trying to optimize queries to my database. I have the following query:
select date, (
select count(user_id)
from myTable
where logdate = date
) as value
from myTable;
As far as I can see, the second value is computed efficiently. However, is there any common practice to optimize this kind of query in MySQL?
I believe you can avoid writing a subquery and preform the same query using aggregation, which may run faster:
SELECT date, COUNT(user_id) AS numRecords
FROM myTable
GROUP BY date;
Here is a reference on aggregate functions.
you do not have to put group functions in a separate select. Just do
select date, count(user_id) from myTable group by date;
There is no hard and fast. In this query, it was a matter of one select being more efficient than 2. But here is some tips for beginners on optimizing queries.

Mysql upper limit for count(*)

I've got a query:
select count(*) from `table` where `something`>123
If the table has few million records, the query runs really slow even though there's an index on column something. However, in fact I'm interested in value of:
min(100000, count(*))
So is there any way to prevent MySQL from counting rows when it already found 100k? I've found something like:
select count(*) from (select 1 from `table` where `something`>123 limit 100000) as `asd`
It's much faster than count(*) if the table has a few million matching entries, but count(*) runs much faster when there are less than 100000 matches.
Is there any way to do it faster?
I don't have the points to comment, so I am posting this as an answer...
Have you tried using EXPLAIN to see if your index on something is actually being used? It sounds like this query is doing a Table Scan. Ideally, you will want to see something like "Extra: Using where; Using index".
Out of curiosity, is something a nullable field?
As an aside, perhaps the query optimizer would do better with the following:
select count(*) as cnt
from table
where something > 123
having count(*) > 100000
Might help to make better use of value range limitation.
select count(*) - (select count(*) from t where something <= 123) as cnt
from t
The other thing might be to have an update trigger counting.

What's faster, SELECT DISTINCT or GROUP BY in MySQL?

If I have a table
CREATE TABLE users (
id int(10) unsigned NOT NULL auto_increment,
name varchar(255) NOT NULL,
profession varchar(255) NOT NULL,
employer varchar(255) NOT NULL,
PRIMARY KEY (id)
)
and I want to get all unique values of profession field, what would be faster (or recommended):
SELECT DISTINCT u.profession FROM users u
or
SELECT u.profession FROM users u GROUP BY u.profession
?
They are essentially equivalent to each other (in fact this is how some databases implement DISTINCT under the hood).
If one of them is faster, it's going to be DISTINCT. This is because, although the two are the same, a query optimizer would have to catch the fact that your GROUP BY is not taking advantage of any group members, just their keys. DISTINCT makes this explicit, so you can get away with a slightly dumber optimizer.
When in doubt, test!
If you have an index on profession, these two are synonyms.
If you don't, then use DISTINCT.
GROUP BY in MySQL sorts results. You can even do:
SELECT u.profession FROM users u GROUP BY u.profession DESC
and get your professions sorted in DESC order.
DISTINCT creates a temporary table and uses it for storing duplicates. GROUP BY does the same, but sortes the distinct results afterwards.
So
SELECT DISTINCT u.profession FROM users u
is faster, if you don't have an index on profession.
All of the answers above are correct, for the case of DISTINCT on a single column vs GROUP BY on a single column.
Every db engine has its own implementation and optimizations, and if you care about the very little difference (in most cases) then you have to test against specific server AND specific version! As implementations may change...
BUT, if you select more than one column in the query, then the DISTINCT is essentially different! Because in this case it will compare ALL columns of all rows, instead of just one column.
So if you have something like:
// This will NOT return unique by [id], but unique by (id,name)
SELECT DISTINCT id, name FROM some_query_with_joins
// This will select unique by [id].
SELECT id, name FROM some_query_with_joins GROUP BY id
It is a common mistake to think that DISTINCT keyword distinguishes rows by the first column you specified, but the DISTINCT is a general keyword in this manner.
So people you have to be careful not to take the answers above as correct for all cases... You might get confused and get the wrong results while all you wanted was to optimize!
Go for the simplest and shortest if you can -- DISTINCT seems to be more what you are looking for only because it will give you EXACTLY the answer you need and only that!
well distinct can be slower than group by on some occasions in postgres (dont know about other dbs).
tested example:
postgres=# select count(*) from (select distinct i from g) a;
count
10001
(1 row)
Time: 1563,109 ms
postgres=# select count(*) from (select i from g group by i) a;
count
10001
(1 row)
Time: 594,481 ms
http://www.pgsql.cz/index.php/PostgreSQL_SQL_Tricks_I
so be careful ... :)
Group by is expensive than Distinct since Group by does a sort on the result while distinct avoids it. But if you want to make group by yield the same result as distinct give order by null ..
SELECT DISTINCT u.profession FROM users u
is equal to
SELECT u.profession FROM users u GROUP BY u.profession order by null
It seems that the queries are not exactly the same. At least for MySQL.
Compare:
describe select distinct productname from northwind.products
describe select productname from northwind.products group by productname
The second query gives additionally "Using filesort" in Extra.
In MySQL, "Group By" uses an extra step: filesort. I realize DISTINCT is faster than GROUP BY, and that was a surprise.
After heavy testing we came to the conclusion that GROUP BY is faster
SELECT sql_no_cache
opnamegroep_intern
FROM telwerken
WHERE opnemergroep IN (7,8,9,10,11,12,13) group by opnamegroep_intern
635 totaal 0.0944 seconds
Weergave van records 0 - 29 ( 635 totaal, query duurde 0.0484 sec)
SELECT sql_no_cache
distinct (opnamegroep_intern)
FROM telwerken
WHERE opnemergroep IN (7,8,9,10,11,12,13)
635 totaal 0.2117 seconds ( almost 100% slower )
Weergave van records 0 - 29 ( 635 totaal, query duurde 0.3468 sec)
(more of a functional note)
There are cases when you have to use GROUP BY, for example if you wanted to get the number of employees per employer:
SELECT u.employer, COUNT(u.id) AS "total employees" FROM users u GROUP BY u.employer
In such a scenario DISTINCT u.employer doesn't work right. Perhaps there is a way, but I just do not know it. (If someone knows how to make such a query with DISTINCT please add a note!)
Here is a simple approach which will print the 2 different elapsed time for each query.
DECLARE #t1 DATETIME;
DECLARE #t2 DATETIME;
SET #t1 = GETDATE();
SELECT DISTINCT u.profession FROM users u; --Query with DISTINCT
SET #t2 = GETDATE();
PRINT 'Elapsed time (ms): ' + CAST(DATEDIFF(millisecond, #t1, #t2) AS varchar);
SET #t1 = GETDATE();
SELECT u.profession FROM users u GROUP BY u.profession; --Query with GROUP BY
SET #t2 = GETDATE();
PRINT 'Elapsed time (ms): ' + CAST(DATEDIFF(millisecond, #t1, #t2) AS varchar);
OR try SET STATISTICS TIME (Transact-SQL)
SET STATISTICS TIME ON;
SELECT DISTINCT u.profession FROM users u; --Query with DISTINCT
SELECT u.profession FROM users u GROUP BY u.profession; --Query with GROUP BY
SET STATISTICS TIME OFF;
It simply displays the number of milliseconds required to parse, compile, and execute each statement as below:
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 2 ms.
SELECT DISTINCT will always be the same, or faster, than a GROUP BY. On some systems (i.e. Oracle), it might be optimized to be the same as DISTINCT for most queries. On others (such as SQL Server), it can be considerably faster.
This is not a rule
For each query .... try separately distinct and then group by ... compare the time to complete each query and use the faster ....
In my project sometime I use group by and others distinct
If you don't have to do any group functions (sum, average etc in case you want to add numeric data to the table), use SELECT DISTINCT. I suspect it's faster, but i have nothing to show for it.
In any case, if you're worried about speed, create an index on the column.
If the problem allows it, try with EXISTS, since it's optimized to end as soon as a result is found (And don't buffer any response), so, if you are just trying to normalize data for a WHERE clause like this
SELECT FROM SOMETHING S WHERE S.ID IN ( SELECT DISTINCT DCR.SOMETHING_ID FROM DIFF_CARDINALITY_RELATIONSHIP DCR ) -- to keep same cardinality
A faster response would be:
SELECT FROM SOMETHING S WHERE EXISTS ( SELECT 1 FROM DIFF_CARDINALITY_RELATIONSHIP DCR WHERE DCR.SOMETHING_ID = S.ID )
This isn't always possible but when available you will see a faster response.
in mySQL i have found that GROUP BY will treat NULL as distinct, while DISTINCT does not.
Took the exact same DISTINCT query, removed the DISTINCT, and added the selected fields as the GROUP BY, and i got many more rows due to one of the fields being NULL.
So.. I tend to believe that there is more to the DISTINCT in mySQL.