I want to remove duplicates based on the combination of listings.product_id and listings.channel_listing_id
This simple query returns 400.000 rows (the id's of the rows I want to keep):
SELECT id
FROM `listings`
WHERE is_verified = 0
GROUP BY product_id, channel_listing_id
While this variation returns 1.600.000 rows, which are all records on the table, not only is_verified = 0:
SELECT *
FROM (
SELECT id
FROM `listings`
WHERE is_verified = 0
GROUP BY product_id, channel_listing_id
) AS keepem
I'd expect them to return the same amount of rows.
What's the reason for this? How can I avoid it (in order to use the subselect in the where condition of the DELETE statement)?
EDIT: I found that doing a SELECT DISTINCT in the outer SELECT "fixes" it (it returns 400.000 records as it should). I'm still not sure if I should trust this subquery, for there is no DISTINCT in the DELETE statement.
EDIT 2: Seems to be just a bug in the way phpMyAdmin reports the total count of the rows.
Your query as it stands is ambiguous. Suppose you have two listings with the same product_id and channel_id. Then what id is supposed to be returned? The first, the second? Or both, ignoring the GROUP request?
What if there is more than one id with different product and channel ids?
Try removing the ambiguity by selecting MAX(id) AS id and adding DISTINCT.
Are there any foreign keys to worry about? If not, you could pour the original table into a copy, empty the original and copy back in it the non-duplicates only. Messier, but you only do SELECTs or DELETEs guaranteed to succeed, and you also get to keep a backup.
Assign aliases in order to avoid field reference ambiguity:
SELECT
keepem.*
FROM
(
SELECT
innerStat.id
FROM
`listings` AS innerStat
WHERE
innerStat.is_verified = 0
GROUP BY
innerStat.product_id,
innerStat.channel_listing_id
) AS keepem
I hope you can help me with that topic.
I have one table, the relevant fields are VARCHAR id, VARCHAR name and date
3DF0001AB TESTING_1 2017-04-04
3DF0002ZG TESTING_2 2017-04-03
3DF0003ER TESTING_1 2017-04-01
3DF0004XY TESTING_1 2017-03-26
3DF0005UO TESTING_3 2017-03-25
The goal is to retrieve two entries for every name (>500), sorted by date. As I can just use database queries I tried following approach. Get one id for every name, UNION the result with the same query, but excluding the ids from the first set.
First step was to get one entry for every name. Result as expected, one id for every name.
SELECT id FROM table GROUP BY name;
Second step; using the above statement in the WHERE clause to receive results, that are not in the first result:
SELECT id FROM table WHERE id NOT IN (SELECT id FROM table GROUP BY name)
But here the result is empty, then I tried to invert the WHERE by using WHERE id IN instead of NOT IN. Expected result was that the same ids would show up when just using the subquery, result was all ids from the table. So I assume that the subquery delivers a wrong result, because when I copy the ids manually -> id IN ("3DF0001AB", ...) it works.
So maybe someone can explain the behavior and/or help to find a solution for the original problem.
This is a really bad practice:
SELECT id
FROM table
GROUP BY name;
Although MySQL allows this construct, the returned id is from an indeterminate row. You can even get different rows when you run the same query at different times.
A better approach is to use an aggregation function:
SELECT MAX(id)
FROM table
GROUP BY name;
Your real problem, though, is slightly different. When you use NOT IN, no rows are returned if any value in the IN list is NULL. That is how NOT IN is defined.
I would recommend using NOT EXISTS or LEFT JOIN instead, because their behavior is more intuitive:
SELECT t.id
FROM table t LEFT JOIN
(SELECT MAX(id) as id
FROM table t2
GROUP BY name
) tt
ON t.id = tt.id
WHERE tt.id IS NULL;
I have a table called links in which I am having 50000 + records where three of the fields are having indexes in the order link_id ,org_id ,data_id.
Problem here is when I am using group by sql query it is taking more time to load.
The Query is like this
SELECT DISTINCT `links`.*
FROM `links`
WHERE `links`.`org_id` = 2 AND (link !="")
GROUP BY link
The table is having nearly 20+ columns
Is there any solution to speed up the query to access faster.
Build an index on (org_id, link)
org_id to optimize your WHERE clause, and link for your group by (and also part of where).
By having the LINK_ID in the first position is probably what is holding your query back.
create index LinksByOrgAndLink on Links ( Org_ID, Link );
MySQL Create Index syntax
the problem is in your DISTINCT.*
the GROUP BY is already doing the work of DISTINCT , so are doing distinct two times , one of SELECT DISTINCT and other for GROUP BY
try this
SELECT *
FROM `links`
WHERE org_id` = 2 AND (link !="")
GROUP BY link
I guess adding a index to your 'link' column would improve the result.
http://dev.mysql.com/doc/refman/5.0/en/create-index.html
Only select the columns that you need.
Why is there a distinct for links.*?
Do you have some rows exactly doubled in your table?
On the other hand, changing the value "" to NULL could be improve your select statement, but iam not sure about this.
select field1,count(*) from table where $condition group by field1
select field2,count(*) from table where $condition group by field2
Basically that's what I'm doing the job now,is there a way to optimize the performance so that MySQL doesn't need to search two times to group by for the where clause?
If the table is large and $condition is 'rare', it might help to create a temporary table in memory. This way, you group it twice but filter it only once.
CREATE TEMPORARY TABLE temp ENGINE=MEMORY
select field1,field2 from table where $condition;
select field1,count(*) from temp group by field1;
select field1,count(*) from temp group by field2;
No magic bullet here... the key upon which the aggregation takes place is distinct, so SQL needs to iterated over two different lists. To make this a bit more evident, ask yourself how you would like the data returned: all field1 first, then all field 2, or intertwined, or possibly "pivoted" (but how?..)
To avoid an extra "trip" to the server we could get these two results set returned together, or we could even group the two, using UNION ALL (and being careful to add a prefix column to to know what is what), but this latter solution would end up taxing the server a bit more if anything.
If I have a table
CREATE TABLE users (
id int(10) unsigned NOT NULL auto_increment,
name varchar(255) NOT NULL,
profession varchar(255) NOT NULL,
employer varchar(255) NOT NULL,
PRIMARY KEY (id)
)
and I want to get all unique values of profession field, what would be faster (or recommended):
SELECT DISTINCT u.profession FROM users u
or
SELECT u.profession FROM users u GROUP BY u.profession
?
They are essentially equivalent to each other (in fact this is how some databases implement DISTINCT under the hood).
If one of them is faster, it's going to be DISTINCT. This is because, although the two are the same, a query optimizer would have to catch the fact that your GROUP BY is not taking advantage of any group members, just their keys. DISTINCT makes this explicit, so you can get away with a slightly dumber optimizer.
When in doubt, test!
If you have an index on profession, these two are synonyms.
If you don't, then use DISTINCT.
GROUP BY in MySQL sorts results. You can even do:
SELECT u.profession FROM users u GROUP BY u.profession DESC
and get your professions sorted in DESC order.
DISTINCT creates a temporary table and uses it for storing duplicates. GROUP BY does the same, but sortes the distinct results afterwards.
So
SELECT DISTINCT u.profession FROM users u
is faster, if you don't have an index on profession.
All of the answers above are correct, for the case of DISTINCT on a single column vs GROUP BY on a single column.
Every db engine has its own implementation and optimizations, and if you care about the very little difference (in most cases) then you have to test against specific server AND specific version! As implementations may change...
BUT, if you select more than one column in the query, then the DISTINCT is essentially different! Because in this case it will compare ALL columns of all rows, instead of just one column.
So if you have something like:
// This will NOT return unique by [id], but unique by (id,name)
SELECT DISTINCT id, name FROM some_query_with_joins
// This will select unique by [id].
SELECT id, name FROM some_query_with_joins GROUP BY id
It is a common mistake to think that DISTINCT keyword distinguishes rows by the first column you specified, but the DISTINCT is a general keyword in this manner.
So people you have to be careful not to take the answers above as correct for all cases... You might get confused and get the wrong results while all you wanted was to optimize!
Go for the simplest and shortest if you can -- DISTINCT seems to be more what you are looking for only because it will give you EXACTLY the answer you need and only that!
well distinct can be slower than group by on some occasions in postgres (dont know about other dbs).
tested example:
postgres=# select count(*) from (select distinct i from g) a;
count
10001
(1 row)
Time: 1563,109 ms
postgres=# select count(*) from (select i from g group by i) a;
count
10001
(1 row)
Time: 594,481 ms
http://www.pgsql.cz/index.php/PostgreSQL_SQL_Tricks_I
so be careful ... :)
Group by is expensive than Distinct since Group by does a sort on the result while distinct avoids it. But if you want to make group by yield the same result as distinct give order by null ..
SELECT DISTINCT u.profession FROM users u
is equal to
SELECT u.profession FROM users u GROUP BY u.profession order by null
It seems that the queries are not exactly the same. At least for MySQL.
Compare:
describe select distinct productname from northwind.products
describe select productname from northwind.products group by productname
The second query gives additionally "Using filesort" in Extra.
In MySQL, "Group By" uses an extra step: filesort. I realize DISTINCT is faster than GROUP BY, and that was a surprise.
After heavy testing we came to the conclusion that GROUP BY is faster
SELECT sql_no_cache
opnamegroep_intern
FROM telwerken
WHERE opnemergroep IN (7,8,9,10,11,12,13) group by opnamegroep_intern
635 totaal 0.0944 seconds
Weergave van records 0 - 29 ( 635 totaal, query duurde 0.0484 sec)
SELECT sql_no_cache
distinct (opnamegroep_intern)
FROM telwerken
WHERE opnemergroep IN (7,8,9,10,11,12,13)
635 totaal 0.2117 seconds ( almost 100% slower )
Weergave van records 0 - 29 ( 635 totaal, query duurde 0.3468 sec)
(more of a functional note)
There are cases when you have to use GROUP BY, for example if you wanted to get the number of employees per employer:
SELECT u.employer, COUNT(u.id) AS "total employees" FROM users u GROUP BY u.employer
In such a scenario DISTINCT u.employer doesn't work right. Perhaps there is a way, but I just do not know it. (If someone knows how to make such a query with DISTINCT please add a note!)
Here is a simple approach which will print the 2 different elapsed time for each query.
DECLARE #t1 DATETIME;
DECLARE #t2 DATETIME;
SET #t1 = GETDATE();
SELECT DISTINCT u.profession FROM users u; --Query with DISTINCT
SET #t2 = GETDATE();
PRINT 'Elapsed time (ms): ' + CAST(DATEDIFF(millisecond, #t1, #t2) AS varchar);
SET #t1 = GETDATE();
SELECT u.profession FROM users u GROUP BY u.profession; --Query with GROUP BY
SET #t2 = GETDATE();
PRINT 'Elapsed time (ms): ' + CAST(DATEDIFF(millisecond, #t1, #t2) AS varchar);
OR try SET STATISTICS TIME (Transact-SQL)
SET STATISTICS TIME ON;
SELECT DISTINCT u.profession FROM users u; --Query with DISTINCT
SELECT u.profession FROM users u GROUP BY u.profession; --Query with GROUP BY
SET STATISTICS TIME OFF;
It simply displays the number of milliseconds required to parse, compile, and execute each statement as below:
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 2 ms.
SELECT DISTINCT will always be the same, or faster, than a GROUP BY. On some systems (i.e. Oracle), it might be optimized to be the same as DISTINCT for most queries. On others (such as SQL Server), it can be considerably faster.
This is not a rule
For each query .... try separately distinct and then group by ... compare the time to complete each query and use the faster ....
In my project sometime I use group by and others distinct
If you don't have to do any group functions (sum, average etc in case you want to add numeric data to the table), use SELECT DISTINCT. I suspect it's faster, but i have nothing to show for it.
In any case, if you're worried about speed, create an index on the column.
If the problem allows it, try with EXISTS, since it's optimized to end as soon as a result is found (And don't buffer any response), so, if you are just trying to normalize data for a WHERE clause like this
SELECT FROM SOMETHING S WHERE S.ID IN ( SELECT DISTINCT DCR.SOMETHING_ID FROM DIFF_CARDINALITY_RELATIONSHIP DCR ) -- to keep same cardinality
A faster response would be:
SELECT FROM SOMETHING S WHERE EXISTS ( SELECT 1 FROM DIFF_CARDINALITY_RELATIONSHIP DCR WHERE DCR.SOMETHING_ID = S.ID )
This isn't always possible but when available you will see a faster response.
in mySQL i have found that GROUP BY will treat NULL as distinct, while DISTINCT does not.
Took the exact same DISTINCT query, removed the DISTINCT, and added the selected fields as the GROUP BY, and i got many more rows due to one of the fields being NULL.
So.. I tend to believe that there is more to the DISTINCT in mySQL.