How does SELECT DISTINCT work in MySQL? - mysql

I have a table with multiple rows which have a same data. I used SELECT DISTINCT to get a unique row and it works fine. But when i use ORDER BY with SELECT DISTINCT it gives me unsorted data.
Can anyone tell me how distinct works?
Based on what criteria it selects the row?

From your comment earlier, the query you are trying to run is
Select distinct id from table where id2 =12312 order by time desc.
As I expected, here is your problem. Your select column and order by column are different. Your output rows are ordered by time, but that order doesn't necessarily need to preserved in the id column. Here is an example.
id | id2 | time
-------------------
1 | 12312 | 34
2 | 12312 | 12
3 | 12312 | 48
If you run
SELECT * FROM table WHERE id2=12312 ORDER BY time DESC
you will get the following result
id | id2 | time
-------------------
2 | 12312 | 12
1 | 12312 | 34
3 | 12312 | 48
Now if you select only the id column from this, you will get
id
--
2
1
3
This is why your results are not sorted.

When you specify SELECT DISTINCT it will give you all the rows, eliminating duplicates from the result set. By "duplicates" I mean rows where all fields have the same values. For example, say you have a table that looks like:
id | num
--------------
1 | 1
2 | 3
3 | 3
SELECT DISTINCT * would return all rows above, whereas SELECT DISTINCT num would return two rows:
num
-----
1
3
Note that which row actual row (eg: whether it's row 2 or row 3) it selects is irrelevant, as the result would be indistinguishable.
Finally, DISTINCT should not affect how ORDER BY works.
Reference: MySQL SELECT statement

The behaviour you describe happens when you ORDER BY an expression that is not present in the SELECT clause. The SQL standard does not allow such a query but MySQL is less strict and allows it.
Let's try an example:
SELECT DISTINCT colum1, column2
FROM table1
WHERE ...
ORDER BY column3
Let's say the content of the table table1 is:
id | column1 | column2 | column3
----+---------+---------+---------
1 | A | B | 1
2 | A | B | 5
3 | X | Y | 3
Without the ORDER BY clause, the above query returns following two records (without ORDER BY the order is not guaranteed):
column1 | column2
---------+---------
A | B
X | Y
But with ORDER BY column3 the order is also not guaranteed.
The DISTINCT clause operates on the values of the expressions present in the SELECT clause. If row #1 is processed first then (A, B) is placed in the result set and it is associated with row #1. Then, when row #2 is processed, the values of the SELECT expressions produce the record (A, B) that is already in the result set. Because of DISTINCT it is dropped. Row #3 produces (X, Y) that is also put in the result set. Then, the ORDER BY column3 clause makes the records be sorted in the result set as (A, B), (X, Y).
But if row #2 is processed before row #1 then, following the same logic exposed in the previous paragraph, the records in the result set are sorted as (X, Y), (A, B).
There is no rule imposed on the database engine about the order it processes the rows when it runs a query. The database is free to process the rows in any order it consider it's better for performance.
Your query is invalid SQL and the fact that it can return different results using the same input data proves it.

Related

Explaining a SELECT statement script in detail for MySql

For one of the questions in my computing coursework, I was asked to explain the following SQL script in detail:
SELECT exam_board, COUNT(*)
FROM subjects
GROUP BY exam_board;
Below is what I have written in response to that question. I was just wondering if I forgot to include something, or if I incorrectly stated something.Any feedback at all would be greatly appreciated!
The script begins with a SELECT statement. A SELECT statement retrieves records from one or more tables or databases (, the data that is returned is then stored inside a result table, which is called a result-set). ‘COUNT ()’ is a function which returns (all (, as there is an asterisk)) the number of rows which match a specified criteria and it gives a total number of records fetched in a query. Therefore ‘SELECT exam_board, COUNT() FROM subjects’ means that the script will return all exam boards from the ‘exam_board’ column in the ‘subjects’ table with their count (of how many subjects are of that exam board). Finally the last line is ‘GROUP BY exam_board;’ the ‘GROUP BY’ clause is often used in SELECT statements to collect data from a number of records. Its purpose is to group the results in one or more columns. In this case it was grouped by ‘exam_board’, meaning that the result of the query will be grouped into a column of the exam boards.
You forgot the effect of GROUP BY is to reduce the result set to one row per distinct value in the grouping column (exam_board in this query).
So there might be 10,000 rows in the subjects table, but only four distinct values for exam_board. Using GROUP BY means you will only have four rows in the result set, exactly one row for each exam_board.
Then the COUNT(*) will be the count of rows that were "collapsed" for each respective group.
I request that you do not copy & paste my answer, but write your own answer in your own words. My writing style is pretty different from yours, so if you copy & paste, it'll be obvious to your teacher that you lifted this.
Actually this not the best answer.
SELECT can return not only data from the tables, but any result of any function, for example SELECT VERSION() returns a version of server software.
An asterisk as a parameter for COUNT(*) does not matter at all. You can put here any column or function, even COUNT(VERSION()), the result will be the same.
‘SELECT exam_board, COUNT() FROM subjects’ will return a single row with two columns: the total number of rows in table 'subjects' and the value of 'exam_board' column in the first row of the table.
Content of the table:
mysql> select exam_board from subjects;
+------------+
| exam_board |
+------------+
| 2 |
| 2 |
| 3 |
| 3 |
| 3 |
+------------+
5 rows in set (0.00 sec)
Mixing together column values and a function returning a single value like SUM(), MIN(), MAX() etc without grouping functions:
mysql> select exam_board, count(*) from subjects;
+------------+----------+
| exam_board | count(*) |
+------------+----------+
| 2 | 5 |
+------------+----------+
1 row in set (0.00 sec)
And only with grouping operator we will get the desired result: the count of records for each value of exam_board field.
mysql> select exam_board, count(*) from subjects group by exam_board;
+------------+----------+
| exam_board | count(*) |
+------------+----------+
| 2 | 2 |
| 3 | 3 |
+------------+----------+
2 rows in set (0.00 sec)

query over 2 columns where value never appears in either column more than once

Looking for a query that takes the following table ProductList
id| column_1 | column_2 | Sum
================================
1 | Product-A | Product-B | 67
2 | Product-A | Product-C | 55
3 | Product-A | Product-D | 23
4 | Product-B | Product-C | 95
5 | Product-C | Product-D | 110
and returns the first record Product-A_Product-B and then skips all records that contain Product-A or Product-B in either column and returns Product-C_Product-D.
I only want to return the row if everything in the row is appearing for the first time.
Assuming that the products don't contain ,, you could use a comma-delimited session variable to store already selected products and check for every row if one of the columns is already contained in that variable:
select column_1, column_2
from (
select l.*,
case when find_in_set(l.column_1, #products) or find_in_set(l.column_2, #products)
then 1
else (#products := concat(#products, ',', l.column_1, ',', l.column_2)) = ''
end as skip
from ProductList l
cross join (select #products := '') init
order by l.id
) t
where skip = 0;
Demo: http://rextester.com/NDVBW87988
But you should know the risks:
ORDER BY in a subquery is not really valid and usually doesn't make sence. The engine may skip it or move it to the outer query.
If you read and write the same session variable in one statement, the execution order is not defined. So the query might not work for all (future) versions.

Return the query when count of a query is greater than a number?

I want to return all rows that have a certain value in a column and have more than 5 instances in which a number is that certain value. For example, I would like to return all rows of the condition in which if the value in the column M has the number 1 in it and there are 5 or more instances of M having the number 1 in it, then it will return all rows with that condition.
select *
from tab
where M = 1
group by id --ID is the primary key of the table
having count(M) > 5;
EDIT: Here is my table:
id | M | price
--------+-------------+-------
1 | | 100
2 | 1 | 50
3 | 1 | 30
4 | 2 | 20
5 | 2 | 10
6 | 3 | 20
7 | 1 | 1
8 | 1 | 1
9 | 1 | 1
10 | 1 | 1
11 | 1 | 1
Originally I just want to insert into a trigger so that if the number of M = 1's is greater than 5, then I want to create an exception. The query I asked for would be inserted into the trigger. END EDIT.
But my table is always empty. Can anyone help me out? Thanks!
Try this :
select *
from tab
where M in (select M from tab where M = 1 group by M having count(id) > 5);
SQL Fiddle Demo
please try
select *,count(M) from table where M=1 group by id having count(M)>5
Since you group on your PK (which seems a futile excercise), you are counting per ID, whicg will indeed always return 1.
As i explain after this code, this query is NOT good, it is NOT the answer, and i also explain WHY. Please do not expect this query to run correctly!
select *
from tab
where M = 1
group by M
having count(*) > 5;
Like this, you group on what you are counting, which makes a lot more sense. At the same time, this will have unexpected behaviour, as you are selecting all kinds of columns that are not in the group by or in any aggregate. I know mySQL is lenient on that, but I don;t even want to know what it will produce.
Try indeed a subquery along these lines:
select *
from tab
where M in
(SELECT M
from tab
group by M
having count(*) > 5)
I've built a SQLFiddle demo (i used 'Test' as table name out of habit) accomplishing this (I don't have a mySQL at hand now to test it).
-- Made up a structure for testing
CREATE TABLE Test (
id INT NOT NULL AUTO_INCREMENT,
PRIMARY KEY(id),
M int
);
SELECT id, M FROM tab
WHERE M IN (
SELECT M
FROM Test
WHERE M = 1
GROUP BY M
HAVING COUNT(M) > 5
)
The sub-query is a common "find the duplicates" kind of query, with the added condition of a specific value for the column M, also stating that there must be at least 5 dupes.
It will spit out a series of values of M which you can use to query the table against, ending with the rows you need.
You shouldn't use SELECT * , it's a bad practice in general: don't retrieve data you aren't actually using, and if you are using it then take the little time needed to type in a list of field, you'll likely see faster querying and on the other hand the code will be way more readable.

Delete all rows except first N from a table having single column

I need a single query. Delete all rows from the table except the top N rows. The table has only one column. Like,
|friends_name|
==============
| Arunji |
| Roshit |
| Misbahu |
| etc... |
This column may contain repeated names as well.
Contains repeated names
Only one column.
If you can order your records by friends_name, and if there are no duplicates, you could use this:
DELETE FROM names
WHERE
friends_name NOT IN (
SELECT * FROM (
SELECT friends_name
FROM names
ORDER BY friends_name
LIMIT 10) s
)
Please see fiddle here.
Or you can use this:
DELETE FROM names ORDER BY friends_name DESC
LIMIT total_records-10
where total_records is (SELECT COUNT(*) FROM names), but you have to do this by code, you can't put a count in the LIMIT clause of your query.
If you don't have an id field, i suppose you use an alphabetic order.
MYSQL
DELETE FROM friends
WHERE friends_name
NOT IN (
SELECT * FROM (
SELECT friends_name
FROM friends
ORDER BY friends_name ASC
LIMIT 10) r
)
You delete all rows exept the 10 firsts (alphabetic order)
I just wanted to follow up on this relatively old question because the existing answers don't capture the requirement and/or are incorrect. The question states the names can be repeated, but only the top N must be preserved. Other answers will delete incorrect rows and/or incorrect number of them.
For example, if we have this table:
|friends_name|
==============
| Arunji |
| Roshit |
| Misbahu |
| Misbahu |
| Roshit |
| Misbahu |
| Rohan |
And we want to delete all but top 3 rows (N = 3), the expected result would be:
|friends_name|
==============
| Arunji |
| Roshit |
| Misbahu |
The DELETE statement from the currently selected answer will result in:
|friends_name|
==============
| Arunji |
| Misbahu |
| Misbahu |
| Misbahu |
See this sqlfiddle. The reason for this is that it first sorts names alphabetically, then takes top 3, then deletes all that don't equal that. But since they are sorted by name they may not be the top 3 we want, and there's no guarantee that we'll end up with only 3.
In the absence of unique indexes and other fields to determine what "top N" means, we go by the order returned by the database. We could be tempted to do something like this (substitute 99999 with however high number):
DELETE FROM names LIMIT 99999 OFFSET 3
But according to MySQL docs, while the DELETE supports the LIMIT clause, it does not support OFFSET. So, doing this in a single query, as requested, does not seem to be possible; we must perform the steps manually.
Solution 1 - temporary table to hold top 3
CREATE TEMPORARY TABLE temp_names LIKE names;
INSERT INTO temp_names SELECT * FROM names LIMIT 3;
DELETE FROM names;
INSERT INTO names SELECT * FROM temp_names;
Here's the sqlfiddle for reference.
Solution 2 - new table with rename
CREATE TABLE new_names LIKE names;
INSERT INTO new_names SELECT * FROM names LIMIT 3;
RENAME TABLE names TO old_names, new_names TO names;
DROP TABLE old_names;
Here's the sqlfiddle for this one.
In either case, we end up with top 3 rows in our original table:
|friends_name|
==============
| Arunji |
| Roshit |
| Misbahu |

Row number for query results grouped by a column

I have a table that has the following columns:
id | fk_id | rcv_date
There may be multiple records with a common fk_id, which represents a foreign key id in a related table.
I need to create a query that will assign a row number to each record, grouped by fk_id, sorted by rcv_date.
I originally began with the following query, which works quite well for sorting and assigning row numbers:
SELECT #row:=#row +1 AS ordinality, c.fk_id, rcv_date
FROM (SELECT #row:=0) r, mytable c
ORDER BY rcv_date
However -- the row count and sorting is done across the entire dataset. I need the counting to be within a common fk_id. For example, the following sample data would return (the first column represents the row count/ordinality):
1 | 5 | 2011-10-01
2 | 5 | 2011-10-14
3 | 5 | 2011-11-02
4 | 5 | 2011-12-17
1 | 8 | 2011-09-03
2 | 8 | 2011-11-12
1 | 9 | 2011-10-08
2 | 9 | 2011-10-10
3 | 9 | 2011-11-19
The middle column represents the fk_id. As you can see, the sorting and row count is within the fk_id "grouping."
UPDATE
I have a query that seems to be working, but would like some input as to whether it can be improved:
SELECT IF(#last = c.fk_id, #row:=#row +1, #row:=1) AS ordinality, #last:=c.fk_id, c.fk_id, rcv_date
FROM (SELECT #row:=0) r, (SELECT #last:=0) l, mytable c
ORDER BY c.fk_id, rcv_date
So what this does is order by fk_id and then rcv_date -- which essentially handles my grouping. Then I use a second variable to compare the fk_id in the previous record with the current record: if it's the same, we increment the row; if different, we reset to 1.
My tests with real data appear to be working. I suspect it's a pretty inefficient query though -- so if anyone has ideas for improving it, or see possible flaws, I would love to hear.
This should be pretty straightforward.
SELECT (CASE WHEN #fk <> fk_id THEN #row:=1 ELSE #row:=#row + 1 END) AS ordinality,
#fk:=fk_id, rcv_date
FROM (SELECT #row:=0) AS r,
(SELECT #fk:=0) AS f,
(SELECT fk_id, rcv_date FROM files ORDER BY fk_id, rcv_date) AS t
I ordered by fk_id first to ensure all your foreign keys come together (what if they are not really in the table?), then I did your preferred ordering, ie by rcv_date. The query checks for a change in fk_id and if there is one, then row number variable is set to 1, or else the variable is incremented. Its handled in case statement. Notice that #fk:=fk_id is done after the case checking else it will affect the row number.
Edit: Just noticed your own solution which happened to be the same as I ended up with. Kudos! :)