Distinct - which items are taken? The first or the last occurance? - mysql

If I use following query:
SELECT DISTINCT comment FROM table;
And I have for example following data: (IDs are just there to SHOW the order...)
ID | comment
-------------
1 | comment1
2 | comment1
3 | comment2
4 | comment1
What I could get back are following three results:
Result 1:
1 | comment1
3 | comment2
Result 2:
3 | comment2
4 | comment1
Result 3:
order is unpredicatable
Question 1:
Is the result independant from the platform? Can I make sure, that I always get a predictable result?
Question 2:
I want to distinct select all comments and get the NEWEST only, meaning I want to always get result 2. Is it possible to achive that? Maybe ordering by the key would affect the result?

Your query doesn't request the ID column, only the comment column:
SELECT DISTINCT comment FROM table;
In the result, the ID is not included, so the row each value comes from is irrelevant.
comment1
comment2
As for how it will sort them, I think it depends on index order. I'll do a test to confirm:
mysql> create table t (id int primary key, comment varchar(100));
mysql> insert into t values
-> (1, 'comment2'),
-> (2, 'comment1'),
-> (3, 'comment2'),
-> (4, 'comment1');
The default order is that of the primary key:
mysql> select distinct comment from t;
+----------+
| comment |
+----------+
| comment2 |
| comment1 |
+----------+
Whereas if we have an index on the requested column, it returns the values in index order:
mysql> create index i on t(comment);
mysql> select distinct comment from t;
+----------+
| comment |
+----------+
| comment1 |
| comment2 |
+----------+
I'm assuming the InnoDB storage engine, because everyone should be using InnoDB. ;-)
Your last question indicates that you really want a query that doesn't involve DISTINCT at all, but it's a greatest-n-per-group question. This type of question is very common, and it has been asked and answered hundreds of times on StackOverflow. Follow the link and read the many solutions.

You can experiment and see which of the unique rows is returned, and you can experiment and see which order they're returned in, but that will only show you how things turn out with your experimental table, today, under the current database engine version. Bottom line:
If you SELECT DISTINCT comment the id is immaterial because it's not in your SELECT
If you don't ORDER BY the database will determine the order.
If you want the most recent distinct comment with its ID, this will work every time (full disclosure: this replaces an earlier answer that works but was over-thinking the problem):
SELECT comment, MAX(id)
FROM myTable
GROUP BY comment
ORDER BY 2 DESC;
Note that the ORDER BY 2 DESC assumes that the higher the ID, the more recent the comment.

If you select a single distinct column, the other will not be returned.
select distinct column from table
is the same result as
select column from table group by column
In both these cases, the sort order of column is unpredictable, depending on the execution plan which may vary with larger amounts of data, diferent table structures, diferent database versions
to mimic your result, one would have to do :
select id, column from table group by column
which is an illegal grouping. If your SQL mode permits it to run, ID will be random.
if you mean select distinct * from table, then all distinct rows will be returned, in your case all the table.

Related

Sql Query to retrive data from table

How to retrieve odd rows from the table?
In the Base table always Cr_id is duplicated 2 times.
Base table
I want a SELECT statement that retrieves only those c_id =1 where Cr_id is always first as shown in the output table.
Output table
Just see the base table and output table you should automatically know what I want, Thanx.
Just testing min date should be enough
drop table if exists t;
create table t(c_id int,cr_id int,dt date);
insert into t values
(1,56,'2020-12-17'),(56,56,'2020-12-17'),
(1,8,'2020-12-17'),(56,8,'2020-12-17'),
(123,78,'2020-12-17'),(1,78,'2020-12-18');
select c_id,cr_id,dt
from t
where c_id = 1 and
dt = (select min(dt) from t t1 where t1.cr_id = t.cr_id);
+------+-------+------------+
| c_id | cr_id | dt |
+------+-------+------------+
| 1 | 56 | 2020-12-17 |
| 1 | 8 | 2020-12-17 |
+------+-------+------------+
2 rows in set (0.002 sec)
What you're looking for could be "partition by", at least if you're working on mssql.
(In the future, please include more background, SQL is not just SQL)
https://codingsight.com/grouping-data-using-the-over-and-partition-by-functions/
I have an old query lying around, that is able to put a sorting index on data who lacks this, although the underlying reason is 99.9% sure to be a bad data design.
Typically I use this query to remove bad data, but you may rewrite it to become a join instead, so that you can identify the data you need.
The reason why I'm not putting that answer here, is to point out, bad data design results in more work when reading it afterwards, whom seems to be the real root cause here.
DELETE t
FROM
(
SELECT ROW_NUMBER () OVER (PARTITION BY column_1 ,column_2, column_3 ORDER BY column_1,column_2 ,column_3 ) AS Seq
FROM Table
)t
WHERE Seq > 1

Remove Purge duplicate/multiplicate records from mariadb

Briefly: database imported from foreign source, so I cannot prevent duplicates, I can only prune and clean the database.
Foreign db changes daily, so, I want to automate the pruning process.
It resides on:
MariaDB v10.4.6 managed predominantly by phpMyadmin GUI v4.9.0.1 (both pretty much up to date as of this writing).
This is a radio browsing database.
It has multiple columns, but for me there are only few important:
StationID (it is unique entry number, thus db does not consider new entries as duplicates, all of them are unique because of this primary key)
There are no row numbers.
Name, url, home-page, country, etc
I do want to remove multiple url duplicated entries base on:
duplicate url has country to it, but some country values are NULL (=empty)
so I do want remove all duplicates except one containing country name, if there is one entry with it, if there is none, just one url, regardless of name (names are multilingual, so some duplicated urls have also various names, which I do not care for.
StationID (unique number, but not consecutive, also this is primary db key)
Name (variable, least important)
url (variable, but I do want to remove the duplicates)
country (variable, frequently NULL/empty, I want to eliminate those with empty entries as much as possible, if possible)
One url has to stay by any means (not to be deleted)
I have tried multitude of queries, some work for SELECT, but do NOT for DELETE, some hang my machine when executed. Here are some queries I tried (remember I use MariaDB, not oracle, or ms-sql)
SELECT * from `radio`.`Station`
WHERE (`radio`.`Station`.`Url`, `radio`.`Station`.`Name`) IN (
SELECT `radio`.`Station`.`Url`, `radio`.`Station`.`Name`
FROM `radio`.`Station`
GROUP BY `radio`.`Station`.`Url`, `radio`.`Station`.`Name`
HAVING COUNT(*) > 1)
This one should show all entries (not only one grouped), but this query hangs my machine
This query gets me as close as possible:
SELECT *
FROM `radio`.`Station`
WHERE `radio`.`Station`.`StationID` NOT IN (
SELECT MAX(`radio`.`Station`.`StationID`)
FROM `radio`.`Station`
GROUP BY `radio`.`Station`.`Url`,`radio`.`Station`.`Name`,`radio`.`Station`.`Country`)
However this query lists more entries:
SELECT *, COUNT(`radio`.`Station`.`Url`) FROM `radio`.`Station` GROUP BY `radio`.`Station`.`Name`,`radio`.`Station`.`Url` HAVING (COUNT(`radio`.`Station`.`Url`) > 1);
But all of these queries group them and display only one row.
I also tried UNION, INNER JOIN, but failed.
WITH cte AS..., but phpMyadmin does NOT like this query, and mariadb cli also did not like it.
I also tried something of this kind, published at oracle blog, which did not work, and I really had no clue what was what in this function:
select *
from (
select f.*,
count(*) over (
partition by `radio`.`Station`.`Url`, `radio`.`Station`.`Name`
) ct
from `radio`.`Station` f
)
where ct > 1
I did not know what f.* was, query did not like ct.
Given
drop table if exists radio;
create table radio
(stationid int,name varchar(3),country varchar(3),url varchar(3));
insert into radio values
(1,'aaa','uk','a/b'),
(2,'bbb','can','a/b'),
(3,'bbb',null,'a/b'),
(4,'bbb',null,'b/b'),
(5,'bbb',null,'b/b');
You could give the null countries a unique value (using coalesce), fortunately stationid is unique so:
select t.stationid,t.name,t.country,t.url
from radio t
join
(select url,max(coalesce(country,stationid)) cntry from radio t group by url) s
on s.url = t.url and s.cntry= coalesce(t.country,t.stationid);
Yields
+-----------+------+---------+------+
| stationid | name | country | url |
+-----------+------+---------+------+
| 1 | aaa | uk | a/b |
| 5 | bbb | NULL | b/b |
+-----------+------+---------+------+
2 rows in set (0.00 sec)
Translated to a delete
delete t from radio t
join
(select url,max(coalesce(country,stationid)) cntry from radio t group by url) s
on s.url = t.url and s.cntry <> coalesce(t.country,t.stationid);
MariaDB [sandbox]> select * from radio;
+-----------+------+---------+------+
| stationid | name | country | url |
+-----------+------+---------+------+
| 1 | aaa | uk | a/b |
| 5 | bbb | NULL | b/b |
+-----------+------+---------+------+
2 rows in set (0.00 sec)
Fix 2 problems at once:
Dup rows already in table
Dup rows can still be put in table
Do this fore each table:
CREATE TABLE new LIKE real;
ALTER TABLE new ADD UNIQUE(x,y); -- will prevent future dups
INSERT IGNORE INTO new -- IGNORE dups
SELECT * FROM real;
RENAME TABLE real TO old, new TO real;
DROP TABLE old;

Explaining a SELECT statement script in detail for MySql

For one of the questions in my computing coursework, I was asked to explain the following SQL script in detail:
SELECT exam_board, COUNT(*)
FROM subjects
GROUP BY exam_board;
Below is what I have written in response to that question. I was just wondering if I forgot to include something, or if I incorrectly stated something.Any feedback at all would be greatly appreciated!
The script begins with a SELECT statement. A SELECT statement retrieves records from one or more tables or databases (, the data that is returned is then stored inside a result table, which is called a result-set). ‘COUNT ()’ is a function which returns (all (, as there is an asterisk)) the number of rows which match a specified criteria and it gives a total number of records fetched in a query. Therefore ‘SELECT exam_board, COUNT() FROM subjects’ means that the script will return all exam boards from the ‘exam_board’ column in the ‘subjects’ table with their count (of how many subjects are of that exam board). Finally the last line is ‘GROUP BY exam_board;’ the ‘GROUP BY’ clause is often used in SELECT statements to collect data from a number of records. Its purpose is to group the results in one or more columns. In this case it was grouped by ‘exam_board’, meaning that the result of the query will be grouped into a column of the exam boards.
You forgot the effect of GROUP BY is to reduce the result set to one row per distinct value in the grouping column (exam_board in this query).
So there might be 10,000 rows in the subjects table, but only four distinct values for exam_board. Using GROUP BY means you will only have four rows in the result set, exactly one row for each exam_board.
Then the COUNT(*) will be the count of rows that were "collapsed" for each respective group.
I request that you do not copy & paste my answer, but write your own answer in your own words. My writing style is pretty different from yours, so if you copy & paste, it'll be obvious to your teacher that you lifted this.
Actually this not the best answer.
SELECT can return not only data from the tables, but any result of any function, for example SELECT VERSION() returns a version of server software.
An asterisk as a parameter for COUNT(*) does not matter at all. You can put here any column or function, even COUNT(VERSION()), the result will be the same.
‘SELECT exam_board, COUNT() FROM subjects’ will return a single row with two columns: the total number of rows in table 'subjects' and the value of 'exam_board' column in the first row of the table.
Content of the table:
mysql> select exam_board from subjects;
+------------+
| exam_board |
+------------+
| 2 |
| 2 |
| 3 |
| 3 |
| 3 |
+------------+
5 rows in set (0.00 sec)
Mixing together column values and a function returning a single value like SUM(), MIN(), MAX() etc without grouping functions:
mysql> select exam_board, count(*) from subjects;
+------------+----------+
| exam_board | count(*) |
+------------+----------+
| 2 | 5 |
+------------+----------+
1 row in set (0.00 sec)
And only with grouping operator we will get the desired result: the count of records for each value of exam_board field.
mysql> select exam_board, count(*) from subjects group by exam_board;
+------------+----------+
| exam_board | count(*) |
+------------+----------+
| 2 | 2 |
| 3 | 3 |
+------------+----------+
2 rows in set (0.00 sec)

MySQL 5.7 return all columns of table based on distinct column

I just upgraded to MySQL 5.7 and unfortunately for me, some of the functionality of GROUP BY is gone. I wanted to select all movies from my movies table with as long as the movies.id of type int is not a duplicate. My previous query in MySQL 5.6 was:
SELECT *
FROM movies
WHERE movies.title LIKE '%example%'
GROUP BY movies.id
If I had two movies with the same id, it would only display one movie, instead of that movie and its duplicates.
When I upgraded to MySQL 5.7, the GROUP BY gave me errors and I was instead told to use ORDER BY. However this query:
SELECT *
FROM movies
WHERE movies.title LIKE '%example%'
ORDER BY movies.id
Does return duplicate movies. So, is there a way to filter this out, and only return a row if it isn't a duplicate?
Edit: For example if this is my movies table:
movies
==================
| id | title |
==================
| 1 | example |
------------------
| 2 | example |
------------------
| 1 | example |
------------------
Here is the output of each query:
Previous query result (with MySQL 5.6)
=======
1 | example
2 | example
New query result (with MySQL 5.7 and ORDER BY)
=======
1 | example
1 | example
2 | example
I want the final result to contain no duplicates (so the result should look like the first query result).
Edit 2: I understand I was sort of abusing the way MySQL handled GROUP BY. Unfortunately, I do not have much experience with MySQL and got that answer from StackOverflow. I would just like to return all columns in my table that do not contain duplicate ids.
I believe would be easy to use distinct keyword
SELECT distinct movies.*
FROM movies
WHERE movies.title = 'example'

How to compare two columns to find unmatched records in MySQL

I have a MySQL table with 2 columns and each column has thousands of records
For Example 15000 Email addresses in Column1 and 15005 Email addresses in column 2
How to find those 5 records from 15005 which are unmatched in column1?
I wish MySql query to compare both columns and give result of only 5 unmatched records
Thanks
Not sure if I got it right... but would it be something like?
select column2 from table
where column2 not in (select column1 from table)
Richard, it's highly unusual to find matching/missing rows from one column in a table compared against another column in the same table.
You can think of a table as being a collection of facts, with each row being one fact. Converting values into predicates is how we understand the data. The value "12" in one table may mean "there exists a day on which 12 widgets were made," or "12 people bought widgets on Jan. 1," or "on Jan. 12, no widgets were sold," but whatever the table's corresponding predicate is, "12" should represent a fact.
It's common to want to find the difference between two tables: "what facts are in B that aren't in A?" But in a table with two columns, each row should conceptually be a fact about that pair of values. Perhaps the predicate for the row (12, 13) might be "on Jan. 12, we sold 13 widgets." But in that case I doubt you'd be asking for this information.
So, if (12,13) is really two of the same predicate -- "someone in district 12 bought widgets, and also, someone in district 13 bought widgets" -- in the long run life will be easier if those are one column, not two. And if it's two different predicates, it would make more sense for them to be in two tables. SQL's flexible and can handle these situations, but you may run into more problems later. If you're interested in more about this subject, searching on "normalization" will find you way more than you want to know :)
Anyway, I think the query you're looking for uses a LEFT JOIN to compare the table against itself. I added the values 1-15000 to col1 and 1-15005 to col2 in this table:
CREATE TABLE `foo` (
`col1` int(11) DEFAULT NULL,
`col2` int(11) DEFAULT NULL,
KEY `idx_col1` (`col1`),
KEY `idx_col2` (`col2`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
mysql> select count(distinct col1), count(distinct col2) from foo;
+----------------------+----------------------+
| count(distinct col1) | count(distinct col2) |
+----------------------+----------------------+
| 15000 | 15005 |
+----------------------+----------------------+
1 row in set (0.01 sec)
By giving the same table two names, I can compare its two columns against each other, and find the col2 values that have no corresponding col1 values -- in those cases, f1.col1 will be NULL:
mysql> select f2.col2
from foo as f2 left join foo as f1 on (f2.col2=f1.col1)
where f1.col1 is null;
+-------+
| col2 |
+-------+
| 15001 |
| 15002 |
| 15003 |
| 15004 |
| 15005 |
+-------+
5 rows in set (0.03 sec)
Regarding Mosty's solution yesterday, I'm not sure it's correct. I try not to use subqueries, so I'm a little out of my depth here. But it doesn't seem to work for at least my attempt to replicate your data set:
mysql> select col2 from foo where col2 not in
(select col1 from foo);
Empty set (0.02 sec)
It works if I exclude the 5 NULLs from the subquery, which suggests to me that "NOT IN (NULL)" doesn't necessarily work the way one might think it works:
mysql> select col2 from foo where col2 not in
(select col1 from foo where col1 is not null);
+-------+
| col2 |
+-------+
| 15001 |
| 15002 |
| 15003 |
| 15004 |
| 15005 |
+-------+
5 rows in set (0.02 sec)
The main reason I avoid subqueries in MySQL is that they have unpredictable performance characteristics, or at least, complex enough that I can't predict them. For more information, see the "O(MxN)" comment in http://dev.mysql.com/doc/refman/5.5/en/subquery-restrictions.html and the advice on the short webpage http://dev.mysql.com/doc/refman/5.5/en/rewriting-subqueries.html .