How to find duplicates pairs in MySQL - mysql

I have a MySQL table like this:
| id1 | id2 |
| 34567 | 75879 | <---- pair1
| 13245 | 46753 |
| 75879 | 34567 | <---- pair2
| 06898 | 00013 |
with 37 000 entries.
What is the SQL Request or how can i identify duplicates pairs (like pair1 and pair2)?
Thanks

if you want to identify the duplicates and count them at the same time, you could use:
SELECT if(id1 < id2, id1, id2), if (id1 < id2, id2, id1), count(*)
FROM your_table
GROUP BY 1,2
HAVING count(*) > 1
This does not perform a join, which might be faster in the end.

If you join the table with it self you can filter out the ones you need.
SELECT *
FROM your_table yt1,
your_table yt2
WHERE (yt1.id1 = yt2.id2 AND yt1.id2 = yt1.id1)
OR (yt1.id1 = yt2.id1 AND yt1.id2 = yt2.id2)

The original post is 1000 years old, but here's another form:
SELECT CONCAT(d1, '/' d2) AS pair, count(*) AS total
FROM your_table
GROUP BY pair HAVING total > 1
ORDER BY total DESC;
May or may not perform as well as the other suggested answers.

Related

Select all records where last n characters in column are not unique

I have bit strange requirement in mysql.
I should select all records from table where last 6 characters are not unique.
for example if I have table:
I should select row 1 and 3 since last 6 letters of this values are not unique.
Do you have any idea how to implement this?
Thank you for help.
I uses a JOIN against a subquery where I count the occurences of each unique combo of n (2 in my example) last chars
SELECT t.*
FROM t
JOIN (SELECT RIGHT(value, 2) r, COUNT(RIGHT(value, 2)) rc
FROM t
GROUP BY r) c ON c.r = RIGHT(value, 2) AND c.rc > 1
Something like that should work:
SELECT `mytable`.*
FROM (SELECT RIGHT(`value`, 6) AS `ending` FROM `mytable` GROUP BY `ending` HAVING COUNT(*) > 1) `grouped`
INNER JOIN `mytable` ON `grouped`.`ending` = RIGHT(`value`, 6)
but it is not fast. This requires a full table scan. Maybe you should rethink your problem.
EDITED: I had a wrong understanding of the question previously and I don't really want to change anything from my initial answer. But if my previous answer is not acceptable in some environment and it might mislead people, I have to correct it anyhow.
SELECT GROUP_CONCAT(id),RIGHT(VALUE,6)
FROM table1
GROUP BY RIGHT(VALUE,6) HAVING COUNT(RIGHT(VALUE,6)) > 1;
Since this question already have good answers, I made my query in a slightly different way. And I've tested with sql_mode=ONLY_FULL_GROUP_BY. ;)
This is what you need: a subquery to get the duplicated right(value,6) and the main query yo get the rows according that condition.
SELECT t.* FROM t WHERE RIGHT(`value`,6) IN (
SELECT RIGHT(`value`,6)
FROM t
GROUP BY RIGHT(`value`,6) HAVING COUNT(*) > 1);
UPDATE
This is the solution to avoid the mysql error in the case you have sql_mode=only_full_group_by
SELECT t.* FROM t WHERE RIGHT(`value`,6) IN (
SELECT DISTINCT right_value FROM (
SELECT RIGHT(`value`,6) AS right_value,
COUNT(*) AS TOT
FROM t
GROUP BY RIGHT(`value`,6) HAVING COUNT(*) > 1) t2
)
Fiddle here
Might be a fast code, as there is no counting involved.
Live test: https://www.db-fiddle.com/f/dBdH9tZd4W6Eac1TCRXZ8U/0
select *
from tbl outr
where not exists
(
select 1 / 0 -- just a proof that this is not evaluated. won't cause division by zero
from tbl inr
where
inr.id <> outr.id
and right(inr.value, 6) = right(outr.value, 6)
)
Output:
| id | value |
| --- | --------------- |
| 2 | aaaaaaaaaaaaaa |
| 4 | aaaaaaaaaaaaaaB |
| 5 | Hello |
The logic is to test other rows that is not equal to the same id of the outer row. If those other rows has same right 6 characters as the outer row, then don't show that outer row.
UPDATE
I misunderstood the OP's intent. It's the reversed. Anyway, just reverse the logic. Use EXISTS instead of NOT EXISTS
Live test: https://www.db-fiddle.com/f/dBdH9tZd4W6Eac1TCRXZ8U/3
select *
from tbl outr
where exists
(
select 1 / 0 -- just a proof that this is not evaluated. won't cause division by zero
from tbl inr
where
inr.id <> outr.id
and right(inr.value, 6) = right(outr.value, 6)
)
Output:
| id | value |
| --- | ----------- |
| 1 | abcdePuzzle |
| 3 | abcPuzzle |
UPDATE
Tested the query. The performance of my answer (correlated EXISTS approach) is not optimal. Just keeping my answer, so others will know what approach to avoid :)
GhostGambler's answer is faster than correlated EXISTS approach. For 5 million rows, his answer takes 2.762 seconds only:
explain analyze
SELECT
tbl.*
FROM
(
SELECT
RIGHT(value, 6) AS ending
FROM
tbl
GROUP BY
ending
HAVING
COUNT(*) > 1
) grouped
JOIN tbl ON grouped.ending = RIGHT(value, 6)
My answer (correlated EXISTS) takes 4.08 seconds:
explain analyze
select *
from tbl outr
where exists
(
select 1 / 0 -- just a proof that this is not evaluated. won't cause division by zero
from tbl inr
where
inr.id <> outr.id
and right(inr.value, 6) = right(outr.value, 6)
)
Straightforward query is the fastest, no join, just plain IN query. 2.722 seconds. It has practically the same performance as JOIN approach since they have the same execution plan. This is kiks73's answer. I just don't know why he made his second answer unnecessarily complicated.
So it's just a matter of taste, or choosing which code is more readable select from in vs select from join
explain analyze
SELECT *
FROM tbl
where right(value, 6) in
(
SELECT
RIGHT(value, 6) AS ending
FROM
tbl
GROUP BY
ending
HAVING
COUNT(*) > 1
)
Result:
Test data used:
CREATE TABLE tbl (
id INTEGER primary key,
value VARCHAR(20)
);
INSERT INTO tbl
(id, value)
VALUES
('1', 'abcdePuzzle'),
('2', 'aaaaaaaaaaaaaa'),
('3', 'abcPuzzle'),
('4', 'aaaaaaaaaaaaaaB'),
('5', 'Hello');
insert into tbl(id, value)
select x.y, 'Puzzle'
from generate_series(6, 5000000) as x(y);
create index ix_tbl__right on tbl(right(value, 6));
Performances without the index, and with index on tbl(right(value, 6)):
JOIN approach:
Without index: 3.805 seconds
With index: 2.762 seconds
IN approach:
Without index: 3.719 seconds
With index: 2.722 seconds
Just a bit neater code (if using MySQL 8.0). Can't guarantee the performance though
Live test: https://www.db-fiddle.com/f/dBdH9tZd4W6Eac1TCRXZ8U/1
select x.*
from
(
select
*,
count(*) over(partition by right(value, 6)) as unique_count
from tbl
) as x
where x.unique_count = 1
Output:
| id | value | unique_count |
| --- | --------------- | ------------ |
| 2 | aaaaaaaaaaaaaa | 1 |
| 4 | aaaaaaaaaaaaaaB | 1 |
| 5 | Hello | 1 |
UPDATE
I misunderstood OP's intent. It's the reversed. Just change the count:
select x.*
from
(
select
*,
count(*) over(partition by right(value, 6)) as unique_count
from tbl
) as x
where x.unique_count > 1
Output:
| id | value | unique_count |
| --- | ----------- | ------------ |
| 1 | abcdePuzzle | 2 |
| 3 | abcPuzzle | 2 |

SQL - select rows that have the same value in two columns

The solution to the topic is evading me.
I have a table looking like (beyond other fields that have nothing to do with my question):
NAME,CARDNUMBER,MEMBERTYPE
Now, I want a view that shows rows where the cardnumber AND membertype is identical. Both of these fields are integers. Name is VARCHAR. Name is not unique, and duplicate cardnumber, membertype should show for the same name, as well.
I.e. if the following was the table:
JOHN | 324 | 2
PETER | 642 | 1
MARK | 324 | 2
DIANNA | 753 | 2
SPIDERMAN | 642 | 1
JAMIE FOXX | 235 | 6
I would want:
JOHN | 324 | 2
MARK | 324 | 2
PETER | 642 | 1
SPIDERMAN | 642 | 1
this could just be sorted by cardnumber to make it useful to humans.
What's the most efficient way of doing this?
What's the most efficient way of doing this?
I believe a JOIN will be more efficient than EXISTS
SELECT t1.* FROM myTable t1
JOIN (
SELECT cardnumber, membertype
FROM myTable
GROUP BY cardnumber, membertype
HAVING COUNT(*) > 1
) t2 ON t1.cardnumber = t2.cardnumber AND t1.membertype = t2.membertype
Query plan: http://www.sqlfiddle.com/#!2/0abe3/1
You can use exists for this:
select *
from yourtable y
where exists (
select 1
from yourtable y2
where y.name <> y2.name
and y.cardnumber = y2.cardnumber
and y.membertype = y2.membertype)
SQL Fiddle Demo
Since you mentioned names can be duplicated, and that a duplicate name still means is a different person and should show up in the result set, we need to use a GROUP BY HAVING COUNT(*) > 1 in order to truly detect dupes. Then join this back to the main table to get your full result list.
Also since from your comments, it sounds like you are wrapping this into a view, you'll need to separate out the subquery.
CREATE VIEW DUP_CARDS
AS
SELECT CARDNUMBER, MEMBERTYPE
FROM mytable t2
GROUP BY CARDNUMBER, MEMBERTYPE
HAVING COUNT(*) > 1
CREATE VIEW DUP_ROWS
AS
SELECT t1.*
FROM mytable AS t1
INNER JOIN DUP_CARDS AS DUP
ON (T1.CARDNUMBER = DUP.CARDNUMBER AND T1.MEMBERTYPE = DUP.MEMBERTYPE )
SQL Fiddle Example
If you just need to know the valuepairs of the 3 fields that are not unique then you could simply do:
SELECT concat(NAME, "|", CARDNUMBER, "|", MEMBERTYPE) AS myIdentifier,
COUNT(*) AS count
FROM myTable
GROUP BY myIdentifier
HAVING count > 1
This will give you all the different pairs of NAME, CARDNUMBER and MEMBERTYPE that are used more than once with a count (how many times they are duplicated). This doesnt give you back the entries, you would have to do that in a second step.

mysql small count issue on same table

Please find db structure as following...
| id | account_number | referred_by |
+----+-----------------+--------------+
| 1 | ac203003 | ac203005 |
+----+-----------------+--------------+
| 2 | ac203004 | ac203005 |
+----+-----------------+--------------+
| 3 | ac203005 | ac203004 |
+----+-----------------+--------------+
I want to achieve following results...
id, account_number, total_referred
1, ac203005, 2
2, ac203003m 0
3, ac203004, 1
And i am using following query...
SELECT id, account_number,
(SELECT count(*) FROM `member_tbl` WHERE referred_by = account_number) AS total_referred
FROM `member_tbl`
GROUP BY id, account_number
but its not giving expected results, please help. thanks.
You need to use table aliases to do this correctly:
SELECT id, account_number,
(SELECT count(*)
FROM `member_tbl` t2
WHERE t2.referred_by = t1.account_number
) AS total_referred
FROM `member_tbl` t1;
Your original query had referred_by = account_number. Without aliases, these would come from the same row -- and the value would be 0.
Also, I removed the outer group by. It doesn't seem necessary, unless you want to remove duplicates.
One idea is to join the table on itself. This way you can avoid the subquery. There might be performance gains with this approach.
select b.id, b.account_number, count(a.referred_by)
from member_tbl a inner join member_tbl b
on a.referred_by=b.account_number
group by (a.referred_by);
SQL fiddle: http://sqlfiddle.com/#!2/b1393/2
Another test, with more data: http://sqlfiddle.com/#!2/8d216/1
select t1.account_number, count(t2.referred_by)
from (select account_number from member_tbl) t1
left join member_tbl t2 on
t1.account_number = t2.referred_by
group by t1.account_number;
Fiddle for your data
Fiddle with more data

Using ORDER BY and GROUP BY together

My table looks like this (and I'm using MySQL):
m_id | v_id | timestamp
------------------------
6 | 1 | 1333635317
34 | 1 | 1333635323
34 | 1 | 1333635336
6 | 1 | 1333635343
6 | 1 | 1333635349
My target is to take each m_id one time, and order by the highest timestamp.
The result should be:
m_id | v_id | timestamp
------------------------
6 | 1 | 1333635349
34 | 1 | 1333635336
And i wrote this query:
SELECT * FROM table GROUP BY m_id ORDER BY timestamp DESC
But, the results are:
m_id | v_id | timestamp
------------------------
34 | 1 | 1333635323
6 | 1 | 1333635317
I think it causes because it first does GROUP_BY and then ORDER the results.
Any ideas? Thank you.
One way to do this that correctly uses group by:
select l.*
from table l
inner join (
select
m_id, max(timestamp) as latest
from table
group by m_id
) r
on l.timestamp = r.latest and l.m_id = r.m_id
order by timestamp desc
How this works:
selects the latest timestamp for each distinct m_id in the subquery
only selects rows from table that match a row from the subquery (this operation -- where a join is performed, but no columns are selected from the second table, it's just used as a filter -- is known as a "semijoin" in case you were curious)
orders the rows
If you really don't care about which timestamp you'll get and your v_id is always the same for a given m_i you can do the following:
select m_id, v_id, max(timestamp) from table
group by m_id, v_id
order by max(timestamp) desc
Now, if the v_id changes for a given m_id then you should do the following
select t1.* from table t1
left join table t2 on t1.m_id = t2.m_id and t1.timestamp < t2.timestamp
where t2.timestamp is null
order by t1.timestamp desc
Here is the simplest solution
select m_id,v_id,max(timestamp) from table group by m_id;
Group by m_id but get max of timestamp for each m_id.
You can try this
SELECT tbl.* FROM (SELECT * FROM table ORDER BY timestamp DESC) as tbl
GROUP BY tbl.m_id
SQL>
SELECT interview.qtrcode QTR, interview.companyname "Company Name", interview.division Division
FROM interview
JOIN jobsdev.employer
ON (interview.companyname = employer.companyname AND employer.zipcode like '100%')
GROUP BY interview.qtrcode, interview.companyname, interview.division
ORDER BY interview.qtrcode;
I felt confused when I tried to understand the question and answers at first. I spent some time reading and I would like to make a summary.
The OP's example is a little bit misleading.
At first I didn't understand why the accepted answer is the accepted answer.. I thought that the OP's request could be simply fulfilled with
select m_id, v_id, max(timestamp) as max_time from table
group by m_id, v_id
order by max_time desc
Then I took a second look at the accepted answer. And I found that actually the OP wants to express that, for a sample table like:
m_id | v_id | timestamp
------------------------
6 | 1 | 11
34 | 2 | 12
34 | 3 | 13
6 | 4 | 14
6 | 5 | 15
he wants to select all columns based only on (group by)m_id and (order by)timestamp.
Then the above sql won't work. If you still don't get it, imagine you have more columns than m_id | v_id | timestamp, e.g m_id | v_id | timestamp| columnA | columnB |column C| .... With group by, you can only select those "group by" columns and aggreate functions in the result.
By far, you should have understood the accepted answer.
What's more, check row_number function introduced in MySQL 8.0:
https://www.mysqltutorial.org/mysql-window-functions/mysql-row_number-function/
Finding top N rows of every group
It does the simlar thing as the accepted answer.
Some answers are wrong. My MySQL gives me error.
select m_id,v_id,max(timestamp) from table group by m_id;
#abinash sahoo
SELECT m_id,v_id,MAX(TIMESTAMP) AS TIME
FROM table_name
GROUP BY m_id
#Vikas Garhwal
Error message:
[42000][1055] Expression #2 of SELECT list is not in GROUP BY clause and contains nonaggregated column 'testdb.test_table.v_id' which is not functionally dependent on columns in GROUP BY clause; this is incompatible with sql_mode=only_full_group_by
Why make it so complicated? This worked.
SELECT m_id,v_id,MAX(TIMESTAMP) AS TIME
FROM table_name
GROUP BY m_id
Just you need to desc with asc. Write the query like below. It will return the values in ascending order.
SELECT * FROM table GROUP BY m_id ORDER BY m_id asc;

MySQL - Exclude rows from Select based on duplication of two columns

I am attempting to narrow results of an existing complex query based on conditional matches on multiple columns within the returned data set. I'll attempt to simplify the data as much as possible here.
Assume that the following table structure represents the data that my existing complex query has already selected (here ordered by date):
+----+-----------+------+------------+
| id | remote_id | type | date |
+----+-----------+------+------------+
| 1 | 1 | A | 2011-01-01 |
| 3 | 1 | A | 2011-01-07 |
| 5 | 1 | B | 2011-01-07 |
| 4 | 1 | A | 2011-05-01 |
+----+-----------+------+------------+
I need to select from that data set based on the following criteria:
If the pairing of remote_id and type is unique to the set, return the row always
If the pairing of remote_id and type is not unique to the set, take the following action:
Of the sets of rows for which the pairing of remote_id and type are not unique, return only the single row for which date is greatest and still less than or equal to now.
So, if today is 2011-01-10, I'd like the data set returned to be:
+----+-----------+------+------------+
| id | remote_id | type | date |
+----+-----------+------+------------+
| 3 | 1 | A | 2011-01-07 |
| 5 | 1 | B | 2011-01-07 |
+----+-----------+------+------------+
For some reason I'm having no luck wrapping my head around this one. I suspect the answer lies in good application of group by, but I just can't grasp it. Any help is greatly appreciated!
/* Rows with exactly one date - always return regardless of when date occurs */
SELECT id, remote_id, type, date
FROM YourTable
GROUP BY remote_id, type
HAVING COUNT(*) = 1
UNION
/* Rows with more than one date - Return Max date <= NOW */
SELECT yt.id, yt.remote_id, yt.type, yt.date
FROM YourTable yt
INNER JOIN (SELECT remote_id, type, max(date) as maxdate
FROM YourTable
WHERE date <= DATE(NOW())
GROUP BY remote_id, type
HAVING COUNT(*) > 1) sq
ON yt.remote_id = sq.remote_id
AND yt.type = sq.type
AND yt.date = sq.maxdate
The group by clause groups all rows that have identical values of one or more columns together and returns one row in the result set for them. If you use aggregate functions (min, max, sum, avg etc.) that will be applied for each "group".
SELECT id, remote_id, type, max(date)
FROM blah
GROUP BY remote_id, date;
I'm not whore where today's date comes in, but assumed that was part of the complex query that you didn't describe and I assume isn't directly relevant to your question here.
Try this:
SELECT a.*
FROM table a INNER JOIN
(
select remote_id, type, MAX(date) date, COUNT(1) cnt from table
group by remote_id, type
) b
WHERE a.remote_id = b.remote_id,
AND a.type = b.type
AND a.date = b.date
AND ( (b.cnt = 1) OR (b.cnt>1 AND b.date <= DATE(NOW())))
Try this
select id, remote_id, type, MAX(date) from table
group by remote_id, type
Hey Carson! You could try using the "distinct" keyword on those two fields, and in a union you can use Count() along with group by and some operators to pull non-unique (greatest and less-than today) records!