Detect duplicates - duplicates

Detect duplicates - duplicates

My question regards detecting duplicates. Say I have the following data:
clear all
input str2 pos str10 name
A Joe
A Joe
B Frank
C Mike
C Ted
D Mike
D Mike
E Bill
F Bill
end
If I want to detect all the duplicate names, I would simply type:
duplicates tag name, gen(flag)
This gives me:
pos name flag
A Joe 1
A Joe 1
B Frank 0
C Mike 2
C Ted 0
D Mike 2
D Mike 2
E Bill 1
F Bill 1
That is great - it indicates that Joe, Mike, and Bill are duplicates.
But let's say that I want to not include any duplicates that are duplicates only within pos. In other words, I do not want to find that Joe is duplicate because Joe appears only within pos A. I only want to find that Mike and Bill are duplicates. (While Mike is duplicated within D, he also appears in C, so he appears in more than one pos.)
In other words, I want:
pos name flag
A Joe 0
A Joe 0
B Frank 0
C Mike 1
C Ted 0
D Mike 1
D Mike 1
E Bill 1
F Bill 1
Note that here Mike takes flag of only 1 instead of 2. That is because I am treating Mike in D as appearing only once instead of twice. Having 2 instead of 1 is not problematic if the solution produces this.
Is there a way to do this?

This is no longer a duplicates problem in the specific sense of duplicates. (Disclaimer: I originally wrote it.)
You just want to know if a given name occurs in different groups. That problem is reviewed in various places, such as here.
One way to proceed is to tag each distinct joint occurrence of name and pos just once, and then count over groups.
clear
input str1 pos str5 name flag
A Joe 1
A Joe 1
B Frank 0
C Mike 2
C Ted 0
D Mike 2
D Mike 2
E Bill 1
F Bill 1
end
egen tag = tag(name pos)
egen npos = total(tag), by(name)
list , sepby(pos)
+---------------------------------+
| pos name flag tag npos |
|---------------------------------|
1. | A Joe 1 1 1 |
2. | A Joe 1 0 1 |
|---------------------------------|
3. | B Frank 0 1 1 |
|---------------------------------|
4. | C Mike 2 1 2 |
5. | C Ted 0 1 1 |
|---------------------------------|
6. | D Mike 2 1 2 |
7. | D Mike 2 0 2 |
|---------------------------------|
8. | E Bill 1 1 2 |
|---------------------------------|
9. | F Bill 1 1 2 |
+---------------------------------+
Some may like to see a solution without egen:
bysort name pos: gen tag = _n == 1
by name: gen npos = sum(tag)
by name replace npos = npos[_N]
This could be rewritten using just one new variable:
bysort name pos: gen npos = _n == 1
by name: replace npos = sum(npos)
by name: replace npos = npos[_N]

Related

The best way to Find entries missing after inner-join?

Lets suppose that we have the following 3 tables
Animal
id name
1 dog
2 cat
3 crow
Actions
id name
1 run
2 walk
3 jump
4 fly
5 puppy_eyes
6 swim
Animal_Actions
id Animal_id action_id
1 1 1
2 1 2
3 1 3
4 1 5
5 2 1
6 2 2
7 2 3
8 3 2
9 3 4
I would like to find all the missing animal actions for certain animals
For example if we input dog and cat( id 1 and 2) we should get the following (1,4),(1,6),(2,4),(2,5), (2,6)
and if we input crow (3) we get the following (3,1),(3,3),(3,5), (3,6) .
Currently I'm doing an inner join between Animal and Animal_Actions table based on animal ID and importing this into a SET in my code and checking if every possible permutation is present in this set and collecting the missing ones. I'm not sure if the process I follow is the most efficient one, is there a better way to do this when the data is at a large scale ? Thanks in advance.

If you'll be filtering on a small number of Animal records, one approach is to do a CROSS JOIN with the Actions table. That will give you all action combinations for each Animal record. Then do an OUTER JOIN to Animal_Actions to identify which ones are missing.
For example, using cat = 2 and dog = 1
SELECT ani.id AS Animal_Id
, ani.Name AS Animal_Name
, act.id AS Action_Id
, act.Name AS Action_Name
FROM Animal ani
CROSS JOIN Actions act
LEFT JOIN Animal_Actions aa ON ani.id = aa.Animal_id
AND aa.Action_Id = act.id
WHERE ani.id IN (1,2)
AND aa.id IS NULL
ORDER BY ani.Name, act.Name
;
Results:
Animal_Id | Animal_Name | Action_Id | Action_Name
--------: | :---------- | --------: | :----------
2 | cat | 4 | fly
2 | cat | 5 | puppy_eyes
2 | cat | 6 | swim
1 | dog | 4 | fly
1 | dog | 6 | swim
db<>fiddle here

Query first column with the same data on the second column

ID | Name
---|-----
1 | John
2 | John
3 | Mike
4 | James
5 | Doe
I have this table. I want to query so that I'd be able to get this:
12 John
3 Mike
4 James
5 Doe
i've tried it with putting the variables with array but the result is only 12345 Doe. Will anybody please give me an idea?

Try this
SELECT GROUP_CONCAT(ID), NAME
FROM DB.TABLE
GROUP BY NAME

mysql Update table after querying from it

I have a table filled with first and last names. I have two other columns that I am trying to update. These two columns has the number of people that have the same first names and same last names. For example,
first last samef samel
John Smith 1 2
John Adams 1 1
Mary Kate 0 0
Kate Adams 2 1
Kate Smith 2 2
Kate Smith 2 2
Alice Mirth 0 0
So far I can only come up with these two queries, but of course they are not correct. They return the total count for each name when I need the total count - 1. Plus, the results are shown on separate tables.
I was wondering if I should use a stored procedure where I use variables to store the count for samef and samel. And then insert it into the names table, but I don't know the correct syntax for this.
SELECT first, last,
( SELECT COUNT(*) FROM names WHERE first = table1.first) AS samef
FROM names AS table1
SELECT first, last,
( SELECT COUNT(*) FROM names WHERE last = table2.last) AS samel
FROM names AS table2
I am new to mySQL so please provide explanations.

Just like Strawberry mentioned, do not store information that can be derived. Databases are great at storing data optimally. SQL is great at extracting table and derived/calculated data. Try this:
select `first`, `last`,
(select count(*)-1 from test where `first` = t.`first`) as samef,
(select count(*)-1 from test where `last` = t.`last`) as samef
from test t;
Example: http://sqlfiddle.com/#!9/9c673f/1
Result:
| first | last | samef | samef |
|-------|-------|-------|-------|
| john | smith | 1 | 2 |
| john | adams | 1 | 1 |
| mary | kate | 0 | 0 |
| kate | adams | 2 | 1 |
| kate | smith | 2 | 2 |
| kate | smith | 2 | 2 |
| alice | mirth | 0 | 0 |

MYSQL - ORDER BY clause

I have some records like this
id name sequence
------------------------
1 steve 3
2 lee 2
3 lisa 1
4 john 0
5 smith 0
I want to display records like following
id name
------------
1 lisa
2 lee
3 steve
4 john
5 smith
When i am using order by clause then it display like
name
----
john
smith
lisa
lee
steve
Query
SELECT name from tbl1 where 1 ORDER BY sequence ASC

SELECT name
FROM tbl1
ORDER BY sequence = 0,
sequence ASC
or
SELECT name
FROM tbl1
ORDER BY case when sequence <> 0 then 1 else 2 end,
sequence ASC

You can use query with if condition in ORDER BY clause
SELECT
name
from tbl1
ORDER BY IF(sequence = 0,name,sequence) ASC
Fiddle
Output
| NAME |
|-------|
| lisa |
| lee |
| steve |
| john |
| smith |

SQL find the most recurring element based on different table relations

I have a question regarding how to do an SQL query. I wrote a sample database that I am using here, I am trying to keep things simple for all of you who wish to help.
Officer Permit Vehicle Dispatch
Oid | Dname | Rank Oid | Type | Model Vid | Type | Model Did | Oid | Location
------------------ ------------------ ------------------ --------------
1 | John | Jr 1 D1 Ford 1 D1 Ford 1 1 Hill
2 | Jack | Sr 1 D2 Ford 2 D2 Ford 2 2 Beach
3 | Jay | Jr 2 D1 Ford 3 D3 Ford 3 3 Post
4 | Jim | Jr 3 D1 Ford 4 D4 Ford 4 1 Beach
5 | Jules | Sr 5 D1 Ford 5 D5 Ford 5 2 Hill
1 D3 Ford 6 4 Post
2 D2 Ford 7 5 Hill
4 D1 Ford 8 5 Beach
1 D5 Ford 9 2 Post
The relation between the tables are:
Officer - lists the officer by OID(officer ID)/Name/Rank where Sr is highest, Jr is lowest.
Permit - Officers are required to have a permit depending on the vehicle they will be using, Oid for Officer ID, Type for the vehicle and Model.
Vehicle - Vid for vehicle ID, Type and Model
Dispatch - Did for Dispatch ID, keeps track of which officer (Oid) was dispatched to which location (Location)
Question: I need to know a couple of things from here.
First is how do I know which officer is permitted to drive all vehicle types?
Second is How do I know which officer has been dispatched to all the dispatched locations?
Writing these two queries has been a nightmare for me, I have tried to join different tables but I still cannot get the most recurring element from either (I don't know how!) any assistance will be much appreciated!

First question:
select Oid, count(*) type_count
from Permit
group by Oid
having type_count = (select count(distinct Type, Model) from Vehicle)
Second:
select Oid, count(*) location_count
from Dispatch
group by Oid
having location_count = (select count(distinct Location) from Dispatch)
See a pattern?

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Detect duplicates - duplicates

Related

The best way to Find entries missing after inner-join?

Query first column with the same data on the second column

mysql Update table after querying from it

MYSQL - ORDER BY clause

SQL find the most recurring element based on different table relations

Categories

Resources