Finding MySQL near-duplicates across two columns using wildcards

Finding MySQL near-duplicates across two columns using wildcards - mysql

I have a table with id, first_name and last_name columns. I'd like to get a listing of rows where last_name and the first character of first_name are duplicated. I am groping my way around and have a sense that there is a COUNT('WHERE') in there, but can't quite get to it.
In essence, I'm looking for possible duplicates. So, from this subset:
+------+-----------+-----------+-------------+------------+
| id | firstName | lastName | dateOfBirth | createdOn |
+------+-----------+-----------+-------------+------------+
| 143 | Susie | Wong | 2015-12-01 | 2016-07-11 |
| 1268 | Dale | Armstrong | 2017-01-01 | 2017-01-04 |
| 1435 | Olive | Armstrong | 1941-03-11 | 2017-03-08 |
| 2013 | Timotini | Attilio | 1932-01-01 | 2017-08-21 |
| 2014 | Olinda | Attilio | 1938-01-01 | 2017-08-21 |
| 3076 | Sue | Armstrong | 1951-06-01 | 2018-06-22 |
| 3079 | Susan | Armstrong | 1951-09-15 | 2018-06-22 |
+------+-----------+-----------+-------------+------------+
I would like a query that returns only 3076 and 3079 (Sue and Susan Armstrong) based on looking for a matching last name and a matching first initial, like so:
+------+-----------+-----------+-------------+------------+
| id | firstName | lastName | dateOfBirth | createdOn |
+------+-----------+-----------+-------------+------------+
| 3076 | Sue | Armstrong | 1951-06-01 | 2018-06-22 |
| 3079 | Susan | Armstrong | 1951-09-15 | 2018-06-22 |
+------+-----------+-----------+-------------+------------+

Here's one option using exists and left:
select *
from yourtable y
where exists (
select 1
from yourtable y2
where y.id != y2.id
and y.lastname = y2.lastname
and left(y.firstname,1) = left(y2.firstname,1)
)
Sample Fiddle Demo

Duplicates of last_name
SELECT id, first_name, last_name, COUNT(*) c
FROM table
GROUP BY last_name
HAVING c > 1;
For grouping by the first character in first_name, try playing with left() function

Related

MySQL returns bad result

I have question about SELECT FROM WHERE statement, which returns me bad result.
Here is my table called friends:
+----------+-----------+------------+--------+--------+-------+
| lastname | firstname | callprefix | phone | region | zip |
+----------+-----------+------------+--------+--------+-------+
| Lužný | Bob | 602 | 111222 | OL | 79821 |
| Matyáš | Bob | 773 | 123456 | BR | NULL |
| Strouhal | Fido | 300 | 343434 | ZL | 76701 |
| Přikryl | Tom | 581 | 010101 | PL | 72000 |
| Černý | Franta | 777 | 000999 | OL | 79801 |
| Zavadil | Olda | 911 | 111311 | OL | 79604 |
| Berka | Standa | 604 | 111234 | ZL | 72801 |
| Vlcik | BbB | 736 | 555444 | KV | 35210 |
+----------+-----------+------------+--------+--------+-------+
And here is my query.
SELECT * FROM friends WHERE region <= 'z';
I would expect that the rows with region ZL should be present, but they are not. Can you please tell me why?
Result is:
+----------+-----------+------------+--------+--------+-------+
| lastname | firstname | callprefix | phone | region | zip |
+----------+-----------+------------+--------+--------+-------+
| Lužný | Bob | 602 | 111222 | OL | 79821 |
| Matyáš | Bob | 773 | 123456 | BR | NULL |
| Přikryl | Tom | 581 | 010101 | PL | 72000 |
| Černý | Franta | 777 | 000999 | OL | 79801 |
| Zavadil | Olda | 911 | 111311 | OL | 79604 |
| Vlcik | BbB | 736 | 555444 | KV | 35210 |
+----------+-----------+------------+--------+--------+-------+
When I try this query:
SELECT * FROM friends WHERE region >= 'z';
the result contains both rows with region = 'ZL'
????
Thank you!

Because "ZL" is greater than "Z." Z is just one character so will only return values less that Z or with the value of Z. What are you trying to achieve with this query?

Can you please tell me why?
If you add a record where region is Z, and sorted those rows alphabetically by region, would you expect ZL to come before or after Z? Obviously it would come after, so it does not meet your criteria.
If you want to only consider the first character, then add that to your criteria:
SELECT * FROM friends WHERE LEFT(region,1) <= 'Z';
I would also make Z explicitly a capital letter in case your database settings make it a case-sensitive search.

Have you tried
SELECT * FROM friends WHERE region <= 'zl';?
From the computer's perspective, 'z' < 'zl'

How select remaining unspecified columns

I am looking to overwrite a column name in a table with an existing column name.
I am Looking for a way to get the remaining unspecified columns in the tables.
Note:
The query could have more joins in the future.
eg
Person
+-----------+----------+---------+
| firstname | lastname | pers_id |
+-----------+----------+---------+
| Joe | Soap | 1 |
| Bobby | Pin | 2 |
| Janet | Jackson | 3 |
+-----------+----------+---------+
Category
+----------+-------------------+--------+
| type | description | cat_id |
+----------+-------------------+--------+
| customer | people who pay us | 1 |
| employee | people we pay | 2 |
| director | people who direct | 3 |
+----------+-------------------+--------+
Person_Cat
(=^ェ^=)
+---------+--------+
| pers_id | cat_id |
+---------+--------+
| 3 | 1 |
| 2 | 2 |
| 1 | 3 |
+---------+--------+
Query
SELECT *, CONCAT(p.firstname, ' '
, p.lastname) as full_name
, c.cat_id AS category_id
, p.pers_id AS cat_id
FROM Person AS p
JOIN Person_Cat AS pc ON(p.pers_id = pc.pers_id)
JOIN Category AS c ON (pc.cat_id = c.cat_id)
OUTPUT
(Apologies for the length but the table after is more important)
+-----------+----------+---------+---------+--------+----------+-------------------+--------+---------------+-------------+--------+
| p | p | p | pc | pc | c | c | c | Select | Select | Select |
+-----------+----------+---------+---------+--------+----------+-------------------+--------+---------------+-------------+--------+
| firstname | lastname | pers_id | pers_id | cat_id | type | description | cat_id | full_name | category_id | cat_id |
+-----------+----------+---------+---------+--------+----------+-------------------+--------+---------------+-------------+--------+
| Janet | Jackson | 3 | 3 | 1 | customer | people who pay us | 1 | Janet jackson | 1 | 3 |
| Bobby | Pin | 2 | 2 | 2 | employee | people who we pay | 2 | Bobby Pin | 2 | 2 |
| Joe | Soap | 1 | 1 | 3 | director | people who direct | 3 | Joe Soap | 3 | 1 |
+-----------+----------+---------+---------+--------+----------+-------------------+--------+---------------+-------------+--------+
The headers above column names are there for reference
to where they comes from.
Column summary -
firstname, lastname, pers_id, pers_id, cat_id, type,
description, cat_id, full_name ,category_id, cat_id
Wanted output
+-----------+----------+---------+--------+----------+-------------------+---------------+-------------+--------+
| p | p | pc | pc | c | c | Select | Select | Select |
+-----------+----------+---------+--------+----------+-------------------+---------------+-------------+--------+
| firstname | lastname | pers_id | cat_id | type | description | full_name | category_id | cat_id |
+-----------+----------+---------+--------+----------+-------------------+---------------+-------------+--------+
| Janet | Jackson | 3 | 1 | customer | people who pay us | Janet jackson | 1 | 3 |
| Bobby | Pin | 2 | 2 | employee | people who we pay | Bobby Pin | 2 | 2 |
| Joe | Soap | 1 | 3 | director | people who direct | Joe Soap | 3 | 1 |
+-----------+----------+---------+--------+----------+-------------------+---------------+-------------+--------+
Column summary -
firstname, lastname, pers_id, cat_id, type,
description, full_name ,category_id, cat_id
Notice:
The p.pers_id and the c.cat_id are not present. I would like to think this would be because the were called directly and unmodified unlike the first and lastname used in ConCat

When the short answer is that there is no such concept as Select [remaining columns]at this time (2015-06-17), if you want to use SELECT * but only remove redundant columns,
then you will need to explicitly remove (ignore) those redundant columns when rendering your view.
You will have to explicitly configure logic of which columns to ignore, which is pretty much the same thing as explicitly listing the columns that you are interested in, so you get back to the argument against selecting all columns that I made in the comments above.
Unless your table schema is changing all the time, there really isn't reason for this.

Combining data in sql

I have a query that gives me this data:
| id | job | firstName | lastName |
+----+------------+-----------+----------+
| 1 | Programmer | NULL | NULL |
| 2 | NULL | Tom | Tucker |
But I need the table to look like this:
| id | job | firstName | lastName |
+----+------------+-----------+----------+
| 1 | Programmer | Tom | Tucker |
I need for it to display like this, not change the data in the database.

Use aggregate functions. Try this,
select min(Id) as Id,max(job) as Job,max(FNAME) as FNAME,max(LName) as LNAME
from yourtable

Remove duplicates SQL while ignoring key and selecting max of specified column

I have the following sample data:
| key_id | name | name_id | data_id |
+--------+-------+---------+---------+
| 1 | jim | 23 | 098 |
| 2 | joe | 24 | 098 |
| 3 | john | 25 | 098 |
| 4 | jack | 26 | 098 |
| 5 | jim | 23 | 091 |
| 6 | jim | 23 | 090 |
I have tried this query:
INSERT INTO temp_table
SELECT
DISTINCT #key_id,
name,
name_id,
#data_id FROM table1,
I am trying to dedupe a table by all fields in a row.
My desired output:
| key_id | name | name_id | data_id |
+--------+-------+---------+---------+
| 1 | jim | 23 | 098 |
| 2 | joe | 24 | 098 |
| 3 | john | 25 | 098 |
| 4 | jack | 26 | 098 |
What I'm actually getting:
| key_id | name | name_id | data_id |
+--------+-------+---------+----------+
| 1 | jim | 23 | NULL |
| 2 | joe | 24 | NULL |
| 3 | john | 25 | NULL |
| 4 | jack | 26 | NULL |
I am able to dedupe the table, but I am setting the 'data_Id' value to NULL by attempting to override the field with '#'
Is there anyway to select distinct on all fields and while keeping the value for 'data_id'? I will take the highest or MAX data_id # if possible.

If you only want one row returned for a specific value (in this case, name), one option you have is to group by that value. This seems like a good approach because you also said you wanted the largest data_id for each name, so I would suggest grouping and using the MAX() aggregate function like this:
SELECT name, name_id, MAX(data_id) AS data_id
FROM myTable
GROUP BY name, name_id;
The only thing you should be aware of is the possibility that a name occurs multiple times under different name_ids. If that is possible in your table, you could group by the name_id too, which is what I did.
Since you stated you're not interested in the key_id but only the name, I just excluded it from the query altogether to get this:
| name | name_id | data_id |
+-------+---------+---------+
| jim | 23 | 098 |
| joe | 24 | 098 |
| john | 25 | 098 |
| jack | 26 | 098 |
Here is the SQL Fiddle example.

RENAME TABLE myTable to Old_mytable,
myTable2 to myTable
INSERT INTO myTable
SELECT *
FROM Old_myTable
GROUP BY name, name_id;
This groups my tables by the values I want to dedupe while still keeping structure and ignoring the 'Data_id' column

What does group by do exactly ?

From an example taken from here , I'm trying to understand what does GROUP BY do exactly :
Given this employee table :
+-------+----------+--------+------------+
| Empid | Empname | Salary | DOB |
+-------+----------+--------+------------+
| 1 | Habib | 2014 | 2004-12-02 |
| 2 | Karan | 4021 | 2003-04-11 |
| 3 | Samia | 22 | 2008-02-23 |
| 4 | Hui Ling | 25 | 2008-10-15 |
| 5 | Yumie | 29 | 1999-01-26 |
+-------+----------+--------+------------+
After executing mysql> select * from employee group by empname;
We get :
+-------+----------+--------+------------+
| Empid | Empname | Salary | DOB |
+-------+----------+--------+------------+
| 1 | Habib | 2014 | 2004-12-02 |
| 4 | Hui Ling | 25 | 2008-10-15 |
| 2 | Karan | 4021 | 2003-04-11 |
| 3 | Samia | 22 | 2008-02-23 |
| 5 | Yumie | 29 | 1999-01-26 |
+-------+----------+--------+------------+
So , does that mean that GROUP BY just sorts a table by key ?
Thanks

GROUP BY enables summaries. Specifically, it controls the use of summary functions like COUNT(), SUM(), AVG(), MIN(), MAX() etc. There isn't much to summarize in your example.
But, suppose you had a Deptname column. Then you could issue this query and get the average salary by Deptname.
SELECT AVG(Salary) Average,
Deptname
FROM Employee
GROUP BY Deptname
ORDER BY Deptname
If you want your result set put in a certain order, use ORDER BY.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Finding MySQL near-duplicates across two columns using wildcards - mysql

Here's one option using exists and left: select * from yourtable y where exists ( select 1 from yourtable y2 where y.id != y2.id and y.lastname = y2.lastname and left(y.firstname,1) = left(y2.firstname,1) ) Sample Fiddle Demo

Duplicates of last_name SELECT id, first_name, last_name, COUNT(*) c FROM table GROUP BY last_name HAVING c > 1; For grouping by the first character in first_name, try playing with left() function

Related

MySQL returns bad result

How select remaining unspecified columns

Combining data in sql

Remove duplicates SQL while ignoring key and selecting max of specified column

What does group by do exactly ?

Categories

Resources