I have a database like this:
users
id name email phone
1 bill bill#fakeemail.com
2 bill bill#fakeemail.com 123456789
3 susan susan#fakeemail.com
4 john john#fakeemail.com 123456789
5 john john#fakeemail.com 987654321
I want to merge records considered duplicates based on the email field.
Trying to figure out how to use the following considerations.
Merge based on duplicate email
If one row has a null value use the row that has the most data.
If 2 rows are duplicates but other fields are different then use the one
with the highest id number (see the john#fakeemail.com row for an example.)
Here is a query I tried:
DELETE FROM users WHERE users.id NOT IN
(SELECT grouped.id FROM (SELECT DISTINCT ON (email) * FROM users) AS grouped)
Getting a syntax error.
I'm trying to get the database to transform to this, I can't figure out the correct query:
users
id name email phone
2 bill bill#fakeemail.com 123456789
3 susan susan#fakeemail.com
5 john john#fakeemail.com 987654321
Here is one option using a delete join:
DELETE
FROM users
WHERE id NOT IN (SELECT id
FROM (
SELECT CASE WHEN COUNT(*) = 1
THEN MAX(id)
ELSE MAX(CASE WHEN phone IS NOT NULL THEN id END) END AS id
FROM users
GROUP BY email) t);
The logic of this delete is as follows:
Emails where there is only one record are not deleted
For emails with two or more records, we delete everything except for the record having the highest id value, where the phone is also defined.
Here's a solution that will give you the latest data for each field for each user in the result table, thus meeting your second criterion as well as the first and third. It will work for as many duplicates as you have, subject to the group_concat_max_len condition on GROUP_CONCAT. It uses GROUP_CONCAT to prepare a list of all values of a field for each user, sorted so that the most recent value is first. SUBSTRING_INDEX is then used to extract the first value in that list, which is the most recent. This solution uses a CREATE TABLE ... SELECT command to make a new users table, then DROPs the old one and renames the new table to users.
CREATE TABLE users
(`id` int, `name` varchar(5), `email` varchar(19), `phone` int)
;
INSERT INTO users
(`id`, `name`, `email`, `phone`)
VALUES
(1, 'bill', 'bill#fakeemail.com', 123456789),
(2, 'bill', 'bill#fakeemail.com', NULL),
(3, 'susan', 'susan#fakeemail.com', NULL),
(4, 'john', 'john#fakeemail.com', 123456789),
(5, 'john', 'john#fakeemail.com', 987654321)
;
CREATE TABLE newusers AS
SELECT id
, SUBSTRING_INDEX(names, ',', 1) AS name
, email
, SUBSTRING_INDEX(phones, ',', 1) AS phone
FROM (SELECT id
, GROUP_CONCAT(name ORDER BY id DESC) AS names
, email
, GROUP_CONCAT(phone ORDER BY id DESC) AS phones
FROM users
GROUP BY email) u;
DROP TABLE users;
RENAME TABLE newusers TO users;
SELECT * FROM users
Output:
id name email phone
1 bill bill#fakeemail.com 123456789
4 john john#fakeemail.com 987654321
3 susan susan#fakeemail.com (null)
Demo on SQLFiddle
Related
I have 2 tables and i am joining them using the below Query
Select distinct EmailAddress,CUSTOMER_ID,Send_Date,Unique_key,sub_category
from table1
UNION ALL
Select distinct EmailAddress,CUSTOMER_ID,Send_Date,Unique_key,sub_category
from table2
I am using Unique_key as the primary key. It is the concatination of send date + customer id. Sometimes both the tables can have duplicate keys and hence I want to take only 1row in such cases using the above query
Table 1
EmailAddress CUSTOMER_ID Send_Date Unique_key sub_category
a#gmail.com 1001 07-08-2021 70820211001 chair
Table 2
EmailAddress CUSTOMER_ID Send_Date Unique_key sub_category
a#gmail.com 1001 07-08-2021 7082021100 book
What is expected results ?
EmailAddress CUSTOMER_ID Send_Date Unique_key sub_category
a#gmail.com 1001 07-08-2021 70820211001 chair
Only 1 record should appear in the final table & multiple rows should be skipped. I don't want to change anything in unique key format. Is there any workaround for this?
You need something like:
Select distinct EmailAddress,CUSTOMER_ID,Send_Date,sub_category
from table_1
UNION
SELECT EmailAddress,CUSTOMER_ID,Send_Date,sub_category FROM table_2
WHERE NOT EXISTS ( SELECT NULL
FROM table_1
WHERE table_1.EmailAddress = table_2.EmailAddress ) ;
The below select will return empty set, because you have the WHERE NOT EXISTS condition, return the non matching row.
SELECT EmailAddress,CUSTOMER_ID,Send_Date,sub_category
FROM table_2
WHERE NOT EXISTS ( SELECT NULL
FROM table_1
WHERE table_1.EmailAddress = table_2.EmailAddress
) ;
Demo: https://www.db-fiddle.com/f/pB6b5xrgPKCivFWcpQHsyE/24
Try with your data and let me know.
I have a query related to fetching records from the combination of 2 tables in a way that the returned result will be fetched using the ORDER by clause with the help of foreign key.
I have two tables named users and orders.
Table: users
id name
1 John
2 Doe
Table: orders
id user_id for
1 2 cake
2 1 shake
2 2 milk
In table:orders, user_id is foreign key representing id in table:users.
Question:
I want to extract the records from table:orders but the ORDER should be based on name of users.
Desired Results:
user_id for
2 cake
2 milk
1 shake
Note: Here user_id with 2 is showing before user id with 1. This is because the name Doe should be shown before the name John because of order by.
What I have done right now:
I have no idea about MySQL joins. By searching this thing on the internet i did not find a way how i will achieve this thing. I have written a query but it will not fetch such record but have no idea what should i do to make it work exactly like what i want to.
SELECT * FROM orders ORDER BY user_id
It will fetch the records according to the order of user_id but not with name.
you are right join both tables is the simplest way to achieve that and you can show the names also, as you have them joined anyway
CREATE TABLE orders (
`id` INTEGER,
`user_id` INTEGER,
`for` VARCHAR(5)
);
INSERT INTO orders
(`id`, `user_id`, `for`)
VALUES
('1', '2', 'cake'),
('2', '1', 'shake'),
('2', '2', 'milk');
CREATE TABLE users (
`id` INTEGER,
`name` VARCHAR(4)
);
INSERT INTO users
(`id`, `name`)
VALUES
('1', 'John'),
('2', 'Doe');
SELECT o.`user_id`, o.`for` FROM orders o INNER JOIN users u ON u.id = o.user_id ORDER BY u.name
user_id | for
------: | :----
2 | cake
2 | milk
1 | shake
db<>fiddle here
You can get your desired results by join orders table and users table by simply using below query.
SELECT user_id, for FROM orders, users where user_id = id ORDER BY name;
Using where condition, we match corresponding rows where user_id in orders table equals id in users table. By using ORDER BY for name column in users table, rows will be sorted in ascending order. Here user_id and for columns in orders table will be show as final result.
Here I haven't use users.id or orders.user_id because they are in different formats. If you use same format for columns, you need to use above syntax.
my table:
drop table if exists new_table;
create table if not exists new_table(
obj_type int(4),
user_id varchar(30),
payer_id varchar(30)
);
insert into new_table (obj_type, user_id, payer_id) values
(1, 'user1', 'payer1'),
(1, 'user2', 'payer1'),
(2, 'user3', 'payer1'),
(1, 'user1', 'payer2'),
(1, 'user2', 'payer2'),
(2, 'user3', 'payer2'),
(3, 'user1', 'payer3'),
(3, 'user2', 'payer3');
I am trying to select all the payer id's whose obj_type is only one value and not any other values. In other words, even though each payer has multiple users, I only want the payers who are only using one obj_type.
I have tried using a query like this:
select * from new_table
where obj_type = 1
group by payer_id;
But this returns rows whose payers also have other user's with other obj_types. I am trying to get a result that looks like:
obj | user | payer
----|-------|--------
3 | user1 | payer3
3 | user2 | payer3
Thanks in advance.
That is actually easy:
SELECT player_id
FROM new_table
GROUP BY player_id
HAVING COUNT(DISTINCT obj_type) = 1
Having filters rows just like WHERE but it does so after the aggregation.
The difference is best explained by an example:
SELECT dept_id, SUM(salary)
FROM employees
WHERE salary > 100000
GROUP BY dept_id
This will give you the sum of the salaries of people earning more than 100000 each.
SELECT dept_id, SUM(salary)
FROM employees
GROUP BY dept_id
HAVINF salary > 100000
The second query will give you the departments where all employees together earn more than 100000 even if no single employee earns that much.
If you want to return all rows without grouping them you can use analytic functions:
SELECT * FROM (
SELECT obj_type,user_id,
payer_id,
COUNT(DISTINCT obj_type) OVER (PARTITION BY payer_id) AS distinct_obj_type
FROM new_table)
WHERE distinct_obj_type = 1
Or you can use exist with the query above:
SELECT *
FROM new_table
WHERE payer_id IN (SELECT payer_id
FROM new_table
GROUP BY payer_id
HAVING COUNT(DISTINCT obj_type) = 1)
Hopefully I can describe this simply if not I'll try to make a table for it: say I have a table that tracks all visits to my store by customer name. I log their Name and Purchase Amount (if any). I want to get a list of visitors who never buy anything. So if I have
VisitorName PurchaseAmount
Bob 10
Bob NULL
Mary NULL
Mary NULL
I want a query that returns Mary since all of her records have NULL in the PurchaseAmount
Create table/insert data
CREATE TABLE visits
(`VisitorName` VARCHAR(4), `PurchaseAmount` VARCHAR(4))
;
INSERT INTO visits
(`VisitorName`, `PurchaseAmount`)
VALUES
('Bob', '10'),
('Bob', NULL),
('Mary', NULL),
('Mary', NULL)
;
Query
Just GROUP BY on VisitorName.
And a HAVING what checks if all records are NULL's
SELECT
visits.VisitorName
FROM
visits
GROUP BY
visits.VisitorName
HAVING
SUM(CASE
WHEN visits.PurchaseAmount IS NULL
THEN 1
END
) = COUNT(*)
Result
VisitorName
-------------
Mary
You could use a not in subselect the VisitorName that have value not null
select distinct visitorName from my_table
where visitorName not in ( select VisitorName
from my_table where PurchaseAmount is not null)
Is it possible to select distinct company names from the customer table but also displaying the iD's related?
at the minute I'm using
SELECT company,id, COUNT(*) as count FROM customers GROUP BY company HAVING COUNT(*) > 1;
which returns
MyDuplicateCompany1 64 2
MyDuplicateCompany2 20 3
MyDuplicateCompany6 175 2
but what I'm after is all the duplicate ID's for each.
so
CompanyName, TimesDuplicated, DuplicateId1, DuplicateId2, DuplicateId3
or a row for each so
MyDuplicateCompany1, DuplicateId1, TimesDuplicated
MyDuplicateCompany1, DuplicateId2, TimesDuplicated
MyDuplicateCompany2, DuplicateId1, TimesDuplicated
MyDuplicateCompany2, DuplicateId2, TimesDuplicated
MyDuplicateCompany2, DuplicateId3, TimesDuplicated
is this possible?
Not sure if this would be acceptable but there's a function in mySQL which allows you to combine multiple rows into one Group_Concat(Field), but show the distinct values for each record for columns specified (like ID in this case)
SELECT company
, COUNT(*) as count
, group_concat(ID) as DupCompanyIDs
FROM customers
GROUP BY company
HAVING COUNT(*) > 1;
SQL Fiddle
showing similar results with duplicate companies listed in one field.
If you need it in multiple columns or multiple rows, you could wrap the above as an inline view and inner join it back to customers on the name to list the duplicates and times duplicated.
You can use GROUP_CONCAT(id) to concat your id by comma, your query should be:
SELECT company, GROUP_CONCAT(id) as ids, COUNT(id) as cant FROM customers GROUP BY company HAVING cant > 1
You can test the query with this
CREATE TABLE IF NOT EXISTS `customers` (
`id` int(11) NOT NULL,
`company` varchar(50) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
INSERT INTO `customers` (`id`, `company`) VALUES
(1, 'MyDuplicateCompany1'),
(2, 'MyDuplicateCompany1'),
(3, 'MyDuplicateCompany1'),
(4, 'MyDuplicateCompany2'),
(5, 'MyDuplicateCompany2'),
(6, 'MyDuplicateCompany3'),
(7, 'MyDuplicateCompany3'),
(8, 'MyDuplicateCompany3'),
(9, 'MyDuplicateCompany3'),
(10, 'MyDuplicateCompany4');
Output:
Read more at:
http://monksealsoftware.com/mysql-group_concat-and-postgres-array_agg/
You are not looking for companies with more than 1 entry (GROUP BY company), but for duplicate company IDs (GROUP BY company, id):
SELECT company, id, COUNT(*)
FROM customers
GROUP BY company, id
HAVING COUNT(*) > 1;
This should give exactly what you're looking for without GROUP_CONCAT()
SELECT
company, id,
( SELECT COUNT(*) from customers AS b
WHERE a.company = b.company
) AS cnt
FROM customers AS a
GROUP BY company, id
HAVING cnt > 1
;
Note: GROUP_CONCAT does the same thing, just all in one row per company.