MySQL - Merge rows in table based on multiple criteria - mysql

I'd like to merge rows based on multiple criteria, essentially removing duplicates where I get to define what "duplicate" means. Here is an example table:
╔═════╦═══════╦═════╦═══════╗
║ id* ║ name ║ age ║ grade ║
╠═════╬═══════╬═════╬═══════╣
║ 1 ║ John ║ 11 ║ 5 ║
║ 2 ║ John ║ 11 ║ 5 ║
║ 3 ║ John ║ 11 ║ 6 ║
║ 4 ║ Sam ║ 14 ║ 7 ║
║ 5 ║ Sam ║ 14 ║ 7 ║
╚═════╩═══════╩═════╩═══════╝
In my example, let's say I want to merge on name and age but ignore grade. The result should be:
╔═════╦═══════╦═════╦═══════╗
║ id* ║ name ║ age ║ grade ║
╠═════╬═══════╬═════╬═══════╣
║ 1 ║ John ║ 11 ║ 5 ║
║ 3 ║ John ║ 11 ║ 6 ║
║ 4 ║ Sam ║ 14 ║ 7 ║
╚═════╩═══════╩═════╩═══════╝
I don't particularly care if the id column is updated to be incremental, but I suppose that would be nice.
Can I do this in MySQL?

My suggestion, based on my above comment.
SELECT distinct name, age, grade
into tempTable
from theTable
This will ignore the IDs and give you only a distinct dump, and into a new table.
Then you can either drop the old and, and rename the new one. Or truncate the old one, and dump this back in.

You could just delete the duplicates in place like this:
delete test
from test
inner join (
select name, age, grade, min(id) as minid, count(*)
from test
group by name, age, grade
having count(*) > 1
) main on test.id = main.minid;
Example: http://sqlfiddle.com/#!9/f1a38/1

Related

MySQL Query Grouped Conditional Count

Here is the mySQL table data:
╔════╦══════╦══════════╦══════════════╗
║ ID ║ USER ║ DATE ║ NUMDOWNLOADS ║
╠════╬══════╬══════════╬══════════════╣
║ 1 ║ John ║ xx-xx-xx ║ 1 ║
║ 2 ║ Mary ║ xx-xx-xx ║ 3 ║
║ 3 ║ John ║ xx-xx-xx ║ 5 ║
║ 4 ║ Mary ║ xx-xx-xx ║ 2 ║
║ 5 ║ Mary ║ xx-xx-xx ║ 6 ║
║ 6 ║ John ║ xx-xx-xx ║ 7 ║
║ 7 ║ John ║ xx-xx-xx ║ 1 ║
║ 8 ║ Mary ║ xx-xx-xx ║ 8 ║
║ 9 ║ Mary ║ xx-xx-xx ║ 9 ║
╚════╩══════╩══════════╩══════════════╝
What I want to accomplish is to group the data by USER, and display the total NUMDOWNLOADS per USER where NUMDOWNLOADS is > X. For example, if X=5:
John: 1 (since 1 NUMDOWNLOADS > 5, and others count collectively as 1)
Mary: 3 (since 3 NUMDOWNLOADS > 5, and others count collectively as 1)
So, (1) output per user, and (2) output total, which in this case would be 4. Clear as mud :) Ideas on statement to use?
Your query is here. Try it
SELECT USER, COUNT(NUMDOWNLOADS)
FROM table_name
WHERE NUMDOWNLOADS > 5
GROUP BY USER
SELECT USER, COUNT(NUMDOWNLOADS)
FROM downloads
WHERE NUMDOWNLOADS > 5
GROUP BY USER
Follow the link below for a running demo:
SQLFiddle
I think you just want to count records where NUMDOWNLOADS > 5:
select USER, count(*)
from myTable
where NUMDOWNLOADS > 5
group by USER
The WHERE filter is performed before any grouping is done, so first this query filters out any rows that do not match NUMDOWNLOADS > 5, then it groups by USER and counts.
Alternatively if there is something about your actual query that requires you to use a conditional sum, you can do so as well:
select USER, sum(case when NUMDOWNLOADS > 5 then 1 else 0 end)
from myTable
group by USER

De-duplicating many-to-many relationships in MySQL lookup table

I've inherited a database that includes a lookup table to find other patents that are related to a given patent.
So it looks like
╔════╦═══════════╦════════════╗
║ id ║ patent_id ║ related_id ║
╠════╬═══════════╬════════════╣
║ 1 ║ 1 ║ 2 ║
║ 2 ║ 1 ║ 3 ║
║ 3 ║ 2 ║ 1 ║
║ 4 ║ 2 ║ 3 ║
║ 5 ║ 3 ║ 2 ║
╚════╩═══════════╩════════════╝
And I want to filter out the reciprocal relationships. 1->2 and 2->1 are the same for my purposes so I only want 1->2.
I don't need to make the edit in the table, I just need a query the returns a list of the unique relationships, and while I'm sure it's simple I've been banging my head against the keyboard for far too long.
Here is a clever query which you can try using. The general strategy is to identify the unwanted duplicate records and then subtract them away from the entire set.
SELECT t.id, t.patent_id, t.related_id
FROM t LEFT JOIN
(
SELECT t1.patent_id AS t1_patent_id, t1.related_id AS t1_related_id
FROM t t1 LEFT JOIN t t2
ON t1.related_id = t2.patent_id
WHERE t1.patent_id = t2.related_id AND t1.patent_id > t1.related_id
) t3
ON t.patent_id = t3.t1_patent_id AND t.related_id = t3.t1_related_id
WHERE t3.t1_patent_id IS NULL
Here is the inner temporary table generated by this query. You can convince yourself that by applying the logic in the WHERE clause you will select the correct records. Non-duplicate records are characterized by t1.patent_id != t2.related_id, and all these records are retained. In the case of duplicates (t1.patent_id = t2.related_id), the record chosen from each pair of duplicates is the one where patent_id < related_id, as you requested in your question.
╔════╦══════════════╦═══════════════╦══════════════╦═══════════════╗
║ id ║ t1.patent_id ║ t1.related_id ║ t2.patent_id ║ t2.related_id ║
╠════╬══════════════╬═══════════════╬══════════════╬═══════════════╣
║ 1 ║ 1 ║ 2 ║ 2 ║ 1 ║ * duplicate
║ 1 ║ 1 ║ 2 ║ 2 ║ 3 ║
║ 2 ║ 1 ║ 3 ║ 3 ║ 2 ║
║ 3 ║ 2 ║ 1 ║ 1 ║ 2 ║ * duplicate
║ 3 ║ 2 ║ 1 ║ 1 ║ 3 ║
║ 4 ║ 2 ║ 3 ║ 3 ║ 2 ║ * duplicate
║ 5 ║ 3 ║ 2 ║ 2 ║ 1 ║
║ 5 ║ 3 ║ 2 ║ 2 ║ 3 ║ * duplicate
╚════╩══════════════╩═══════════════╩══════════════╩═══════════════╝
Click the link below for a running example of this query.
SQLFiddle
Try something like
select distinct * from
(select patient_id, related_id from TABLENAME
union
select related_id, patient_id from TABLENAME
);
Okay you're right the above won't work. Try
select patient_id, related_id from TABLENAME p1
where p1.patiend_id not in
(select patient_id from TABLENAME p2
where p2.related_id = p1.related_id)

Groupwise-max on count() with nested subqueries

My head is turning to mush when trying to get this nesting around my head.
So basically I got 2 tables:
Brokers, which is my "user" table:
╔══════════╦════════════╦
║ ID ║ EMAIL ║
╠══════════╬════════════╬
║ 1 ║ 1#email.co ║
║ 2 ║ 2#email.co ║
║ 3 ║ 3#email.co ║
╚══════════╩════════════╝
Houses, which is houses that the users have added. Currently user and house is connected by the email column (I know, makes more sense to do with a ID):
╔══════════╦════════════╦════════════╦
║ ID ║ TYPE ║ EMAIL ║
╠══════════╬════════════╬════════════╬
║ 1 ║ 1 ║ 1#email.co ║
║ 2 ║ 3 ║ 1#email.co ║
║ 3 ║ 2 ║ 1#email.co ║
║ 4 ║ 3 ║ 1#email.co ║
║ 5 ║ 3 ║ 1#email.co ║
║ 6 ║ 2 ║ 1#email.co ║
║ 7 ║ 3 ║ 1#email.co ║
║ 8 ║ 1 ║ 2#email.co ║
║ 9 ║ 1 ║ 2#email.co ║
║ 10 ║ 2 ║ 2#email.co ║
║ 11 ║ 2 ║ 2#email.co ║
║ 12 ║ 3 ║ 2#email.co ║
║ 13 ║ 3 ║ 3#email.co ║
║ 14 ║ 2 ║ 3#email.co ║
║ 15 ║ 3 ║ 3#email.co ║
║ 16 ║ 1 ║ 3#email.co ║
║ 17 ║ 3 ║ 3#email.co ║
║ 18 ║ 2 ║ 3#email.co ║
║ 19 ║ 2 ║ 3#email.co ║
║ 20 ║ 3 ║ 3#email.co ║
╚══════════╩════════════╩════════════╝
Now what I want to do, is that I want to select all brokers that have type 3 as the highest, most popular kind of house added. So for example if house type 3 represents "Apartments", I want to find the brokers that sell apartments as their number one most popular type.
My current query is:
SELECT b.id, b.email, h.email, h.type, h.total
FROM brokers b
INNER JOIN (
SELECT COUNT( * ) AS total, email, type
FROM house
GROUP BY email, type
ORDER BY total DESC
)h ON b.email = h.email
AND h.type = "3"
ORDER BY b.id DESC
Now this only selects the total amount of houses that that broker has for type 3. It does not only select the brokers where type 3 is their most popular type.
Now to do that, I need to use what is called "Groupwise Max". But I can not use max() on a count(*) like:
MAX(COUNT(*)) as max_value
So I guess that what I need to do is to nest my query further with additional subqueries to first count, and then select the max value.
I've been trying to get it right for a while now and I just can't get my head around it. Anyone can help?
EDIT:
Expected Output:
Based on the table above, Broker 1#email.co got:
1 House with Type 1.
2 Houses with Type 2.
4 Houses with Type 3.
Broker 2#email.co got:
2 houses with Type 1
2 houses with Type 2
1 house with Type 3.
Broker 3#email.co got:
1 house with Type 1.
3 houses with Type 2.
4 houses with Type 3.
Since both 1#email.co and 3#email.co is selling House Type 3 most commonly, they should be included in the output. 2#email.co do not sell type 3 as his most popular type, so he should not be included in the result.
So output:
╔══════════╦════════════╦════════════╦
║ ID ║ EMAIL ║ Total ║
╠══════════╬════════════╬════════════╬
║ 1 ║ 1#email.co ║ 4 ║
║ 3 ║ 3#email.co ║ 4 ║
╚══════════╩════════════╝════════════╝
Posting answer without executing, hope this works!
Select a.ID,a.Email,b.Cnt from Brokers as a
inner join (
Select Email,count(ID) as Cnt from Houses where Type =
(Select max(Type) from Houses)
group by Email
) as b on a.Email = b.Email
I can't understand why you need Count()? I think, according to you question ("select all brokers that have type 3") , it doesn't make sense, or do I misunderstand something?
EDIT:
I have done it in SQL SERVER by temporary table and variable
If you can convert it to mysql syntax, your problem will we solved:
SELECT COUNT(*) as total, Email, [Type]
into #tbl3
from house
group by Email, Type
declare #a int
set #a = (select MAX(total) from #tbl3)
SELECT b.id, b.email, h.email, h.type, h.total
FROM brokers b
inner join
(
select * from #tbl3
where total=#a
) h
on h.Email=b.Email and h.Type=3
EDIT: This is MySql syntax which will do your job.
CREATE TEMPORARY TABLE IF NOT EXISTS table2 AS (
SELECT COUNT(*) as total, Email, Type
from house
group by Email, Type
);
set #a = (select MAX(total) from table2);
SELECT b.id, b.email, h.email, h.type, h.total
FROM brokers b
inner join
(
select * from table2
where total=#a
) h
on h.Email=b.Email and h.Type=3

Update a table by inserting a count of foreign key from another table

I have two tables:
╔════════════════╗ ╔════════════════╗
║ ITEM ║ ║ ITEM_TRACK ║
╠════════════════╣ ╠════════════════╣
║ ID ║ ║ ID ║
║ GUID ║ ║ ITEM_GUID ║
║ COUNT1 ║ ║ CONTEXT ║
║ ENDDATE ║ ║ ║
╚════════════════╝ ╚════════════════╝
╔═════╦══════╦════════╗ ╔═════╦═══════════╦══════════╗
║ ID ║ GUID ║ COUNT1 ║ ║ ID ║ ITEM_GUID ║ CONTEXT ║
╠═════╬══════╬════════╣ ╠═════╬═══════════╬══════════╣
║ 1 ║ aaa ║ ║ ║ 1 ║ abc ║ ITEM ║
║ 2 ║ bbb ║ ║ ║ 2 ║ aaa ║ PAGE ║
║ 3 ║ ccc ║ ║ ║ 3 ║ bbb ║ ITEM ║
║ 4 ║ abc ║ ║ ║ 4 ║ ccc ║ ITEM ║
╚═════╩══════╩════════╝ ║ 5 ║ abc ║ ITEM ║
║ 6 ║ aaa ║ ITEM ║
║ 7 ║ abc ║ ITEM ║
║ 8 ║ ccc ║ PAGE ║
╚═════╩═══════════╩══════════╝
What I'm trying to do is fill in the COUNT1 column in ITEM with the count of the number of times ITEM_GUID appears in ITEM_TRACK for all ITEM.GUIDs where ENDDATE is still in the future. I need to do this once an hour for all GUIDS in ITEM.
I can get the counts I need easily
SELECT ITEM_GUID, COUNT(*) from ITEM_TRACK GROUP BY ITEM_GUID;
What I don't know how to do is, how do I merge this with an INSERT INTO statement to automatically update all the items in the items table with the count based on their ENDDATE?
UPDATE:
I have a working solution based on Aquillo's answer:
UPDATE ITEM a
SET COUNT1 = (SELECT COUNT(*) AS total FROM ITEM_TRACK b WHERE b.item_guid=a.guid);
Is there any other way to do this without a subquery?
You can insert from a select like this:
INSERT INTO myTable (foreignKey, countColumn) VALUES
SELECT ITEM_GUID, COUNT(*) from ITEM_TRACK GROUP BY ITEM_GUID;
In case you want to update, try something like this:
UPDATE from SELECT using SQL Server
If you use INSERT INTO you'll put additional rows in your ITEM table, not update the existing ones. If this is what you meant then that's great, but if you want to update the existing ones, you'll need to use update. You do this by joining the table you want to update with the table you want to update from. However, in your case you want to update from an aggregation and so you need to create a table with the aggregated values. Try this:
UPDATE ITEM SET Count1 = temp.total
FROM Item
INNER JOIN (
SELECT ITEM_GUID, COUNT(*) AS total
FROM ITEM_TRACK
GROUP BY ID) AS temp
ON Item.GUID = temp.ITEM_GUID
WHERE ENDDATE > NOW()
I've tried this on SQL Server (using GETDATE() instead of NOW()) to double check and it worked, I think it should work on MYSQL.

How to merge two different field values into one row?

I need to clean some data by merging two similar but slightly different dimension field values into one new row that adds together the two metric values, keeping the uid and date intact.
Current setup looks like this:
╔═════╦═════════════╦══════╦═══════════╦═══════════╗
║ id ║ date ║ uid ║ source ║ pageviews ║
╠═════╬═════════════╬══════╬═══════════╬═══════════╣
║ 1 ║ 2013-12-11 ║ 111 ║ source1 ║ 14 ║
║ 3 ║ 2013-12-11 ║ 111 ║ source1a ║ 1 ║
║ 11 ║ 2013-12-11 ║ 222 ║ source1 ║ 3 ║
║ 19 ║ 2013-12-11 ║ 222 ║ source1a ║ 11 ║
╚═════╩═════════════╩══════╩═══════════╩═══════════╝
I'd like to consider source1 and source1a to be equal and merge the two, to get this:
╔═════╦═════════════╦══════╦══════════╦═══════════╗
║ id ║ date ║ uid ║ source ║ pageviews ║
╠═════╬═════════════╬══════╬══════════╬═══════════╣
║ 1 ║ 2013-12-11 ║ 111 ║ source1 ║ 15 ║
║ 2 ║ 2013-12-11 ║ 222 ║ source1 ║ 14 ║
╚═════╩═════════════╩══════╩══════════╩═══════════╝
id is not important, I had planned to re-increment the id in the new table that results
This is what I tried, but it didn't merge the two records – I am getting matching values but still separate rows:
SELECT date, uid, (SELECT CASE
WHEN source = 'source1a' THEN 'source1'
ELSE source
END) AS 'source', pageviews
FROM trafficSourceMedium
GROUP BY date, source, userid
An aggregation query should do what you want:
select `date`, uid,
(case when source = 'source1a' then 'source1' else source end) as source,
sum(pageviews) as pageviews
from trafficSourceMedium
group by `date`, uid,
(case when source = 'source1a' then 'source1' else source end);