Removing duplicates with aggregated values that cancel out - duplicates

I am attempting to remove duplicates that are found in the serial number, but only if the amount bill of the duplicate serial number cancels out when aggregating the amount bill. For example, there are five serial number as "abc-321", but have five different amount dues ($500, $250, $-250, $30, $-30). The four duplicates would be $250, $-250, $30, $-30, because adding them together would cancel them or be 0.
I was only able to come up with a query to identify what the duplicate serial numbers are, but not sure how to aggregate them to cancel them out.
SELECT a.Serial, a.BillAmt, a.Date, a.Code
FROM TableA a
WHERE (a.Serial) in
(SELECT Serial
FROM TableA a
GROUP BY Serial
HAVING COUNT(Serial)>1)
GROUP BY a.Serial, a.BillAmt, a.Date, a.Code
ORDER BY a.Serial ASC;
Sample Output:
+--------+---------+----------+------+
|Serial | BillAmt | Date | Code |
+--------+---------+----------+------+
|abc-112 | $240 | 20200720 | MPO |
|abc-112 | -$400 | 20200527 | CPP |
|acc-130 | $300 | 20200515 | CPP |
|acc-130 | $300 | 20200420 | DUB |
|acc-130 | -$300 | 20200515 | CPP |
|bcc-111 | $500 | 20200701 | MPO |
|bcc-111 | -$500 | 20200701 | MPO |
|caa-321 | $700 | 20200805 | DUB |
|caa-321 | $700 | 20200805 | MPO |
+--------+---------+----------+------+
Desire Results:
+--------+---------+----------+------+
|Serial | BillAmt | Date | Code |
+--------+---------+----------+------+
|abc-112 | $240 | 20200720 | MPO |
|abc-112 | -$400 | 20200527 | CPP |
|acc-130 | $300 | 20200420 | DUB |
|caa-321 | $700 | 20200805 | DUB |
|caa-321 | $700 | 20200805 | MPO |
+--------+---------+----------+------+

You are trying to compare the data to itself. You can do this by using a JOIN statement from TableA to TableA.
Here's what worked for me, with some test data. I did it by doing a partial OUTER JOIN (specifically a LEFT JOIN), with one of the JOIN conditions being the BillAmt's cancelling. Then in the WHERE, I look for where the JOIN failed. This is will be where the RIGHT side is NULL.
CREATE TABLE TableA (
Serial VARCHAR(10),
BillAmt DOUBLE,
`Date` DATE,
Code INT
);
INSERT INTO TableA (Serial, BillAmt, Date, Code) VALUES
('abc-321',500,'2021-06-01',1),
('abc-321',250,'2021-06-02',1),
('abc-321',-250,'2021-06-03',1),
('abc-321',50,'2021-06-04',1),
('abc-321',-50,'2021-06-05',1),
('abc-321',123,'2021-06-06',1);
SELECT a.Serial, a.BillAmt, a.Date, a.Code
FROM TableA AS a
LEFT JOIN TableA AS b
ON a.Serial = b.Serial
AND a.BillAmt = (b.BillAmt * -1)
WHERE b.Serial IS NULL;
It gives this output:
+---------+---------+------------+------+
| Serial | BillAmt | Date | Code |
+---------+---------+------------+------+
| abc-321 | 500 | 2021-06-01 | 1 |
| abc-321 | 123 | 2021-06-06 | 1 |
+---------+---------+------------+------+
2 rows in set (0.00 sec)

This solution will give you the result without a self-join, and a single pass over the table.
We calculate the total number of positive numbers per Serial per absolute value of BillAmt (ignoring the negative sign)
Same for the number of negatives
We calculate a row number over the same grouping, per sign
We filter only where the BillAmt is positive and current row number < = number of positives minus negatives, or it's negative and number of current row number <= negatives - positives
SELECT a.Serial, a.BillAmt, a.Date, a.Code
FROM (
SELECT a.*,
positives = COUNT(CASE WHEN a.BillAmt > 0 THEN 1 END) OVER (PARTITION BY a.Serial, ABS(a.BillAmt)),
negatives = COUNT(CASE WHEN a.BillAmt < 0 THEN 1 END) OVER (PARTITION BY a.Serial, ABS(a.BillAmt)),
rn = ROW_NUMBER() OVER (PARTITION BY a.Serial, ABS(a.BillAmt), SIGN(a.BillAmt) ORDER BY a.Date)
FROM TableA a
) a
WHERE a.BillAmt > 0 AND rn <= positives - negatives
OR a.BillAmt < 0 AND rn <= negatives - positives
ORDER BY a.Serial ASC;
db<>fiddle

Related

WHERE-clause for last non-null value

I'm making a community database for a school project and have ran into an issue. I am attempting to integrate a log system but retrieve the latest non-null value from column in a table called logs and present that information on a different page. My current code (without any attempt at filtering by their rank) looks as follows:
SELECT m.MemberID, m.MemberName, o.OfficeID, o.OfficeDesignation, p.PositionAbbreviation, l.LogRank
FROM logs l
INNER JOIN (SELECT l.LogMember, MAX(l.LogDate) AS maxLogDate FROM logs l GROUP BY l.LogMember) l2
ON (l.LogMember = l2.LogMember AND l.LogDate = l2.maxLogDate)
INNER JOIN members m
ON (l.LogMember = m.MemberID)
INNER JOIN offices o
ON (l.LogOffice = o.OfficeID)
INNER JOIN positions p
ON (l.LogPosition = p.PositionID)
GROUP BY m.MemberID;
The above query returns the latest entry for each member in the logs table, but I can't figure out how to, when l.LogRank returns NULL, take the latest non-null value only for that column.
I have struggled with a variety of approaches to this problem over the past week to no avail. Any help/pointers would be appreciated.
EDIT: Sample data seen below:
+-------+-----------+---------+-----------+-------------+-----------+
| LogID | LogMember | LogRank | LogOffice | LogPosition | LogDate |
+-------+-----------+---------+-----------+-------------+-----------+
| 1 | 1 | 1 | 7 | 5 | TIMESTAMP |
+-------+-----------+---------+-----------+-------------+-----------+
| 2 | 1 | 1 | | 1 | TIMESTAMP |
+-------+-----------+---------+-----------+-------------+-----------+
| 3 | 1 | | 1 | | TIMESTAMP |
+-------+-----------+---------+-----------+-------------+-----------+
The various INT values reference IDs in the relevant other tables.
Desired Output:
+-------+-----------+---------+-------------------+-----------------+-----------+
| LogID | LogMember | LogRank | OfficeDesignation | PositionAbbrev. | LogDate |
+-------+-----------+---------+-------------------+-----------------+-----------+
| 1 | 1 | 1 | C-6 | CO | TIMESTAMP |
+-------+-----------+---------+-------------------+-----------------+-----------+
So basically I want to retrieve the MemberID, MemberName, OfficeID, OfficeDesignation, PositionAbbreviation, and LogRank from a variety of tables and get the latest non-null record for each column.
INNER JOIN (
SELECT l.LogMember, l.LogRank FROM logs l
WHERE l.LogDate=(
SELECT LogDate FROM logs
WHERE LogMember=l.LogMember
AND LogRank IS NOT NULL
ORDER BY LogDate DESC
LIMIT 1
)
) l3
ON (l.LogMember = l3.LogMember)

Bring all data from a table with joins with where clause that may not exist in the other table

I'm having a hard time setting up a query(select). Database is not my specialty, so I'm turning to the experts. Let me show what I need.
----companies--- ----company_server----- -----servers---- -----print------------------------
| id | name | | company | server | | id | name | | id |page|copy | date |server
|----|-------- | |---------|----------| |----|-------- | |----|----|-----|-------------
| 1 | Company1 |1--N| 1 | 1 |N*--1| 1 | Server1 |1--N| 1 | 2 | 3 | 2020-1-11 | 1
| 2 | Company2 | | 2 | 1 | | 2 | Server2 | | 2 | 1 | 6 | 2020-1-12 | 3
| 3 | Company3 | | 3 | 2 | | 3 | Server3 | | 3 | 4 | 5 | 2020-1-13 | 4
| 3 | 3 | | 4 | Server4 | | 4 | 5 | 3 | 2020-1-15 | 2
| 5 | 3 | 4 | 2020-1-15 | 4
| 6 | 1 | 2 | 2020-1-16 | 3
| 7 | 2 | 2 | 2020-1-16 | 4
What I need?
Example where date between CAST(2020-1-12 AS DATE) AND CAST(2020-1-15 AS DATE) group by servers.id
| companies | server | sum | percent
------------------------------------------------------------------------------------
| company1,company2 | server1 | sum(page*copy) = 0 or null | 0 or NULL
| company3 | server2 | sum(page*copy) = 15 | 28.30
| company3 | server3 | sum(page*copy) = 6 | 11.32
| NULL | server4 | sum(page*copy) = 32 | 60.38
Few notes:
I need this query for MYSQL;
Every Company is linked to at least one server.
I need result grouped by server. So, every company linked to that server must be concatenated by a comma.
If the company has not yet been registered, the value null should be presented.
The sum (page * copie) must be presented as zero or null (I don't care) in the case that there was no printing in the date range.
The percentage should be calculated according to the date range entered and not with all records in the database.
The field date is stored as MYSQL DATE.
Experts, I thank you in advance for your help. I currently solve this problem with at least 03 queries to the database, but I have a conviction that I could do it with just one query.
Added a fiddle. Sorry. Im still learing how to use this.
https://www.db-fiddle.com/f/dXej7QCPe9iDopfYd1SfVh/2
Follows the query that more or less represents how far I had arrived. Notice that in the middle of the way 'server4' disappeared because there are no values ​​for it in print in the period searched for him and I am in possession of the total of the period but I cannot calculate the percentage.
i'm stuck
select
*
from
(select
sum(p.copy * p.page) as sum1,
s.name as s_name,
s.id as s_id
from
print p
join servers s on s.id = p.server
where p.date between cast('2020-1-12' as date) and cast('2020-1-15' as date)
group by s.id) as t1
join company_server cs on cs.server = t1.s_id
right join companies c on c.id = cs.company
cross join(
select
sum(p1.copy * p1.page) sum2
from
print p1
where p1.date between cast('2020-1-12' as date) and cast('2020-1-15' as date)
) as c;
I did this query before you add fiddle, so may be name of column of mine is not same as you. Anyway, this is my solution, hope it help you.
select group_concat(c.name separator ',') as name_company,
ss.name,
sum_print as sum,
(sum_print/total) *100 as percentage
from companies c
inner join company_server cs on c.id = cs.company
right join servers ss on ss.id = cs.id
left join
(
select server,sum(page*copy) as sum_print, date from print
where date between CAST('2020-1-12' AS DATE) AND CAST('2020-1-15' AS DATE)
group by server
) tmp on tmp.server = ss.id
cross join
(select sum(page*copy) as total from print where date between CAST('2020-1-12' AS DATE) AND CAST('2020-1-15' AS DATE)) tmp2
group by id
Group and concat by comma, using GROUP_CONCAT .
You can reference this image for JOIN clause.
https://i.stack.imgur.com/6cioZ.png

MySQL - Return Latest Date and Total Sum from two rows in a column for multiple entries

For every ID_Number, there is a bill_date and then two types of bills that happen. I want to return the latest date (max date) for each ID number and then add together the two types of bill amounts. So, based on the table below, it should return:
| 1 | 201604 | 10.00 | |
| 2 | 201701 | 28.00 | |
tbl_charges
+-----------+-----------+-----------+--------+
| ID_Number | Bill_Date | Bill_Type | Amount |
+-----------+-----------+-----------+--------+
| 1 | 201601 | A | 5.00 |
| 1 | 201601 | B | 7.00 |
| 1 | 201604 | A | 4.00 |
| 1 | 201604 | B | 6.00 |
| 2 | 201701 | A | 15.00 |
| 2 | 201701 | B | 13.00 |
+-----------+-----------+-----------+--------+
Then, if possible, I want to be able to do this in a join in another query, using ID_Number as the column for the join. Would that change the query here?
Note: I am initially only wanting to run the query for about 200 distinct ID_Numbers out of about 10 million. I will be adding an 'IN' clause for those IDs. When I do the join for the final product, I will need to know how to get those latest dates out of all the other join possibilities. (ie, how do I get ID_Number 1 to join with 201604 and not 201601?)
I would use NOT EXISTS and GROUP BY
select, t1.id_number, max(t1.bill_date), sum(t1.amount)
from tbl_charges t1
where not exists (
select 1
from tbl_charges t2
where t1.id_number = t2.id_number and
t1.bill_date < t2.bill_date
)
group by t1.id_number
the NOT EXISTS filter out the irrelevant rows and GROUP BY do the sum.
I would be inclined to filter in the where:
select id_number, sum(c.amount)
from tbl_charges c
where c.date = (select max(c2.date)
from tbl_charges c2
where c2.id_number = c.id_number and c2.bill_type = c.bill_type
)
group by id_number;
Or, another fun way is to use in with tuples:
select id_number, sum(c.amount)
from tbl_charges c
where (c.id_number, c.bill_type, c.date) in
(select c2.id_number, c2.bill_type, max(c2.date)
from tbl_charges c2
group by c2.id_number, c2.bill_type
)
group by id_number;

Performant way to self-join and filter by revised rows

I'm trying to select all rows in this table, with the constraint that revised id's are selected instead of the original ones. So, if a row has a revision, that revision is selected instead of that row, if there are multiple revision numbers the highest revision number is preferred.
I think an example table, output, and query will explain this better:
Table:
+----+-------+-------------+-----------------+-------------+
| id | value | original_id | revision_number | is_revision |
+----+-------+-------------+-----------------+-------------+
| 1 | abcd | null | null | 0 |
| 2 | zxcv | null | null | 0 |
| 3 | qwert | null | null | 0 |
| 4 | abd | 1 | 1 | 1 |
| 5 | abcde | 1 | 2 | 1 |
| 6 | zxcvb | 2 | 1 | 1 |
| 7 | poiu | null | null | 0 |
+----+-------+-------------+-----------------+-------------+
Desired Output:
+----+-------+-------------+-----------------+
| id | value | original_id | revision_number |
+----+-------+-------------+-----------------+
| 3 | qwert | null | null |
| 5 | abcde | 1 | 2 |
| 6 | zxcvb | 2 | 1 |
| 7 | poiu | null | null |
+----+-------+-------------+-----------------+
View Called revisions_max:
SELECT
responses.original_id AS original_id,
MAX(responses.revision_number) AS revision
FROM
responses
WHERE
original_id IS NOT NULL
GROUP BY responses.original_id
My Current Query:
SELECT
responses.*
FROM
responses
WHERE
id NOT IN (
SELECT
original_id
FROM
revisions_max
)
AND
is_revision = 0
UNION
SELECT
responses.*
FROM
responses
INNER JOIN revisions_max ON revisions_max.original_id = responses.original_id
AND revisions_max.revision_number = responses.revision_number
This query works, but takes 0.06 seconds to run. With a table of only 2000 rows. This table will quickly start expanding to tens or hundreds of thousands of rows. The query under the union is what takes most of the time.
What can I do to improve this queries performance?
How about using coalesce()?
SELECT COALESCE(y.id, x.id) AS id,
COALESCE(y.value, x.value) AS value,
COALESCE(y.original_id, x.original_id) AS original_id,
COALESCE(y.revision_number, x.revision_number) AS revision_number
FROM responses x
LEFT JOIN (SELECT r1.*
FROM responses r1
INNER JOIN (SELECT responses.original_id AS
original_id,
Max(responses.revision_number) AS
revision
FROM responses
WHERE original_id IS NOT NULL
GROUP BY responses.original_id) rev
ON r1.original_id = rev.original_id
AND r1.revision_number = rev.revision) y
ON x.id = y.original_id
WHERE y.id IS NOT NULL
OR x.original_id IS NULL;
The approach I would take with any other DBMS is to use NOT EXISTS:
SELECT r1.*
FROM Responses AS r1
WHERE NOT EXISTS
( SELECT 1
FROM Responses AS r2
WHERE r2.original_id = COALESCE(r1.original_id, r1.id)
AND r2.revision_number > COALESCE(r1.revision_number, 0)
);
To remove any rows where a higher revision number exists for the same id (or original_id if it is populated). However, in MySQL, LEFT JOIN/IS NULL will perform better than NOT EXISTS1. As such I would rewrite the above as:
SELECT r1.*
FROM Responses AS r1
LEFT JOIN Responses AS r2
ON r2.original_id = COALESCE(r1.original_id, r1.id)
AND r2.revision_number > COALESCE(r1.revision_number, 0)
WHERE r2.id IS NULL;
Example on DBFiddle
I realise that you have said that you don't want to use LEFT JOIN and check for nulls, but I don't see that there is a better solution.
1. At least this was the case historically, I don't actively use MySQL so don't keep up to date with developments in the optimiser

Get the balance of my users in the same table

Help please, I have a table like this:
| ID | userId | amount | type |
-------------------------------------
| 1 | 10 | 10 | expense |
| 2 | 10 | 22 | income |
| 3 | 3 | 25 | expense |
| 4 | 3 | 40 | expense |
| 5 | 3 | 63 | income |
I'm looking for a way to use one query and retrive the balance of each user.
The hard part comes when the amounts has to be added on expenses and substracted on incomes.
This would be the result table:
| userId | balance |
--------------------
| 10 | 12 |
| 3 | -2 |
You need to get each totals of income and expense using subquery then later on join them so you can subtract expense from income
SELECT a.UserID,
(b.totalIncome - a.totalExpense) `balance`
FROM
(
SELECT userID, SUM(amount) totalExpense
FROM myTable
WHERE type = 'expense'
GROUP BY userID
) a INNER JOIN
(
SELECT userID, SUM(amount) totalIncome
FROM myTable
WHERE type = 'income'
GROUP BY userID
) b on a.userID = b.userid
SQLFiddle Demo
This is easiest to do with a single group by:
select user_id,
sum(case when type = 'income' then amount else - amount end) as balance
from t
group by user_id
You could have 2 sub-queries, each grouped by id: one sums the incomes, the other the expenses. Then you could join these together, so that each row had an id, the sum of the expenses and the sum of the income(s), from which you can easily compute the balance.