SQL Capture duplicate records across two DIFFERENT columns

SQL Capture duplicate records across two DIFFERENT columns - mysql

I am writing an Exception Catching Page using MySQL for catching duplicate billing entries the following scenario.
Items details are entered in a table which has the following two columns (among others).
ItemCode VARCHAR(50), BillEntryDate DATE
It often happens that same item's bill is entered multiple times, but over a period of few days. Like,
"Football","2019-01-02"
"Basketball","2019-01-02"
...
...
"Football","2019-01-05"
"Rugby","2019-01-05"
...
"Handball","2019-01-05"
"Rugby","2019-01-07"
"Rugby","2019-01-10"
In the above example, the item Football is billed twice - first on 2Jan and again on 5Jan. Similarly, item Rugby is billed thrice on 5,7,10Jan.
I am looking to write simple SQL which can pickup each item [say, using distinct(ItemCode) clause], and then display all the records which are duplicates over a period of 30 days.
In the above case, the expected output should be the following 5 records:
"Football","2019-01-02"
"Football","2019-01-05"
"Rugby","2019-01-05"
"Rugby","2019-01-07"
"Rugby","2019-01-10"
I am trying to run the following SQL:
select * from tablen a, tablen b, where a.ItemCode=b.ItemCode and a.BillEntryDate = b.BillEntryDate+30;
However, this seems to be highly inefficient as it is running for long without displaying any records.
Is there any possibility for getting a less complex and faster method?
I did explore existing topics (like How do I find duplicates across multiple columns?), but it is catching duplicates where BOTH columns have same value. My requirement is one column same value, and second column varying over a month-long date range.

You can use:
select t.*
from tablen t
where exists (select 1
from tablen t2
where t2.ItemCode = t.ItemCode and
t2.BillEntryDate <> t.BillEntryDate and
t2.BillEntryDate >= t1.BillEntryDate - interval 30 day and t2.BillEntryDate <= t1.BillEntryDate + interval 30 day
);
This will pick up both duplicates in the pair.
For performance, you want an index on (ItemCode, BillEntryDate).

With EXISTS:
select ItemCode, BillEntryDate
from tablename t
where exists (
select 1 from tablename
where
ItemCode = t.ItemCode
and
abs(datediff(BillEntryDate, t.BillEntryDate)) between 1 and 30
)

Related

Looking for a low footprint solution to GROUP rows using HAVING to filter

Here is a table
id date name
1 180101 josh
2 180101 peter
3 180101 julia
4 180102 robert
5 180103 patrick
6 180104 josh
7 180104 adam
I need to get all the names whom having the same days as 'josh'. how can i achieve it without groupping the whole table together. i need to keep it efficient (this is not my real table, i just simplified my problem here, and i have hundred thousands of records, and 99% of the rows have different dates, so groupable rows by date is kind of rare).
So basicaly what i want is: if 'josh' is the target, i need to get 'josh,peter,julia,adam' (actually the first 10 distinct names sharing the same date with josh).
SELECT
COUNT(date) as datecount,
GROUP_CONCAT(DISTINCT name) as names,
FROM
table
GROUP BY
date
HAVING
datecount>1
// && name IN ('josh') would work nice for me, but im getting error because 'name' is not in GROUPED BY
LIMIT 10
Any idea ? As i mentioned it needs to be fast, and most of the rows have unique dates

Join the table with itself on date:
select distinct t1.name
from tbl t1
join tbl t2 using (date)
where t2.name = 'josh'
Demo
For the best performance you would have indexes on (name) and (date, name).

Assistance with complex MySQL query (using LIMIT ?)

I wonder if anyone could help with a MySQL query I am trying to write to return relevant results.
I have a big table of change log data, and I want to retrieve a number of record 'groups'. For example, in this case a group would be where two or more records are entered with the same timestamp.
Here is a sample table.
==============================================
ID DATA TIMESTAMP
==============================================
1 Some text 1379000000
2 Something 1379011111
3 More data 1379011111
3 Interesting data 1379022222
3 Fascinating text 1379033333
If I wanted the first two grouped sets, I could use LIMIT 0,2 but this would miss the third record. The ideal query would return three rows (as two rows have the same timestamp).
==============================================
ID DATA TIMESTAMP
==============================================
1 Some text 1379000000
2 Something 1379011111
3 More data 1379011111
Currently I've been using PHP to process the entire table, which mostly works, but for a table of 1000+ records, this is not very efficient on memory usage!
Many thanks in advance for any help you can give...

Get the timestamps for the filtering using a join. For instance, the following would make sure that the second timestamp is in a completed group:
select t.*
from t join
(select timestamp
from t
order by timestamp
limit 2
) tt
on t.timestamp = tt.timestamp;
The following would get the first three groups, no matter what their size:
select t.*
from t join
(select distinct timestamp
from t
order by timestamp
limit 3
) tt
on t.timestamp = tt.timestamp;

Can SQL query do this?

I have a table "audit" with a "description" column, a "record_id" column and a "record_date" column. I want to select only those records where the description matches one of two possible strings (say, LIKE "NEW%" OR LIKE "ARCH%") where the record_id in each of those two matches each other. I then need to calculate the difference in days between the record_date of each other.
For instance, my table may contain:
id description record_id record_date
1 New Sub 1000 04/14/13
2 Mod 1000 04/14/13
3 Archived 1000 04/15/13
4 New Sub 1001 04/13/13
I would want to select only rows 1 and 3 and then calculate the number of days between 4/15 and 4/14 to determine how long it took to go from New to Archived for that record (1000). Both a New and an Archived entry must be present for any record for it to be counted (I don't care about ones that haven't been archived). Does this make sense and is it possible to calculate this in a SQL query? I don't know much beyond basic SQL.
I am using MySQL Workbench to do this.

The following is untested, but it should work asuming that any given record_id can only show up once with "New Sub" and "Archived"
select n.id as new_id
,a.id as archive_id
,record_id
,n.record_date as new_date
,a.record_date as archive_date
,DateDiff(a.record_date, n.record_date) as days_between
from audit n
join audit a using(record_id)
where n.description = 'New Sub'
and a.description = 'Archieved';
I changed from OR to AND, because I thought you wanted only the nr of days between records that was actually archived.

My test was in SQL Server so the syntax might need to be tweaked slightly for your (especially the DATEDIFF function) but you can select from the same table twice, one side grabbing the 'new' and one grabbing the 'archived' then linking them by record_id...
SELECT
newsub.id,
newsub.description,
newsub.record_date,
arc.id,
arc.description,
arc.record_date,
DATEDIFF(day, newsub.record_date, arc.record_date) AS DaysBetween
FROM
foo1 arc
, foo1 newsub
WHERE
(newsub.description LIKE 'NEW%')
AND
(arc.description LIKE 'ARC%')
AND
(newsub.record_id = arc.record_id)

View to return row duplicated the # of times represented by amount field value

I have an accounting table that contains a dollar amount field. I want to create a view that will return one row per penny of that amount with some of the other fields from the table.
So as a very simple example let's say I have a row like this:
PK Amount Date
---------------------------
123 4.80 1/1/2012
The query/view should return 480 rows (one for each penny) that all look like this:
PK Date
-----------------
123 1/1/2012
What would be the most performant way to accomplish this? I have a solution that uses a table valued function and a temp table but in the back of my head I keep thinking there has got to be a way to accomplish this with a traditional view. Possibly a creative cross join or something that will return this result without having to declare too many resources in the form of temp tables, and tbf's etc. Any ideas?

With a little help of a numbers table you can do like this:
select PK, [Date]
from YourTable as T
inner join number as N
on N.n between 1 and T.Amount * 100
If you don't have one around and you want to test this you can use master..spt_values.
declare #T table
(
PK int,
Amount money,
[Date] date
)
insert into #T values
(123, 4.80, '20120101')
;with number(n) as
(
select number
from master..spt_values
where type = 'P'
)
select PK, [Date]
from #T as T
inner join number as N
on N.n between 1 and T.Amount * 100
Update:
From an article by Jeff Moden.
The "Numbers" or "Tally" Table: What it is and how it replaces a loop.
A Tally table is nothing more than a table with a single column of
very well indexed sequential numbers starting at 0 or 1 (mine start at
1) and going up to some number. The largest number in the Tally table
should not be just some arbitrary choice. It should be based on what
you think you'll use it for. I split VARCHAR(8000)'s with mine, so it
has to be at least 8000 numbers. Since I occasionally need to generate
30 years of dates, I keep most of my production Tally tables at 11,000
or more which is more than 365.25 days times 30 years.

You could use a CTE, something like this:
WITH duplicationCTE AS
(
SELECT PK, Date, Amount, 1 AS Count
FROM myTable
UNION ALL
SELECT myTable.PK, myTable.Date, myTable.Amount, Count+1
FROM myTable
JOIN duplicationCTE ON myTable.PK = duplicationCTE.PK
WHERE Count+1 <= myTable.Amount*100
)
SELECT PK, Date
FROM duplicationCTE
OPTION (MAXRECURSION 0);
Here is the SqlFiddle
AND, do note the 0. That means that this can run infinitely (dangerous btw) Otherwise, 32676 is the max number of recursions you can set (default is 100). However, if you are running over 32676 loops, then maybe you need to rethink your logic :)

Mysql subquery with sum causing problems

This is a summary version of the problems I am encountering, but hits the nub of my problem. The real problem involves huge UNION groups of monthly data tables, but the SQL would be huge and add nothing. So:
SELECT entity_id,
sum(day_call_time) as day_call_time
from (
SELECT entity_id,
sum(answered_day_call_time) as day_call_time
FROM XCDRDNCSum201108
where (day_of_the_month >= 10 AND day_of_the_month<=24)
and LPAD(core_range,4,"0")="0987"
and LPAD(subrange,3,"0")="654"
and SUBSTR(LPAD(core_number,7,"0"),4,7)="3210"
) as summary
is the problem: when the table in the subquery XCDRDNCSum201108 returns no rows, because it is a sum, the column values contain null. And entity_id is part of the primary key, and cannot be null.
If I take out the sum, and just query entity_id, the subquery contains no rows, and thus the outer query does not fail, but when I use sum, I get error 1048 Column 'entity_id' cannot be null
how do I work around this problem ? Sometimes there is no data.

You are completely overworking the query... pre-summing inside, then summing again outside. In addition, I understand you are not a DBA, but if you are ever doing an aggregation, you TYPICALLY need the criteria that its grouped by. In the case presented here, you are getting sum of calls for all entity IDs. So you must have a group by any non-aggregates. However, if all you care about is the Grand total WITHOUT respect to the entity_ID, then you could skip the group by, but would also NOT include the actual entity ID...
If you want inclusive to show actual time per specific entity ID...
SELECT
entity_id,
sum(answered_day_call_time) as day_call_time,
count(*) number_of_calls
FROM
XCDRDNCSum201108
where
(day_of_the_month >= 10 AND day_of_the_month<=24)
and LPAD(core_range,4,"0")="0987"
and LPAD(subrange,3,"0")="654"
and SUBSTR(LPAD(core_number,7,"0"),4,7)="3210"
group by
entity_id
This would result in something like (fictitious data)
Entity_ID Day_Call_Time Number_Of_Calls
1 10 3
2 45 4
3 27 2
If all you cared about were the total call times
SELECT
sum(answered_day_call_time) as day_call_time,
count(*) number_of_calls
FROM
XCDRDNCSum201108
where
(day_of_the_month >= 10 AND day_of_the_month<=24)
and LPAD(core_range,4,"0")="0987"
and LPAD(subrange,3,"0")="654"
and SUBSTR(LPAD(core_number,7,"0"),4,7)="3210"
This would result in something like (fictitious data)
Day_Call_Time Number_Of_Calls
82 9

Would:
sum(answered_day_call_time) as day_call_time
changed to
ifnull(sum(answered_day_call_time),0) as day_call_time
work? I'm assuming mysql here but the coalesce function would/should work too.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008