Is possible to remove duplicates in Power BI based on a time interval between the data? - duplicates

I have a list of leads data.
The table has a lot of infos like date, name, email, mobile number, etc.
However, some of these leads are duplicates: the same person generated more than one lead
What i want to do is to remove the duplicate leads.
The problem is, to be considered a duplicated lead, the email or the mobile number must appear in more than one row, in a time interval of 30 days.
And only those who come next must be considered duplicated. The first one is always a not duplicated lead.
E.g.
1) If Jones generated a lead in 01/01/20 with his email abc#abc.com and then generate another lead 10 days after, in 10/01/20. The first lead is a single lead (not duplicated) and this second lead must be considered duplicate.
2) If Maria generated a lead in 01/01/20 with her email xyz#abc.com and then generate another lead 40 days after, in 10/02/20. The first lead is a single lead (not duplicated) and this second lead also must be considered single (or not duplicate).
To mark the lead as duplicate or not, i want to generate a new column with the time between the last lead of the same person (same email or same mobile number)
Then generate a new column with the label "Duplicate" or "Not Duplicate" based on the time showed on the last column. If its > than 30 days, that is a single lead. Otherwise (<30 days) is a duplicate lead.
E.g picture:
Can someone please help me on how to do that?

Getting the lag/lead data is not very straight forward in Power BI. You will have to use a combination of EARLIERand some aggregate function to get the specific result. For your specific scenario, the following calculation might work:
Day Difference =
VAR name1 = 'Table'[Name]
VAR Lastdate1 = MAXX(FILTER('Table','Table'[Name]=name1 && 'Table'[date]<EARLIER('Table'[date])),'Table'[date])
RETURN
IF(
DATEDIFF(Lastdate1,'Table'[date],DAY)=BLANK(),
100,
DATEDIFF(Lastdate1,'Table'[date],DAY))
Once the column is created, you can filter for all records <=30 to get the result you are looking for. I have replaced the blanks with 100, so that the original records don't get removed when applying the condition.
If you are looking for the "Tag" value, then the following calculation will get you the tag values directly:
Tag =
VAR name1 = 'Table'[Name]
VAR Lastdate1 = MAXX(FILTER('Table','Table'[Name]=name1 && 'Table'[date]<EARLIER('Table'[date])),'Table'[date])
RETURN
IF(IF(
DATEDIFF(Lastdate1,'Table'[date],DAY)=BLANK(),
100,
DATEDIFF(Lastdate1,'Table'[date],DAY))<=30,"Duplicate","Single")

Related

Transforming Results from Rows into Columns

I have a data set that contains both common and unique values, which I am attempting to return in a useable format to allow further analyse/work to be taken based on said results.
The desired result would be to have a script that would recognise the common values such as mpan/serial_number/read_at so as to only return a single row, but also to recognise the unique values those being the read_at and identifier.
Currently my script returns a unique row based on the identifier and the value, but I would like to be able to return a unique row for the read_at date for as many identifiers and values as are held. In most cases there are only two identifiers and values, but there could be as many as five.
The issue I have is that when I try to make distinct work, it will only then return the first found result, where I am expecting a pair of results at minimum. I am also unclear as to how I could stop getting a new row and instead create the result as an additional column?
My base script which pulls everything is as below, I have tried a few variances on this, but think this would likely be the best place to start from with regards to any help you may be able to offer?
SELECT *
FROM consumer.stg_d0010_v2_026_027
/*LEFT JOIN consumer.stg_d0010_v2_026_028_029
ON consumer.stg_d0010_v2_026_028_029.file_identifier = consumer.stg_d0010_v2_026_027.file_identifier
AND consumer.stg_d0010_v2_026_028_029.mpan = consumer.stg_d0010_v2_026_027.mpan*/
LEFT JOIN consumer.stg_d0010_v2_026_028_030_032
ON consumer.stg_d0010_v2_026_028_030_032.file_identifier = consumer.stg_d0010_v2_026_027.file_identifier
AND consumer.stg_d0010_v2_026_028_030_032.mpan = consumer.stg_d0010_v2_026_027.mpan
LEFT JOIN consumer.stg_d0010_v2_026_028_030_033
ON consumer.stg_d0010_v2_026_028_030_033.file_identifier = consumer.stg_d0010_v2_026_027.file_identifier
AND consumer.stg_d0010_v2_026_028_030_033.mpan = consumer.stg_d0010_v2_026_027.mpan
where consumer.stg_d0010_v2_026_028_030_032.read_At > '2022-10-01'
and consumer.stg_d0010_v2_026_027.mpan in (
)
Example dataset in image below.
enter image description here
And desired outcome
enter image description here
The issue I have is that when I try to make distinct work, it will only then return the first found result, where I am expecting a pair of results at minimum. I am also unclear as to how I could stop getting a new row and instead create the result as an additional column?

Trying to pull Max date less than Date on the row

I know this is a tough one but I'm basically trying to say. Give me a service call and its completion date, then give me the Max date for all service calls where the date is less than the date of the service call I'm inquiring about.
Basically the end result I'm looking for is to say was there another service call on this piece of equipment that was within the last 30 days.
So as you can see in the image for say Asset 50698 service call 579032 we have a date of 11/9/2020 the call below that was 10/22/2020 which was less than 30 days. I want to somehow find a way to count how many service calls I have where this has occurred. Is this possible?
I think you're looking for a context operator In, ForEach or ForAll (in in this case)
Add a variable "MaxAssetDate" and assign it a Formula similar to the following based on your column headers.
=Max([Service Call Completion Date] In ([Asset ID];[Service Call])) In (Asset ID])
Then add this as a column. Provided you have a prompt filtering for a given asset or "date" this column will then show the max date for each service call of the same asset ID. Then add a new variable: ServiceCallDaysDiff: Then by using DatesBetween() with "MaxAssetDate" and ServiceCallCompletionDate and DayPeriod; =DatesBetween([ServiceCallCompletionDate];[MaxAssetDate];DayPeriod) you should get a number 0-X. Then add a filter based if the number is between 1 and 30 then you show those records, otherwise hide the rest; or do whatever logic is then needed.
Now if you're dealing with hundreds of thousands of records this isn't ideal as you're putting all the processing on the webi engine when it ideally would occur as an object in the database layer. However if you only have a few thousand records this should be managable.
To add a count of service calls...
add variable: ServiceCallsCount:
=Sum(Sum(If([ServiceCallDaysDiff]=0;0;1)) In ([AssetID]))
this will count the non zero day differents. Note this will extend beyond 30 so if you want to limit by 30 days adjust the if statement to zero out those not between 1 and 30.
This is but one approach: there may be simpler ways.

SQL/mysql - how to display two columns with different value from 1 table

I am trying to make a query for approval of documents, where the result display the name and signature with date. How can I get the date for two people approving the document?
Select Uname
case when stepcode=1 then 'approver1' end as 'name of person'
case when stepcode=1 then 'approver1' end as ' date of signed noted'
case when stepcode=2 then 'approver2' end as 'date of signed approved'
from table
I tried this, but only one result showed up. Only the name, signature and date of the first approval displayed.
We can only answer this by making some assumptions:
the field stepcode denotes what stage of the sign off process the record is at
value of 1 means noted and value 2 means approved. A value of 0 means nothing has happened yet
approver1 and approver 2 are NULL if the action has not yet taken place
If all of the above is true, then there should be no requirement to have a CASE statement for the fields... just including the fields within the SELECT statement will bring the values through if they have been completed.
Some validation of data might be required here though if you are not getting the results you are expecting. Running some rough counts for each of the steps and for where they have values in the approver fields would help to make sure your code is working. The following should give you something to work with:
SELECT
stepcode
COUNT(TableID) AS NumberAtStep
FROM table
GROUP BY stepcode
Using these counts, you can then run your statement without the CASE statements and run a manual count to ensure you are seeing the right number of records with the relevant populated columns for each step.
Further information will be required to delve into your problem further however

Access 2013 Count

I am working on a report in Access 2013 I need to seperate the first 20 records in a column that contain a value and assign a name to them. Such as at 1-20 I need it to insert Lot 1 at 21-40 need to assign Lot 2 etc... The report needs to be separated by lots of 20. I can also just insert a line when it reaches sets of 20 without a name if that makes it easier. Just need something to show a break at sets of 20.
Example: As you can see the report is separated by welder stencil. When the count in the VT column reaches 20 I need to enter a line or some type of divider to separate data. What our client is asking for is we separate the VT in sets of 20. I don't know whats the easiest way to accomplish this. I have researched it but haven't found anything.
Example Report with Divisions
Update the report's RecordSource query by adding "Lot" values for each row. There are multiple ways of doing this, but the easiest will be if your records already have a sequential, continuous numerical key. If they do not have such a key, you can research generating such sequential numbers for your query, but it is beyond the scope of this question and no details about the actual data schema were supplied in the question.
Let's imagine that you have such a key column [Seq]. You use the modulo (mod) and/or integer division operators (\ - backslash) to determine values that are exactly divisible by 20, e.g. ([Seq] - 1) mod 20 == 0.
Generate a lot value for each row. An example SQL snippet: SELECT ("Lot " & (([Seq] - 1) \ 20)) As LotNumber ...
Utilize Access report sorting and grouping features --grouping on the new Lot field-- to print a line and/or label at the start of each group. You can also have the report start a new page at the beginning or end of such a group.
The details about grouping can be found elsewhere in tutorials and Access documentation and are beyond the scope of this question.

Searching ALL ROWS in a Group using IIF Expression

I am working on a report that displays patient names (as groups with drilldowns) and several fields related to their visits. I have created a column in the report to display whether or not a specific value appears in the 'LocationID' column. The expression I used is
=IIF(Fields!LocationID.Value="WELL","Y","N")
I thought this was working great, it displays Y or N next to each name to let me know if 'WELL' was in their 'LocationID'. I checked several to ensure that this was going to work and discovered that there was a LocationID code of 'WHS' and since I have the rows ordered by Name and LocationID if there was a WHS visit it shows up at the top of the group and my expression is only seeing this top item. How can this expression be written differently so that it searches the entire result of each group? Depending on the date range a patient may have one visit or they may have ten. I need to check all visits that are returned. Perhaps there is a better method. Thanks in advance.
I agree with jimmy8ball that the easiest way to solve most issues like this is to push some logic back into the SQL layer.
However, if you really want to do this via SSRS functionality, then you could implement a substring search against a lookupset. Assuming you have a patient id in your dataset that is unique for each patient (I hope your group isn't on the name) then...
=Iif(InStr(Join(Lookupset(Fields!patientid.Value, Fields!patientid.Value, Fields!LocationsID.Value, "dataset"), ","), "WELL") > 0, "Y", "N")
Which says, "Search through the dataset for all rows related to my patientid, join every location into a comma deliminated string, search the string for the text "WELL" and return "Y" if it's found.
Obviously if you have locations in your dataset like "WELLY", these will become false positives and you'll have to implement some more nested logic. Try appending a value (perhaps !) to the lookupset return field so that you can search for "WELL!" or some other terminator character.