Detect duplicates on 2 attributes: nifi - duplicates

I am trying using NIFI to detect duplicates based on 2 attributes of flow files such that per second there should not be any duplicate rows, whose 2 particular attribute values are same. In DetectDuplicate processor, following are the entries of my processor:
CacheEntryIdentifier : ${attribute1_name}::${attribute2_name}
Age of Duration : 1 sec
Distributed Chache Service : DistributedMapCacheClientService
Still, I am getting duplicate rows for which, per second values of these 2 attributes are same.
Help is much appreciated. Thanks.

An "Age Off Duration" of 1 second means that a CacheEntryIndentifier value that is a duplicate of one that arrived at least one second ago will NOT be considered a duplicate. That property is used to let entries "expire", some users set it for 24 hours so the next day, the same values can show up again as "not previously seen". If you want to always maintain the "seen" values, leave "Age Off Duration" blank.

Related

Is possible to remove duplicates in Power BI based on a time interval between the data?

I have a list of leads data.
The table has a lot of infos like date, name, email, mobile number, etc.
However, some of these leads are duplicates: the same person generated more than one lead
What i want to do is to remove the duplicate leads.
The problem is, to be considered a duplicated lead, the email or the mobile number must appear in more than one row, in a time interval of 30 days.
And only those who come next must be considered duplicated. The first one is always a not duplicated lead.
E.g.
1) If Jones generated a lead in 01/01/20 with his email abc#abc.com and then generate another lead 10 days after, in 10/01/20. The first lead is a single lead (not duplicated) and this second lead must be considered duplicate.
2) If Maria generated a lead in 01/01/20 with her email xyz#abc.com and then generate another lead 40 days after, in 10/02/20. The first lead is a single lead (not duplicated) and this second lead also must be considered single (or not duplicate).
To mark the lead as duplicate or not, i want to generate a new column with the time between the last lead of the same person (same email or same mobile number)
Then generate a new column with the label "Duplicate" or "Not Duplicate" based on the time showed on the last column. If its > than 30 days, that is a single lead. Otherwise (<30 days) is a duplicate lead.
E.g picture:
Can someone please help me on how to do that?
Getting the lag/lead data is not very straight forward in Power BI. You will have to use a combination of EARLIERand some aggregate function to get the specific result. For your specific scenario, the following calculation might work:
Day Difference =
VAR name1 = 'Table'[Name]
VAR Lastdate1 = MAXX(FILTER('Table','Table'[Name]=name1 && 'Table'[date]<EARLIER('Table'[date])),'Table'[date])
RETURN
IF(
DATEDIFF(Lastdate1,'Table'[date],DAY)=BLANK(),
100,
DATEDIFF(Lastdate1,'Table'[date],DAY))
Once the column is created, you can filter for all records <=30 to get the result you are looking for. I have replaced the blanks with 100, so that the original records don't get removed when applying the condition.
If you are looking for the "Tag" value, then the following calculation will get you the tag values directly:
Tag =
VAR name1 = 'Table'[Name]
VAR Lastdate1 = MAXX(FILTER('Table','Table'[Name]=name1 && 'Table'[date]<EARLIER('Table'[date])),'Table'[date])
RETURN
IF(IF(
DATEDIFF(Lastdate1,'Table'[date],DAY)=BLANK(),
100,
DATEDIFF(Lastdate1,'Table'[date],DAY))<=30,"Duplicate","Single")

SQL/mysql - how to display two columns with different value from 1 table

I am trying to make a query for approval of documents, where the result display the name and signature with date. How can I get the date for two people approving the document?
Select Uname
case when stepcode=1 then 'approver1' end as 'name of person'
case when stepcode=1 then 'approver1' end as ' date of signed noted'
case when stepcode=2 then 'approver2' end as 'date of signed approved'
from table
I tried this, but only one result showed up. Only the name, signature and date of the first approval displayed.
We can only answer this by making some assumptions:
the field stepcode denotes what stage of the sign off process the record is at
value of 1 means noted and value 2 means approved. A value of 0 means nothing has happened yet
approver1 and approver 2 are NULL if the action has not yet taken place
If all of the above is true, then there should be no requirement to have a CASE statement for the fields... just including the fields within the SELECT statement will bring the values through if they have been completed.
Some validation of data might be required here though if you are not getting the results you are expecting. Running some rough counts for each of the steps and for where they have values in the approver fields would help to make sure your code is working. The following should give you something to work with:
SELECT
stepcode
COUNT(TableID) AS NumberAtStep
FROM table
GROUP BY stepcode
Using these counts, you can then run your statement without the CASE statements and run a manual count to ensure you are seeing the right number of records with the relevant populated columns for each step.
Further information will be required to delve into your problem further however

How do I replace values in a column in KNIME?

I have a column of countries with 50 different values that I want to reduce to United States and Other.
Can someone help me with that?
Another example is Age which has 48 values that I'd like to reduce to only 4 like 1 to 18 = youth, 18-27 = starting, etc.
I've actually got about 5 columns that I want to reduce the values of. So would I need to repeat the process multiple times in KNIME or can I accomplish multiple column value replacements at once?
The latter on can easily be achieved with the Rule Engine
$Col0$ > 1 AND $Col0$ <18 => "youth"
For the First problem I'd use a String Replace (Dictionary).
I don't think you replace all at once but you can loop over columns.
For the second case I would use Numeric Binner:
For each column a number of intervals - known as bins - can be
defined. Each of these bins is given a unique name (for this column),
a defined range, and open or closed interval borders. They
automatically ensure that the ranges are defined in descending order
and that interval borders are consistent. In addition, each column is
either replaced with the binned, string-type column, or a new binned,
string-type column is appended.

getting started with loops and iteration in sequel pro

I'm using sequel pro to select data from several tables. There are two things I need to do that seem to need a loop of some kind. I have never used any form of iteration in sql and can't find a beginners-level resource to learn from.
Can anyone suggest how to do the following two tasks, or suggest a tutorial where I can learn the fundamentals and figure it out from there:
Task 1: Go through a version history table, find the relevant history record for a given id that applied at a given date, and select the value from that record. The form of the history table is:
id, Item_id, version-created_at, value
eg
1, 123, 2014-05-01, 754
2, 456, 2014-05-10, 333
3, 123, 2014-05-27, 709
and I need to find what the value of item 123 was on the date 2014-05-25 (ie I need to find record id=1 and value = 754 because that is the most recent version for item 123 created prior to my target date.
So I figure I need to run through the table looking for item 123 and comparing dates of those records. But I don't know how to deal with the iteration of moving from one record to the next and comparing them.
Task 2: Go through a single text field that contains a number of product id and matching product prices in a string, and find the id of the product with the lowest price. Form of the string is a series of pairs of price "p" and id "i", in random order, like this:
"
- :p: 99.8
:i: 3
- :p: 59.0
:i: 5
- :p: 109.8
:i: 18
- :p: 82.45
:i: 46
"
and in this example I need to find "5", being the id of the product with the lowest price $59.
So I figure I need to step through each of the p/i sets, maybe by counting characters, but I have no idea how to iterate through and compare to find the best price.
A little help would go a long way.
Thanks.
For first answer you can do something like this:-
SELECT value FROM history where id = 123 AND version-created_at = '2014-05-01';
and for another task you must try this at front end rather than at back end.

Sorting/Ordering sequenced pairs of data in MySQL?

I am trying to determine if there's a way to sort rows of a MySQL table that consists of start/finish columns. (Could also be thought of as parent/child relations or other linked list arrangement)
Here's an example of how the data is currently stored:
id start finish
2 stepthree stepfour
6 stepfive stepsix
9 stepone steptwo
78 stepfour stepfive
121 steptwo stepthree
(The id numbers in this are not relevant, just using them to indicate additional columns of arbitrary data)
I want to sort/display these row in order, presuming I am always starting with "stepone", that traverses the start-> finish chain like, each "finish" being followed by the row with it as a "start".
desired output
9 stepone steptwo
121 steptwo stepthree
2 stepthree stepfour
78 stepfour stepfive
6 stepfive stepsix
There shouldn't be any branching/splits normally, just a sequential series of steps or states. I can't use simple alpha sorting (in my case the start and finish values are codes created by a customer), but can't figure out any other way to order these using SQL. I could programmatically do it using most languages, but stumped about doing it just with SQL.
Any clever ideas?
I would recommend having another table that has each step mapped to its precedence order.
Then you can write a query to sort each row in the order of precedence of the start step.