Stata: from a bunch of duplicates, keep a particular one

Stata: from a bunch of duplicates, keep a particular one - duplicates

In Stata, I have 3 variables: "objectid", "year", and "count". There are several duplicates in terms of "objectid" and "year". From these duplicates, I would like to keep the one with the highest value in "count".

This is standard stuff requiring only (1) getting the observations into a sort order where you want is identifiable and (2) working under the aegis of by:. See manual entries for by: and/or http://www.stata-journal.com/sjpdf.html?articlenum=pr0004
bysort objectid year (count) : keep if _n == _N
Note that if count is ever missing, that value will be the one kept.

Related

Access 2016 Table Calculated Field

I have a table containing the following: Five Y/N fields and a calculated field [Priority Results] that totals the number of 'Yeses' from those five y/n fields. I'm trying to create another calculated field that will return a value of Low, Medium or High dependent on the number of boxes that have been checked. [Priority Results] currently returns the values 0 through -5. Low = 0 & -1, Medium = -2, High = -3 or lower. I've tried SEVERAL different versions of If/Then, If/Else, Iif statements and always receive a syntax error. I've read a lot of different sites and the following expression seems to be the most commonly used, but I'm still getting the error. Anyone have any ideas? I've even tried this statement on a non-calculated field and can't get it to work.
IIf([Priority results]<="-1","Low",IIf([Priority results]="-2","Medium",IIf([Priority results]>="-3","High")))
Here are the calculated field [Priority results] properties.
Expression:
[Class Non-Attendance]+[Instructor Referral]+[Late Registration]+[Low Starting GPA]+[Talon Log-in]
Result Type: Long Integer
enter image description here
The part of the table this question relates to has the following fields:
Class Non-Attendance: Yes/No
Instructor Referral: Yes/No
Late Registration: Yes/No
Low Starting GPA: Yes/No
Talon Log-In: Yes/No
Priority Results: Calculated field counting the Yes/No fields above
Priority Outcome: Calculated field (that isn't working) prioritizing based on Priority Results

Don't put parameters for number fields in quotes.
Consider:
IIf(Abs([Priority Results])<=1, "Low", IIf(Abs([Priority Results])=2, "Medium", "High"))
In a query or textbox, expression could be:
Switch(Abs([Priority Results])<=1, "Low", Abs([Priority Results])=2, "Medium", True, "High")

Parts of the question still confuse me, which is why this answer will be brief. You have a calculated field PriorityOutcome based on another calculated field PriorityResults and that is the problem. Access doesn't calculate PriorityResults before calculating PriorityOutcome. Instead Access says PriorityResults doesn't exist yet and passes null to PriorityOutcome resulting in either an error or a silent fail.
There are several fixes you can mix and match. You can repeat the calculation for PriorityResults inside PriorityOutcome: wasteful but often the fastest solution. You can also add a code module with public functions to do part or all of the calculations. Then refer to those public functions in your calculated fields Access intellisense can find public functions.

Is possible to remove duplicates in Power BI based on a time interval between the data?

I have a list of leads data.
The table has a lot of infos like date, name, email, mobile number, etc.
However, some of these leads are duplicates: the same person generated more than one lead
What i want to do is to remove the duplicate leads.
The problem is, to be considered a duplicated lead, the email or the mobile number must appear in more than one row, in a time interval of 30 days.
And only those who come next must be considered duplicated. The first one is always a not duplicated lead.
E.g.
1) If Jones generated a lead in 01/01/20 with his email abc#abc.com and then generate another lead 10 days after, in 10/01/20. The first lead is a single lead (not duplicated) and this second lead must be considered duplicate.
2) If Maria generated a lead in 01/01/20 with her email xyz#abc.com and then generate another lead 40 days after, in 10/02/20. The first lead is a single lead (not duplicated) and this second lead also must be considered single (or not duplicate).
To mark the lead as duplicate or not, i want to generate a new column with the time between the last lead of the same person (same email or same mobile number)
Then generate a new column with the label "Duplicate" or "Not Duplicate" based on the time showed on the last column. If its > than 30 days, that is a single lead. Otherwise (<30 days) is a duplicate lead.
E.g picture:
Can someone please help me on how to do that?

Getting the lag/lead data is not very straight forward in Power BI. You will have to use a combination of EARLIERand some aggregate function to get the specific result. For your specific scenario, the following calculation might work:
Day Difference =
VAR name1 = 'Table'[Name]
VAR Lastdate1 = MAXX(FILTER('Table','Table'[Name]=name1 && 'Table'[date]<EARLIER('Table'[date])),'Table'[date])
RETURN
IF(
DATEDIFF(Lastdate1,'Table'[date],DAY)=BLANK(),
100,
DATEDIFF(Lastdate1,'Table'[date],DAY))
Once the column is created, you can filter for all records <=30 to get the result you are looking for. I have replaced the blanks with 100, so that the original records don't get removed when applying the condition.
If you are looking for the "Tag" value, then the following calculation will get you the tag values directly:
Tag =
VAR name1 = 'Table'[Name]
VAR Lastdate1 = MAXX(FILTER('Table','Table'[Name]=name1 && 'Table'[date]<EARLIER('Table'[date])),'Table'[date])
RETURN
IF(IF(
DATEDIFF(Lastdate1,'Table'[date],DAY)=BLANK(),
100,
DATEDIFF(Lastdate1,'Table'[date],DAY))<=30,"Duplicate","Single")

Detect duplicates on 2 attributes: nifi

I am trying using NIFI to detect duplicates based on 2 attributes of flow files such that per second there should not be any duplicate rows, whose 2 particular attribute values are same. In DetectDuplicate processor, following are the entries of my processor:
CacheEntryIdentifier : ${attribute1_name}::${attribute2_name}
Age of Duration : 1 sec
Distributed Chache Service : DistributedMapCacheClientService
Still, I am getting duplicate rows for which, per second values of these 2 attributes are same.
Help is much appreciated. Thanks.

An "Age Off Duration" of 1 second means that a CacheEntryIndentifier value that is a duplicate of one that arrived at least one second ago will NOT be considered a duplicate. That property is used to let entries "expire", some users set it for 24 hours so the next day, the same values can show up again as "not previously seen". If you want to always maintain the "seen" values, leave "Age Off Duration" blank.

Stata related -selecting specific rows

I am currently working with a dataset that has information on individuals i = 1,...,N by time t = 1,...,T. I basically have a panel structure in my dataset. However, I want to select only one row of data from each individual. Specifically, I want to select only the last time period t=T for each individual i=1,...,N. How can I 'extract' this specific information from the bigger dataset?

In Stata [not STATA] rows are more properly called observations. You can "select" the last observation in each panel with the generic
bysort id (time) : ... if _n == _N
as under the aegis of by:
the built-in variable _n identifies observations in each panel
its sibling _N is the number of observations in each panel and therefore identifies the last observation in each panel.
This is well documented: e.g. see the help and manual entries explaining the by: prefix.

getting started with loops and iteration in sequel pro

I'm using sequel pro to select data from several tables. There are two things I need to do that seem to need a loop of some kind. I have never used any form of iteration in sql and can't find a beginners-level resource to learn from.
Can anyone suggest how to do the following two tasks, or suggest a tutorial where I can learn the fundamentals and figure it out from there:
Task 1: Go through a version history table, find the relevant history record for a given id that applied at a given date, and select the value from that record. The form of the history table is:
id, Item_id, version-created_at, value
eg
1, 123, 2014-05-01, 754
2, 456, 2014-05-10, 333
3, 123, 2014-05-27, 709
and I need to find what the value of item 123 was on the date 2014-05-25 (ie I need to find record id=1 and value = 754 because that is the most recent version for item 123 created prior to my target date.
So I figure I need to run through the table looking for item 123 and comparing dates of those records. But I don't know how to deal with the iteration of moving from one record to the next and comparing them.
Task 2: Go through a single text field that contains a number of product id and matching product prices in a string, and find the id of the product with the lowest price. Form of the string is a series of pairs of price "p" and id "i", in random order, like this:
"
- :p: 99.8
:i: 3
- :p: 59.0
:i: 5
- :p: 109.8
:i: 18
- :p: 82.45
:i: 46
"
and in this example I need to find "5", being the id of the product with the lowest price $59.
So I figure I need to step through each of the p/i sets, maybe by counting characters, but I have no idea how to iterate through and compare to find the best price.
A little help would go a long way.
Thanks.

For first answer you can do something like this:-
SELECT value FROM history where id = 123 AND version-created_at = '2014-05-01';
and for another task you must try this at front end rather than at back end.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008