SPSS: How do I generate ID numbers from client ID variable that contains duplicate IDs - duplicates

I have a dataset which contains thousands of rows which each person assigned a ClientID. I would like to use the ClientID variable to generate a new ID variable which starts at 1. Some ClientIDs are duplicated so I would like to make sure that duplicate ClientIDs are given the same ID number. Client IDs are string and my data has to be sorted by TimeStamp.
My data looks like:
ClientID TimeStamp
15137.45692 15/03/2021
10489.15789 03/02/2021
14143.96745 01/01/2021
15137.45692 15/01/2021
15137.45692 27/02/2021
14143.96745 08/03/2021
I would like it to look like:
ID ClientID TimeStamp
1 14143.96745 01/01/2021
2 15137.45692 15/01/2021
3 10489.15789 03/02/2021
2 15137.45692 27/02/2021
1 14143.96745 08/03/2021
2 15137.45692 15/03/2021
How do I do this?
I would do it in excel but I have over 250k rows of data and excel keeps crashing.
Thanks

The following syntax creates ID=1 and then adds 1 only in case of a new ClientID:
sort cases by ClientID.
compute ID=1.
if $casenum>1 ID=lag(ID)+(ClientID<>lag(ClientID)).
exe.
EDIT:
Here's another nice way to do it using rank function:
RANK VARIABLES=ClientID (A) /RANK /PRINT=NO /TIES=CONDENSE.

Related

SPSS: How do I generate ID numbers from client ID variable that contains duplicate IDs in the order of the first date of each ID

Previously, I asked how to generate ID numbers from a client ID variable that contains duplicate IDs. I will use the same example data in this question but I would like to know how to generate ID numbers in the order of the first date of each ID. My client ID variable is string and has to remain as string.
My Data looks like:
ClientID TimeStamp
15137.45692 15/03/2021
10489.15789 03/02/2021
14143.96745 01/01/2021
15137.45692 15/01/2021
15137.45692 27/02/2021
14143.96745 08/03/2021
I would like it to look like:
ID ClientID TimeStamp
1 14143.96745 01/01/2021
1 14143.96745 08/03/2021
2 15137.45692 15/01/2021
2 15137.45692 27/02/2021
2 15137.45692 15/03/2021
3 10489.15789 03/02/2021
The previous code I tried was this:
sort cases by ClientID.
compute ID=1.
if $casenum>1 ID=lag(ID)+(ClientID<>lag(ClientID)).
exe.
However, whilst it gave me ID numbers for each ID, those ID numbers weren't ordered by TimeStamp.
In order to create the new ID the data needs to be sorted by ClientID. But then the new IDs will have the same order of the ClientID, while the order you want is not by the ClientID but by the first date of appearance. So first we need to calculate the first date for every ClientID, then we can use that to sort before creating the new ID.
Note: you need to make sure TimeStamp is defined as a date variable.
aggregate outfile=* mode=addvariables /break=ClientID /firstDate=min(TimeStamp).
sort cases by firstDate ClientID.
compute ID=1.
if $casenum>1 ID=lag(ID)+(ClientID<>lag(ClientID)).
exe.

insert ignore or replace ignore not working

I'm moving data from one SQL table to a second table using
insert ignore or replace into
I isn't working, I believe because I don't have a unique key. I also don't know where I would put the key.
I need the second table to display the last zone the number was seen on that date. I can add a time column if needed.
Example Data:
number
zone
date
1
zone3
01-02-03
1
zone1
01-02-03
1
zone3
01-02-03
2
zone1
01-02-03
3
zone2
01-02-03
If I put number as a unique key it doesn't get added when the date changes.
If I add date as a unique key only one row gets added on that date.
The query:
REPLACE INTO database.table2 (number,zone,date)
SELECT number,zone,date
FROM database.table1
GROUP BY number,date;
I hoped with the number and date grouped that it won't duplicate record, but it is still adding multiples.

Select all values from one table, check another table to see related columns and fetch more values

I really dont know how to phrase my question, probably why google is not giving me results that i need, but am going to try.
I have two tables, required_files table and submitted_files table. I have a page where i want to display to a user all required files for submission and show which files he/she has submitted.
Required files table is as follows:
file_id file_name mandatory
1 Registration Certificate 0
2 KRA Clearance 1
3 3 Months Tax returns 0
4 Business Permit 1
5 Tour Permit 1
6 Country Govt Operating License 0
7 Certificate of good Conduct 0
file_id is unique, mandatory column is binary value to state whether the file is mandatory before registration or not.
submitted files table is a follows
file_id user_id file_required_id original_file_name file_name_on_server submission_date
1 2 2 KRA_Form.docx 0a10f5291e9bcb6a345ac7a8f5705b8a.docx 2016-11-01
2 2 3 Tax_returns.docx 9f04361013df7e25235a03c506f347ed.docx 2016-11-03
3 3 3 Taxes.docx 86aea74cc87fb669510d9d4c488cbcf8.docx 2016-11-04
file_id is unique AI value, user_id col is unique value of the current user logged in, file_required_id column is related to files_required.file_id column
When fetching the values i already have a user_id (in this case, lets use user_id = 2) Now i want to fetch all values of files_required table and check on files submitted table for files that user_id = 2 meaning user has submitted the files.
my sql query is as follows
SELECT files_required.*, submitted_files.* FROM submitted_files
RIGHT JOIN files_required ON files_required.id = submitted_files.file_required_id
WHERE submitted_files.user_id = 2
This gives me two rows only where the user_ids matched but i want the entire files_required table values and show which files the user has submitted. Someone Kindly assist.
In the meantime, i am fetching files_requied table first then looping through the other table using a php script to look for submitted files for the given user. it works but its not what i wanted and is cumbersome and a rookie move.
Try having user_id condition in RIGHT JOIN itself like below query
SELECT files_required.*, submitted_files.*
FROM submitted_files
RIGHT JOIN files_required ON files_required.id = submitted_files.file_required_id
AND submitted_files.user_id = 2
You want this.
SELECT submitted_files.user_id, files_required.*, submitted_files.*
FROM submitted_files
RIGHT JOIN files_required ON files_required.id =
submitted_files.file_required_id
Don't put the where condition on userid as it will filter out the data just for that user. You want all the records and user should also be seen. Just put the user_id in the select statement.

ruby code to direct all users to a particular location instead of multiple locations and delete duplicates

I am having a User model and a Location Model. Each user belongs to a particular location in the Location model.
I am having duplicate locations in the Location table.
and User belongs to Location.
how can i remove duplicate rows in the location table and keep one row and make all users belong to that single row using ruby. Both the tables are connected through location_ID attribute.
I tried to do this through migration:
def dedupe(model, *key_attrs)
model.select(key_attrs).group(key_attrs).having('count(*) > 1').each { |duplicates|
dup_rows = model.where(duplicates.attributes.slice(key_attrs)).to_a
# the first one we want to keep right?
first_one = dup_rows.shift #stored the first one
dup_rows.each{ |double| double.destroy } # duplicates can now be destroyed
}
end
But there is foreign key constraint of User not letting the migration to run. How can I achieve this?
Current Models are :
User
user_id name location_id
1 tim 1
2 adam 2
3 Joy 3
Location
location_id name
1 NewYork
2 NewYork
3 NewYork
Expected Ouput:
User
user_id name location_id
1 tim 1
2 adam 1
3 Joy 1
Location
location_id name
1 NewYork
Kinda ugly, but you can use a subquery:
First, grab the first occurrence of all records which are duplicates;
original_duplicate_locations = Location.select("MIN(id) AS id, name, user_id").group(:name, :user_id).having("COUNT(id) > 1")
The extra duplicates are defined as locations having the same name, and user_id but not the same id:
duplicates_not_including_originals = Location.joins("JOIN (#{duplicates.to_sql}) dupes ON locations.name = dupes.name AND locations.user_id = dupes.user_id AND locations.id <> dupes.id")
Hey you can try this way:
1)First Update all entries with location first entry in location table using
User.joins(:location).update_all("location_id = select id from locations as l2 where l2.name = locations.name limit 1")
Note: you can also use order by id here if sub query not return first entry from table.
2)Destroy All entries from location table excluding first entry-
Before this make sure all your data get updated with first entry in location table properly means first id of repeated location is updated or not. because after deletion it is not possible to recover your data again. then just destroy all your repeated entries excluding first entry using
Location.where("id not in (?)", Location.select("min(id) as id").group("name").map(&:id)).destroy_all

Display a table with just the second duplicate rows removed yet keep the first row

So, I have a table with 3 columns, of which the first column consists of IDs and the last column consists of dates. What I need is, to sort the table by dates, and remove any duplicate IDs with a later date (and keep the ID with the earliest date).
For example,
This is how my table originally looks like -
123 Ryan 01/01/2011
345 Carl 03/01/2011
123 Lisa 01/02/2012
870 Tiya 06/03/2012
345 Carl 07/01/2012
I want my resultant table to look like this -
123 Ryan 01/01/2011
345 Carl 03/01/2011
870 Tiya 06/03/2012
I'm using VBA Access Code to find a solution for the above, and used SQL Queries too, however my resultant table either has no duplicates whatsoever or displays all the records.
Any help will be appreciated.
This will create a new table:
SELECT tbl.SName, a.ID, a.BDate
INTO NoDups
FROM tbl
INNER JOIN (
SELECT ID, Min(ADate) As BDate
FROM tbl GROUP BY ID) AS a
ON (tbl.ADate = a.BDate) AND (tbl.ID = a.ID);