Delete duplicate rows in SAS - csv

I am trying to delete duplicate rows from a csv file using SAS but haven't been able to do so. My data looks like-
site1,variable1,20151126000000,22.8,140,1
site1,variable1,20151126010000,22.8,140,1
site1,variable2,20151126000000,22.8,140,1
site1,variable2,20151126000000,22.8,140,1
site2,variable1,20151126000000,22.8,140,1
site2,variable1,20151126010000,22.8,140,1
The 4th row is a duplicate of the 3rd one. This is just an example, I have more than a thousand records in the file. I tried doing this by creating subsets but didn't get the desired results. Thanks in advance for any help.

I think you can use nodupkey for this, just reference your key, or you can use _all_ -
proc sort data = file nodupkey;
by _all_;
run;

In this article you find different options to remove duplicate rows: https://support.sas.com/resources/papers/proceedings17/0188-2017.pdf
If all columns are sorted the easiest way is to use the option noduprecs:
proc sort data = file noduprecs;
by some_column;
run;
In contrast to the option nodupkey no matter which column or columns you state after the by it will always remove duplicate rows based on all columns.
Edit: Apparently, all columns have to be sorted (-> have a look at the comment below).

Related

How to make MySQL MATCH...AGAINST use various word separators?

I have a table with 300K string values. These values contain all types of word separators so it looks like this:
id value
1 A B C
2 A B_C
3 A_B-C
4 A-B-C
Let's say I want to find all four rows containing A and B. This query
SELECT * FROM table WHERE MATCH(value) AGAINST('+A +B' IN BOOLEAN MODE);
will return only one row with space separated values:
1 A B C
Is there a way to make MATCH...AGAINST use other word separators? I tried to use LIKE and it was too slow.
You will probably want to alter your app and schema just a little bit to solve this problem. You have two tasks:
Task 1: Transform your existing data
Assuming you need to keep the source data unchanged:
Step 1: Add a field to your schema, "searchFriendly", same datatype as the source data.
Step 2: Write a script to transform the data you already have. Get the whole data set and do string replaces to get spaces.
Step 3: Save that transformed data to the new searchFriendly field.
Task 2: Modify the app so that all future database save/update's on this data, also perform the transformation and save that data as well.
Step 1: Find the part of the app that saves these records.
Step 2: Before actually writing the data to the database, perform the transformation.
Step 3: Add the transformed data to your API call to save/update the record, under the searchFriendly field.

Apache NiFi: Creating a new column by comparing multiple rows with various data

I have a csv which looks like this:
first,second,third,num1,num2,num3
12,312,433,0787388393,0783452323,01123124
12,124345,453,07821323,077213424,0123124421
33,2432,214,077213424,07821323,0234234211
I have to create another column according to the data stored in num1 and num2. There can be various values in the columns but the new column should only contain 2 values it's either original or fake. (I should only compare the first 3 digits in both num1 andnum2`.
For the mapping part I have another csv which looks like this (I have many more rows):
078,078,fake
072,078,original
077,078,original
My Output csv should look like this after mapping:
first,second,third,num1,num2,num3,status
12,312,433,0787388393,0783452323,01123124,fake
12,124345,453,07821323,072213424,0123124421,original
33,2432,214,078213424,07821323,0234234211,fake
Hope you can suggest me a nifi workflow to get the following done:
You can use LookupRecord for this, but due to the special logic you'll likely have to write your own ScriptedLookupService to read in the mapping file and compare the first 3 digits.

MySQL to CSV - separating multiple values

I have downloaded a MySQL table as CSV, which has over thousand entries of the following type:
id,gender,garment-color
1,male,white
2,"male,female",black
3,female,"red,pink"
Now, when I am trying to create a chart out of this data, it is taking "male" as one value, and "male,female" as a separate value.
So, for the above example, rather than counting 2 "male", and 3 "female", the chart is showing 3 separate categories ("male", "female", "male,female"), with one count each.
I want the output as follows, for chart to have the correct count:
id,gender,garment-color
1,male,white
2,male,black
2,female,black
3,female,red
3,female,pink
The only way I know is to copy the row in MS Excel and adjust the values manually, which is too tedious for 1000+ entries. Is there a better way?
From MySQL command line or whatever tool you are using to send queries to MySQL:
select * from the_table
into outfile '/tmp/out.txt' fields terminated by ',' enclosed by '"'
Then download /tmp/out.txt' from the server and it should be good to go assuming your data is good. If it is not, you might need to massage it with some SQL function use in theselect`.
The csv likely came from a poorly designed/normalized database that had both those values in the same row. You could try using selects and updates, along some built in string functions, on such rows to spawn additional rows containing the additional values and update their original rows to remove those values; but you will have to repeat until all commas are removed (if there is more than one in some field), and will have to determine if a row containing multiple fields with such comma-separated lists need multiplied out (i.e. should 2 gender and 4 color mean 8 rows total).
More likely, you'll probably want to create additional tables for X_garmentcolors, and X_genders; where X is whatever the original table is supposed to be describing. These tables would have an X_id field referencing the original row and a [garmentcolor|gender] value field holding one of the values in the original rows lists. Ideally, they should actually reference [gender|garmentcolor] lookup tables instead of holding actual values; but you'd have to do the grunt work of picking out all the unique colors and genders from your data first. Once that is done, you can do something like:
INSERT INTO X_[garmentcolor|gender] (X_id, Y_id)
SELECT X.X_id, Y.Y_id
FROM originalTable AS X
INNER JOIN valueTable AS Y
ON X.Y_valuelist LIKE CONCAT('%,' Y.value) -- Value at end of list
OR X.Y_valuelist LIKE CONCAT('%,' Y.value, ',%') -- Value in middle of list
OR X.Y_valuelist LIKE CONCAT(Y.value, ',%') -- Value at start of list
OR X.Y_valuelist = Y.value -- Value is entire list
;

How to prevent use of the first row Pandas DataFrame as column names when using to_sql

I have a dataframe loaded from a CSV file, which includes a header row. After assigning the returned dataframe from read_csv, I'm trying to add the rows to a MySQL database table using SQLAlchemy engine, my method call looks like this:
my_dataframe.to_sql(name="my_table",
con=alch_engine,
if_exists="append",
chunksize=50,
index=False,
index_label=None)
However, the table already exists, and the values of the dataframe header don't match the column names, and so I get a MySQL error (1054, "Unknown Column 'Col1' in 'field_list'")
I would like not to use the first row at all and run the insert query without specifying the column names. I have not found a solution for this from the Pandas manual.
Thank you for your help,
AFAIK you cannot do that with .to_sql(). But you can modify the dataframe to match the column names in the table. Provided db_cols is a list/array/series/iterable containing the names, this should do:
(my_dataframe
.rename(columns=dict(zip(df.columns, db_cols)))
.to_sql(name="my_table",
con=alch_engine,
if_exists="append",
chunksize=50,
index=False,
index_label=None))
Old.. but came across this.. far as i know, when you create the dataframe in the first place, you can specify header=None.. then the dataframe has no column names and the first row is treated as data.
i've only used it for excel.. but i assume csv is the same:
my_dataframe = pd.read_csv(full_path, header=None)
Then when you use to_sql, it won't have the column names. it seems then pandas attempts to use numbers as the column names for it's insert statement. I suppose it depends on the db engine to accept that as valid.
ie it generates something like:
INSERT INTO [table] (0, 1) VALUES (%(0)s, %(1)s)
[sorry, not sure how to escape the quote in this comment box to show them around the column names above]
Found out a simple way out for this problem.
First, read the very first line i.e. the header and save it as a list(header_list).
Second, create a Dataframe without skipping any rows. Do not use the names argument.
df = pandas.read_csv(input_file, quotechar='"', skiprows = skip_row_count, nrows = num_of_lines_per_iter)
This will create the table with the first row as Table Header and insert the rest of the rows as data.
Third, if the table exists, create a data frame, this time use the names argument.
df = pandas.read_csv(input_file, quotechar='"', skiprows=skip_row_count, nrows=num_of_lines_per_iter, names = header)
This will ensure the data in the data frame is inserted into the corresponding columns by matching the column names in data frame against the column names in the table.
Finally, you can use skiprows argument to skip the header.

How can a CSV file with counter columns be loaded to a Cassandra CQL3 table

I have a CSV file in the following format
key1,key2,key3,counter1,counter2,counter3,counter4
1,2,1,0,0,0,1
1,2,2,0,1,0,4
The CQL3 table has all value columns of type 'counter'. When I try to use the COPY command to load the CSV I get the usual error which asks for an UPDATE instead of an INSERT.
The question is : how can I tell CQL to use an UPDATE ?
Is there any other way to do this ?
using sstables solved this issue. Although a little slower than what i expected , it does the job
To update a counter column you have to delete it (with Consistency set to ALL) and then insert a new value (same consistency).
So my advice is to use an HashMap in you program and determine which value you want to write to the counter column (oldest, highest, lowest, ...).