Remove duplicate rows from multiple files in Notepad++ - duplicates

I have 350 files of data with each containing about 4,000 rows. There are 3,000 unique rows but some rows are duplicated e.g.
"2021-02-02",20.1,99,0,3.4
"2021-02-03",22.6,95,0,2.9
"2021-02-04",18.8,90,0,5.2
"2021-02-02",20.1,99,0,3.4
"2021-02-03",22.6,95,0,2.9
"2021-02-05",21.9,96,0.8,4.2
"2021-02-06",20.8,95,0,3.3
I will like to remove only the duplicate lines in each of the 350 files. However, the duplicate lines are different in each file. i.e., some files may have other dates duplicated apart from the sample shown. The duplicate lines are random and not in any particular order. I used Line Operations in Notepad++ to sort the lines in ascending order and then remove duplicates. It works okay for one file but it will take a long time repeating this step 350 times.

As mentioned in comments a script in your favorite scripting language is the best way.
But you may have a look at the screenshots below and try for your needs.
I assume you have all files or part of them in one directory. Please think about a backup copy for your test.
Open one file in your workspace
Open the dialog e.g. by STRG+F
Try for your needs Find What: ^(.*?)$\s+?^(?=.*^\1$)
Choose Regular Expression and matches newline
Open Find in Files tab e.g. by STRG+Shift+F
Replace with: Nothing
Set Filter
Set Directory
Press Replace in Files (at your own risk!)
Before:
After:

Related

Notepad ++, how to delete duplicate lines, up to specific character and alphabetically order?

I am working with a dictionary file en_GB.dic.
Problem 1: There is a new updated en_GB.dic and when compared in Notepad++, i want to auto transfer missing words from new en_GB.dic to my current en_GB.dic. I do not want to transfer same words like abdominal/YS when i have abdominal/SY.
Problem 2: Assuming problem 1 can not be done and i have to copy and paste new en_GB.dic to my current en_GB.dic, there will be thousands of duplicate lines. Easy to remove the obvious duplicates. But i can't seem to remove duplicates like this below, as an example.
abdominal/SY
abdominal/YS
I want to check up to character / for duplicate and remove the abdominal/YS as /YS is not in alphabetical order and keep abdominal/SY. There are thousands of examples that need removing.
Thanks in advance for any replies and hopefully solutions.

Ignore last/corrupted record from flat file source in SSIS

I have following csv file:
col1, col2, col3
"r1", "r2", "r3"
"r11", "r22", "r33"
"totals","","",
followed by 2 blank lines. The import is failing as there is extra comma at the end of the last data row and most probably will fail because of the extra blank lines at the end.
Can I skip the last row somehow or even better stop import when I get into that row? It always has "totals" string in the "col1".
UPDATE:
As far as I understood from the answers that it is not possible to do that with Flat File. Currently I did that with the "Script Component" as a source
You can do it by reading the row as a single string.
Conditionally split out Null and left(col0)=="total"
in script component you then use split function
finally trim("\"")
I know of nothing built-in to SSIS that lets you ignore the LAST line of a CSV.
One way to handle this is to precede your dataflow with a script task that uses the FileSystemObject to edit the CSV and remove the last line.
You will need to create a custom script where you read all lines but the last within SSIS.
This is old but it came up for me when searching this topic. My solution was to redirect rows on the destination. The last row is redirected instead of failing and the job completes. Of course you will potentially redirect rows you don't want to. It all depends on how much you can trust the data.

Value is repeated after copying it from Access and pasting it into Notepad++

I open a table inside the GUI of MS Access. I mark a cell which contains:
Myriam
I hit ctrl + c to copy it. I open a new document in Notepad++ and copy with strg + v. The result is:
"Myriam
Myriam"
I get two lines instead of one! There are 27 thousand entries in that column and ONLY for this one I observe this behaviour. I was able to track it down to this level, but now I'm clueless about the 'why' ... ?
Using the "Zoom" feature (ShiftF2) in Access revealed that the field actually did contain
Myriam
Myriam
This was probably due to a user accidentally hitting CtrlEnter during data entry, which added the newline and made the field look empty, so they typed the name in a second time.

NiFi : Regular Expression in ExtractText gets CSV header instead of data

I'm working on a flow where I get CSV files. I want to put the records into different directories based on the first field in the CSV record.
For ex, the CSV file would look like this
country,firstname,lastname,ssn,mob_num
US,xxxx,xxxxx,xxxxx,xxxx
UK,xxxx,xxxxx,xxxxx,xxxx
US,xxxx,xxxxx,xxxxx,xxxx
JP,xxxx,xxxxx,xxxxx,xxxx
JP,xxxx,xxxxx,xxxxx,xxxx
I want to get the field value of the first field i.e, country. Put those records into a particular directory. US records goes to US directory, UK records goes to UK directory, and so on.
The flow that I have right now is:
GetFile ----> SplitText(line split count = 1 & header line count = 1) ----> ExtractText (line = (.+)) ----> PutFile(Directory = \tmp\data\${line:getDelimitedField(1)}). I need the header file to be replicated across all the split files for a different purpose. So I need them.
The thing is, the incoming CSV file gets split into multiple flow files with the header successfully. However, the regex that I have given in ExtractText processor evaluates it against the splitted flow files' CSV header instead of the record. So instead of getting US or UK in the "line" attribute, I always get "country". So all the files go to \tmp\data\country. Help me how to resolve this.
I believe getDelimitedField will only work off a singular line and is likely not moving past the newline in your split file.
I would advocate for a slightly different approach in which you could alter your ExtractText to find the country code through a regular expression and avoid the need to include the contents of the file as an attribute.
Using a regex of ^.*\n+(\w+) will capture the first line and the first set of word characters up to the comma and place them in the attribute name you specify in capture group 1. (e.g. country.1).
I have created a template that should get the value you are looking for available at https://github.com/apiri/nifi-review-collateral/blob/master/stackoverflow/42022249/Extract_Country_From_Splits.xml

Parse tab separated text file in Google Sheets

I have a txt file available on the web which contains tab separated values (TSV/CSV) like this:
Product_IdtabColortabPricetabQuantityItem1 tabRed tab$5.2 tab5Item2 tabBlue tab$7.5 tab10
I imported the txt file into a Google Spreadsheet using the IMPORTDATA(url) formula. The problem is that now I need to split the text to columns. I tried the following formulas without success:
Split(A1,"\t")
Split(A1," ")
Split(A1,"<tab>")
another thing I tried is to to use the Substitute function, but I just can't figure out how to match the Tab character in Google Spreadsheets?
Pages strips tabs by default when you paste text using a standard paste. Tab delimited data can be pasted and automatically parsed using:
Right Click -> Paste special -> Paste values only
IMPORTDATA(url) seems to handle tabs automatically, as others have mentioned before, if the URL ends in ".tsv".
I had trouble trying to import a file from Dropbox even though the file was named "something.tsv", because the url was
"https://www.dropbox.com/s/xxxxxxx/something.tsv?dl=1"
I managed to solve the problem by adding a dummy query parameter to the url:
"https://www.dropbox.com/s/xxxxxxx/something.tsv?dl=1&x=.tsv"
NOTE: I know this question was asked back in 2014 and I am answering this question some 5 years later. I am posting the answer here in hopes that someone else who googles their way here will be saved the headache and can be helped by how I devised a solution.
SUMMARY OF THE ISSUE: By default the IMPORTDATA() function will properly process a tab-delimited file only if the file name ends with the extension .TSV
UPDATE Nov 14, 2019:
In a comment below, Poul shared that he has found an undocumented parameter for the IMPORTDATA() function by which you can specify the delimiter to split the data. As of writing this, the official documentation makes no reference to this delimiter.
In effect the documentation should look something like the following:
IMPORTDATA("url","delimiter")
So, if you wanted to force a file to be split on the TAB character, it would look something like
IMPORTDATA("url","\t")
PRIOR ANSWER:
UPDATE: I am leaving my original answer just in case it might be helpful if the answer above, which includes undocumented functionality, does not continue to work.
ORIGINAL ANSWER: After seemingly countless attempts, I figured out how to coax Google Sheets into importing a tab-delimited file regardless of the extension.
For those looking for the quick and dirty answer, copy the following into a cell of a Google Sheet to give it a try:
=ARRAYFORMULA(IFERROR(SPLIT(IMPORTDATA("https://iso639-3.sil.org/sites/iso639-3/files/downloads/iso-639-3_Latin1.tab"),CHAR(9),FALSE,FALSE)))
For those that want to know a bit more, I will try to explain how each of the nested functions are helping to create the final solution:
=ARRAYFORMULA( IFERROR( SPLIT( IMPORTDATA(URL-HERE) ,CHAR(9),FALSE,FALSE) ) )
IMPORTDATA() - the primary function that pulls in the data file from the web
SPLIT - split the row by tab, note the use of char(09) to generate the tab character; also note the use of FALSE for the last parameter which was required in my case to ensure empty cells were not collapsed together
IFERROR - used to catch situations where an import might fail, the error will be trapped and not returned to the spreadsheet
ARRAYFORMULA - this function ensures that every line in the file is parsed; without this, only the first line of the file would be returned to the spreadsheet
It turns out that IMPORTDATA(url) can import a tab separated file, but it expects the file name to have the .tsv extension. This is inconsistent with Excel, where a tab-separated export results in *.txt.
If you can ensure that you use a .tsv extension, then your problem is solved.
You can also use the Sheets UI to import the file (into a new Spreadsheet). Select File > Import..., then Upload > Select a file from your computer. When the file selection dialog opens, paste the URL into the file name field, and click Open. The file will be downloaded to your PC then uploaded to Drive, through the Import dialog that will let you choose the delimiter.
(Validated on Windows 8.1 with Chrome; I don't know how this will behave on other OSes or browsers.)
Edit: See this gist.
importFromCSV(string fileName, string sheetName)
Populates a sheet with contents read from a CSV file located in the user's GDrive. If either parameter is not provided, the function will open inputBoxes to obtain them interactively.
Automatically detects tab or comma delimited input.
I had luck using split() and indicating only a single space as the delimiter, even though the data i pasted in had tabs separating each "column": =SPLIT(A1, " ", True) where A1 had data separated by 1 or more spaces. It seems that pasting in TSV data results in conversion from tabs to spaces.
This could be done in two steps leveraging the fact that tab is essentially multiple spaces.
Steps are as follows:
Select the columns which have tab separated data. Then trim tab to single space by using Data -> Data cleanup -> Trim whitespaces.
Now usual Data -> Split text to columns should work out of the box or after selecting space as separator.