Notepad ++, how to delete duplicate lines, up to specific character and alphabetically order? - duplicates

I am working with a dictionary file en_GB.dic.
Problem 1: There is a new updated en_GB.dic and when compared in Notepad++, i want to auto transfer missing words from new en_GB.dic to my current en_GB.dic. I do not want to transfer same words like abdominal/YS when i have abdominal/SY.
Problem 2: Assuming problem 1 can not be done and i have to copy and paste new en_GB.dic to my current en_GB.dic, there will be thousands of duplicate lines. Easy to remove the obvious duplicates. But i can't seem to remove duplicates like this below, as an example.
abdominal/SY
abdominal/YS
I want to check up to character / for duplicate and remove the abdominal/YS as /YS is not in alphabetical order and keep abdominal/SY. There are thousands of examples that need removing.
Thanks in advance for any replies and hopefully solutions.

Related

Remove duplicate rows from multiple files in Notepad++

I have 350 files of data with each containing about 4,000 rows. There are 3,000 unique rows but some rows are duplicated e.g.
"2021-02-02",20.1,99,0,3.4
"2021-02-03",22.6,95,0,2.9
"2021-02-04",18.8,90,0,5.2
"2021-02-02",20.1,99,0,3.4
"2021-02-03",22.6,95,0,2.9
"2021-02-05",21.9,96,0.8,4.2
"2021-02-06",20.8,95,0,3.3
I will like to remove only the duplicate lines in each of the 350 files. However, the duplicate lines are different in each file. i.e., some files may have other dates duplicated apart from the sample shown. The duplicate lines are random and not in any particular order. I used Line Operations in Notepad++ to sort the lines in ascending order and then remove duplicates. It works okay for one file but it will take a long time repeating this step 350 times.
As mentioned in comments a script in your favorite scripting language is the best way.
But you may have a look at the screenshots below and try for your needs.
I assume you have all files or part of them in one directory. Please think about a backup copy for your test.
Open one file in your workspace
Open the dialog e.g. by STRG+F
Try for your needs Find What: ^(.*?)$\s+?^(?=.*^\1$)
Choose Regular Expression and matches newline
Open Find in Files tab e.g. by STRG+Shift+F
Replace with: Nothing
Set Filter
Set Directory
Press Replace in Files (at your own risk!)
Before:
After:

Google Spreadsheets ArrayFormula: How to split and transpose a cell-range?

Hello everybody and thanks a lot for your help.
Here's my problem:
What I have:
I have a table with raw data in 53 rows and numerous columns which I would like to reduce and restructure into three columns: City, Date and Value.
https://docs.google.com/spreadsheets/d/1bsdC8lrtSGk957ae8Z0VRGnDqTZfFLPpLkfoid0UbIQ/edit?usp=sharing
What I've done so far:
For a single row, I used the following formula to make everything work as I wanted it to:
ArrayFormula({SPLIT(TRANSPOSE(Base_Data!A2)&"|"&TRANSPOSE(Base_Data!AJ1:1&"|"&Base_Data!AJ2:2),"|")})
What I want:
I'd like to extend the formula to work for the entire area, all 53 rows. Does anyone have a tip for this? The solution doesn't have to be a formula, it would work as a script, too
I've set up a new sheet called "New_Data [Erik]" and placed the following formula into A2:
=ArrayFormula(SPLIT(FLATTEN(Base_Data!A2:A&"\"&Base_Data!AJ1:1&"\"&Base_Data!AJ2:54),"\",0,1))
If this is a one-time conversion, I'd recommend copying the results in place. To do that, select A:C, hit Ctrl-C to Copy and then Ctrl-Alt-V to Paste Special. A small clipboard icon will appear. Click it and choose "Paste Values Only."
If you'll need this functionality ongoing, just understand that FLATTEN is a not-yet-official function of Google Sheets, which means that while Google sheets may very well make it official, they may also decide to do away with it at any time. (This is why I suggest copying and pasting the results in place, if it's just a one-time conversion.)
Not sure what you're trying to get to there. If you are trying to leave out all columns but 3, just do ={Base_Data!A2:A, Base_Data!E2:E} and add as many columns as you require comma-separated within the curly brackets

How to copy variable values within an SPSS file?

I have three seperate SPSS files with information about roughly 7500 hemicolectomy patients. One file contains the information about the hemicolectomies, the second one about other surgeries the patients have had during their lifetime and the last one contains information about their sick leaves during their lifetime.
I have merged (idnumber is the common variable) the files to a single SPSS document but i ran into a problem with filtering out the surgeries and sick leaves that have nothing to do with the hemicolectomy. I'm quite new to SPSS so the simplest way i could think of doing this is by somehow copying the hemicolectomy info to every case and then just using the date/time calculator to choose which sick leaves and surgeries to discard. Switching to wide format is unpractical due to the large number of unrelated surgeries and sick leaves: I'd have thousands of variables.
So basically I'd like to do the following:
IF idnumber = idnumber THEN variable1=variable1 AND variable2=variable2 etc
How would I go about doing this?
All help will be appreciated!
the IF command can only be used with one transformation:
IF [condition] [transformation].
Assuming both of your files are sorted by idnumber:
UPDATE file=[master_file_reference]
/file=[secondary_file_reference]
/BY idnumber.
EXECUTE.
The file reference can be made either by their dataset name, or by their full path.
More on the UPDATE command:
https://www.ibm.com/support/knowledgecenter/en/SSLVMB_24.0.0/spss/base/syn_update_examples.html
I cant comment yet, so Im sorry if I misunderstand the problem. I wouldve asked for clarification in the comments to the question... here goes...
So you have three sources of data which have dates (?) of hemicolectomies, one for each case; dates (?) of other surgeries, multiple for each case; and sickleaves even more for each case. Is that right?
I'd try solving the problem before matching all three file by matching the file that contains one observation per patient (presumably hemicolectomies) to the one with the second most observations (presumably other surgeries) per patient with the /table keyword:
MATCH FILES /FILE= 'surgeries.sav' /table = 'hemicolectomies.sav'
/by idnumber.
EXECUTE.
this will "fill up" the blank cells for each patient with the hemicolectomy data.
now use the datetime to check which surgeries "belong" to the hemicolectomies, thus reduce your data and match it to the sickleave data using the /table keyword again.
Seems like the easiest solution to me.

Reshape the dataset into more relational format (Transpose SOME rows and assign them to a data subset)

I have a spreadsheet/csv:
Code:,101,Course Description:,"Introduction to Rocket Science",
Student Name,Lecture Hours,Labs Hours,Test Score,Status
John Galt,48,120,4.7,Passed
James Taggart,50,120,4.9,Passed
...
I need to reshape it to the following view:
Code:,Course Description:,Students,Lecture Hours,Labs Hours,Average Test Score,Teaching Staff
101,"Introduction to Rocket Science",John Galt,48,120,4.7,Passed
101,"Introduction to Rocket Science",James Taggart,50,120,4.9,Passed
...
Beleive it or not, can not get the right idea how to do that despite it seems to be very primitive transformation, is there any silver bullet for this?
Original records (csv) have in a way json-like structure so my first approach was to represent the original data as a vector and then transpose it, (but in this case my resulting table looks like sparced matrix - rows I have transpored are blank in the rest of its values)
Another way Im considering - **serialize it into jsons and then de-serialize** into new spreadsheet (jsonize()) - in this case, Im having problems with merging them properly.
In both ways I have it "half-working";
Can anyone suggest simple and reliable algorithm for this;
Any language, RegEx, any tools, code snippets are very appreciated
Assuming that the pattern you've described here is consistent throughout, there are quite a few different approaches you could take I think, but in all cases you basically can use that fact that the 'Course' rows start with "Code:" but that's never going to be a student name.
You can take advantage of this either by a regular expression find/replace, or within OpenRefine.
Example:
Open file in a text editor that supports regular expressions in
find/replace
Search for lines starting with 'Code:' and add additional commas to the start of the row to shift the course data columns to the
right e.g. search for: ^Code: replace with: ,,,,,^Code:
If you now import the file into OpenRefine then you'll have a project with 10 columns (the 10th col is caused by the trailing
comma at the end of the course data row)
You can now use Transpose (or just rename) on the right-most columns which contain the course data, while leaving the left-most
columns which contain the student details
Isolate the rows that contain the phrase 'Student Name' in the first column and remove them (via a filter or facet)
Move the Course Code/Description columns to the beginning of the project, and use the 'Edit Cells->Fill Down' option on each column to get the values repeated on all the relevant lines
Finally rename the columns as you want, remove any extraneous columns

Pentaho / crossing files

Im trying to cross 2 different .csv files in order to have an output files indicating the new,changed,deleted and identical entries on the output file.
Im trying to do as explained here
http://wiki.pentaho.com/display/EAI/Merge+rows
Im using merge rows(diff) in order to try and achieve this but no matter what i try its not working, as key fields im only using the value of the row that doesnt update i.e an ID.
What i tried to do is using the same file for for both inputs,when i dont change anything the flagfield value is "identical" for all rows,but then if i try and modifify ONE single value in ONE row in ONE of the files,i get all changed? and maybe 3 or 4 identical? any ideas why this is happening? I just can't figure it out,thanks in advanced.
Merge rows diff is the correct answer here.
If you're using a target database after the diff then you can pair it with "synchronise after merge" but in this case a text file output will do it.