notepad++ or excel - delete duplicate ad original rows - duplicates

I have a single .txt with +200k rows.
In this file, i want to delete duplicates and original rows.
I have now:
text a
text a
text b
text c
text d
text d
text e
But i need a result like this
text b
text c
text e
Suggest?
i have tried normal "delete duplicate" procedure of excel and notepad++ but i obtain this
text a
text b
text c
text d
text e
and it not work fine for me
looking for discussion i find something like that but applicated to access.

Just using Notepad++, do a regular expression replace of ^(.*\R)\1+ with nothing. Ensure that "Wrap around" is selected and that ". matches newline" is not selected. This assumes that the duplicate lines are grouped together; sorted lines would be ideal.
The regular expression is as follows
^ Beginning of a line
( Start capture group
.* Zero or more of any character
\R An end of line
) End of capture group
\1+ One or more repeats of the capture group
If the last line of the file is part of a duplicated group but it does not have the end-of-line character(s) then the above regular expression will not find that last line. If the last three or more lines are identical, except that there is no final end-of-line character(s) then the above regular expression will remove most of the final group but leave one line. My recommendation is to ensure that the last line of the file has the end-of-line character(s) before using the regular expression. (Thanks to #Toto for pointing out this possible problem.)
Note that if the test for what is duplicated is more complex than exactly identical then the ideas in this Stack Overflow answer may be used.

Here is a way how you can achieve it using python. Hope it helps :)
# Importing Pandas to create DataFrame
import pandas as pd
# Creating Empty DataFrame and Storing it in variable df
df = pd.DataFrame()
#replce this with our requiremnt
# lt = ["text a", "text a", "text b", "text c", "text d", "text d", "text e"]
text_file = open(r"C:\Users\antch\Desktop\txt.txt", "r")
lt = text_file.readlines()
# print (lt)
# print (type(lt))
text_file.close()
df['cols'] = lt
#creating an empty column with count value set to 1
df['dummy_count'] = 1
# grouping values
df = df.groupby('cols').count()
# print(df)
df.reset_index(inplace = True)
df = df.loc[df["dummy_count"] == 1]
print(df["cols"])
the output of the above code: expected output

Related

Finding text within text delimited by new line character

I am trying to find text within text using MySQL. I have a field of values that is somewhat unstructured, but the data entry fortunately is delimited by new lines. I'm trying to see if I can pull the value for "Education", which would be basically a substring that starts after "Education:" and ends with \n new line character in data below:
'Children: 5
Education: College
Employment: Homemaker
Marital Status: Married'
I've looked at the MID function, but since the values for education vary, the length isn't standard. I have searched MySQL string functions, and I have not found a solution that will allow me to search between two positions, including one that is defined by a regex character -- the REGEX simply provides a match, not a position.
SELECT id,MID(value,POSITION('Education:' IN value),30)
FROM client_data
the code performed as expected, but due to fixed length rather than position of \n new line character, results either truncated or included extra characters from subsequent text.
I'm guessing there is a way to do this that I'm just not finding.
You can use REGEXP_SUBSTR to get actual the string that matched the regular expression:
REGEXP_SUBSTR(value, '^Education:.*', 1, 1, 'm')
This gets you the Education line. Then you just need to extract the part after the : from that string:
REGEXP_REPLACE(
REGEXP_SUBSTR(value, '^Education:.*', 1, 1, 'm'),
'^Education:', '')

Fail to load a 4 column CSV file into OCTAVE - output is only first column, or 1 array per line

Trouble loading a csv file into OCTAVE.
EDIT: as pointed out from ANDY and Eliahu Aaron, I changed ; to ,.
csvread 4 returns separated columns, each named after the first line.
My matlab script throws these errors:
error: 'z' undefined near line 13 column 3
error: called from myScript at line 13 column 2
I can0t find -z even though there is now a column called z from where it should calculate.
This fixed my Issue in the end:
g = cell2mat(A(2:end-1,2));
My csv looks like this:
time;z;y;x
5;15084;-1360;-9664
7;15280;-1296;-9784
10;15032;-1384;-9688
30;15160;-1548;-9772
56;15116;-1532;-9660
First I had to delete the first row- because matrix was unreadable for octave.
If I try to csv2cell my file - I only get 1 column filled with all values in every line
mycsvdata = csv2cell("file.csv")
if I try csvread i get 1 column with the values of the first column name "ans"... 2nd,3rd and 4th column is ignored.
csvread("file.csv")
when i drag and drop the same csv into matlab - i click on the green tick and every column is named after its first cell and is a var. I end up having 4 vars called: time, z, y and x.
In octave this is kind of impossible for me to archieve.
what am i doing wrong?
This seems to be such a basic problem but I havent come across a solution in the internet.
I need to get 4 variables called time, z, y and x and having them all the values from the 1st (time), 2nd(z), 3rd(y) and 4th(x) column stored in them
I am new to octave and have a code written for matlab - which I want to change to octave. I am not even able to test my code, becuase I am not able to load the csv properly. This is very frustrating for me.
thanks in adavance
CSV by default uses , as column delimiters but your file has ; as column delimiters.
You can use dlmread("file.csv", ";") instead of csvread but it can't read the first row time;z;y;x.
You can use csv2cell("file.csv", ";"), the first row will be strings and the rest numbers.
To create a struct array with fields time;z;y;x you can use the fullowing code:
pkg load io
A = csv2cell("file.csv", ";");
B = cell2struct(A(2:end,:),A(1,:),2);

Need to insert a column in csv and parse current date in it - JREPL.BAT?

I could (thanks to dbenham and its powerful JREPL.BAT) remove header row from CSV File with command
jrepl "^(Date,)?.*" "($1?i++:i)?$0:false" /jmatch /jbeg "var i=0" /f test.txt /o output.txt
I now need to insert in this csv below the date in first column (here 2016-03-27) for every row and delete last row (total). Would jrepl do this too? Thanks!
Report,Begin Date,End Date,Currency,Change Currency
Activity Summary By Account,2016-03-27 00:00:00.000 -0600,2016-03-28 00:00:00.000 -0600,USD,Change Currency
Affiliate,Account,Screen Alias,Total Wagered,Total Payout,Net Win Loss,Percent Payout
FaridZ,BuF,BuFis,1153.00,828.00,325.00,71.81%
JohnX,adel,adel,104.70,71.70,33.00,68.48%
FaridZ,chat00,shat,49065.00,45987.50,3077.50,93.72%
,,Total:,"50,657.70","47,247.20","3,410.50",93.26%
Updated: screenshot of final csv output...
This can be done efficiently with JREPL.BAT, but the solution is fairly complicated if you want to do everything with a single pass of the input data. I'm not sure that the solution is any simpler than writing a custom JScript or VBS script.
Note that I discovered a minor JREPL.BAT bug while developing this solution, so there is an updated version 3.8 at the link with a few bug fixes.
jrepl "^$|^,,Total:,.*|^.*?,(.+?),.*"^
"i=1;''|false|if(ln==2){dt=$4;$0}else i?$0+','+((i++==1)?'Date':dt):$0"^
/jmatch /jbeg "var i=0, dt" /t "|" /f test.txt /o output.txt
I used line continuation to make the code easier to read.
A bit of explanation of the solution is in order.
/JBEG defines a couple of variables that are needed during the find/replace operation.
dt - Holds the captured date string.
i - Used to control whether anything is appended:
if i=0 then no change
if i=1 then append the Date header
else append dt
I used /JMATCH along with the /T option with | as a delimiter. The /T option is similar to the unix tr command. For each delimited search in the find string, there is a corresponding JScript expression in the replacement string.
$1 search ^$ - Looks for an empty line
replace i=1;'' - Triggers i so that subsequent non-empty lines have the date column appended. The replacement value for this line is an empty string.
$2 search ^,,Total:,.* - Looks for the final Total line
replace false - Prevents the total line from being printed
$3 search ^.*?,(.+?),.* - Looks for a line with at least 3 fields, capturing the 2nd field in $4
replace if(ln==2){dt=$4;$0}else i?$0+','+((i++==1)?'Date':dt):$0 - This is where most of the complicated logic resides:
If this is the 2nd line, then save the date string ($4) in dt and replace with the full matched string
else if i is not 0, then increment i and replace with the full matched string plus append string ',Date' the first time, else append the dt value
else i=0, so replace with the original string.

MySQL select column string with LIKE or REGEX based on sub-string with delimiters

I have to compare a column value with a user input numbers. The string in the column is in the format 8|5|12|7|
Now, I need to compare a user input values 2,5,3 with this column value
When I use LIKE operator as '%2|%' I got the output by matching with column value |12|
How do I match the string by using Regular Expression or any other way?
If I understand the question correct, then to make sure that you get 2|.. or ..|2|.. or |2, you need to add or clauses
where col like '%|2|%'
or col like '2|%'
or col like '%|2'
or col='2'
Something similar to this to test for 2 in this example 12|8|12|5|12|7|2|12|22
# (^|\|)2(\||$)
#
#
# Match the regex below and capture its match into backreference number 1 «(^|\|)»
# Match this alternative (attempting the next alternative only if this one fails) «^»
# Assert position at the beginning of the string «^»
# Or match this alternative (the entire group fails if this one fails to match) «\|»
# Match the character “|” literally «\|»
# Match the character “2” literally «2»
# Match the regex below and capture its match into backreference number 2 «(\||$)»
# Match this alternative (attempting the next alternative only if this one fails) «\|»
# Match the character “|” literally «\|»
# Or match this alternative (the entire group fails if this one fails to match) «$»
# Assert position at the end of the string, or before the line break at the end of the string, if any (line feed) «$»
REGEXP "(^|\|)2(\||$)"
This allows for the column value to just be 2 or 2|anything or anything|2 or first thing|2|end thing.
By looking your column design, one of the way u can do is LIKE '%|2|%'
It is bad design to build "arrays" in a cell. Use a separate table.
Anyway, FIND_IN_SET() is a function that does the work a lot easier than a regexp. (But you have to use ',')

Extract specific words from text field in mysql

I have a table that contains a text field, there is around 3 to 4 sentences in the field depending on the row.
Now, I am making an auto-complete html object, and I would like to start typing the beginning of a word and that the database return words that start with those letters from the database text field.
Example of a text field: I like fishsticks, fishhat are great too
in my auto-complete if I would type "fish" it would propose "fishsticks" and "fishhat"
Everything works but the query.
I can easily find the rows that contains a specific word but I can't extract only the word, not the full text.
select data_txt from mytable match(data_txt) against('fish', IN BOOLEAN MODE) limit 10
I know it is dirty, but I cannot rearrange the database.
Thank you for your help!
EDIT:
Here's what I got, thanks to Brent Worden, it is not clean but it works:
SELECT DISTINCT
SUBSTRING(data_txt,
LOCATE('great', data_txt),
LOCATE(' ' , data_txt, LOCATE('great', data_txt)) - LOCATE('great', data_txt)
)
FROM mytable WHERE data_txt LIKE '% great%'
LIMIT 10
any idea on how to avoid using the same LOCATE expression over and over?
Use LOCATE to find the occurrence of the word.
Use LOCATE and the previous LOCATE return value to find the occurrence of the first space after the word.
USE SUBSTR and the previous 2 LOCATE return values to extract the whole word.
$tagsql ="SELECT * from mytable";
$tagquery = mysql_query($tagsql);
$tags = array(); //Creates an empty array
while ($tagrow = mysql_fetch_array($tagquery)) {
$tags[] = tagrow['tags']; //Fills the empty array
}
If the rows contain commas you could use -
$comma_separated = implode(",", $tags);
you can replace the comma for spaces if they are separated as spaces in your table.
$exp = explode(",", $comma_separated);
If you require your data to be unique you may include the following:
$uniquetags = array_unique($exp, SORT_REGULAR);
you can use print_r to see the results of the array resulting
Here array_merge is used because $rt will not get displayed if you are using a 'jquery' autocomplete else $rt may work and array_merge can be ignored. However, you may use array_merge to include multiple tables by repeating the previous process.
$autocompletetags = array_merge((array)$uniquetags);
This sorts the values in the alphabetic order
sort($autocompletetags, SORT_REGULAR);