Find duplicates lines based on some delimited fileds on line

Find duplicates lines based on some delimited fileds on line - duplicates

I have a file with lines having some fields delimited by "|".
I have to extract the lines that are identical based on some of the fileds
(i.e. find lines which contain the same values for fields 1,2,3 12,and 13)
Other fields contents have no importance for searching but the whole extracted lines have to be complete.
Can anyone tell me how I can do that in KSH scripting
(By exemple a script with some arguments (order dependent) that define the fileds separator and the fields which have to be compared to find duplicates lines in input file )
Thanks in advance and kind regards
Oli

This prints duplicate lines based on matching fields. It uses an associative array which could grow large depending on the nature of your input file. The output is not sorted so most duplicates are not grouped together (except the first two of a set).
awk -F'|' '{ idx=$1$2$3$12$13; if (array[idx] == 1) {print} else if (array[idx]) {print array[idx]; print; array[idx]=1} else {array[idx]=$0}}' inputfile.txt
You could probably build up your index list in a shell variable in a wrapper script something like this:
#!/bin/ksh
for arg
do
case arg in # validate input (could be better)
+([0-9]) ) # integers only
idx="$idx'$'$arg"
;;
* )
echo "Invalid field specifier"
exit
;;
esac
done
awk -F'|' '{ idx='$idx'; if (array ...
You can sort the output by piping it through a command such as this:
awk ... | sort --field-separator='|' --key=1,1 --key=2,2 --key=3,3 --key=12,12 --key=13,13

This prints lines which are duplicated - just one line each:
awk -F'|' '!arr[$1$2$3$12$13]++' inputfile > outputfile

Related

can't understand this bash script $ awk -F

Can someone help me understand what these scripts mean?
$ awk -F',' '{ data[$4]+=$29;}END{c=0; for (i in data) { if (data[i]+0<1000000) {c++}} ;print c}' file.csv
thanks

This scripts iterates line by line over the input file file.csv. The file is apparently separated by comma (,), thus the field separator is set via -F',' appropriately. Then you access the data array with the content of the 4th field ($4) and add the value of the 29th field ($29). After processing all lines, at the end, END code section is invoked. It iterates over your data array, does some numerical comparison and eventually prints the number of times c the numerical comparison evaluated to true.

Converting lines in chunks into tab delimited

I have the following lines in 2 chunks (actually there are ~10K of that).
And in this example each chunk contain 3 lines. The chunks are separated by an empty line. So the chunks are like "paragraphs".
xox
91-233
chicago
koko
121-111
alabama
I want to turn it into tab-delimited lines, like so:
xox 91-233 chicago
koko 121-111 alabama
How can I do that?
I tried tr "\n" "\t", but it doesn't do what I want.

$ awk -F'\n' '{$1=$1} 1' RS='\n\n' OFS='\t' file
xox 91-233 chicago
koko 121-111 alabama
How it works
Awk divides input into records and it divides each record into fields.
-F'\n'
This tells awk to use a newline as the field separator.
$1=$1
This tells awk to assign the first field to the first field. While this seemingly does nothing, it causes awk to treat the record as changed. As a consequence, the output is printed using our assigned value for ORS, the output record separator.
1
This is awk's cryptic shorthand for print the line.
RS='\n\n'
This tells awk to treat two consecutive newlines as a record separator.
OFS='\t'
This tells awk to use a tab as the field separator on output.

This answer offers the following:
* It works with blocks of nonempty lines of any size, separated by any number of empty lines; John1024's helpful answer (which is similar and came first) works with blocks of lines separated by exactly one empty line.
* It explains the awk command used in detail.
A more idiomatic (POSIX-compliant) awk solution:
awk -v RS= -F '\n' -v OFS='\t' '$1=$1""' file
-v RS= tells awk to operate in paragraph mode: consider each run of nonempty lines a single record; RS is the input record separator.
Note: The implication is that this solution considers one or more empty lines as separating paragraphs (line blocks); empty means: no line-internal characters at all, not even whitespace.
-F '\n' tells awk to consider each line of an input paragraph its own field (breaks the multiline input record into fields by lines); -F sets FS, the input field separator.
-v OFS='\t' tells awk to separate fields with \t (tab chars.) on output; OFS is the output field separator.
$1=$1"" looks like a no-op, but, due to assigning to field variable $1 (the record's first field), tells awk to rebuild the input record, using OFS as the field separator, thereby effectively replacing the \n separators with \t.
The trailing "" is to guard against the edge case of the first line in a paragraph evaluating to 0 in a numeric context; appending "" forces treatment as a string, and any nonempty string - even if it contains "0" - is considered true in a Boolean context - see below.
Given that $1 is by definition nonempty and given that assignments in awk pass their value through, the result of assignment $1=$1"" is also a nonempty string; since the assignment is used as a pattern (a condition), and a nonempty string is considered true, and there is no associated action block ({ ... }), the implied action is to print the - rebuilt - input record, which now consists of the input lines separated with tabs, terminated by the default output record separator (ORS), \n.

another alternative,
$ sed '/^$/d' file | pr -3ats$'\t'
xox 91-233 chicago
koko 121-111 alabama
remove empty lines with sed and print to 3 columns with tab delimiter. In your real file, this should be the number of lines in blocks.
Note that this will only work if all your blocks are of the same size.

xargs -L3 < filename.log |tr ' ' '\t'
xox 91-233 chicago
koko 121-111 alabama

another version of awk to do this
awk '{if(NF>0){a=a$1"\t";i++};if(i%3==0&&NF>0){print a;a=""}}' input_file

Print first, penultimate and last fields in CSV file

I have big comma separated file with 20000 row and five column, I want to extract particular column, but there are more values so more comma, except header, so how to cut such column.
example file:
name,v1,v2,v3,v4,v5
as,"10,12,15",21,"12,11,10,12",5,7
bs,"11,15,16",24,"19,15,18,23",9,3
This is my desired output:
name,v4,v5
as,5,7
bs,9,3
I tried following cut command but doesn't work
cut -d, -f1,5,6

In general, for these scenarios is best to use a proper csv parser. You can find those in Python, for example.
However, since your data seems to have fields with commas just in the very beginning, you can decide to print the first field and then the penultimate and last one:
$ awk 'BEGIN{FS=OFS=","} {print $1, $(NF-1), $NF}' file
name,v4,v5
as,5,7
bs,9,3

In TXR Lisp:
$ txr extract.tl < data
name,v4,v5
as,5,7
bs,9,3
Code in extract.tl:
(mapdo
(lambda (line)
(let ((f (tok-str line #/"[^"]*"|[^,]+/)))
(put-line `#[f 0],#[f 4],#[f 5]`)))
(get-lines))
As a condensed one liner:
$ txr -t '(mapcar* (do let ((f (tok-str #1 #/"[^"]*"|[^,]+/)))
`#[f 0],#[f 4],#[f 5]`) (get-lines))' < data

merge csv unix based on column 1

Hi I have two csv files having same columns like,
x.csv
column1,column2
A,2
B,1
y.csv
column1,column2
A,1
C,2
I want the output like:
z.csv
column1,column2
A,2
B,1
C,2
i.e. for the matching data in first column, I want to keep the x.csv record and for a new field in y.csv (like A,2) i just want to append it (like C,2).
Thanks

$ awk -F, 'NR==FNR{a[$1]; print; next} ! ($1 in a)' x.csv y.csv
column1,column2
A,2
B,1
C,2
How it works
-F,
This tells awk to use a comma as the field separator
NR==FNR{a[$1]; print; next}
While reading the first file (NR==FNR), this tells awk to (a) to add $1 as a key to the associative array a, (b) print the line, and (c) skip the remaining commands and jump to the next line in a file.
! ($1 in a)
If we get here, that means we are working on the second file. In that case, we print the line if the first field is not a key of array a (in other words, if the first field did not appear in the first file).

Can awk deal with CSV file that contains comma inside a quoted field?

I am using awk to perform counting the sum of one column in the csv file. The data format is something like:
id, name, value
1, foo, 17
2, bar, 76
3, "I am the, question", 99
I was using this awk script to count the sum:
awk -F, '{sum+=$3} END {print sum}'
Some of the value in name field contains comma and this break my awk script.
My question is: can awk solve this problem? If yes, and how can I do that?
Thank you.

One way using GNU awk and FPAT
awk 'BEGIN { FPAT = "([^, ]+)|(\"[^\"]+\")" } { sum+=$3 } END { print sum }' file.txt
Result:
192

I am using
`FPAT="([^,]+)|(\"[^\"]+\")" `
to define the fields with gawk. I found that when the field is null this doesn't recognize correct number of fields. Because "+" requires at least 1 character in the field.
I changed it to:
`FPAT="([^,]*)|(\"[^\"]*\")"`
and replace "+" with "*". It works correctly.
I also find that GNU Awk User Guide also has this problem.
https://www.gnu.org/software/gawk/manual/html_node/Splitting-By-Content.html

You're probably better off doing it in perl with Text::CSV, since that's a fast and robust solution.

You can help awk work with data fields that contain commas (or newlines) by using a small script I wrote called csvquote. It replaces the offending commas inside quoted fields with nonprinting characters. If you need to, you can later restore those commas - but in this case, you don't need to.
Here is the command:
csvquote inputfile.csv | awk -F, '{sum+=$3} END {print sum}'
see https://github.com/dbro/csvquote for the code

For as simple an input file as that you can just write a small function to convert all of the real FSs outside of the quotes to some other value (I chose RS since the record separator cannot be part of the record) and then use that as the FS, e.g.:
$ cat decsv.awk
BEGIN{ fs=FS; FS=RS }
{
decsv()
for (i=1;i<=NF;i++) {
printf "Record %d, Field %d is <%s>\n" ,NR,i,$i
}
print ""
}
function decsv( curr,head,tail)
{
tail = $0
while ( match(tail,/"[^"]+"/) ) {
head = substr(tail, 1, RSTART-1);
gsub(fs,RS,head)
curr = curr head substr(tail, RSTART, RLENGTH)
tail = substr(tail, RSTART + RLENGTH)
}
gsub(fs,RS,tail)
$0 = curr tail
}
$ cat file
id, name, value
1, foo, 17
2, bar, 76
3, "I am the, question", 99
$ awk -F", " -f decsv.awk file
Record 1, Field 1 is <id>
Record 1, Field 2 is <name>
Record 1, Field 3 is <value>
Record 2, Field 1 is <1>
Record 2, Field 2 is <foo>
Record 2, Field 3 is <17>
Record 3, Field 1 is <2>
Record 3, Field 2 is <bar>
Record 3, Field 3 is <76>
Record 4, Field 1 is <3>
Record 4, Field 2 is <"I am the, question">
Record 4, Field 3 is <99>
It only becomes complicated when you have to deal with embedded newlines and embedded escaped quotes within the quotes and even then it's not too hard and it's all been done before...
See What's the most robust way to efficiently parse CSV using awk? for more information.

You can always tackle the problem from the source. Put quotes around the name field, just like the field of "I am the, question". This is much easier than spending your time coding workarounds for that.
Update(as Dennis requested). A simple example
$ s='id, "name1,name2", value 1, foo, 17 2, bar, 76 3, "I am the, question", 99'
$ echo $s|awk -F'"' '{ for(i=1;i<=NF;i+=2) print $i}'
id,
, value 1, foo, 17 2, bar, 76 3,
, 99
$ echo $s|awk -F'"' '{ for(i=2;i<=NF;i+=2) print $i}'
name1,name2
I am the, question
As you can see, by setting the delimiter to double quote, the fields that belong to the "quotes" are always on even number. Since OP doesn't have the luxury of modifying the source data, this method will not be appropriate to him.

This article did help me solve this same data field issue. Most CSV will put a quote around fields with spaces or commas within them. This messes up the field count for awk unless you filter them out.
If you need the data within those fields that contain the garbage, this is not for you. ghostdog74 provided the answer, which empties that field but maintains the total field count in the end, which is key to keeping the data output consistent. I did not like how this solution introduced new lines. This is the version of this solution I used. The fist three fields never had this problem in the data. The fourth field containing customer name often did, but I needed that data. The remaining fields that exhibit the problem I could throw away without issue because it was not needed in my report output. So I first sed out the 4th field's garbage very specifically and remove the first two instances of quotes. Then I apply what ghostdog74gave to empty the remaining fields that have commas within them - this also removes the quotes, but I use printfto maintain the data in a single record. I start off with 85 fields and end up with 85 fields in all cases from my 8000+ lines of messy data. A perfect score!
grep -i $1 $dbfile | sed 's/\, Inc.//;s/, LLC.//;s/, LLC//;s/, Ltd.//;s/\"//;s/\"//' | awk -F'"' '{ for(i=1;i<=NF;i+=2) printf ($i);printf ("\n")}' > $tmpfile
The solution that empties the fields with commas within them but also maintains the record, of course is:
awk -F'"' '{ for(i=1;i<=NF;i+=2) printf ($i);printf ("\n")}
Megs of thanks to ghostdog74 for the great solution!
NetsGuy256/

FPAT is the elegant solution because it can handle the dreaded commas within quotes problem, but to sum a column of numbers in the last column regardless of the number of preceding separators, $NF works well:
awk -F"," '{sum+=$NF} END {print sum}'
To access the second to last column, you would use this:
awk -F"," '{sum+=$(NF-1)} END {print sum}'

If you know for sure that the 'value' column is always the last column:
awk -F, '{sum+=$NF} END {print sum}'
NF represents the number of fields, so $NF is the last column

Fully fledged CSV parsers such as Perl's Text::CSV_XS are purpose-built to handle that kind of weirdness.
perl -MText::CSV_XS -lne 'BEGIN{$csv=Text::CSV_XS->new({allow_whitespace => 1})} if($csv->parse($_)){#f=$csv->fields();$sum+=$f[2]} END{print $sum}' file
allow_whitespace is needed since the input data has whitespace surrounding the comma separators. Very old versions of Text::CSV_XS may not support this option.
I provided more explanation of Text::CSV_XS within my answer here: parse csv file using gawk

you could try piping the file through a perl regex to convert the quoted , into something else like a |.
cat test.csv | perl -p -e "s/(\".+?)(,)(.+?\")/\1\|\3/g" | awk -F, '{...
The above regex assumes there is always a comma within the double quotes. so more work would be needed to make the comma optional

you write a function in awk like below:
$ awk 'func isnum(x){return(x==x+0)}BEGIN{print isnum("hello"),isnum("-42")}'
0 1
you can incorporate in your script this function and check whether the third field is numeric or not.if not numeric then go for the 4th field and if the 4th field inturn is not numberic go for 5th ...till you reach a numeric value.probably a loop will help here, and add it to the sum.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Find duplicates lines based on some delimited fileds on line - duplicates

This prints lines which are duplicated - just one line each: awk -F'|' '!arr[$1$2$3$12$13]++' inputfile > outputfile

Related

can't understand this bash script $ awk -F

Converting lines in chunks into tab delimited

Print first, penultimate and last fields in CSV file

merge csv unix based on column 1

Can awk deal with CSV file that contains comma inside a quoted field?

Categories

Resources