Related
Given the following csv, with multiline fields:
"id","text"
"1","line 1
line 2"
"2","line 1
line 2"
"1","line 1
line 2"
... which displays as:
id
text
1
line 1 line 2
2
line 1 line 2
1
line 1 line 2
If I use the following awk command to remove duplicate rows from this csv based on the id (column 1):
awk -F, '!x[$1]++' 'file-01.csv' > 'file-01-deduped.csv'
I end up with:
"id","text"
"1","line 1
line 2"
"2","line 1
which displays as:
id
text
1
line 1 line 2
2
line 1
This is an oversimplified example, but it seems awk doesn't play well with multiline fields. Perhaps I'm missing something though.
Additional info: I'm writing these csv's according to RFC4180 standards—most notably, fields containing line breaks, double quotes, and commas are enclosed in double-quotes. And double quotes appearing inside a field are escaped with a preceding double quote.
Also, I'm writing the csv in Node/JS, but I found awk to be a really simple/fast way to dedupe very large files in the past—none had multiline fields though.
I'm by no means bound to awk—I'm open to any/all suggestions—just wanted to be clear about what I've tried. Thanks!
With your shown samples only, please try following awk code. Written and tested in GNU awk, should work in any awk.
awk -F',' '
FNR>1{
sub(/^"/,"",$2)
sub(/"$/,"",$3)
gsub(/"/,"",$1)
print $1 OFS $2 ORS " " $3
}
' <(awk '{printf("%s%s",$0!~/^"/?",":FNR>1?ORS:"",$0)} END{print ""}' Input_file)
Explanation: Simple explanation would be, running 1st awk to print all lines in single row(wherever it has line not started from ") and sending its output as an input to main awk where printing the required values of id and all line values as per requirement.
As others have pointed out, you need a CSV-aware tool to properly handle the line breaks inside the rows.
GoCSV was made for this: it's fast, pretty good w/memory, is CSV savvy, and is pre-built for a number of platforms.
Its unique subcommand will keep only the first row based on the occurrence of a value or set of values in a column or set of columns.
To drop duplicate rows based on the text column:
gocsv unique -c 'text' input.csv > de-duped.csv
It can even tell you how many dupes it found along the way:
gocsv unique -c 'text' -count input.csv > de-duped.csv
How fast, how good w/memory?.
I mocked up a 1_000_000 row CSV with two columns of random text and embedded line breaks (also includes commas and quoted quotes):
ll -h gen_1000000x3.csv
-rw-r--r-- 1 zyoung staff 52M Apr 26 09:36 gen_1000000x3.csv
cat gen_1000000x3.csv
ID,Col1,Col2
0,"ddddd
"","" oooooo","wwwwww
"","" nnnnnnn"
1,"llllllll
"","" ccccccc","iiiiiiii
"","" wwwww"
2,"nnnnn
"","" iiiiiiii","ooooo
"","" kkkkkkkk"
...
On my M1 MacBook Air, de-duping the 1-million row, 52 MB CSV took a half-second and consumed only 13 MB of memory:
/usr/bin/time -l gocsv unique -c Col2 gen_1000000x3.csv > de-duped.csv
0.45 real 0.49 user 0.05 sys
...
13124608 peak memory footprint
Over 989_000 duplicate rows were dropped:
gocsv dims de-duped.csv
Dimensions:
Rows: 10816
Columns: 3
We can count instances of each of value in Col2 that was found (counting consumed 175 MB of memory):
gocsv unique -c Col2 -count gen_1000000x3.csv > de-duped.csv
GoCSV can also display multi-line rows in the terminal:
+--------+---------------+---------------+-------+
| ID | Col1 | Col2 | Count |
+--------+---------------+---------------+-------+
| 0 | ddddd | wwwwww | 80 |
| | "," oooooo | "," nnnnnnn | |
+--------+---------------+---------------+-------+
| 1 | llllllll | iiiiiiii | 89 |
| | "," ccccccc | "," wwwww | |
+--------+---------------+---------------+-------+
| 2 | nnnnn | ooooo | 97 |
| | "," iiiiiiii | "," kkkkkkkk | |
...
I cannot compare the awk scripts suggested so far: one just doesn't do anything in my terminal, and the other requires GNU which I don't have. But awk will come out slower: 3x longer just to run awk '{print $0}' gen_1000000x3.csv > /dev/null, and that's not even doing meaningful work. And the hoops you have to jump through to try and program a CSV parser from scratch.
Awk is not csv aware, so it's not really the right tool for the job. There are a few csv implementations floating around the internets, maybe you could take a look at them.
You did mention the file being large, but if it fits your memory, this is a variation of something I needed a few weeks back. It's GNU awk using FPAT so it's not really fast:
$ gawk '
BEGIN {
RS="^$" # read in whole file
FPAT="([^,\n]*)|(\"(\"\"|[^\"])+\")" # regex magic
OFS=","
}
{
for(i=1;i<NF;i+=2) # iterate fields 2 at a time
if(!a[$i]++) # if first field not seen before
print $i,$(i+1) # output 2 fields
}' file
Test data:
"id","text"
"1","line 1
line 2"
"2","line 1
line 2"
"3"," ""line 1""
line 2"
"4",""
"5","line 1,
line 2"
"1","line 1
line 2"
Output:
"id","text"
"1","line 1
line 2"
"2","line 1
line 2"
"3"," ""line 1""
line 2"
"4",""
"5","line 1,
line 2"
I don't know how many ways it can fail you, tho.
A great and very simple tool CSV aware is Miller.
Running
mlr --csv uniq -a input.csv >output.csv
You will have
id,text
1,"line 1
line 2"
2,"line 1
line 2"
It has also a great documentation: this is the one for the uniq verb.
I'm trying to use Powershell to search a csv file and output a list of duplicate lines in a csv file. I can accomplish this pretty easily in bash with the following:
uniq -d myfile.csv > list.csv
In Powershell I can output a list of unique lines but how do I modify Get-Unique to display only the duplicate lines like I did in bash?
Get-Content c:\file\myfile.csv | Get-Unique | Set-Content c:\file\list1.csv
It's a bit weird to use the unique tool to get the duplicates. How about:
gc .\test.csv | group -NoElement |? Count -gt 1 | select -expand name
This groups the lines by how many there are, identifies the ones with duplicates, and outputs them. e.g. if:
test.csv contains:
a,b,c
d,e,f
a,b,c
z,z,z
gc test.csv | group
Count Name Group
----- ---- -----
2 a,b,c {a,b,c, a,b,c}
1 d,e,f {d,e,f}
1 z,z,z {z,z,z}
1 {}
and -NoElement stops it building the group contents, redundant in this case.
I have a text file with a list of IDs like this:
123
456
789
I would like to use them in queries to look up more information about these items. So far I have unsuccessfully tried:
awk '{print $1;}' 'pathto/file.txt' | xargs echo "SELECT * FROM table WHERE id='{}');" | mysql -u uname -p database
IDS=$(cat /path/to/file)
IDS=$(echo $IDS | sed 's/\s\s*/,/g')
#or IDS=$(echo $IDS | awk '{$1=$1}1' RS= OFS=,) on OSX
echo "SELECT * FROM table WHERE id in ($IDS)" | mysql -u uname -p database
Line 1 loads the ids into a variable, which allows to replace the line breaks easily in line 2, then replace the whitespace by commas to use an IN () construct.
Edits 1 and 2 see comments
Edit 3
on OSX (which I don't use or have access to) it seems sed doesn't like my regex. As the OP points out, using IDS=$(echo $IDS | awk '{$1=$1}1' RS= OFS=,) as line 2 works around this.
I have 2 csv files and I'm looking for a way to compare them using a specific column, and once a match is found I need to take the value of another column from the matched row and put it in the corresponding column of the other record.
I'll try to explain a little bit more.
One csv has product_id,product_name,brand_name,price
the other has product_id,product_category,product_name,brand_name,price
I need to compare the 2 files by finding the rows that have a matching product_id value, and once found I need to take the price value from file 1 and put it to the matched record's price in file 2.
After extensive research I've come to the conclusion that this maybe achievable with powershell.
Does anyone have any ideas about how I could do that? Thank you for your time.
Since is just a one time action. you could open the csv files in a spreadsheet (google docs, excel, ...) and do a VLOOKUP. Is easy:
To demonstrate this imagine the following spreadsheet where both csv files are side by side. First from column A to B and the second on column D to F
| A | B | C | D | E | F
--+------------+-------+---+------------+------------------+-------
1 | product_id | price | | product_id | product_category | price
2 | 1 | 29.9 | | 2 | SOME CAT 1 | =IFERROR(VLOOKUP(D2;A:B;2;FALSE); "NULL")
3 | 2 | 35.5 | | 3 | SOME CAT 2 | =IFERROR(VLOOKUP(D3;A:B;2;FALSE); "NULL")
The VLOOKUP function will search for an exact match of the value of D2 cell on the first column of the region A:B, and return the value from the second column of that region. The iferrorwill return NULL if the VLOOKUP fails.
So in this case on cell F2, will look for the product id "2" (Cell d2) on the column A. It founds the product id "2" in row 3, and return the price "35.5" (being the second row of the range A:B). After all rows have been calculated the result will be:
| A | B | C | D | E | F
--+------------+-------+---+------------+------------------+-------
1 | product_id | price | | product_id | product_category | price
2 | 1 | 29.9 | | 2 | SOME CAT 1 | 35.5
3 | 2 | 35.5 | | 3 | SOME CAT 2 | NULL
One could also use awk for this; say you have:
$ cat a.csv
#product_id,product_name,brand_name,price
1,pname1,bname1,100
10,pname10,bname10,200
20,pname20,bname20,300
$ cat b.csv
#product_id,product_category,product_name,brand_name,price
3,pcat3,pname3,bname3,42
10,pcat10,pname10,bname10,199
20,pcat20,pname20,bname20,299
30,pcat10,pname30,bname30,420
With the "FNR==NR" approach (see e.g. > The Unix shell: comparing two files with awk):
$ awk -F, 'FNR==NR{if(!/^#/){a[$1]=$0;next}}($1 in a){split(a[$1],tmp,",");printf "%d,%s,%s,%s,%d\n",$1,$2,$3,$4,tmp[4];}' a.csv b.csv
10,pcat10,pname10,bname10,200
20,pcat20,pname20,bname20,300
With reading each file into an array (see e.g. Awking it – how to load a file into an array in awk | Tapping away):
$ awk -F, 'BEGIN{while(getline < "a.csv"){if(!/^#/){a[$1]=$0;}}close("a.csv");while(getline < "b.csv"){if($1 in a){split(a[$1],tmp,",");printf "%d,%s,%s,%s,%d\n",$1,$2,$3,$4,tmp[4];}}close("b.csv");}'
10,pcat10,pname10,bname10,200
20,pcat20,pname20,bname20,300
In essence, the two approaches do the same thing:
read the first file (a.csv), and store its lines in an associative array a, keyed/indexed by the first field $1 of that line (in this case, product_id);
then read the second file (b.csv); and if the first field of each of its lines is found in the array a; then output the first four fields of the current line of b.csv; and the fourth field (price) from the corresponding entry in array a
The difference is, that with the FNR==NR approach, specify the input files on the command line as arguments to awk, and basically you can only identify the first file as "special" so you can store it as an array; with the second approach, each input file could be parsed in a separate array - however, the input files are specified in the awk script itself, not in the arguments to awk - and since then you don't even need to use arguments to awk, the entirety of the awk script needs to happen within a BEGIN{...} block.
When lines are being read from the files, they are automatically split in fields according to -F, command line options, which sets comma as delimiter; however, when retrieving lines stored in the array, we have to split() them separately
Breakdown for the first:
FNR==NR # if FNR (input record number in the current input file) equals NR (total num records so far)
# only true when the first file is being read
{
if(!/^#/) # if the current line does not `!` match regex `/.../` of start `^` with `#`
{
a[$1]=$0; # assign current line `$0` to array `a`, with index/key being first field in current line `$1`
next # skip the rest, and start processing next line
}
}
# --this section below executes when FNR does not equal NR;--
($1 in a) # first, check if first field `$1` of current line is in array `a`
{
split(a[$1],tmp,","); # split entry `a[$1]` at commas into array `tmp`
printf "%d,%s,%s,%s,%d\n",$1,$2,$3,$4,tmp[4]; # print reconstructed current line,
# taking the fourth field from the `tmp` array
}
Breakdown for the second:
BEGIN{ # since no file arguments here, everything goes in BEGIN block
while(getline < "a.csv"){ # while reading lines from first file
if(!/^#/){ # if the current line does not `!` match regex `/.../` of start `^` with `#`
a[$1]=$0; # store current line `$0` to array `a`, with index/key being first field in current line `$1`
}
}
close("a.csv");
while(getline < "b.csv"){ # while reading lines from second file
if($1 in a){ # first, check if first field `$1` of current line is in array `a`
split(a[$1],tmp,","); # (same as above)
printf "%d,%s,%s,%s,%d\n",$1,$2,$3,$4,tmp[4]; # (same as above)
}
}
close("b.csv");
} # end BEGIN
Note about the execution with FNR==NR:
$ awk -F, 'FNR==NR{print "-";} (1){print;}' a.csv b.csv # or:
$ awk -F, 'FNR==NR{print "-";} {print;}' a.csv b.csv
-
#product_id,product_name,brand_name,price
-
1,pname1,bname1,100
-
10,pname10,bname10,200
-
20,pname20,bname20,300
#product_id,product_category,product_name,brand_name,price
3,pcat3,pname3,bname3,42
10,pcat10,pname10,bname10,199
20,pcat20,pname20,bname20,299
30,pcat10,pname30,bname30,420
$ awk -F, 'FNR==NR{print "-";} FNR!=NR{print;}' a.csv b.csv
-
-
-
-
#product_id,product_category,product_name,brand_name,price
3,pcat3,pname3,bname3,42
10,pcat10,pname10,bname10,199
20,pcat20,pname20,bname20,299
30,pcat10,pname30,bname30,420
That means that the "this section below executes when FNR does not equal NR;" comment above is in principle wrong - even if that is how that particular example ends up behaving.
I have been trying to delete rows (records) from a csv in which entries in a specific column match the entries in the other csv.
The csv structure is roughly like this:
1.csv
Col1,Col2,Col3,Col4,Col5
sasdf,3432,fjkdk,fjjof,1234
efvr,4565,fhjs,dihi,9999
asa,234,rgs,fkjf,0102
aaa,456,jfvv,dofh,4565
ths,7865,fhjf,fhks,3212
2.csv
Col1
1234
3212
0102
4565
So as you can see, there are some values in col5 of 1.csv that appear in col1 of 2.csv
I want to use awk to delete the rows (records) from 1.csv that match col1 of 2.csv
So the output would look like this:
3.csv
Col1,Col2,Col3,Col4,Col5
efvr,4565,fhjs,dihi,9999
Here is the awk script I used:
awk -F"," 'NR==FNR{array1[FNR]=$1};NR>FNR{array1[FNR]!~$5}' 2.csv 1.csv > 3.csv
It did not work.
This will do the trick:
$ awk -F, 'NR==FNR{a[$1];next}!($5 in a)' 2.csv 1.csv
Col1,Col2,Col3,Col4,Col5
efvr,4565,fhjs,dihi,9999
$ awk -F, 'NR==FNR{a[$1];next}!($5 in a)' 2.csv 1.csv > 3.csv