Remove duplicates rows from large csv with multiline fields

Remove duplicates rows from large csv with multiline fields - csv

Given the following csv, with multiline fields:
"id","text"
"1","line 1
line 2"
"2","line 1
line 2"
"1","line 1
line 2"
... which displays as:
id
text
1
line 1 line 2
2
line 1 line 2
1
line 1 line 2
If I use the following awk command to remove duplicate rows from this csv based on the id (column 1):
awk -F, '!x[$1]++' 'file-01.csv' > 'file-01-deduped.csv'
I end up with:
"id","text"
"1","line 1
line 2"
"2","line 1
which displays as:
id
text
1
line 1 line 2
2
line 1
This is an oversimplified example, but it seems awk doesn't play well with multiline fields. Perhaps I'm missing something though.
Additional info: I'm writing these csv's according to RFC4180 standards—most notably, fields containing line breaks, double quotes, and commas are enclosed in double-quotes. And double quotes appearing inside a field are escaped with a preceding double quote.
Also, I'm writing the csv in Node/JS, but I found awk to be a really simple/fast way to dedupe very large files in the past—none had multiline fields though.
I'm by no means bound to awk—I'm open to any/all suggestions—just wanted to be clear about what I've tried. Thanks!

With your shown samples only, please try following awk code. Written and tested in GNU awk, should work in any awk.
awk -F',' '
FNR>1{
sub(/^"/,"",$2)
sub(/"$/,"",$3)
gsub(/"/,"",$1)
print $1 OFS $2 ORS " " $3
}
' <(awk '{printf("%s%s",$0!~/^"/?",":FNR>1?ORS:"",$0)} END{print ""}' Input_file)
Explanation: Simple explanation would be, running 1st awk to print all lines in single row(wherever it has line not started from ") and sending its output as an input to main awk where printing the required values of id and all line values as per requirement.

As others have pointed out, you need a CSV-aware tool to properly handle the line breaks inside the rows.
GoCSV was made for this: it's fast, pretty good w/memory, is CSV savvy, and is pre-built for a number of platforms.
Its unique subcommand will keep only the first row based on the occurrence of a value or set of values in a column or set of columns.
To drop duplicate rows based on the text column:
gocsv unique -c 'text' input.csv > de-duped.csv
It can even tell you how many dupes it found along the way:
gocsv unique -c 'text' -count input.csv > de-duped.csv
How fast, how good w/memory?.
I mocked up a 1_000_000 row CSV with two columns of random text and embedded line breaks (also includes commas and quoted quotes):
ll -h gen_1000000x3.csv
-rw-r--r-- 1 zyoung staff 52M Apr 26 09:36 gen_1000000x3.csv
cat gen_1000000x3.csv
ID,Col1,Col2
0,"ddddd
"","" oooooo","wwwwww
"","" nnnnnnn"
1,"llllllll
"","" ccccccc","iiiiiiii
"","" wwwww"
2,"nnnnn
"","" iiiiiiii","ooooo
"","" kkkkkkkk"
...
On my M1 MacBook Air, de-duping the 1-million row, 52 MB CSV took a half-second and consumed only 13 MB of memory:
/usr/bin/time -l gocsv unique -c Col2 gen_1000000x3.csv > de-duped.csv
0.45 real 0.49 user 0.05 sys
...
13124608 peak memory footprint
Over 989_000 duplicate rows were dropped:
gocsv dims de-duped.csv
Dimensions:
Rows: 10816
Columns: 3
We can count instances of each of value in Col2 that was found (counting consumed 175 MB of memory):
gocsv unique -c Col2 -count gen_1000000x3.csv > de-duped.csv
GoCSV can also display multi-line rows in the terminal:
+--------+---------------+---------------+-------+
| ID | Col1 | Col2 | Count |
+--------+---------------+---------------+-------+
| 0 | ddddd | wwwwww | 80 |
| | "," oooooo | "," nnnnnnn | |
+--------+---------------+---------------+-------+
| 1 | llllllll | iiiiiiii | 89 |
| | "," ccccccc | "," wwwww | |
+--------+---------------+---------------+-------+
| 2 | nnnnn | ooooo | 97 |
| | "," iiiiiiii | "," kkkkkkkk | |
...
I cannot compare the awk scripts suggested so far: one just doesn't do anything in my terminal, and the other requires GNU which I don't have. But awk will come out slower: 3x longer just to run awk '{print $0}' gen_1000000x3.csv > /dev/null, and that's not even doing meaningful work. And the hoops you have to jump through to try and program a CSV parser from scratch.

Awk is not csv aware, so it's not really the right tool for the job. There are a few csv implementations floating around the internets, maybe you could take a look at them.
You did mention the file being large, but if it fits your memory, this is a variation of something I needed a few weeks back. It's GNU awk using FPAT so it's not really fast:
$ gawk '
BEGIN {
RS="^$" # read in whole file
FPAT="([^,\n]*)|(\"(\"\"|[^\"])+\")" # regex magic
OFS=","
}
{
for(i=1;i<NF;i+=2) # iterate fields 2 at a time
if(!a[$i]++) # if first field not seen before
print $i,$(i+1) # output 2 fields
}' file
Test data:
"id","text"
"1","line 1
line 2"
"2","line 1
line 2"
"3"," ""line 1""
line 2"
"4",""
"5","line 1,
line 2"
"1","line 1
line 2"
Output:
"id","text"
"1","line 1
line 2"
"2","line 1
line 2"
"3"," ""line 1""
line 2"
"4",""
"5","line 1,
line 2"
I don't know how many ways it can fail you, tho.

A great and very simple tool CSV aware is Miller.
Running
mlr --csv uniq -a input.csv >output.csv
You will have
id,text
1,"line 1
line 2"
2,"line 1
line 2"
It has also a great documentation: this is the one for the uniq verb.

Related

How to get data for past hour, 6h, 24h from CSV log file with awk, sed or other?

I have a csv log file in the following format. I'm not very good at awk/sed. Would someone tell me how to extract data for past hour, 6h, and 24h.
This is the format of my log:
blabla,11:04:44,Alarm,121,TBD,TBD
blabla,11:04:50,Alarm,121,TBD,00:00:05
blabla,11:04:54,Warning,121,00:00:09,00:00:05
blabla,11:06:12,Alarm,125,TBD,TBD
blabla,11:06:42,Alarm,125,TBD,00:00:29
blabla,16:06:55,Warning,125,00:00:41,00:00:29
blabla,16:09:13,Alarm,125,TBD,TBD
blabla,16:10:32,Alarm,125,TBD,TBD
blabla,16:14:50,Alarm,125,TBD,TBD
blabla,16:15:00,Normal,125,00:00:10,TBD
blabla,16:15:03,Normal,125,00:00:10,00:00:13
blabla,20:04:08,Alarm,121,TBD,TBD
blabla,20:04:21,Normal,121,00:00:13,TBD
blabla,20:04:25,Normal,121,00:00:13,00:00:16
blabla,20:06:12,Alarm,125,TBD,TBD
So assuming that the time now is 21:00:00, and I need data from past hour, the output would be:
blabla,20:04:08,Alarm,121,TBD,TBD
blabla,20:04:21,Normal,121,00:00:13,TBD
blabla,20:04:25,Normal,121,00:00:13,00:00:16
blabla,20:06:12,Alarm,125,TBD,TBD
For past 6h the output should be:
blabla,16:06:55,Warning,125,00:00:41,00:00:29
blabla,16:09:13,Alarm,125,TBD,TBD
blabla,16:10:32,Alarm,125,TBD,TBD
blabla,16:14:50,Alarm,125,TBD,TBD
blabla,16:15:00,Normal,125,00:00:10,TBD
blabla,16:15:03,Normal,125,00:00:10,00:00:13
blabla,20:04:08,Alarm,121,TBD,TBD
blabla,20:04:21,Normal,121,00:00:13,TBD
blabla,20:04:25,Normal,121,00:00:13,00:00:16
blabla,20:06:12,Alarm,125,TBD,TBD
Etc.
I tried to come up with something on my own, just by looking at other answers, but I seem to get no output with these:
awk '$0>=from && $0<=to' from="$(date +"%H:%M:%S -d '1 hour ago'")" to="$(date +"%H:%M:%S")" logfile.csv (this actually produces error unexpected EOF while looking for matching)
and
sed -n "/^[^,]*,[^$(date --date='24 hours ago' '+%H:%M:%S'),],[^,]*,[^,]*,[^,]*,[^,]*/,\$p" logfile.csv

Using Miller (https://github.com/johnkerl/miller) and running
mlr --csv -N put '$sourcetime=$2' \
then nest --explode --values --across-fields --nested-fs ":" -f 2 \
then put '$seconds=$2_1*3600+$2_2*60+$2_3' \
then filter '(21*3600-$seconds)<3600' \
then cut -x -r -f '(_|sec)' input
you will have
+--------+--------+-----+----------+----------+------------+
| blabla | Alarm | 121 | TBD | TBD | 20:04:08 |
| blabla | Normal | 121 | 00:00:13 | TBD | 20:04:21 |
| blabla | Normal | 121 | 00:00:13 | 00:00:16 | 20:04:25 |
| blabla | Alarm | 125 | TBD | TBD | 20:06:12 |
+--------+--------+-----+----------+----------+------------+
I have converted the time in seconds ($seconds=$2_1*3600+$2_2*60+$2_3);
and filtered starting from 21:00:00 time (in seconds it's 21*3600), all the records of the last hour (in seconds it's 3600), using '(21*3600-$seconds)<3600'.
You can change as you want the filter parameters

Your AWK attempt is nearly there:
$ awk -F, '$2 >= from' from=$(date -d "6 hours ago" +%H:%M:%S) sample.txt
blabla,20:04:08,Alarm,121,TBD,TBD
blabla,20:04:21,Normal,121,00:00:13,TBD
blabla,20:04:25,Normal,121,00:00:13,00:00:16
blabla,20:06:12,Alarm,125,TBD,TBD
You need to use -F, to tell awk to split by ,. This takes advantage that even if they are strings "20:04:08" > "20:00:00". However, for other formats you might need to do some math.
I don't think sed is going to work, it can't compare strings, and you won't probably have an exact match in the log. If you know that the time exist in the file, it's trivial:
sed -n '/20:04:08/,$p' sample.txt
blabla,20:04:08,Alarm,121,TBD,TBD
blabla,20:04:21,Normal,121,00:00:13,TBD
blabla,20:04:25,Normal,121,00:00:13,00:00:16
blabla,20:06:12,Alarm,125,TBD,TBD

jq - combine multple lines into single comma separated line

I need the IP output below to be printed in a single line like this "10.88.4.92", "10.88.5.203", "10.87.5.215"
cat ec2.json | jq ".[] | .private_ip"
"10.88.4.92"
"10.88.5.203"
"10.87.5.215"
How to achieve this with jq

One approach would be to use #csv, e.g. along the lines of:
< ec2.json jq -r "[.[] | .private_ip] | #csv"

Finding duplicate lines in a CSV file

I'm trying to use Powershell to search a csv file and output a list of duplicate lines in a csv file. I can accomplish this pretty easily in bash with the following:
uniq -d myfile.csv > list.csv
In Powershell I can output a list of unique lines but how do I modify Get-Unique to display only the duplicate lines like I did in bash?
Get-Content c:\file\myfile.csv | Get-Unique | Set-Content c:\file\list1.csv

It's a bit weird to use the unique tool to get the duplicates. How about:
gc .\test.csv | group -NoElement |? Count -gt 1 | select -expand name
This groups the lines by how many there are, identifies the ones with duplicates, and outputs them. e.g. if:
test.csv contains:
a,b,c
d,e,f
a,b,c
z,z,z
gc test.csv | group
Count Name Group
----- ---- -----
2 a,b,c {a,b,c, a,b,c}
1 d,e,f {d,e,f}
1 z,z,z {z,z,z}
1 {}
and -NoElement stops it building the group contents, redundant in this case.

How to pipe a list of IDs into a MySQL select statement

I have a text file with a list of IDs like this:
123
456
789
I would like to use them in queries to look up more information about these items. So far I have unsuccessfully tried:
awk '{print $1;}' 'pathto/file.txt' | xargs echo "SELECT * FROM table WHERE id='{}');" | mysql -u uname -p database

IDS=$(cat /path/to/file)
IDS=$(echo $IDS | sed 's/\s\s*/,/g')
#or IDS=$(echo $IDS | awk '{$1=$1}1' RS= OFS=,) on OSX
echo "SELECT * FROM table WHERE id in ($IDS)" | mysql -u uname -p database
Line 1 loads the ids into a variable, which allows to replace the line breaks easily in line 2, then replace the whitespace by commas to use an IN () construct.
Edits 1 and 2 see comments
Edit 3
on OSX (which I don't use or have access to) it seems sed doesn't like my regex. As the OP points out, using IDS=$(echo $IDS | awk '{$1=$1}1' RS= OFS=,) as line 2 works around this.

Compare 2 csv files and replace updated values

I have 2 csv files and I'm looking for a way to compare them using a specific column, and once a match is found I need to take the value of another column from the matched row and put it in the corresponding column of the other record.
I'll try to explain a little bit more.
One csv has product_id,product_name,brand_name,price
the other has product_id,product_category,product_name,brand_name,price
I need to compare the 2 files by finding the rows that have a matching product_id value, and once found I need to take the price value from file 1 and put it to the matched record's price in file 2.
After extensive research I've come to the conclusion that this maybe achievable with powershell.
Does anyone have any ideas about how I could do that? Thank you for your time.

Since is just a one time action. you could open the csv files in a spreadsheet (google docs, excel, ...) and do a VLOOKUP. Is easy:
To demonstrate this imagine the following spreadsheet where both csv files are side by side. First from column A to B and the second on column D to F
| A | B | C | D | E | F
--+------------+-------+---+------------+------------------+-------
1 | product_id | price | | product_id | product_category | price
2 | 1 | 29.9 | | 2 | SOME CAT 1 | =IFERROR(VLOOKUP(D2;A:B;2;FALSE); "NULL")
3 | 2 | 35.5 | | 3 | SOME CAT 2 | =IFERROR(VLOOKUP(D3;A:B;2;FALSE); "NULL")
The VLOOKUP function will search for an exact match of the value of D2 cell on the first column of the region A:B, and return the value from the second column of that region. The iferrorwill return NULL if the VLOOKUP fails.
So in this case on cell F2, will look for the product id "2" (Cell d2) on the column A. It founds the product id "2" in row 3, and return the price "35.5" (being the second row of the range A:B). After all rows have been calculated the result will be:
| A | B | C | D | E | F
--+------------+-------+---+------------+------------------+-------
1 | product_id | price | | product_id | product_category | price
2 | 1 | 29.9 | | 2 | SOME CAT 1 | 35.5
3 | 2 | 35.5 | | 3 | SOME CAT 2 | NULL

One could also use awk for this; say you have:
$ cat a.csv
#product_id,product_name,brand_name,price
1,pname1,bname1,100
10,pname10,bname10,200
20,pname20,bname20,300
$ cat b.csv
#product_id,product_category,product_name,brand_name,price
3,pcat3,pname3,bname3,42
10,pcat10,pname10,bname10,199
20,pcat20,pname20,bname20,299
30,pcat10,pname30,bname30,420
With the "FNR==NR" approach (see e.g. > The Unix shell: comparing two files with awk):
$ awk -F, 'FNR==NR{if(!/^#/){a[$1]=$0;next}}($1 in a){split(a[$1],tmp,",");printf "%d,%s,%s,%s,%d\n",$1,$2,$3,$4,tmp[4];}' a.csv b.csv
10,pcat10,pname10,bname10,200
20,pcat20,pname20,bname20,300
With reading each file into an array (see e.g. Awking it – how to load a file into an array in awk | Tapping away):
$ awk -F, 'BEGIN{while(getline < "a.csv"){if(!/^#/){a[$1]=$0;}}close("a.csv");while(getline < "b.csv"){if($1 in a){split(a[$1],tmp,",");printf "%d,%s,%s,%s,%d\n",$1,$2,$3,$4,tmp[4];}}close("b.csv");}'
10,pcat10,pname10,bname10,200
20,pcat20,pname20,bname20,300
In essence, the two approaches do the same thing:
read the first file (a.csv), and store its lines in an associative array a, keyed/indexed by the first field $1 of that line (in this case, product_id);
then read the second file (b.csv); and if the first field of each of its lines is found in the array a; then output the first four fields of the current line of b.csv; and the fourth field (price) from the corresponding entry in array a
The difference is, that with the FNR==NR approach, specify the input files on the command line as arguments to awk, and basically you can only identify the first file as "special" so you can store it as an array; with the second approach, each input file could be parsed in a separate array - however, the input files are specified in the awk script itself, not in the arguments to awk - and since then you don't even need to use arguments to awk, the entirety of the awk script needs to happen within a BEGIN{...} block.
When lines are being read from the files, they are automatically split in fields according to -F, command line options, which sets comma as delimiter; however, when retrieving lines stored in the array, we have to split() them separately
Breakdown for the first:
FNR==NR # if FNR (input record number in the current input file) equals NR (total num records so far)
# only true when the first file is being read
{
if(!/^#/) # if the current line does not `!` match regex `/.../` of start `^` with `#`
{
a[$1]=$0; # assign current line `$0` to array `a`, with index/key being first field in current line `$1`
next # skip the rest, and start processing next line
}
}
# --this section below executes when FNR does not equal NR;--
($1 in a) # first, check if first field `$1` of current line is in array `a`
{
split(a[$1],tmp,","); # split entry `a[$1]` at commas into array `tmp`
printf "%d,%s,%s,%s,%d\n",$1,$2,$3,$4,tmp[4]; # print reconstructed current line,
# taking the fourth field from the `tmp` array
}
Breakdown for the second:
BEGIN{ # since no file arguments here, everything goes in BEGIN block
while(getline < "a.csv"){ # while reading lines from first file
if(!/^#/){ # if the current line does not `!` match regex `/.../` of start `^` with `#`
a[$1]=$0; # store current line `$0` to array `a`, with index/key being first field in current line `$1`
}
}
close("a.csv");
while(getline < "b.csv"){ # while reading lines from second file
if($1 in a){ # first, check if first field `$1` of current line is in array `a`
split(a[$1],tmp,","); # (same as above)
printf "%d,%s,%s,%s,%d\n",$1,$2,$3,$4,tmp[4]; # (same as above)
}
}
close("b.csv");
} # end BEGIN
Note about the execution with FNR==NR:
$ awk -F, 'FNR==NR{print "-";} (1){print;}' a.csv b.csv # or:
$ awk -F, 'FNR==NR{print "-";} {print;}' a.csv b.csv
-
#product_id,product_name,brand_name,price
-
1,pname1,bname1,100
-
10,pname10,bname10,200
-
20,pname20,bname20,300
#product_id,product_category,product_name,brand_name,price
3,pcat3,pname3,bname3,42
10,pcat10,pname10,bname10,199
20,pcat20,pname20,bname20,299
30,pcat10,pname30,bname30,420
$ awk -F, 'FNR==NR{print "-";} FNR!=NR{print;}' a.csv b.csv
-
-
-
-
#product_id,product_category,product_name,brand_name,price
3,pcat3,pname3,bname3,42
10,pcat10,pname10,bname10,199
20,pcat20,pname20,bname20,299
30,pcat10,pname30,bname30,420
That means that the "this section below executes when FNR does not equal NR;" comment above is in principle wrong - even if that is how that particular example ends up behaving.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Remove duplicates rows from large csv with multiline fields - csv

A great and very simple tool CSV aware is Miller. Running mlr --csv uniq -a input.csv >output.csv You will have id,text 1,"line 1 line 2" 2,"line 1 line 2" It has also a great documentation: this is the one for the uniq verb.

Related

How to get data for past hour, 6h, 24h from CSV log file with awk, sed or other?

jq - combine multple lines into single comma separated line

Finding duplicate lines in a CSV file

How to pipe a list of IDs into a MySQL select statement

Compare 2 csv files and replace updated values

Categories

Resources