Finding duplicate lines in a CSV file - csv

I'm trying to use Powershell to search a csv file and output a list of duplicate lines in a csv file. I can accomplish this pretty easily in bash with the following:
uniq -d myfile.csv > list.csv
In Powershell I can output a list of unique lines but how do I modify Get-Unique to display only the duplicate lines like I did in bash?
Get-Content c:\file\myfile.csv | Get-Unique | Set-Content c:\file\list1.csv

It's a bit weird to use the unique tool to get the duplicates. How about:
gc .\test.csv | group -NoElement |? Count -gt 1 | select -expand name
This groups the lines by how many there are, identifies the ones with duplicates, and outputs them. e.g. if:
test.csv contains:
a,b,c
d,e,f
a,b,c
z,z,z
gc test.csv | group
Count Name Group
----- ---- -----
2 a,b,c {a,b,c, a,b,c}
1 d,e,f {d,e,f}
1 z,z,z {z,z,z}
1 {}
and -NoElement stops it building the group contents, redundant in this case.

Related

Remove duplicates rows from large csv with multiline fields

Given the following csv, with multiline fields:
"id","text"
"1","line 1
line 2"
"2","line 1
line 2"
"1","line 1
line 2"
... which displays as:
id
text
1
line 1 line 2
2
line 1 line 2
1
line 1 line 2
If I use the following awk command to remove duplicate rows from this csv based on the id (column 1):
awk -F, '!x[$1]++' 'file-01.csv' > 'file-01-deduped.csv'
I end up with:
"id","text"
"1","line 1
line 2"
"2","line 1
which displays as:
id
text
1
line 1 line 2
2
line 1
This is an oversimplified example, but it seems awk doesn't play well with multiline fields. Perhaps I'm missing something though.
Additional info: I'm writing these csv's according to RFC4180 standards—most notably, fields containing line breaks, double quotes, and commas are enclosed in double-quotes. And double quotes appearing inside a field are escaped with a preceding double quote.
Also, I'm writing the csv in Node/JS, but I found awk to be a really simple/fast way to dedupe very large files in the past—none had multiline fields though.
I'm by no means bound to awk—I'm open to any/all suggestions—just wanted to be clear about what I've tried. Thanks!
With your shown samples only, please try following awk code. Written and tested in GNU awk, should work in any awk.
awk -F',' '
FNR>1{
sub(/^"/,"",$2)
sub(/"$/,"",$3)
gsub(/"/,"",$1)
print $1 OFS $2 ORS " " $3
}
' <(awk '{printf("%s%s",$0!~/^"/?",":FNR>1?ORS:"",$0)} END{print ""}' Input_file)
Explanation: Simple explanation would be, running 1st awk to print all lines in single row(wherever it has line not started from ") and sending its output as an input to main awk where printing the required values of id and all line values as per requirement.
As others have pointed out, you need a CSV-aware tool to properly handle the line breaks inside the rows.
GoCSV was made for this: it's fast, pretty good w/memory, is CSV savvy, and is pre-built for a number of platforms.
Its unique subcommand will keep only the first row based on the occurrence of a value or set of values in a column or set of columns.
To drop duplicate rows based on the text column:
gocsv unique -c 'text' input.csv > de-duped.csv
It can even tell you how many dupes it found along the way:
gocsv unique -c 'text' -count input.csv > de-duped.csv
How fast, how good w/memory?.
I mocked up a 1_000_000 row CSV with two columns of random text and embedded line breaks (also includes commas and quoted quotes):
ll -h gen_1000000x3.csv
-rw-r--r-- 1 zyoung staff 52M Apr 26 09:36 gen_1000000x3.csv
cat gen_1000000x3.csv
ID,Col1,Col2
0,"ddddd
"","" oooooo","wwwwww
"","" nnnnnnn"
1,"llllllll
"","" ccccccc","iiiiiiii
"","" wwwww"
2,"nnnnn
"","" iiiiiiii","ooooo
"","" kkkkkkkk"
...
On my M1 MacBook Air, de-duping the 1-million row, 52 MB CSV took a half-second and consumed only 13 MB of memory:
/usr/bin/time -l gocsv unique -c Col2 gen_1000000x3.csv > de-duped.csv
0.45 real 0.49 user 0.05 sys
...
13124608 peak memory footprint
Over 989_000 duplicate rows were dropped:
gocsv dims de-duped.csv
Dimensions:
Rows: 10816
Columns: 3
We can count instances of each of value in Col2 that was found (counting consumed 175 MB of memory):
gocsv unique -c Col2 -count gen_1000000x3.csv > de-duped.csv
GoCSV can also display multi-line rows in the terminal:
+--------+---------------+---------------+-------+
| ID | Col1 | Col2 | Count |
+--------+---------------+---------------+-------+
| 0 | ddddd | wwwwww | 80 |
| | "," oooooo | "," nnnnnnn | |
+--------+---------------+---------------+-------+
| 1 | llllllll | iiiiiiii | 89 |
| | "," ccccccc | "," wwwww | |
+--------+---------------+---------------+-------+
| 2 | nnnnn | ooooo | 97 |
| | "," iiiiiiii | "," kkkkkkkk | |
...
I cannot compare the awk scripts suggested so far: one just doesn't do anything in my terminal, and the other requires GNU which I don't have. But awk will come out slower: 3x longer just to run awk '{print $0}' gen_1000000x3.csv > /dev/null, and that's not even doing meaningful work. And the hoops you have to jump through to try and program a CSV parser from scratch.
Awk is not csv aware, so it's not really the right tool for the job. There are a few csv implementations floating around the internets, maybe you could take a look at them.
You did mention the file being large, but if it fits your memory, this is a variation of something I needed a few weeks back. It's GNU awk using FPAT so it's not really fast:
$ gawk '
BEGIN {
RS="^$" # read in whole file
FPAT="([^,\n]*)|(\"(\"\"|[^\"])+\")" # regex magic
OFS=","
}
{
for(i=1;i<NF;i+=2) # iterate fields 2 at a time
if(!a[$i]++) # if first field not seen before
print $i,$(i+1) # output 2 fields
}' file
Test data:
"id","text"
"1","line 1
line 2"
"2","line 1
line 2"
"3"," ""line 1""
line 2"
"4",""
"5","line 1,
line 2"
"1","line 1
line 2"
Output:
"id","text"
"1","line 1
line 2"
"2","line 1
line 2"
"3"," ""line 1""
line 2"
"4",""
"5","line 1,
line 2"
I don't know how many ways it can fail you, tho.
A great and very simple tool CSV aware is Miller.
Running
mlr --csv uniq -a input.csv >output.csv
You will have
id,text
1,"line 1
line 2"
2,"line 1
line 2"
It has also a great documentation: this is the one for the uniq verb.

jq - combine multple lines into single comma separated line

I need the IP output below to be printed in a single line like this "10.88.4.92", "10.88.5.203", "10.87.5.215"
cat ec2.json | jq ".[] | .private_ip"
"10.88.4.92"
"10.88.5.203"
"10.87.5.215"
How to achieve this with jq
One approach would be to use #csv, e.g. along the lines of:
< ec2.json jq -r "[.[] | .private_ip] | #csv"

mysql count distinct occurrences in an array field

I have a text column which contains hashtags used by users. each row contains a different number of hashtags, separed by a space, like this:
USERS | HASHTAG COLUMN:
------------------------
user1 | hashtag1 hashtag2
user2 | hashtag2
user1 | hashtag1 hashtag2 hashtag3 hashtag4
I want to get the most used hashtags, in this case my desired output should be:
OCCURRENCES | TAG
----------------------
3 | hashtag2
2 | hashtag1
1 | hashtag3
1 | hashtag4
I have NO IDEA how to get it, any help is much appreciated. Thank you
Assuming you can't redesign your database to be in 1NF, then you can do this in bash:
echo "select hashtag from table" | \
mysql --user=foo --password=bar --host=hostname --database=dbname --skip-column-names | \
sed -e 's/ /\n/g' | \
sort | \
uniq -c | \
sort -rn
The sed command puts each hashtag on its own line. The first sort command puts all the duplicate hashtags next to each other so that ... the uniq command can count all the occurrences of each one. The second sort command orders the output in reverse numerical order by the counts.

Simple way to print structured SQL SELECT in bash

I'm trying to echo the output of a SELECT request in bash, in a structured form with column names. The issue is that i cannot do it properly with more than 2 fields or if the value is larger that the column name.
Example :
My table looks like this : value1 value2
If i do in bash : echo "select value1, value2 from table" | mysql -uUSER -pPASS
The result looks like this in bash :
value1 value2
a d
b e
c f
Now if i have 3 fields or a large value the result looks like this :
value1 value2 value3
aaaaaaaaa ddddddddddd ggg
bbbb eeeeeeeee hhhh
ccccccc fffffffff iiii
Is there a simple way to have a structured result ? I mean with column names correctly spaced? I know it is possible to do it with a sort to get the largest value and add the number of spaces needed but it seems to be too much for a simple problem like this.
Do you have an idea? Thanks !
Use the mysql -e option to execute your query, and -t to print table output to stdout:
mysql -uUSER -t -e "select value1, value2, value3 from table" -pPASS
Your output will look something like this:
+-----------+-------------+--------+
| value1 | value2 | value3 |
+-----------+-------------+--------+
| aaaaaaaaa | ddddddddddd | ggg |
| bbbb | eeeeeeeee | hhhh |
| ccccccc | fffffffff | iiii |
+-----------+-------------+--------+
From the mysql manpage:
--execute=statement, -e statement
Execute the statement and quit. The default output format is like that produced with --batch. See Section 4.2.3.1, “Using Options on
the Command Line”, for some examples. With this option, mysql does not use the history file.
and
-table, -t
Display output in table format. This is the default for interactive use, but can be used to produce table output in batch mode.
echo "select value1, value2 from table" | mysql -uUSER -pPASS | column -t
A good way to do this is By Not Using Echo to just print Everything to Webpage, Instead you need to assign Each with an Array Variable.
will be something like this :
int x[50];
x[i] = $value // which is from SELECT Query
i++
and put it in a Loop, so the result every time will be Stored in a new Location for Array. once you have saved all your results in an array, it will be much easier for you to organize, you can use Table in HTML
<table />
and Organize it to make it efficient for you to easily put the array of the variable in a loop, so basically you will have 2 loops. 1st loop to assign array to database attributes, Second Loop will be for Printing inside the table of HTML.

Compare 2 csv files and replace updated values

I have 2 csv files and I'm looking for a way to compare them using a specific column, and once a match is found I need to take the value of another column from the matched row and put it in the corresponding column of the other record.
I'll try to explain a little bit more.
One csv has product_id,product_name,brand_name,price
the other has product_id,product_category,product_name,brand_name,price
I need to compare the 2 files by finding the rows that have a matching product_id value, and once found I need to take the price value from file 1 and put it to the matched record's price in file 2.
After extensive research I've come to the conclusion that this maybe achievable with powershell.
Does anyone have any ideas about how I could do that? Thank you for your time.
Since is just a one time action. you could open the csv files in a spreadsheet (google docs, excel, ...) and do a VLOOKUP. Is easy:
To demonstrate this imagine the following spreadsheet where both csv files are side by side. First from column A to B and the second on column D to F
| A | B | C | D | E | F
--+------------+-------+---+------------+------------------+-------
1 | product_id | price | | product_id | product_category | price
2 | 1 | 29.9 | | 2 | SOME CAT 1 | =IFERROR(VLOOKUP(D2;A:B;2;FALSE); "NULL")
3 | 2 | 35.5 | | 3 | SOME CAT 2 | =IFERROR(VLOOKUP(D3;A:B;2;FALSE); "NULL")
The VLOOKUP function will search for an exact match of the value of D2 cell on the first column of the region A:B, and return the value from the second column of that region. The iferrorwill return NULL if the VLOOKUP fails.
So in this case on cell F2, will look for the product id "2" (Cell d2) on the column A. It founds the product id "2" in row 3, and return the price "35.5" (being the second row of the range A:B). After all rows have been calculated the result will be:
| A | B | C | D | E | F
--+------------+-------+---+------------+------------------+-------
1 | product_id | price | | product_id | product_category | price
2 | 1 | 29.9 | | 2 | SOME CAT 1 | 35.5
3 | 2 | 35.5 | | 3 | SOME CAT 2 | NULL
One could also use awk for this; say you have:
$ cat a.csv
#product_id,product_name,brand_name,price
1,pname1,bname1,100
10,pname10,bname10,200
20,pname20,bname20,300
$ cat b.csv
#product_id,product_category,product_name,brand_name,price
3,pcat3,pname3,bname3,42
10,pcat10,pname10,bname10,199
20,pcat20,pname20,bname20,299
30,pcat10,pname30,bname30,420
With the "FNR==NR" approach (see e.g. > The Unix shell: comparing two files with awk):
$ awk -F, 'FNR==NR{if(!/^#/){a[$1]=$0;next}}($1 in a){split(a[$1],tmp,",");printf "%d,%s,%s,%s,%d\n",$1,$2,$3,$4,tmp[4];}' a.csv b.csv
10,pcat10,pname10,bname10,200
20,pcat20,pname20,bname20,300
With reading each file into an array (see e.g. Awking it – how to load a file into an array in awk | Tapping away):
$ awk -F, 'BEGIN{while(getline < "a.csv"){if(!/^#/){a[$1]=$0;}}close("a.csv");while(getline < "b.csv"){if($1 in a){split(a[$1],tmp,",");printf "%d,%s,%s,%s,%d\n",$1,$2,$3,$4,tmp[4];}}close("b.csv");}'
10,pcat10,pname10,bname10,200
20,pcat20,pname20,bname20,300
In essence, the two approaches do the same thing:
read the first file (a.csv), and store its lines in an associative array a, keyed/indexed by the first field $1 of that line (in this case, product_id);
then read the second file (b.csv); and if the first field of each of its lines is found in the array a; then output the first four fields of the current line of b.csv; and the fourth field (price) from the corresponding entry in array a
The difference is, that with the FNR==NR approach, specify the input files on the command line as arguments to awk, and basically you can only identify the first file as "special" so you can store it as an array; with the second approach, each input file could be parsed in a separate array - however, the input files are specified in the awk script itself, not in the arguments to awk - and since then you don't even need to use arguments to awk, the entirety of the awk script needs to happen within a BEGIN{...} block.
When lines are being read from the files, they are automatically split in fields according to -F, command line options, which sets comma as delimiter; however, when retrieving lines stored in the array, we have to split() them separately
Breakdown for the first:
FNR==NR # if FNR (input record number in the current input file) equals NR (total num records so far)
# only true when the first file is being read
{
if(!/^#/) # if the current line does not `!` match regex `/.../` of start `^` with `#`
{
a[$1]=$0; # assign current line `$0` to array `a`, with index/key being first field in current line `$1`
next # skip the rest, and start processing next line
}
}
# --this section below executes when FNR does not equal NR;--
($1 in a) # first, check if first field `$1` of current line is in array `a`
{
split(a[$1],tmp,","); # split entry `a[$1]` at commas into array `tmp`
printf "%d,%s,%s,%s,%d\n",$1,$2,$3,$4,tmp[4]; # print reconstructed current line,
# taking the fourth field from the `tmp` array
}
Breakdown for the second:
BEGIN{ # since no file arguments here, everything goes in BEGIN block
while(getline < "a.csv"){ # while reading lines from first file
if(!/^#/){ # if the current line does not `!` match regex `/.../` of start `^` with `#`
a[$1]=$0; # store current line `$0` to array `a`, with index/key being first field in current line `$1`
}
}
close("a.csv");
while(getline < "b.csv"){ # while reading lines from second file
if($1 in a){ # first, check if first field `$1` of current line is in array `a`
split(a[$1],tmp,","); # (same as above)
printf "%d,%s,%s,%s,%d\n",$1,$2,$3,$4,tmp[4]; # (same as above)
}
}
close("b.csv");
} # end BEGIN
Note about the execution with FNR==NR:
$ awk -F, 'FNR==NR{print "-";} (1){print;}' a.csv b.csv # or:
$ awk -F, 'FNR==NR{print "-";} {print;}' a.csv b.csv
-
#product_id,product_name,brand_name,price
-
1,pname1,bname1,100
-
10,pname10,bname10,200
-
20,pname20,bname20,300
#product_id,product_category,product_name,brand_name,price
3,pcat3,pname3,bname3,42
10,pcat10,pname10,bname10,199
20,pcat20,pname20,bname20,299
30,pcat10,pname30,bname30,420
$ awk -F, 'FNR==NR{print "-";} FNR!=NR{print;}' a.csv b.csv
-
-
-
-
#product_id,product_category,product_name,brand_name,price
3,pcat3,pname3,bname3,42
10,pcat10,pname10,bname10,199
20,pcat20,pname20,bname20,299
30,pcat10,pname30,bname30,420
That means that the "this section below executes when FNR does not equal NR;" comment above is in principle wrong - even if that is how that particular example ends up behaving.