I have 2 csv files and I'm looking for a way to compare them using a specific column, and once a match is found I need to take the value of another column from the matched row and put it in the corresponding column of the other record.
I'll try to explain a little bit more.
One csv has product_id,product_name,brand_name,price
the other has product_id,product_category,product_name,brand_name,price
I need to compare the 2 files by finding the rows that have a matching product_id value, and once found I need to take the price value from file 1 and put it to the matched record's price in file 2.
After extensive research I've come to the conclusion that this maybe achievable with powershell.
Does anyone have any ideas about how I could do that? Thank you for your time.
Since is just a one time action. you could open the csv files in a spreadsheet (google docs, excel, ...) and do a VLOOKUP. Is easy:
To demonstrate this imagine the following spreadsheet where both csv files are side by side. First from column A to B and the second on column D to F
| A | B | C | D | E | F
--+------------+-------+---+------------+------------------+-------
1 | product_id | price | | product_id | product_category | price
2 | 1 | 29.9 | | 2 | SOME CAT 1 | =IFERROR(VLOOKUP(D2;A:B;2;FALSE); "NULL")
3 | 2 | 35.5 | | 3 | SOME CAT 2 | =IFERROR(VLOOKUP(D3;A:B;2;FALSE); "NULL")
The VLOOKUP function will search for an exact match of the value of D2 cell on the first column of the region A:B, and return the value from the second column of that region. The iferrorwill return NULL if the VLOOKUP fails.
So in this case on cell F2, will look for the product id "2" (Cell d2) on the column A. It founds the product id "2" in row 3, and return the price "35.5" (being the second row of the range A:B). After all rows have been calculated the result will be:
| A | B | C | D | E | F
--+------------+-------+---+------------+------------------+-------
1 | product_id | price | | product_id | product_category | price
2 | 1 | 29.9 | | 2 | SOME CAT 1 | 35.5
3 | 2 | 35.5 | | 3 | SOME CAT 2 | NULL
One could also use awk for this; say you have:
$ cat a.csv
#product_id,product_name,brand_name,price
1,pname1,bname1,100
10,pname10,bname10,200
20,pname20,bname20,300
$ cat b.csv
#product_id,product_category,product_name,brand_name,price
3,pcat3,pname3,bname3,42
10,pcat10,pname10,bname10,199
20,pcat20,pname20,bname20,299
30,pcat10,pname30,bname30,420
With the "FNR==NR" approach (see e.g. > The Unix shell: comparing two files with awk):
$ awk -F, 'FNR==NR{if(!/^#/){a[$1]=$0;next}}($1 in a){split(a[$1],tmp,",");printf "%d,%s,%s,%s,%d\n",$1,$2,$3,$4,tmp[4];}' a.csv b.csv
10,pcat10,pname10,bname10,200
20,pcat20,pname20,bname20,300
With reading each file into an array (see e.g. Awking it – how to load a file into an array in awk | Tapping away):
$ awk -F, 'BEGIN{while(getline < "a.csv"){if(!/^#/){a[$1]=$0;}}close("a.csv");while(getline < "b.csv"){if($1 in a){split(a[$1],tmp,",");printf "%d,%s,%s,%s,%d\n",$1,$2,$3,$4,tmp[4];}}close("b.csv");}'
10,pcat10,pname10,bname10,200
20,pcat20,pname20,bname20,300
In essence, the two approaches do the same thing:
read the first file (a.csv), and store its lines in an associative array a, keyed/indexed by the first field $1 of that line (in this case, product_id);
then read the second file (b.csv); and if the first field of each of its lines is found in the array a; then output the first four fields of the current line of b.csv; and the fourth field (price) from the corresponding entry in array a
The difference is, that with the FNR==NR approach, specify the input files on the command line as arguments to awk, and basically you can only identify the first file as "special" so you can store it as an array; with the second approach, each input file could be parsed in a separate array - however, the input files are specified in the awk script itself, not in the arguments to awk - and since then you don't even need to use arguments to awk, the entirety of the awk script needs to happen within a BEGIN{...} block.
When lines are being read from the files, they are automatically split in fields according to -F, command line options, which sets comma as delimiter; however, when retrieving lines stored in the array, we have to split() them separately
Breakdown for the first:
FNR==NR # if FNR (input record number in the current input file) equals NR (total num records so far)
# only true when the first file is being read
{
if(!/^#/) # if the current line does not `!` match regex `/.../` of start `^` with `#`
{
a[$1]=$0; # assign current line `$0` to array `a`, with index/key being first field in current line `$1`
next # skip the rest, and start processing next line
}
}
# --this section below executes when FNR does not equal NR;--
($1 in a) # first, check if first field `$1` of current line is in array `a`
{
split(a[$1],tmp,","); # split entry `a[$1]` at commas into array `tmp`
printf "%d,%s,%s,%s,%d\n",$1,$2,$3,$4,tmp[4]; # print reconstructed current line,
# taking the fourth field from the `tmp` array
}
Breakdown for the second:
BEGIN{ # since no file arguments here, everything goes in BEGIN block
while(getline < "a.csv"){ # while reading lines from first file
if(!/^#/){ # if the current line does not `!` match regex `/.../` of start `^` with `#`
a[$1]=$0; # store current line `$0` to array `a`, with index/key being first field in current line `$1`
}
}
close("a.csv");
while(getline < "b.csv"){ # while reading lines from second file
if($1 in a){ # first, check if first field `$1` of current line is in array `a`
split(a[$1],tmp,","); # (same as above)
printf "%d,%s,%s,%s,%d\n",$1,$2,$3,$4,tmp[4]; # (same as above)
}
}
close("b.csv");
} # end BEGIN
Note about the execution with FNR==NR:
$ awk -F, 'FNR==NR{print "-";} (1){print;}' a.csv b.csv # or:
$ awk -F, 'FNR==NR{print "-";} {print;}' a.csv b.csv
-
#product_id,product_name,brand_name,price
-
1,pname1,bname1,100
-
10,pname10,bname10,200
-
20,pname20,bname20,300
#product_id,product_category,product_name,brand_name,price
3,pcat3,pname3,bname3,42
10,pcat10,pname10,bname10,199
20,pcat20,pname20,bname20,299
30,pcat10,pname30,bname30,420
$ awk -F, 'FNR==NR{print "-";} FNR!=NR{print;}' a.csv b.csv
-
-
-
-
#product_id,product_category,product_name,brand_name,price
3,pcat3,pname3,bname3,42
10,pcat10,pname10,bname10,199
20,pcat20,pname20,bname20,299
30,pcat10,pname30,bname30,420
That means that the "this section below executes when FNR does not equal NR;" comment above is in principle wrong - even if that is how that particular example ends up behaving.
Related
Given the following csv, with multiline fields:
"id","text"
"1","line 1
line 2"
"2","line 1
line 2"
"1","line 1
line 2"
... which displays as:
id
text
1
line 1 line 2
2
line 1 line 2
1
line 1 line 2
If I use the following awk command to remove duplicate rows from this csv based on the id (column 1):
awk -F, '!x[$1]++' 'file-01.csv' > 'file-01-deduped.csv'
I end up with:
"id","text"
"1","line 1
line 2"
"2","line 1
which displays as:
id
text
1
line 1 line 2
2
line 1
This is an oversimplified example, but it seems awk doesn't play well with multiline fields. Perhaps I'm missing something though.
Additional info: I'm writing these csv's according to RFC4180 standards—most notably, fields containing line breaks, double quotes, and commas are enclosed in double-quotes. And double quotes appearing inside a field are escaped with a preceding double quote.
Also, I'm writing the csv in Node/JS, but I found awk to be a really simple/fast way to dedupe very large files in the past—none had multiline fields though.
I'm by no means bound to awk—I'm open to any/all suggestions—just wanted to be clear about what I've tried. Thanks!
With your shown samples only, please try following awk code. Written and tested in GNU awk, should work in any awk.
awk -F',' '
FNR>1{
sub(/^"/,"",$2)
sub(/"$/,"",$3)
gsub(/"/,"",$1)
print $1 OFS $2 ORS " " $3
}
' <(awk '{printf("%s%s",$0!~/^"/?",":FNR>1?ORS:"",$0)} END{print ""}' Input_file)
Explanation: Simple explanation would be, running 1st awk to print all lines in single row(wherever it has line not started from ") and sending its output as an input to main awk where printing the required values of id and all line values as per requirement.
As others have pointed out, you need a CSV-aware tool to properly handle the line breaks inside the rows.
GoCSV was made for this: it's fast, pretty good w/memory, is CSV savvy, and is pre-built for a number of platforms.
Its unique subcommand will keep only the first row based on the occurrence of a value or set of values in a column or set of columns.
To drop duplicate rows based on the text column:
gocsv unique -c 'text' input.csv > de-duped.csv
It can even tell you how many dupes it found along the way:
gocsv unique -c 'text' -count input.csv > de-duped.csv
How fast, how good w/memory?.
I mocked up a 1_000_000 row CSV with two columns of random text and embedded line breaks (also includes commas and quoted quotes):
ll -h gen_1000000x3.csv
-rw-r--r-- 1 zyoung staff 52M Apr 26 09:36 gen_1000000x3.csv
cat gen_1000000x3.csv
ID,Col1,Col2
0,"ddddd
"","" oooooo","wwwwww
"","" nnnnnnn"
1,"llllllll
"","" ccccccc","iiiiiiii
"","" wwwww"
2,"nnnnn
"","" iiiiiiii","ooooo
"","" kkkkkkkk"
...
On my M1 MacBook Air, de-duping the 1-million row, 52 MB CSV took a half-second and consumed only 13 MB of memory:
/usr/bin/time -l gocsv unique -c Col2 gen_1000000x3.csv > de-duped.csv
0.45 real 0.49 user 0.05 sys
...
13124608 peak memory footprint
Over 989_000 duplicate rows were dropped:
gocsv dims de-duped.csv
Dimensions:
Rows: 10816
Columns: 3
We can count instances of each of value in Col2 that was found (counting consumed 175 MB of memory):
gocsv unique -c Col2 -count gen_1000000x3.csv > de-duped.csv
GoCSV can also display multi-line rows in the terminal:
+--------+---------------+---------------+-------+
| ID | Col1 | Col2 | Count |
+--------+---------------+---------------+-------+
| 0 | ddddd | wwwwww | 80 |
| | "," oooooo | "," nnnnnnn | |
+--------+---------------+---------------+-------+
| 1 | llllllll | iiiiiiii | 89 |
| | "," ccccccc | "," wwwww | |
+--------+---------------+---------------+-------+
| 2 | nnnnn | ooooo | 97 |
| | "," iiiiiiii | "," kkkkkkkk | |
...
I cannot compare the awk scripts suggested so far: one just doesn't do anything in my terminal, and the other requires GNU which I don't have. But awk will come out slower: 3x longer just to run awk '{print $0}' gen_1000000x3.csv > /dev/null, and that's not even doing meaningful work. And the hoops you have to jump through to try and program a CSV parser from scratch.
Awk is not csv aware, so it's not really the right tool for the job. There are a few csv implementations floating around the internets, maybe you could take a look at them.
You did mention the file being large, but if it fits your memory, this is a variation of something I needed a few weeks back. It's GNU awk using FPAT so it's not really fast:
$ gawk '
BEGIN {
RS="^$" # read in whole file
FPAT="([^,\n]*)|(\"(\"\"|[^\"])+\")" # regex magic
OFS=","
}
{
for(i=1;i<NF;i+=2) # iterate fields 2 at a time
if(!a[$i]++) # if first field not seen before
print $i,$(i+1) # output 2 fields
}' file
Test data:
"id","text"
"1","line 1
line 2"
"2","line 1
line 2"
"3"," ""line 1""
line 2"
"4",""
"5","line 1,
line 2"
"1","line 1
line 2"
Output:
"id","text"
"1","line 1
line 2"
"2","line 1
line 2"
"3"," ""line 1""
line 2"
"4",""
"5","line 1,
line 2"
I don't know how many ways it can fail you, tho.
A great and very simple tool CSV aware is Miller.
Running
mlr --csv uniq -a input.csv >output.csv
You will have
id,text
1,"line 1
line 2"
2,"line 1
line 2"
It has also a great documentation: this is the one for the uniq verb.
I have a dataframe and I want to write it as json array into a single file in scala.
attempt 1:
dataframe.coalesce(1).write.format("json").save(destDir)
output 1:
One row per line, where each row is a json
attempt 2:
dataframe.toJSON.coalesce(1).write.format("json").save(destDir)
output 2:
same as output 1, but a weird looking json on each row
{value: {key1:value1, key2:value2, ... }
attempt 3 (writing as String using java PrintWriter):
printWriter.write(dataframe.toJSON.collect.mkString("[",",","]"))
output3:
It writes an array of json to a local path.
If the path is for hdfs it says FileNotFound, even if the path + file exist.
To write a dataframe as a json array, first you transform your dataframe to json string, then you transform those strings so each row is a line in your future json file, then you write the file with text instead of json
Analysis
To write a dataframe to a json, you can start from the .toJSON method as in your attemps 2 and 3:
val rawJson = dataframe.toJSON
Now you have a dataframe with one column value containing a json representation of rows as a String.
To transform this dataframe to a dataframe whose each row represents a line of your future file, you need to:
add a new row containing [ as first row of the dataframe
add a comma to all rows representing your json data
except for the last row with json data
add a new row containing ] as last row of the dataframe
As you see, concepts like "first" and "last" are important in your case, thus you need to build an ordering of the rows in your dataframe. You can associate it like that:
+-------+--------------------+------------+
| order | row | value |
+-------+--------------------+------------+
| 0 | first row | "[" |
| 1 | row with json | " {...}," |
| 1 | row with json | " {...}," |
| ... | ... | ... |
| 1 | row with json | " {...}," |
| 1 | row with json | " {...}," |
| 2 | last row with json | " {...}" |
| 3 | last row | "]" |
+-------+--------------------+------------+
First, you can distinguish the last row with json from the others. To do so, you can use Window functions.
You count the number of row in a window that contains the current row and the next row, meaning you associate each row with 2, except the last one that
has no next row thus that you associate with 1.
val window = Window.rowsBetween(Window.currentRow, 1)
val jsonWindow = rawJson.withColumn("order", count("value").over(window))
However, you want that the last row has 2 in column "order", and other rows have 1 in column "order". You can use modulo (%) function to achieve this:
val jsonRowsWithOrder = jsonWindow.withColumn("order", (col("order") % lit(2)) + 1)
Then you add comma to all rows except the last one, meaning you add comma to all rows whose column "order" is set to 1:
val jsonRowsWithCommas = jsonRowsWithOrder.withColumn("value", when(col("order").equalTo(1), concat(col("value"), lit(","))).otherwise(col("value")))
Those lines in the final file will be indented, so you indent them:
val indentedJsonRows = jsonRowsWithCommas.withColumn("value", concat(lit(" "), col("value")))
You add first and last rows, which contain open and close square brackets:
val unorderedRows = indentedJsonRows.unionByName(Seq((0, "["), (3, "]")).toDF("order", "value"))
You order it:
val orderedRows = unorderedRows.orderBy("order").drop("order")
You coalesce to have only one partition as you want only one file at the end:
partitionedRows = orderedRows.coalesce(1)
And you write it as a text:
partitionedRows.write.text(destDir)
And you're done !
Complete solution
Here is the complete solution, with imports. This solution works from spark 2.3 (tested with spark 3.0):
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val window = Window.rowsBetween(Window.currentRow, 1)
dataframe.toJSON
.map(jsonString => s" $jsonString")
.withColumn("order", (count("value").over(window) % lit(2)) + lit(1))
.withColumn("value", when(col("order").equalTo(1), concat(col("value"), lit(","))).otherwise(col("value")))
.unionByName(Seq((0, "["), (3, "]")).toDF("order", "value"))
.orderBy("order")
.select("value")
.coalesce(1)
.write.text(destDir)
Conclusion
You can write a spark dataframe as a json array by using only spark.
However, spark is a parallele computing framework, so enforcing an order and shrinking to one partition is not the way it is supposed to work. Moreover, as
you can't change the name of file output by spark, the saved file will have .txt extension (but inside it is a json array)
It may be better to save your dataframe with .write.json(destDir) and then rework the ouput with classic tools instead of creating complicated logic to make
it with spark.
I'm trying to use Powershell to search a csv file and output a list of duplicate lines in a csv file. I can accomplish this pretty easily in bash with the following:
uniq -d myfile.csv > list.csv
In Powershell I can output a list of unique lines but how do I modify Get-Unique to display only the duplicate lines like I did in bash?
Get-Content c:\file\myfile.csv | Get-Unique | Set-Content c:\file\list1.csv
It's a bit weird to use the unique tool to get the duplicates. How about:
gc .\test.csv | group -NoElement |? Count -gt 1 | select -expand name
This groups the lines by how many there are, identifies the ones with duplicates, and outputs them. e.g. if:
test.csv contains:
a,b,c
d,e,f
a,b,c
z,z,z
gc test.csv | group
Count Name Group
----- ---- -----
2 a,b,c {a,b,c, a,b,c}
1 d,e,f {d,e,f}
1 z,z,z {z,z,z}
1 {}
and -NoElement stops it building the group contents, redundant in this case.
I need a formula/function to concatenate cell values from one column and multiple rows. The matching criteria is applied to a different column. Here is my example of what I have to do:
Islington | "Bunhill" | EC2M
Islington | "Bunhill" | EC2Y
Islington | "Bunhill" | N1
Barnet | "Burnt Oak" | HA8
Barnet | "Burnt Oak" | NW7
Barnet | "Burnt Oak" | NW9
The end result needs to look like this:
Islington | "Bunhill" | EC2M, EC2Y, N1
Barnet | "Burnt Oak" | HA8, NW7, NW9
Basically, I need to remove all duplicates from the second column, but save the data from the third column that is paired with each of the duplicates, and concatenate it in one cell.
You can go through a process of steps using functions. Start with the UNIQUE function. Put this in a cell where it is convenient to list all the unique values of column B:
=UNIQUE(B:B)
Gets all the unique values in column B.
Google Support - Unique Function
The result from the UNIQUE function will look like this:
Now that you have all the unique values from column B, you can use the FILTER function to retrieve all the rows that match that unique value.
=FILTER(D1:D6, B1:B6=A8)
The FILTER function lists all the results down the column, but you can use the CONCATENATE function to avoid that.
Results of FILTER function:
Results of CONCATENATE:
You will need to adjust the FILTER function to now use column D, rather than column C.
=CONCATENATE(FILTER(D1:D6, B1:B6=A8))
This solves the problem of getting data in multiple rows, but now there is no separator between the values.
To get around that problem, you can create a fourth column with a function that adds a comma to the end:
There is a problem with an extra comma on the end, which you can get rid of with the LEFT function:
If not required too often it is quite practical without a script. Assuming EC2M is in C2, D1 is blank, and your data is sorted, in D2:
=if(B1=B2,D1&", "&C2,C2)
and in E2, both formulae copied down to suit:
=B2=B3
Select all, Ctrl+c, Edit, Paste special, Paste values only over the top and filter to select and delete rows with TRUE in ColumnE.
TEXTJOIN has 2 advantages over CONCATENATE: (1) customizable delimiter, and (2) can skip blanks.
Example:
AA | BB | CC | __ | EE
=TEXTJOIN(",",TRUE,A1:E1)
Will produce: AA,BB,CC,EE
(skipping the blank DD and putting a comma in between every term except last)
I have numerous csv files that will form the basis of a mysql database. My problem is as follows:
The input CSV files are of the format:
TIME | VALUE PARAM 1 | VALUE PARAM 2 | VALUE PARAM 3 | ETC.
0.00001 | 10 | 20 | 30 | etc.
This is not the structure I want to use in the database. There I would like one big table for all of the data, structured something like:
TIME | PARAMETER | VALUE | Unit of Measure | Version
This means that I would like to insert the combination of TIME and VALUE PARAM 1 from the CSV into the table, then the combination of TIME and VALUE PARAM 2, and so on, and so on.
I haven't done anything like this before, but could a possible solution be to set up a BASH script that loops through the columns and on each iteration inserts the combination of time + value into my database?
I have a reasonable understanding of mysql, but very limited knowledge of bash scripting. But I couldn't find a way out with the mysql LOAD DATA INFILE command.
If you need more info to help me out, I'm happy to provide more info!
Regards,
Erik
i do this all day, every day, and as a rule, have the most success with the least headaches by using LOAD DATA INFILE to a temporary table, then leveraging the power of mySQL to get it into the final table/format successfully. Details at this answer.
To illustrate this further, we process log files for every video event of 80K highschools/colleges around the country (that's every pause/play/seek/stop/start for 100's of thousands of videos).
They're served from a number of different servers, depending on the type of videos (WMV, FLV, MP4, etc.), so there's some 200GB to handle every night, with each format having a different log layout. The old way we did it with CSV/PHP took literally days to finish, but changing it to LOAD DATA INFILE into temporary tables, unifying them into a second, standardized temporary table, then using SQL to group and otherwise slice and dice cut the execution time to a few hours.
It would probably be easiest to preprocess your CSV with an awk script first, and then (as Greg P said) use LOAD DATA LOCAL INFILE. If I understand your requirements correctly, this awk script should work:
#!/usr/bin/awk -F| -f
NR==1 {
for(col = 2; col <= NF; col++) label[col] = $col
printf("TIME | PARAM | VALUE | UNIT | VERSION\n")
next
}
{
for(col = 2; col <= NF; col++) {
printf("%s | %s | %s | [unit] | [version]\n", $1, label[col], $col)
}
}
Output:
$ ./test.awk test.in
TIME | PARAM | VALUE | UNIT | VERSION
0.00001 | VALUE PARAM 1 | 10 | [unit] | [version]
0.00001 | VALUE PARAM 2 | 20 | [unit] | [version]
0.00001 | VALUE PARAM 3 | 30 | [unit] | [version]
0.00001 | ETC. | etc. | [unit] | [version]
Then
mysql> LOAD DATA LOCAL INFILE 'processed.csv'
mysql> INTO TABLE 'table'
mysql> FIELDS TERMINATED BY '|'
mysql> IGNORE 1 LINES;
(Note: I haven't tested the MySQL)