Editing csv files in batch - csv

i need to edit this csv file in batch
id;category;name/code;description;sku;price;weight;options;enable discounts;discounts;availability type;available;pending;images
iqhk8mjh;Software;Quick Heal Antivirus Pro;Quick Heal Antivirus Pro;quickh1;29,90;0;;0;;Dynamic;10;0;C:\Users\Matteo\Dropbox\siti\quick\av2021.png
Delete column 2,3,4,5,7,8,9,10,11,13,14
change column title from availability to quantity
delete row with quantity=0
add column with title codice with static value for example 1234
then convert in txt (with /t separator)
Is it possible?
i tryed with powershell

You can do it with PowerShell but I would recommend Miller (available here for several OSs) instead:
mlr --icsv --ifs ';' --otsv cut -f 'id,price,available' then rename 'available,quantity' then filter '$quantity != 0' then put '$codice = 1234' file.csv
Output:
id price quantity codice
iqhk8mjh 29,90 10 1234
Explanations
--icsv --ifs ';' => set the input format to CSV with ; as field separator.
--otsv => set the output format to TSV (TAB separated values).
cut -f 'id,price,available' => only keep the specified fields.
rename 'available,quantity' => rename the field available to quantity
filter '$quantity != 0' => only keep rows for which quantity isn't 0
put '$codice = 1234' => add a field codice with an arbitrary value in each row
note: then is for chaining operations in Miller

Related

CSV Column Insertion via awk

I am trying to insert a column in front of the first column in a comma separated value file (CSV). At first blush, awk seems to be the way to go but, I'm struggling with how to move down the new column.
CSV File
A,B,C,D,E,F
1,2,3,4,5,6
2,3,4,5,6,7
3,4,5,6,7,8
4,5,6,7,8,9
Attempted Code
awk 'BEGIN{FS=OFS=","}{$1=$1 OFS (FNR<1 ? $1 "0\nA\n2\nC" : "col")}1'
Result
A,col,B,C,D,E,F
1,col,2,3,4,5,6
2,col,3,4,5,6,7
3,col,4,5,6,7,8
4,col,5,6,7,8,9
Expected Result
col,A,B,C,D,E,F
0,1,2,3,4,5,6
A,2,3,4,5,6,7
2,3,4,5,6,7,8
C,4,5,6,7,8,9
This can be easily done using paste + printf:
paste -d, <(printf "col\n0\nA\n2\nC\n") file
col,A,B,C,D,E,F
0,1,2,3,4,5,6
A,2,3,4,5,6,7
2,3,4,5,6,7,8
C,4,5,6,7,8,9
<(...) is process substitution available in bash. For other shells use a pipeline like this:
printf "col\n0\nA\n2\nC\n" | paste -d, - file
With awk only you could try following solution, written and tested with shown samples.
awk -v value="$(echo -e "col\n0\nA\n2\nC")" '
BEGIN{
FS=OFS=","
num=split(value,arr,ORS)
for(i=1;i<=num;i++){
newVal[i]=arr[i]
}
}
{
$1=arr[FNR] OFS $1
}
1
' Input_file
Explanation:
First of all creating awk variable named value whose value is echo(shell command)'s output. NOTE: using -e option with echo will make sure that \n aren't getting treated as literal characters.
Then in BEGIN section of awk program, setting FS and OFS as , here for all line of Input_file.
Using split function on value variable into array named arr with delimiter of ORS(new line).
Then traversing through for loop till value of num(total values posted by echo command).
Then creating array named newVal with index of i(1,2,3 and so on) and its value is array arr value.
In main awk program, setting first field's value to array arr value and $1 and printing the line then.

remove comma in jsonpath template using bash

I have a JSON path template query.
oc get event jsonpath='{range .items[*]},{#.name}{","}{#.message}{","}{#.evenname}'{"\n"}{end}'> /tmp/test.csv
i'm redirecting it to csv.
name1,message of event one,eventname1
name2,message of,event,two,eventname2
name3,message of event three,eventname3
name4,message of, event four,eventname4
getting comma in a message from above output , i want to replace the comma with space for the second column(message) in the bash script.
Anyone has any thoughts on how can achieve this.
Expected result
name1,message of event one,eventname1
name2,message of event two,eventname2
name3,message of event three,eventname3
name4,message of event four,eventname4
Assuming you can change the field delimiter to a character known to not exist in the data (eg, |), you would now be generating:
name1|message of event one|eventname1
name2|message of,event,two|eventname2
name3|message of event three|eventname3
name4|message of, event four|eventname4
From here we can use sed to a) remove/replace , with <space> and then b) replace | with ,:
$ sed 's/[ ]*,[ ]*/,/g;s/,/ /g;s/|/,/g'
NOTE: the s/[ ]*,[ ]*/g is needed to address the additional requirement of stripping out repeating spaces (as would occur in line #4 if we replace , with <space>)
When applied to the data this generates:
name1,message of event one,eventname1
name2,message of event two,eventname2
name3,message of event three,eventname3
name4,message of event four,eventname4
Another option using awk (for OP's current data using the , as the field delimiter):
awk -F',' ' # input field delimiter = ","
{ x=$1"," # start new string as field #1 + ","
sep="" # initial separator = "" for fields 2 to (NF-1)
for (i=2;i<NF;i++) { # loop through fields 2 to (NF-1)
gsub(/^[ ]+|[ ]+$/,"",$i) # trim leading/trailing spaces
x=x sep $i # append current field to x along with sep
sep=" " # use " " as separator for rest of fields
}
printf "%s,%s\n", x, $NF # print "x" plus "," plus the last field (NF)
}'
When applied to the data this generates:
name1,message of event one,eventname1
name2,message of event two,eventname2
name3,message of event three,eventname3
name4,message of event four,eventname4

Awk 4.1.4 Error when processing large file

I am using Awk 4.1.4 on Centos 7.6 (x86_64) with 250 GB RAM to transform a row-wide csv file into a column-wide csv based on the last column (Sample_Key). Here is an example small row-wide csv
Probe_Key,Ind_Beta,Sample_Key
1,0.6277,7417
2,0.9431,7417
3,0.9633,7417
4,0.8827,7417
5,0.9761,7417
6,0.1799,7417
7,0.9191,7417
8,0.8257,7417
9,0.9111,7417
1,0.6253,7387
2,0.9495,7387
3,0.5551,7387
4,0.8913,7387
5,0.6197,7387
6,0.7188,7387
7,0.8282,7387
8,0.9157,7387
9,0.9336,7387
This is what the correct output looks like for the above small csv example
7387,0.6253,0.9495,0.5551,0.8913,0.6197,0.7188,0.8282,0.9157,0.9336
7417,0.6277,0.9431,0.9633,0.8827,0.9761,0.1799,0.9191,0.8257,0.9111
Here is the awk code (based on https://unix.stackexchange.com/questions/522046/how-to-convert-a-3-column-csv-file-into-a-table-or-matrix) to achieve the row to column wide transformation
BEGIN{
printf "Probe_Key,ind_beta,Sample_Key\n";
}
NR > 1{
ks[$3 $1] = $2; # save the second column using the first and third as index
k1[$1]++; # save the first column
k2[$3]++; # save the third column
}
END {
# After processing input
for (i in k2) # loop over third column
{
printf "%s,", i ; # print it as first value in the row
for (j in k1) # loop over the first column (index)
{
if ( j < length(k1) )
{
printf "%s,",ks[i j]; #and print values ks[third_col first_col]
}
else
printf "%s",ks[i j]; #print last value
}
print ""; # newline
}
}
However, when I input a relatively large row-wide csv file (5 GB size), I get tons of values without any commas in the output and then values start to appear with commas and then values without commas. This keeps on going. Here is small excerpt from the portion without comma
0.04510.03580.81470.57690.8020.89630.90950.10880.66560.92240.05 060.78130.86910.07330.03080.0590.06440.80520.05410.91280.16010.19420.08960.0380.95010.7950.92760.9410.95710.2830.90790 .94530.69330.62260.90520.1070.95480.93220.01450.93390.92410.94810.87380.86920.9460.93480.87140.84660.33930.81880.94740 .71890.11840.05050.93760.94920.06190.89280.69670.03790.8930.84330.9330.9610.61760.04640.09120.15520.91850.76760.94840. 61340.02310.07530.93660.86150.79790.05090.95130.14380.06840.95690.04510.75220.03150.88550.82920.11520.11710.5710.94340 .50750.02590.97250.94760.91720.37340.93580.84730.81410.95510.93080.31450.06140.81670.04140.95020.73390.87250.93680.20240.05810.93660.80870.04480.8430.33120.88170.92670.92050.71290.01860.93260.02940.91820
and when I use the largest row-wide csv file (126 GB size), I get the following Error
ERROR (EXIT CODE 255) Unknow error code
How do I debug the two situations when the code works for small input size?
Instead of trying to hold all 5GB's (Or 126GB's) worth of data in memory at once and printing out everything all together at the end, here's an approach using sort and GNU datamash to group each set of values together as they come through its input:
$ datamash --header-in -t, -g3 collapse 2 < input.csv | sort -t, -k1,1n
7387,0.6253,0.9495,0.5551,0.8913,0.6197,0.7188,0.8282,0.9157,0.9336
7417,0.6277,0.9431,0.9633,0.8827,0.9761,0.1799,0.9191,0.8257,0.9111
This assumes your file is already grouped with all the identical third column values together in blocks, and the first/second columns sorted in the appropriate order already, like your sample input. If that's not the case, the slower:
$ tail -n +2 input.csv | sort -t, -k3,3n -k1,1n | datamash -t, -g3 collapse 2
7387,0.6253,0.9495,0.5551,0.8913,0.6197,0.7188,0.8282,0.9157,0.9336
7417,0.6277,0.9431,0.9633,0.8827,0.9761,0.1799,0.9191,0.8257,0.9111
If you can get rid of that header line so sort can be passed the file directly instead of in a pipe, it might be able to pick a more efficient sorting method knowing the full size in advance.
if your data is already grouped in fields 3 and sorted 1, you can just simply do
$ awk -F, 'NR==1 {next}
{if(p!=$3)
{if(p) print v; v=$3 FS $2; p=$3}
else v=v FS $2}
END{print v}' file
7417,0.6277,0.9431,0.9633,0.8827,0.9761,0.1799,0.9191,0.8257,0.9111
7387,0.6253,0.9495,0.5551,0.8913,0.6197,0.7188,0.8282,0.9157,0.9336
if not, pre-sorting is better idea instead of caching all the data in memory which will blow up for large input files.

Extract column data from csv file based on row values

I am trying to use awk/sed to extract specific column data based on row values. My actual files have 15 columns and over 1,000 rows (From a .csv file.)
Simple EXAMPLE: Input; a cdv file with a total of 5 columns and 100 rows. Output; data from column 2 through 5 based on specific row values from column 2. (I have a specific list of the row values I want the operator to filter out. The values are numbers.)
File looks like this:
"Date","IdNo","Color","Height","Education"
"06/02/16","7438","Red","54","4"
"06/02/16","7439","Yellow","57","3"
"06/03/16","7500","Red","55","3"
Recently Tried in AWK:
#!/usr/bin/awk -f
#I need to extract a full line when column 2 has a specific 5 digit value
awk '\
BEGIN { awk -F "," \
{
if ( $2 == "19650" ) { \
{print $1 "," $6} \
}
exit }
chmod u+x PPMDfUN.AWK
The operator response:
/var/folders/1_/drk_nwld48bb0vfvdm_d9n0h0000gq/T/PPMDfUN- 489939602.998.AWK.command ; exit;
/usr/bin/awk: syntax error at source line 3 source file /private/var/folders/1_/drk_nwld48bb0vfvdm_d9n0h0000gq/T/PPMDfUN- 489939602.997.AWK
context is
awk >>> ' <<<
/usr/bin/awk: bailing out at source line 17
logout
Output Example: I want full row lines based if column 2 equals 7439 & 7500.
“Date","IdNo","Color","Height","Education"
"06/02/16","7439","Yellow","57","3"
"06/03/16","7500","Red","55","3"
here you go...
$ awk -F, -v q='"' '$2==q"7439"q' file
"06/02/16","7439","Yellow","57","3"
There is not much to explain, other than convenience variable q defined for double quotes helps to eliminate escaping.
awk -F, 'NR<2;$2~/7439|7500/' file
"Date","IdNo","Color","Height","Education"
"06/02/16","7439","Yellow","57","3"
"06/03/16","7500","Red","55","3"

Remove Rows From CSV Where A Specific Column Matches An Input File

I have a CSV that contains multiple columns and rows [File1.csv].
I have another CSV file (just one column) that lists a specific words [File2.csv].
I want to able to take remove rows within File1 if any columns match any of the words listed in File2.
I originally used this:
grep -v -F -f File2.csv File1.csv > File3.csv
This worked, to a certain extent. This issue I ran into was with columns that had more than word in it (ex. word1,word2,word3). File2 contained word2 but did not delete that row.
I tired spreading the words apart to look like this: (word1 , word2 , word3), but the original command did not work.
How can I remove a row that contains a word from File2 and may have other words in it?
One way using awk.
Content of script.awk:
BEGIN {
## Split line with a doble quote surrounded with spaces.
FS = "[ ]*\"[ ]*"
}
## File with words, save them in a hash.
FNR == NR {
words[ $2 ] = 1;
next;
}
## File with multiple columns.
FNR < NR {
## Omit line if eigth field has no interesting value or is first line of
## the file (header).
if ( $8 == "N/A" || FNR == 1 ) {
print $0
next
}
## Split interested field with commas. Traverse it searching for a
## word saved from first file. Print line only if not found.
## Change due to an error pointed out in comments.
##--> split( $8, array, /[ ]*,[ ]*/ )
##--> for ( i = 1; i <= length( array ); i++ ) {
len = split( $8, array, /[ ]*,[ ]*/ )
for ( i = 1; i <= len; i++ ) {
## END change.
if ( array[ i ] in words ) {
found = 1
break
}
}
if ( ! found ) {
print $0
}
found = 0
}
Assuming File1.csv and File2.csv have content provided in comments of Thor's answer (I suggest to add that information to the question), run the script like:
awk -f script.awk File2.csv File1.csv
With following output:
"DNSName","IP","OS","CVE","Name","Risk"
"ex.example.com","1.2.3.4","Linux","N/A","HTTP 1.1 Protocol Detected","Information"
"ex.example.com","1.2.3.4","Linux","CVE-2011-3048","LibPNG Memory Corruption Vulnerability (20120329) - RHEL5","High"
"ex.example.com","1.2.3.4","Linux","CVE-2012-2141","Net-SNMP Denial of Service (Zero-Day) - RHEL5","Medium"
"ex.example.com","1.2.3.4","Linux","N/A","Web Application index.php?s=-badrow Detected","High"
"ex.example.com","1.2.3.4","Linux","CVE-1999-0662","Apache HTTPD Server Version Out Of Date","High"
"ex.example.com","1.2.3.4","Linux","CVE-1999-0662","PHP Unsupported Version Detected","High"
"ex.example.com","1.2.3.4","Linux","N/A","HBSS Common Management Agent - UNIX/Linux","High"
You could convert split lines containing multiple patterns in File2.csv.
Below uses tr to convert lines containing word1,word2 into separate lines before using them as patterns. The <() construct temporarily acts as a file/fifo (tested in bash):
grep -v -F -f <(tr ',' '\n' < File2.csv) File1.csv > File3.csv