I have a bunch of csv files, each is an 84 x 84 matrix of numbers. I'm trying to use awk to sum all of the cells (to generate a single number). So far all I've been able to come up with is the following, which can sum a single column at a time (for example, column 75), but not all columns together:
awk -F ',' '{sum += $75} END {print sum}' file_name.csv
Then, I would like to create a new csv file in the same directory, where each column is the sum of that column from the previous csv divided by the total sum generated by the previous awk command. So in other words, a csv with only 1 row, where each column has a number which is column sum/total sum.
Any help would be massively appreciated!
If the required final output is simply a single line CSV file with the column sum divided by the total sum for each of the columns in the input, then this should do the job with a single awk command.
{
for (i = 1; i <= NF; i++)
{
colsum[i] += $i
totsum += $i
if (NF > maxfld)
maxfld = NF
}
}
END {
pad = ""
for (i = 1; i <= maxfld; i++)
{
printf("%s%.2f", pad, colsum[i] / totsum)
pad = ","
}
print ""
}
I'd store that in a file such as script.awk and run:
awk -F, -f script.awk data
Given sample input (8 rows and 8 columns) — a set of random numbers between 10 and 99:
34,98,18,16,62,86,21,37
39,10,62,33,81,16,70,36
23,23,56,16,86,82,30,74
18,10,42,46,99,93,83,76
90,10,76,50,12,24,13,96
11,40,89,92,31,71,25,90
87,82,33,24,32,25,98,17
86,50,70,33,93,30,98,67
the output is:
0.12,0.10,0.13,0.09,0.15,0.13,0.13,0.15
Clearly, you can tweak the format used to present the final values; I chose 2 decimal places, but you can choose any format you prefer.
The question asks about an 84x84 matrix; this code works with such matrices too. The matrices don't have to be square. The input data doesn't even have to have the same number of fields in each line. You could add validation to insist on either condition, or both. If you need the intermediate results files, you could arrange for this code to generate them too.
Related
I am using Awk 4.1.4 on Centos 7.6 (x86_64) with 250 GB RAM to transform a row-wide csv file into a column-wide csv based on the last column (Sample_Key). Here is an example small row-wide csv
Probe_Key,Ind_Beta,Sample_Key
1,0.6277,7417
2,0.9431,7417
3,0.9633,7417
4,0.8827,7417
5,0.9761,7417
6,0.1799,7417
7,0.9191,7417
8,0.8257,7417
9,0.9111,7417
1,0.6253,7387
2,0.9495,7387
3,0.5551,7387
4,0.8913,7387
5,0.6197,7387
6,0.7188,7387
7,0.8282,7387
8,0.9157,7387
9,0.9336,7387
This is what the correct output looks like for the above small csv example
7387,0.6253,0.9495,0.5551,0.8913,0.6197,0.7188,0.8282,0.9157,0.9336
7417,0.6277,0.9431,0.9633,0.8827,0.9761,0.1799,0.9191,0.8257,0.9111
Here is the awk code (based on https://unix.stackexchange.com/questions/522046/how-to-convert-a-3-column-csv-file-into-a-table-or-matrix) to achieve the row to column wide transformation
BEGIN{
printf "Probe_Key,ind_beta,Sample_Key\n";
}
NR > 1{
ks[$3 $1] = $2; # save the second column using the first and third as index
k1[$1]++; # save the first column
k2[$3]++; # save the third column
}
END {
# After processing input
for (i in k2) # loop over third column
{
printf "%s,", i ; # print it as first value in the row
for (j in k1) # loop over the first column (index)
{
if ( j < length(k1) )
{
printf "%s,",ks[i j]; #and print values ks[third_col first_col]
}
else
printf "%s",ks[i j]; #print last value
}
print ""; # newline
}
}
However, when I input a relatively large row-wide csv file (5 GB size), I get tons of values without any commas in the output and then values start to appear with commas and then values without commas. This keeps on going. Here is small excerpt from the portion without comma
0.04510.03580.81470.57690.8020.89630.90950.10880.66560.92240.05 060.78130.86910.07330.03080.0590.06440.80520.05410.91280.16010.19420.08960.0380.95010.7950.92760.9410.95710.2830.90790 .94530.69330.62260.90520.1070.95480.93220.01450.93390.92410.94810.87380.86920.9460.93480.87140.84660.33930.81880.94740 .71890.11840.05050.93760.94920.06190.89280.69670.03790.8930.84330.9330.9610.61760.04640.09120.15520.91850.76760.94840. 61340.02310.07530.93660.86150.79790.05090.95130.14380.06840.95690.04510.75220.03150.88550.82920.11520.11710.5710.94340 .50750.02590.97250.94760.91720.37340.93580.84730.81410.95510.93080.31450.06140.81670.04140.95020.73390.87250.93680.20240.05810.93660.80870.04480.8430.33120.88170.92670.92050.71290.01860.93260.02940.91820
and when I use the largest row-wide csv file (126 GB size), I get the following Error
ERROR (EXIT CODE 255) Unknow error code
How do I debug the two situations when the code works for small input size?
Instead of trying to hold all 5GB's (Or 126GB's) worth of data in memory at once and printing out everything all together at the end, here's an approach using sort and GNU datamash to group each set of values together as they come through its input:
$ datamash --header-in -t, -g3 collapse 2 < input.csv | sort -t, -k1,1n
7387,0.6253,0.9495,0.5551,0.8913,0.6197,0.7188,0.8282,0.9157,0.9336
7417,0.6277,0.9431,0.9633,0.8827,0.9761,0.1799,0.9191,0.8257,0.9111
This assumes your file is already grouped with all the identical third column values together in blocks, and the first/second columns sorted in the appropriate order already, like your sample input. If that's not the case, the slower:
$ tail -n +2 input.csv | sort -t, -k3,3n -k1,1n | datamash -t, -g3 collapse 2
7387,0.6253,0.9495,0.5551,0.8913,0.6197,0.7188,0.8282,0.9157,0.9336
7417,0.6277,0.9431,0.9633,0.8827,0.9761,0.1799,0.9191,0.8257,0.9111
If you can get rid of that header line so sort can be passed the file directly instead of in a pipe, it might be able to pick a more efficient sorting method knowing the full size in advance.
if your data is already grouped in fields 3 and sorted 1, you can just simply do
$ awk -F, 'NR==1 {next}
{if(p!=$3)
{if(p) print v; v=$3 FS $2; p=$3}
else v=v FS $2}
END{print v}' file
7417,0.6277,0.9431,0.9633,0.8827,0.9761,0.1799,0.9191,0.8257,0.9111
7387,0.6253,0.9495,0.5551,0.8913,0.6197,0.7188,0.8282,0.9157,0.9336
if not, pre-sorting is better idea instead of caching all the data in memory which will blow up for large input files.
I have a table which has been exported to a file in UNIX which has data in CSV format like for e.g.:
File 1:
ACCT_NUM,EXPIRY_DT,FIRST_NAME,LAST_NAME
123456,09-09-2019,Prisi,Kumar
Now I need to mask ACCT_NUM and FIRST_NAME and replace the masked values in File 1, the output should look something like this
File 2:
ACCT_NUM,EXPIRY_DT,FIRST_NAME,LAST_NAME
123XXX,09-09-2019,PRXXX,Kumar
I have separate masking functions for numerical and string fields, I need to know how to replace the masked columns in the original file.
I'm not sure what you want to do with FNR and what the point of assigning to array a should be. This is how I would do it:
$ cat x.awk
#!/bin/sh
awk -F, -vOFS=, ' # Set input and output field separators.
NR == 1 { # First record?
print # Just output.
next # Then continue with next line.
}
NR > 1 { # Second and subsequent record?
if (length($1) < 4) { # Short account number?
$1 = "XXX" # Replace the whole number.
} else {
sub(/...$/, "XXX", $1) # Change last three characters.
}
if (length($3) < 4) { # Short first name number?
$3 = "XXX" # Replace the whole name.
} else {
sub(/...$/, "XXX", $3) # Change last three characters.
}
print # Output the changed line.
}'
Showtime!
$ cat input
ACCT_NUM,EXPIRY_DT,FIRST_NAME,LAST_NAME
123456,09-09-2019,Prisi,Kumar
123,29-12-2017,Jim,Kirk
$ ./x.awk < input
ACCT_NUM,EXPIRY_DT,FIRST_NAME,LAST_NAME
123XXX,09-09-2019,PrXXX,Kumar
XXX,29-12-2017,XXX,Kirk
I am trying to use awk/sed to extract specific column data based on row values. My actual files have 15 columns and over 1,000 rows (From a .csv file.)
Simple EXAMPLE: Input; a cdv file with a total of 5 columns and 100 rows. Output; data from column 2 through 5 based on specific row values from column 2. (I have a specific list of the row values I want the operator to filter out. The values are numbers.)
File looks like this:
"Date","IdNo","Color","Height","Education"
"06/02/16","7438","Red","54","4"
"06/02/16","7439","Yellow","57","3"
"06/03/16","7500","Red","55","3"
Recently Tried in AWK:
#!/usr/bin/awk -f
#I need to extract a full line when column 2 has a specific 5 digit value
awk '\
BEGIN { awk -F "," \
{
if ( $2 == "19650" ) { \
{print $1 "," $6} \
}
exit }
chmod u+x PPMDfUN.AWK
The operator response:
/var/folders/1_/drk_nwld48bb0vfvdm_d9n0h0000gq/T/PPMDfUN- 489939602.998.AWK.command ; exit;
/usr/bin/awk: syntax error at source line 3 source file /private/var/folders/1_/drk_nwld48bb0vfvdm_d9n0h0000gq/T/PPMDfUN- 489939602.997.AWK
context is
awk >>> ' <<<
/usr/bin/awk: bailing out at source line 17
logout
Output Example: I want full row lines based if column 2 equals 7439 & 7500.
“Date","IdNo","Color","Height","Education"
"06/02/16","7439","Yellow","57","3"
"06/03/16","7500","Red","55","3"
here you go...
$ awk -F, -v q='"' '$2==q"7439"q' file
"06/02/16","7439","Yellow","57","3"
There is not much to explain, other than convenience variable q defined for double quotes helps to eliminate escaping.
awk -F, 'NR<2;$2~/7439|7500/' file
"Date","IdNo","Color","Height","Education"
"06/02/16","7439","Yellow","57","3"
"06/03/16","7500","Red","55","3"
I want to compare the lines of the same column in a csv file and keep only the lines that respect the following conditions
1.if the first pattern is the same as the one in the previous line and
2.the difference between the values in the second column equal abs(1)
for example if I have this lines
aaaa;12
aaaa;13
bbbb;11
bbbb;9
cccc;9
cccc;8
I will keep only
aaaa;12
aaaa;13
cccc;9
cccc;8
The logic would work this way:
If the previous pattern is not equal to this pattern, then remember the this pattern and this value as the new "previous", and move to the next line.
Otherwise, if the difference between the previous value and this value equals 1 or -1 (awk does not have an abs() function) then print the previous pattern and value and print this line.
Take a stab at translating that into code, and come back when you have questions.
Given:
$ echo "$test"
aaaa;12
aaaa;13
bbbb;11
bbbb;9
cccc;9
cccc;8
You can do something like:
$ echo "$test" | awk -F ";" 'function abs(v) {return v < 0 ? -v : v} $1==l1 && abs($2-l2)==1 {print l1 FS l2 RS $0} {l1=$1;l2=$2}'
aaaa;12
aaaa;13
cccc;9
cccc;8
I have a csv file with over 5k fields/columns with header names. I would like to import only some specific fields to my database.
I am using local infile for other smaller files which need to be imported
LOAD DATA
LOCAL INFILE 'C:/wamp/www/imports/new_export.csv'
INTO TABLE table1
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\r\n'
(colour,shape,size);
Assigning dummy variables for columns to skip might be cumbersome, Also I would prefer to reference using the fields headers to future proof in case the file has additional fields
I am considering using awk on the file before loading the file to the database. But the examples I have found in search don't seem to work.
Any suggestions on best approach for this would be appreciated.
This is similar to MvG's answer, but it doesn't require gawk 4 and thus uses -F as suggested in that answer. It also shows a technique for listing the desired fields and iterating over the list. This may make the code easier to maintain if there is a large list.
#!/usr/bin/awk -f
BEGIN {
col_list = "colour shape size" # continuing with as many as desired for output
num_cols = split(col_list, cols)
FS = OFS = ","
}
NR==1 {
for (i = 1; i <= NF; i++) {
p[$i] = i # remember column for name
}
# next # enable this line to suppress headers.
}
{
delim = ""
for (i = 1; i <= num_cols; i++) {
printf "%s%s", delim, $p[cols[i]]
delim = OFS
}
printf "\n"
}
Does your actual data have any commas? If not, you might be best served using cut:
cut -d, -f1,2,5,8-12
will select the named fields, splitting lines at the ,. If any of your "-enclosed text fields does contain a ,, things will break, as cut doesn't know about ".
Here is a full-featured solution which can deal with all kinds of quotes and commas in the values of the csv table, and can extract columns by name. It requires gawk and is based on the FPAT feature suggested in this answer.
BEGIN {
# Allow simple values, quoted values and even doubled quotes
FPAT="\"[^\"]*(\"\"[^\"]*)*\"|[^,]*"
}
NR==1 {
for (i = 1; i <= NF; i++) {
p[$i]=i # remember column for name
}
# next # enable this line to suppress headers.
}
{
print $p["colour"] "," $p["shape"] "," $p["size"]
}
Write this to a file, to be invoked by gawk -f file.awk.
As the column-splitting and the index-by-header features are kind of orthogonal, you could use part of the script on non-GNU awk to select by column name, not using FPAT but simple -F, instead.