Converting Multiple CSV Rows to Individual Columns [closed]

Converting Multiple CSV Rows to Individual Columns [closed] - csv

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I have a CSV file in this format:
#Time,CPU,Data
x,0,a
x,1,b
y,0,c
y,1,d
I want to transform it into this
#Time,CPU 0 Data,CPU 1 Data
x,a,b
y,c,d
But I don't know the number of CPU cores there will be in a system (represented by the CPU column). I also have multiple columns of data (not just the singular data column).
How would I go about doing this?
Example input
# hostname,interval,timestamp,CPU,%user,%nice,%system,%iowait,%steal,%idle
hostname,600,2018-07-24 00:10:01 UTC,-1,5.19,0,1.52,0.09,0.13,93.07
hostname,600,2018-07-24 00:10:01 UTC,0,5.37,0,1.58,0.15,0.15,92.76
hostname,600,2018-07-24 00:10:01 UTC,1,8.36,0,1.75,0.08,0.1,89.7
hostname,600,2018-07-24 00:10:01 UTC,2,3.87,0,1.38,0.07,0.12,94.55
hostname,600,2018-07-24 00:10:01 UTC,3,3.16,0,1.36,0.05,0.14,95.29
hostname,600,2018-07-24 00:20:01 UTC,-1,5.13,0,1.52,0.08,0.13,93.15
hostname,600,2018-07-24 00:20:01 UTC,0,4.38,0,1.54,0.13,0.15,93.8
hostname,600,2018-07-24 00:20:01 UTC,1,5.23,0,1.49,0.07,0.11,93.09
hostname,600,2018-07-24 00:20:01 UTC,2,5.26,0,1.53,0.07,0.12,93.03
hostname,600,2018-07-24 00:20:01 UTC,3,5.64,0,1.52,0.04,0.12,92.68
This would be the output for this file: (CPU -1 turns into CPU ALL)(also the key value is just the timestamp (the hostname and interval stay constant)
# hostname,interval,timestamp,CPU ALL %user,CPU ALL %nice,CPU ALL %system,CPU ALL %iowait,CPU ALL %steal,CPU ALL %idle,CPU 0 %user,CPU 0 %nice,CPU 0 %system,CPU 0 %iowait,CPU 0 %steal,CPU 0 %idle,CPU 1 %user,CPU 1 %nice,CPU 1 %system,CPU 1 %iowait,CPU 1 %steal,CPU 1 %idle,CPU 2 %user,CPU 2 %nice,CPU 2 %system,CPU 2 %iowait,CPU 2 %steal,CPU 2 %idle,CPU 3 %user,CPU 3 %nice,CPU 3 %system,CPU 3 %iowait,CPU 3 %steal,CPU 3 %idle
hostname,600,2018-07-24 00:10:01 UTC,5.19,0,1.52,0.09,0.13,93.07,5.37,0,1.58,0.15,0.15,92.76,8.36,0,1.75,0.08,0.1,89.7,3.87,0,1.38,0.07,0.12,94.55,3.16,0,1.36,0.05,0.14,95.29
hostname,600,2018-07-24 00:20:01 UTC,5.13,0,1.52,0.08,0.13,93.15,4.38,0,1.54,0.13,0.15,93.8,5.23,0,1.49,0.07,0.11,93.09,5.26,0,1.53,0.07,0.12,93.03,5.64,0,1.52,0.04,0.12,92.68

Your question isn't clear and doesn't contain the expected output for your posted larger/presumably more realistic sample CSV so idk what output you were hoping for but this will show you the right approach at least:
$ cat tst.awk
BEGIN{
FS = OFS = ","
}
NR==1 {
for (i=1; i<=NF; i++) {
fldName2nmbr[$i] = i
}
tsFldNmbr = fldName2nmbr["timestamp"]
cpuFldNmbr = fldName2nmbr["CPU"]
next
}
{
tsVal = $tsFldNmbr
cpuVal = $cpuFldNmbr
if ( !(seenTs[tsVal]++) ) {
tsVal2nmbr[tsVal] = ++numTss
tsNmbr2val[numTss] = tsVal
}
if ( !(seenCpu[cpuVal]++) ) {
cpuVal2nmbr[cpuVal] = ++numCpus
cpuNmbr2val[numCpus] = cpuVal
}
tsNmbr = tsVal2nmbr[tsVal]
cpuNmbr = cpuVal2nmbr[cpuVal]
cpuData = ""
for (i=1; i<=NF; i++) {
if ( (i != tsFldNmbr) && (i != cpuFldNmbr) ) {
cpuData = (cpuData == "" ? "" : cpuData OFS) $i
}
}
data[tsNmbr,cpuNmbr] = cpuData
}
END {
printf "%s", "timestamp"
for (cpuNmbr=1; cpuNmbr<=numCpus; cpuNmbr++) {
printf "%sCPU %s Data", OFS, cpuNmbr2val[cpuNmbr]
}
print ""
for (tsNmbr=1; tsNmbr<=numTss; tsNmbr++) {
printf "%s", tsNmbr2val[tsNmbr]
for (cpuNmbr=1; cpuNmbr<=numCpus; cpuNmbr++) {
printf "%s\"%s\"", OFS, data[tsNmbr,cpuNmbr]
}
print ""
}
}
.
$ awk -f tst.awk file
timestamp,CPU -1 Data,CPU 0 Data,CPU 1 Data,CPU 2 Data,CPU 3 Data
2018-07-24 00:10:01 UTC,"hostname,600,5.19,0,1.52,0.09,0.13,93.07","hostname,600,5.37,0,1.58,0.15,0.15,92.76","hostname,600,8.36,0,1.75,0.08,0.1,89.7","hostname,600,3.87,0,1.38,0.07,0.12,94.55","hostname,600,3.16,0,1.36,0.05,0.14,95.29"
2018-07-24 00:20:01 UTC,"hostname,600,5.13,0,1.52,0.08,0.13,93.15","hostname,600,4.38,0,1.54,0.13,0.15,93.8","hostname,600,5.23,0,1.49,0.07,0.11,93.09","hostname,600,5.26,0,1.53,0.07,0.12,93.03","hostname,600,5.64,0,1.52,0.04,0.12,92.68"
I put the per-CPU data within double quotes so you could import it to Excel or similar without worrying about the commas between the sub-fields.

If we assume that the CSV input file is sorted according to increasing timestamps, you could try something like this:
use feature qw(say);
use strict;
use warnings;
my $fn = 'log.csv';
open ( my $fh, '<', $fn ) or die "Could not open file '$fn': $!";
my $header = <$fh>;
my %info;
my #times;
while ( my $line = <$fh> ) {
chomp $line;
my ( $time, $cpu, $data ) = split ",", $line;
push #times, $time if !exists $info{$time};
push #{ $info{$time} }, $data;
}
close $fh;
for my $time (#times) {
say join ",", $time, #{ $info{$time} };
}
Output:
x,a,b
y,c,d

Related

Efficient code to do a countif, ranking on huge csv file

need help to accelerate the run of below perl code. 4 to 6hours should be fine. faster is better :)
csv file is about 14m to 14.5m rows, and aorund 1100 to 1500columns; 62gig
what it does:
do a count (like a countif in excel)
get the percent (based on 14m rows)
get the rank based on count
My current code:
use List::MoreUtils qw(uniq);
$x="Room_reserve";
$in = "D:\\package properties\\${x}.csv";
$out = "D:\\package properties\\output\\${x}_output.csv";
open($fh, '<', $in) or die "Could not open file '$file' $!";
#data = <$fh>;
close($fh);
%counts;
#columns;
$first = 1;
#counter
foreach $dat (#data) {
chomp($dat);
#rows = split(',',$dat);
if ($first == 1) {
$first = 0;
next;
}
else {
$count = 1;
foreach $i (0..$#rows) {
if ( exists($columns[$i]{$rows[$i]}) ) {
$columns[$i]{$rows[$i]}++;
}
else {
$columns[$i]{$rows[$i]} = int($count);
}
}
}
}
#output
$first = 1;
open($fh, '>', $out) or die "Could not open file '$file' $!";
foreach $dat (#data) {
chomp($dat);
#rows = split(',',$dat);
foreach $i (0..$#rows) {
if ($i > 6) {
#for modifying name
if ( $first == 1 ) {
$line = join( ",", "Rank_$rows[$i]", "Percent_$rows[$i]",
"Count_$rows[$i]", $rows[$i]);
print $fh "$line,";
if ( $i == $#rows ) {
$first = 0;
}
}
else {
#dat_val = reverse sort { $a <=> $b } values %{$columns[$i]};
%ranks = {};
$rank_cnt = 0;
foreach $val (#dat_val) {
if ( ! exists($ranks{$val}) ) {
$rank_cnt++;
}
$ranks{$val} = $rank_cnt;
}
$rank = $ranks{$columns[$i]{$rows[$i]}};
$cnt = $columns[$i]{$rows[$i]};
$ave = ($cnt / 14000000) * 100;
$line = join( ",", $rank, $ave, $cnt, $rows[$i]);
print $fh "$line,";
}
}
else {
print $fh "$rows[$i],";
}
}
print $fh "\n";
}
close($fh);
thanks in advance.
my table
Col
_1
_2
_3
_4
_5
_6
Col2
Col3
Col
Col5
FALSE
1
2
3
4
5
6
6
6
1
4
FALSE
1
2
3
4
5
6
6
6
1
4
FALSE
1
2
3
4
5
7
6
6
1
3
edited to show sample table and correct $x
##Sample output
Col
_1
_2
_3
_4
_5
_6
Col2
rank_Col2
percent_rank_Col2
count_Col2
Col3
rank_Col3
percent_rank_Col3
count_Col3
Col
rank_Col
percent_rank_Col
count_Col
Col5
rank_Col5
percent_rank_Col5
count_Col5
FALSE
1
2
3
4
5
6
9
2
0.33
1
6
1
0.67
2
1
1
0.67
2
11
1
0.33
1
FALSE
1
2
3
4
5
6
6
1
0.67
2
6
1
0.67
2
2
2
0.33
1
4
1
0.33
1
FALSE
1
2
3
4
5
7
6
1
0.67
2
4
2
0.33
1
1
1
0.67
2
3
1
0.33
1

Presume you have this file:
% ls -lh file
-rw-r--r-- 1 dawg wheel 57G Jul 24 13:15 file
% time wc -l file
14000000 file
wc -l file 29.24s user 7.27s system 99% cpu 36.508 total
% awk -F, 'FNR==1{print $NF; exit}' file
1099
So we have a 57GB file with 14,000,000 line by 1099 col csv with random numbers.
It only takes 20 to 30 SECONDS to read the entire file in a line-by-line fashion.
How long in Perl?
% time perl -lnE '' file
perl -lnE '' file 5.70s user 9.67s system 99% cpu 15.381 total
So only 15 seconds in Perl line by line. How long to 'gulp' it?
% time perl -0777 -lnE '' file
perl -0777 -lnE '' file 12.13s user 23.86s system 98% cpu 36.688 total
But that is on THIS computer which has 255GB of RAM...
It took this Python script approximately 23 minutes to write that file:
total=0
cols=1100
with open('/tmp/file', 'w') as f_out:
for cnt in range(14_000_000):
line=','.join(f'{x}' for x in range(cols))
total+=cols
f_out.write(f'{cnt},{total},{line}\n')
The issue with your Perl script is you are gulping the whole file (with #data = <$fh>;) and the OS maybe is running out of RAM. It is likely swapping and this is very slow. (How much RAM do you have?)
Rewrite the script to do it line by line. You should be able to do your entire analysis in less than 1 hour.

Can you try the following script? It precomputes the ranks instead of recalculating them for each row. It also avoids saving the first 7 columns in #columns since they appear not to be used for anything.
use feature qw(say);
use strict;
use warnings;
{
my $x="Room_reserve";
my $in = "D:\\package properties\\${x}.csv";
my $out = "D:\\package properties\\output\\${x}_output.csv";
my $min_col = 7;
my ($data1, $data2) = read_input_file($in, $min_col);
my $columns = create_counter($data2);
my $ranks = create_ranks( $columns);
write_output($data1, $data2, $columns, $ranks, $min_col, $out);
}
sub create_counter {
my ($data) = #_;
print "Creating counter..";
my #columns;
my $first = 1;
for my $row (#$data) {
if ( $first ) {
$first = 0; next; # skip header
}
for my $i (0..$#$row) {
$columns[$i]{$row->[$i]}++;
}
}
say "done.";
return \#columns;
}
sub create_ranks {
my ( $columns ) = #_;
print "Creating ranks..";
my #ranks;
for my $col (#$columns) {
# sort the column values according to highest frequency.
my #freqs = sort { $b <=> $a } values %$col;
my $idx = 1;
my %ranks = map {$_ => $idx++} #freqs;
push #ranks, \%ranks;
}
say "done.";
return \#ranks;
}
sub read_input_file {
my ($in, $min_col) = #_;
print "Reading input file $in ..";
open(my $fh, '<', $in) or die "Could not open file '$in' $!";
my #data1;
my #data2;
while( my $line = <$fh> ) {
chomp $line;
my #fields = split(',', $line);
next if #fields == 0;
die "Unexpected column count at line $.\n" if #fields < $min_col;
push #data1, [#fields[0..($min_col-1)]];
push #data2, [#fields[$min_col..$#fields]];
}
say " $. lines.";
close($fh);
return \#data1, \#data2;
}
sub write_output {
my ($data1, $data2, $columns, $ranks, $min_col, $out) = #_;
say "Saving output file $out..";
open(my $fh, '>', $out) or die "Could not open file '$out': $!";
for my $i (0..$#$data1) {
my $row1 = $data1->[$i];
my $row2 = $data2->[$i];
if ($i == 0) {
print $fh join ",", #$row1;
print $fh ",";
for my $j (0..$#$row2) {
print $fh join ",", $row2->[$j],
"Rank_" . $row2->[$j],
"Percent_". $row2->[$j],
"Count_" . $row2->[$j];
print $fh "," if $j < $#$row2;
}
print $fh "\n";
next;
}
print $fh join ",", #$row1;
print $fh ",";
for my $j (0..$#$row2) {
my $cnt = $columns->[$j]{$row2->[$j]};
my $rank = $ranks->[$j]{$cnt};
my $ave = ($cnt / 14000000) * 100;
print $fh join ",", $row2->[$j], $rank, $ave, $cnt;
print $fh "," if $j < $#$row2;
}
print $fh "\n";
}
close $fh;
say "Done.";
}

AWK: statistics operations of multi-column CSV data

With the aim to perform some statistical analysis of multi-column data I am analyzing big number of CSV filles using the following bash + AWK routine:
#!/bin/bash
home="$PWD"
# folder with the outputs
rescore="${home}"/rescore
# folder with the folders to analyse
storage="${home}"/results
#cd "${home}"/results
cd ${storage}
csv_pattern='*_filt.csv'
while read -r d; do
awk -v rescore="$rescore" '
FNR==1 {
if (n)
mean[suffix] = s/n
prefix=suffix=FILENAME
sub(/_.*/, "", prefix)
sub(/\/[^\/]+$/, "", suffix)
sub(/^.*_/, "", suffix)
s=n=0
}
FNR > 1 {
s += $3
++n
}
END {
out = rescore "/" prefix ".csv"
mean[suffix] = s/n
print prefix ":", "dG(mean)" > out
for (i in mean)
printf "%s: %.2f\n", i, mean[i] >> out
close(out)
}' "${d}_"*/${csv_pattern} #> "${rescore}/"${d%%_*}".csv"
done < <(find . -maxdepth 1 -type d -name '*_*_*' | awk -F '[_/]' '!seen[$2]++ {print $2}')
Basically the script takes ensemble of CSV files belonged to the same prefix (defined as the naming pattern occured at the begining of the directory contained CSV, for example 10V1 from 10V1_cne_lig1) and calculate for it the mean value for the numbers in the third column:
# input *_filt.csv located in the folder 10V1_cne_lig1001
ID, POP, dG
1, 142, -5.6500
2, 10, -5.5000
3, 2, -4.9500
add 1 string to 10V1.csv, which is organized in 2 column format i) the name of the suffix of the folder with initial CSV; ii) the mean value calculated for all numbers in the third column (dG) of input.csv:
# this is two column format of output.csv: 10V1.csv
10V1: dG(mean)
lig1001: -5.37
in this way for 100 CSV filles such output.csv should contain 100 lines with the mean values, etc
I need to introduce a small modification to my AWK part of my routine that would add the 3rd column to the output CSV with RMSD value (as the measure of the differences between initial dG values) of the initial data (dG), which had been used to calculate the MEAN value. Using AWK syntax, with a particular MEAN value the RMS could be expressed as
mean=$(awk -F , '{sum+=$3}END{printf "%.2f", sum/NR}' $csv)
rmsd=$(awk -v mean=$mean '{++n;sum+=($NF-mean)^2} END{if(n) printf "%.2f", sqrt(sum/n)}' $csv)
Here is expected output for 5 means and 5 rmsds values calculated for 5 CSV logs (the first one is corresponded to my above example!):
10V1: dG(mean): RMSD (error)
lig1001 -5.37 0.30
lig1002 -8.53 0.34
lig1003 -6.57 0.25
lig1004 -9.53 0.00 # rmsd=0 since initial csv has only 1 line: no data variance
lig1005 -8.11 0.39
How this addition could be incorporated into my main bash-AWK code with the aim to add the third RMSD column (for each of the processed CSV, thus taking each of the calculated MEAN) to the output.csv?

You can calculate both of mean and rmsd within the awk code.
Would you please try the following awk code:
awk -v rescore="$rescore" '
FNR==1 {
if (n) { # calculate the results of previous file
m = s / n # mean
var = s2 / n - m * m # variance
if (var < 0) var = 0 # avoid an exception due to round-off error
mean[suffix] = m # store the mean in an array
rmsd[suffix] = sqrt(var)
}
prefix=suffix=FILENAME
sub(/_.*/, "", prefix)
sub(/\/[^\/]+$/, "", suffix)
sub(/^.*_/, "", suffix)
s = 0 # sum of $3
s2 = 0 # sum of $3 ** 2
n = 0 # count of samples
}
FNR > 1 {
s += $3
s2 += $3 * $3
++n
}
END {
out = rescore "/" prefix ".csv"
m = s / n
var = s2 / n - m * m
if (var < 0) var = 0
mean[suffix] = m
rmsd[suffix] = sqrt(var)
print prefix ":", "dG(mean)", "dG(rmsd)" > out
for (i in mean)
printf "%s: %.2f %.2f\n", i, mean[i], rmsd[i] >> out
close(out)
}'
Here is the version to print the lowest value of dG.
awk -v rescore="$rescore" '
FNR==1 {
if (n) { # calculate the results of previous file
m = s / n # mean
var = s2 / n - m * m # variance
if (var < 0) var = 0 # avoid an exception due to round-off error
mean[suffix] = m # store the mean in an array
rmsd[suffix] = sqrt(var)
lowest[suffix] = min
}
prefix=suffix=FILENAME
sub(/_.*/, "", prefix)
sub(/\/[^\/]+$/, "", suffix)
sub(/^.*_/, "", suffix)
s = 0 # sum of $3
s2 = 0 # sum of $3 ** 2
n = 0 # count of samples
min = 0 # lowest value of $3
}
FNR > 1 {
s += $3
s2 += $3 * $3
++n
if ($3 < min) min = $3 # update the lowest value
}
END {
if (n) { # just to avoid division by zero
m = s / n
var = s2 / n - m * m
if (var < 0) var = 0
mean[suffix] = m
rmsd[suffix] = sqrt(var)
lowest[suffix] = min
}
out = rescore "/" prefix ".csv"
print prefix ":", "dG(mean)", "dG(rmsd)", "dG(lowest)" > out
for (i in mean)
printf "%s: %.2f %.2f %.2f\n", i, mean[i], rmsd[i], lowest[i] > out
}' file_*.csv
I've assumed all dG values are negative. If there is any chance the
value is greater than zero, modify the line min = 0 which initializes
the variable to considerably big value (10,000 or whatever).
Please apply your modifications regarding the filenames, if needed.
The suggestions by Ed Morton are also included although the results will be the same.

Complex CSV parsing with Linux commands

I have a CSV log file that records the properties HA;HB;HC;HD;HE. The following file records 6 entries (separated by the above header).
I would like to extract the 3rd property(HC) of every entry.
HA;HB;HC;HD;HE
a1;b1;14;d;e
HA;HB;HC;HD;HE
a2;b2;28;d;e
HA;HB;HC;HD;HE
a31;b31;44;d;e
a32;b32;07;d;e
HA;HB;HC;HD;HE
a4;b4;0;d;e
HA;HB;HC;HD;HE
a51;b51;32;d;e
a52;b52;0;d;e
a53;b53;5;d;e
HA;HB;HC;HD;HE
a6;b6;10;d;e
Whenever there's n lines of HC recorded per entry, I want to extract the addition of the n entries.
The expected output for the above file:
14
28
51
0
37
10
I know I can write a program for this, but is there an easy way to get this with a combination on awk and/or sed commands?

I haven't tested this; try it and let me know if it works.
awk -F';' '
$3 == "HC" {
if (NR > 1) {
print sum
sum = 0 }
next }
{ sum += $3 }
END { print sum }'

awk solution:
$ awk -F';' '$3=="HC" && p{
print sum # print current total
sum=p=0 # reinitialize sum and p
next
}
$3!="HC"{
sum=sum+($3+0) # make sure $3 is converted to integer. sum it up.
p=1 # set p to 1
} # print last sum
END{print sum}' input.txt
output:
14
28
51
0
37
10
one-liner:
$ awk -F";" '$3=="HC" && p{print sum;sum=p=0;next} $3!="HC"{sum=sum+($3+0);p=1} END{print sum}' input.txt

awk -F';' '/^H.*/{if(f)print s;s=0;f=$3=="HC"}f{s+=$3}END{if(f)print s}' infile
For given inputs:
$ cat infile
HA;HB;HC;HD;HE
a1;b1;14;d;e
HA;HB;HC;HD;HE
a2;b2;28;d;e
HA;HB;HC;HD;HE
a31;b31;44;d;e
a32;b32;07;d;e
HA;HB;HC;HD;HE
a4;b4;0;d;e
HA;HB;HC;HD;HE
a51;b51;32;d;e
a52;b52;0;d;e
a53;b53;5;d;e
HA;HB;HC;HD;HE
a6;b6;10;d;e
$ awk -F';' '/^H.*/{if(f)print s; s=0; f=$3=="HC"}f{s+=$3}END{if(f)print s}' infile
14
28
51
0
37
10
It takes little more care for example:
$ cat infile2
HA;HB;HC;HD;HE
a1;b1;14;d;e
HA;HB;HC;HD;HE
a2;b2;28;d;e
HA;HB;HC;HD;HE
a31;b31;44;d;e
a32;b32;07;d;e
HA;HB;HC;HD;HE
a4;b4;0;d;e
HA;HB;HD;HD;HE <---- Say if HC does not found
a51;b51;32;d;e
a52;b52;0;d;e
a53;b53;5;d;e
HA;HB;HC;HD;HE
a6;b6;10;d;e
# find only HC in 3rd column
$ awk -F';' '/^H.*/{if(f)print s; s=0; f=$3=="HC"}f{s+=$3}END{if(f)print s}' infile2
14
28
51
0
10
# Find HD in 3rd column
$ awk -F';' '/^H.*/{if(f)print s; s=0; f=$3=="HD"}f{s+=$3}END{if(f)print s}' infile2
37

eval "true || $(cat data.csv|cut -d ";" -f3 |sed -e s/"HC"/"0; expr 0"/g |tr '\n' '#'|sed -e s/"##"/""/g|sed -e s/"#"/" + "/g)"
Explanation:
Get contents of the file using cat
Take only the third column using cut delimiter of ;
Replace HC lines with 0; expr 0 values to start building eval-worthy bash expressions to eventually yield expr 0 + 14;
Replace \n newlines temporarily with # to circumvent possible BSD sed limitations
Replace double ## with single # to avoid blank lines turning into spaces and causing expr to bomb out.
Replace # with + to add the numbers together.
Execute the command, but with a true || 0; expr ... to avoid a guaranteed syntax error on the first line.
Which creates this:
true || 0; expr 0 + 14 + 0; expr 0 + 28 + 0; expr 0 + 44 + 07 + 0; expr 0 + 0 + 0; expr 0 + 32 + 0 + 5 + 0; expr 0 + 10
The output looks like this:
14
28
51
0
37
10
This was tested on Bash 3.2 and MacOS El Capitan.

Could you please try following and let me know if this helps you.
awk -F";" '
/^H/ && $3!="HC"{
flag="";
next
}
/^H/ && $3=="HC"{
if(NR>1){
printf("%d\n",sum)
};
sum=0;
flag=1;
next
}
flag{
sum+=$3
}
END{
printf("%d\n",sum)
}
' Input_file
Output will be as follows.
14
28
51
0
37
10

$ awk -F';' '$3=="HC"{if (NR>1) print s; s=0; next} {s+=$3} END{print s}' file
14
28
51
0
37
10

Use multiple lines with Awk

I have a CSV that has underscore delimiters. I have 8 lines that need to be converted to one in this way:
101_1_variableName_(value)
101_1_variableName1_(value2)
into:
101 1 (value) (value2)
(in different boxes preferably)
The problem is that I don't know how to use multiple lines in awk to form a single line. Any help is appreciated.
UPDATE: (input + output)
101_1_1_trialOutcome_found_berry
101_1_1_trialStartTime_2014-08-05 11:26:49.510000
101_1_1_trialOutcomeTime_2014-08-05 11:27:00.318000
101_1_1_trialResponseTime_0:00:05.804000
101_1_1_whichResponse_d
101_1_1_bearPosition_6
101_1_1_patch_9
101_1_1_food_11
(last part all one line)
101 1 1 found_berry 2014-08-05 11:26:49.510000 2014-08-05 11:27:00.318000 0:00:05.804000 d 6 9 11

You can use Perl:
use strict;
use warnings;
my %hash=();
while (<DATA>) {
if (m/^([0-9_]+)_(?:[^_]+)_(.*?)\s*$/) {
push #{ $hash{join(' ', split('_', $1) )} }, $2;
}
}
print "$_ ". join(' ', #{ $hash{$_} })."\n" for (keys %hash);
__DATA__
101_1_1_trialOutcome_found_berry
101_1_1_trialStartTime_2014-08-05 11:26:49.510000
101_1_1_trialOutcomeTime_2014-08-05 11:27:00.318000
101_1_1_trialResponseTime_0:00:05.804000
101_1_1_whichResponse_d
101_1_1_bearPosition_6
101_1_1_patch_9
101_1_1_food_11
Prints:
101 1 1 found_berry 2014-08-05 11:26:49.510000 2014-08-05 11:27:00.318000 0:00:05.804000 d 6 9 11
Or, perl one line version:
$ perl -lane '
> push #{ $hash{join(" ", split("_", $1) )} }, $2 if (m/^([0-9_]+)_(?:[^_]+)_(.*?)\s*$/);
> END { print "$_ ". join(" ", #{ $hash{$_}})."\n" for (keys %hash); }
> ' file.txt
101 1 1 found_berry 2014-08-05 11:26:49.510000 2014-08-05 11:27:00.318000 0:00:05.804000 d 6 9 11

Using AWK to get text and looping in csv

i'am new in awk and i want ask...
i have a csv file like this
IVALSTART IVALEND IVALDATE
23:00:00 23:30:00 4/9/2012
STATUS LSN LOC
K lskpg 1201
K lntrjkt 1201
K lbkkstp 1211
and i want to change like this
IVALSTART IVALEND
23:00:00 23:30:00
STATUS LSN LOC IVALDATE
K lskpg 1201 4/9/2012
K lntrjkt 1201 4/9/2012
K lbkkstp 1211 4/9/2012
How to do it in awk?
thanks and best regards!

Try this:
awk '
NR == 1 { name = $3; print $1, $2 }
NR == 2 { date = $3; print $1, $2 }
NR == 3 { print "" }
NR == 4 { $4 = name; print }
NR > 4 { $4 = date; print }
' FILE
If you need formating, it's necessary to change print to printf with appropriate specifiers.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Converting Multiple CSV Rows to Individual Columns [closed] - csv

Related

Efficient code to do a countif, ranking on huge csv file

AWK: statistics operations of multi-column CSV data

Complex CSV parsing with Linux commands

Use multiple lines with Awk

Using AWK to get text and looping in csv

Categories

Resources