Converting Multiple CSV Rows to Individual Columns [closed] - csv
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I have a CSV file in this format:
#Time,CPU,Data
x,0,a
x,1,b
y,0,c
y,1,d
I want to transform it into this
#Time,CPU 0 Data,CPU 1 Data
x,a,b
y,c,d
But I don't know the number of CPU cores there will be in a system (represented by the CPU column). I also have multiple columns of data (not just the singular data column).
How would I go about doing this?
Example input
# hostname,interval,timestamp,CPU,%user,%nice,%system,%iowait,%steal,%idle
hostname,600,2018-07-24 00:10:01 UTC,-1,5.19,0,1.52,0.09,0.13,93.07
hostname,600,2018-07-24 00:10:01 UTC,0,5.37,0,1.58,0.15,0.15,92.76
hostname,600,2018-07-24 00:10:01 UTC,1,8.36,0,1.75,0.08,0.1,89.7
hostname,600,2018-07-24 00:10:01 UTC,2,3.87,0,1.38,0.07,0.12,94.55
hostname,600,2018-07-24 00:10:01 UTC,3,3.16,0,1.36,0.05,0.14,95.29
hostname,600,2018-07-24 00:20:01 UTC,-1,5.13,0,1.52,0.08,0.13,93.15
hostname,600,2018-07-24 00:20:01 UTC,0,4.38,0,1.54,0.13,0.15,93.8
hostname,600,2018-07-24 00:20:01 UTC,1,5.23,0,1.49,0.07,0.11,93.09
hostname,600,2018-07-24 00:20:01 UTC,2,5.26,0,1.53,0.07,0.12,93.03
hostname,600,2018-07-24 00:20:01 UTC,3,5.64,0,1.52,0.04,0.12,92.68
This would be the output for this file: (CPU -1 turns into CPU ALL)(also the key value is just the timestamp (the hostname and interval stay constant)
# hostname,interval,timestamp,CPU ALL %user,CPU ALL %nice,CPU ALL %system,CPU ALL %iowait,CPU ALL %steal,CPU ALL %idle,CPU 0 %user,CPU 0 %nice,CPU 0 %system,CPU 0 %iowait,CPU 0 %steal,CPU 0 %idle,CPU 1 %user,CPU 1 %nice,CPU 1 %system,CPU 1 %iowait,CPU 1 %steal,CPU 1 %idle,CPU 2 %user,CPU 2 %nice,CPU 2 %system,CPU 2 %iowait,CPU 2 %steal,CPU 2 %idle,CPU 3 %user,CPU 3 %nice,CPU 3 %system,CPU 3 %iowait,CPU 3 %steal,CPU 3 %idle
hostname,600,2018-07-24 00:10:01 UTC,5.19,0,1.52,0.09,0.13,93.07,5.37,0,1.58,0.15,0.15,92.76,8.36,0,1.75,0.08,0.1,89.7,3.87,0,1.38,0.07,0.12,94.55,3.16,0,1.36,0.05,0.14,95.29
hostname,600,2018-07-24 00:20:01 UTC,5.13,0,1.52,0.08,0.13,93.15,4.38,0,1.54,0.13,0.15,93.8,5.23,0,1.49,0.07,0.11,93.09,5.26,0,1.53,0.07,0.12,93.03,5.64,0,1.52,0.04,0.12,92.68
Your question isn't clear and doesn't contain the expected output for your posted larger/presumably more realistic sample CSV so idk what output you were hoping for but this will show you the right approach at least:
$ cat tst.awk
BEGIN{
FS = OFS = ","
}
NR==1 {
for (i=1; i<=NF; i++) {
fldName2nmbr[$i] = i
}
tsFldNmbr = fldName2nmbr["timestamp"]
cpuFldNmbr = fldName2nmbr["CPU"]
next
}
{
tsVal = $tsFldNmbr
cpuVal = $cpuFldNmbr
if ( !(seenTs[tsVal]++) ) {
tsVal2nmbr[tsVal] = ++numTss
tsNmbr2val[numTss] = tsVal
}
if ( !(seenCpu[cpuVal]++) ) {
cpuVal2nmbr[cpuVal] = ++numCpus
cpuNmbr2val[numCpus] = cpuVal
}
tsNmbr = tsVal2nmbr[tsVal]
cpuNmbr = cpuVal2nmbr[cpuVal]
cpuData = ""
for (i=1; i<=NF; i++) {
if ( (i != tsFldNmbr) && (i != cpuFldNmbr) ) {
cpuData = (cpuData == "" ? "" : cpuData OFS) $i
}
}
data[tsNmbr,cpuNmbr] = cpuData
}
END {
printf "%s", "timestamp"
for (cpuNmbr=1; cpuNmbr<=numCpus; cpuNmbr++) {
printf "%sCPU %s Data", OFS, cpuNmbr2val[cpuNmbr]
}
print ""
for (tsNmbr=1; tsNmbr<=numTss; tsNmbr++) {
printf "%s", tsNmbr2val[tsNmbr]
for (cpuNmbr=1; cpuNmbr<=numCpus; cpuNmbr++) {
printf "%s\"%s\"", OFS, data[tsNmbr,cpuNmbr]
}
print ""
}
}
.
$ awk -f tst.awk file
timestamp,CPU -1 Data,CPU 0 Data,CPU 1 Data,CPU 2 Data,CPU 3 Data
2018-07-24 00:10:01 UTC,"hostname,600,5.19,0,1.52,0.09,0.13,93.07","hostname,600,5.37,0,1.58,0.15,0.15,92.76","hostname,600,8.36,0,1.75,0.08,0.1,89.7","hostname,600,3.87,0,1.38,0.07,0.12,94.55","hostname,600,3.16,0,1.36,0.05,0.14,95.29"
2018-07-24 00:20:01 UTC,"hostname,600,5.13,0,1.52,0.08,0.13,93.15","hostname,600,4.38,0,1.54,0.13,0.15,93.8","hostname,600,5.23,0,1.49,0.07,0.11,93.09","hostname,600,5.26,0,1.53,0.07,0.12,93.03","hostname,600,5.64,0,1.52,0.04,0.12,92.68"
I put the per-CPU data within double quotes so you could import it to Excel or similar without worrying about the commas between the sub-fields.
If we assume that the CSV input file is sorted according to increasing timestamps, you could try something like this:
use feature qw(say);
use strict;
use warnings;
my $fn = 'log.csv';
open ( my $fh, '<', $fn ) or die "Could not open file '$fn': $!";
my $header = <$fh>;
my %info;
my #times;
while ( my $line = <$fh> ) {
chomp $line;
my ( $time, $cpu, $data ) = split ",", $line;
push #times, $time if !exists $info{$time};
push #{ $info{$time} }, $data;
}
close $fh;
for my $time (#times) {
say join ",", $time, #{ $info{$time} };
}
Output:
x,a,b
y,c,d
Related
Efficient code to do a countif, ranking on huge csv file
need help to accelerate the run of below perl code. 4 to 6hours should be fine. faster is better :) csv file is about 14m to 14.5m rows, and aorund 1100 to 1500columns; 62gig what it does: do a count (like a countif in excel) get the percent (based on 14m rows) get the rank based on count My current code: use List::MoreUtils qw(uniq); $x="Room_reserve"; $in = "D:\\package properties\\${x}.csv"; $out = "D:\\package properties\\output\\${x}_output.csv"; open($fh, '<', $in) or die "Could not open file '$file' $!"; #data = <$fh>; close($fh); %counts; #columns; $first = 1; #counter foreach $dat (#data) { chomp($dat); #rows = split(',',$dat); if ($first == 1) { $first = 0; next; } else { $count = 1; foreach $i (0..$#rows) { if ( exists($columns[$i]{$rows[$i]}) ) { $columns[$i]{$rows[$i]}++; } else { $columns[$i]{$rows[$i]} = int($count); } } } } #output $first = 1; open($fh, '>', $out) or die "Could not open file '$file' $!"; foreach $dat (#data) { chomp($dat); #rows = split(',',$dat); foreach $i (0..$#rows) { if ($i > 6) { #for modifying name if ( $first == 1 ) { $line = join( ",", "Rank_$rows[$i]", "Percent_$rows[$i]", "Count_$rows[$i]", $rows[$i]); print $fh "$line,"; if ( $i == $#rows ) { $first = 0; } } else { #dat_val = reverse sort { $a <=> $b } values %{$columns[$i]}; %ranks = {}; $rank_cnt = 0; foreach $val (#dat_val) { if ( ! exists($ranks{$val}) ) { $rank_cnt++; } $ranks{$val} = $rank_cnt; } $rank = $ranks{$columns[$i]{$rows[$i]}}; $cnt = $columns[$i]{$rows[$i]}; $ave = ($cnt / 14000000) * 100; $line = join( ",", $rank, $ave, $cnt, $rows[$i]); print $fh "$line,"; } } else { print $fh "$rows[$i],"; } } print $fh "\n"; } close($fh); thanks in advance. my table Col _1 _2 _3 _4 _5 _6 Col2 Col3 Col Col5 FALSE 1 2 3 4 5 6 6 6 1 4 FALSE 1 2 3 4 5 6 6 6 1 4 FALSE 1 2 3 4 5 7 6 6 1 3 edited to show sample table and correct $x ##Sample output Col _1 _2 _3 _4 _5 _6 Col2 rank_Col2 percent_rank_Col2 count_Col2 Col3 rank_Col3 percent_rank_Col3 count_Col3 Col rank_Col percent_rank_Col count_Col Col5 rank_Col5 percent_rank_Col5 count_Col5 FALSE 1 2 3 4 5 6 9 2 0.33 1 6 1 0.67 2 1 1 0.67 2 11 1 0.33 1 FALSE 1 2 3 4 5 6 6 1 0.67 2 6 1 0.67 2 2 2 0.33 1 4 1 0.33 1 FALSE 1 2 3 4 5 7 6 1 0.67 2 4 2 0.33 1 1 1 0.67 2 3 1 0.33 1
Presume you have this file: % ls -lh file -rw-r--r-- 1 dawg wheel 57G Jul 24 13:15 file % time wc -l file 14000000 file wc -l file 29.24s user 7.27s system 99% cpu 36.508 total % awk -F, 'FNR==1{print $NF; exit}' file 1099 So we have a 57GB file with 14,000,000 line by 1099 col csv with random numbers. It only takes 20 to 30 SECONDS to read the entire file in a line-by-line fashion. How long in Perl? % time perl -lnE '' file perl -lnE '' file 5.70s user 9.67s system 99% cpu 15.381 total So only 15 seconds in Perl line by line. How long to 'gulp' it? % time perl -0777 -lnE '' file perl -0777 -lnE '' file 12.13s user 23.86s system 98% cpu 36.688 total But that is on THIS computer which has 255GB of RAM... It took this Python script approximately 23 minutes to write that file: total=0 cols=1100 with open('/tmp/file', 'w') as f_out: for cnt in range(14_000_000): line=','.join(f'{x}' for x in range(cols)) total+=cols f_out.write(f'{cnt},{total},{line}\n') The issue with your Perl script is you are gulping the whole file (with #data = <$fh>;) and the OS maybe is running out of RAM. It is likely swapping and this is very slow. (How much RAM do you have?) Rewrite the script to do it line by line. You should be able to do your entire analysis in less than 1 hour.
Can you try the following script? It precomputes the ranks instead of recalculating them for each row. It also avoids saving the first 7 columns in #columns since they appear not to be used for anything. use feature qw(say); use strict; use warnings; { my $x="Room_reserve"; my $in = "D:\\package properties\\${x}.csv"; my $out = "D:\\package properties\\output\\${x}_output.csv"; my $min_col = 7; my ($data1, $data2) = read_input_file($in, $min_col); my $columns = create_counter($data2); my $ranks = create_ranks( $columns); write_output($data1, $data2, $columns, $ranks, $min_col, $out); } sub create_counter { my ($data) = #_; print "Creating counter.."; my #columns; my $first = 1; for my $row (#$data) { if ( $first ) { $first = 0; next; # skip header } for my $i (0..$#$row) { $columns[$i]{$row->[$i]}++; } } say "done."; return \#columns; } sub create_ranks { my ( $columns ) = #_; print "Creating ranks.."; my #ranks; for my $col (#$columns) { # sort the column values according to highest frequency. my #freqs = sort { $b <=> $a } values %$col; my $idx = 1; my %ranks = map {$_ => $idx++} #freqs; push #ranks, \%ranks; } say "done."; return \#ranks; } sub read_input_file { my ($in, $min_col) = #_; print "Reading input file $in .."; open(my $fh, '<', $in) or die "Could not open file '$in' $!"; my #data1; my #data2; while( my $line = <$fh> ) { chomp $line; my #fields = split(',', $line); next if #fields == 0; die "Unexpected column count at line $.\n" if #fields < $min_col; push #data1, [#fields[0..($min_col-1)]]; push #data2, [#fields[$min_col..$#fields]]; } say " $. lines."; close($fh); return \#data1, \#data2; } sub write_output { my ($data1, $data2, $columns, $ranks, $min_col, $out) = #_; say "Saving output file $out.."; open(my $fh, '>', $out) or die "Could not open file '$out': $!"; for my $i (0..$#$data1) { my $row1 = $data1->[$i]; my $row2 = $data2->[$i]; if ($i == 0) { print $fh join ",", #$row1; print $fh ","; for my $j (0..$#$row2) { print $fh join ",", $row2->[$j], "Rank_" . $row2->[$j], "Percent_". $row2->[$j], "Count_" . $row2->[$j]; print $fh "," if $j < $#$row2; } print $fh "\n"; next; } print $fh join ",", #$row1; print $fh ","; for my $j (0..$#$row2) { my $cnt = $columns->[$j]{$row2->[$j]}; my $rank = $ranks->[$j]{$cnt}; my $ave = ($cnt / 14000000) * 100; print $fh join ",", $row2->[$j], $rank, $ave, $cnt; print $fh "," if $j < $#$row2; } print $fh "\n"; } close $fh; say "Done."; }
AWK: statistics operations of multi-column CSV data
With the aim to perform some statistical analysis of multi-column data I am analyzing big number of CSV filles using the following bash + AWK routine: #!/bin/bash home="$PWD" # folder with the outputs rescore="${home}"/rescore # folder with the folders to analyse storage="${home}"/results #cd "${home}"/results cd ${storage} csv_pattern='*_filt.csv' while read -r d; do awk -v rescore="$rescore" ' FNR==1 { if (n) mean[suffix] = s/n prefix=suffix=FILENAME sub(/_.*/, "", prefix) sub(/\/[^\/]+$/, "", suffix) sub(/^.*_/, "", suffix) s=n=0 } FNR > 1 { s += $3 ++n } END { out = rescore "/" prefix ".csv" mean[suffix] = s/n print prefix ":", "dG(mean)" > out for (i in mean) printf "%s: %.2f\n", i, mean[i] >> out close(out) }' "${d}_"*/${csv_pattern} #> "${rescore}/"${d%%_*}".csv" done < <(find . -maxdepth 1 -type d -name '*_*_*' | awk -F '[_/]' '!seen[$2]++ {print $2}') Basically the script takes ensemble of CSV files belonged to the same prefix (defined as the naming pattern occured at the begining of the directory contained CSV, for example 10V1 from 10V1_cne_lig1) and calculate for it the mean value for the numbers in the third column: # input *_filt.csv located in the folder 10V1_cne_lig1001 ID, POP, dG 1, 142, -5.6500 2, 10, -5.5000 3, 2, -4.9500 add 1 string to 10V1.csv, which is organized in 2 column format i) the name of the suffix of the folder with initial CSV; ii) the mean value calculated for all numbers in the third column (dG) of input.csv: # this is two column format of output.csv: 10V1.csv 10V1: dG(mean) lig1001: -5.37 in this way for 100 CSV filles such output.csv should contain 100 lines with the mean values, etc I need to introduce a small modification to my AWK part of my routine that would add the 3rd column to the output CSV with RMSD value (as the measure of the differences between initial dG values) of the initial data (dG), which had been used to calculate the MEAN value. Using AWK syntax, with a particular MEAN value the RMS could be expressed as mean=$(awk -F , '{sum+=$3}END{printf "%.2f", sum/NR}' $csv) rmsd=$(awk -v mean=$mean '{++n;sum+=($NF-mean)^2} END{if(n) printf "%.2f", sqrt(sum/n)}' $csv) Here is expected output for 5 means and 5 rmsds values calculated for 5 CSV logs (the first one is corresponded to my above example!): 10V1: dG(mean): RMSD (error) lig1001 -5.37 0.30 lig1002 -8.53 0.34 lig1003 -6.57 0.25 lig1004 -9.53 0.00 # rmsd=0 since initial csv has only 1 line: no data variance lig1005 -8.11 0.39 How this addition could be incorporated into my main bash-AWK code with the aim to add the third RMSD column (for each of the processed CSV, thus taking each of the calculated MEAN) to the output.csv?
You can calculate both of mean and rmsd within the awk code. Would you please try the following awk code: awk -v rescore="$rescore" ' FNR==1 { if (n) { # calculate the results of previous file m = s / n # mean var = s2 / n - m * m # variance if (var < 0) var = 0 # avoid an exception due to round-off error mean[suffix] = m # store the mean in an array rmsd[suffix] = sqrt(var) } prefix=suffix=FILENAME sub(/_.*/, "", prefix) sub(/\/[^\/]+$/, "", suffix) sub(/^.*_/, "", suffix) s = 0 # sum of $3 s2 = 0 # sum of $3 ** 2 n = 0 # count of samples } FNR > 1 { s += $3 s2 += $3 * $3 ++n } END { out = rescore "/" prefix ".csv" m = s / n var = s2 / n - m * m if (var < 0) var = 0 mean[suffix] = m rmsd[suffix] = sqrt(var) print prefix ":", "dG(mean)", "dG(rmsd)" > out for (i in mean) printf "%s: %.2f %.2f\n", i, mean[i], rmsd[i] >> out close(out) }' Here is the version to print the lowest value of dG. awk -v rescore="$rescore" ' FNR==1 { if (n) { # calculate the results of previous file m = s / n # mean var = s2 / n - m * m # variance if (var < 0) var = 0 # avoid an exception due to round-off error mean[suffix] = m # store the mean in an array rmsd[suffix] = sqrt(var) lowest[suffix] = min } prefix=suffix=FILENAME sub(/_.*/, "", prefix) sub(/\/[^\/]+$/, "", suffix) sub(/^.*_/, "", suffix) s = 0 # sum of $3 s2 = 0 # sum of $3 ** 2 n = 0 # count of samples min = 0 # lowest value of $3 } FNR > 1 { s += $3 s2 += $3 * $3 ++n if ($3 < min) min = $3 # update the lowest value } END { if (n) { # just to avoid division by zero m = s / n var = s2 / n - m * m if (var < 0) var = 0 mean[suffix] = m rmsd[suffix] = sqrt(var) lowest[suffix] = min } out = rescore "/" prefix ".csv" print prefix ":", "dG(mean)", "dG(rmsd)", "dG(lowest)" > out for (i in mean) printf "%s: %.2f %.2f %.2f\n", i, mean[i], rmsd[i], lowest[i] > out }' file_*.csv I've assumed all dG values are negative. If there is any chance the value is greater than zero, modify the line min = 0 which initializes the variable to considerably big value (10,000 or whatever). Please apply your modifications regarding the filenames, if needed. The suggestions by Ed Morton are also included although the results will be the same.
Complex CSV parsing with Linux commands
I have a CSV log file that records the properties HA;HB;HC;HD;HE. The following file records 6 entries (separated by the above header). I would like to extract the 3rd property(HC) of every entry. HA;HB;HC;HD;HE a1;b1;14;d;e HA;HB;HC;HD;HE a2;b2;28;d;e HA;HB;HC;HD;HE a31;b31;44;d;e a32;b32;07;d;e HA;HB;HC;HD;HE a4;b4;0;d;e HA;HB;HC;HD;HE a51;b51;32;d;e a52;b52;0;d;e a53;b53;5;d;e HA;HB;HC;HD;HE a6;b6;10;d;e Whenever there's n lines of HC recorded per entry, I want to extract the addition of the n entries. The expected output for the above file: 14 28 51 0 37 10 I know I can write a program for this, but is there an easy way to get this with a combination on awk and/or sed commands?
I haven't tested this; try it and let me know if it works. awk -F';' ' $3 == "HC" { if (NR > 1) { print sum sum = 0 } next } { sum += $3 } END { print sum }'
awk solution: $ awk -F';' '$3=="HC" && p{ print sum # print current total sum=p=0 # reinitialize sum and p next } $3!="HC"{ sum=sum+($3+0) # make sure $3 is converted to integer. sum it up. p=1 # set p to 1 } # print last sum END{print sum}' input.txt output: 14 28 51 0 37 10 one-liner: $ awk -F";" '$3=="HC" && p{print sum;sum=p=0;next} $3!="HC"{sum=sum+($3+0);p=1} END{print sum}' input.txt
awk -F';' '/^H.*/{if(f)print s;s=0;f=$3=="HC"}f{s+=$3}END{if(f)print s}' infile For given inputs: $ cat infile HA;HB;HC;HD;HE a1;b1;14;d;e HA;HB;HC;HD;HE a2;b2;28;d;e HA;HB;HC;HD;HE a31;b31;44;d;e a32;b32;07;d;e HA;HB;HC;HD;HE a4;b4;0;d;e HA;HB;HC;HD;HE a51;b51;32;d;e a52;b52;0;d;e a53;b53;5;d;e HA;HB;HC;HD;HE a6;b6;10;d;e $ awk -F';' '/^H.*/{if(f)print s; s=0; f=$3=="HC"}f{s+=$3}END{if(f)print s}' infile 14 28 51 0 37 10 It takes little more care for example: $ cat infile2 HA;HB;HC;HD;HE a1;b1;14;d;e HA;HB;HC;HD;HE a2;b2;28;d;e HA;HB;HC;HD;HE a31;b31;44;d;e a32;b32;07;d;e HA;HB;HC;HD;HE a4;b4;0;d;e HA;HB;HD;HD;HE <---- Say if HC does not found a51;b51;32;d;e a52;b52;0;d;e a53;b53;5;d;e HA;HB;HC;HD;HE a6;b6;10;d;e # find only HC in 3rd column $ awk -F';' '/^H.*/{if(f)print s; s=0; f=$3=="HC"}f{s+=$3}END{if(f)print s}' infile2 14 28 51 0 10 # Find HD in 3rd column $ awk -F';' '/^H.*/{if(f)print s; s=0; f=$3=="HD"}f{s+=$3}END{if(f)print s}' infile2 37
eval "true || $(cat data.csv|cut -d ";" -f3 |sed -e s/"HC"/"0; expr 0"/g |tr '\n' '#'|sed -e s/"##"/""/g|sed -e s/"#"/" + "/g)" Explanation: Get contents of the file using cat Take only the third column using cut delimiter of ; Replace HC lines with 0; expr 0 values to start building eval-worthy bash expressions to eventually yield expr 0 + 14; Replace \n newlines temporarily with # to circumvent possible BSD sed limitations Replace double ## with single # to avoid blank lines turning into spaces and causing expr to bomb out. Replace # with + to add the numbers together. Execute the command, but with a true || 0; expr ... to avoid a guaranteed syntax error on the first line. Which creates this: true || 0; expr 0 + 14 + 0; expr 0 + 28 + 0; expr 0 + 44 + 07 + 0; expr 0 + 0 + 0; expr 0 + 32 + 0 + 5 + 0; expr 0 + 10 The output looks like this: 14 28 51 0 37 10 This was tested on Bash 3.2 and MacOS El Capitan.
Could you please try following and let me know if this helps you. awk -F";" ' /^H/ && $3!="HC"{ flag=""; next } /^H/ && $3=="HC"{ if(NR>1){ printf("%d\n",sum) }; sum=0; flag=1; next } flag{ sum+=$3 } END{ printf("%d\n",sum) } ' Input_file Output will be as follows. 14 28 51 0 37 10
$ awk -F';' '$3=="HC"{if (NR>1) print s; s=0; next} {s+=$3} END{print s}' file 14 28 51 0 37 10
Use multiple lines with Awk
I have a CSV that has underscore delimiters. I have 8 lines that need to be converted to one in this way: 101_1_variableName_(value) 101_1_variableName1_(value2) into: 101 1 (value) (value2) (in different boxes preferably) The problem is that I don't know how to use multiple lines in awk to form a single line. Any help is appreciated. UPDATE: (input + output) 101_1_1_trialOutcome_found_berry 101_1_1_trialStartTime_2014-08-05 11:26:49.510000 101_1_1_trialOutcomeTime_2014-08-05 11:27:00.318000 101_1_1_trialResponseTime_0:00:05.804000 101_1_1_whichResponse_d 101_1_1_bearPosition_6 101_1_1_patch_9 101_1_1_food_11 (last part all one line) 101 1 1 found_berry 2014-08-05 11:26:49.510000 2014-08-05 11:27:00.318000 0:00:05.804000 d 6 9 11
You can use Perl: use strict; use warnings; my %hash=(); while (<DATA>) { if (m/^([0-9_]+)_(?:[^_]+)_(.*?)\s*$/) { push #{ $hash{join(' ', split('_', $1) )} }, $2; } } print "$_ ". join(' ', #{ $hash{$_} })."\n" for (keys %hash); __DATA__ 101_1_1_trialOutcome_found_berry 101_1_1_trialStartTime_2014-08-05 11:26:49.510000 101_1_1_trialOutcomeTime_2014-08-05 11:27:00.318000 101_1_1_trialResponseTime_0:00:05.804000 101_1_1_whichResponse_d 101_1_1_bearPosition_6 101_1_1_patch_9 101_1_1_food_11 Prints: 101 1 1 found_berry 2014-08-05 11:26:49.510000 2014-08-05 11:27:00.318000 0:00:05.804000 d 6 9 11 Or, perl one line version: $ perl -lane ' > push #{ $hash{join(" ", split("_", $1) )} }, $2 if (m/^([0-9_]+)_(?:[^_]+)_(.*?)\s*$/); > END { print "$_ ". join(" ", #{ $hash{$_}})."\n" for (keys %hash); } > ' file.txt 101 1 1 found_berry 2014-08-05 11:26:49.510000 2014-08-05 11:27:00.318000 0:00:05.804000 d 6 9 11
Using AWK to get text and looping in csv
i'am new in awk and i want ask... i have a csv file like this IVALSTART IVALEND IVALDATE 23:00:00 23:30:00 4/9/2012 STATUS LSN LOC K lskpg 1201 K lntrjkt 1201 K lbkkstp 1211 and i want to change like this IVALSTART IVALEND 23:00:00 23:30:00 STATUS LSN LOC IVALDATE K lskpg 1201 4/9/2012 K lntrjkt 1201 4/9/2012 K lbkkstp 1211 4/9/2012 How to do it in awk? thanks and best regards!
Try this: awk ' NR == 1 { name = $3; print $1, $2 } NR == 2 { date = $3; print $1, $2 } NR == 3 { print "" } NR == 4 { $4 = name; print } NR > 4 { $4 = date; print } ' FILE If you need formating, it's necessary to change print to printf with appropriate specifiers.