Using AWK to get text and looping in csv - csv
i'am new in awk and i want ask...
i have a csv file like this
IVALSTART IVALEND IVALDATE
23:00:00 23:30:00 4/9/2012
STATUS LSN LOC
K lskpg 1201
K lntrjkt 1201
K lbkkstp 1211
and i want to change like this
IVALSTART IVALEND
23:00:00 23:30:00
STATUS LSN LOC IVALDATE
K lskpg 1201 4/9/2012
K lntrjkt 1201 4/9/2012
K lbkkstp 1211 4/9/2012
How to do it in awk?
thanks and best regards!
Try this:
awk '
NR == 1 { name = $3; print $1, $2 }
NR == 2 { date = $3; print $1, $2 }
NR == 3 { print "" }
NR == 4 { $4 = name; print }
NR > 4 { $4 = date; print }
' FILE
If you need formating, it's necessary to change print to printf with appropriate specifiers.
Related
Converting Multiple CSV Rows to Individual Columns [closed]
Closed. This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post. Closed 4 years ago. Improve this question I have a CSV file in this format: #Time,CPU,Data x,0,a x,1,b y,0,c y,1,d I want to transform it into this #Time,CPU 0 Data,CPU 1 Data x,a,b y,c,d But I don't know the number of CPU cores there will be in a system (represented by the CPU column). I also have multiple columns of data (not just the singular data column). How would I go about doing this? Example input # hostname,interval,timestamp,CPU,%user,%nice,%system,%iowait,%steal,%idle hostname,600,2018-07-24 00:10:01 UTC,-1,5.19,0,1.52,0.09,0.13,93.07 hostname,600,2018-07-24 00:10:01 UTC,0,5.37,0,1.58,0.15,0.15,92.76 hostname,600,2018-07-24 00:10:01 UTC,1,8.36,0,1.75,0.08,0.1,89.7 hostname,600,2018-07-24 00:10:01 UTC,2,3.87,0,1.38,0.07,0.12,94.55 hostname,600,2018-07-24 00:10:01 UTC,3,3.16,0,1.36,0.05,0.14,95.29 hostname,600,2018-07-24 00:20:01 UTC,-1,5.13,0,1.52,0.08,0.13,93.15 hostname,600,2018-07-24 00:20:01 UTC,0,4.38,0,1.54,0.13,0.15,93.8 hostname,600,2018-07-24 00:20:01 UTC,1,5.23,0,1.49,0.07,0.11,93.09 hostname,600,2018-07-24 00:20:01 UTC,2,5.26,0,1.53,0.07,0.12,93.03 hostname,600,2018-07-24 00:20:01 UTC,3,5.64,0,1.52,0.04,0.12,92.68 This would be the output for this file: (CPU -1 turns into CPU ALL)(also the key value is just the timestamp (the hostname and interval stay constant) # hostname,interval,timestamp,CPU ALL %user,CPU ALL %nice,CPU ALL %system,CPU ALL %iowait,CPU ALL %steal,CPU ALL %idle,CPU 0 %user,CPU 0 %nice,CPU 0 %system,CPU 0 %iowait,CPU 0 %steal,CPU 0 %idle,CPU 1 %user,CPU 1 %nice,CPU 1 %system,CPU 1 %iowait,CPU 1 %steal,CPU 1 %idle,CPU 2 %user,CPU 2 %nice,CPU 2 %system,CPU 2 %iowait,CPU 2 %steal,CPU 2 %idle,CPU 3 %user,CPU 3 %nice,CPU 3 %system,CPU 3 %iowait,CPU 3 %steal,CPU 3 %idle hostname,600,2018-07-24 00:10:01 UTC,5.19,0,1.52,0.09,0.13,93.07,5.37,0,1.58,0.15,0.15,92.76,8.36,0,1.75,0.08,0.1,89.7,3.87,0,1.38,0.07,0.12,94.55,3.16,0,1.36,0.05,0.14,95.29 hostname,600,2018-07-24 00:20:01 UTC,5.13,0,1.52,0.08,0.13,93.15,4.38,0,1.54,0.13,0.15,93.8,5.23,0,1.49,0.07,0.11,93.09,5.26,0,1.53,0.07,0.12,93.03,5.64,0,1.52,0.04,0.12,92.68
Your question isn't clear and doesn't contain the expected output for your posted larger/presumably more realistic sample CSV so idk what output you were hoping for but this will show you the right approach at least: $ cat tst.awk BEGIN{ FS = OFS = "," } NR==1 { for (i=1; i<=NF; i++) { fldName2nmbr[$i] = i } tsFldNmbr = fldName2nmbr["timestamp"] cpuFldNmbr = fldName2nmbr["CPU"] next } { tsVal = $tsFldNmbr cpuVal = $cpuFldNmbr if ( !(seenTs[tsVal]++) ) { tsVal2nmbr[tsVal] = ++numTss tsNmbr2val[numTss] = tsVal } if ( !(seenCpu[cpuVal]++) ) { cpuVal2nmbr[cpuVal] = ++numCpus cpuNmbr2val[numCpus] = cpuVal } tsNmbr = tsVal2nmbr[tsVal] cpuNmbr = cpuVal2nmbr[cpuVal] cpuData = "" for (i=1; i<=NF; i++) { if ( (i != tsFldNmbr) && (i != cpuFldNmbr) ) { cpuData = (cpuData == "" ? "" : cpuData OFS) $i } } data[tsNmbr,cpuNmbr] = cpuData } END { printf "%s", "timestamp" for (cpuNmbr=1; cpuNmbr<=numCpus; cpuNmbr++) { printf "%sCPU %s Data", OFS, cpuNmbr2val[cpuNmbr] } print "" for (tsNmbr=1; tsNmbr<=numTss; tsNmbr++) { printf "%s", tsNmbr2val[tsNmbr] for (cpuNmbr=1; cpuNmbr<=numCpus; cpuNmbr++) { printf "%s\"%s\"", OFS, data[tsNmbr,cpuNmbr] } print "" } } . $ awk -f tst.awk file timestamp,CPU -1 Data,CPU 0 Data,CPU 1 Data,CPU 2 Data,CPU 3 Data 2018-07-24 00:10:01 UTC,"hostname,600,5.19,0,1.52,0.09,0.13,93.07","hostname,600,5.37,0,1.58,0.15,0.15,92.76","hostname,600,8.36,0,1.75,0.08,0.1,89.7","hostname,600,3.87,0,1.38,0.07,0.12,94.55","hostname,600,3.16,0,1.36,0.05,0.14,95.29" 2018-07-24 00:20:01 UTC,"hostname,600,5.13,0,1.52,0.08,0.13,93.15","hostname,600,4.38,0,1.54,0.13,0.15,93.8","hostname,600,5.23,0,1.49,0.07,0.11,93.09","hostname,600,5.26,0,1.53,0.07,0.12,93.03","hostname,600,5.64,0,1.52,0.04,0.12,92.68" I put the per-CPU data within double quotes so you could import it to Excel or similar without worrying about the commas between the sub-fields.
If we assume that the CSV input file is sorted according to increasing timestamps, you could try something like this: use feature qw(say); use strict; use warnings; my $fn = 'log.csv'; open ( my $fh, '<', $fn ) or die "Could not open file '$fn': $!"; my $header = <$fh>; my %info; my #times; while ( my $line = <$fh> ) { chomp $line; my ( $time, $cpu, $data ) = split ",", $line; push #times, $time if !exists $info{$time}; push #{ $info{$time} }, $data; } close $fh; for my $time (#times) { say join ",", $time, #{ $info{$time} }; } Output: x,a,b y,c,d
Complex CSV parsing with Linux commands
I have a CSV log file that records the properties HA;HB;HC;HD;HE. The following file records 6 entries (separated by the above header). I would like to extract the 3rd property(HC) of every entry. HA;HB;HC;HD;HE a1;b1;14;d;e HA;HB;HC;HD;HE a2;b2;28;d;e HA;HB;HC;HD;HE a31;b31;44;d;e a32;b32;07;d;e HA;HB;HC;HD;HE a4;b4;0;d;e HA;HB;HC;HD;HE a51;b51;32;d;e a52;b52;0;d;e a53;b53;5;d;e HA;HB;HC;HD;HE a6;b6;10;d;e Whenever there's n lines of HC recorded per entry, I want to extract the addition of the n entries. The expected output for the above file: 14 28 51 0 37 10 I know I can write a program for this, but is there an easy way to get this with a combination on awk and/or sed commands?
I haven't tested this; try it and let me know if it works. awk -F';' ' $3 == "HC" { if (NR > 1) { print sum sum = 0 } next } { sum += $3 } END { print sum }'
awk solution: $ awk -F';' '$3=="HC" && p{ print sum # print current total sum=p=0 # reinitialize sum and p next } $3!="HC"{ sum=sum+($3+0) # make sure $3 is converted to integer. sum it up. p=1 # set p to 1 } # print last sum END{print sum}' input.txt output: 14 28 51 0 37 10 one-liner: $ awk -F";" '$3=="HC" && p{print sum;sum=p=0;next} $3!="HC"{sum=sum+($3+0);p=1} END{print sum}' input.txt
awk -F';' '/^H.*/{if(f)print s;s=0;f=$3=="HC"}f{s+=$3}END{if(f)print s}' infile For given inputs: $ cat infile HA;HB;HC;HD;HE a1;b1;14;d;e HA;HB;HC;HD;HE a2;b2;28;d;e HA;HB;HC;HD;HE a31;b31;44;d;e a32;b32;07;d;e HA;HB;HC;HD;HE a4;b4;0;d;e HA;HB;HC;HD;HE a51;b51;32;d;e a52;b52;0;d;e a53;b53;5;d;e HA;HB;HC;HD;HE a6;b6;10;d;e $ awk -F';' '/^H.*/{if(f)print s; s=0; f=$3=="HC"}f{s+=$3}END{if(f)print s}' infile 14 28 51 0 37 10 It takes little more care for example: $ cat infile2 HA;HB;HC;HD;HE a1;b1;14;d;e HA;HB;HC;HD;HE a2;b2;28;d;e HA;HB;HC;HD;HE a31;b31;44;d;e a32;b32;07;d;e HA;HB;HC;HD;HE a4;b4;0;d;e HA;HB;HD;HD;HE <---- Say if HC does not found a51;b51;32;d;e a52;b52;0;d;e a53;b53;5;d;e HA;HB;HC;HD;HE a6;b6;10;d;e # find only HC in 3rd column $ awk -F';' '/^H.*/{if(f)print s; s=0; f=$3=="HC"}f{s+=$3}END{if(f)print s}' infile2 14 28 51 0 10 # Find HD in 3rd column $ awk -F';' '/^H.*/{if(f)print s; s=0; f=$3=="HD"}f{s+=$3}END{if(f)print s}' infile2 37
eval "true || $(cat data.csv|cut -d ";" -f3 |sed -e s/"HC"/"0; expr 0"/g |tr '\n' '#'|sed -e s/"##"/""/g|sed -e s/"#"/" + "/g)" Explanation: Get contents of the file using cat Take only the third column using cut delimiter of ; Replace HC lines with 0; expr 0 values to start building eval-worthy bash expressions to eventually yield expr 0 + 14; Replace \n newlines temporarily with # to circumvent possible BSD sed limitations Replace double ## with single # to avoid blank lines turning into spaces and causing expr to bomb out. Replace # with + to add the numbers together. Execute the command, but with a true || 0; expr ... to avoid a guaranteed syntax error on the first line. Which creates this: true || 0; expr 0 + 14 + 0; expr 0 + 28 + 0; expr 0 + 44 + 07 + 0; expr 0 + 0 + 0; expr 0 + 32 + 0 + 5 + 0; expr 0 + 10 The output looks like this: 14 28 51 0 37 10 This was tested on Bash 3.2 and MacOS El Capitan.
Could you please try following and let me know if this helps you. awk -F";" ' /^H/ && $3!="HC"{ flag=""; next } /^H/ && $3=="HC"{ if(NR>1){ printf("%d\n",sum) }; sum=0; flag=1; next } flag{ sum+=$3 } END{ printf("%d\n",sum) } ' Input_file Output will be as follows. 14 28 51 0 37 10
$ awk -F';' '$3=="HC"{if (NR>1) print s; s=0; next} {s+=$3} END{print s}' file 14 28 51 0 37 10
AWK: Converting CSV with Headers to Summary Table
I have a need to document 100+ CSV files as far as the format of those files and including sample data. What I would like to do is take a CSV of the following format: Name, Phone, State Fred, 1234567, TX John, 2345678, NC and convert it to: Field | Sample --- | ---- Name | Fred Phone | 1234567 State | TX Is this possible with AWK? From my example below, you will see I am trying to format as a markdown table. I have it currently transposing the header row with #!/usr/bin/awk -v RS='\r\n' -f BEGIN { printf "| Field \t| Critical |\n"} { printf "|---\t|---\t|\n" for (i=1; i<=NF; i++) {print "|", toupper($i), "| sample |"} } END {} But I am not sure now how to use the first row of data, after the header to display the sample data?
awk is the right tool for data parsing. You can try something like: awk ' BEGIN { FS=", "; OFS=" | " } NR==1 { for(tag = 1; tag <= NF; tag++) { hdr[tag] = sprintf ("%-7s", $tag) } next } { for(fld = 1; fld <= NF; fld++) { data[NR,fld] = $fld } } END { print "Field | Sample\n------- | -------"; for(rec = 2; rec <= NR; rec++) { for(line = 1; line <= NF; line++) { print hdr[line], data[rec,line] } } }' file Output: Field | Sample ------- | ------- Name | Fred Phone | 1234567 State | TX Name | John Phone | 2345678 State | NC
Here is a more simple way to do it with awk No need to store everything in a array then print at the end. awk -F", " 'NR==1{split($0,a,FS);print "Field | Sample\n------- | -------";next} {for (i=1;i<=NF;i++) printf "%-8s| %s\n",a[i],$i}' file Field | Sample ------- | ------- Name | Fred Phone | 1234567 State | TX Name | John Phone | 2345678 State | NC How it works: awk -F", " ' # set field separator to "," NR==1{ # if first line do: split($0,a,FS) # split first line to an array named "a" to get the labels print "Field | Sample" # print header print "------- | -------" # print separator next} # prevents nothing more run for first line { # for all lines except first do: for (i=1;i<=NF;i++) # loop trough all element in line printf "%-8s| %s\n",a[i],$i # print data for every element } ' file
Dynamically edit lists within a csv file
I have a csv file that looks like that: col1|col2 1|a 2|g 3|f 1|m 3|k 2|n 2|a 1|d 4|r 3|s where | separates the columns, and would like to transform it into something homogeneous like: ------------------------ fields > 1 2 3 4 record1 a g f record2 m n k record3 d a s r ------------------------ Is there a way to do that? What would be better, using mysql or editing the csv file?
I wrote this, works for your example: gawk is required awk -F'|' -v RS="" '{for(i=1;i<=NF;i+=2)a[$i]=$(i+1);asorti(a,d); for(i=1;i<=length(a);i++)printf "%s", a[d[i]]((i==length(a))?"":" ");delete a;delete d;print ""}' file example: kent$ cat file 1|a 2|g 3|f 1|m 3|k 2|n 2|a 1|d 4|r 3|s kent$ awk -F'|' -v RS="" '{for(i=1;i<=NF;i+=2)a[$i]=$(i+1);asorti(a,d); for(i=1;i<=length(a);i++)printf "%s", a[d[i]]((i==length(a))?"":" ");delete a;delete d;print ""}' file a g f m n k d a s r
Here an awk solution: BEGIN{ RS="" FS="\n" } FNR==NR&&FNR>1{ for (i=1;i<=NF;i++) { split($i,d,"|") if (d[1] > max) max = d[1] } next } FNR>1&&!header{ printf "%s\t","fields >" for (i=1;i<=max;i++) printf "%s\t",i print "" header=1 } FNR>1{ printf "record%s\t\t",FNR-1 for (i=1;i<=NF;i++) { split($i,d,"|") val[d[1]] = d[2] } for (i=1;i<=max;i++) printf "%s\t",val[i]?val[i]:"NULL" print "" delete val } Save as script.awk and run like (notice it uses a two pass approach so you need to give the file twice): $ awk -f script.awk file file fields > 1 2 3 4 record1 a g f NULL record2 m n k NULL record3 d a s r Adding the line 5|b to the first record in file gives the output: $ awk -f script.awk file file fields > 1 2 3 4 5 record1 a g f NULL b record2 m n k NULL NULL record3 d a s r NULL
$ cat file col1|col2 1|a 2|g 3|f 5|b 1|m 3|k 2|n 2|a 1|d 4|r 3|s $ $ awk -f tst.awk file fields > 1 2 3 4 5 record1 a g f NULL b record2 m n k NULL NULL record3 d a s r NULL $ $ cat tst.awk BEGIN{ RS=""; FS="\n" } NR>1 { ++numRecs for (i=1;i<=NF;i++) { split($i,fldNr2val,"|") fldNr = fldNr2val[1] val = fldNr2val[2] recNrFldNr2val[numRecs,fldNr] = val numFlds = (fldNr > numFlds ? fldNr : numFlds) } } END { printf "fields >" for (fldNr=1;fldNr<=numFlds;fldNr++) { printf " %4s", fldNr } print "" for (recNr=1; recNr<=numRecs; recNr++) { printf "record%d ", recNr for (fldNr=1;fldNr<=numFlds;fldNr++) { printf " %4s", ((recNr,fldNr) in recNrFldNr2val ? recNrFldNr2val[recNr,fldNr] : "NULL") } print "" } }
Enhance awk script by printing top 5 occurring data elements from each column
I have an awk script that processes a csv file and produces a report that counts the number of rows for each column, named in the header field, that contain data /[A-Za-z0-9]/. What I would like to do is enhance the script and print the top 5 most duplicated data elements in each column. Here is sample data: Food|Type|Spicy Broccoli|Vegetable|No Lettuce|Vegetable|No Spinach|Vegetable|No Habanero|Vegetable|Yes Swiss Cheese|Dairy|No Milk|Dairy|No Yogurt|Dairy|No Orange Juice|Fruit|No Papaya|Fruit|No Watermelon|Fruit|No Coconut|Fruit|No Cheeseburger|Meat|No Gorgonzola|Dairy|No Salmon|Fish| Apple|Fruit|No Orange|Fruit|No Bagel|Bread|No Chicken|Meat|No Chicken Wings|Meat|Yes Pizza||No This is the current script that SiegeX has substantially contributed: $ cat matrix2.awk NR==1{ for(i=1;i<=NF;i++) head[i]=$i next } { for(i=1;i<=NF;i++) { if($i && !arr[i,$i]++) n[i]++ if(arr[i,$i] > 1) f[i]=1 } } END{ for(i=1;i<=length(head);i++) { printf("%-6d%s\n",n[i],head[i]) if(f[i]) { for(x in arr) { split(x,b,SUBSEP) if(b[1]==i && b[2]) printf("% -6d %s\n",arr[i,b[2]],b[2]) } } } } This is the current output: $ awk -F "|" -f matrix2.awk testlist.csv 20 Food 6 Type 6 Fruit 4 Vegetable 3 Meat 1 Fish 4 Dairy 1 Bread 2 Spicy 17 No 2 Yes And this is the desired output: $ awk -F "|" -f matrix2.awk testlist.csv 20 Food 6 Type 6 Fruit 4 Vegetable 4 Dairy 3 Meat 1 Fish 2 Spicy 17 No 2 Yes The only thing left that I would like to add is a general function that limits each columns output to the top 5 most duplicated fields. As mentioned below, a columnar version of sort |uniq -c |sort -nr |head -5.
The following script is both extensible and scalable as it will work with an arbitrary number of columns. Nothing is hardcoded awk -F'|' ' NR==1{ for(i=1;i<=NF;i++) head[i]=$i next } { for(i=1;i<=NF;i++) { if($i && !arr[i,$i]++) n[i]++ if(arr[i,$i] > 1) f[i]=1 } } END{ for(i=1;i<=length(head);i++) { printf("%-32s%d\n",head[i],n[i]) if(f[i]) { for(x in arr) { split(x,b,SUBSEP) if(b[1]==i && b[2]) printf(" %-28s%d\n",b[2],arr[i,b[2]]) } } } }' infile Output $ ./report Food 9 Type 5 Meat 2 Bread 1 Vegetable 2 Fruit 2 Fish 1 Spicy 2 Yes 2 No 6
Not a complete solution, but something to get you started - awk -F"|" ' NR>1{ a[$1]++; b[$2]++; c[$3]++} END{ print "Food\t\t\t" length(a); print "Type\t\t\t" length(b); for (x in b) if (x!="") { printf ("\t%-16s%s\n",x,b[x]); } print "Spicy\t\t\t" length(c); for (y in c) if (y!="") { printf ("\t%-16s%d\n",y,c[y]) } }' testlist.csv TEST: [jaypal:~/Temp] cat testlist.csv Food|Type|Spicy Broccoli|Vegetable|No Jalapeno|Vegetable|Yes Salmon|Fish| Apple|Fruit|No Orange|Fruit|No Bagel|Bread|No Chicken|Meat|No Chicken Wings|Meat|Yes Pizza||No [jaypal:~/Temp] awk -F"|" 'NR>1{a[$1];b[$2]++;c[$3]++}END{print "Food\t\t\t" length(a); print "Type\t\t\t"length(b); for (x in b) if (x!="") printf ("\t%-16s%s\n",x,b[x]) ;print "Spicy\t\t\t"length(c); for (y in c) if (y!="") {printf ("\t%-16s%d\n",y,c[y])}}' testlist.csv Food 9 Type 6 Fruit 2 Vegetable 2 Bread 1 Meat 2 Fish 1 Spicy 3 Yes 2 No 6