Using AWK to get text and looping in csv - csv

i'am new in awk and i want ask...
i have a csv file like this
IVALSTART IVALEND IVALDATE
23:00:00 23:30:00 4/9/2012
STATUS LSN LOC
K lskpg 1201
K lntrjkt 1201
K lbkkstp 1211
and i want to change like this
IVALSTART IVALEND
23:00:00 23:30:00
STATUS LSN LOC IVALDATE
K lskpg 1201 4/9/2012
K lntrjkt 1201 4/9/2012
K lbkkstp 1211 4/9/2012
How to do it in awk?
thanks and best regards!

Try this:
awk '
NR == 1 { name = $3; print $1, $2 }
NR == 2 { date = $3; print $1, $2 }
NR == 3 { print "" }
NR == 4 { $4 = name; print }
NR > 4 { $4 = date; print }
' FILE
If you need formating, it's necessary to change print to printf with appropriate specifiers.

Related

Converting Multiple CSV Rows to Individual Columns [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I have a CSV file in this format:
#Time,CPU,Data
x,0,a
x,1,b
y,0,c
y,1,d
I want to transform it into this
#Time,CPU 0 Data,CPU 1 Data
x,a,b
y,c,d
But I don't know the number of CPU cores there will be in a system (represented by the CPU column). I also have multiple columns of data (not just the singular data column).
How would I go about doing this?
Example input
# hostname,interval,timestamp,CPU,%user,%nice,%system,%iowait,%steal,%idle
hostname,600,2018-07-24 00:10:01 UTC,-1,5.19,0,1.52,0.09,0.13,93.07
hostname,600,2018-07-24 00:10:01 UTC,0,5.37,0,1.58,0.15,0.15,92.76
hostname,600,2018-07-24 00:10:01 UTC,1,8.36,0,1.75,0.08,0.1,89.7
hostname,600,2018-07-24 00:10:01 UTC,2,3.87,0,1.38,0.07,0.12,94.55
hostname,600,2018-07-24 00:10:01 UTC,3,3.16,0,1.36,0.05,0.14,95.29
hostname,600,2018-07-24 00:20:01 UTC,-1,5.13,0,1.52,0.08,0.13,93.15
hostname,600,2018-07-24 00:20:01 UTC,0,4.38,0,1.54,0.13,0.15,93.8
hostname,600,2018-07-24 00:20:01 UTC,1,5.23,0,1.49,0.07,0.11,93.09
hostname,600,2018-07-24 00:20:01 UTC,2,5.26,0,1.53,0.07,0.12,93.03
hostname,600,2018-07-24 00:20:01 UTC,3,5.64,0,1.52,0.04,0.12,92.68
This would be the output for this file: (CPU -1 turns into CPU ALL)(also the key value is just the timestamp (the hostname and interval stay constant)
# hostname,interval,timestamp,CPU ALL %user,CPU ALL %nice,CPU ALL %system,CPU ALL %iowait,CPU ALL %steal,CPU ALL %idle,CPU 0 %user,CPU 0 %nice,CPU 0 %system,CPU 0 %iowait,CPU 0 %steal,CPU 0 %idle,CPU 1 %user,CPU 1 %nice,CPU 1 %system,CPU 1 %iowait,CPU 1 %steal,CPU 1 %idle,CPU 2 %user,CPU 2 %nice,CPU 2 %system,CPU 2 %iowait,CPU 2 %steal,CPU 2 %idle,CPU 3 %user,CPU 3 %nice,CPU 3 %system,CPU 3 %iowait,CPU 3 %steal,CPU 3 %idle
hostname,600,2018-07-24 00:10:01 UTC,5.19,0,1.52,0.09,0.13,93.07,5.37,0,1.58,0.15,0.15,92.76,8.36,0,1.75,0.08,0.1,89.7,3.87,0,1.38,0.07,0.12,94.55,3.16,0,1.36,0.05,0.14,95.29
hostname,600,2018-07-24 00:20:01 UTC,5.13,0,1.52,0.08,0.13,93.15,4.38,0,1.54,0.13,0.15,93.8,5.23,0,1.49,0.07,0.11,93.09,5.26,0,1.53,0.07,0.12,93.03,5.64,0,1.52,0.04,0.12,92.68
Your question isn't clear and doesn't contain the expected output for your posted larger/presumably more realistic sample CSV so idk what output you were hoping for but this will show you the right approach at least:
$ cat tst.awk
BEGIN{
FS = OFS = ","
}
NR==1 {
for (i=1; i<=NF; i++) {
fldName2nmbr[$i] = i
}
tsFldNmbr = fldName2nmbr["timestamp"]
cpuFldNmbr = fldName2nmbr["CPU"]
next
}
{
tsVal = $tsFldNmbr
cpuVal = $cpuFldNmbr
if ( !(seenTs[tsVal]++) ) {
tsVal2nmbr[tsVal] = ++numTss
tsNmbr2val[numTss] = tsVal
}
if ( !(seenCpu[cpuVal]++) ) {
cpuVal2nmbr[cpuVal] = ++numCpus
cpuNmbr2val[numCpus] = cpuVal
}
tsNmbr = tsVal2nmbr[tsVal]
cpuNmbr = cpuVal2nmbr[cpuVal]
cpuData = ""
for (i=1; i<=NF; i++) {
if ( (i != tsFldNmbr) && (i != cpuFldNmbr) ) {
cpuData = (cpuData == "" ? "" : cpuData OFS) $i
}
}
data[tsNmbr,cpuNmbr] = cpuData
}
END {
printf "%s", "timestamp"
for (cpuNmbr=1; cpuNmbr<=numCpus; cpuNmbr++) {
printf "%sCPU %s Data", OFS, cpuNmbr2val[cpuNmbr]
}
print ""
for (tsNmbr=1; tsNmbr<=numTss; tsNmbr++) {
printf "%s", tsNmbr2val[tsNmbr]
for (cpuNmbr=1; cpuNmbr<=numCpus; cpuNmbr++) {
printf "%s\"%s\"", OFS, data[tsNmbr,cpuNmbr]
}
print ""
}
}
.
$ awk -f tst.awk file
timestamp,CPU -1 Data,CPU 0 Data,CPU 1 Data,CPU 2 Data,CPU 3 Data
2018-07-24 00:10:01 UTC,"hostname,600,5.19,0,1.52,0.09,0.13,93.07","hostname,600,5.37,0,1.58,0.15,0.15,92.76","hostname,600,8.36,0,1.75,0.08,0.1,89.7","hostname,600,3.87,0,1.38,0.07,0.12,94.55","hostname,600,3.16,0,1.36,0.05,0.14,95.29"
2018-07-24 00:20:01 UTC,"hostname,600,5.13,0,1.52,0.08,0.13,93.15","hostname,600,4.38,0,1.54,0.13,0.15,93.8","hostname,600,5.23,0,1.49,0.07,0.11,93.09","hostname,600,5.26,0,1.53,0.07,0.12,93.03","hostname,600,5.64,0,1.52,0.04,0.12,92.68"
I put the per-CPU data within double quotes so you could import it to Excel or similar without worrying about the commas between the sub-fields.
If we assume that the CSV input file is sorted according to increasing timestamps, you could try something like this:
use feature qw(say);
use strict;
use warnings;
my $fn = 'log.csv';
open ( my $fh, '<', $fn ) or die "Could not open file '$fn': $!";
my $header = <$fh>;
my %info;
my #times;
while ( my $line = <$fh> ) {
chomp $line;
my ( $time, $cpu, $data ) = split ",", $line;
push #times, $time if !exists $info{$time};
push #{ $info{$time} }, $data;
}
close $fh;
for my $time (#times) {
say join ",", $time, #{ $info{$time} };
}
Output:
x,a,b
y,c,d

Complex CSV parsing with Linux commands

I have a CSV log file that records the properties HA;HB;HC;HD;HE. The following file records 6 entries (separated by the above header).
I would like to extract the 3rd property(HC) of every entry.
HA;HB;HC;HD;HE
a1;b1;14;d;e
HA;HB;HC;HD;HE
a2;b2;28;d;e
HA;HB;HC;HD;HE
a31;b31;44;d;e
a32;b32;07;d;e
HA;HB;HC;HD;HE
a4;b4;0;d;e
HA;HB;HC;HD;HE
a51;b51;32;d;e
a52;b52;0;d;e
a53;b53;5;d;e
HA;HB;HC;HD;HE
a6;b6;10;d;e
Whenever there's n lines of HC recorded per entry, I want to extract the addition of the n entries.
The expected output for the above file:
14
28
51
0
37
10
I know I can write a program for this, but is there an easy way to get this with a combination on awk and/or sed commands?
I haven't tested this; try it and let me know if it works.
awk -F';' '
$3 == "HC" {
if (NR > 1) {
print sum
sum = 0 }
next }
{ sum += $3 }
END { print sum }'
awk solution:
$ awk -F';' '$3=="HC" && p{
print sum # print current total
sum=p=0 # reinitialize sum and p
next
}
$3!="HC"{
sum=sum+($3+0) # make sure $3 is converted to integer. sum it up.
p=1 # set p to 1
} # print last sum
END{print sum}' input.txt
output:
14
28
51
0
37
10
one-liner:
$ awk -F";" '$3=="HC" && p{print sum;sum=p=0;next} $3!="HC"{sum=sum+($3+0);p=1} END{print sum}' input.txt
awk -F';' '/^H.*/{if(f)print s;s=0;f=$3=="HC"}f{s+=$3}END{if(f)print s}' infile
For given inputs:
$ cat infile
HA;HB;HC;HD;HE
a1;b1;14;d;e
HA;HB;HC;HD;HE
a2;b2;28;d;e
HA;HB;HC;HD;HE
a31;b31;44;d;e
a32;b32;07;d;e
HA;HB;HC;HD;HE
a4;b4;0;d;e
HA;HB;HC;HD;HE
a51;b51;32;d;e
a52;b52;0;d;e
a53;b53;5;d;e
HA;HB;HC;HD;HE
a6;b6;10;d;e
$ awk -F';' '/^H.*/{if(f)print s; s=0; f=$3=="HC"}f{s+=$3}END{if(f)print s}' infile
14
28
51
0
37
10
It takes little more care for example:
$ cat infile2
HA;HB;HC;HD;HE
a1;b1;14;d;e
HA;HB;HC;HD;HE
a2;b2;28;d;e
HA;HB;HC;HD;HE
a31;b31;44;d;e
a32;b32;07;d;e
HA;HB;HC;HD;HE
a4;b4;0;d;e
HA;HB;HD;HD;HE <---- Say if HC does not found
a51;b51;32;d;e
a52;b52;0;d;e
a53;b53;5;d;e
HA;HB;HC;HD;HE
a6;b6;10;d;e
# find only HC in 3rd column
$ awk -F';' '/^H.*/{if(f)print s; s=0; f=$3=="HC"}f{s+=$3}END{if(f)print s}' infile2
14
28
51
0
10
# Find HD in 3rd column
$ awk -F';' '/^H.*/{if(f)print s; s=0; f=$3=="HD"}f{s+=$3}END{if(f)print s}' infile2
37
eval "true || $(cat data.csv|cut -d ";" -f3 |sed -e s/"HC"/"0; expr 0"/g |tr '\n' '#'|sed -e s/"##"/""/g|sed -e s/"#"/" + "/g)"
Explanation:
Get contents of the file using cat
Take only the third column using cut delimiter of ;
Replace HC lines with 0; expr 0 values to start building eval-worthy bash expressions to eventually yield expr 0 + 14;
Replace \n newlines temporarily with # to circumvent possible BSD sed limitations
Replace double ## with single # to avoid blank lines turning into spaces and causing expr to bomb out.
Replace # with + to add the numbers together.
Execute the command, but with a true || 0; expr ... to avoid a guaranteed syntax error on the first line.
Which creates this:
true || 0; expr 0 + 14 + 0; expr 0 + 28 + 0; expr 0 + 44 + 07 + 0; expr 0 + 0 + 0; expr 0 + 32 + 0 + 5 + 0; expr 0 + 10
The output looks like this:
14
28
51
0
37
10
This was tested on Bash 3.2 and MacOS El Capitan.
Could you please try following and let me know if this helps you.
awk -F";" '
/^H/ && $3!="HC"{
flag="";
next
}
/^H/ && $3=="HC"{
if(NR>1){
printf("%d\n",sum)
};
sum=0;
flag=1;
next
}
flag{
sum+=$3
}
END{
printf("%d\n",sum)
}
' Input_file
Output will be as follows.
14
28
51
0
37
10
$ awk -F';' '$3=="HC"{if (NR>1) print s; s=0; next} {s+=$3} END{print s}' file
14
28
51
0
37
10

AWK: Converting CSV with Headers to Summary Table

I have a need to document 100+ CSV files as far as the format of those files and including sample data. What I would like to do is take a CSV of the following format:
Name, Phone, State
Fred, 1234567, TX
John, 2345678, NC
and convert it to:
Field | Sample
--- | ----
Name | Fred
Phone | 1234567
State | TX
Is this possible with AWK? From my example below, you will see I am trying to format as a markdown table. I have it currently transposing the header row with
#!/usr/bin/awk -v RS='\r\n' -f
BEGIN { printf "| Field \t| Critical |\n"}
{
printf "|---\t|---\t|\n"
for (i=1; i<=NF; i++) {print "|", toupper($i), "| sample |"}
}
END {}
But I am not sure now how to use the first row of data, after the header to display the sample data?
awk is the right tool for data parsing. You can try something like:
awk '
BEGIN { FS=", "; OFS=" | " }
NR==1 {
for(tag = 1; tag <= NF; tag++) {
hdr[tag] = sprintf ("%-7s", $tag)
}
next
}
{
for(fld = 1; fld <= NF; fld++) {
data[NR,fld] = $fld
}
}
END {
print "Field | Sample\n------- | -------";
for(rec = 2; rec <= NR; rec++) {
for(line = 1; line <= NF; line++) {
print hdr[line], data[rec,line]
}
}
}' file
Output:
Field | Sample
------- | -------
Name | Fred
Phone | 1234567
State | TX
Name | John
Phone | 2345678
State | NC
Here is a more simple way to do it with awk
No need to store everything in a array then print at the end.
awk -F", " 'NR==1{split($0,a,FS);print "Field | Sample\n------- | -------";next} {for (i=1;i<=NF;i++) printf "%-8s| %s\n",a[i],$i}' file
Field | Sample
------- | -------
Name | Fred
Phone | 1234567
State | TX
Name | John
Phone | 2345678
State | NC
How it works:
awk -F", " ' # set field separator to ","
NR==1{ # if first line do:
split($0,a,FS) # split first line to an array named "a" to get the labels
print "Field | Sample" # print header
print "------- | -------" # print separator
next} # prevents nothing more run for first line
{ # for all lines except first do:
for (i=1;i<=NF;i++) # loop trough all element in line
printf "%-8s| %s\n",a[i],$i # print data for every element
}
' file

Dynamically edit lists within a csv file

I have a csv file that looks like that:
col1|col2
1|a
2|g
3|f
1|m
3|k
2|n
2|a
1|d
4|r
3|s
where | separates the columns, and would like to transform it into something homogeneous like:
------------------------
fields > 1 2 3 4
record1 a g f
record2 m n k
record3 d a s r
------------------------
Is there a way to do that? What would be better, using mysql or editing the csv file?
I wrote this, works for your example: gawk is required
awk -F'|' -v RS="" '{for(i=1;i<=NF;i+=2)a[$i]=$(i+1);asorti(a,d);
for(i=1;i<=length(a);i++)printf "%s", a[d[i]]((i==length(a))?"":" ");delete a;delete d;print ""}' file
example:
kent$ cat file
1|a
2|g
3|f
1|m
3|k
2|n
2|a
1|d
4|r
3|s
kent$ awk -F'|' -v RS="" '{for(i=1;i<=NF;i+=2)a[$i]=$(i+1);asorti(a,d);
for(i=1;i<=length(a);i++)printf "%s", a[d[i]]((i==length(a))?"":" ");delete a;delete d;print ""}' file
a g f
m n k
d a s r
Here an awk solution:
BEGIN{
RS=""
FS="\n"
}
FNR==NR&&FNR>1{
for (i=1;i<=NF;i++) {
split($i,d,"|")
if (d[1] > max)
max = d[1]
}
next
}
FNR>1&&!header{
printf "%s\t","fields >"
for (i=1;i<=max;i++)
printf "%s\t",i
print ""
header=1
}
FNR>1{
printf "record%s\t\t",FNR-1
for (i=1;i<=NF;i++) {
split($i,d,"|")
val[d[1]] = d[2]
}
for (i=1;i<=max;i++)
printf "%s\t",val[i]?val[i]:"NULL"
print ""
delete val
}
Save as script.awk and run like (notice it uses a two pass approach so you need to give the file twice):
$ awk -f script.awk file file
fields > 1 2 3 4
record1 a g f NULL
record2 m n k NULL
record3 d a s r
Adding the line 5|b to the first record in file gives the output:
$ awk -f script.awk file file
fields > 1 2 3 4 5
record1 a g f NULL b
record2 m n k NULL NULL
record3 d a s r NULL
$ cat file
col1|col2
1|a
2|g
3|f
5|b
1|m
3|k
2|n
2|a
1|d
4|r
3|s
$
$ awk -f tst.awk file
fields > 1 2 3 4 5
record1 a g f NULL b
record2 m n k NULL NULL
record3 d a s r NULL
$
$ cat tst.awk
BEGIN{ RS=""; FS="\n" }
NR>1 {
++numRecs
for (i=1;i<=NF;i++) {
split($i,fldNr2val,"|")
fldNr = fldNr2val[1]
val = fldNr2val[2]
recNrFldNr2val[numRecs,fldNr] = val
numFlds = (fldNr > numFlds ? fldNr : numFlds)
}
}
END {
printf "fields >"
for (fldNr=1;fldNr<=numFlds;fldNr++) {
printf " %4s", fldNr
}
print ""
for (recNr=1; recNr<=numRecs; recNr++) {
printf "record%d ", recNr
for (fldNr=1;fldNr<=numFlds;fldNr++) {
printf " %4s", ((recNr,fldNr) in recNrFldNr2val ? recNrFldNr2val[recNr,fldNr] : "NULL")
}
print ""
}
}

Enhance awk script by printing top 5 occurring data elements from each column

I have an awk script that processes a csv file and produces a report that counts the number of rows for each column, named in the header field, that contain data /[A-Za-z0-9]/. What I would like to do is enhance the script and print the top 5 most duplicated data elements in each column.
Here is sample data:
Food|Type|Spicy
Broccoli|Vegetable|No
Lettuce|Vegetable|No
Spinach|Vegetable|No
Habanero|Vegetable|Yes
Swiss Cheese|Dairy|No
Milk|Dairy|No
Yogurt|Dairy|No
Orange Juice|Fruit|No
Papaya|Fruit|No
Watermelon|Fruit|No
Coconut|Fruit|No
Cheeseburger|Meat|No
Gorgonzola|Dairy|No
Salmon|Fish|
Apple|Fruit|No
Orange|Fruit|No
Bagel|Bread|No
Chicken|Meat|No
Chicken Wings|Meat|Yes
Pizza||No
This is the current script that SiegeX has substantially contributed:
$ cat matrix2.awk
NR==1{
for(i=1;i<=NF;i++)
head[i]=$i
next
}
{
for(i=1;i<=NF;i++)
{
if($i && !arr[i,$i]++)
n[i]++
if(arr[i,$i] > 1)
f[i]=1
}
}
END{
for(i=1;i<=length(head);i++) {
printf("%-6d%s\n",n[i],head[i])
if(f[i]) {
for(x in arr) {
split(x,b,SUBSEP)
if(b[1]==i && b[2])
printf("% -6d %s\n",arr[i,b[2]],b[2])
}
}
}
}
This is the current output:
$ awk -F "|" -f matrix2.awk testlist.csv
20 Food
6 Type
6 Fruit
4 Vegetable
3 Meat
1 Fish
4 Dairy
1 Bread
2 Spicy
17 No
2 Yes
And this is the desired output:
$ awk -F "|" -f matrix2.awk testlist.csv
20 Food
6 Type
6 Fruit
4 Vegetable
4 Dairy
3 Meat
1 Fish
2 Spicy
17 No
2 Yes
The only thing left that I would like to add is a general function that limits each columns output to the top 5 most duplicated fields. As mentioned below, a columnar version of sort |uniq -c |sort -nr |head -5.
The following script is both extensible and scalable as it will work with an arbitrary number of columns. Nothing is hardcoded
awk -F'|' '
NR==1{
for(i=1;i<=NF;i++)
head[i]=$i
next
}
{
for(i=1;i<=NF;i++)
{
if($i && !arr[i,$i]++)
n[i]++
if(arr[i,$i] > 1)
f[i]=1
}
}
END{
for(i=1;i<=length(head);i++) {
printf("%-32s%d\n",head[i],n[i])
if(f[i]) {
for(x in arr) {
split(x,b,SUBSEP)
if(b[1]==i && b[2])
printf(" %-28s%d\n",b[2],arr[i,b[2]])
}
}
}
}' infile
Output
$ ./report
Food 9
Type 5
Meat 2
Bread 1
Vegetable 2
Fruit 2
Fish 1
Spicy 2
Yes 2
No 6
Not a complete solution, but something to get you started -
awk -F"|" '
NR>1{
a[$1]++;
b[$2]++;
c[$3]++}
END{
print "Food\t\t\t" length(a);
print "Type\t\t\t" length(b);
for (x in b)
if (x!="")
{
printf ("\t%-16s%s\n",x,b[x]);
}
print "Spicy\t\t\t" length(c);
for (y in c)
if (y!="")
{
printf ("\t%-16s%d\n",y,c[y])
}
}' testlist.csv
TEST:
[jaypal:~/Temp] cat testlist.csv
Food|Type|Spicy
Broccoli|Vegetable|No
Jalapeno|Vegetable|Yes
Salmon|Fish|
Apple|Fruit|No
Orange|Fruit|No
Bagel|Bread|No
Chicken|Meat|No
Chicken Wings|Meat|Yes
Pizza||No
[jaypal:~/Temp] awk -F"|" 'NR>1{a[$1];b[$2]++;c[$3]++}END{print "Food\t\t\t" length(a); print "Type\t\t\t"length(b); for (x in b) if (x!="") printf ("\t%-16s%s\n",x,b[x]) ;print "Spicy\t\t\t"length(c); for (y in c) if (y!="") {printf ("\t%-16s%d\n",y,c[y])}}' testlist.csv
Food 9
Type 6
Fruit 2
Vegetable 2
Bread 1
Meat 2
Fish 1
Spicy 3
Yes 2
No 6