awk, header: igore when reading, include when writing - csv

i know how to ignore the column header when reading a file. i can do something like this:
awk 'FNR > 1 { #process here }' table > out_table
But if i do this, everything else other than the COLUMN HEADER is written into the output file. But i want the output file to have COLUMN HEADER also.
Of course i can do something like this, after i execute the first statement:
awk 'BEGIN {print "Column Headers\t"} {print}' Out_table > out_table_with_header
But this becomes a 2 step process. So is there a way to do this in a SINGE STEP itself?
In short, is there a way for me to ignore Column Header while reading file, perform operation on the data, then include Column Header when writing it to output file, in a single step (or a block of steps that takes very less response time?)

Not sure if I got your correctly, you can simply:
awk 'NR==1{print}; NR>1 { # process }' file
which can be simplified to:
awk 'NR==1; NR>1 { # process }' file
That works for a single input file.
If you want to process more than one file, all having the same column headers at line 1 use this:
awk 'FNR==1 && !h {print; h=1}; FNR>1 { # process }' file1 file2 ...
I'm using the variable h to check whether the headers have been printed already or not.

Related

prefix every header column with string using awk

I have a bunch of big csv I want to prefix every header column with fixed string. There is more than 500 columns in every file.
suppose my header is:
number;date;customer;key;amount
I tried this awk line:
awk -F';' 'NR==1{gsub(/[^a-z_]/,"input_file.")} { print }'
but I get (note fist column is missing prefix and separator is removed):
numberinput_file.dateinput_file.customerinput_file.keyinput_file.amount
expected output:
input_file.number;input_file.date;input_file.customer;input_file.key;input_file.amount
In any awk that'd be:
$ awk 'NR==1{gsub(/^|;/,"&input_file.")} 1' file
input_file.number;input_file.date;input_file.customer;input_file.key;input_file.amount
but sed exists to do simple substitutions like that, e.g. using a sed that has -E to enable EREs (e.g. GNU and BSD sed):
$ sed -E '1s/^|;/&input_file./g' file
input_file.number;input_file.date;input_file.customer;input_file.key;input_file.amount
If you're using GNU tools then you could use either of the above to change all of your CSV files at once with either of these:
awk -i inplace 'NR==1{gsub(/^|;/,"&input_file.")} 1' *.csv
sed -i -E '1s/^|;/&input_file./g' *.csv
Your gsub would brutally replace any nonalphabetic character anywhere in the input with the prefix - including your column separators.
The print can be abbreviated to the common idiom 1 at the very end of your script; this simply means "this condition is true; perform the default action for every line (i.e. print it all)" though this is just a stylistic change.
awk -F';' 'NR==1{
sub(/^/, "input_file."); gsub(/;/, ";input_file."); }
1' filename
If you want to perform this on multiple files, probably put a shell loop around it. If you only want to concatenate everything to standard output, you can give all the files to Awk in one go (in which case you probably don't want to print the header line for any file after the first; maybe change the 1 to NR==1 || FNR != 1).
I would use GNU AWK following way, let file.txt content be
number;date;customer;key;amount
1;2;3;4;5
6;7;8;9;10
then
awk 'BEGIN{FS=";";OFS=";input_file."}NR==1{$1="input_file." $1}{print}' file.txt
output
input_file.number;input_file.date;input_file.customer;input_file.key;input_file.amount
1;2;3;4;5
6;7;8;9;10
Explanation: I set OFS to ; followed by prefix. Then in first line I add prefix to first column, which trigger string rebuilding. No modification is done in any other line, thus they are printed as is.
(tested in GNU Awk 5.0.1)
Also with awk using for loop and printf:
awk 'BEGIN{FS=OFS=";"} NR==1{for (i=1; i<=NF; i++) printf "%s%s", "input_file." $i, (i<NF ? OFS : ORS)}' file
input_file.number;input_file.date;input_file.customer;input_file.key;input_file.amount

How to split text file into multiple files and extract filename from line prefix?

I have a simple log file with content like:
1504007980.039:{"key":"valueA"}
1504007990.359:{"key":"valueB", "key2": "valueC"}
...
That I'd like to output to multiple files that each have as content the JSON part that comes after the timestamp. So I would get as a result the files:
1504007980039.json
1504007990359.json
...
This is similar to How to split one text file into multiple *.txt files? but the name of the file should be extracted from each line (and remove an extra dot), and not generated via an index
Preferably I'd want a one-liner that can be executed in bash.
Since you aren't using GNU awk you need to close output files as you go to avoid the "too many open files" error. To avoid that and issues around specific values in your JSON and issues related to undefined behavior during output redirection, this is what you need:
awk '{
fname = $0
sub(/\./,"",fname)
sub(/:.*/,".json",fname)
sub(/[^:]+:/,"")
print >> fname
close(fname)
}' file
You can of course squeeze it onto 1 line if you see some benefit to that:
awk '{f=$0;sub(/\./,"",f);sub(/:.*/,".json",f);sub(/[^:]+:/,"");print>>f;close(f)}' file
awk solution:
awk '{ idx=index($0,":"); fn=substr($0,1,idx-1)".json"; sub(/\./,"",fn);
print substr($0,idx+1) > fn; close(fn) }' input.log
idx=index($0,":") - capturing index of the 1st :
fn=substr($0,1,idx-1)".json" - preparing filename
Viewing results (for 2 sample lines from the question):
for f in *.json; do echo "$f"; cat "$f"; echo; done
The output (filename -> content):
1504007980039.json
{"key":"valueA"}
1504007990359.json
{"key":"valueB"}

Delete rows of CSV file based on the value of a column

Here's an example of a few lines of my CSV file:
movieID,actorID,actorName,ranking
1,don_rickles,Don Rickles,3
1,jack_angel,Jack Angel,6
1,jim_varney,Jim Varney,4
1,tim_allen,Tim Allen,2
1,tom_hanks,Tom Hanks,1
1,wallace_shawn,Wallace Shawn,5
I would like to remove all rows that have a ranking of > 4, so far I've been trying use this awk line:
awk -F ',' 'BEGIN {OFS=","} { if (($4) < 5) print }' file.csv > file_out.csv
It should print all the rows with a ranking (4th column) of less than 5 to a new file. I can't tell exactly what this line actually does, but it's not what I want. can someone tell me where I've gone wrong with that line?
Instead of deleting the records, think of which ones you're going to print. I guess it's <=4. In idiomatic awk you can write this as
$ awk -F, '$4<=4' file
1,don_rickles,Don Rickles,3
1,jim_varney,Jim Varney,4
1,tim_allen,Tim Allen,2
1,tom_hanks,Tom Hanks,1

How to create a string with x numbers of the same character without looping in AWK?

I'm trying to write a CSV join program in AWK which joins the first.csv file with the second.csv file. The program now works perfectly fine, assuming the the number of rows in both files are the same.
The problem arises when one of the files contains more rows than the other; in this case, I'd have to add multiple number of commas (which depends to the number of fields in input files), to the file with less number of rows, so that the columns are not misplaced.
The question is, How can I create and assign strings containing different number of commas? for instance,
if: NumberOfFields==5; Then, I want to create string ",,,,," and add it to an Array[i].
Here is another answer with sample code using your variable and array name.
BEGIN {
NumberOfFields=5;
i=1;
Array[i] = gensub(/0/, ",", "g", sprintf("%0*d", NumberOfFields, 0));
print Array[i];
}
Run it with awk -f x,awk where x.awk is the above code in a text file. Be aware that it always prints at least 1 comma, even if you specify zero.
$ awk -v num=3 'BEGIN {var=sprintf("%*s",num,""); gsub(/ /,",",var); print var}'
,,,
$
Use an array instead of var if/when you like. Note that unlike another solution posted the above will work with any awk, not just gawk, and it will not print any commas if the number requested is zero:
$ awk -v num=0 'BEGIN {var=sprintf("%*s",num,""); gsub(/ /,",",var); print var}'
$
The equivalent with GNU awk and gensub() would be:
$ awk -v num=3 'BEGIN {var=gensub(/ /,",","g",sprintf("%*s",num,"")); print var}'
,,,
$
$ awk -v num=0 'BEGIN {var=gensub(/ /,",","g",sprintf("%*s",num,"")); print var}'
$

Most effective way to retrieve, edit, and then store data from many files in bash

So I have a bunch of data in .log files. The columns are tab separated, but I only need data from columns 2 and 7 (Although there is not always guaranteed to be data in column 7, and there are more columns after 7. In this instance there would be a double tab to before column 8)
My current method is EXTREMELY slow and I feel like there must be a better way as I am going through the data more often than I should be.
#First I iterate through all the files and att them to data.raw.log
cat $f >> data.raw.log
#Then cut out unneeded data.
cut -f2,7 data.raw.log > data.log
#I then need to parse the data into JSON
while IFS=$'\t' read -r -a entry
do
if [ ! -z ${entry[1]} ]; then
echo "FORMATTED JSON HERE WITH ${entry[0]} AND ${entry[1]}" >> data.json
fi
done < data.log
The obvious issue is that I am going through the data twice to cut and then add when I only need to once. This is proving to be EXTREMELY slow, any ideas on speed improvement would be helpful.
Use awk:
awk -F'\t' '$7 != "" { print "FORMATTED JSON HERE WITH " $2 " AND " $7 }' * > data.json
Here, I assume that all the files are in the current directory. You should be able to adjust this easily to accommodate the actual location of the files.
Ok, so you cut the data into a file, then parse that file. That is kind of lengthy. Not only that, but you first copy all the data from one file to another.
You can achieve the same thing with a single little awk script:
$ cat file*.log | awk -F'\t' '{if ($7 != "") print "Formatted data here with " $1 " and " $7}' >output.log
Awk takes the input as tokens, $1 and $7 (tab separated), and checks if $7 is empty or not. If not, then print the data formatted as you like.