CSV Column Insertion via awk - csv

I am trying to insert a column in front of the first column in a comma separated value file (CSV). At first blush, awk seems to be the way to go but, I'm struggling with how to move down the new column.
CSV File
A,B,C,D,E,F
1,2,3,4,5,6
2,3,4,5,6,7
3,4,5,6,7,8
4,5,6,7,8,9
Attempted Code
awk 'BEGIN{FS=OFS=","}{$1=$1 OFS (FNR<1 ? $1 "0\nA\n2\nC" : "col")}1'
Result
A,col,B,C,D,E,F
1,col,2,3,4,5,6
2,col,3,4,5,6,7
3,col,4,5,6,7,8
4,col,5,6,7,8,9
Expected Result
col,A,B,C,D,E,F
0,1,2,3,4,5,6
A,2,3,4,5,6,7
2,3,4,5,6,7,8
C,4,5,6,7,8,9

This can be easily done using paste + printf:
paste -d, <(printf "col\n0\nA\n2\nC\n") file
col,A,B,C,D,E,F
0,1,2,3,4,5,6
A,2,3,4,5,6,7
2,3,4,5,6,7,8
C,4,5,6,7,8,9
<(...) is process substitution available in bash. For other shells use a pipeline like this:
printf "col\n0\nA\n2\nC\n" | paste -d, - file

With awk only you could try following solution, written and tested with shown samples.
awk -v value="$(echo -e "col\n0\nA\n2\nC")" '
BEGIN{
FS=OFS=","
num=split(value,arr,ORS)
for(i=1;i<=num;i++){
newVal[i]=arr[i]
}
}
{
$1=arr[FNR] OFS $1
}
1
' Input_file
Explanation:
First of all creating awk variable named value whose value is echo(shell command)'s output. NOTE: using -e option with echo will make sure that \n aren't getting treated as literal characters.
Then in BEGIN section of awk program, setting FS and OFS as , here for all line of Input_file.
Using split function on value variable into array named arr with delimiter of ORS(new line).
Then traversing through for loop till value of num(total values posted by echo command).
Then creating array named newVal with index of i(1,2,3 and so on) and its value is array arr value.
In main awk program, setting first field's value to array arr value and $1 and printing the line then.

Related

Increment field value provided another field matches a string

I am trying to increment a value in a csv file, provided it matches a search string. Here is the script that was utilized:
awk -i inplace -F',' '$1 == "FL" { print $1, $2+1} ' data.txt
Contents of data.txt:
NY,1
FL,5
CA,1
Current Output:
FL 6
Intended Output:
NY,1
FL,6
CA,1
Thanks.
$ awk 'BEGIN{FS=OFS=","} $1=="FL"{++$2} 1' data.txt
NY,1
FL,6
CA,1
Intended Output:
NY,1 FL,6 CA,1
I would harness GNU AWK for this task following way, let file.txt content be
NY,1
FL,5
CA,1
then
awk 'BEGIN{FS=OFS=",";ORS=" "}{print $1,$2+($1=="FL")}' file.txt
gives output
NY,1 FL,6 CA,1
Explanation: I inform GNU AWK that field separator (FS) and output field separator (OFS) is , and output row separator (ORS) is space with accordance to your requirements. Then for each line I print 1st field followed by 2nd field increased by is 1st field FL? with 1 denoting it does hold, 0 denotes it does not hold. If you want to know more about FS or OFS or ORS then read 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR
(tested in gawk 4.2.1)
Use this Perl one-liner:
perl -i -F',' -lane 'if ( $F[0] eq "FL" ) { $F[1]++; } print join ",", #F;' data.txt
The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-n : Loop over the input one line at a time, assigning it to $_ by default.
-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.
-a : Split $_ into array #F on whitespace or on the regex specified in -F option.
-F',' : Split into #F on comma, rather than on whitespace.
-i.bak : Edit input files in-place (overwrite the input file). Before overwriting, save a backup copy of the original file by appending to its name the extension .bak. If you want to skip writing a backup file, just use -i and skip the extension.
SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches

transform multiline text into csv with awk sed and grep

I run a shell command that returns a list of repeated values like this (note the indentation):
Name: vm346
cpu 1 (12%) 6150m (76%)
memory 1130Mi (7%) 1130Mi (7%)
Name: vm847
cpu 6 (75%) 30150m (376%)
memory 12980Mi (87%) 12980Mi (87%)
Name: vm848
cpu 3500m (43%) 17150m (214%)
memory 6216Mi (41%) 6216Mi (41%)
I am trying to transform that data like this (in csv):
vm346,1,(12%),6150m,(76%),1130Mi,(7%),1130Mi,(7%)
vm847,6,(75%),30150m,(376%),12980Mi,(87%),12980Mi,(87%)
vm848,3500m,(43%),17150m,(214%),6216Mi,(41%),6216Mi,(41%)
The problem is that any given dataset like the one above is always on more than one line.
when I pipe that into it awk it drives me mad because even if I use:
BEGIN{ FS="\n" }
to try and stitch the data together in one line, it doesn't work. No matter what I do, awk keeps the name value as a separated line above everything else.
I am sorry I haven't much code to share but I have been spinning my wheels with this for a few hours now and I am running out of ideas...
I can solve this in Perl:
perl -ane 'print join ",", #F[1 .. $#F]; print $F[0] eq "memory" ? "\n" : ","'
It should be easy to translate it to awk if you need it.
How does it work?
-a splits each line on whitespace into the #F array
-n reads the input line by line and runs the code specified after -e for each line
We print all the elements but the first one separated by commas (see join)
We then look at the first column, if it's memory, we are at the last line of the block, so we print a newline, otherwise we print a comma
With AWK, one option is to set RS to "Name: ", and ignore the first record with NR > 1, e.g.
awk -v RS="Name: " 'BEGIN{OFS=","} NR > 1 {print $1, $3, $4, $5, $6, $8, $9, $10, $11}' file
#> vm346,1,(12%),6150m,(76%),1130Mi,(7%),1130Mi,(7%)
#> vm847,6,(75%),30150m,(376%),12980Mi,(87%),12980Mi,(87%)
#> vm848,3500m,(43%),17150m,(214%),6216Mi,(41%),6216Mi,(41%)
awk '{$1=""}1' | paste -sd' \n' - | awk '{$1=$1}1' OFS=,
Get rid of the first column. Join every three rows. Same idea with sed:
sed 's/^ *[^ ]* *//' | paste -sd' \n' - | sed 's/ */,/g'
Something else:
awk '
$1=="Name:" {
sep=ors
ors=ORS
} {
for (i=2;i<=NF;++i) {
printf "%s%s",sep,$i
sep=OFS
}
} END {printf "%s",ors}'
Or if you want to print an ORS based on the first field being "memory" (note that this program may end without printing a terminating ORS):
awk '{for (i=2;i<=NF;++i) printf "%s%s",$i,(i==NF && $1=="memory" ? ORS : OFS)}'
something else else:
awk -v OFS=, '
index($0,$1)==1 {
OFS=ors
ors=ORS
} {
$1=""
printf "%s",$0
OFS=ofs
} END {printf "%s",ors} BEGIN {ofs=OFS}'
This might work for you (GNU sed):
sed -nE '/^ +\S+ +/{s///;H;$!d};x;/./s/\s+/,/gp;x;s/^\S+ +//;h' file
In overview the sed program processes indented lines, already gathered lines (except in the case that the current line is the first line of the file) and non-indented lines.
Turn off implicit printing and enable extended regexp's. (-nE).
If the current line is indented, remove the indent, the first field and any following spaces, append the result to the hold space and if it is not the last line, delete it.
Otherwise, check the hold space for gathered lines and if found, replace one or more whitespaces by commas and print the result. Then prep the current line by removing the first field and any following spaces and replace the hold space with the result.
The solution seems logically back-to-front, but programming in this style avoids having to check for end-of-file multiple times and invoking labels and gotos.
N.B. This solution will work for any number of indented lines.
Here is a ruby to do that:
ruby -e '
s=$<.read
s.scan(/^([^ \t]+:)([\s\S]+?)(?=^\1|\z)/m). # parse blocks
map(&:last). # get data part
# parse and join the data fields:
map{|block| block.split(/\n[ \t]+[^ \t]+[ \t]+/)}.
map{|lines| lines.map(&:strip).join(" ").split().join(",")}.
each{|l| puts "#{l}"}
' file
vm346,1,(12%),6150m,(76%),1130Mi,(7%),1130Mi,(7%)
vm847,6,(75%),30150m,(376%),12980Mi,(87%),12980Mi,(87%)
vm848,3500m,(43%),17150m,(214%),6216Mi,(41%),6216Mi,(41%)
The advantage is that this is not dependent on the number of lines or the number of fields. It is parsing data that is in blocks of the form:
START: ([ \t]+[data_with_no_space])*\n
l1 ([ \t]+[data_with_no_space])*\n
...
START:
...
Works this way:
Parse the blocks with THIS REGEX;
Save an array of the data elements;
Join the sub arrays and then split into data fields;
Join(',') to make a csv.

AWK parse CSV, extract substring from cell and add new column with extracted value

AWK parse CSV, extract substring from cell and add new column. Where there is no matching pattern (i.e. no substring to extract), add blank cell to CSV.
Source Data (3 example columns, actual data is 20+ columns)
"col1txtA","col2txtA","TYPE=ARRAY&ID=111&OPERATINGSYSTEM=WINDOWS%2010&DATE=0000"
"col1txtB","col2txtB","TYPE=ARRAY&ID=112&DATE=0000"
Attempted code
awk -F, -v OFS=, '
NR>1
{$4=match($3,/OPERATINGSYSTEM=[^&]*/)}
1'
Desired output data (new column even if blank result)
"col1txtA","col2txtA","TYPE=ARRAY&ID=111&OPERATINGSYSTEM=WINDOWS%2010&DATE=0000","WINDOWS%2010"
"col1txtB","col2txtB","TYPE=ARRAY&ID=112&DATE=0000",""
With GNU awk:
You can save the result of your match in an array a and access the element that was matched inside the parentheses of your the regex as a[1]. The array argument is a gawk extension.
awk -F',' -v OFS=',' '
{
if (match($3, /OPERATINGSYSTEM=([^&]*)/, a)){
$(NF+1)="\"" a[1] "\""
}
else {
$(NF+1)="\"\""
}
}
1' input.csv

Extract column data from csv file based on row values

I am trying to use awk/sed to extract specific column data based on row values. My actual files have 15 columns and over 1,000 rows (From a .csv file.)
Simple EXAMPLE: Input; a cdv file with a total of 5 columns and 100 rows. Output; data from column 2 through 5 based on specific row values from column 2. (I have a specific list of the row values I want the operator to filter out. The values are numbers.)
File looks like this:
"Date","IdNo","Color","Height","Education"
"06/02/16","7438","Red","54","4"
"06/02/16","7439","Yellow","57","3"
"06/03/16","7500","Red","55","3"
Recently Tried in AWK:
#!/usr/bin/awk -f
#I need to extract a full line when column 2 has a specific 5 digit value
awk '\
BEGIN { awk -F "," \
{
if ( $2 == "19650" ) { \
{print $1 "," $6} \
}
exit }
chmod u+x PPMDfUN.AWK
The operator response:
/var/folders/1_/drk_nwld48bb0vfvdm_d9n0h0000gq/T/PPMDfUN- 489939602.998.AWK.command ; exit;
/usr/bin/awk: syntax error at source line 3 source file /private/var/folders/1_/drk_nwld48bb0vfvdm_d9n0h0000gq/T/PPMDfUN- 489939602.997.AWK
context is
awk >>> ' <<<
/usr/bin/awk: bailing out at source line 17
logout
Output Example: I want full row lines based if column 2 equals 7439 & 7500.
“Date","IdNo","Color","Height","Education"
"06/02/16","7439","Yellow","57","3"
"06/03/16","7500","Red","55","3"
here you go...
$ awk -F, -v q='"' '$2==q"7439"q' file
"06/02/16","7439","Yellow","57","3"
There is not much to explain, other than convenience variable q defined for double quotes helps to eliminate escaping.
awk -F, 'NR<2;$2~/7439|7500/' file
"Date","IdNo","Color","Height","Education"
"06/02/16","7439","Yellow","57","3"
"06/03/16","7500","Red","55","3"

Read tab-delimited data into Hive array

Data format I need:
12cef8e1b711a351 [1377045694501,1377045728475,1377045709652]
12cf3cb988f10a87 [1380741459591,1380739871201,1380739785397,1380740303830,1380739849591]
12d1be8adb90a88b [1375541238666,1375541281821]
12d29ba61341e7ce [1377855844089,1377855785342]
12d2e28e50d42d19 [1381974506104,1381973579872,1377988785664,1381976074258]
Data format I have - everything is tab-delimited:
12cef8e1b711a351 1377045694501 377045728475 1377045709652
12cf3cb988f10a87 1380741459591 1380739871201 1380739785397 1380740303830 1380739849591
12d1be8adb90a88b 1375541238666 1375541281821
12d29ba61341e7ce 1377855844089 1377855785342
12d2e28e50d42d19 1381974506104 1381973579872 1377988785664 1381976074258
How do I process tab-delimited data so that the first field is delimited from the rest with tab, and everything else is comma-delimited and surrounded by []. Possibly, each comma-delimited item also has to be concluded into "".
I need to read these data into Hive table
CREATE TABLE id_timestamps (id STRING, timestamps array<STRING>);
Can I read it directly to Hive with some tricks or shell I transform tab-delimited data with awk or sed? Please, help with some suggestions and recipes.
Thanks!
This awk script produces the desired format:
awk '{printf "%s\t[", $1; for(i=2;i<=NF;++i) printf "%s%s", $i, (i<NF?",":"]\n")}' file
Print the first column, followed by a tab character and the opening "[". Print the rest of the columns followed by a ",", except the last, which is followed by a "]" and a newline.
Testing it out:
$ awk '{printf "%s\t[", $1; for(i=2;i<=NF;++i) printf "%s%s", $i, (i<NF?",":"]\n")}' file
12cef8e1b711a351 [1377045694501,377045728475,1377045709652]
12cf3cb988f10a87 [1380741459591,1380739871201,1380739785397,1380740303830,1380739849591]
12d1be8adb90a88b [1375541238666,1375541281821]
12d29ba61341e7ce [1377855844089,1377855785342]
12d2e28e50d42d19 [1381974506104,1381973579872,1377988785664,1381976074258]