Convert single column to multiple, ensuring column count on last line - csv

I would like to use AWK (Windows) to convert a text file with a single column to multiple columns - the count specified in the script or on the command line.
This question has been asked before but my final data file needs to have the same column count all the way.
Example of input:
L1
L2
L3
L4
L5
L6
L7
split into 3 columns and ";" as a separator
L1;L2;L3
L4;L5;L6
L7;; <<< here two empty fields are created after end of file, since I used just one on this line.
I tried to modify variants of the typical solution given: NR%4 {printf $0",";next} 1; and a counter, but could not quite get it right.
I would prefer not to count lines before, thereby running over the file multiple times.

You may use this awk solution:
awk -v n=3 '{
sub(/\r$/, "") # removes DOS line break, if present
printf "%s", $0(NR%n ? ";" : ORS)
}
END {
# now we need to add empty columns in last record
if (NR % n) {
for (i=1; i < (n - (NR % n)); ++i)
printf ";"
print ""
}
}' file
L1;L2;L3
L4;L5;L6
L7;;

With your shown samples please try following awk code. Using xargs + awk combination to achieve the outcome needed by OP.
xargs -n3 < Input_file |
awk -v OFS=";" '{if(NF==1){$0=$0";;"};if(NF==2){$0=$0";"};$1=$1} 1'

For an awk I would do:
awk -v n=3 '
{printf("%s%s", $0, (NR%n>0) ? ";" : ORS)}
END{
for(i=NR%n; i<n-1; i++) printf(";")
printf ORS
}' file
Or, an alternative awk:
awk -v n=3 -v OFS=";" '
{ row=row ? row FS $0 : $0 } # build row of n fields
!(NR%n) {$0=row; NF=n; print; row="" } # split the fields sep by OFS
END { if (NR%n) { $0=row; NF=n; print } } # same
' file
Or you can use ruby if you want more options:
ruby -le '
n=3
puts $<.read.
split($/).
each_slice(n).
map{|sl| sl.fill(sl.size...n) { "" }; sl.join(";") }.
join($\) # By using $\ and $/ with the -l the RS and ORS is set correctly for the platform
' file
Or, realize that paste is designed to do this:
paste -d';' - - - <file
(Use a - for each column desired)
Any of those prints (with n=3):
L1;L2;L3
L4;L5;L6
L7;;
(And work correctly for other values of n...)

Related

How to combine several GAWK statements?

I have the following:
cat *.csv > COMBINED.csv
sort -k1 -n -t, COMBINED.csv > A.csv
gawk -F ',' '{sub(/[[:lower:]]+/,"",$1)}1' OFS=',' A.csv # REMOVE LOWER CASE CHARACTERS FROM 1st COLUMN
gawk -F ',' 'length($1) == 14 { print }' A.csv > B.csv # REMOVE ANY LINE FROM CSV WHERE VALUE IN FIRST COLUMN IS NOT 14 CHARACTERS
gawk -F ',' '{ gsub("/", "-", $2) ; print }' OFS=',' B.csv > C.csv # REPLACE FORWARD SLASH WITH HYPHEN IN SECOND COLUMN
gawk -F ',' '{print > ("processed/"$1".csv")}' C.csv # SPLIT CSV INTO FILES GROUPED BY VALUE IN FIRST COLUMN AND SAVE THE FILE WITH THAT VALUE
However, I think 4 separate lines is a bit overkill and was wondering whether I could optimise it or at least streamline it into a one-liner?
I've tried piping the data but getting stuck in a mix of errors
Thanks
In awk you can append multiple actions as:
pattern1 { action1 }
pattern2 { action2 }
pattern3 { action3 }
So every time a record is read, it will process it by first doing pattern-action1 followed by pattern-action2, ...
In your case, it seems like you can do:
awk 'BEGIN{FS=OFS=","}
# remove lower case characters from first column
{sub(/[[:lower:]]+/,"",$1)}
# process only lines with 14 characters in first column
(length($1) != 14) { next }
# replace forward slash with hyphen
{ gsub("/", "-", $2) }
{ print > ("processed/" $1 ".csv") }' <(sort -k1 -n -t, combined.csv)
You could essentially also put the sorting in GNU awk, but that is a but to mimic the sort exactly, we would need to know your input format.

finding patterns in csv file

trying to sort a csv file based on the repeating rows
awk -F, 'NR>1{arr[$4,",",$5,",",$6,,",",$7,",",$8,",",$9]++}END{for (a in arr) printf "%s\n", arr[a] "-->" a}' test.txt
Input file
a,b,d,1,2,3,4,5,6,y,x,z
k,s,t,1,2,3,4,5,6,t,z,s
a,b,k,1,4,5,5,5,6,k,r,s
Create a file with
a,b,d,1,2,3,4,5,6,y,x,z-->2
k,s,t,1,2,3,4,5,6,2,t,z,s-->2
a,b,k,1,4,5,5,5,6,1,k,r,s-->1
where the last column contains the number of occurrences of the pattern of numbers that start in the 4th place till the 9th place.
Count and sort the duplicate lines
I got to the point that I have the patterns with the count - but I don't know how to add the rest of the columns to the line:
thank you for the support.
A solution where the data is read twice, on the first go the duplicates are counted and on the second outputed:
$ awk -F, '
NR==FNR {
a[$4 ORS $5 ORS $6 ORS $7 ORS $8 ORS $9]++ # count
next
}
{
print $0 "-->" a[$4 ORS $5 ORS $6 ORS $7 ORS $8 ORS $9] # output
}' file file
a,b,d,1,2,3,4,5,6,y,x,z-->2
k,s,t,1,2,3,4,5,6,t,z,s-->2
a,b,k,1,4,5,5,5,6,k,r,s-->1
Could you please try following, reading Input_file single time only.
awk '
BEGIN{
FS=OFS=","
}
{
a[FNR]=$0
b[FNR]=$4 FS $5 FS $6 FS $7 FS $8 FS $9
c[$4 FS $5 FS $6 FS $7 FS $8 FS $9]++
}
END{
for(i=1;i<=FNR;i++){
print a[i]" ---->" c[b[i]]
}
}' Input_file
The answer of James Brown is a very simple double-pass solution which has the advantage that you don't need to store the file into memory but the disadvantage of having to read a file twice. The following solution will just do the inverse, read the file only ones but have to keep it into memory. To this end we need 3 arrays. Array c to keep track of the count, array b to act as a buffer and array a to keep track of the original order.
Furthermore, we will make use of multidimensional array indices:
A valid array index shall consist of one or more <comma>-separated expressions, similar to the way in which multi-dimensional arrays are indexed in some programming languages. Because awk arrays are really one-dimensional, such a <comma>-separated list shall be converted to a single string by concatenating the string values of the separate expressions, each separated from the other by the value of the SUBSEP variable. Thus, the following two index operations shall be equivalent:
var[expr1, expr2, ... exprn]
var[expr1 SUBSEP expr2 SUBSEP... SUBSEP exprn]
The solution now reads:
{ a[NR] = $4 SUBSEP $5 SUBSEP $6 SUBSEP $7 SUBSEP $8 SUBSEP $9
b[$4,$5,$6,$7,$8,$9] = $0
c[$4,$5,$6,$7,$8,$9]++ }
END { for(i=1;i<=NR;++i) print b[a[i]],"-->",c[a[i]] }
Since the problem resembles SQL pattern, you can use sqlite also. Check this out.
$ cat shimon.txt
a,b,d,1,2,3,4,5,6,y,x,z
k,s,t,1,2,3,4,5,6,t,z,s
a,b,k,1,4,5,5,5,6,k,r,s
$ cat sqllite_cols4_to_9.sh
#!/bin/sh
sqlite3 <<EOF
create table data(c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11,c12);
.separator ','
.import "$1" data
select t1.*, " --> " || t2.cw from data t1, ( select c4,c5,c6,c7,c8,c9, count(*) as cw from data group by c4,c5,c6,c7,c8,c9 ) t2
where t1.c4=t2.c4 and t1.c5=t2.c5 and t1.c6=t2.c6 and t1.c7=t2.c7 and t1.c8=t2.c8 and t1.c9=t2.c9;
EOF
$ ./sqllite_cols4_to_9.sh shimon.txt
a,b,d,1,2,3,4,5,6,y,x,z, --> 2
k,s,t,1,2,3,4,5,6,t,z,s, --> 2
a,b,k,1,4,5,5,5,6,k,r,s, --> 1
$
You can try Perl also. The file is read only once, so it will be faster. Check this out:
$ cat shimon.txt
a,b,d,1,2,3,4,5,6,y,x,z
k,s,t,1,2,3,4,5,6,t,z,s
a,b,k,1,4,5,5,5,6,k,r,s
$ perl -F, -lane ' $v=join(",",#F[3..8]);$kv{$_}{$v}=$kv2{$v}++; END { while(($x,$y)=each (%kv)){ while(($p,$q)=each (%{$y})) { print "$x --> $kv2{$p}" }}}' shimon.txt
a,b,k,1,4,5,5,5,6,k,r,s --> 1
a,b,d,1,2,3,4,5,6,y,x,z --> 2
k,s,t,1,2,3,4,5,6,t,z,s --> 2
$
Another Perl - shorter code
$ perl -F, -lane ' $kv{$_}=$kv2{join(",",#F[3..8])}++; END { for(keys %kv) { $t=join(",",(split /,/)[3..8]); print "$_ --> $kv2{$t}" } } ' shimon.txt
a,b,k,1,4,5,5,5,6,k,r,s --> 1
a,b,d,1,2,3,4,5,6,y,x,z --> 2
k,s,t,1,2,3,4,5,6,t,z,s --> 2
or
$ perl -F, -lane ' $kv{$_}=$kv2{join(",",#F[3..8])}++; END { for(keys %kv) { print "$_ --> ",$kv2{join(",",(split /,/)[3..8])} } } ' shimon.txt
a,b,k,1,4,5,5,5,6,k,r,s --> 1
a,b,d,1,2,3,4,5,6,y,x,z --> 2
k,s,t,1,2,3,4,5,6,t,z,s --> 2
$

Adding a column in multiple csv file using awk

I want to add a column at the multiple (500) CSV files (same dimensionality). Each column should act as an identifier for the individual file. I want to create a bash script using awk(I am a new bee in awk). The CSV files do come with headers.
For eg.
Input File1.csv
#name,#age,#height
A,12,4.5
B,13,5.0
Input File2.csv
#name,#age,#height
C,11,4.6
D,12,4.3
I want to add a new column "#ID" in both the files, where the value of ID will be same for an individual file but not for both the file.
Expected Output
File1.csv
#name,#age,#height,#ID
A,12,4.5,1
B,13,5.0,1
Expected File2.csv
#name,#age,#height,#ID
C,11,4.6,2
D,12,4.3,2
Please suggest.
If you don't need to extract the id number from the filename, this should do.
$ c=1; for f in File*.csv;
do
sed -i '1s/$/,#ID/; 2,$s/$/,'$c'/' "$f";
c=$((c+1));
done
note that this is inplace edit. Perhaps make a backup or test first.
UPDATE
If you don't need the individual files to be updated, this may work better for you
$ awk -v OFS=, 'BEGIN {f="allFiles.csv"}
FNR==1 {c++; print $0,"#ID" > f; next}
{print $0,c > f}' File*.csv
awk -F, -v OFS=, ‘
FNR == 1 {
$(NF + 1) = “ID#”
i++
f = FILENAME
sub(/Input/, “Output”, f)
} FNR != 1 {
$(NF + 1) = i
} {
print > f
}’ Input*.csv
With GNU awk for inplace editing and ARGIND:
awk -i inplace -v OFS=, '{print $0, (FNR==1 ? "#ID" : ARGIND)}' File*.csv

Change delimiters in files

Below I have files as they should, and further down, what I made till now. I think that in my code is the source of the problem: delimiters, but I can't get it much better.
My source file is with ; as delimiter, and the files for my database have a , as separator; also, the strings are between "":
The category file should be like this:
"1","1","testcategory","testdescription"
And the manufacturers file, like this:
"24","ASUS",NULL,NULL,NULL
"23","ASROCK",NULL,NULL,NULL
"22","ARNOVA",NULL,NULL,NULL
What I have at this moment:
- category file:
1;2;Alarmen en beveiligingen;
2;2;Apparatuur en toebehoren;
3;2;AUDIO;
- manufacturers file:
315;XTREAMER;NULL;NULL;NULL
316;XTREMEMAC;NULL;NULL;NULL
317;Y-CAM;NULL;NULL;NULL
318;ZALMAN;NULL;NULL;NULL
I tried a bit around to use sed; first, on the categories file:
cut -d ";" -f1 /home/arno/pixtmp/pixtmp.csv |sort | uniq > /home/arno/pixtmp/categories_description-in.csv
sed 's/^/;2;/g' /home/arno/pixtmp/categories_description-in.csv > /home/arno/pixtmp/categories_description-in.tmp
sed -e "s/$/;/" /home/arno/pixtmp/categories_description-in.tmp > /home/arno/pixtmp/categories_description-in.tmp2
awk 'BEGIN{n=1}{printf("%s%s\n",n++,$0)}' /home/arno/pixtmp/categories_description-in.tmp2 > /home/arno/pixtmp/categories_description$
And then on the manufacturers file:
cut -d ";" -f5 /home/arno/pixtmp/pixtmp.csv |sort | uniq > /home/arno/pixtmp/manufacturers-in
sed 's/^/;/g' /home/arno/pixtmp/manufacturers-in > /home/arno/pixtmp/manufacturers-tmp
sed -e "s/$/;NULL;NULL;NULL/" /home/arno/pixtmp/manufacturers-tmp > /home/arno/pixtmp/manufacturers-tmp2
awk 'BEGIN{n=1}{printf("%s%s\n",n++,$0)}' /home/arno/pixtmp/manufacturers-tmp2 > /home/arno/pixtmp/manufacturers.ok
You were trying to solve the problem by using cut, sed, and AWK. AWK by itself is powerful enough to solve your problem.
I wrote one AWK program that can handle both of your examples. If NULL is not a special case, and the manufacturers' file is a different format, you will need to make two AWK programs but I think it should be clear how to do it.
All we do here is tell AWK that the "field separator" is the semicolon. Then AWK splits the input lines into fields for us. We loop over the fields, printing as we go.
#!/usr/bin/awk -f
BEGIN {
FS = ";"
DQUOTE = "\""
}
function add_quotes(s) {
if (s == "NULL")
return s
else
return DQUOTE s DQUOTE
}
NF > 0 {
# if input ended with a semicolon, last field will be empty
if ($NF == "")
NF -= 1 # subtract one from NF to forget the last field
if (NF > 0)
{
for (i = 1; i <= NF - 1; ++i)
printf("%s,", add_quotes($i))
printf("%s\n", add_quotes($i))
}
}

(sed/awk) Extract values from text to csv file - even/odd lines pattern

I need to export some numeric values from a given ASCII text file and export it in a specific formatted csv file. The input file has got the even / odd line pattern:
SCF Done: E(UHF) = -216.432419652 A.U. after 12 cycles
CCSD(T)= -0.21667965032D+03
SCF Done: E(UHF) = -213.594303492 A.U. after 10 cycles
CCSD(T)= -0.21379841974D+03
SCF Done: E(UHF) = -2.86120139864 A.U. after 6 cycles
CCSD(T)= -0.29007031339D+01
and so on
I need the odd line value in the 5th column and the even line 2nd column value. They should be printed in a semicolon seperated csv file, with 10 values in each row. So the output should look like
-216.432419652;-0.21667965032D+03;-213.594303492;-0.21379841974D+03;-2.86120139864;-0.29007031339D+01; ...linebreak after 5 pairs of values
I started with awk '{print $5}' and awk '{print $2}', however I was not successful in creating a pattern that just acts on even/odd lines.
A simple way to do that?
The following script doesn't use a lot of the great power of awk, but will do the job for you and is hopefully understandable:
NR % 2 { printf "%s;", $5 }
NR % 2 == 0 { printf "%s;", $2 }
NR % 10 == 0 { print "" }
END { print "" }
Usage (save the above as script.awk):
awk -f script.awk input.txt
Given a file called data.txt, try:
awk '/SCF/{ printf $5 ";"; } /CCSD/{ printf($2); } NR % 10 == 0 { printf "\n"; }' data.txt
Something like this could work -
awk '{x = NF > 3 ? $5 : $2 ; printf("%s;",x)}(NR % 10 == 0){print OFS}' file
|_____________________| |________| |___________||_________|
| | | |
This is a `ternary operator`, Print with `NR` is a `OFS` is another built-in
what it does is checks the line formatting a built-in that has a default value of
for number of fields (`NF`). If to add that keeps `\n`
the number of fields is more than a ";" track of
3, we assign $5 value to variable x number of lines.
else we assign $2 value We are using modulo
operator to check when
10 lines are crossed.
This might work for you:
tr -s ' ' ',' <file | paste -sd',\n' | cut -d, -f5,11 | paste -sd',,,,\n'