Truncate CSV Header Names - csv

I'm looking for a relatively simple method for truncating CSV header names to a given maximum length. For example a file like:
one,two,three,four,five,six,seven
data,more data,words,,,data,the end
Could limit all header names to a max of 3 characters and become:
one,two,thr,fou,fiv,six,sev
data,more data,words,,,data,the end
Requirements:
Only the first row is affected
I don't know what the headers are going to be, so it has to dynamically read and write the values and lengths
I tried a few things with awk and sed, but am not proficient at either. The closest I found was this snippet:
csvcut -c 3 file.csv |
sed -r 's/^"|"$//g' |
awk -F';' -vOFS=';' '{ for (i=1; i<=NF; ++i) $i = substr($i, 0, 2) } { printf("\"%s\"\n", $0) }' >tmp-3rd
But it was focusing on columns and also feels more complicated than necessary to use csvcut.
Any help is appreciated.

With GNU sed:
sed -E '1s/([^,]{1,3})[^,]*/\1/g' file
Output:
one,two,thr,fou,fiv,six,sev
data,more data,words,,,data,the end
See: man sed and The Stack Overflow Regular Expressions FAQ

With your shown samples, please try following awk program. Simple explanation would be, setting field separator and output field separator as , Then in first line cutting short each field of first line to 3 chars as per requirement and printing them(new line after last field of first line), printing rest of lines as it is.
awk '
BEGIN { FS=OFS="," }
FNR==1{
for(i=1; i<=NF; i++){
printf("%s%s",substr($i, 1, 3),(i==NF?ORS:OFS))
}
next
}
1
' Input_file

Related

CSV Column Insertion via awk

I am trying to insert a column in front of the first column in a comma separated value file (CSV). At first blush, awk seems to be the way to go but, I'm struggling with how to move down the new column.
CSV File
A,B,C,D,E,F
1,2,3,4,5,6
2,3,4,5,6,7
3,4,5,6,7,8
4,5,6,7,8,9
Attempted Code
awk 'BEGIN{FS=OFS=","}{$1=$1 OFS (FNR<1 ? $1 "0\nA\n2\nC" : "col")}1'
Result
A,col,B,C,D,E,F
1,col,2,3,4,5,6
2,col,3,4,5,6,7
3,col,4,5,6,7,8
4,col,5,6,7,8,9
Expected Result
col,A,B,C,D,E,F
0,1,2,3,4,5,6
A,2,3,4,5,6,7
2,3,4,5,6,7,8
C,4,5,6,7,8,9
This can be easily done using paste + printf:
paste -d, <(printf "col\n0\nA\n2\nC\n") file
col,A,B,C,D,E,F
0,1,2,3,4,5,6
A,2,3,4,5,6,7
2,3,4,5,6,7,8
C,4,5,6,7,8,9
<(...) is process substitution available in bash. For other shells use a pipeline like this:
printf "col\n0\nA\n2\nC\n" | paste -d, - file
With awk only you could try following solution, written and tested with shown samples.
awk -v value="$(echo -e "col\n0\nA\n2\nC")" '
BEGIN{
FS=OFS=","
num=split(value,arr,ORS)
for(i=1;i<=num;i++){
newVal[i]=arr[i]
}
}
{
$1=arr[FNR] OFS $1
}
1
' Input_file
Explanation:
First of all creating awk variable named value whose value is echo(shell command)'s output. NOTE: using -e option with echo will make sure that \n aren't getting treated as literal characters.
Then in BEGIN section of awk program, setting FS and OFS as , here for all line of Input_file.
Using split function on value variable into array named arr with delimiter of ORS(new line).
Then traversing through for loop till value of num(total values posted by echo command).
Then creating array named newVal with index of i(1,2,3 and so on) and its value is array arr value.
In main awk program, setting first field's value to array arr value and $1 and printing the line then.

transform multiline text into csv with awk sed and grep

I run a shell command that returns a list of repeated values like this (note the indentation):
Name: vm346
cpu 1 (12%) 6150m (76%)
memory 1130Mi (7%) 1130Mi (7%)
Name: vm847
cpu 6 (75%) 30150m (376%)
memory 12980Mi (87%) 12980Mi (87%)
Name: vm848
cpu 3500m (43%) 17150m (214%)
memory 6216Mi (41%) 6216Mi (41%)
I am trying to transform that data like this (in csv):
vm346,1,(12%),6150m,(76%),1130Mi,(7%),1130Mi,(7%)
vm847,6,(75%),30150m,(376%),12980Mi,(87%),12980Mi,(87%)
vm848,3500m,(43%),17150m,(214%),6216Mi,(41%),6216Mi,(41%)
The problem is that any given dataset like the one above is always on more than one line.
when I pipe that into it awk it drives me mad because even if I use:
BEGIN{ FS="\n" }
to try and stitch the data together in one line, it doesn't work. No matter what I do, awk keeps the name value as a separated line above everything else.
I am sorry I haven't much code to share but I have been spinning my wheels with this for a few hours now and I am running out of ideas...
I can solve this in Perl:
perl -ane 'print join ",", #F[1 .. $#F]; print $F[0] eq "memory" ? "\n" : ","'
It should be easy to translate it to awk if you need it.
How does it work?
-a splits each line on whitespace into the #F array
-n reads the input line by line and runs the code specified after -e for each line
We print all the elements but the first one separated by commas (see join)
We then look at the first column, if it's memory, we are at the last line of the block, so we print a newline, otherwise we print a comma
With AWK, one option is to set RS to "Name: ", and ignore the first record with NR > 1, e.g.
awk -v RS="Name: " 'BEGIN{OFS=","} NR > 1 {print $1, $3, $4, $5, $6, $8, $9, $10, $11}' file
#> vm346,1,(12%),6150m,(76%),1130Mi,(7%),1130Mi,(7%)
#> vm847,6,(75%),30150m,(376%),12980Mi,(87%),12980Mi,(87%)
#> vm848,3500m,(43%),17150m,(214%),6216Mi,(41%),6216Mi,(41%)
awk '{$1=""}1' | paste -sd' \n' - | awk '{$1=$1}1' OFS=,
Get rid of the first column. Join every three rows. Same idea with sed:
sed 's/^ *[^ ]* *//' | paste -sd' \n' - | sed 's/ */,/g'
Something else:
awk '
$1=="Name:" {
sep=ors
ors=ORS
} {
for (i=2;i<=NF;++i) {
printf "%s%s",sep,$i
sep=OFS
}
} END {printf "%s",ors}'
Or if you want to print an ORS based on the first field being "memory" (note that this program may end without printing a terminating ORS):
awk '{for (i=2;i<=NF;++i) printf "%s%s",$i,(i==NF && $1=="memory" ? ORS : OFS)}'
something else else:
awk -v OFS=, '
index($0,$1)==1 {
OFS=ors
ors=ORS
} {
$1=""
printf "%s",$0
OFS=ofs
} END {printf "%s",ors} BEGIN {ofs=OFS}'
This might work for you (GNU sed):
sed -nE '/^ +\S+ +/{s///;H;$!d};x;/./s/\s+/,/gp;x;s/^\S+ +//;h' file
In overview the sed program processes indented lines, already gathered lines (except in the case that the current line is the first line of the file) and non-indented lines.
Turn off implicit printing and enable extended regexp's. (-nE).
If the current line is indented, remove the indent, the first field and any following spaces, append the result to the hold space and if it is not the last line, delete it.
Otherwise, check the hold space for gathered lines and if found, replace one or more whitespaces by commas and print the result. Then prep the current line by removing the first field and any following spaces and replace the hold space with the result.
The solution seems logically back-to-front, but programming in this style avoids having to check for end-of-file multiple times and invoking labels and gotos.
N.B. This solution will work for any number of indented lines.
Here is a ruby to do that:
ruby -e '
s=$<.read
s.scan(/^([^ \t]+:)([\s\S]+?)(?=^\1|\z)/m). # parse blocks
map(&:last). # get data part
# parse and join the data fields:
map{|block| block.split(/\n[ \t]+[^ \t]+[ \t]+/)}.
map{|lines| lines.map(&:strip).join(" ").split().join(",")}.
each{|l| puts "#{l}"}
' file
vm346,1,(12%),6150m,(76%),1130Mi,(7%),1130Mi,(7%)
vm847,6,(75%),30150m,(376%),12980Mi,(87%),12980Mi,(87%)
vm848,3500m,(43%),17150m,(214%),6216Mi,(41%),6216Mi,(41%)
The advantage is that this is not dependent on the number of lines or the number of fields. It is parsing data that is in blocks of the form:
START: ([ \t]+[data_with_no_space])*\n
l1 ([ \t]+[data_with_no_space])*\n
...
START:
...
Works this way:
Parse the blocks with THIS REGEX;
Save an array of the data elements;
Join the sub arrays and then split into data fields;
Join(',') to make a csv.

Awk 4.1.4 Error when processing large file

I am using Awk 4.1.4 on Centos 7.6 (x86_64) with 250 GB RAM to transform a row-wide csv file into a column-wide csv based on the last column (Sample_Key). Here is an example small row-wide csv
Probe_Key,Ind_Beta,Sample_Key
1,0.6277,7417
2,0.9431,7417
3,0.9633,7417
4,0.8827,7417
5,0.9761,7417
6,0.1799,7417
7,0.9191,7417
8,0.8257,7417
9,0.9111,7417
1,0.6253,7387
2,0.9495,7387
3,0.5551,7387
4,0.8913,7387
5,0.6197,7387
6,0.7188,7387
7,0.8282,7387
8,0.9157,7387
9,0.9336,7387
This is what the correct output looks like for the above small csv example
7387,0.6253,0.9495,0.5551,0.8913,0.6197,0.7188,0.8282,0.9157,0.9336
7417,0.6277,0.9431,0.9633,0.8827,0.9761,0.1799,0.9191,0.8257,0.9111
Here is the awk code (based on https://unix.stackexchange.com/questions/522046/how-to-convert-a-3-column-csv-file-into-a-table-or-matrix) to achieve the row to column wide transformation
BEGIN{
printf "Probe_Key,ind_beta,Sample_Key\n";
}
NR > 1{
ks[$3 $1] = $2; # save the second column using the first and third as index
k1[$1]++; # save the first column
k2[$3]++; # save the third column
}
END {
# After processing input
for (i in k2) # loop over third column
{
printf "%s,", i ; # print it as first value in the row
for (j in k1) # loop over the first column (index)
{
if ( j < length(k1) )
{
printf "%s,",ks[i j]; #and print values ks[third_col first_col]
}
else
printf "%s",ks[i j]; #print last value
}
print ""; # newline
}
}
However, when I input a relatively large row-wide csv file (5 GB size), I get tons of values without any commas in the output and then values start to appear with commas and then values without commas. This keeps on going. Here is small excerpt from the portion without comma
0.04510.03580.81470.57690.8020.89630.90950.10880.66560.92240.05 060.78130.86910.07330.03080.0590.06440.80520.05410.91280.16010.19420.08960.0380.95010.7950.92760.9410.95710.2830.90790 .94530.69330.62260.90520.1070.95480.93220.01450.93390.92410.94810.87380.86920.9460.93480.87140.84660.33930.81880.94740 .71890.11840.05050.93760.94920.06190.89280.69670.03790.8930.84330.9330.9610.61760.04640.09120.15520.91850.76760.94840. 61340.02310.07530.93660.86150.79790.05090.95130.14380.06840.95690.04510.75220.03150.88550.82920.11520.11710.5710.94340 .50750.02590.97250.94760.91720.37340.93580.84730.81410.95510.93080.31450.06140.81670.04140.95020.73390.87250.93680.20240.05810.93660.80870.04480.8430.33120.88170.92670.92050.71290.01860.93260.02940.91820
and when I use the largest row-wide csv file (126 GB size), I get the following Error
ERROR (EXIT CODE 255) Unknow error code
How do I debug the two situations when the code works for small input size?
Instead of trying to hold all 5GB's (Or 126GB's) worth of data in memory at once and printing out everything all together at the end, here's an approach using sort and GNU datamash to group each set of values together as they come through its input:
$ datamash --header-in -t, -g3 collapse 2 < input.csv | sort -t, -k1,1n
7387,0.6253,0.9495,0.5551,0.8913,0.6197,0.7188,0.8282,0.9157,0.9336
7417,0.6277,0.9431,0.9633,0.8827,0.9761,0.1799,0.9191,0.8257,0.9111
This assumes your file is already grouped with all the identical third column values together in blocks, and the first/second columns sorted in the appropriate order already, like your sample input. If that's not the case, the slower:
$ tail -n +2 input.csv | sort -t, -k3,3n -k1,1n | datamash -t, -g3 collapse 2
7387,0.6253,0.9495,0.5551,0.8913,0.6197,0.7188,0.8282,0.9157,0.9336
7417,0.6277,0.9431,0.9633,0.8827,0.9761,0.1799,0.9191,0.8257,0.9111
If you can get rid of that header line so sort can be passed the file directly instead of in a pipe, it might be able to pick a more efficient sorting method knowing the full size in advance.
if your data is already grouped in fields 3 and sorted 1, you can just simply do
$ awk -F, 'NR==1 {next}
{if(p!=$3)
{if(p) print v; v=$3 FS $2; p=$3}
else v=v FS $2}
END{print v}' file
7417,0.6277,0.9431,0.9633,0.8827,0.9761,0.1799,0.9191,0.8257,0.9111
7387,0.6253,0.9495,0.5551,0.8913,0.6197,0.7188,0.8282,0.9157,0.9336
if not, pre-sorting is better idea instead of caching all the data in memory which will blow up for large input files.

find and replace multiple patterns in a specific csv column with sed

I have a csv file like this:
2018-May-17 21:33:16,VF-AUDI-prod,Start:2018-May-17:End:2018-May-19
2018-May-17 21:34:15,VF-AUDI-prod,Start:2018-May-17:End:2018-May-19
2018-May-17 21:35:17,VF-AUDI-prod,Start:2018-May-17:End:2018-May-19
I need to convert only the first column into a YYYYMMDDHHmmss format like this:
20180517213316,VF-AUDI-prod,Start:2018-May-17:End:2018-May-19
20180517213415,VF-AUDI-prod,Start:2018-May-17:End:2018-May-19
20180517213517,VF-AUDI-prod,Start:2018-May-17:End:2018-May-19
How can I achieve this with sed without modifying the other columns?
$ awk -F'[- :,]' '{
t = $1 sprintf("%02d",(index("JanFebMarAprMayJunJulAugSepOctNovDec",$2)+2)/3) $3 $4 $5 $6
sub(/[^,]+/,t)
}1' file
20180517213316,VF-AUDI-prod,Start:2018-May-17:End:2018-May-19
20180517213415,VF-AUDI-prod,Start:2018-May-17:End:2018-May-19
20180517213517,VF-AUDI-prod,Start:2018-May-17:End:2018-May-19
There are two ways to do the replacement. But both of the two ways need a help shell script.
PHP version
sed -r 's/([^,]*),(.*)/echo $(echo "\1"|.\/php.sh),\2/e' file
php.sh
#!/bin/sh
read str
php -r "echo date('YmdHis', strtotime('$str'));"
bash version
sed -r 's/([^-]*)-([^-]*)-([0-9]{1,2})[[:space:]]*([0-9]{1,2}):([0-9]{1,2}):([0-9]{1,2}),(.*)/echo \1$(echo "\2"\|.\/help.sh)\3\4\5\6,\7/e' file
help.sh
#!/bin/sh
read str
case $str in
Jan) MON=01 ;;
Feb) MON=02 ;;
Mar) MON=03 ;;
Apr) MON=04 ;;
May) MON=05 ;;
Jun) MON=06 ;;
Jul) MON=07 ;;
Aug) MON=08 ;;
Sep) MON=09 ;;
Oct) MON=10 ;;
Nov) MON=11 ;;
Dec) MON=12 ;;
esac
echo $MON
Output:
20180517213316,VF-AUDI-prod,Start:2018-May-17:End:2018-May-19
20180517213415,VF-AUDI-prod,Start:2018-May-17:End:2018-May-19
20180517213517,VF-AUDI-prod,Start:2018-May-17:End:2018-May-19
For more information about the use of echo embedded in sed, you can go this link
Following awk may help you on same.
awk -F"," '
BEGIN{
num=split("jan,feb,mar,apr,may,jun,jul,aug,sept,oct,nov,dec",array,",");
for(i=1;i<=num;i++){
month[array[i]]=sprintf("%02d",i)}
}
{
split($1,a,"[- ]");
a[2]=month[tolower(a[2])];
$1=a[1] a[2] a[4];
gsub(/:/,"",$1)
}
1' OFS="," Input_file
Explanation of code:
awk -F"," ' ##Setting field separator as comma here or lines.
BEGIN{ ##Starting BEGIN section for awk here.
num=split("jan,feb,mar,apr,may,jun,jul,aug,sept,oct,nov,dec",array,",");##Using split to create a month names array and its length is stored in num variable.
for(i=1;i<=num;i++){ ##Starting a for loop from variable value i=1 to till value of num here.
month[array[i]]=sprintf("%02d",i)} ##Creating an array month whose index is array value with index i and value is variable i.
}
{ ##Starting main section here which will be executed during Input_file reading by awk.
split($1,a,"[- ]"); ##Using split to split $1 into array a whose delimiter are space and - in that line.
a[2]=month[tolower(a[2])]; ##Setting 2nd value of array a to value of month array, to get months into digit format.
$1=a[1] a[2] a[4]; ##Re-creating first field with values of first, second and third values of array a.
gsub(/:/,"",$1) ##globally substituting colon with NULL in first colon.
}
1 ##Using 1 here to print the current line.
' OFS="," Input_file ##Setting output field separator as comma and mentioning Input_file name here.
awk -F, '{ gsub(/:| /, "", $1);
x=(match("JanFebMarAprMayJunJulAugSepOctNovDec", substr($1,6,3))+2)/3;
x=x>9?x:0x; gsub(/-.*-/, x, $1) }1' OFS=, infile
Output:
20180517213316,VF-AUDI-prod,Start:2018-May-17:End:2018-May-19
20180517213415,VF-AUDI-prod,Start:2018-May-17:End:2018-May-19
20180517213517,VF-AUDI-prod,Start:2018-May-17:End:2018-May-19
How it works
this -F, defines what delimiter is separated fields.
this gsub(/:| /, "", $1) removes spaces and colons from the first field.
this substr($1,6,3) return the month name from first field
this match("JanFebMarAprMayJunJulAugSepOctNovDec", substr($1,6,3)) returns the first character position (Index) of the month name begins in string of all month names JanFebMarAprMayJunJulAugSepOctNovDec= 13. the result of this match(...) will always one of these 1, 4, 7, 10, 13, 16, 19, 22, 25, 28, 31, 34; now we got 13 and since each month name is in length of 3 we should find a way how to return 5 in result so we added 2 to the result to point the position to the end of matched month name then divide into 3 13+2/3=5.
this x=x>9?x:0x prepending a 0 to the number above if its less than 10
this gsub(/-.*-/, x, $1) replaces the match between hyphens which is month name with the value of x in first field only.
this 1 is always true condition and causes to print the line awk read
this OFS=, is setting Output Feild Seperator back to the comma ,.
Sed one-liner :
$ cat file.csv | sed 's/^\([[:digit:]]*\)-\([^ ]*\)\(.*\)/\2-\1\3/g' | sed 's/\([^,]*\),\(.*\)/echo $(date -d "\1" +%Y%m%d%H%M%S ),\2/e'
Explanation
Convert %Y-%m-%d to %m-%d-%Y format in order to be consumed by date -d
Use sed to substitute only the first column.
Use date's -d command to read the date input.
Use date's +%Y%m%d%H%M%S to print the output
This might work for you (GNU sed):
m="Jan01Feb02Mar03Apr04May05Jun06Jul07Aug08Sep09Oct10Nov11Dec12"
sed -E 's/$/\n'"$m"'/;s/-(...)-(..) (..):(..):(.*)\n.*\1(..).*/\6\2\3\4\5/' file
Append a lookup table to the end of each line and using pattern matching, grouping and back references, transform the first column to the required specification.
Alternative, less messy and more efficient:
cat <<\! | sed -Ef - file
1{x;s/^/Jan01Feb02Mar03Apr04May05Jun06Jul07Aug0Sep09Oct10Nov11Dec12/;x}
G
s/-(...)-(..) (..):(..):(.*)\n.*\1(..).*/\6\2\3\4\5/
P
d
!

Extract column data from csv file based on row values

I am trying to use awk/sed to extract specific column data based on row values. My actual files have 15 columns and over 1,000 rows (From a .csv file.)
Simple EXAMPLE: Input; a cdv file with a total of 5 columns and 100 rows. Output; data from column 2 through 5 based on specific row values from column 2. (I have a specific list of the row values I want the operator to filter out. The values are numbers.)
File looks like this:
"Date","IdNo","Color","Height","Education"
"06/02/16","7438","Red","54","4"
"06/02/16","7439","Yellow","57","3"
"06/03/16","7500","Red","55","3"
Recently Tried in AWK:
#!/usr/bin/awk -f
#I need to extract a full line when column 2 has a specific 5 digit value
awk '\
BEGIN { awk -F "," \
{
if ( $2 == "19650" ) { \
{print $1 "," $6} \
}
exit }
chmod u+x PPMDfUN.AWK
The operator response:
/var/folders/1_/drk_nwld48bb0vfvdm_d9n0h0000gq/T/PPMDfUN- 489939602.998.AWK.command ; exit;
/usr/bin/awk: syntax error at source line 3 source file /private/var/folders/1_/drk_nwld48bb0vfvdm_d9n0h0000gq/T/PPMDfUN- 489939602.997.AWK
context is
awk >>> ' <<<
/usr/bin/awk: bailing out at source line 17
logout
Output Example: I want full row lines based if column 2 equals 7439 & 7500.
“Date","IdNo","Color","Height","Education"
"06/02/16","7439","Yellow","57","3"
"06/03/16","7500","Red","55","3"
here you go...
$ awk -F, -v q='"' '$2==q"7439"q' file
"06/02/16","7439","Yellow","57","3"
There is not much to explain, other than convenience variable q defined for double quotes helps to eliminate escaping.
awk -F, 'NR<2;$2~/7439|7500/' file
"Date","IdNo","Color","Height","Education"
"06/02/16","7439","Yellow","57","3"
"06/03/16","7500","Red","55","3"