Extract column data from csv file based on row values

Extract column data from csv file based on row values - csv

I am trying to use awk/sed to extract specific column data based on row values. My actual files have 15 columns and over 1,000 rows (From a .csv file.)
Simple EXAMPLE: Input; a cdv file with a total of 5 columns and 100 rows. Output; data from column 2 through 5 based on specific row values from column 2. (I have a specific list of the row values I want the operator to filter out. The values are numbers.)
File looks like this:
"Date","IdNo","Color","Height","Education"
"06/02/16","7438","Red","54","4"
"06/02/16","7439","Yellow","57","3"
"06/03/16","7500","Red","55","3"
Recently Tried in AWK:
#!/usr/bin/awk -f
#I need to extract a full line when column 2 has a specific 5 digit value
awk '\
BEGIN { awk -F "," \
{
if ( $2 == "19650" ) { \
{print $1 "," $6} \
}
exit }
chmod u+x PPMDfUN.AWK
The operator response:
/var/folders/1_/drk_nwld48bb0vfvdm_d9n0h0000gq/T/PPMDfUN- 489939602.998.AWK.command ; exit;
/usr/bin/awk: syntax error at source line 3 source file /private/var/folders/1_/drk_nwld48bb0vfvdm_d9n0h0000gq/T/PPMDfUN- 489939602.997.AWK
context is
awk >>> ' <<<
/usr/bin/awk: bailing out at source line 17
logout
Output Example: I want full row lines based if column 2 equals 7439 & 7500.
“Date","IdNo","Color","Height","Education"
"06/02/16","7439","Yellow","57","3"
"06/03/16","7500","Red","55","3"

here you go...
$ awk -F, -v q='"' '$2==q"7439"q' file
"06/02/16","7439","Yellow","57","3"
There is not much to explain, other than convenience variable q defined for double quotes helps to eliminate escaping.

awk -F, 'NR<2;$2~/7439|7500/' file
"Date","IdNo","Color","Height","Education"
"06/02/16","7439","Yellow","57","3"
"06/03/16","7500","Red","55","3"

Related

CSV Column Insertion via awk

I am trying to insert a column in front of the first column in a comma separated value file (CSV). At first blush, awk seems to be the way to go but, I'm struggling with how to move down the new column.
CSV File
A,B,C,D,E,F
1,2,3,4,5,6
2,3,4,5,6,7
3,4,5,6,7,8
4,5,6,7,8,9
Attempted Code
awk 'BEGIN{FS=OFS=","}{$1=$1 OFS (FNR<1 ? $1 "0\nA\n2\nC" : "col")}1'
Result
A,col,B,C,D,E,F
1,col,2,3,4,5,6
2,col,3,4,5,6,7
3,col,4,5,6,7,8
4,col,5,6,7,8,9
Expected Result
col,A,B,C,D,E,F
0,1,2,3,4,5,6
A,2,3,4,5,6,7
2,3,4,5,6,7,8
C,4,5,6,7,8,9

This can be easily done using paste + printf:
paste -d, <(printf "col\n0\nA\n2\nC\n") file
col,A,B,C,D,E,F
0,1,2,3,4,5,6
A,2,3,4,5,6,7
2,3,4,5,6,7,8
C,4,5,6,7,8,9
<(...) is process substitution available in bash. For other shells use a pipeline like this:
printf "col\n0\nA\n2\nC\n" | paste -d, - file

With awk only you could try following solution, written and tested with shown samples.
awk -v value="$(echo -e "col\n0\nA\n2\nC")" '
BEGIN{
FS=OFS=","
num=split(value,arr,ORS)
for(i=1;i<=num;i++){
newVal[i]=arr[i]
}
}
{
$1=arr[FNR] OFS $1
}
1
' Input_file
Explanation:
First of all creating awk variable named value whose value is echo(shell command)'s output. NOTE: using -e option with echo will make sure that \n aren't getting treated as literal characters.
Then in BEGIN section of awk program, setting FS and OFS as , here for all line of Input_file.
Using split function on value variable into array named arr with delimiter of ORS(new line).
Then traversing through for loop till value of num(total values posted by echo command).
Then creating array named newVal with index of i(1,2,3 and so on) and its value is array arr value.
In main awk program, setting first field's value to array arr value and $1 and printing the line then.

Create CSV file with below output line

I have below output line , from this line I want to create CSV file. In CSV
it should print below line as first column and in second column I want to print the string before second delemeter ":".I am using below script but It is separating data wherever "," is present , and I want to print that whole line in first column and the string after second delimiter ":" in second column .Please help me to sort data in proper format
output line :/home/nagios/NaCl/files/chk_raid.pl:token=$$value=undef;next};my($lhys,$lytrn,$ccdethe
shell script : input="out.txt"
while IFS= read -r LINES
do
#echo "$LINES"
if [[ $LINES = /* ]]
then
filename=echo $LINES | cut -d ":" -f1
echo "$LINES,$filename" >> out.csv
fi
done < "$input"

I don't think I understand your question correctly.
You currently have this output
:/home/nagios/NaCl/files/chk_raid.pl:token=$$value=undef;next};my($lhys,$lytrn,$ccdethe
And you would like to have this kind of CSV output
2
3
/home/nagios/NaCl/files/chk_raid.pl:token=13704value=undef;next};my($lhys,$lytrn,$ccdethe
token=13704value=undef;next};my($lhys,$lytrn,$ccdethe
If that's what you want, you can use Miller like this
echo ":/home/nagios/NaCl/files/chk_raid.pl:token=$$value=undef;next};my(\$lhys,\$lytrn,\$ccdethe" | mlr --n2c --ifs ":" cut -x -f 1 then put '$2=$2.":".$3'
and you will have this two columns CSV
2,3
"/home/nagios/NaCl/files/chk_raid.pl:token=13704value=undef;next};my($lhys,$lytrn,$ccdethe","token=13704value=undef;next};my($lhys,$lytrn,$ccdethe"

find and replace multiple patterns in a specific csv column with sed

I have a csv file like this:
2018-May-17 21:33:16,VF-AUDI-prod,Start:2018-May-17:End:2018-May-19
2018-May-17 21:34:15,VF-AUDI-prod,Start:2018-May-17:End:2018-May-19
2018-May-17 21:35:17,VF-AUDI-prod,Start:2018-May-17:End:2018-May-19
I need to convert only the first column into a YYYYMMDDHHmmss format like this:
20180517213316,VF-AUDI-prod,Start:2018-May-17:End:2018-May-19
20180517213415,VF-AUDI-prod,Start:2018-May-17:End:2018-May-19
20180517213517,VF-AUDI-prod,Start:2018-May-17:End:2018-May-19
How can I achieve this with sed without modifying the other columns?

$ awk -F'[- :,]' '{
t = $1 sprintf("%02d",(index("JanFebMarAprMayJunJulAugSepOctNovDec",$2)+2)/3) $3 $4 $5 $6
sub(/[^,]+/,t)
}1' file
20180517213316,VF-AUDI-prod,Start:2018-May-17:End:2018-May-19
20180517213415,VF-AUDI-prod,Start:2018-May-17:End:2018-May-19
20180517213517,VF-AUDI-prod,Start:2018-May-17:End:2018-May-19

There are two ways to do the replacement. But both of the two ways need a help shell script.
PHP version
sed -r 's/([^,]*),(.*)/echo $(echo "\1"|.\/php.sh),\2/e' file
php.sh
#!/bin/sh
read str
php -r "echo date('YmdHis', strtotime('$str'));"
bash version
sed -r 's/([^-]*)-([^-]*)-([0-9]{1,2})[[:space:]]*([0-9]{1,2}):([0-9]{1,2}):([0-9]{1,2}),(.*)/echo \1$(echo "\2"\|.\/help.sh)\3\4\5\6,\7/e' file
help.sh
#!/bin/sh
read str
case $str in
Jan) MON=01 ;;
Feb) MON=02 ;;
Mar) MON=03 ;;
Apr) MON=04 ;;
May) MON=05 ;;
Jun) MON=06 ;;
Jul) MON=07 ;;
Aug) MON=08 ;;
Sep) MON=09 ;;
Oct) MON=10 ;;
Nov) MON=11 ;;
Dec) MON=12 ;;
esac
echo $MON
Output:
20180517213316,VF-AUDI-prod,Start:2018-May-17:End:2018-May-19
20180517213415,VF-AUDI-prod,Start:2018-May-17:End:2018-May-19
20180517213517,VF-AUDI-prod,Start:2018-May-17:End:2018-May-19
For more information about the use of echo embedded in sed, you can go this link

Following awk may help you on same.
awk -F"," '
BEGIN{
num=split("jan,feb,mar,apr,may,jun,jul,aug,sept,oct,nov,dec",array,",");
for(i=1;i<=num;i++){
month[array[i]]=sprintf("%02d",i)}
}
{
split($1,a,"[- ]");
a[2]=month[tolower(a[2])];
$1=a[1] a[2] a[4];
gsub(/:/,"",$1)
}
1' OFS="," Input_file
Explanation of code:
awk -F"," ' ##Setting field separator as comma here or lines.
BEGIN{ ##Starting BEGIN section for awk here.
num=split("jan,feb,mar,apr,may,jun,jul,aug,sept,oct,nov,dec",array,",");##Using split to create a month names array and its length is stored in num variable.
for(i=1;i<=num;i++){ ##Starting a for loop from variable value i=1 to till value of num here.
month[array[i]]=sprintf("%02d",i)} ##Creating an array month whose index is array value with index i and value is variable i.
}
{ ##Starting main section here which will be executed during Input_file reading by awk.
split($1,a,"[- ]"); ##Using split to split $1 into array a whose delimiter are space and - in that line.
a[2]=month[tolower(a[2])]; ##Setting 2nd value of array a to value of month array, to get months into digit format.
$1=a[1] a[2] a[4]; ##Re-creating first field with values of first, second and third values of array a.
gsub(/:/,"",$1) ##globally substituting colon with NULL in first colon.
}
1 ##Using 1 here to print the current line.
' OFS="," Input_file ##Setting output field separator as comma and mentioning Input_file name here.

awk -F, '{ gsub(/:| /, "", $1);
x=(match("JanFebMarAprMayJunJulAugSepOctNovDec", substr($1,6,3))+2)/3;
x=x>9?x:0x; gsub(/-.*-/, x, $1) }1' OFS=, infile
Output:
20180517213316,VF-AUDI-prod,Start:2018-May-17:End:2018-May-19
20180517213415,VF-AUDI-prod,Start:2018-May-17:End:2018-May-19
20180517213517,VF-AUDI-prod,Start:2018-May-17:End:2018-May-19
How it works
this -F, defines what delimiter is separated fields.
this gsub(/:| /, "", $1) removes spaces and colons from the first field.
this substr($1,6,3) return the month name from first field
this match("JanFebMarAprMayJunJulAugSepOctNovDec", substr($1,6,3)) returns the first character position (Index) of the month name begins in string of all month names JanFebMarAprMayJunJulAugSepOctNovDec= 13. the result of this match(...) will always one of these 1, 4, 7, 10, 13, 16, 19, 22, 25, 28, 31, 34; now we got 13 and since each month name is in length of 3 we should find a way how to return 5 in result so we added 2 to the result to point the position to the end of matched month name then divide into 3 13+2/3=5.
this x=x>9?x:0x prepending a 0 to the number above if its less than 10
this gsub(/-.*-/, x, $1) replaces the match between hyphens which is month name with the value of x in first field only.
this 1 is always true condition and causes to print the line awk read
this OFS=, is setting Output Feild Seperator back to the comma ,.

Sed one-liner :
$ cat file.csv | sed 's/^\([[:digit:]]*\)-\([^ ]*\)\(.*\)/\2-\1\3/g' | sed 's/\([^,]*\),\(.*\)/echo $(date -d "\1" +%Y%m%d%H%M%S ),\2/e'
Explanation
Convert %Y-%m-%d to %m-%d-%Y format in order to be consumed by date -d
Use sed to substitute only the first column.
Use date's -d command to read the date input.
Use date's +%Y%m%d%H%M%S to print the output

This might work for you (GNU sed):
m="Jan01Feb02Mar03Apr04May05Jun06Jul07Aug08Sep09Oct10Nov11Dec12"
sed -E 's/$/\n'"$m"'/;s/-(...)-(..) (..):(..):(.*)\n.*\1(..).*/\6\2\3\4\5/' file
Append a lookup table to the end of each line and using pattern matching, grouping and back references, transform the first column to the required specification.
Alternative, less messy and more efficient:
cat <<\! | sed -Ef - file
1{x;s/^/Jan01Feb02Mar03Apr04May05Jun06Jul07Aug0Sep09Oct10Nov11Dec12/;x}
G
s/-(...)-(..) (..):(..):(.*)\n.*\1(..).*/\6\2\3\4\5/
P
d
!

I just want the last 3 characters of a column returned to the original file

first 2lines of my data:
"Rec_Open_Date","MSISDN","IMEI","Data_Volume_Bytes","Device_Manufacturer","Device_Model","Product_Description"
"2015-10-06","123427","456060","137765","Samsung Korea","Samsung SM-G900I","$39 Plan"
I only want the last 3 characters of column 2 and column 3, I dont want the column header affected.
happy for a solution that can do column2 first and then do column 3
I am fiddling with sed and awk at the minute but have no joy yet.
this is what I want:
"Rec_Open_Date","MSISDN","IMEI","Data_Volume_Bytes","Device_Manufacturer","Device_Model","Product_Description"
"2015-10-06","427","060","137765","Samsung Korea","Samsung SM-G900I","$39 Plan"
edit1 this gives me the last 3 digits(+ "), just need to write this back to the orig file?
$ awk -F"," 'NR>1{ print $2}' head_test_real.csv | sed 's/.*\(....\)/\1/'
427"
592"
007"
592"
409"
742"
387"
731"
556"
edit2 this works but i lose the double quotes "123427" goes to 427, i ould like to keep the double quotes.
* NR>1 works on the rows after the 1st row.
$ awk -F, 'NR>1{$2=substr($2,length($2)-3,3)}1' OFS=, head_test_real.csv
"Rec_Open_Date","MSISDN","IMEI","Data_Volume_Bytes","Device_Manufacturer","Device_Model","Product_Description"
"2015-10-06",427,"456060","137765","Samsung Korea","Samsung SM-G900I","$39 Plan"
edit3 #Mark tks fro correct answer, and here just for my ref on the quotes.
$ ####csv.QUOTE_ALL
$ cat out.csv
"Rec_Open_Date","MSISDN","IMEI","Data_Volume_Bytes","Device_Manufacturer","Device_Model","Product_Description"
"2015-10-06","427","060","137765","Samsung Korea","Samsung SM-G900I","$39 Plan"
$ ####csv.QUOTE_MINIMAL
$ cat out.csv
Rec_Open_Date,MSISDN,IMEI,Data_Volume_Bytes,Device_Manufacturer,Device_Model,Product_Description
2015-10-06,427,060,137765,Samsung Korea,Samsung SM-G900I,$39 Plan
$ ###csv.QUOTE_NONNUMERIC
$ cat out.csv
"Rec_Open_Date","MSISDN","IMEI","Data_Volume_Bytes","Device_Manufacturer","Device_Model","Product_Description"
"2015-10-06","427","060","137765","Samsung Korea","Samsung SM-G900I","$39 Plan"
$ ###csv.QUOTE_NONE
$ cat out.csv
Rec_Open_Date,MSISDN,IMEI,Data_Volume_Bytes,Device_Manufacturer,Device_Model,Product_Description
2015-10-06,427,060,137765,Samsung Korea,Samsung SM-G900I,$39 Plan

While awk seems like a natural fit for comma-separated data, it doesn't deal well with the quoted-fields version. I would recommend using a dedicated CSV-processing library like the one that ships with Python (both 2 and 3):
import csv
with open('in.csv','r') as infile:
reader = csv.reader(infile)
with open('out.csv','w') as outfile:
writer = csv.writer(outfile,delimiter=',',quotechar='"',quoting=csv.QUOTE_ALL)
writer.writerow(next(reader))
for row in reader:
row[1] = row[1][-3:]
row[2] = row[2][-3:]
writer.writerow(row)
Put the above code into a file named e.g. fixcsv.py and make the filenames match what you have and want, then just run it with python fixcsv.py (or python3 fixcsv.py).
I set it to quote everything in the output (QUOTE_ALL); if you don't want it to do that, you can set it to QUOTE_MINIMAL, QUOTE_NONNUMERIC or QUOTE_NONE.
The row assignments replace the second and third fields (row[1] and row[2], since the first field is row[0]) with their last three characters ([-3:]). You could also do it arithmetically with e.g. row[1] = int(row[1]) % 1000.

$ awk 'BEGIN{FS=OFS="\",\""} NR>1{for (i=2;i<=3;i++) $i=substr($i,length($i)-2)} 1' file
"Rec_Open_Date","MSISDN","IMEI","Data_Volume_Bytes","Device_Manufacturer","Device_Model","Product_Description"
"2015-10-06","427","060","137765","Samsung Korea","Samsung SM-G900I","$39 Plan"
As with any command, to write back to the original file is just:
command file > tmp && mv tmp file

Perl to the rescue!
perl -pe 's/",".*?(...",")/","$1/ if $. > 1' < input > output
-p reads the input line by line and prints the result
s/regex/replacement/ is a substitution
.*? matches anything (like .*), but the question mark makes it "frugal", i.e. it matches the shortest string possible
(...",") creates a capture group starting three characters before ",", it can be referenced as $1.
$. is the line number, no replacement happens on line 1.
Make sure the first two columns are always quoted and the second column is never shorter than 3 characters.
To modify the third column, you can modify the regex to
perl -pe 's/^("(?:.*?","){2}).*?(...",")/$1$2/ if $. > 1'
# ~
Modify the indicated number to handle any column you like.

(sed/awk) Extract values from text to csv file - even/odd lines pattern

I need to export some numeric values from a given ASCII text file and export it in a specific formatted csv file. The input file has got the even / odd line pattern:
SCF Done: E(UHF) = -216.432419652 A.U. after 12 cycles
CCSD(T)= -0.21667965032D+03
SCF Done: E(UHF) = -213.594303492 A.U. after 10 cycles
CCSD(T)= -0.21379841974D+03
SCF Done: E(UHF) = -2.86120139864 A.U. after 6 cycles
CCSD(T)= -0.29007031339D+01
and so on
I need the odd line value in the 5th column and the even line 2nd column value. They should be printed in a semicolon seperated csv file, with 10 values in each row. So the output should look like
-216.432419652;-0.21667965032D+03;-213.594303492;-0.21379841974D+03;-2.86120139864;-0.29007031339D+01; ...linebreak after 5 pairs of values
I started with awk '{print $5}' and awk '{print $2}', however I was not successful in creating a pattern that just acts on even/odd lines.
A simple way to do that?

The following script doesn't use a lot of the great power of awk, but will do the job for you and is hopefully understandable:
NR % 2 { printf "%s;", $5 }
NR % 2 == 0 { printf "%s;", $2 }
NR % 10 == 0 { print "" }
END { print "" }
Usage (save the above as script.awk):
awk -f script.awk input.txt

Given a file called data.txt, try:
awk '/SCF/{ printf $5 ";"; } /CCSD/{ printf($2); } NR % 10 == 0 { printf "\n"; }' data.txt

Something like this could work -
awk '{x = NF > 3 ? $5 : $2 ; printf("%s;",x)}(NR % 10 == 0){print OFS}' file
|_____________________| |________| |___________||_________|
| | | |
This is a `ternary operator`, Print with `NR` is a `OFS` is another built-in
what it does is checks the line formatting a built-in that has a default value of
for number of fields (`NF`). If to add that keeps `\n`
the number of fields is more than a ";" track of
3, we assign $5 value to variable x number of lines.
else we assign $2 value We are using modulo
operator to check when
10 lines are crossed.

This might work for you:
tr -s ' ' ',' <file | paste -sd',\n' | cut -d, -f5,11 | paste -sd',,,,\n'

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Extract column data from csv file based on row values - csv

here you go... $ awk -F, -v q='"' '$2==q"7439"q' file "06/02/16","7439","Yellow","57","3" There is not much to explain, other than convenience variable q defined for double quotes helps to eliminate escaping.

awk -F, 'NR<2;$2~/7439|7500/' file "Date","IdNo","Color","Height","Education" "06/02/16","7439","Yellow","57","3" "06/03/16","7500","Red","55","3"

Related

CSV Column Insertion via awk

Create CSV file with below output line

find and replace multiple patterns in a specific csv column with sed

I just want the last 3 characters of a column returned to the original file

(sed/awk) Extract values from text to csv file - even/odd lines pattern

Categories

Resources