Print first, penultimate and last fields in CSV file - csv

I have big comma separated file with 20000 row and five column, I want to extract particular column, but there are more values so more comma, except header, so how to cut such column.
example file:
name,v1,v2,v3,v4,v5
as,"10,12,15",21,"12,11,10,12",5,7
bs,"11,15,16",24,"19,15,18,23",9,3
This is my desired output:
name,v4,v5
as,5,7
bs,9,3
I tried following cut command but doesn't work
cut -d, -f1,5,6

In general, for these scenarios is best to use a proper csv parser. You can find those in Python, for example.
However, since your data seems to have fields with commas just in the very beginning, you can decide to print the first field and then the penultimate and last one:
$ awk 'BEGIN{FS=OFS=","} {print $1, $(NF-1), $NF}' file
name,v4,v5
as,5,7
bs,9,3

In TXR Lisp:
$ txr extract.tl < data
name,v4,v5
as,5,7
bs,9,3
Code in extract.tl:
(mapdo
(lambda (line)
(let ((f (tok-str line #/"[^"]*"|[^,]+/)))
(put-line `#[f 0],#[f 4],#[f 5]`)))
(get-lines))
As a condensed one liner:
$ txr -t '(mapcar* (do let ((f (tok-str #1 #/"[^"]*"|[^,]+/)))
`#[f 0],#[f 4],#[f 5]`) (get-lines))' < data

Related

Update a CSV file to drop the first number and insert a decimal place in a particular column

I need help to perform the following
My CSV file looks like this
900001_10459.jpg,036921,Initiated
900002_10454.jpg,027964,Initiated
900003_10440.jpg,021449,Initiated
900004_10440.jpg,016650,Initiated
900005_10440.jpg,013929,Initiated
What I need to do is generate a new csv file to be as follows
900001_10459.jpg,3692.1,Initiated
900002_10454.jpg,2796.4,Initiated
900003_10440.jpg,2144.9,Initiated
900004_10440.jpg,1665.0,Initiated
900005_10440.jpg,1392.9,Initiated
if I was to do this as a test
echo '036921' | awk -v range=1 '{print substr($0,range+1)}' | sed 's/.$/.&/'
I get
3692.1
Can anyone help me so I can incorporate that, (or anything similar) to change my CSV file?
Try
sed 's/,0*([0-9]*)([0-9]),/,\1.\2,/' myfile.csv
Using awk and with the conditions specified in the comment, you can use:
$ awk -F, '{ printf "%s,%06.1f,%s\n", $1, $2 / 10, $3 }' data
900001_10459.jpg,3692.1,Initiated
900002_10454.jpg,2796.4,Initiated
900003_10440.jpg,2144.9,Initiated
900004_10440.jpg,1665.0,Initiated
900005_10440.jpg,1392.9,Initiated
$
With the printf format string providing the commas, there's no need to set OFS (because OFS is not used by printf).
Assuming that values with leading zeros appears solely in 2nd column I would use GNU AWK for this task following way, let file.txt content be
900001_10459.jpg,036921,Initiated
900002_10454.jpg,027964,Initiated
900003_10440.jpg,021449,Initiated
900004_10440.jpg,016650,Initiated
900005_10440.jpg,013929,Initiated
then
awk 'BEGIN{FS=",0?";OFS=","}{$2=gensub(/([0-9])$/, ".\\1", 1, $2);print}' file.txt
output
900001_10459.jpg,3692.1,Initiated
900002_10454.jpg,2796.4,Initiated
900003_10440.jpg,2144.9,Initiated
900004_10440.jpg,1665.0,Initiated
900005_10440.jpg,1392.9,Initiated
Explanation: I set field separator (FS) to be , optionally followed by 0, so leading zero will be discarded as part of separator. In 2nd I replace last digit by . followed by that digit. Finally I print such changed line, using , as separators.
(tested in gawk 4.2.1)
I wish to have 4 numbers (including zeros) and the last value (5th value) separated from the 4 values by a decimal point.
If I understand, you need not all digits of that field but only the last five digits.
Using awk you can get the last five with the substr function and then print the field with the last digit separeted from de previous 4 by a decimal point, using the sub() function:
awk -F',' -v OFS=',' '{$2= substr($2, length($2) - 4, length($2) ); sub(/[[:digit:]]{1}$/, ".&",$2);print}' file
900001_10459.jpg,3692.1,Initiated
900002_10454.jpg,2796.4,Initiated
900003_10440.jpg,2144.9,Initiated
900004_10440.jpg,1665.0,Initiated
900005_10440.jpg,1392.9,Initiated

can't understand this bash script $ awk -F

Can someone help me understand what these scripts mean?
$ awk -F',' '{ data[$4]+=$29;}END{c=0; for (i in data) { if (data[i]+0<1000000) {c++}} ;print c}' file.csv
thanks
This scripts iterates line by line over the input file file.csv. The file is apparently separated by comma (,), thus the field separator is set via -F',' appropriately. Then you access the data array with the content of the 4th field ($4) and add the value of the 29th field ($29). After processing all lines, at the end, END code section is invoked. It iterates over your data array, does some numerical comparison and eventually prints the number of times c the numerical comparison evaluated to true.

Change column if some regex expression is true with awk or sed

I have a file (lets call it data.csv) similar to this
"123","456","ud,h-match","moredata"
with many rows in the same format and embedded commas. What I need to do is look at the third column and see if it has an expression. In this case I want to know if the third column has "match" anywhere (which it does). If there is any, then I to replace the whole column to something else like "replaced". So to relate it to the example data.csv file, I would want it to look this.
"123","456","replaced","moredata"
Ideally, I want the file data.csv itself to be changed (time is of the essence since I have a big file) but it's also fine if you write it to another file.
Edit:
I have tried using awk:
awk -F'","' -OFS="," '{if(tolower($3) ~ "stringI'mSearchingFor"){$3="replacement"; print}else print}' file
but it dosen't change anything. If I remove the OFS portion then it works but it gets separated by spaces and the columns don't get enclosed by double quotes.
Depending on the answer to my question about what you mean by column, this may be what you want (uses GNU awk for FPAT):
$ awk -v FPAT='[^,]+|"[^"]+"' -v OFS=',' '$3~/match/{$3="\"replaced\""} 1' file
"123","456","replaced","moredata"
Use awk -i inplace ... if you want to do "in place" editing.
With any awk (but slightly more fragile than the above since it leaves the leading/trailing " on the first and last fields, and has no -i inplace):
$ awk 'BEGIN{FS=OFS="\",\""} $3~/match/{$3="replaced"} 1' file
"123","456","replaced","moredata"

Can awk deal with CSV file that contains comma inside a quoted field?

I am using awk to perform counting the sum of one column in the csv file. The data format is something like:
id, name, value
1, foo, 17
2, bar, 76
3, "I am the, question", 99
I was using this awk script to count the sum:
awk -F, '{sum+=$3} END {print sum}'
Some of the value in name field contains comma and this break my awk script.
My question is: can awk solve this problem? If yes, and how can I do that?
Thank you.
One way using GNU awk and FPAT
awk 'BEGIN { FPAT = "([^, ]+)|(\"[^\"]+\")" } { sum+=$3 } END { print sum }' file.txt
Result:
192
I am using
`FPAT="([^,]+)|(\"[^\"]+\")" `
to define the fields with gawk. I found that when the field is null this doesn't recognize correct number of fields. Because "+" requires at least 1 character in the field.
I changed it to:
`FPAT="([^,]*)|(\"[^\"]*\")"`
and replace "+" with "*". It works correctly.
I also find that GNU Awk User Guide also has this problem.
https://www.gnu.org/software/gawk/manual/html_node/Splitting-By-Content.html
You're probably better off doing it in perl with Text::CSV, since that's a fast and robust solution.
You can help awk work with data fields that contain commas (or newlines) by using a small script I wrote called csvquote. It replaces the offending commas inside quoted fields with nonprinting characters. If you need to, you can later restore those commas - but in this case, you don't need to.
Here is the command:
csvquote inputfile.csv | awk -F, '{sum+=$3} END {print sum}'
see https://github.com/dbro/csvquote for the code
For as simple an input file as that you can just write a small function to convert all of the real FSs outside of the quotes to some other value (I chose RS since the record separator cannot be part of the record) and then use that as the FS, e.g.:
$ cat decsv.awk
BEGIN{ fs=FS; FS=RS }
{
decsv()
for (i=1;i<=NF;i++) {
printf "Record %d, Field %d is <%s>\n" ,NR,i,$i
}
print ""
}
function decsv( curr,head,tail)
{
tail = $0
while ( match(tail,/"[^"]+"/) ) {
head = substr(tail, 1, RSTART-1);
gsub(fs,RS,head)
curr = curr head substr(tail, RSTART, RLENGTH)
tail = substr(tail, RSTART + RLENGTH)
}
gsub(fs,RS,tail)
$0 = curr tail
}
$ cat file
id, name, value
1, foo, 17
2, bar, 76
3, "I am the, question", 99
$ awk -F", " -f decsv.awk file
Record 1, Field 1 is <id>
Record 1, Field 2 is <name>
Record 1, Field 3 is <value>
Record 2, Field 1 is <1>
Record 2, Field 2 is <foo>
Record 2, Field 3 is <17>
Record 3, Field 1 is <2>
Record 3, Field 2 is <bar>
Record 3, Field 3 is <76>
Record 4, Field 1 is <3>
Record 4, Field 2 is <"I am the, question">
Record 4, Field 3 is <99>
It only becomes complicated when you have to deal with embedded newlines and embedded escaped quotes within the quotes and even then it's not too hard and it's all been done before...
See What's the most robust way to efficiently parse CSV using awk? for more information.
You can always tackle the problem from the source. Put quotes around the name field, just like the field of "I am the, question". This is much easier than spending your time coding workarounds for that.
Update(as Dennis requested). A simple example
$ s='id, "name1,name2", value 1, foo, 17 2, bar, 76 3, "I am the, question", 99'
$ echo $s|awk -F'"' '{ for(i=1;i<=NF;i+=2) print $i}'
id,
, value 1, foo, 17 2, bar, 76 3,
, 99
$ echo $s|awk -F'"' '{ for(i=2;i<=NF;i+=2) print $i}'
name1,name2
I am the, question
As you can see, by setting the delimiter to double quote, the fields that belong to the "quotes" are always on even number. Since OP doesn't have the luxury of modifying the source data, this method will not be appropriate to him.
This article did help me solve this same data field issue. Most CSV will put a quote around fields with spaces or commas within them. This messes up the field count for awk unless you filter them out.
If you need the data within those fields that contain the garbage, this is not for you. ghostdog74 provided the answer, which empties that field but maintains the total field count in the end, which is key to keeping the data output consistent. I did not like how this solution introduced new lines. This is the version of this solution I used. The fist three fields never had this problem in the data. The fourth field containing customer name often did, but I needed that data. The remaining fields that exhibit the problem I could throw away without issue because it was not needed in my report output. So I first sed out the 4th field's garbage very specifically and remove the first two instances of quotes. Then I apply what ghostdog74gave to empty the remaining fields that have commas within them - this also removes the quotes, but I use printfto maintain the data in a single record. I start off with 85 fields and end up with 85 fields in all cases from my 8000+ lines of messy data. A perfect score!
grep -i $1 $dbfile | sed 's/\, Inc.//;s/, LLC.//;s/, LLC//;s/, Ltd.//;s/\"//;s/\"//' | awk -F'"' '{ for(i=1;i<=NF;i+=2) printf ($i);printf ("\n")}' > $tmpfile
The solution that empties the fields with commas within them but also maintains the record, of course is:
awk -F'"' '{ for(i=1;i<=NF;i+=2) printf ($i);printf ("\n")}
Megs of thanks to ghostdog74 for the great solution!
NetsGuy256/
FPAT is the elegant solution because it can handle the dreaded commas within quotes problem, but to sum a column of numbers in the last column regardless of the number of preceding separators, $NF works well:
awk -F"," '{sum+=$NF} END {print sum}'
To access the second to last column, you would use this:
awk -F"," '{sum+=$(NF-1)} END {print sum}'
If you know for sure that the 'value' column is always the last column:
awk -F, '{sum+=$NF} END {print sum}'
NF represents the number of fields, so $NF is the last column
Fully fledged CSV parsers such as Perl's Text::CSV_XS are purpose-built to handle that kind of weirdness.
perl -MText::CSV_XS -lne 'BEGIN{$csv=Text::CSV_XS->new({allow_whitespace => 1})} if($csv->parse($_)){#f=$csv->fields();$sum+=$f[2]} END{print $sum}' file
allow_whitespace is needed since the input data has whitespace surrounding the comma separators. Very old versions of Text::CSV_XS may not support this option.
I provided more explanation of Text::CSV_XS within my answer here: parse csv file using gawk
you could try piping the file through a perl regex to convert the quoted , into something else like a |.
cat test.csv | perl -p -e "s/(\".+?)(,)(.+?\")/\1\|\3/g" | awk -F, '{...
The above regex assumes there is always a comma within the double quotes. so more work would be needed to make the comma optional
you write a function in awk like below:
$ awk 'func isnum(x){return(x==x+0)}BEGIN{print isnum("hello"),isnum("-42")}'
0 1
you can incorporate in your script this function and check whether the third field is numeric or not.if not numeric then go for the 4th field and if the 4th field inturn is not numberic go for 5th ...till you reach a numeric value.probably a loop will help here, and add it to the sum.

Find duplicates lines based on some delimited fileds on line

I have a file with lines having some fields delimited by "|".
I have to extract the lines that are identical based on some of the fileds
(i.e. find lines which contain the same values for fields 1,2,3 12,and 13)
Other fields contents have no importance for searching but the whole extracted lines have to be complete.
Can anyone tell me how I can do that in KSH scripting
(By exemple a script with some arguments (order dependent) that define the fileds separator and the fields which have to be compared to find duplicates lines in input file )
Thanks in advance and kind regards
Oli
This prints duplicate lines based on matching fields. It uses an associative array which could grow large depending on the nature of your input file. The output is not sorted so most duplicates are not grouped together (except the first two of a set).
awk -F'|' '{ idx=$1$2$3$12$13; if (array[idx] == 1) {print} else if (array[idx]) {print array[idx]; print; array[idx]=1} else {array[idx]=$0}}' inputfile.txt
You could probably build up your index list in a shell variable in a wrapper script something like this:
#!/bin/ksh
for arg
do
case arg in # validate input (could be better)
+([0-9]) ) # integers only
idx="$idx'$'$arg"
;;
* )
echo "Invalid field specifier"
exit
;;
esac
done
awk -F'|' '{ idx='$idx'; if (array ...
You can sort the output by piping it through a command such as this:
awk ... | sort --field-separator='|' --key=1,1 --key=2,2 --key=3,3 --key=12,12 --key=13,13
This prints lines which are duplicated - just one line each:
awk -F'|' '!arr[$1$2$3$12$13]++' inputfile > outputfile