Filtering CSV files by multiple columns, sort them and create 2 new files - csv

I have been searching how to do the following for couple hours and could not find it. I apologize if I am repeating something.
I have 22 csv files with 14 columns and 17,392 lines in each.I am using awk to filter the original files using the following commands:
First need to get lines that have values on column 14 smaller than 0.05
awk -F '\t' '$14 < 0.05 { print $0 }' file1 > file2
Next I need to get the lines with values higher and 1 and smaller than -1.
awk -F '\t' '$10 < -1 { print $0 }' file2 > file3
awk -F '\t' '$10 > 1 { print $0 }' file2 > file4
My last step is to get the lines that have values on column 7 OR 8 higher than 1 (e.g. on 7 could be 0 if on 8 it is 1)
awk -F '\t' '$7<=1 {print $0}' file3 > file5
awk -F '\t' '$8>=1 {print $0}' file4 > file6
My problem is that I create several intermediate files. I would need just two files at the end. File3 and 4 where columns 7 or 8 have values equal or greater than 1. How can I make an awk command to do that at once?
Thank you.

Your question is ambiguous, so there are many possible answers. However, you can combine conditions in awk and you can write to separate files in a single pass, so you might mean:
awk -F '\t' '$14 < 0.05 && $10 < -1 && $7 > 1 { print > "file5" }
$14 < 0.05 && $10 > +1 && $8 > 1 { print > "file6" }' file1
This command should give you the same output in file5 and file6 as you got from your original sequence of operations (but it only makes one pass over the data, not many). (Strictly, it produces the same answer if you change your $7<=1 to $7>1 to agree with your description of wanting column 7 or 8 higher than 1, though that contradicts your example 'on 7 could be 0 if on 8 it is 1'.)
Given an input file:
1 2 3 4 5 6 7 8 9 -10 11 12 13 -14
1 2 3 4 5 6 7 8 9 10 11 12 13 -14
1 2 3 4 5 6 7 8 9 10 11 12 13 14
The output in file5 is:
1 2 3 4 5 6 7 8 9 -10 11 12 13 -14
and the output in file6 is:
1 2 3 4 5 6 7 8 9 10 11 12 13 -14
If you need to combine the conditions differently, then you need to clarify your question.

You could try:
awk -F'\t' '($14 < 0.05) && ($10 < -1) && ($7 <= 1) {print}' file1 > file3

Related

awk extract rows with N columns

I have a tsv file with different column number
1 123 123 a b c
1 123 b c
1 345 345 a b c
I would like to extract only rows with 6 columns
1 123 123 a b c
1 345 345 a b c
How I can do that in bash (awk, sed or something else) ?
Using Awk
$ awk -F'\t' 'NF==6' file
1 123 123 a b c
1 345 345 a b c
FYI, most of the existing solutions have one potential pitfall :
echo "1\t2\t3\t4\t5\t" |
mawk '$!NF = "\n\n\t NF == "( NF ) \
" :\f\b<( "( $_ )" )>\n\n"' FS='\11'
NF == 6 :
<( 1 2 3 4 5 )>
if the input file happens to have a trailing tab \t, it would still be reported by awk as having NF count of 6. whether this test case line actually has 5 columns or 6 in the logical sense is open for interpretation.
Using GNU sed let file.txt content be
1 123 123 a b c
1 123 b c
1 345 345 a b c
1 777 777 a b c d
then
sed -n '/^[^\t]*\t[^\t]*\t[^\t]*\t[^\t]*\t[^\t]*\t[^\t]*$/p' file.txt
gives output
1 123 123 a b c
1 345 345 a b c
Explanation: -n turn off default printing, sole action is to print (p) line matching pattern which is begin (^) and end ($) anchored consisting of 6 column of non-TABs separated by single TABs. This code does use very basic features sed but as you might observe is longer than AWK and not as easy in adjusting N.
(tested in GNU sed 4.2.2)
This might work for you (GNU sed):
sed -nE 's/\S+/&/6p' file
This will print lines with 6 or more fields.
sed -nE 's/\S+/&/6;T;s//&/7;t;p' file
This will print lines with only 6 fields.

Complex CSV parsing with Linux commands

I have a CSV log file that records the properties HA;HB;HC;HD;HE. The following file records 6 entries (separated by the above header).
I would like to extract the 3rd property(HC) of every entry.
HA;HB;HC;HD;HE
a1;b1;14;d;e
HA;HB;HC;HD;HE
a2;b2;28;d;e
HA;HB;HC;HD;HE
a31;b31;44;d;e
a32;b32;07;d;e
HA;HB;HC;HD;HE
a4;b4;0;d;e
HA;HB;HC;HD;HE
a51;b51;32;d;e
a52;b52;0;d;e
a53;b53;5;d;e
HA;HB;HC;HD;HE
a6;b6;10;d;e
Whenever there's n lines of HC recorded per entry, I want to extract the addition of the n entries.
The expected output for the above file:
14
28
51
0
37
10
I know I can write a program for this, but is there an easy way to get this with a combination on awk and/or sed commands?
I haven't tested this; try it and let me know if it works.
awk -F';' '
$3 == "HC" {
if (NR > 1) {
print sum
sum = 0 }
next }
{ sum += $3 }
END { print sum }'
awk solution:
$ awk -F';' '$3=="HC" && p{
print sum # print current total
sum=p=0 # reinitialize sum and p
next
}
$3!="HC"{
sum=sum+($3+0) # make sure $3 is converted to integer. sum it up.
p=1 # set p to 1
} # print last sum
END{print sum}' input.txt
output:
14
28
51
0
37
10
one-liner:
$ awk -F";" '$3=="HC" && p{print sum;sum=p=0;next} $3!="HC"{sum=sum+($3+0);p=1} END{print sum}' input.txt
awk -F';' '/^H.*/{if(f)print s;s=0;f=$3=="HC"}f{s+=$3}END{if(f)print s}' infile
For given inputs:
$ cat infile
HA;HB;HC;HD;HE
a1;b1;14;d;e
HA;HB;HC;HD;HE
a2;b2;28;d;e
HA;HB;HC;HD;HE
a31;b31;44;d;e
a32;b32;07;d;e
HA;HB;HC;HD;HE
a4;b4;0;d;e
HA;HB;HC;HD;HE
a51;b51;32;d;e
a52;b52;0;d;e
a53;b53;5;d;e
HA;HB;HC;HD;HE
a6;b6;10;d;e
$ awk -F';' '/^H.*/{if(f)print s; s=0; f=$3=="HC"}f{s+=$3}END{if(f)print s}' infile
14
28
51
0
37
10
It takes little more care for example:
$ cat infile2
HA;HB;HC;HD;HE
a1;b1;14;d;e
HA;HB;HC;HD;HE
a2;b2;28;d;e
HA;HB;HC;HD;HE
a31;b31;44;d;e
a32;b32;07;d;e
HA;HB;HC;HD;HE
a4;b4;0;d;e
HA;HB;HD;HD;HE <---- Say if HC does not found
a51;b51;32;d;e
a52;b52;0;d;e
a53;b53;5;d;e
HA;HB;HC;HD;HE
a6;b6;10;d;e
# find only HC in 3rd column
$ awk -F';' '/^H.*/{if(f)print s; s=0; f=$3=="HC"}f{s+=$3}END{if(f)print s}' infile2
14
28
51
0
10
# Find HD in 3rd column
$ awk -F';' '/^H.*/{if(f)print s; s=0; f=$3=="HD"}f{s+=$3}END{if(f)print s}' infile2
37
eval "true || $(cat data.csv|cut -d ";" -f3 |sed -e s/"HC"/"0; expr 0"/g |tr '\n' '#'|sed -e s/"##"/""/g|sed -e s/"#"/" + "/g)"
Explanation:
Get contents of the file using cat
Take only the third column using cut delimiter of ;
Replace HC lines with 0; expr 0 values to start building eval-worthy bash expressions to eventually yield expr 0 + 14;
Replace \n newlines temporarily with # to circumvent possible BSD sed limitations
Replace double ## with single # to avoid blank lines turning into spaces and causing expr to bomb out.
Replace # with + to add the numbers together.
Execute the command, but with a true || 0; expr ... to avoid a guaranteed syntax error on the first line.
Which creates this:
true || 0; expr 0 + 14 + 0; expr 0 + 28 + 0; expr 0 + 44 + 07 + 0; expr 0 + 0 + 0; expr 0 + 32 + 0 + 5 + 0; expr 0 + 10
The output looks like this:
14
28
51
0
37
10
This was tested on Bash 3.2 and MacOS El Capitan.
Could you please try following and let me know if this helps you.
awk -F";" '
/^H/ && $3!="HC"{
flag="";
next
}
/^H/ && $3=="HC"{
if(NR>1){
printf("%d\n",sum)
};
sum=0;
flag=1;
next
}
flag{
sum+=$3
}
END{
printf("%d\n",sum)
}
' Input_file
Output will be as follows.
14
28
51
0
37
10
$ awk -F';' '$3=="HC"{if (NR>1) print s; s=0; next} {s+=$3} END{print s}' file
14
28
51
0
37
10

AWK wrong math on first line only

This is the input file input.awk DOS type
06-13-2014,08:43:11
RLS007817
RRC001021
yes,71.61673,0,150,37,1
no,11,156,1.35,306.418
4,3,-1,2.5165,20,-1.4204
-4,0,11,0,0,0
1.00E-001,0.2,3.00E-001,0.6786031,0.5,6.37E-002
110,40,30,222,200,-539
120,50,35,215,220,-547
130,60,40,207,240,-553
140,70,45,196,260,-560
150,80,50,184,280,-566
160,90,55,170,300,-573
170,100,60,157,320,-578
180,110,65,141,340,-582
190,120,70,126,360,-586
200,130,75,110,380,-590
This is what I basically need:
Ignore the first 8 lines (OK)
Pick and split the numbers on lines 6,7 & 8 (OK)
Do AWK math on columns (Error only in first line?)
BASH code
#!/bin/bash
myfile="input.awk"
vzeros=$(sed '6q;d' $myfile)
vshift=$(sed '7q;d' $myfile)
vcalib=$(sed '8q;d' $myfile)
IFS=','
read -a avz <<< "${vzeros}"
read -a avs <<< "${vshift}"
read -a avc <<< "${vcalib}"
z1=${avz[0]};s1=${avs[0]};c1=${avc[0]}
z2=${avz[1]};s2=${avs[1]};c2=${avc[1]}
z3=${avz[2]};s3=${avs[2]};c3=${avc[2]}
z4=${avz[4]};s4=${avs[4]};c4=${avc[4]}
#The single variables will be passed to awk
awk -v z1="$z1" -v c1="$c1" -v s1="$s1" -v z2="$z2" -v c2="$c2" -v s2="$s2" -v z3="$z3" -v c3="$c3" -v s3="$s3" -v z4="$z4" -v c4="$c4" -v s4="$s4" 'NR>8 { FS = "," ;
nc1 = c1 * ( $1 - z1 - s1 );
nc2 = c2 * ( $2 - z2 - s2 );
nc3 = c3 * ( $3 - z3 - s3 );
nc4 = c4 * ( $5 - z4 - s4 );
print nc1,nc2,nc3,nc4 }' $myfile > test.plot
This is the result on the file test.plot
11 -0.6 -3 -10
12 9.4 7.5 100
13 11.4 9 110
14 13.4 10.5 120
15 15.4 12 130
16 17.4 13.5 140
17 19.4 15 150
18 21.4 16.5 160
19 23.4 18 170
20 25.4 19.5 180
This is the weird part... Only in the first line and after the first column all is wrong... And I have no idea why.
This is the expected result file:
11 7.4 6 90
12 9.4 7.5 100
13 11.4 9 110
14 13.4 10.5 120
15 15.4 12 130
16 17.4 13.5 140
17 19.4 15 150
18 21.4 16.5 160
19 23.4 18 170
20 25.4 19.5 180
I've printed the correction factors captured from lines 6,7 & 8 and everything is fine. All math is fine, except on the first line, after the first column.
OS: Slackware 13.37.
AWK: GNU Awk 3.1.6 Copyright (C) 1989, 1991-2007 Free Software Foundation.
I agree with #jeanrjc.
I copied your file and script to my machine and reduced it to processing the first 2 lines of your data.
With your code as is, I duplicate your results, i.e.
#dbg $0=110,40,30,222,200,-539
#dbg c2=0.2 $2= z2=3 s2=0
11 -0.6 -3 -10
#dbg $0=120,50,35,215,220,-547
#dbg c2=0.2 $2= z2=3 s2=0
12 -0.6 -3 -10
With FS=","; commented out, and -F, added in the option list the output is what you are looking for.
#dbg $0=110,40,30,222,200,-539
#dbg c2=0.2 $2=40 z2=3 s2=0
11 7.4 6 90
#dbg $0=120,50,35,215,220,-547
#dbg c2=0.2 $2=50 z2=3 s2=0
12 9.4 7.5 100
So make sure you have removed the FS=","; from the block of code, and you are using -F, In any case, I would say, that resetting the FS="," for each line that is processed is not useful.
If that still doesn't solve it, try the corrected code on a machine with a newer version of awk.
It would take a small magazine article to completely illustrate what is happening while reading thru the first 8 records (when FS="[[:space:]]), the transition to the first row that meets your rule NR>8, the FS is still [:space:] when the fields are parsed, then, FS is set to ,, but that first row is not rescanned.
IHTH!
Your sample is too complex to reproduce something, but I guess you should try :
awk -F"," 'NR>8{...
instead of
awk 'NR>8 { FS = "," ;
You can also try with BEGIN:
awk 'BEGIN{FS=","}NR>8{...
I eventually tested your script, and you should change the position of the FS parameter, as I told you:
awk -v z1="$z1" -v c1="$c1" -v s1="$s1" -v z2="$z2" \
-v c2="$c2" -v s2="$s2" -v z3="$z3" -v c3="$c3" \
-v s3="$s3" -v z4="$z4" -v c4="$c4" -v s4="$s4" -F"," 'NR>8 {
nc1 = c1 * ( $1 - z1 - s1 );
nc2 = c2 * ( $2 - z2 - s2 );
nc3 = c3 * ( $3 - z3 - s3 );
nc4 = c4 * ( $5 - z4 - s4 );
print nc1,nc2,nc3,nc4 }' $myfile
11 7.4 6 90
12 9.4 7.5 100
13 11.4 9 110
14 13.4 10.5 120
15 15.4 12 130
16 17.4 13.5 140
17 19.4 15 150
18 21.4 16.5 160
19 23.4 18 170
20 25.4 19.5 180
0 -0.6 -3 -10
Why you had a problem ?
Because awk parses the line before executing the block, so if you tell it to change something related to parsing, the changes will occur from the next line.
HTH

Enhance awk script by printing top 5 occurring data elements from each column

I have an awk script that processes a csv file and produces a report that counts the number of rows for each column, named in the header field, that contain data /[A-Za-z0-9]/. What I would like to do is enhance the script and print the top 5 most duplicated data elements in each column.
Here is sample data:
Food|Type|Spicy
Broccoli|Vegetable|No
Lettuce|Vegetable|No
Spinach|Vegetable|No
Habanero|Vegetable|Yes
Swiss Cheese|Dairy|No
Milk|Dairy|No
Yogurt|Dairy|No
Orange Juice|Fruit|No
Papaya|Fruit|No
Watermelon|Fruit|No
Coconut|Fruit|No
Cheeseburger|Meat|No
Gorgonzola|Dairy|No
Salmon|Fish|
Apple|Fruit|No
Orange|Fruit|No
Bagel|Bread|No
Chicken|Meat|No
Chicken Wings|Meat|Yes
Pizza||No
This is the current script that SiegeX has substantially contributed:
$ cat matrix2.awk
NR==1{
for(i=1;i<=NF;i++)
head[i]=$i
next
}
{
for(i=1;i<=NF;i++)
{
if($i && !arr[i,$i]++)
n[i]++
if(arr[i,$i] > 1)
f[i]=1
}
}
END{
for(i=1;i<=length(head);i++) {
printf("%-6d%s\n",n[i],head[i])
if(f[i]) {
for(x in arr) {
split(x,b,SUBSEP)
if(b[1]==i && b[2])
printf("% -6d %s\n",arr[i,b[2]],b[2])
}
}
}
}
This is the current output:
$ awk -F "|" -f matrix2.awk testlist.csv
20 Food
6 Type
6 Fruit
4 Vegetable
3 Meat
1 Fish
4 Dairy
1 Bread
2 Spicy
17 No
2 Yes
And this is the desired output:
$ awk -F "|" -f matrix2.awk testlist.csv
20 Food
6 Type
6 Fruit
4 Vegetable
4 Dairy
3 Meat
1 Fish
2 Spicy
17 No
2 Yes
The only thing left that I would like to add is a general function that limits each columns output to the top 5 most duplicated fields. As mentioned below, a columnar version of sort |uniq -c |sort -nr |head -5.
The following script is both extensible and scalable as it will work with an arbitrary number of columns. Nothing is hardcoded
awk -F'|' '
NR==1{
for(i=1;i<=NF;i++)
head[i]=$i
next
}
{
for(i=1;i<=NF;i++)
{
if($i && !arr[i,$i]++)
n[i]++
if(arr[i,$i] > 1)
f[i]=1
}
}
END{
for(i=1;i<=length(head);i++) {
printf("%-32s%d\n",head[i],n[i])
if(f[i]) {
for(x in arr) {
split(x,b,SUBSEP)
if(b[1]==i && b[2])
printf(" %-28s%d\n",b[2],arr[i,b[2]])
}
}
}
}' infile
Output
$ ./report
Food 9
Type 5
Meat 2
Bread 1
Vegetable 2
Fruit 2
Fish 1
Spicy 2
Yes 2
No 6
Not a complete solution, but something to get you started -
awk -F"|" '
NR>1{
a[$1]++;
b[$2]++;
c[$3]++}
END{
print "Food\t\t\t" length(a);
print "Type\t\t\t" length(b);
for (x in b)
if (x!="")
{
printf ("\t%-16s%s\n",x,b[x]);
}
print "Spicy\t\t\t" length(c);
for (y in c)
if (y!="")
{
printf ("\t%-16s%d\n",y,c[y])
}
}' testlist.csv
TEST:
[jaypal:~/Temp] cat testlist.csv
Food|Type|Spicy
Broccoli|Vegetable|No
Jalapeno|Vegetable|Yes
Salmon|Fish|
Apple|Fruit|No
Orange|Fruit|No
Bagel|Bread|No
Chicken|Meat|No
Chicken Wings|Meat|Yes
Pizza||No
[jaypal:~/Temp] awk -F"|" 'NR>1{a[$1];b[$2]++;c[$3]++}END{print "Food\t\t\t" length(a); print "Type\t\t\t"length(b); for (x in b) if (x!="") printf ("\t%-16s%s\n",x,b[x]) ;print "Spicy\t\t\t"length(c); for (y in c) if (y!="") {printf ("\t%-16s%d\n",y,c[y])}}' testlist.csv
Food 9
Type 6
Fruit 2
Vegetable 2
Bread 1
Meat 2
Fish 1
Spicy 3
Yes 2
No 6

How can I generate a file of random negative and positive integers in serial?

I want a file of randomly generated positive or negative serial integers. For now, I ask the file contain roughly (no guarantee required) equal numbers of negative and positive, but make it easy to change the proportions later. By "serial", I mean the kth random negative is equal to -k, and the kth random positive is equal to +k.
This GNU Bash script one-liner would satisfy the file format, but just wouldn't be random.
$ seq -1 -1 -5 && seq 1 5
-1
-2
-3
-4
-5
1
2
3
4
5
This example shows what I'm looking for even better, but is still not random since the integers alternate predictably between negative and positive.
$ paste <(seq -1 -1 -5) <(seq 1 5) | tr '\t' '\n'
-1
1
-2
2
-3
3
-4
4
-5
5
Sending one of these through the shuf command makes them randomly negative or positive, but they lose their serial-ness.
$ paste <(seq -1 -1 -5) <(seq 1 5) | tr '\t' '\n' | shuf
-5
4
3
2
-2
1
-1
-4
5
-3
Note: I'm trying to test algorithms that sort lists/arrays of bits (zeros and ones), but if I use 0s and 1s I won't be able to analyse the sort's behaviour or tell if stability was preserved.
If I understand correctly, you want to interleave the positive integers and the negative integers randomly. For example: 1 2 -1 3 -2 4 5- 3.
my $count = 10;
my $pos = 1;
my $neg = -1;
my #random = map {
int(rand 2)
? $pos++
: $neg--
} 1..$count;
print "#random\n";
Update:
To change proportions I'd do this:
use strict;
use warnings;
my $next = get_list_generator(.5);
my #random = map $next->(), 1..10;
print "#random\n";
my $again = get_list_generator(.25);
my #another = map $again->(), 1..10;
print "#another\n";
sub get_list_generator {
my $prob_positive = shift;
my $pos = 1;
my $neg = -1;
return sub {
return rand() <= $prob_positive ? scalar $pos++ : scalar $neg--;
}
}
The get_list_generator() function returns a closure. This way you can even have multiple list generators going at once.
Let's start golf contest? (44)
perl -le'print rand>.5?++$a:--$b for 1..10'
Edit: daotoad's 40 chars version
seq 1 10|perl -ple'$_=rand>.5?++$a:--$b'
Where 15 is the total amount of numbers generated and tp is the amount of positive numbers you want (effectively indicating the ratio of pos/neg):
tp=8
unset p n
for i in $(printf '%s\n' {1..15} | gsort -R); do
(( i <= tp )) && \
echo $((++p)) || \
echo $((--n))
done
#!/bin/bash
pos=0 neg=0
for i in {1..10}
do
if (( ($RANDOM > 16384 ? ++pos : --neg) > 0 ))
then echo $pos
else echo $neg
fi
done
I could not quite fit this into a one-liner. Anyone else?
edit: Ah, a one liner, 65 characters (need to set a and b if you're repeatedly invoking this in the same shell):
a=0 b=0;for i in {1..10}; do echo $(($RANDOM>16384?++a:--b));done
Here's a Bash one-liner (2?) inspired by lhunath and Brian's answers.
RANDOM=$$; pos=1; neg=-1; for i in {1..10}; do \
echo $(( $(echo $RANDOM / 32767 \> 0.5 | bc -l) ? pos++ : neg-- )); done
Here's an Awk script that competes in the golf contest (44).
seq 1 10|awk '{print(rand()>0.5?++p:--n);}'
This is the clearer idiomatic way to write it:
seq 1 10 | awk 'BEGIN{srand(); pos=1; neg=-1;}
{print (rand() > 0.5 ? pos++ : neg--);}'
There is no set of numbers that will fit all your criterion. You can't say you want random but at the same time say that the kth negative value == -k and the kth positive value == k. You can either have it random, or not.
As to what you're trying to do, why not separate the two concerns and test the sort on something like an array of pairs of integers length n. The first of the pair can be zero or 1 and the second of the pair will be your stability tracker (just a count from 0 to n).
Generate the list of 0's and 1's that you want and shuffle them then add on the tracker integer. Now sort the pairs by their first element.
The input to your sort will look something like this.
0, 1
1, 2
0, 3
1, 4
1, 5
0, 6
0, 7
1, 8
1, 9
1, 10
0, 11
0, 12
0, 13
Stable sorts will produce this
0, 1
0, 3
0, 6
0, 7
0, 11
0, 12
0, 13
1, 2
1, 4
1, 5
1, 8
1, 9
1, 10
Unstable ones will produce the 0's and 1's with the tracker integers out of order.