AWK: filtering of the data based on TWO column information

AWK: filtering of the data based on TWO column information - csv

I am working on post-processing of multi-column CSV arranged in multi-column format:
ID, POP, dG
1, 10, -5.6200
2, 4, -5.4900
3, 1, -5.3000
4, 4, -5.1600
5, 4, -4.8800
6, 3, -4.7600
7, 2, -4.4900
8, 5, -4.4500
9, 2, -4.4400
10, 8, -4.1400
11, 1, -4.1200
12, 2, -4.0900
13, 5, -4.0100
14, 1, -3.9500
15, 3, -3.9200
16, 10, -3.8800
17, 1, -3.8700
18, 3, -3.8300
19, 1, -3.8200
20, 3, -3.8000
Previously I have used the following AWK sollution to process the inout log two times, detect pop(MAX) and save linnes which matched $2 > (.8 * max)':
awk -F ', ' 'NR == 1 {next} FNR==NR {if (max < $2) {max=$2; n=FNR+1} next} FNR <= 2 || (FNR == n && $2 > (.4*max)) || $2 > (.8 * max)' input.csv{,} > output.csv
that could reduce the input log keeping just two linnes with highest POP:
ID, POP, dG
1, 10, -5.6200
16, 10, -3.8800
Now I need to change the search algorithm taking into account the both 2nd (POP) and 3rd(dG) columns: i) always taking the first line as the reference, which always has most negative number in the 3rd column (dG); ii) finding the line which has biggest number in the second column, pop(MAX);
iii) taking all linnes between (i) and (ii) that will match the following rule applied for the BOTH columns:
a) line should have (negative) number in 3rd column, matching following the rule: $1 > (.5 * $1(min))', where $1(min) is the number (dG) of the first line (always most negative)
b) additionally line should match the old rule for the second column with decreased threshold : $2 = or > (.5 * max)', where max is the pop(MAX)
So an expecting output should be
ID, POP, dG
1, 10, -5.6200. # this is the first line with most negative dG
8, 5, -4.4500 # this has POP (5) and dG (-4.4500) matching the both rules
10, 8, -4.1400. # this has POP (8) and dG (-4.1400) matching the both rules
16, 10, -3.8800 # this is pop max, with higher POP
ADDED 8-04:
For the case if the first line has with very low POP (which does not match the rule $2 >= (.5 * maxPop)
ID, POP, dG
1, 5, -5.5600
2, 7, -5.3300
3, 7, -5.1900
4, 1, -4.6800
5, 1, -4.5800
6, 5, -4.5600
7, 3, -4.4700
8, 4, -4.4300
9, 9, -4.4200
10, 4, -4.4200
11, 2, -4.3800
12, 4, -4.3400
13, 25, -4.3000
14, 6, -4.2900
15, 8, -4.2600
16, 3, -4.2300
17, 1, -4.1800
18, 3, -4.1300
19, 1, -4.1300
20, 1, -4.1200
21, 27, -4.0800
22, 2, -4.0300
the output should not contain the first line either while still using its value from dG column as the reference for the second condition ($3 <= (.5 * minD), which should be applied for the selection of other linnes in the output:
13, 25, -4.3000
21, 27, -4.0800

You may use this awk solution:
awk -F ', ' 'NR == 1 {next} FNR==NR {if (maxP < $2) maxP=$2; if (minD=="" || minD > $3) minD=$3; next} FNR <= 2 || ($2 >= (.5 * maxP) && $3 <= (.5 * minD))' file{,}
ID, POP, dG
1, 10, -5.6200
8, 5, -4.4500
10, 8, -4.1400
13, 5, -4.0100
16, 10, -3.8800
To make it more readable:
awk -F ', ' '
NR == 1 {next} # skip 1st record 1st time
FNR == NR {
if (maxP < $2) # compute max(POP)
maxP = $2
if (minD == "" || minD > $3) # compute min(dG)
minD = $3
next
}
# print if 1st 2 lines OR "$2 >= .5 * max(POP) && $3 <= .5 * min(dG)"
FNR <= 2 || ($2 >= (.5 * maxP) && $3 <= (.5 * minD))
' file{,}

Related

AWK: filtering of the lines based on the column similarity

I am dealing with the post-processing of multi-column CSV arranged in fixed format:
ID, POP, dG
1, 12, -5.6500
2, 10, -5.5100
3, 14, -5.3500
4, 17, -5.3400
5, 8, -5.3000
6, 1, -5.1800
7, 12, -5.1700
8, 7, -5.1500
9, 3, -5.1100
10, 6, -5.0200
11, 2, -5.0100
12, 2, -4.9500
13, 1, -4.9000
14, 14, -4.8400
15, 4, -4.8300
16, 2, -4.8300
17, 6, -4.7700
18, 7, -4.7600
19, 3, -4.7200
20, 6, -4.7100
I need to reduce the number of the lines in this data focusing i) on the search of the line with the highest number in the second column, pop(MAX); ii) while keeping in the output the lines with the POP column > or = 0.6*pop(MAX); and iii) keep always the first line (with the ID 1). I've used the following AWK expression for the practical realisation:
awk -F ', ' 'NR == 1 {next} FNR==NR {if (max < $2) max=$2; next} FNR == 2 || $2 > (.6 * max)' input.csv input.csv > output.csv
that gives me the following output.csv
ID, POP, dG
1, 12, -5.6500
3, 14, -5.3500
4, 17, -5.3400
7, 12, -5.1700
14, 14, -4.8400
How could I modify my AWK expression to include additionally the next line after POP MAX if it matches another condition with the lower persentatage similarity to pop max: POP > or = 0.4*pop(MAX),
so the output should consist additional string if its POP column fit the rule (I added # coments to clarify the selection of each line):
ID, POP, dG
1, 12, -5.6500 # the first line is always taken
3, 14, -5.3500 # POP > or = 0.6*pop(MAX)
4, 17, -5.3400 # it is POP MAX
5, 8, -5.3000 # it is the string next to pop max: POP > or = 0.4*pop(MAX)
7, 12, -5.1700 # POP > or = 0.6*pop(MAX)
14, 14, -4.840 # POP > or = 0.6*pop(MAX)

You may use this awk:
awk -F ', ' 'NR == 1 {next} FNR==NR {if (max < $2) {max=$2; n=FNR+1} next} FNR <= 2 || FNR == n || $2 > (.6 * max)' input.csv input.csv
ID, POP, dG
1, 12, -5.6500
3, 14, -5.3500
4, 17, -5.3400
5, 8, -5.3000
7, 12, -5.1700
14, 14, -4.8400
To make it more readable:
awk -F ', ' '
NR == 1 {next}
FNR == NR {
if (max < $2) {
max = $2
n = FNR+1
}
next
}
FNR <= 2 || FNR == n || $2 > (.6 * max)
' input.csv{,}
input.csv{,} is brace expansion in bash that just repeats string twice to make it input.csv input.csv

AWK: post-processing of the data based on two columns

I am dealing with the post-procession of CSV logs arranged in the multi-column format in the following order: the first column corresponds to the line number (ID), the second one contains its population (POP, the number of the samples fell into this ID) and the third column (dG) represent some inherent value of this ID (which is always negative):
ID, POP, dG
1, 7, -9.6000
2, 3, -8.7700
3, 6, -8.6200
4, 4, -8.2700
5, 6, -8.0800
6, 10, -8.0100
7, 9, -7.9700
8, 8, -7.8400
9, 16, -7.8100
10, 2, -7.7000
11, 1, -7.5600
12, 2, -7.5200
13, 9, -7.5100
14, 1, -7.5000
15, 2, -7.4200
16, 1, -7.3300
17, 1, -7.1700
18, 4, -7.1300
19, 3, -6.9200
20, 1, -6.9200
21, 2, -6.9100
22, 2, -6.8500
23, 10, -6.6900
24, 2, -6.6800
25, 1, -6.6600
26, 20, -6.6500
27, 1, -6.6500
28, 5, -6.5700
29, 3, -6.5500
30, 2, -6.4600
31, 2, -6.4500
32, 1, -6.3000
33, 7, -6.2900
34, 1, -6.2100
35, 1, -6.2000
36, 3, -6.1800
37, 1, -6.1700
38, 4, -6.1300
39, 1, -6.1000
40, 2, -6.0600
41, 3, -6.0600
42, 8, -6.0200
43, 2, -6.0100
44, 1, -6.0100
45, 1, -5.9800
46, 2, -5.9700
47, 1, -5.9300
48, 6, -5.8800
49, 4, -5.8300
50, 4, -5.8000
51, 2, -5.7800
52, 3, -5.7200
53, 1, -5.6600
54, 1, -5.6500
55, 4, -5.6400
56, 2, -5.6300
57, 1, -5.5700
58, 1, -5.5600
59, 1, -5.5200
60, 1, -5.5000
61, 3, -5.4200
62, 4, -5.3600
63, 1, -5.3100
64, 5, -5.2500
65, 5, -5.1600
66, 1, -5.1100
67, 1, -5.0300
68, 1, -4.9700
69, 1, -4.7700
70, 2, -4.6600
In order to reduce the number of the lines I filtered this CSV with the aim to search for the line with the highest number in the second column (POP), using the following AWK expression:
# search CSV for the line with the highest POP and save all lines before it, while keeping minimal number of the lines (3) in the case if this line is found at the beginning of CSV.
awk -v min_lines=3 -F ", " 'a < $2 {for(idx=0; idx < i; idx++) {print arr[idx]} print $0; a=int($2); i=0; printed=NR} a > $2 && NR > 1 {arr[i]=$0; i++}END{if(printed <= min_lines) {for(idx = 0; idx <= min_lines - printed; idx++){print arr[idx]}}}' input.csv > output.csv
thus obtaining the following reduced output CSV, which still has many lines since the search string (with highest POP) is located on 26th line:
ID, POP, dG
1, 7, -9.6000
2, 3, -8.7700
3, 6, -8.6200
4, 4, -8.2700
5, 6, -8.0800
6, 10, -8.0100
7, 9, -7.9700
8, 8, -7.8400
9, 16, -7.8100
10, 2, -7.7000
11, 1, -7.5600
12, 2, -7.5200
13, 9, -7.5100
14, 1, -7.5000
15, 2, -7.4200
16, 1, -7.3300
17, 1, -7.1700
18, 4, -7.1300
19, 3, -6.9200
20, 1, -6.9200
21, 2, -6.9100
22, 2, -6.8500
23, 10, -6.6900
24, 2, -6.6800
25, 1, -6.6600
26, 20, -6.6500
How it would be possible to further customize my filter via modifying my AWK expression (or pipe it to something else) in order to consider additionally only the lines with small difference in the negative value of the third column, dG compared to the first line (which has the value most negative)? For example to consider only the lines different no more then 20% in terms of dG compared to the first line, while keeping all rest conditions the same:
ID, POP, dG
1, 7, -9.6000
2, 3, -8.7700
3, 6, -8.6200
4, 4, -8.2700
5, 6, -8.0800
6, 10, -8.0100
7, 9, -7.9700
8, 8, -7.8400
9, 16, -7.8100
10, 2, -7.7000

Both tasks can be done in a single awk:
awk -F ', ' 'NR==1 {next} FNR==NR {if (max < $2) {max=$2; n=FNR}; if (FNR==2) dg = $3 * .8; next} $3+0 == $3 && (FNR == n+1 || $3 > dg) {exit} 1' file file
ID, POP, dG
1, 7, -9.6000
2, 3, -8.7700
3, 6, -8.6200
4, 4, -8.2700
5, 6, -8.0800
6, 10, -8.0100
7, 9, -7.9700
8, 8, -7.8400
9, 16, -7.8100
10, 2, -7.7000
To make it more readable:
awk -F ', ' '
NR == 1 {
next
}
FNR == NR {
if (max < $2) {
max = $2
n = FNR
}
if (FNR == 2)
dg = $3 * .8
next
}
$3 + 0 == $3 && (FNR == n+1 || $3 > dg) {
exit
}
1' file file

AWK: multi-step filtering of data based on the selected column

I am dealing with the post-processing of multi-column CSV arranged in fixed format: the first column corresponds to the line number (ID), the second one contains its population (POP, the number of the samples fell into this ID) and the third column (dG) represent some inherent value of this ID (always negative):
ID, POP, dG
1, 7, -9.6000
2, 3, -8.7700
3, 6, -8.6200
4, 4, -8.2700
5, 6, -8.0800
6, 10, -8.0100
7, 9, -7.9700
8, 8, -7.8400
9, 16, -7.8100
10, 2, -7.7000
11, 1, -7.5600
12, 2, -7.5200
13, 9, -7.5100
14, 1, -7.5000
15, 2, -7.4200
16, 1, -7.3300
17, 1, -7.1700
18, 4, -7.1300
19, 3, -6.9200
20, 1, -6.9200
21, 2, -6.9100
22, 2, -6.8500
23, 10, -6.6900
24, 2, -6.6800
25, 1, -6.6600
26, 20, -6.6500
27, 1, -6.6500
28, 5, -6.5700
29, 3, -6.5500
30, 2, -6.4600
31, 2, -6.4500
32, 1, -6.3000
33, 7, -6.2900
34, 1, -6.2100
35, 1, -6.2000
36, 3, -6.1800
37, 1, -6.1700
38, 4, -6.1300
39, 1, -6.1000
40, 2, -6.0600
41, 3, -6.0600
42, 8, -6.0200
43, 2, -6.0100
44, 1, -6.0100
45, 1, -5.9800
46, 2, -5.9700
47, 1, -5.9300
48, 6, -5.8800
49, 4, -5.8300
50, 4, -5.8000
51, 2, -5.7800
52, 3, -5.7200
53, 1, -5.6600
54, 1, -5.6500
55, 4, -5.6400
56, 2, -5.6300
57, 1, -5.5700
58, 1, -5.5600
59, 1, -5.5200
60, 1, -5.5000
61, 3, -5.4200
62, 4, -5.3600
63, 1, -5.3100
64, 5, -5.2500
65, 5, -5.1600
66, 1, -5.1100
67, 1, -5.0300
68, 1, -4.9700
69, 1, -4.7700
70, 2, -4.6600
In order to reduce the number of the lines I filtered this CSV with the aim to search for the line with the highest number in the second column (POP), using the following AWK expression:
# search CSV for the line with the highest POP and save all linnes before it, while keeping minimal number of the linnes (3) in the case if this line is found at the begining of CSV.
awk -v min_lines=3 -F ", " 'a < $2 {for(idx=0; idx < i; idx++) {print arr[idx]} print $0; a=int($2); i=0; printed=NR} a > $2 && NR > 1 {arr[i]=$0; i++}END{if(printed <= min_lines) {for(idx = 0; idx <= min_lines - printed; idx++){print arr[idx]}}}' input.csv > output.csv
For simple case when the string with maximum POP is located on the first line, the script will save this line (POP max) +2 lines after it(=min_lines=3).
For more complicated case, if the line with POP max is located in the middle of the CSV, the script detect this line + all the precedent lines from the begining of the CSV and list them in the new CSV keeping the original order. However, in that case output.csv would contain too many lines since the search string (with highest POP) is located on 26th line:
ID, POP, dG
1, 7, -9.6000
2, 3, -8.7700
3, 6, -8.6200
4, 4, -8.2700
5, 6, -8.0800
6, 10, -8.0100
7, 9, -7.9700
8, 8, -7.8400
9, 16, -7.8100
10, 2, -7.7000
11, 1, -7.5600
12, 2, -7.5200
13, 9, -7.5100
14, 1, -7.5000
15, 2, -7.4200
16, 1, -7.3300
17, 1, -7.1700
18, 4, -7.1300
19, 3, -6.9200
20, 1, -6.9200
21, 2, -6.9100
22, 2, -6.8500
23, 10, -6.6900
24, 2, -6.6800
25, 1, -6.6600
26, 20, -6.6500
In order to reduce the total number of the lines up to 3-5 lines in the output CSV, how it would be possible to customize my filter in order to save only the lines with a minor difference (e.g. the values in the pop column should match (POP >0.5 max(POP)) ), while comparing each line with the line having bigest value in the POP column? Finally, I need always to keep the first line as well as the line with the maximal value in the output. So the AWK solution should filter multi-string CSV in the following manner (please ignore coments in #):
ID, POP, dG
1, 7, -9.6000
9, 16, -7.8100
26, 20, -6.6500 # this is POP max detected over all lines

This 2 phase awk should work for you:
awk -F ', ' -v n=2 'NR == 1 {next}
FNR==NR { if (max < $2) {max=$2; if (FNR==n) n++} next}
FNR <= n || $2 > (.5 * max)' file file
ID, POP, dG
1, 7, -9.6000
9, 16, -7.8100
26, 20, -6.6500

Boolen check if datetime.now() is between the values of any tuple in a list of tuples

I have a list of tuples looking like this:
import datetime as dt
hours = [(dt.datetime(2019,3,9,23,0), dt.datetime(2019,3,10,22,0)),
(dt.datetime(2019,3,10,23,0), d.datetime(2019,3,11,22,0))]
The list has a variable length and I just need a boolean if datetime.now() is between the first and second element of any tuple in the list.
In NumPy I would do:
((start <= now) & (end >= now)).any()
what is the most efficient way to do this in a pythonic way? Sorry about the beginners question.
this works but I don't like the len():
from itertools import takewhile
len(list(takewhile(lambda x: x[0] <= now and now <= x[1], hours ))) > 0
any better suggestions?

any(map(lambda d: d[0] <= now <= d[1], hours))
any: Logical OR across all elements
map: runs a function on every element of the list
As #steff pointed out map is redundant, because we cause list enumeration directly.
any(d[0] <= now <= d[1] for d in hours)
It would be way better if we can avoid indexing into tuple and use tuple unpacking somehow (this was the reason I started with map)

A more verbose alternative. (But more readable in my eyes)
import datetime as dt
def in_time_ranges(ranges):
now = dt.datetime.now()
return any([r for r in ranges if now <= r[0] and r[1] >= now])
ranges1 = [(dt.datetime(2019, 3, 9, 23, 0), dt.datetime(2019, 3, 10, 22, 0)),
(dt.datetime(2019, 3, 10, 23, 0), dt.datetime(2019, 3, 11, 22, 0)),
(dt.datetime(2019, 4, 10, 23, 0), dt.datetime(2019, 5, 11, 22, 0))]
print(in_time_ranges(ranges1))
ranges2 = [(dt.datetime(2017, 3, 9, 14, 0), dt.datetime(2018, 3, 10, 22, 0)),
(dt.datetime(2018, 3, 10, 23, 0), dt.datetime(2018, 3, 11, 22, 0)),
(dt.datetime(2018, 4, 10, 23, 0), dt.datetime(2018, 5, 11, 22, 0))]
print(in_time_ranges(ranges2))
Output
True
False

Concatenate two-line header by column

I have a CSV file with a header consisting of two lines:
A, A, B, B, B
a, b, c, d, e
1, 2, 3, 4, 5
2, 3, 4, 5, 6
I'd like to concatenate the header to this form:
A_a, A_b, B_c, B_d, B_e
1, 2, 3, 4, 5
2, 3, 4, 5, 6
How to achieve that in command-line, using bash, sed, etc.?

Awk solution:
awk 'BEGIN{ FS = OFS = ", " }
NR == 1{ split($0, a, ", "); next }
NR == 2{ for(i=1; i <= NF; i++) $i = a[i]"_"$i }1' file
The output:
A_a, A_b, B_c, B_d, B_e
1, 2, 3, 4, 5
2, 3, 4, 5, 6

a bash solution:
#!/bin/bash
argfile=$1
line1=($(sed -n 1s/,//gp $argfile))
line2=($(sed -n 2p $argfile))
line12=()
for ((i=0; i<${#line1[*]}; i++))
do
line12+=${line1[$i]}"_"${line2[$i]}" "
done
echo $line12
sed -n '3,$p' $argfile
The output:
A_a, A_b, B_c, B_d, B_e
1, 2, 3, 4, 5
2, 3, 4, 5, 6

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

AWK: filtering of the data based on TWO column information - csv

Related

AWK: filtering of the lines based on the column similarity

AWK: post-processing of the data based on two columns

AWK: multi-step filtering of data based on the selected column

Boolen check if datetime.now() is between the values of any tuple in a list of tuples

Concatenate two-line header by column

Categories

Resources