How to remove rows with similar data to keep only highest value in a specific column (tsv file) with awk in bash? - csv

I have a very large .tsv file (80 GB) that I need to edit. It is made up of 5 columns. The last column represent a score. Some positions have multiple "Score" entries and I need to keep only the row for each position with the highest value.
For example, this position have multiple entries for each combination:
1 861265 C A 0.071
1 861265 C A 0.148
1 861265 C G 0.001
1 861265 C G 0.108
1 861265 C T 0
1 861265 C T 0.216
2 193456 G A 0.006
2 193456 G A 0.094
2 193456 G C 0.011
2 193456 G C 0.152
2 193456 G T 0.003
2 193456 G T 0.056
The desired output would look like this:
1 861265 C A 0.148
1 861265 C G 0.108
1 861265 C T 0.216
2 193456 G A 0.094
2 193456 G C 0.152
2 193456 G T 0.056
Doing it in python/pandas is not possible as the file is too large or takes too long. Therefore, I am looking for a solution using bash; in particular awk.
Thif input file has been sorted with the following command:
sort -t$'\t' -k1 -n -o sorted_file original_file
The command would basically need to:
compare the data from the first 4 columns in the sorted_file
if all of those are the same, then only the row with the highest value on column 5 should be printed onto the output file`.
I am not very familiar with awk syntax. I have seen relatively similar questions in other forums, but I was unable to adapt it to my particular case. I have tried to adapt one of those solutions to my case like this:
awk -F, 'NR==1 {print; next} NR==2 {key=$2; next}$2 != key {print lastval; key = $2} {lastval = $0} END {print lastval}' sorted_files.tsv > filtered_file.tsv
However, the output file does not look like it should, at all.
Any help would be very much appreciated.

a more robust way is to sort the last field numerically and let awk pick the first value. If your fields don't have spaces, no need to specify the delimiter.
$ sort -k1n k5,5nr original_file | awk '!a[$1,$2,$3,$4]++' > max_value_file
As #Fravadona commented, since this stores the keys, if there are many unique records it will have large memory footprint. One alterative is delegating to uniq to pick the first record over repeated entries.
$ sort -k1n k5,5nr original_file |
awk '{print $5,$1,$2,$3,$4}' |
uniq -f1 |
awk '{print $2,$3,$4,$5,$1}'
we change the order of the fields to skip the value for comparison and then change back afterwards. This won't have any memory footprint (aside from sort, which will be managed).
If you're not a purist, this should work the same as the previous one
$ sort -k1n k5,5nr original_file | rev | uniq -f1 | rev

It's not awk, but using Miller, is very easy and interesting
mlr --tsv -N sort -f 1,2,3,4 -n 5 then top -f 5 -g 1,2,3,4 -a input.tsv >output.tsv
You will have
1 861265 C A 1 0.148
1 861265 C G 1 0.108
1 861265 C T 1 0.216
2 193456 G A 1 0.094
2 193456 G C 1 0.152
2 193456 G T 1 0.056

You can try this approach. This also works on a non-sorted last column, only the first 4 columns have to be sorted.
% awk 'NR>1&&str!=$1" "$2" "$3" "$4{print line; m=0}
$5>=m{m=$5; line=$0}
{str=$1" "$2" "$3" "$4} END{print line}' file
1 861265 C A 0.148
1 861265 C G 0.108
1 861265 C T 0.216
2 193456 G A 0.094
2 193456 G C 0.152
2 193456 G T 0.056
Data
% cat file
1 861265 C A 0.071
1 861265 C A 0.148
1 861265 C G 0.001
1 861265 C G 0.108
1 861265 C T 0
1 861265 C T 0.216
2 193456 G A 0.006
2 193456 G A 0.094
2 193456 G C 0.011
2 193456 G C 0.152
2 193456 G T 0.003
2 193456 G T 0.056

Assumptions/Understandings:
file is sorted by the first field
no guarantee on the ordering of fields #2, #3 and #4
must maintain the current row ordering (this would seem to rule out (re)sorting the file as we could lose the current row ordering)
the complete set of output rows for a given group will fit into memory (aka the awk arrays)
General plan:
we'll call field #1 the group field; all rows with the same value in field #1 are considered part of the same group
for a given group we keep track of all output rows via the awk array arr[] (index will be a combo of fields #2, #3, #4)
we also keep track of the incoming row order via the awk array order[]
update arr[] if we see a value in field #5 that's higher than the previous value
when group changes flush the current contents of the arr[] index to stdout
One awk idea:
awk '
function flush() { # function to flush current group to stdout
for (i=1; i<=seq; i++)
print group,order[i],arr[order[i]]
delete arr # reset arrays
delete order
seq=0 # reset index for order[] array
}
BEGIN { FS=OFS="\t" }
$1!=group { flush()
group=$1
}
{ key=$2 OFS $3 OFS $4
if ( key in arr && $5 <= arr[key] )
next
if ( ! (key in arr) )
order[++seq]=key
arr[key]=$5
}
END { flush() } # flush last group to stdout
' input.dat
This generates:
1 861265 C A 0.148
1 861265 C G 0.108
1 861265 C T 0.216
2 193456 G A 0.094
2 193456 G C 0.152
2 193456 G T 0.056

Updated
Extract from the sort manual:
-k, --key=KEYDEF
KEYDEF is F[.C][OPTS][,F[.C][OPTS]] for start and stop position, where F is a field number and C a character position in the field; both are origin 1, and the stop position defaults to the line's end.
It means that by using sort -t$'\t' -k1 -n like you did, all the fields of the file have contributed to the numerical sorting.
Here's probably the fastest awk solution that makes use of the numerical sorting in ascending order:
awk '
BEGIN {
FS = "\t"
if ((getline line) > 0) {
split(line, arr)
prev_key = arr[1] FS arr[2] FS arr[4]
prev_line = $0
}
}
{
curr_key = $1 FS $2 FS $4
if (curr_key != prev_key) {
print prev_line
prev_key = curr_key
}
prev_line = $0
}
END {
if (prev_key) print prev_line
}
' file.tsv
Note: As you're handling a file that has around 4 billions of lines, I tried to keep the number of operations to a minimum. For example:
Saving 80 billions operations just by setting FS to "\t". Indeed, why would you allow awk to compare each character of the file with " " when you're dealing with a TSV?
Saving 4 billions comparisons by processing the first line with getline in the BEGIN block. Some people might say that it's safer/better/cleaner to use (NR == 1) and/or (NR > 1), but that would mean doing 2 comparisons per input line instead of 0.
It may be worth to compare the execution time of this code with #EdMorton's code that uses the same algorithm without those optimisations. The disk speed will probably flatten the difference though ^^

Assuming your real input is sorted by key then ascending value the same way as your example is:
$ cat tst.awk
{ key = $1 FS $2 FS $3 FS $4 }
key != prevKey {
if ( NR > 1 ) {
print prevRec
}
prevKey = key
}
{ prevRec = $0 }
END {
print prevRec
}
$ awk -f tst.awk file
1 861265 C A 0.148
1 861265 C G 0.108
1 861265 C T 0.216
2 193456 G A 0.094
2 193456 G C 0.152
2 193456 G T 0.056
if your data isn't already sorted then just sort it with:
sort file | awk ..
That way only sort has to handle the whole file at once and it's designed to do so by using demand paging, etc. and so is far less likely to run out of memory than if you read the whole file into awk or python or any other tool

With sort and awk:
sort -t$'\t' -k1,1n -k4,4 -k5,5rn file | awk 'BEGIN{FS=OFS="\t"} !seen[$1,$4]++'
Prints:
1 861265 C A 0.148
1 861265 C G 0.108
1 861265 C T 0.216
2 193456 G A 0.094
2 193456 G C 0.152
2 193456 G T 0.056
This assumes the 'group' is defined as column 1.
Works by grouping first by column 1, then by column 4 (each letter) then reverse numeric sort on column 5.
The awk then prints the first group, letter seen which will be the max based on the sort.

Related

Creating new column in linux

This is my input file
chr SNP position gene effect_allele other_allele eaf beta se pval I2
1 rs143038218 11388792 UBIAD1 T A 0.992 0.54 0.087 6.00E-10 0
1 rs945211 32191798 ADGRB2 G C 0.383 -0.073 0.013 3.00E-08 0
I want to create a new column called CHR:POS_A1_A2 using 1st, 3rd, 5th and 6th columns respectively.
I'm using the following script:
awk -v OFS='\t' '{ if(NR==1) { print "chr\tSNP\tposition\tgene\teffect_allele\tother_allele\teaf\tbeta\tse\tpval\tI2\tCHR:POS_A1_A2" } else { print $0, $1":"$3"_"$5"_"$6 } }' IOP_2018_133_SNPs.txt > IOP_2018_133_SNPs_CPA1A2.txt
My output is as follows:
==> IOP_2018_133_SNPs_CPA1A2.txt <==
chr SNP position gene effect_allele other_allele eaf beta se pval I2 CHR:POS_A1_A2
1 1:11388792_T_A 11388792 UBIAD1 T A 0.992 0.54 0.087 6.00E-10 0
1 1:32191798_G_C 32191798 ADGRB2 G C 0.383 -0.073 0.013 3.00E-08 0
For some reason the second column 'SNP' is missing in my output file. How can I address this issue TIA

awk extract rows with N columns

I have a tsv file with different column number
1 123 123 a b c
1 123 b c
1 345 345 a b c
I would like to extract only rows with 6 columns
1 123 123 a b c
1 345 345 a b c
How I can do that in bash (awk, sed or something else) ?
Using Awk
$ awk -F'\t' 'NF==6' file
1 123 123 a b c
1 345 345 a b c
FYI, most of the existing solutions have one potential pitfall :
echo "1\t2\t3\t4\t5\t" |
mawk '$!NF = "\n\n\t NF == "( NF ) \
" :\f\b<( "( $_ )" )>\n\n"' FS='\11'
NF == 6 :
<( 1 2 3 4 5 )>
if the input file happens to have a trailing tab \t, it would still be reported by awk as having NF count of 6. whether this test case line actually has 5 columns or 6 in the logical sense is open for interpretation.
Using GNU sed let file.txt content be
1 123 123 a b c
1 123 b c
1 345 345 a b c
1 777 777 a b c d
then
sed -n '/^[^\t]*\t[^\t]*\t[^\t]*\t[^\t]*\t[^\t]*\t[^\t]*$/p' file.txt
gives output
1 123 123 a b c
1 345 345 a b c
Explanation: -n turn off default printing, sole action is to print (p) line matching pattern which is begin (^) and end ($) anchored consisting of 6 column of non-TABs separated by single TABs. This code does use very basic features sed but as you might observe is longer than AWK and not as easy in adjusting N.
(tested in GNU sed 4.2.2)
This might work for you (GNU sed):
sed -nE 's/\S+/&/6p' file
This will print lines with 6 or more fields.
sed -nE 's/\S+/&/6;T;s//&/7;t;p' file
This will print lines with only 6 fields.

Search for a string and return its value in MySQL

I have a text column where the data is being automatically populated from a machine. Below is the data format which is being populated at the database. The format of the data is almost same for all records which I have in this table.
==========================
=== S U P E R S O N Y ===
========================
START AT 12:16:29A
ON 02-18-19
MACHINE COUNT 1051
OPERATOR ______________
SERIAL # 0777218-15
V=inHg
- TIME T=F P=psig
------------------------
D 12:16:31A 104.6 0.0P
D 12:16:41A 104.1 0.0P
D 12:26:41A 167.2 28.7V
D 12:31:41A 108.1 28.5V
MACHINE VALUE IS:
1.5 mg/min
L 12:41:41A 95.1 28.4V
L 12:43:54A 97.2 1.9V
Z 12:45:23A 97.5 0.0P
========================
= CHECK COMPLETE =
========================
I need to find the exact value after the "MACHINE VALUE IS:" and before "mg/min" word. In the above case, the query must return "1.5". The query I have written is failing because of some spaces after "MACHINE VALUE IS:" word.
SELECT REPLACE(REPLACE(SUBSTRING(contents,
LOCATE('IS:', contents), 10),'IS:', ''),'mg','') as value from machine_content
This will get you the number behind IS:<br> of yourse, uin your case you must change, to what ever breaks the line like cr lf
SET #sql := '==========================
=== S U P E R S O N Y ===
========================
START AT 12:16:29A
ON 02-18-19
MACHINE COUNT 1051
OPERATOR ______________
SERIAL # 0777218-15
V=inHg
- TIME T=F P=psig
------------------------
D 12:16:31A 104.6 0.0P
D 12:16:41A 104.1 0.0P
D 12:26:41A 167.2 28.7V
D 12:31:41A 108.1 28.5V
MACHINE VALUE IS:
1.5 mg/min
L 12:41:41A 95.1 28.4V
L 12:43:54A 97.2 1.9V
Z 12:45:23A 97.5 0.0P
========================
= CHECK COMPLETE =
========================
'
SELECT REPLACE(SUBSTRING_INDEX(SUBSTRING_INDEX(#sql, 'IS:', -1), 'mg', 1),'\n','')
| REPLACE(SUBSTRING_INDEX(SUBSTRING_INDEX(#sql, 'IS:', -1), 'mg', 1),'\n','') |
| :-------------------------------------------------------------------------- |
| 1.5 |
db<>fiddle here

SED remove everything between 2 instances of a character

I have a databasedump with appr. 6.0000 lines.
They all look like this:
{"student”:”12345”,”achieved_date":1576018800,"expiration_date":1648677600,"course_code”:”SOMECODE,”certificate”:”STRING WITH A LOT OF CHARACTERS”,”certificate_code”:”ABCDE,”certificate_date":1546297200}
"STRING WITH A LOT OF CHARACTERS" is a string with around 600.000 characters (!)
I need those characters on each line removed... I tried with:
sed 's/certificate\":\"*","certificate_code//'
But it seems it did not do the trick.
I also couldn't find an answer to work with here, so reaching out to you, hopefully you can help me.. is this best done with SED? or any other method?
For now I don't care if the all the characters on "STRING WITH A LOT OF CHARACTERS" are removed or replaced by I.E. a 0, even that would make it workable for me ;)
The output for od -xc filename | head is:
0000000 2d2d 4d20 5379 4c51 6420 6d75 2070 3031
- - M y S Q L d u m p 1 0
0000020 312e 2033 4420 7369 7274 6269 3520 372e
. 1 3 D i s t r i b 5 . 7
0000040 322e 2c39 6620 726f 4c20 6e69 7875 2820
. 2 9 , f o r L i n u x (
0000060 3878 5f36 3436 0a29 2d2d 2d0a 202d 6f48
x 8 6 _ 6 4 ) \n - - \n - - H o
0000100 7473 203a 3231 2e37 2e30 2e30 2031 2020
s t : 1 2 7 . 0 . 0 . 1
hope you can help me!
When I do the od command on the sample text you've supplied, the output includes :
0000520 454d 4f43 4544 e22c 9d80 6563 7472 6669
M E C O D E , ” ** ** c e r t i f
0000540 6369 7461 e265 9d80 e23a 9d80 5453 4952
i c a t e ” ** ** : ” ** ** S T R I
0000560 474e 5720 5449 2048 2041 4f4c 2054 464f
N G W I T H A L O T O F
0000600 4320 4148 4152 5443 5245 e253 9d80 e22c
C H A R A C T E R S ” ** ** , ”
0000620 9d80 6563 7472 6669 6369 7461 5f65 6f63
** ** c e r t i f i c a t e _ c o
0000640 6564 80e2 3a9d 80e2 419d 4342 4544 e22c
d e ” ** ** : ” ** ** A B C D E , ”
So you can see the "quotes" are the byte sequences e2 80 9d, which is unicode U+201d (see https://www.utf8-chartable.de/unicode-utf8-table.pl?start=8192&number=128 )
Probably the simplest would be to simply skip these unicode characters with the single-character wildcard .
sed "s/certificate.:.*.certificate_code/certificate_code/"
Unfortunately, sed doesn't appear to take the unicode \u201d syntax, so some other answers suggest using the hex sequence (\xe2\x80\x9d) - eg : Escaping double quotation marks in sed (but unfortunately I haven't got that to work just yet, and I have to sign off now)
This answer explains why it could have happened, with some remedial action if that's possible in your situation : Unknown UTF-8 code units closing double quotes
If you are working with bash, would you please try the following:
q=$'\xe2\x80\x9d'
sed "s/certificate${q}:${q}.*${q},${q}certificate_code//" file
Result:
{"student”:”12345”,”achieved_date":1576018800,"expiration_date":1648677600,"course_code”:”SOMECODE,””:”ABCDE,”certificate_date":1546297200}

Identify the empty line in text file and loop over with that list in tcl

I have a file which has the following kind of data
A 1 2 3
B 2 2 2
c 2 4 5
d 4 5 6
From the above file I want to execute a loop like ,
three iteration where first iteration will have A,B elements 2nd iteration with c elements and 3rd with d. so that my html table will look like
Week1 | week2 | week3
----------------------------
A 1 2 3 | c 2 4 5 | d 4 5 6
B 2 2 2
I found this in SO catch multiple empty lines in file in tcl but I'm not getting what I exactly want.
I would suggest using arrays:
# Counter
set week 1
# Create file channel
set file [open filename.txt r]
# Read file contents line by line and store the line in the varialbe called $line
while {[gets $file line] != -1} {
if {$line != ""} {
# if line not empty, add line to current array with counter $week
lappend Week($week) $line
} else {
# else, increment week number
incr week
}
}
# close file channel
close $file
# print Week array
parray Week
# Week(1) = {A 1 2 3} {B 2 2 2}
# Week(2) = {c 2 4 5}
# Week(3) = {d 4 5 6}
ideone demo