Creating new column in linux - multiple-columns

This is my input file
chr SNP position gene effect_allele other_allele eaf beta se pval I2
1 rs143038218 11388792 UBIAD1 T A 0.992 0.54 0.087 6.00E-10 0
1 rs945211 32191798 ADGRB2 G C 0.383 -0.073 0.013 3.00E-08 0
I want to create a new column called CHR:POS_A1_A2 using 1st, 3rd, 5th and 6th columns respectively.
I'm using the following script:
awk -v OFS='\t' '{ if(NR==1) { print "chr\tSNP\tposition\tgene\teffect_allele\tother_allele\teaf\tbeta\tse\tpval\tI2\tCHR:POS_A1_A2" } else { print $0, $1":"$3"_"$5"_"$6 } }' IOP_2018_133_SNPs.txt > IOP_2018_133_SNPs_CPA1A2.txt
My output is as follows:
==> IOP_2018_133_SNPs_CPA1A2.txt <==
chr SNP position gene effect_allele other_allele eaf beta se pval I2 CHR:POS_A1_A2
1 1:11388792_T_A 11388792 UBIAD1 T A 0.992 0.54 0.087 6.00E-10 0
1 1:32191798_G_C 32191798 ADGRB2 G C 0.383 -0.073 0.013 3.00E-08 0
For some reason the second column 'SNP' is missing in my output file. How can I address this issue TIA

Related

awk extract rows with N columns

I have a tsv file with different column number
1 123 123 a b c
1 123 b c
1 345 345 a b c
I would like to extract only rows with 6 columns
1 123 123 a b c
1 345 345 a b c
How I can do that in bash (awk, sed or something else) ?
Using Awk
$ awk -F'\t' 'NF==6' file
1 123 123 a b c
1 345 345 a b c
FYI, most of the existing solutions have one potential pitfall :
echo "1\t2\t3\t4\t5\t" |
mawk '$!NF = "\n\n\t NF == "( NF ) \
" :\f\b<( "( $_ )" )>\n\n"' FS='\11'
NF == 6 :
<( 1 2 3 4 5 )>
if the input file happens to have a trailing tab \t, it would still be reported by awk as having NF count of 6. whether this test case line actually has 5 columns or 6 in the logical sense is open for interpretation.
Using GNU sed let file.txt content be
1 123 123 a b c
1 123 b c
1 345 345 a b c
1 777 777 a b c d
then
sed -n '/^[^\t]*\t[^\t]*\t[^\t]*\t[^\t]*\t[^\t]*\t[^\t]*$/p' file.txt
gives output
1 123 123 a b c
1 345 345 a b c
Explanation: -n turn off default printing, sole action is to print (p) line matching pattern which is begin (^) and end ($) anchored consisting of 6 column of non-TABs separated by single TABs. This code does use very basic features sed but as you might observe is longer than AWK and not as easy in adjusting N.
(tested in GNU sed 4.2.2)
This might work for you (GNU sed):
sed -nE 's/\S+/&/6p' file
This will print lines with 6 or more fields.
sed -nE 's/\S+/&/6;T;s//&/7;t;p' file
This will print lines with only 6 fields.

How to remove rows with similar data to keep only highest value in a specific column (tsv file) with awk in bash?

I have a very large .tsv file (80 GB) that I need to edit. It is made up of 5 columns. The last column represent a score. Some positions have multiple "Score" entries and I need to keep only the row for each position with the highest value.
For example, this position have multiple entries for each combination:
1 861265 C A 0.071
1 861265 C A 0.148
1 861265 C G 0.001
1 861265 C G 0.108
1 861265 C T 0
1 861265 C T 0.216
2 193456 G A 0.006
2 193456 G A 0.094
2 193456 G C 0.011
2 193456 G C 0.152
2 193456 G T 0.003
2 193456 G T 0.056
The desired output would look like this:
1 861265 C A 0.148
1 861265 C G 0.108
1 861265 C T 0.216
2 193456 G A 0.094
2 193456 G C 0.152
2 193456 G T 0.056
Doing it in python/pandas is not possible as the file is too large or takes too long. Therefore, I am looking for a solution using bash; in particular awk.
Thif input file has been sorted with the following command:
sort -t$'\t' -k1 -n -o sorted_file original_file
The command would basically need to:
compare the data from the first 4 columns in the sorted_file
if all of those are the same, then only the row with the highest value on column 5 should be printed onto the output file`.
I am not very familiar with awk syntax. I have seen relatively similar questions in other forums, but I was unable to adapt it to my particular case. I have tried to adapt one of those solutions to my case like this:
awk -F, 'NR==1 {print; next} NR==2 {key=$2; next}$2 != key {print lastval; key = $2} {lastval = $0} END {print lastval}' sorted_files.tsv > filtered_file.tsv
However, the output file does not look like it should, at all.
Any help would be very much appreciated.
a more robust way is to sort the last field numerically and let awk pick the first value. If your fields don't have spaces, no need to specify the delimiter.
$ sort -k1n k5,5nr original_file | awk '!a[$1,$2,$3,$4]++' > max_value_file
As #Fravadona commented, since this stores the keys, if there are many unique records it will have large memory footprint. One alterative is delegating to uniq to pick the first record over repeated entries.
$ sort -k1n k5,5nr original_file |
awk '{print $5,$1,$2,$3,$4}' |
uniq -f1 |
awk '{print $2,$3,$4,$5,$1}'
we change the order of the fields to skip the value for comparison and then change back afterwards. This won't have any memory footprint (aside from sort, which will be managed).
If you're not a purist, this should work the same as the previous one
$ sort -k1n k5,5nr original_file | rev | uniq -f1 | rev
It's not awk, but using Miller, is very easy and interesting
mlr --tsv -N sort -f 1,2,3,4 -n 5 then top -f 5 -g 1,2,3,4 -a input.tsv >output.tsv
You will have
1 861265 C A 1 0.148
1 861265 C G 1 0.108
1 861265 C T 1 0.216
2 193456 G A 1 0.094
2 193456 G C 1 0.152
2 193456 G T 1 0.056
You can try this approach. This also works on a non-sorted last column, only the first 4 columns have to be sorted.
% awk 'NR>1&&str!=$1" "$2" "$3" "$4{print line; m=0}
$5>=m{m=$5; line=$0}
{str=$1" "$2" "$3" "$4} END{print line}' file
1 861265 C A 0.148
1 861265 C G 0.108
1 861265 C T 0.216
2 193456 G A 0.094
2 193456 G C 0.152
2 193456 G T 0.056
Data
% cat file
1 861265 C A 0.071
1 861265 C A 0.148
1 861265 C G 0.001
1 861265 C G 0.108
1 861265 C T 0
1 861265 C T 0.216
2 193456 G A 0.006
2 193456 G A 0.094
2 193456 G C 0.011
2 193456 G C 0.152
2 193456 G T 0.003
2 193456 G T 0.056
Assumptions/Understandings:
file is sorted by the first field
no guarantee on the ordering of fields #2, #3 and #4
must maintain the current row ordering (this would seem to rule out (re)sorting the file as we could lose the current row ordering)
the complete set of output rows for a given group will fit into memory (aka the awk arrays)
General plan:
we'll call field #1 the group field; all rows with the same value in field #1 are considered part of the same group
for a given group we keep track of all output rows via the awk array arr[] (index will be a combo of fields #2, #3, #4)
we also keep track of the incoming row order via the awk array order[]
update arr[] if we see a value in field #5 that's higher than the previous value
when group changes flush the current contents of the arr[] index to stdout
One awk idea:
awk '
function flush() { # function to flush current group to stdout
for (i=1; i<=seq; i++)
print group,order[i],arr[order[i]]
delete arr # reset arrays
delete order
seq=0 # reset index for order[] array
}
BEGIN { FS=OFS="\t" }
$1!=group { flush()
group=$1
}
{ key=$2 OFS $3 OFS $4
if ( key in arr && $5 <= arr[key] )
next
if ( ! (key in arr) )
order[++seq]=key
arr[key]=$5
}
END { flush() } # flush last group to stdout
' input.dat
This generates:
1 861265 C A 0.148
1 861265 C G 0.108
1 861265 C T 0.216
2 193456 G A 0.094
2 193456 G C 0.152
2 193456 G T 0.056
Updated
Extract from the sort manual:
-k, --key=KEYDEF
KEYDEF is F[.C][OPTS][,F[.C][OPTS]] for start and stop position, where F is a field number and C a character position in the field; both are origin 1, and the stop position defaults to the line's end.
It means that by using sort -t$'\t' -k1 -n like you did, all the fields of the file have contributed to the numerical sorting.
Here's probably the fastest awk solution that makes use of the numerical sorting in ascending order:
awk '
BEGIN {
FS = "\t"
if ((getline line) > 0) {
split(line, arr)
prev_key = arr[1] FS arr[2] FS arr[4]
prev_line = $0
}
}
{
curr_key = $1 FS $2 FS $4
if (curr_key != prev_key) {
print prev_line
prev_key = curr_key
}
prev_line = $0
}
END {
if (prev_key) print prev_line
}
' file.tsv
Note: As you're handling a file that has around 4 billions of lines, I tried to keep the number of operations to a minimum. For example:
Saving 80 billions operations just by setting FS to "\t". Indeed, why would you allow awk to compare each character of the file with " " when you're dealing with a TSV?
Saving 4 billions comparisons by processing the first line with getline in the BEGIN block. Some people might say that it's safer/better/cleaner to use (NR == 1) and/or (NR > 1), but that would mean doing 2 comparisons per input line instead of 0.
It may be worth to compare the execution time of this code with #EdMorton's code that uses the same algorithm without those optimisations. The disk speed will probably flatten the difference though ^^
Assuming your real input is sorted by key then ascending value the same way as your example is:
$ cat tst.awk
{ key = $1 FS $2 FS $3 FS $4 }
key != prevKey {
if ( NR > 1 ) {
print prevRec
}
prevKey = key
}
{ prevRec = $0 }
END {
print prevRec
}
$ awk -f tst.awk file
1 861265 C A 0.148
1 861265 C G 0.108
1 861265 C T 0.216
2 193456 G A 0.094
2 193456 G C 0.152
2 193456 G T 0.056
if your data isn't already sorted then just sort it with:
sort file | awk ..
That way only sort has to handle the whole file at once and it's designed to do so by using demand paging, etc. and so is far less likely to run out of memory than if you read the whole file into awk or python or any other tool
With sort and awk:
sort -t$'\t' -k1,1n -k4,4 -k5,5rn file | awk 'BEGIN{FS=OFS="\t"} !seen[$1,$4]++'
Prints:
1 861265 C A 0.148
1 861265 C G 0.108
1 861265 C T 0.216
2 193456 G A 0.094
2 193456 G C 0.152
2 193456 G T 0.056
This assumes the 'group' is defined as column 1.
Works by grouping first by column 1, then by column 4 (each letter) then reverse numeric sort on column 5.
The awk then prints the first group, letter seen which will be the max based on the sort.

How can I merge/join multiple columns from two dataframes, depending on a matching pattern

I would like to merge two dataframes based on similar patterns in the chromosome column. I made various attempts with R & BASH such as with "data.table" "tidyverse", & merge(). Could someone help me by providing alternative solutions in R, BASH, Python, Perl, etc. for solving this solution? I would like to merge based on the chromosome information and retain both counts/RXNs.
NOTE: These two DFs are not aligned and I am also curious what happens if some values are missing.
Thanks and Cheers:
DF1:
Chromosome;RXN;ID
1009250;q9hxn4;NA
1010820;p16256;NA
31783;p16588;"PNTOt4;PNTOt4pp"
203;3-DEHYDROQUINATE-DEHYDRATASE-RXN;"DHQTi;DQDH"
DF2:
Chromosome;Count1;Count2;Count3;Count4;Count5
203;1;31;1;0;0;0
1010820;152;7;0;11;4
1009250;5;0;0;17;0
31783;1;0;0;0;0;0
Expected Result:
Chromosome;RXN;Count1;Count2;Count3;Count4;Count5
1009250;q9hxn4;5;0;0;17;0
1010820;p16256;152;7;0;11;4
31783;p16588;1;0;0;0;0
203;3-DEHYDROQUINATE-DEHYDRATASE-RXN;1;31;1;0;0;0
As bash was mentioned in the text body, I offer you an awk solution. The dataframes are in files df1 and df2:
$ awk '
BEGIN {
FS=OFS=";" # input and output field delimiters
}
NR==FNR { # process df1
a[$1]=$2 # hash to an array, 1st is the key, 2nd the value
next # process next record
}
{ # process df2
$2=(a[$1] OFS $2) # prepend RXN field to 2nd field of df2
}1' df1 df2 # 1 is output command, mind the file order
The 2 last lines could be written perhaps more clearly:
...
{
print $1,a[$1],$2,$3,$4,$5,$6
}' df1 df2
Output:
Chromosome;RXN;Count1;Count2;Count3;Count4;Count5
203;3-DEHYDROQUINATE-DEHYDRATASE-RXN;1;31;1;0;0;0
1010820;p16256;152;7;0;11;4
1009250;q9hxn4;5;0;0;17;0
31783;p16588;1;0;0;0;0;0
Output will be in the order of df2. Chromosome present in df1 but not in df2 will not be included. Chromosome in df2 but not in df1 will be output from df2 with empty RXN field. Also, if there are duplicate chromosomes in df1, the last one is used. This can be fixed if it is an issue.
If I understand your request correctly, this should do it in Python. I've made the Chromosome column into the index of each DataFrame.
from io import StringIO
txt1 = '''Chromosome;RXN;ID
1009250;q9hxn4;NA
1010820;p16256;NA
31783;p16588;"PNTOt4;PNTOt4pp"
203;3-DEHYDROQUINATE-DEHYDRATASE-RXN;"DHQTi;DQDH"'''
txt2 = """Chromosome;Count1;Count2;Count3;Count4;Count5;Count6
203;1;31;1;0;0;0
1010820;152;7;0;11;4
1009250;5;0;0;17;0
31783;1;0;0;0;0;0"""
df1 = pd.read_csv(
StringIO(txt1),
sep=';',
index_col=0,
header=0
)
df2 = pd.read_csv(
StringIO(txt2),
sep=';',
index_col=0,
header=0
)
DF1:
RXN ID
Chromosome
1009250 q9hxn4 NaN
1010820 p16256 NaN
31783 p16588 PNTOt4;PNTOt4pp
203 3-DEHYDROQUINATE-DEHYDRATASE-RXN DHQTi;DQDH
DF2:
Count1 Count2 Count3 Count4 Count5 Count6
Chromosome
203 1 31 1 0 0 0.0
1010820 152 7 0 11 4 NaN
1009250 5 0 0 17 0 NaN
31783 1 0 0 0 0 0.0
result = pd.concat(
[df1.sort_index(), df2.sort_index()],
axis=1
)
print(result)
RXN ID Count1 Count2 Count3 Count4 Count5 Count6
Chromosome
203 3-DEHYDROQUINATE-DEHYDRATASE-RXN DHQTi;DQDH 1 31 1 0 0 0.0
31783 p16588 PNTOt4;PNTOt4pp 1 0 0 0 0 0.0
1009250 q9hxn4 NaN 5 0 0 17 0 NaN
1010820 p16256 NaN 152 7 0 11 4 NaN
The concat command also handles mismatched indices by simply filling in NaN values for columns in e.g. df1 if df2 doesn't have have the same index, and vice versa.

which post-hoc test after welch-anova

i´m doing the statistical evaluation for my master´s thesis. the levene test was significant so i did the welch anova which was significant. now i tried the games-howell post hoc test but it didn´t work.
can anybody help me sending me the exact functions which i have to run in R to do the games-howell post hoc test and to get kind of a compact letter display, where it shows me which treatments are not significantly different from each other? i also wanted to ask if i did the welch anova the right way (you can find the output of R below)
here it the output which i did till now for the statistical evalutation:
data.frame': 30 obs. of 3 variables:
$ Dauer: Factor w/ 6 levels "0","2","4","6",..: 1 2 3 4 5 6 1 2 3 4 ...
$ WH : Factor w/ 5 levels "r1","r2","r3",..: 1 1 1 1 1 1 2 2 2 2 ...
$ TSO2 : num 107 86 98 97 88 95 93 96 96 99 ...
> leveneTest(TSO2~Dauer, data=TSO2R)
`Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 5 3.3491 0.01956 *
24
Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1`
`> oneway.test (TSO2 ~Dauer, data=TSO2R, var.equal = FALSE) ###Welch-ANOVA
One-way analysis of means (not assuming equal variances)
data: TSO2 and Dauer
F = 5.7466, num df = 5.000, denom df = 10.685, p-value = 0.00807
'''`
Thank you very much!

How do I read a HTML table with mismatching columns and headers?

A HTML table body has 1 column more than defined within the table header. This leads to skipping the last column and of course, column mismatch. How can I add the additional column to the result data.frame/table in R while reading in the HTML table with package("htmltab") Obviously, post processing does not help.
Here is an example:
code
install.packages("htmltab")
library(htmltab)
bu<- 0
bu <- data.table("Pl.", "Mannschaft", "Kurzname" , "Spiele", "G.", "U.", "V.", "Tore", "Diff.", "Pkt.")
#https://www.bundesliga-prognose.de/1/2009/1/
url <- "https://www.bundesliga-prognose.de/1/2009/1/"
bu <- htmltab(doc = url, column=10,columnnames=c ("Pl." , "Mannschaft", "Kurzname" , "Spiele", "G.", "U.", "V.", "Tore", "Diff.", "Pkt."), which = "//th[text() = 'Pl.']/ancestor::table")
bu <- data.table(bu)
head(bu)
This results in
Pl. Mannschaft Spiele G. U. V. Tore Diff. Pkt.
1: 1. VfL Wolfsburg Wolfsburg 1 1 0 0 2:0 2
2: 2. Eintracht Frankfurt E. Frankfurt 1 1 0 0 3:2 1
3: 3. FC Schalke 04 FC Schalke 04 1 1 0 0 2:1 1
4: 4. Borussia Dortmund B. Dortmund 1 1 0 0 1:0 1
5: NA Hertha BSC Berlin H. BSC Berlin 1 1 0 0 1:0 1
6: 6. Bor. Mönchengladbach M´gladbach 1 0 1 0 3:3 0
As the short-name("Kurzname") is not specified in the header the short-name ("Kurzname") is displayed with the games (Spiele) column an so on. So the last column is skipped. How can I add the additional column short-name ("Kurzname") while reading the header using the htmltab package?
In addition I would like to replace the NA in row 5 with the row-id/number using the htmltab package?
This seems to be indeed a problem for htmltab. The only solution i have found is to directly read the tbody of the table. You would then need to add the header manually.
htmltab(doc = url, which = "//table[2]/tbody")
With that help I found a quite simple solution:
specify to skip the header
List/define all colums thru colNames
url <- "https://www.bundesliga-prognose.de/1/2007/5/"
sp_2007_5<- htmltab(doc = url, which = "//table[1]/tbody", header = 0 , colNames = c("Datum" , "Anpfiff", "Heim" , "Heim_Kurzname","Gast", "Gast_Kurzname","Ergebnis", "Prognose"), rm_nodata_cols = F,encoding = "UTF-8")
head(sp_2007_5)