Singularity container - deepvariant binding directories to $PATH? - containers

I am trying to use a deepvariant singularity container on my HPC.
However there is something wrong with the way that I am using the container and where it is binding.
This is my code:
#!/bin/bash --login
#SBATCH -J AmyHouseman_deepvariant
#SBATCH -o %x.stdout.%J.%N
#SBATCH -e %x.stderr.%J.%N
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=1
#SBATCH -p c_compute_wgp
#SBATCH --account=scw1581
#SBATCH --mail-type=ALL # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=HousemanA#cardiff.ac.uk # Where to send mail
#SBATCH --array=1-33
#SBATCH --time=02:00:00
#SBATCH --time=072:00:00
#SBATCH --mem-per-cpu=32GB
module purge
module load singularity
module load parallel
# Set bash error trapping to exit on first error.
set -eu
WDPATH=/scratch/$USER/$SLURM_ARRAY_JOB_ID
CONTAINER_FILE=deepvariant_1.3.0.sif
MY_CONTAINER_PATH=/scratch/$USER/containers
CONTAINER=$MY_CONTAINER_PATH/$CONTAINER_FILE
if [ "$SLURM_ARRAY_TASK_ID" == "1" ]
then
mkdir -p $MY_CONTAINER_PATH
[ -f "$CONTAINER" ] || ssh cl1 wget -O $CONTAINER https://wotan.cardiff.ac.uk/containers/$CONTAINER_FILE
mkdir $WDPATH
fi
while [ ! -d $WDPATH ]
do
sleep 10
done
cd /scratch/c.c21087028/
sed -n "${SLURM_ARRAY_TASK_ID}p" Polyposis_Exome_Analysis/fastp/All_fastp_input/List_of_33_exome_IDs | parallel -j 1 "singularity run $CONTAINER --model_type=WES \
-ref=Polyposis_Exome_Analysis/bwa/index/HumanRefSeq/GRCh38_latest_genomic.fna \
--reads=Polyposis_Exome_Analysis/samtools/index/indexed_picardbamfiles/{}PE_markedduplicates.bam \
--output_vcf=Polyposis_Exome_Analysis/deepvariant/vcf/{}PE_output.vcf.gz \
--output_gvcf=Polyposis_Exome_Analysis/deepvariant/gvcf/{}PE_output.vcf.gz \
--intermediate_results_dir=Polyposis_Exome_Analysis/deepvariant/intermediateresults/{}PE_output_intermediate"
The error message I get is set:
invalid option: "--"
FATAL: "--model_type": executable file not found in $PATH
I find it a bit strange because I haven't had a problem with any of the other singularity containers I've used before, so not really sure how to go ahead - I've tried adding in the bit in their manual singularity run -B /usr/lib/locale/:/usr/lib/locale/
but I'm still unsure as to why I would need to do this step when I haven't previously on my other tools like bwa, and samtools.
I also know I can get rid of parallel bits as its not actually running in parallel, so I am aware of that.
I hope this makes sense!
Thank you!
Amy

Related

tshark do not assembly TCP fragments into large packets

I have a simple pcap with some web traffic and am using tshark to obtain some header information from it:
I use the following command:
tshark -r ./capture-1-5 -Y "http2" -o tls.keylog_file:ssl-key.log \
-T fields -e frame.number -e _ws.col.Time -e ip.src -e tcp.srcport \
-e ip.dst -e tcp.dstport -e _ws.col.Protocol -e frame.len \
-e _ws.col.Info -E header=y -E separator="," -E quote=d \
-E occurrence=f > desegmented.csv
I realized that in this case all fragments are reassembled resulting in huge packets. However, I do not want reassembled packets. So, I add an extra option to tshark:
tshark -r ./capture-1-5 -Y "http2" -o tls.keylog_file:ssl-key.log \
-T fields -e frame.number -e _ws.col.Time -e ip.src -e tcp.srcport \
-e ip.dst -e tcp.dstport -e _ws.col.Protocol -e frame.len \
-e _ws.col.Info -E header=y -E separator="," -E quote=d \
-E occurrence=f -o tcp.desegment_tcp_streams:FALSE > segmented.csv
My intuition is that the resultant disassembled.csv file should be greater in size and should contain more rows given that the "packets above the MTU" will be shown as more than one packet.
However, I observe the opposite. The resultant file without assembly is smaller and has almost halved the number of rows.
-rw-r--r-- 1 root root 210K May 18 18:21 desegmented.csv
-rw-r--r-- 1 root root 97K May 18 18:21 segmented.csv
# cat desegmented.csv |wc -l
2635
# cat segmented.csv |wc -l
1233
Is this a normal behavior? I don't see (manually) where the packets start to disappear (and why) or see any pattern because of the two-way communication (missing packets here and there).
I assume that maybe, in the disassebmled.csv case, every packet or even the whole packet stream that resulted in at least one packet above the MTU is completely dropped.
I tried to also apply ip.defragment:FALSE but still the same results.
Thanks
For reproducing, the files can be downloaded from here
Thanks, #JimD., I have already come to a similar conclusion!
Packet capture itself has to be segmented to do this precisely.
So, tried to go one layer below, and make the packet capture itself to be segmented via
ethtool -K eth0 gso off tso off gro off sg off tx off rx off
(just to make sure).
The problem is that packet capturing is done in a docker container, so at multiple places, I have to issue this command to be fully working.
These places include the docker0 bridge, eth0 inside the container and the corresponding vethXXXXXX on the host, from which the second requires privileged containers that pose further issues :)

How to pass an argument in the sbatch command line?

I would like to pass an argument into the sbatch command line.
RHO_COR.sh
#!/bin/bash
#SBATCH -o job-%A_task.out
#SBATCH --job-name=paral_cor
#SBATCH --partition=normal
#SBATCH --time=1-00:00:00
#SBATCH --mem=200G
#SBATCH --cpus-per-task=16
#SBATCH --array=1-10
#Set up whatever package we need to run with
module load gcc/8.1.0 openblas/0.3.3 R
# SET UP DIRECTORIES
OUTPUT="$HOME"/PROJET_M2/data/$(date +"%Y%m%d")_parallel_nodes_test
mkdir -p "$OUTPUT"
export FILENAME="$HOME"/vipailler/PROJET_M2/bin/RHO_COR.R
subset=$((SLURM_ARRAY_TASK_ID))
file="$HOME"/PROJET_M2/raw/truelength2.prok2.uniref2.rares.tsv
#Run the program
echo "Start job :"`date` >> "$OUTPUT"/"$SLURM_ARRAY_TASK_ID".txt
echo "Start job :"`date`
echo PWD $PWD
Rscript $FILENAME --file $file --subset $subset > "$OUTPUT"/"$SLURM_ARRAY_TASK_ID"
wait
echo "Stop job : "`date` >> "$OUTPUT"/"$SLURM_ARRAY_TASK_ID".txt
echo "Stop job : "`date`
I execute this code with :
sbatch --partition normal --array 1-10 RHO_COR.sh
What I would like to get is to use the file argument into the command line above, which looks like sbatch --partition normal --array 1-10 --file name_of_my_file RHO_COR.sh
I don't want to specify the name of my file into the Slurm code, but into the sbatch command line, in order to never change this Slurm code.
Thanks
You can pass an argument after the script as if you were running it directly on the shell like this:
sbatch --partition normal --array 1-10 RHO_COR.sh name_of_my_file
And then the argument will be available inside the shell script as $1

Setting the SGE cluster job name with Snakemake while using DRMAA?

Problem
I'm not sure if the -N argument is being saved. SGE Cluster. Everything works except for the -N argument.
Snakemake requires a valid -N call
It doesn't set the job name properly.
It always reverts to the default name. This is my call, which has the same results, with or without the -N argument.
snakemake --jobs 100 --drmaa "-V -S /bin/bash -o log/mpileup/mpileupSPLIT -e log/mpileup/mpileupSPLIT -l h_vmem=10G -pe ncpus 1 -N {rule}.{wildcards}.varScan"
The only way I have found to influence the job name is to use --jobname.
snakemake --jobs 100 --drmaa "-V -S /bin/bash -o log/mpileup/mpileupSPLIT -e log/mpileup/mpileupSPLIT -l h_vmem=10G -pe ncpus 1 -N {rule}.{wildcards}.varScan" --jobname "{rule}.{wildcards}.{jobid}"
Background
I've tried a variety of things. Usually I actually just use a cluster configuration file, but that isn't working either, so that's why in the code above, I ditched the file system to make sure it's the '-N' command which isn't being saved.
My usual call is:
snakemake --drmaa "{cluster.clusterSpec}" --jobs 10 --cluster-config input/config.json
1) If I use '-n' instead of '-N', I receive a workflow error:
drmaa.errors.DeniedByDrmException: code 17: ERROR! invalid option argument "-n"
2) If I use '-N', but give it an incorrect wildcard, say {rule.name}:
AttributeError: 'str' object has no attribute 'name'
3) I cannot use both --drmaa AND --cluster:
snakemake: error: argument --cluster/-c: not allowed with argument --drmaa
4) If I specify the {jobid} in the config.json file, then Snakemake doesn't know what to do with it.
RuleException in line 13 of /extscratch/clc/projects/tboyarski/gitRepo-LCR-BCCRC/Snakemake/modules/mpileup/mpileupSPLIT:
NameError: The name 'jobid' is unknown in this context. Please make sure that you defined that variable. Also note that braces not used for variable access have to be escaped by repeating them, i.e. {{print $1}}
EDIT Added #5 w/ Solution
5) I can set the job name using the config.json and just concatenate the jobid on afterwards in my snakemake call. That way I have a generic snakemake call (--jobname "{cluster.jobName}.{jobid}"), and a highly configurable and specific job name ({rule}-{wildcards.sampleMPUS}_chr{wildcards.chrMPUS}) which results in:
mpileupSPLIT-Pfeiffer_chr19.1.e7152298
The 1 is the Snakemake jobid according to the DAG.
The 7152298 is my cluster's job number.
2nd EDIT - Just tried v3.12, same thing. Concatenation must occur in snakemake call.
Alternative solution
I would also be okay with something like this:
snakemake --drmaa "{cluster.clusterSpec}" --jobname "{cluster.jobName}" --jobs 10 --cluster-config input/config.json
With my cluster file like this:
"mpileupSPLIT": {
"clusterSpec": "-V -S /bin/bash -o log/mpileup/mpileupSPLIT -e log/mpileup/mpileupSPLIT -l h_vmem=10G -pe ncpus 1 -n {rule}.{wildcards}.varScan",
"jobName": "{rule}-{wildcards.sampleMPUS}_chr{wildcards.chrMPUS}.{jobid}"
}
Documentation Reviewed
I've read the documentation but I was unable to figure it out.
http://snakemake.readthedocs.io/en/latest/executable.html?-highlight=job_name#cluster-execution
http://snakemake.readthedocs.io/en/latest/snakefiles/configuration.html#snakefiles-cluster-configuration
https://groups.google.com/forum/#!topic/snakemake/whwYODy_I74
System
Snakemake v3.10.2 (Will try newest conda version tomorrow)
Red Hat Enterprise Linux Server release 5.4
SGE Cluster
Solution
Use '--jobname' in your snakemake call instead of '-N' in your qsub parameter submission
Setup your cluster config file to have a targetable parameter for the jobname suffix. In this case these are the overrides for my Snakemake rule named "mpileupSPLIT":
"mpileupSPLIT": {
"clusterSpec": "-V -S /bin/bash -o log/mpileup/mpileupSPLIT -e log/mpileup/mpileupSPLIT -l h_vmem=10G -pe ncpus 1",
"jobName": "{rule}-{wildcards.sampleMPUS}_chr{wildcards.chrMPUS}"
}
Utilize a generic Snakemake call which includes {jobid}. On a cluster (SGE), the 'jobid' variable contains both the Snakemake Job# and the Cluster Job#, both are valuable as the first corresponds to the Snakemake DAG and the later is for cluster logging. (E.g. --jobname "{cluster.jobName}.{jobid}")
EDIT Added solution to resolve post.

How to pass arguments from cmd to tcl script of ModelSim

I run Modelsim in the cmd from a python program.
I use the following code which call a tcl script which run the modelsim:
os.system("vsim -c -do top_tb_simulate_reg.tcl " )
The tcl script contain the following:
vsim -voptargs="+acc" +UVM_TESTNAME=test_name +UVM_MAX_QUIT_COUNT=1 +UVM_VERBOSITY=UVM_LOW \
-t 1ps -L unisims_verm -L generic_baseblocks_v2_1_0 -L axi_infrastructure_v1_1_0 \
-L dds_compiler_v6_0_12 -lib xil_defaultlib xil_defaultlib.girobo2_tb_top \
xil_defaultlib.glbl
I want that the value of the +UVM_TESTNAME will be an argument which I passed from the cmd when I execute:
os.system("vsim -c -do top_tb_simulate_reg.tcl " )
How can I do it?
I tried the following with no succees:
Python script:
os.system("vsim -c -do top_tb_simulate_reg.tcl axi_rd_only_test" )
Simulation file (tcl script)
vsim -voptargs="+acc" +UVM_TESTNAME=$argv +UVM_MAX_QUIT_COUNT=1 +UVM_VERBOSITY=UVM_LOW \
-t 1ps -L unisims_verm -L generic_baseblocks_v2_1_0 -L axi_infrastructure_v1_1_0 \
-L dds_compiler_v6_0_12 -lib xil_defaultlib xil_defaultlib.girobo2_tb_top \
xil_defaultlib.glbl
I got the following error:
# ** Error: (vsim-3170) Could not find 'C:/raft/raftortwo/girobo2/ver/sim/work.axi_rd_only_test'.
The problem is that the vsim binary is doing its own processing of the arguments, and that is interfering. While yes, you can probably find a way around this by reading the vsim documentation, the simplest way around this is to pass values via environment variables. They're inherited by a process from its parent process, and are fine for passing most things. (The exception are security tokens, which should always be passed in files with correctly-set permissions, rather than either environment variables or command-line arguments.)
In your python code:
# Store the value in the *inheritable* environment
os.environ["MY_TEST_CASE"] = "axi_rd_only_test"
# Do the call; the environment gets passed over behind the scenes
os.system("vsim -c -do top_tb_simulate_reg.tcl " )
In your tcl code:
# Read out of the inherited environment
set name $env(MY_TEST_CASE)
# Use it! (Could do this as one line, but that's hard to read)
vsim -voptargs="+acc" +UVM_TESTNAME=$name +UVM_MAX_QUIT_COUNT=1 +UVM_VERBOSITY=UVM_LOW \
-t 1ps -L unisims_verm -L generic_baseblocks_v2_1_0 -L axi_infrastructure_v1_1_0 \
-L dds_compiler_v6_0_12 -lib xil_defaultlib xil_defaultlib.girobo2_tb_top \
xil_defaultlib.glbl
Late to the party but I found a great workaround for your obstacle. The do command within Modelsim's TCL instance does accept parameters. See command reference.
vsim -c -do filename.tcl can't take parameters, but you can use vsim -c -do "do filename.tcl params".
In your case this translates to os.system('vsim -c -do "do top_tb_simulate_reg.tcl axi_rd_only_test"'). Your .tcl script will find the parameter passed through the variable $1.
I hope to helps anyone!

Docker Error: container id followed by "command not found"

I'm having difficulty with a script I'm writing. The script is largely incomplete, but so far I expect it to be able to run containers successfully. When I execute the script I'm given an error with a container ID and "command not found". For example: ./wordpress: line 73: 3c0fba4984f3b70f0eb3f1c15a7b157f4862b9b243657a3d2f7141029fb6641a: command not found
The script I'm using is as follows:
#!/bin/bash
echo "Setting Constants"
MYSQL_ROOT_PASSWORD='password'
MYSQL_DATABASE='wordpress'
WORDPRESS_DB_PASSWORD='password'
WP_PORT='80'
DB_PORT='3306'
EPOCH=$(date +%s) # append EPOCH to container names for uniqueness
#FILE='blogcontainers' # filename containing container IDs
DB_CONTAINER_NAME="myblogdb$EPOCH"
WP_CONTAINER_NAME="myblog$EPOCH"
DB_IMG_NAME='blogdb' # MySQL Docker image
WP_IMG_NAME='blog' # WordPress Docker image
cd ~/myblog
WP_CID_FILE="$PWD/blog.cid"
DB_CID_FILE="$PWD/blogdb.cid"
if [ -f $DB_CID_FILE ]; then
DB_IMG_ID=$(sed -n '1p' $DB_CID_FILE)
else
echo "dbcid not found"
# set to baseline image
DB_IMG_ID="f09a5b2903dc"
fi
if [ -f $WP_CID_FILE ]; then
WP_IMG_ID=$(sed -n '1p' $WP_CID_FILE)
else
echo "wpcid not found"
# set to baseline image
WP_IMG_ID="a8d48bc2313d"
fi
DB_PATH='/var/lib/mysql' # standard MySQL path
WP_PATH='/var/www/html' # standard WordPress path
LOCAL_DB_PATH="/$PWD$DB_PATH"
LOCAL_WP_PATH="/$PWD$WP_PATH"
echo "Starting MySQL Container"
#DB_ID=
$(docker run \
-e MYSQL_ROOT_PASSWORD=$MYSQL_ROOT_PASSWORD \
-e MYSQL_DATABASE=$MYSQL_DATABASE \
-v $LOCAL_WP_PATH:$DB_PATH \
-v /$PWD/.bash_history:$WP_PATH \
--name $DB_CONTAINER_NAME \
-p $DB_PORT:3306 \
--cidfile $DB_CID_FILE \
-d \
$DB_IMG_ID)
echo "Starting WordPress Container"
#WP_ID=
$(docker run \
-e WORDPRESS_DB_PASSWORD=$WORDPRESS_DB_PASSWORD \
--link $DB_CONTAINER_NAME:$DB_IMG_NAME \
-p $WP_PORT:80 \
-v $LOCAL_WP_PATH:$WP_PATH \
-v /$PWD/.bash_history:/root/.bash_history \
--name $WP_CONTAINER_NAME \
--cidfile $WP_CID_FILE \
-d \
$WP_IMG_ID)
echo $WP_CONTAINER_NAME
echo $WP_IMG_ID
echo "reached end"
#echo $WP_ID > $FILE # copy WordPress container ID to file
#echo $DB_ID >> $FILE # append MySQL container ID to file
After executing the code there usually is a MySQL container instance running. For example:
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
4f2e9ab14c2e f09a5b2903dc "/entrypoint.sh mysql" 2 seconds ago Up 2 seconds 0.0.0.0:3306->3306/tcp myblogdb1449768739
Also, both blog.cid and blogdb.cid are created successfully containing container IDs.
$ cat blog.cid
e6005bcb4dba524b121d02b301fbe421d67d60986c55d554a0e20443df27ed18
$ cat blogdb.cid
4f2e9ab14c2ea5361557a3714477d7758c993af3b08bbc7db529282a41f90959
I've been troubleshooting and searching around for answers, but I think it's time to have another set of eyes take a look at it. As always, any input/criticism are welcome.
You are using $(docker run ...) instead of simply docker run .... The command substitution ($(...)) runs the command, captures the output, and expands to that output. As a result, you are trying to run the output of docker run as a command.