hard-coded output without expansion in Snakefile - output

I have Snakefile as following:
SAMPLES, = glob_wildcards("data/{sample}_R1.fq.gz")
rule all:
input:
expand("samtools_sorted_out/{sample}.raw.snps.indels.g.vcf", sample=SAMPLES),
expand("samtools_sorted_out/combined_gvcf")
rule combine_gvcf:
input: "samtools_sorted_out/{sample}.raw.snps.indels.g.vcf"
output:directory("samtools_sorted_out/combined_gvcf")
params: gvcf_file_list="gvcf_files.list",
gatk4="/storage/anaconda3/envs/exome/share/gatk4-4.1.0.0-0/gatk-package-4.1.0.0-local.jar"
shell:"""
java -DGATK_STACKTRACE_ON_USER_EXCEPTION=true \
-jar {params.gatk4} GenomicsDBImport \
-V {params.gvcf_file_list} \
--genomicsdb-workspace-path {output}
"""
When I test it with dry run, I got error:
RuleException in line 335 of /data/yifangt/exomecapture/Snakefile:
Wildcards in input, params, log or benchmark file of rule combine_gvcf cannot be determined from output files:
'sample'
There are two places that I need some help:
The {output} is a folder that will be created by the shell part;
The {output} folder was hard-coded manually required by the command line (and the contents are unknown ahead of time).
The problem seems to be with the {output} without expansion as compared with the {input} which does.
How should I handle with this situation? Thanks a lot!

Related

Problem wth JSON argument when using Aspera Fastex Command line interface

In windows, in the command line, I am not able to send packages with Aspera Fastex Command Line Interface, here are two examples of the commands I tried:
aspera faspex send -t "Send-test" -r "*Attachment" --source 1 -f y:\folder-test --metadata= { "Ticket ID":"test name" }
aspera faspex send -t "Send-test" -r "*Attachment" --source 1 -f y:\folder-test --metadata= {"metadata_fields" : {"Ticket ID": "test name"}}
The output is always ‘JSON Exception’, How should be parsed the JSON argument?
JSON Exception
Try quoting the parameter and doubling the QUOTATION MARK characters inside the parameter.
"--metadata={""metadata_fields"" : {""Ticket ID"": ""test name""}}"
Or possibly:
--metadata= "{""metadata_fields"" : {""Ticket ID"": ""test name""}}"
--metadata "{""metadata_fields"" : {""Ticket ID"": ""test name""}}"
The actual parsing of the command line depends on the operating system and shell used...
the command line tool ("aspera") will use an array of string as input, this array of string is either parsed by the invoking shell, or system call, or language runtime (C/C++)
In you case, I guess you are starting on windows, so the double quote thing solves it.
Refer to:
https://daviddeley.com/autohotkey/parameters/parameters.htm
http://www.windowsinspired.com/understanding-the-command-line-string-and-arguments-received-by-a-windows-program/
concerning the aspera command line itself, you may consider:
https://github.com/IBM/aspera-cli

How can I format a json file into a bash environment variable?

I'm trying to take the contents of a config file (JSON format), strip out extraneous new lines and spaces to be concise and then assign it to an environment variable before starting my application.
This is where I've got so far:
pwr_config=`echo "console.log(JSON.stringify(JSON.parse(require('fs').readFileSync(process.argv[2], 'utf-8'))));" | node - config.json | xargs -0 printf '%q\n'` npm run start
This pipes a short node.js app into the node runtime taking an argument of the file name and it parses and stringifies the JSON file to validate it and remove any unnecessary whitespace. So far so good.
The result of this is then piped to printf, or at least it would be but printf doesn't support input in this way, apparently, so I'm using xargs to pass it in in a way it supports.
I'm using the %q formatter to format the string escaping any characters that would be a problem as part of a command, but when calling printf through xargs, printf claims it doesn't support %q. I think this is perhaps because there is more than one version of printf but I'm not exactly sure how to resolve that.
Any help would be appreciated, even if the solution is completely different from what I've started :) Thanks!
Update
Here's the output I get on MacOS:
$ cat config.json | xargs -0 printf %q
printf: illegal format character q
My JSON file looks like this:
{
"hue_host": "192.168.1.2",
"hue_username": "myUsername",
"port": 12000,
"player_group_config": [
{
"name": "Family Room",
"player_uuid": "ATVUID",
"hue_group": "3",
"on_events": ["media.play", "media.resume"],
"off_events": ["media.stop", "media.pause"]
},
{
"name": "Lounge",
"player_uuid": "STVUID",
"hue_group": "1",
"on_events": ["media.play", "media.resume"],
"off_events": ["media.stop", "media.pause"]
}
]
}
Two ways:
Use xargs to pick up bash's printf builtin instead of the printf(1) executable, probably in /usr/bin/printf(thanks to #GordonDavisson):
pwr_config=`echo "console.log(JSON.stringify(JSON.parse(require('fs').readFileSync(process.argv[2], 'utf-8'))));" | node - config.json | xargs -0 bash -c 'printf "%q\n"'` npm run start
Simpler: you don't have to escape the output of a command if you quote it. In the same way that echo "<|>" is OK in bash, this should also work:
pwr_config="$(echo "console.log(JSON.stringify(JSON.parse(require('fs').readFileSync(process.argv[2], 'utf-8'))));" | node - config.json )" npm run start
This uses the newer $(...) form instead of `...`, and so the result of the command is a single word stored as-is into the pwr_config variable.*
Even simpler: if your npm run start script cares about the whitespace in your JSON, it's fundamentally broken :) . Just do:
pwr_config="$(< config.json)" npm run start
The $(<...) returns the contents of config.json. They are all stored as a single word ("") into pwr_config, newlines and all.* If something breaks, either config.json has an error and should be fixed, or the code you're running has an error and needs to be fixed.
* You actually don't need the "" around $(). E.g., foo=$(echo a b c) and foo="$(echo a b c)" have the same effect. However, I like to include the "" to remind myself that I am specifically asking for all the text to be kept together.

How do I make this into a function for input files? [duplicate]

I have a problem figuring out how to make the input directive only select all {samples} files in the rule below.
rule MarkDup:
input:
expand("Outputs/MergeBamAlignment/{samples}_{lanes}_{flowcells}.merged.bam", zip,
samples=samples['sample'],
lanes=samples['lane'],
flowcells=samples['flowcell']),
output:
bam = "Outputs/MarkDuplicates/{samples}_markedDuplicates.bam",
metrics = "Outputs/MarkDuplicates/{samples}_markedDuplicates.metrics",
shell:
"gatk --java-options -Djava.io.tempdir=`pwd`/tmp \
MarkDuplicates \
$(echo ' {input}' | sed 's/ / --INPUT /g') \
-O {output.bam} \
--VALIDATION_STRINGENCY LENIENT \
--METRICS_FILE {output.metrics} \
--MAX_FILE_HANDLES_FOR_READ_ENDS_MAP 200000 \
--CREATE_INDEX true \
--TMP_DIR Outputs/MarkDuplicates/tmp"
Currently it will create correctly named output files, but it selects all files that match the pattern based on all wildcards. So I'm perhaps halfway there. I tried changing {samples} to {{samples}} in the input directive as such:
expand("Outputs/MergeBamAlignment/{{samples}}_{lanes}_{flowcells}.merged.bam", zip,
lanes=samples['lane'],
flowcells=samples['flowcell']),`
but this broke the previous rule somehow. So the solution is something like
input:
"{sample}_*.bam"
But clearly this doesn't work.
Is it possible to collect all files that match {sample}_*.bam with a function and use that as input? And if so, will the function still work with $(echo ' {input}' etc...) in the shell directive?
If you just want all the files in the directory, you can use a lambda function
from glob import glob
rule MarkDup:
input:
lambda wcs: glob('Outputs/MergeBamAlignment/%s*.bam' % wcs.samples)
output:
bam="Outputs/MarkDuplicates/{samples}_markedDuplicates.bam",
metrics="Outputs/MarkDuplicates/{samples}_markedDuplicates.metrics"
shell:
...
Just be aware that this approach can't do any checking for missing files, since it will always report that the files needed are the files that are present. If you do need confirmation that the upstream rule has been executed, you can have the previous rule touch a flag, which you then require as input to this rule (though you don't actually use the file for anything other than enforcing execution order).
If I understand correctly, zip needs to be applied only to {lane} and {flowcells} and not to {samples}. In that case, use two expand instances can achieve that.
input:
expand(expand("Outputs/MergeBamAlignment/{{samples}}_{lanes}_{flowcells}.merged.bam",
zip, lanes=samples['lane'], flowcells=samples['flowcell']),
samples=samples['sample'])
PS: output.tmp file uses {sample} instead of {samples}. Typo?

Input error: Expected '--nodes' to have at least 1 valid item, but had 0 []

I've read plenty of articles about this issue on here, but I still can't seem to get around this issue. I've been trying to use Neo4j-import on some large genome data CSVs I have, but it doesn't seem to recognise the files. My command line input is as follows:
user#LenovoPC ~/.config/Neo4j Desktop/Application/neo4jDatabases/database-2f182948-e170-45b1-b9f4-19d236ff5d43/installation-3.5.1 $ \
bin/neo4j-import --into data/databases/graph.db --id-type string \
--nodes:Allele variants.csv --nodes:Chromosome chromosome.csv --nodes:Phenotype phenotypes.csv \
--nodes:Sample samples.csv --relationships:BELONGS_TO variant_chromosomes.csv \
--relationships: sample_phenotypes.csv --relationships:ALTERNATIVE_TO variant_variants.csv \
--relationships:HAS sample_variants50-99.csv.gz
But I'm getting the following error:
WARNING: neo4j-import is deprecated and support for it will be removed in a future version of Neo4j; please use neo4j-admin import instead.
Input error: Expected '--nodes' to have at least 1 valid item, but had 0 []
Caused by:Expected '--nodes' to have at least 1 valid item, but had 0 []
java.lang.IllegalArgumentException: Expected '--nodes' to have at least 1 valid item, but had 0 []
at org.neo4j.kernel.impl.util.Validators.lambda$atLeast$6(Validators.java:144)
at org.neo4j.helpers.Args.validated(Args.java:670)
at org.neo4j.helpers.Args.interpretOptionsWithMetadata(Args.java:637)
at org.neo4j.tooling.ImportTool.extractInputFiles(ImportTool.java:623)
at org.neo4j.tooling.ImportTool.main(ImportTool.java:445)
at org.neo4j.tooling.ImportTool.main(ImportTool.java:380)
I included the file path, as I'm using Neo4j Desktop and am not sure if this has a different file structure? My csv files are stored in the import folder (but I also have copies in the current folder and the graph.db folder just in case).
The import directory is as follows:
user#LenovoPC ~/.config/Neo4j Desktop/Application/neo4jDatabases/database-2f182948-e170-45b1-b9f4-19d236ff5d43/installation-3.5.1/import $ dir
chromosomes.csv samples.csv variants.csv
phenotypes.csv sample_variants50-99.csv.gz variants.csv.gz
sample_phenotypes.csv variant_chromosomes.csv
variant_variants.csv
I can only assume that it's my filepath, but I've tried quite a few alternatives and had no luck at all. If anyone could shed some light on what the issue is, I would really appreciate it!
Best is to cd into the desktop directory, place the csv files into the import folder.
then you can do:
cd ~/.config/Neo4j Desktop/Application/neo4jDatabases/database-2f182948-e170-45b1-b9f4-19d236ff5d43/installation-3.5.1
bin/neo4j-import --into data/databases/graph.db --id-type string \
--nodes:Allele import/variants.csv \
--nodes:Chromosome import/chromosome.csv \
--nodes:Phenotype import/phenotypes.csv \
--nodes:Sample import/samples.csv \
--relationships:BELONGS_TO import/variant_chromosomes.csv \
--relationships import/sample_phenotypes.csv \
--relationships:ALTERNATIVE_TO import/variant_variants.csv \
--relationships:HAS import/sample_variants50-99.csv.gz
Some more notes:
HAS is a pretty generic relationship type
I left off the colon here: --relationships import/sample_phenotypes.csv not sure if you have the rel-type in the file
is this a single file? --relationships:HAS import/sample_variants50-99.csv.gz

Snakemake: Error when trying to generate multiple output files

I'm writing a snakemake pipeline to take publicly available sra files, convert them to fastq files then run them through alignment, peak calling and LD score regression.
I'm having an issue in the rule called SRA2fastq below in which I use parallel-fastq-dump to convert SRA files to paired end fastq files. This rule generates two outputs for each SRA file, SRRXXXXXXX_1, and SRRXXXXXXX_2.
Here is my config file:
samples:
fullard2018_NpfcATAC_1: SRR5367824
fullard2018_NpfcATAC_2: SRR5367798
fullard2018_NpfcATAC_3: SRR5367778
fullard2018_NpfcATAC_4: SRR5367754
fullard2018_NpfcATAC_5: SRR5367729
And here are the first few rules of my Snakefile:
# read config info into this namespace
configfile: "config.yaml"
print (config['samples'])
rule all:
input:
expand("fastq_files/{SRA}_{num}.fastq.gz", SRA=[config['samples'][x] for x in config['samples']], num=[1,2]),
expand("FastQC/{SRA}_{num}_fastqc.html", SRA=[config['samples'][x] for x in config['samples']], num=[1,2]),
"FastQC/fastq_multiqc.html",
expand("peak_files/{sample}_peaks.blrm.narrowPeak", sample=config['samples']),
"peak_files/Fullard2018_peaks.mrgd.blrm.narrowPeak",
expand("LD_annotation_files/Fullard_2018.{chr}.l2.ldscore.gz", chr=range(1,23))
rule SRA_prefetch:
params:
SRA="{SRA}"
output:
"/home/c1477909/ncbi/public/sra/{SRA}.sra"
log:
"logs/prefetch/{SRA}.log"
shell:
"prefetch {params.SRA}"
rule SRA2fastq:
input:
"/home/c1477909/ncbi/public/sra/{SRA}.sra"
output:
"fastq_files/{SRA}_1.fastq.gz",
"fastq_files/{SRA}_2.fastq.gz"
log:
"logs/SRA2fastq/{SRA}.log"
shell:
"""
parallel-fastq-dump --sra-id {input} --threads 8 \
--outdir fastq_files --split-files --gzip
"""
rule fastqc:
input:
rules.SRA2fastq.output
output:
# Output needs to end in '_fastqc.html' for multiqc to work
html="FastQC/{SRA}_{num}_fastqc.html"
log:
"logs/FASTQC/{SRA}_{num}.log"
wrapper:
"0.27.1/bio/fastqc"
rule multiqc_fastq:
input:
lambda wildcards: expand("FastQC/{SRA}_{num}_fastqc.html", SRA=[config['samples'][x] for x in config['samples']], num=[1,2])
output:
"FastQC/fastq_multiqc.html"
wrapper:
"0.27.1/bio/multiqc"
rule bowtie2:
input:
sample=lambda wildcards: expand("fastq_files/{SRA}_{num}.fastq.gz", SRA=config['samples'][wildcards.sample], num=[1,2])
output:
"bam_files/{sample}.bam"
log:
"logs/bowtie2/{sample}.txt"
params:
index=config["index"], # prefix of reference genome index (built with bowtie2-build),
extra=""
threads: 8
wrapper:
"0.27.1/bio/bowtie2/align"
However, when I run the Snakefile I get the following error:
Error in job SRA2fastq while creating output files fastq_files/SRR5367754_1.fastq.gz, fastq_files/SRR5367754_2.fastq.gz
I've seen this error many times before and it's usually caused when the name of output file generated by the program does not exactly match the output file name you specify in the corresponding snakemake rule. However, this is not the case here as if I run the command snakemake generates for this particular rule separately the files are created as expected and the file names match. Here is an example of one instance of the rule taken after running snakemake -np:
rule SRA2fastq:
input: /home/c1477909/ncbi/public/sra/SRR5367779.sra
output: fastq_files/SRR5367779_1.fastq.gz, fastq_files/SRR5367779_2.fastq.gz
log: logs/SRA2fastq/SRR5367779.log
jobid: 18
wildcards: SRA=SRR5367779
parallel-fastq-dump --sra-id /home/c1477909/ncbi/public/sra/SRR5367779.sra --threads 8 --outdir fastq_files --split-files --gzip
Note the output files generated by the parallel-fastq-dump command run separately (i.e. not using snakemake) are named as specified in the SRA2fastq rule:
ls fastq_files
SRR5367729_1.fastq.gz SRR5367729_2.fastq.gz
I'm a bit stumped by this as this error is usually easily rectified but I can't work out what the issue is. I've tried changing the output section of the SRA2fastq to:
output:
file1="fastq_files/{SRA}_1.fastq.gz",
file2="fastq_files/{SRA}_2.fastq.gz"
However, this throws the same error. I've also tried just specifying one output file but this affects the bowtie2 rule later on as I get an input files missing error.
Any ideas what's going on here? Is there something I'm missing when trying to look for multiple output files in a single rule?
Many Thanks