Match-zero-or-more operator in nextflow glob - glob

I am trying to create a robust glob pattern that will match most of the different naming conventions used for fastq files we receive. However, the version of nextflow I am using (20.10.0) on the HPC doesn't seem to accept what I've written.
Here are some examples of file names:
19_S8_R1_001.fastq.gz 19_S8_R2_001.fastq.gz
F1HD1_S28_R1.fastq.gz F1HD1_S28_R2.fastq.gz
SRR3137747_1.fastq SRR3137747_2.fastq
The pattern I originally wrote to go with the fromFilePairs operator was *_?(R){1,2}?(_001).f?(ast)q?(.gz). Which I tested in a bash environment. Here is the output from testing in the directory with the top two example files:
-bash-4.2$ shopt -s extglob
-bash-4.2$ ls -1 *_?(R){1,2}?(_001).f?(ast)q?(.gz)
19_S8_R1_001.fastq.gz
19_S8_R2_001.fastq.gz
But when I tried to run this with nextflow, it just gave me the error message I put into the ifEmpty operator.
I've eventually got it working, but using this pattern: *_{R1,R2,1,2}{.fastq.gz,.fq.gz,.fastq,.fq,_001.fastq.gz,_001.fq.gz,_001.fastq,_001.fq}, which isn't particularly robust.
Unless I've missed it in the nextflow documentation (and the information I've found about glob), I don't see alternatives to match-zero-or-more operators in nextflow. Any alternative solutions?
Thanks in advance.

The following glob pattern seems to match some of the more common FASTQ filenames:
Channel
.fromFilePairs( '*_{,R}{1,2}{,_001}.{fq,fastq}{,.gz}' )
.view()
Or with a parameterized directory prefix:
Channel
.fromFilePairs( "${params.input_dir}/*_{,R}{1,2}{,_001}.{fq,fastq}{,.gz}" )
.view()
Results:
N E X T F L O W ~ version 21.04.3
Launching `script.nf` [serene_austin] - revision: 5527b9b3c0
[SRR3137747, [/path/to/fasta/SRR3137747_1.fastq, /path/to/fasta/SRR3137747_2.fastq]]
[19_S8, [/path/to/fasta/19_S8_R1_001.fastq.gz, /path/to/fasta/19_S8_R2_001.fastq.gz]]
[F1HD1_S28, [/path/to/fasta/F1HD1_S28_R1.fastq.gz, /path/to/fasta/F1HD1_S28_R2.fastq.gz]]
Another option, which might be more robust (and readable), is to make use of the fact that you can specify more than one glob pattern using a list as argument, and build your list of glob patterns dynamically:
nextflow.enable.dsl=2
params.input_dir = '/path/to/fasta'
def cartesian_product(A, B) {
A.collectMany{ a -> B.collect { b -> [a, b] } }
}
def extensions = [
'.fastq.gz',
'.fastq',
'.fq.gz',
'.fq',
]
def suffixes = [
'*_R{1,2}_001',
'*_R{1,2}',
'*_{1,2}',
]
workflow {
def patterns = cartesian_product(suffixes, extensions).collect {
"${params.input_dir}/${it.join()}"
}
Channel.fromFilePairs( patterns ).view()
}
Results:
N E X T F L O W ~ version 21.04.3
Launching `script.nf` [deadly_payne] - revision: 6d2472ef23
[19_S8, [/path/to/fasta/19_S8_R1_001.fastq.gz, /path/to/fasta/19_S8_R2_001.fastq.gz]]
[F1HD1_S28, [/path/to/fasta/F1HD1_S28_R1.fastq.gz, /path/to/fasta/F1HD1_S28_R2.fastq.gz]]
[SRR3137747, [/path/to/fasta/SRR3137747_1.fastq, /path/to/fasta/SRR3137747_2.fastq]]

Related

Nextflow rename barcodes and concatenate reads within barcodes

My current working directory has the following sub-directories
My Bash script
Hi there
I have compiled the above Bash script to do the following tasks:
rename the sub-directories (barcode01-12) taking information from the metadata.csv
concatenate the individual reads within a sub-directory and move them up in the $PWD
then I use these concatenated reads (one per barcode) for my Nextflow script below:
Query:
How can I get the above pre-processing tasks (renaming and concatenating) or the Bash script added at the beginning of my following Nextflow script?
In my experience, FASTQ files can get quite large. Without knowing too much of the specifics, my recommendation would be to move the concatenation (and renaming) to a separate process. In this way, all of the 'work' can be done inside Nextflow's working directory. Here's a solution that uses the new DSL 2. It uses the splitCsv operator to parse the metadata and identify the FASTQ files. The collection can then be passed into our 'concat_reads' process. To handle optionally gzipped files, you could try the following:
params.metadata = './metadata.csv'
params.outdir = './results'
process concat_reads {
tag { sample_name }
publishDir "${params.outdir}/concat_reads", mode: 'copy'
input:
tuple val(sample_name), path(fastq_files)
output:
tuple val(sample_name), path("${sample_name}.${extn}")
script:
if( fastq_files.every { it.name.endsWith('.fastq.gz') } )
extn = 'fastq.gz'
else if( fastq_files.every { it.name.endsWith('.fastq') } )
extn = 'fastq'
else
error "Concatentation of mixed filetypes is unsupported"
"""
cat ${fastq_files} > "${sample_name}.${extn}"
"""
}
process pomoxis {
tag { sample_name }
publishDir "${params.outdir}/pomoxis", mode: 'copy'
cpus 18
input:
tuple val(sample_name), path(fastq)
"""
mini_assemble \\
-t ${task.cpus} \\
-i "${fastq}" \\
-o results \\
-p "${sample_name}"
"""
}
workflow {
fastq_extns = [ '.fastq', '.fastq.gz' ]
Channel.fromPath( params.metadata )
| splitCsv()
| map { dir, sample_name ->
all_files = file(dir).listFiles()
fastq_files = all_files.findAll { fn ->
fastq_extns.find { fn.name.endsWith( it ) }
}
tuple( sample_name, fastq_files )
}
| concat_reads
| pomoxis
}

Is there a way to lookup a value from a CSV in nextflow? Or, alternately, reuse a CSV?

I have a simple csv created as part of a workflow, like below:
sample,value
A,1
B,0.5
Separately, I have another channel with file names matching the sample names. I'd like to be able to use the values associated with each sample name within a new process.
I've tried splitting the CSV using .splitCsv but (unsurprisingly) sometimes the incorrect value gets used with a sample, although it does run the correct number of times. I've also tried just using awk within the script to pull out the corresponding value and save it to a variable, and this causes the correct value to be used, but it consumes the CSV file and so only one sample gets processed.
Super simplified nextflow (DSL2) script:
#!/usr/bin/env nextflow
nextflow.enable.dsl=2
process foo {
input:
path input_file
output:
path 'file.csv', emit csv
"""
script that creates csv
"""
}
process bar {
input:
path input_file2
output:
path 'file.bam', emit bam
"""
script that creates bam files
"""
}
process help_me {
input:
path csv
path bam
output:
path 'result'
"""
script that uses value from csv on associated bam file
"""
}
workflow {
foo(params.input)
bar(params.input2)
help_me(foo.out.csv, bar.out.bam)
}
Thanks!!
Edit: In essence, is there a way to synchronize two channels such that I can use a csv's individual rows with associated files?
If you have a value channel, you can reuse a file (like a CSV) an unlimited number of times without consuming the channel. For example:
workflow {
input1 = file( params.input1 )
input2 = file( params.input2 )
foo( input1 )
bar( input2 )
help_me(foo.out.csv, bar.out.bam)
}
Here, both input1 and input2 are value channels. Also, (emphasis mine):
A value channel is implicitly created by a process when an input
specifies a simple value in the from clause. Moreover, a value channel
is also implicitly created as output for a process whose inputs are
only value channels.
Means that both foo.out.csv and bar.out.bam are also value channels. Additionally, help_me.out is also a value channel. If input2 was instead a queue channel, you can see that input1 can be re-used an unlimited number of times:
$ mkdir -p ./path/to/bams
$ touch ./path/to/bams/{A,B,C}.bam
$ touch ./foo.txt
params.input1 = './foo.txt'
params.input2 = './path/to/bams/*.bam'
workflow {
input1 = file( params.input1 )
input2 = Channel.fromPath( params.input2 )
foo( input1 )
bar( input2 )
help_me(foo.out.csv, bar.out.bam)
}
Results:
$ nextflow run script.nf
N E X T F L O W ~ version 22.04.0
Launching `script.nf` [trusting_allen] DSL2 - revision: 75209e4c85
executor > local (7)
[24/d459f7] process > foo [100%] 1 of 1 ✔
[04/a903e4] process > bar (2) [100%] 3 of 3 ✔
[24/7a9a1d] process > help_me (3) [100%] 3 of 3 ✔
Note that bar.out.bam and help_me.out are now queue channels.
If instead, you have one CSV per sample (or similar configuration), you will need some way to join these channels prior and adjust your new process' input declaration accordingly. What you want to avoid is declaring two (or more) queue channels in your input block. This part of docs is well worth the time investment: Understand how multiple input channels work, and would explain why you saw the incorrect value being associated with a particular sample when consuming the splitCsv output. To join these channels, you can use the join operator. For example, given your simple CSV (as 'foo.csv') and the test bams created previously:
nextflow.enable.dsl=2
params.input1 = './foo.csv'
params.input2 = './path/to/bams/*.bam'
process help_me {
debug true
input:
tuple val(sample), val(myval), path(bam)
output:
path 'result'
"""
echo -n "sample: ${sample}, myval: ${myval}, bam: ${bam}"
touch result
"""
}
workflow {
Channel.fromPath( params.input1 ) \
| splitCsv( header:true ) \
| map { row -> tuple( row.sample, row.value ) } \
| set { rows_ch }
Channel.fromPath( params.input2 ) \
| map { bam -> tuple( bam.baseName, bam ) } \
| join( rows_ch ) \
| map { sample, bam, myval -> tuple( sample, myval, bam ) } \
| help_me
}
Results:
$ nextflow run script.nf
N E X T F L O W ~ version 22.04.0
Launching `script.nf` [lethal_mayer] DSL2 - revision: 395732babc
executor > local (2)
[c5/e96085] process > help_me (1) [100%] 2 of 2 ✔
sample: B, myval: 0.5, bam: B.bam
sample: A, myval: 1, bam: A.bam
If your CSV has more than one value for a particalar sample and these are specified on seperate lines, you probably want instead the combine operator. For example, if your 'foo.csv' contains:
sample,value
A,1
B,0.5
B,2
And replace, join( rows_ch ) with combine( rows_ch, by:0 ) in the above example. Results:
nextflow run script.nf
N E X T F L O W ~ version 22.04.0
Launching `script.nf` [festering_miescher] DSL2 - revision: f8de1e0d20
executor > local (3)
[ee/8af543] process > help_me (3) [100%] 3 of 3 ✔
sample: A, myval: 1, bam: A.bam
sample: B, myval: 0.5, bam: B.bam
sample: B, myval: 2, bam: B.bam

jq - How to extract domains and remove duplicates

Given the following json:
Full file here: https://pastebin.com/Hzt9bq2a
{
"name": "Visma Public",
"domains": [
"accountsettings.connect.identity.stagaws.visma.com",
"admin.stage.vismaonline.com",
"api.home.stag.visma.com",
"api.workbox.dk",
"app.workbox.dk",
"app.workbox.co.uk",
"authz.workbox.dk",
"connect.identity.stagaws.visma.com",
"eaccounting.stage.vismaonline.com",
"eaccountingprinting.stage.vismaonline.com",
"http://myservices-api.stage.vismaonline.com/",
"identity.stage.vismaonline.com",
"myservices.stage.vismaonline.com"
]
}
How can I transform the data to the below. Which is, to identify the domains in the format of site.SLD.TLD present and then remove the duplication of them. (Not including the subdomains, protocols or paths as illustrated below.)
{
"name": "Visma Public",
"domains": [
"workbox.co.uk",
"workbox.dk",
"visma.com",
"vismaonline.com"
]
}
I would like to do so in jq as that is what I've used to wrangled the data into this format so far, but at this stage any solution that I can run on Debian (I'm using bash) without any extraneous tooling ideally would be fine.
I'm aware that regex can be used within jq so I assume the best way is to regex out the domain and then pipe to unique however I'm unable to get anything working so far I'm currently trying this version which seems to me to need only the text transformation stage adding in somehow either during the jq process or with a run over with something like awk after the event perhaps:
jq '[.[] | {name: .name, domain: [.domains[]] | unique}]' testfile.json
This appears to be useful: https://github.com/stedolan/jq/issues/537
One solution was offered which does a regex match to extract the last two strings separated by . and call the unique function on that & works up to a point but doesn't cover site.SLD.TLD that has 2 parts. Like google.co.uk would return only co.uk with this jq for example:
jq '.domains |= (map(capture("(?<x>[[:alpha:]]+).(?<z>[[:alpha:]]+)(.?)$") | join(".")) | unique)'
A programming language is much more expressive than jq.
Try the following snippet with python3.
import json
import pprint
import urllib.request
from urllib.parse import urlparse
import os
def get_tlds():
f = urllib.request.urlopen("https://publicsuffix.org/list/effective_tld_names.dat")
content = f.read()
lines = content.decode('utf-8').split("\n")
# remove comments
tlds = [line for line in lines if not line.startswith("//") and not line == ""]
return tlds
def extract_domain(url, tlds):
# get domain
url = url.replace("http://", "").replace("https://", "")
url = url.split("/")[0]
# get tld/sld
parts = url.split(".")
suffix1 = parts[-1]
sld1 = parts[-2]
if len(parts) > 2:
suffix2 = ".".join(parts[-2:])
sld2 = parts[-3]
else:
suffix2 = suffix1
sld2 = sld1
# try the longger first
if suffix2 in tlds:
tld = suffix2
sld = sld2
else:
tld = suffix1
sld = sld1
return sld + "." + tld
def clean(site, tlds):
site["domains"] = list(set([extract_domain(url, tlds) for url in site["domains"]]))
return site
if __name__ == "__main__":
filename = "Hzt9bq2a.json"
cache_path = "tlds.json"
if os.path.exists(cache_path):
with open(cache_path, "r") as f:
tlds = json.load(f)
else:
tlds = get_tlds()
with open(cache_path, "w") as f:
json.dump(tlds, f)
with open(filename) as f:
d = json.load(f)
d = [clean(site, tlds) for site in d]
pprint.pprint(d)
with open("clean.json", "w") as f:
json.dump(d, f)
May I offer you achieving the same query with jtc: the same could be achieved in other languages (and of course in jq) - the query is mostly how to come up with the regex to satisfy your ask:
bash $ <file.json jtc -w'<domains>l:>((?:[a-z0-9]+\.)?[a-z0-9]+\.[a-z0-9]+)[^.]*$<R:' -u'{{$1}}' /\
-ppw'<domains>l:><q:' -w'[domains]:<[]>j:' -w'<name>l:'
{
"domains": [
"stagaws.visma.com",
"stage.vismaonline.com",
"stag.visma.com",
"api.workbox.dk",
"app.workbox.dk",
"workbox.co.uk",
"authz.workbox.dk"
],
"name": "Visma Public"
}
bash $
Note: it does extract only DOMAIN.TLD, as per your ask. If you like to extract DOMAIN.SLD.TLD, then the task becomes a bit less trivial.
Update:
Modified solution as per the comment: extract domain.sld.tld where 3 or more levels and domain.tld where there’s only 2
PS. I'm the creator of the jtc - JSON processing utility. This disclaimer is SO requirement.
One of the solutions presented on this page offers that:
A programming language is much more expressive than jq.
It may therefore be worthwhile pointing out that jq is an expressive, Turing-complete programming language, and that it would be as straightforward (and as tedious) to capture all the intricacies of the "Public Suffix List" using jq as any other programming language that does not already provide support for this list.
It may be useful to illustrate an approach to the problem that passes the (revised) test presented in the Q. This approach could easily be extended in any one of a number of ways:
def extract:
sub("^[^:]*://";"")
| sub("/.*$";"")
| split(".")
| (if (.[-1]|length) == 2 and (.[-2]|length) <= 3
then -3 else -2 end) as $ix
| .[$ix : ]
| join(".") ;
{name, domain: (.domains | map(extract) | unique)}
Output
{
"name": "Visma Public",
"domain": [
"visma.com",
"vismaonline.com",
"workbox.co.uk",
"workbox.dk"
]
}
Judging from your example, you don't actually want top-level domains (just one component, e.g. ".com"), and you probably don't really want second-level domains (last two components) either, because some domain registries don't operate at the TLD level. Given www.foo.com.br, you presumably want to find out about foo.com.br, not com.br.
To do that, you need to consult the Public Suffix List. The file format isn't too complicated, but it has support for wildcards and exceptions. I dare say that jq isn't the ideal language to use here — pick one that has a URL-parsing module (for extracting hostnames) and an existing Public Suffix List module (for extracting the domain parts from those hostnames).

Prevent double compilation of c files in cython

I am writing a wrapper over c libriary and this lib has file with almost all functions, let say, all_funcs.c. This file in turn requires compilation of lots of another c files
I have created all_funcs.pyx, where I wraped all functions, but I also want to create a submodule, that has access to functions from all_funcs.c. What works for now is adding all c-files to both Extensions in setup.py, however each c-file compiles twice: first for all_funcs.pyx and second for submodule extension.
Are there any ways to provide common sourse files to each Extension?
Example of current setup.py:
ext_helpers = Extension(name=SRC_DIR + '.wrapper.utils.helpers',
sources=[SRC_DIR + '/wrapper/utils/helpers.pyx'] + source_files_paths,
include_dirs=[SRC_DIR + '/include/'])
ext_all_funcs = Extension(name=SRC_DIR + '.wrapper.all_funcs',
sources=[SRC_DIR + '/wrapper/all_funcs.pyx'] + source_files_paths,
include_dirs=[SRC_DIR + '/include/'])
EXTENSIONS = [
ext_helpers,
ext_all_funcs,
]
if __name__ == "__main__":
setup(
packages=PACKAGES,
zip_safe=False,
name='some_name',
ext_modules=cythonize(EXTENSIONS, language_level=3)
)
source_files_paths - the list with common c source files
Note: this answer only explains how to avoid multiple compilation of c/cpp-files using libraries-argument of setup-function. It doesn't however explain how to avoid possible problems due to ODR-violation - for that see this SO-post.
Adding libraries-argument to setup will trigger build_clib prior to building of ext_modules (when running setup.py build or setup.py install commands), the resulting static library will also be automatically passed to the linker, when extensions are linked.
For your setup.py, this means:
from setuptools import setup, find_packages, Extension
...
#common c files compiled to a static library:
mylib = ('mylib', {'sources': source_files_paths}) # possible further settings
# no common c-files (taken care of in mylib):
ext_helpers = Extension(name=SRC_DIR + '.wrapper.utils.helpers',
sources=[SRC_DIR + '/wrapper/utils/helpers.pyx'],
include_dirs=[SRC_DIR + '/include/'])
# no common c-files (taken care of in mylib):
ext_all_funcs = Extension(name=SRC_DIR + '.wrapper.all_funcs',
sources=[SRC_DIR + '/wrapper/all_funcs.pyx'],
include_dirs=[SRC_DIR + '/include/'])
EXTENSIONS = [
ext_helpers,
ext_all_funcs,
]
if __name__ == "__main__":
setup(
packages=find_packages(where=SRC_DIR),
zip_safe=False,
name='some_name',
ext_modules=cythonize(EXTENSIONS, language_level=3),
# will be build as static libraries and automatically passed to linker:
libraries = [mylib]
)
To build the extensions inplace one should invoke:
python setupy.py build_clib build_ext --inplace
as build_ext alone is not enough: we need the static libraries to build before they can be used in extensions.

How to extract the name of a Party?

In a DAML contract, how do I extract the name of a party from a Party field?
Currently, toText p gives me Party(Alice). I'd like to only keep the name of the party.
That you care about the precise formatting of the resulting string suggests that you are implementing a codec in DAML. As a general principle DAML excels as a modelling/contract language, but consequently has limited features to support the sort of IO-oriented work this question implies. You are generally better off returning DAML values, and implementing codecs in Java/Scala/C#/Haskell/etc interfacing with the DAML via the Ledger API.
Still, once you have a Text value you also have access to the standard List manipulation functions via unpack, so converting "Party(Alice)" to "Alice" is not too difficult:
daml 1.0 module PartyExtract where
import Base.List
def pack (cs: List Char) : Text =
foldl (fun (acc: Text) (c: Char) -> acc <> singleton c) "" cs;
def partyToText (p: Party): Text =
pack $ reverse $ drop 2 $ reverse $ drop 7 $ unpack $ toText p
test foo : Scenario {} = scenario
let p = 'Alice'
assert $ "Alice" == partyToText p
In DAML 1.2 the standard library has been expanded, so the code above can be simplified:
daml 1.2
module PartyExtract2
where
import DA.Text
traceDebug : (Show a, Show b) => b -> a -> a
traceDebug b a = trace (show b <> show a) $ a
partyToText : Party -> Text
partyToText p = dropPrefix "'" $ dropSuffix "'" $ traceDebug "show party: " $ show p
foo : Scenario ()
foo = do
p <- getParty "Alice"
assert $ "Alice" == (traceDebug "partyToText party: " $ partyToText p)
NOTE: I have left the definition and calls to traceDebug so you can see the exact strings being generated in the scenario trace output.