Nextflow rename barcodes and concatenate reads within barcodes - containers

My current working directory has the following sub-directories
My Bash script
Hi there
I have compiled the above Bash script to do the following tasks:
rename the sub-directories (barcode01-12) taking information from the metadata.csv
concatenate the individual reads within a sub-directory and move them up in the $PWD
then I use these concatenated reads (one per barcode) for my Nextflow script below:
Query:
How can I get the above pre-processing tasks (renaming and concatenating) or the Bash script added at the beginning of my following Nextflow script?

In my experience, FASTQ files can get quite large. Without knowing too much of the specifics, my recommendation would be to move the concatenation (and renaming) to a separate process. In this way, all of the 'work' can be done inside Nextflow's working directory. Here's a solution that uses the new DSL 2. It uses the splitCsv operator to parse the metadata and identify the FASTQ files. The collection can then be passed into our 'concat_reads' process. To handle optionally gzipped files, you could try the following:
params.metadata = './metadata.csv'
params.outdir = './results'
process concat_reads {
tag { sample_name }
publishDir "${params.outdir}/concat_reads", mode: 'copy'
input:
tuple val(sample_name), path(fastq_files)
output:
tuple val(sample_name), path("${sample_name}.${extn}")
script:
if( fastq_files.every { it.name.endsWith('.fastq.gz') } )
extn = 'fastq.gz'
else if( fastq_files.every { it.name.endsWith('.fastq') } )
extn = 'fastq'
else
error "Concatentation of mixed filetypes is unsupported"
"""
cat ${fastq_files} > "${sample_name}.${extn}"
"""
}
process pomoxis {
tag { sample_name }
publishDir "${params.outdir}/pomoxis", mode: 'copy'
cpus 18
input:
tuple val(sample_name), path(fastq)
"""
mini_assemble \\
-t ${task.cpus} \\
-i "${fastq}" \\
-o results \\
-p "${sample_name}"
"""
}
workflow {
fastq_extns = [ '.fastq', '.fastq.gz' ]
Channel.fromPath( params.metadata )
| splitCsv()
| map { dir, sample_name ->
all_files = file(dir).listFiles()
fastq_files = all_files.findAll { fn ->
fastq_extns.find { fn.name.endsWith( it ) }
}
tuple( sample_name, fastq_files )
}
| concat_reads
| pomoxis
}

Related

Is there a way to lookup a value from a CSV in nextflow? Or, alternately, reuse a CSV?

I have a simple csv created as part of a workflow, like below:
sample,value
A,1
B,0.5
Separately, I have another channel with file names matching the sample names. I'd like to be able to use the values associated with each sample name within a new process.
I've tried splitting the CSV using .splitCsv but (unsurprisingly) sometimes the incorrect value gets used with a sample, although it does run the correct number of times. I've also tried just using awk within the script to pull out the corresponding value and save it to a variable, and this causes the correct value to be used, but it consumes the CSV file and so only one sample gets processed.
Super simplified nextflow (DSL2) script:
#!/usr/bin/env nextflow
nextflow.enable.dsl=2
process foo {
input:
path input_file
output:
path 'file.csv', emit csv
"""
script that creates csv
"""
}
process bar {
input:
path input_file2
output:
path 'file.bam', emit bam
"""
script that creates bam files
"""
}
process help_me {
input:
path csv
path bam
output:
path 'result'
"""
script that uses value from csv on associated bam file
"""
}
workflow {
foo(params.input)
bar(params.input2)
help_me(foo.out.csv, bar.out.bam)
}
Thanks!!
Edit: In essence, is there a way to synchronize two channels such that I can use a csv's individual rows with associated files?
If you have a value channel, you can reuse a file (like a CSV) an unlimited number of times without consuming the channel. For example:
workflow {
input1 = file( params.input1 )
input2 = file( params.input2 )
foo( input1 )
bar( input2 )
help_me(foo.out.csv, bar.out.bam)
}
Here, both input1 and input2 are value channels. Also, (emphasis mine):
A value channel is implicitly created by a process when an input
specifies a simple value in the from clause. Moreover, a value channel
is also implicitly created as output for a process whose inputs are
only value channels.
Means that both foo.out.csv and bar.out.bam are also value channels. Additionally, help_me.out is also a value channel. If input2 was instead a queue channel, you can see that input1 can be re-used an unlimited number of times:
$ mkdir -p ./path/to/bams
$ touch ./path/to/bams/{A,B,C}.bam
$ touch ./foo.txt
params.input1 = './foo.txt'
params.input2 = './path/to/bams/*.bam'
workflow {
input1 = file( params.input1 )
input2 = Channel.fromPath( params.input2 )
foo( input1 )
bar( input2 )
help_me(foo.out.csv, bar.out.bam)
}
Results:
$ nextflow run script.nf
N E X T F L O W ~ version 22.04.0
Launching `script.nf` [trusting_allen] DSL2 - revision: 75209e4c85
executor > local (7)
[24/d459f7] process > foo [100%] 1 of 1 ✔
[04/a903e4] process > bar (2) [100%] 3 of 3 ✔
[24/7a9a1d] process > help_me (3) [100%] 3 of 3 ✔
Note that bar.out.bam and help_me.out are now queue channels.
If instead, you have one CSV per sample (or similar configuration), you will need some way to join these channels prior and adjust your new process' input declaration accordingly. What you want to avoid is declaring two (or more) queue channels in your input block. This part of docs is well worth the time investment: Understand how multiple input channels work, and would explain why you saw the incorrect value being associated with a particular sample when consuming the splitCsv output. To join these channels, you can use the join operator. For example, given your simple CSV (as 'foo.csv') and the test bams created previously:
nextflow.enable.dsl=2
params.input1 = './foo.csv'
params.input2 = './path/to/bams/*.bam'
process help_me {
debug true
input:
tuple val(sample), val(myval), path(bam)
output:
path 'result'
"""
echo -n "sample: ${sample}, myval: ${myval}, bam: ${bam}"
touch result
"""
}
workflow {
Channel.fromPath( params.input1 ) \
| splitCsv( header:true ) \
| map { row -> tuple( row.sample, row.value ) } \
| set { rows_ch }
Channel.fromPath( params.input2 ) \
| map { bam -> tuple( bam.baseName, bam ) } \
| join( rows_ch ) \
| map { sample, bam, myval -> tuple( sample, myval, bam ) } \
| help_me
}
Results:
$ nextflow run script.nf
N E X T F L O W ~ version 22.04.0
Launching `script.nf` [lethal_mayer] DSL2 - revision: 395732babc
executor > local (2)
[c5/e96085] process > help_me (1) [100%] 2 of 2 ✔
sample: B, myval: 0.5, bam: B.bam
sample: A, myval: 1, bam: A.bam
If your CSV has more than one value for a particalar sample and these are specified on seperate lines, you probably want instead the combine operator. For example, if your 'foo.csv' contains:
sample,value
A,1
B,0.5
B,2
And replace, join( rows_ch ) with combine( rows_ch, by:0 ) in the above example. Results:
nextflow run script.nf
N E X T F L O W ~ version 22.04.0
Launching `script.nf` [festering_miescher] DSL2 - revision: f8de1e0d20
executor > local (3)
[ee/8af543] process > help_me (3) [100%] 3 of 3 ✔
sample: A, myval: 1, bam: A.bam
sample: B, myval: 0.5, bam: B.bam
sample: B, myval: 2, bam: B.bam

Match-zero-or-more operator in nextflow glob

I am trying to create a robust glob pattern that will match most of the different naming conventions used for fastq files we receive. However, the version of nextflow I am using (20.10.0) on the HPC doesn't seem to accept what I've written.
Here are some examples of file names:
19_S8_R1_001.fastq.gz 19_S8_R2_001.fastq.gz
F1HD1_S28_R1.fastq.gz F1HD1_S28_R2.fastq.gz
SRR3137747_1.fastq SRR3137747_2.fastq
The pattern I originally wrote to go with the fromFilePairs operator was *_?(R){1,2}?(_001).f?(ast)q?(.gz). Which I tested in a bash environment. Here is the output from testing in the directory with the top two example files:
-bash-4.2$ shopt -s extglob
-bash-4.2$ ls -1 *_?(R){1,2}?(_001).f?(ast)q?(.gz)
19_S8_R1_001.fastq.gz
19_S8_R2_001.fastq.gz
But when I tried to run this with nextflow, it just gave me the error message I put into the ifEmpty operator.
I've eventually got it working, but using this pattern: *_{R1,R2,1,2}{.fastq.gz,.fq.gz,.fastq,.fq,_001.fastq.gz,_001.fq.gz,_001.fastq,_001.fq}, which isn't particularly robust.
Unless I've missed it in the nextflow documentation (and the information I've found about glob), I don't see alternatives to match-zero-or-more operators in nextflow. Any alternative solutions?
Thanks in advance.
The following glob pattern seems to match some of the more common FASTQ filenames:
Channel
.fromFilePairs( '*_{,R}{1,2}{,_001}.{fq,fastq}{,.gz}' )
.view()
Or with a parameterized directory prefix:
Channel
.fromFilePairs( "${params.input_dir}/*_{,R}{1,2}{,_001}.{fq,fastq}{,.gz}" )
.view()
Results:
N E X T F L O W ~ version 21.04.3
Launching `script.nf` [serene_austin] - revision: 5527b9b3c0
[SRR3137747, [/path/to/fasta/SRR3137747_1.fastq, /path/to/fasta/SRR3137747_2.fastq]]
[19_S8, [/path/to/fasta/19_S8_R1_001.fastq.gz, /path/to/fasta/19_S8_R2_001.fastq.gz]]
[F1HD1_S28, [/path/to/fasta/F1HD1_S28_R1.fastq.gz, /path/to/fasta/F1HD1_S28_R2.fastq.gz]]
Another option, which might be more robust (and readable), is to make use of the fact that you can specify more than one glob pattern using a list as argument, and build your list of glob patterns dynamically:
nextflow.enable.dsl=2
params.input_dir = '/path/to/fasta'
def cartesian_product(A, B) {
A.collectMany{ a -> B.collect { b -> [a, b] } }
}
def extensions = [
'.fastq.gz',
'.fastq',
'.fq.gz',
'.fq',
]
def suffixes = [
'*_R{1,2}_001',
'*_R{1,2}',
'*_{1,2}',
]
workflow {
def patterns = cartesian_product(suffixes, extensions).collect {
"${params.input_dir}/${it.join()}"
}
Channel.fromFilePairs( patterns ).view()
}
Results:
N E X T F L O W ~ version 21.04.3
Launching `script.nf` [deadly_payne] - revision: 6d2472ef23
[19_S8, [/path/to/fasta/19_S8_R1_001.fastq.gz, /path/to/fasta/19_S8_R2_001.fastq.gz]]
[F1HD1_S28, [/path/to/fasta/F1HD1_S28_R1.fastq.gz, /path/to/fasta/F1HD1_S28_R2.fastq.gz]]
[SRR3137747, [/path/to/fasta/SRR3137747_1.fastq, /path/to/fasta/SRR3137747_2.fastq]]

jq - How to extract domains and remove duplicates

Given the following json:
Full file here: https://pastebin.com/Hzt9bq2a
{
"name": "Visma Public",
"domains": [
"accountsettings.connect.identity.stagaws.visma.com",
"admin.stage.vismaonline.com",
"api.home.stag.visma.com",
"api.workbox.dk",
"app.workbox.dk",
"app.workbox.co.uk",
"authz.workbox.dk",
"connect.identity.stagaws.visma.com",
"eaccounting.stage.vismaonline.com",
"eaccountingprinting.stage.vismaonline.com",
"http://myservices-api.stage.vismaonline.com/",
"identity.stage.vismaonline.com",
"myservices.stage.vismaonline.com"
]
}
How can I transform the data to the below. Which is, to identify the domains in the format of site.SLD.TLD present and then remove the duplication of them. (Not including the subdomains, protocols or paths as illustrated below.)
{
"name": "Visma Public",
"domains": [
"workbox.co.uk",
"workbox.dk",
"visma.com",
"vismaonline.com"
]
}
I would like to do so in jq as that is what I've used to wrangled the data into this format so far, but at this stage any solution that I can run on Debian (I'm using bash) without any extraneous tooling ideally would be fine.
I'm aware that regex can be used within jq so I assume the best way is to regex out the domain and then pipe to unique however I'm unable to get anything working so far I'm currently trying this version which seems to me to need only the text transformation stage adding in somehow either during the jq process or with a run over with something like awk after the event perhaps:
jq '[.[] | {name: .name, domain: [.domains[]] | unique}]' testfile.json
This appears to be useful: https://github.com/stedolan/jq/issues/537
One solution was offered which does a regex match to extract the last two strings separated by . and call the unique function on that & works up to a point but doesn't cover site.SLD.TLD that has 2 parts. Like google.co.uk would return only co.uk with this jq for example:
jq '.domains |= (map(capture("(?<x>[[:alpha:]]+).(?<z>[[:alpha:]]+)(.?)$") | join(".")) | unique)'
A programming language is much more expressive than jq.
Try the following snippet with python3.
import json
import pprint
import urllib.request
from urllib.parse import urlparse
import os
def get_tlds():
f = urllib.request.urlopen("https://publicsuffix.org/list/effective_tld_names.dat")
content = f.read()
lines = content.decode('utf-8').split("\n")
# remove comments
tlds = [line for line in lines if not line.startswith("//") and not line == ""]
return tlds
def extract_domain(url, tlds):
# get domain
url = url.replace("http://", "").replace("https://", "")
url = url.split("/")[0]
# get tld/sld
parts = url.split(".")
suffix1 = parts[-1]
sld1 = parts[-2]
if len(parts) > 2:
suffix2 = ".".join(parts[-2:])
sld2 = parts[-3]
else:
suffix2 = suffix1
sld2 = sld1
# try the longger first
if suffix2 in tlds:
tld = suffix2
sld = sld2
else:
tld = suffix1
sld = sld1
return sld + "." + tld
def clean(site, tlds):
site["domains"] = list(set([extract_domain(url, tlds) for url in site["domains"]]))
return site
if __name__ == "__main__":
filename = "Hzt9bq2a.json"
cache_path = "tlds.json"
if os.path.exists(cache_path):
with open(cache_path, "r") as f:
tlds = json.load(f)
else:
tlds = get_tlds()
with open(cache_path, "w") as f:
json.dump(tlds, f)
with open(filename) as f:
d = json.load(f)
d = [clean(site, tlds) for site in d]
pprint.pprint(d)
with open("clean.json", "w") as f:
json.dump(d, f)
May I offer you achieving the same query with jtc: the same could be achieved in other languages (and of course in jq) - the query is mostly how to come up with the regex to satisfy your ask:
bash $ <file.json jtc -w'<domains>l:>((?:[a-z0-9]+\.)?[a-z0-9]+\.[a-z0-9]+)[^.]*$<R:' -u'{{$1}}' /\
-ppw'<domains>l:><q:' -w'[domains]:<[]>j:' -w'<name>l:'
{
"domains": [
"stagaws.visma.com",
"stage.vismaonline.com",
"stag.visma.com",
"api.workbox.dk",
"app.workbox.dk",
"workbox.co.uk",
"authz.workbox.dk"
],
"name": "Visma Public"
}
bash $
Note: it does extract only DOMAIN.TLD, as per your ask. If you like to extract DOMAIN.SLD.TLD, then the task becomes a bit less trivial.
Update:
Modified solution as per the comment: extract domain.sld.tld where 3 or more levels and domain.tld where there’s only 2
PS. I'm the creator of the jtc - JSON processing utility. This disclaimer is SO requirement.
One of the solutions presented on this page offers that:
A programming language is much more expressive than jq.
It may therefore be worthwhile pointing out that jq is an expressive, Turing-complete programming language, and that it would be as straightforward (and as tedious) to capture all the intricacies of the "Public Suffix List" using jq as any other programming language that does not already provide support for this list.
It may be useful to illustrate an approach to the problem that passes the (revised) test presented in the Q. This approach could easily be extended in any one of a number of ways:
def extract:
sub("^[^:]*://";"")
| sub("/.*$";"")
| split(".")
| (if (.[-1]|length) == 2 and (.[-2]|length) <= 3
then -3 else -2 end) as $ix
| .[$ix : ]
| join(".") ;
{name, domain: (.domains | map(extract) | unique)}
Output
{
"name": "Visma Public",
"domain": [
"visma.com",
"vismaonline.com",
"workbox.co.uk",
"workbox.dk"
]
}
Judging from your example, you don't actually want top-level domains (just one component, e.g. ".com"), and you probably don't really want second-level domains (last two components) either, because some domain registries don't operate at the TLD level. Given www.foo.com.br, you presumably want to find out about foo.com.br, not com.br.
To do that, you need to consult the Public Suffix List. The file format isn't too complicated, but it has support for wildcards and exceptions. I dare say that jq isn't the ideal language to use here — pick one that has a URL-parsing module (for extracting hostnames) and an existing Public Suffix List module (for extracting the domain parts from those hostnames).

Parsing JSON-like configuration file using R or AWK

I need your help, as I was working with AWK many years ago and my knowledge is rusty now. Despite refreshing my memory some by reading several guides, I'm sure that my code contains some mistakes. Most related questions that I've read on SO deal with parsing standard JSON, so the advice is not applicable to my case. The only answer close to what I'm looking for is the accepted answer for this SO question: using awk sed to parse update puppet file. But I'm trying to implement a two-pass parsing, whereas I don't see it in that answer (or don't understand it enough).
After considering other options (from R itself to m4 and various template engines in between), I thought about implementing the solution purely in R via jsonlite and stringr packages, but it's not elegant. I've decided to write a short AWK script that would parse my R project's data collection configuration files before they will be read by my R code. Such file is for the most part a JSON file, but with some additions:
1) it contains embedded variables that are parameters, referring to values of JSON elements in the same file (which for simplicity I decided to place in the root of JSON tree);
2) parameters are denoted by placing a star character ('*') immediately before corresponding elements' names.
Initially I planned two types of embedded variables, which you can see here - internal (references to JSON elements in the same file, format: ${var}) and external (user-supplied, format: %{var}). However, the mechanism and benefits of passing values for the external parameters are still unclear to me, so currently I focus only on parsing configuration file with internal variables only. So, please disregard the external variables for now.
Example configuration file:
{
"*source":"SourceForge",
"*action":"import",
"*schema":"sf0314",
"data":[
{
"indicatorName":"test1",
"indicatorDescription":"Test Indicator 1",
"indicatorType":"numeric",
"resultType":"numeric",
"requestSQL":"SELECT * FROM sf0305.users WHERE user_id < 100"
},
{
"indicatorName":"test2",
"indicatorDescription":"Test Indicator 2",
"indicatorType":"numeric",
"resultType":"numeric",
"requestSQL":"SELECT *
FROM sf1104.users a, sf1104.artifact b
WHERE a.user_id = b.submitted_by AND b.artifact_id = 304727"
},
{
"indicatorName":"totalProjects",
"indicatorDescription":"Total number of unique projects",
"indicatorType":"numeric",
"resultType":"numeric",
"requestSQL":"SELECT COUNT(DISTINCT group_id) FROM ${schema}.user_group"
},
{
"indicatorName":"totalDevs",
"indicatorDescription":"Total number of developers per project",
"indicatorType":"numeric",
"resultType":"data.frame",
"requestSQL":"SELECT COUNT(*) FROM ${schema}.user_group WHERE group_id = %{group_id}"
}
]
}
AWK script:
#!/usr/bin/awk -f
BEGIN {
first_pass = true;
param = "\"\*[a-zA-Z^0-9]+?\"";
regex = "\$\{[a-zA-Z^0-9]+?\}";
params[""] = 0;
}
{
if (first_pass)
if (match($0, param)) {
print(substr($0, RSTART, RLENGTH));
params[param] = substr($0, RSTART, RLENGTH);
}
else
gsub(regex, params[regex], $0);
}
END {
if (first_pass) {
ARGC++;
ARGV[ARGIND++] = FILENAME;
first_pass = false;
nextfile;
}
}
Any help will be much appreciated! Thanks!
UPDATE (based on the G. Grothendieck's answer)
The following code (wrapped in a function and slightly modified from the original answer) behaves incorrectly, unexpectedly outputting values of all marked (with '_') configuration keys instead of only the referenced ones:
generateConfig <- function(configTemplate, configFile) {
suppressPackageStartupMessages(suppressWarnings(library(tcltk)))
if (!require(gsubfn)) install.packages('gsubfn')
library(gsubfn)
regexKeyValue <- '"_([^"]*)":"([^"]*)"'
regexVariable <- "[$]{([[:alpha:]][[:alnum:].]*)}"
cfgTmpl <- readLines(configTemplate)
defns <- strapplyc(cfgTmpl, regexKeyValue, simplify = rbind)
dict <- setNames(defns[, 2], defns[, 1])
config <- gsubfn(regexVariable, dict, cfgTmpl)
writeLines(config, con = configFile)
}
The function is called as follows:
if (updateNeeded()) {
<...>
generateConfig(SRDA_TEMPLATE, SRDA_CONFIG)
}
UPDATE 2 (per G. Grothendieck's request)
Function updateNeeded() checks existence and modification time of both files, then, based on logic, a decision is made on whether there is a need to (re)generate the config. file (returns boolean).
The following is the contents of the template configuration file (SRDA_TEMPLATE <- "./SourceForge.cfg.tmpl"):
{
"_source":"SourceForge",
"_action":"import",
"_schema":"sf0314",
"data":[
{
"indicatorName":"test1",
"indicatorDescription":"Test Indicator 1",
"indicatorType":"numeric",
"resultType":"numeric",
"requestSQL":"SELECT * FROM sf0305.users WHERE user_id < 100"
},
{
"indicatorName":"test2",
"indicatorDescription":"Test Indicator 2",
"indicatorType":"numeric",
"resultType":"numeric",
"requestSQL":"SELECT *
FROM sf1104.users a, sf1104.artifact b
WHERE a.user_id = b.submitted_by AND b.artifact_id = 304727"
},
{
"indicatorName":"totalProjects",
"indicatorDescription":"Total number of unique projects",
"indicatorType":"numeric",
"resultType":"numeric",
"requestSQL":"SELECT COUNT(DISTINCT group_id) FROM ${schema}.user_group"
},
{
"indicatorName":"totalDevs",
"indicatorDescription":"Total number of developers per project",
"indicatorType":"numeric",
"resultType":"data.frame",
"requestSQL":"SELECT COUNT(*) FROM ${schema}.user_group WHERE group_id = 78745"
}
]
}
The following is the contents of the auto-generated configuration file (SRDA_CONFIG <- "./SourceForge.cfg.json"):
{
"_source":"SourceForge",
"_action":"import",
"_schema":"sf0314",
"data":[
{
"indicatorName":"test1",
"indicatorDescription":"Test Indicator 1",
"indicatorType":"numeric",
"resultType":"numeric",
"requestSQL":"SELECT * FROM sf0305.users WHERE user_id < 100"
},
{
"indicatorName":"test2",
"indicatorDescription":"Test Indicator 2",
"indicatorType":"numeric",
"resultType":"numeric",
"requestSQL":"SELECT *
FROM sf1104.users a, sf1104.artifact b
WHERE a.user_id = b.submitted_by AND b.artifact_id = 304727"
},
{
"indicatorName":"totalProjects",
"indicatorDescription":"Total number of unique projects",
"indicatorType":"numeric",
"resultType":"numeric",
"requestSQL":"SELECT COUNT(DISTINCT group_id) FROM SourceForge import sf0314.user_group"
},
{
"indicatorName":"totalDevs",
"indicatorDescription":"Total number of developers per project",
"indicatorType":"numeric",
"resultType":"data.frame",
"requestSQL":"SELECT COUNT(*) FROM SourceForge import sf0314.user_group WHERE group_id = 78745"
}
]
}
Notice SourceForge and import, unexpectedly populated before sf0314.
Help by the answer's author will be much appreciated!
I am assuming the objective is to replace each occurrance of ${...} with the definition given on the star lines. In the post it indicates that you are looking at awk because an R solution was not elegant but I think that may have been due to the approach taken using R and I am assuming an R solution is still acceptable if by using a different approach it yields a solution that is reasonably compact.
Here config.json is the name of the input json file and config.out.json is the output file with the definitions substituted in.
We read in the file and use strapplyc to extract out a 2 column matrix of the definitions, defns. We rework this into a vector, dict, whose values are the values of the variables and whose names are the names of the variables. Then we use gsubfn to insert the definitions using the dict list. Finally we write it back out.
library(gsubfn)
Lines <- readLines("config.json")
defns <- strapplyc(Lines, '"\\*([^"]*)":"([^"]*)"', simplify = rbind)
dict <- setNames(as.list(defns[, 2]), defns[, 1])
Lines.out <- gsubfn("[$]{([[:alpha:]][[:alnum:].]*)}", dict, Lines)
writeLines(Lines.out, con = "config.out.json")
REVISED dict should be a list rather than a named character vector.
I believe:
#!/usr/bin/awk -f
BEGIN {
param = "\"\\*([a-zA-Z]+?)\":\"([^\"]*)\"";
regex = "\\${([a-zA-Z]+?)}";
}
NR == FNR {
if (match($0, param, a)) {
params[a[1]] = a[2]
}
next
}
match($0, regex, a) {
gsub(regex, params[a[1]], $0);
}
1
does what you want (when run as awk -f file.awk input.conf input.conf) for your given input.

How to get list of changed files since last build in Jenkins/Hudson

I have set up Jenkins, but I would like to find out what files were added/changed between the current build and the previous build. I'd like to run some long running tests depending on whether or not certain parts of the source tree were changed.
Having scoured the Internet I can find no mention of this ability within Hudson/Jenkins though suggestions were made to use SVN post-commit hooks. Maybe it's so simple that everyone (except me) knows how to do it!
Is this possible?
I have done it the following way. I am not sure if that is the right way, but it seems to be working. You need to get the Jenkins Groovy plugin installed and do the following script.
import hudson.model.*;
import hudson.util.*;
import hudson.scm.*;
import hudson.plugins.accurev.*
def thr = Thread.currentThread();
def build = thr?.executable;
def changeSet= build.getChangeSet();
changeSet.getItems();
ChangeSet.getItems() gives you the changes. Since I use accurev, I did List<AccurevTransaction> accurevTransList = changeSet.getItems();.
Here, the modified list contains duplicate files/names if it has been committed more than once during the current build window.
The CI server will show you the list of changes, if you are polling for changes and using SVN update. However, you seem to want to be changing the behaviour of the build depending on which files were modified. I don't think there is any out-of-the-box way to do that with Jenkins alone.
A post-commit hook is a reasonable idea. You could parameterize the job, and have your hook script launch the build with the parameter value set according to the changes committed. I'm not sure how difficult that might be for you.
However, you may want to consider splitting this into two separate jobs - one that runs on every commit, and a separate one for the long-running tests that you don't always need. Personally I prefer to keep job behaviour consistent between executions. Otherwise traceability suffers.
echo $SVN_REVISION
svn_last_successful_build_revision=`curl $JOB_URL'lastSuccessfulBuild/api/json' | python -c 'import json,sys;obj=json.loads(sys.stdin.read());print obj["'"changeSet"'"]["'"revisions"'"][0]["'"revision"'"]'`
diff=`svn di -r$SVN_REVISION:$svn_last_successful_build_revision --summarize`
You can use the Jenkins Remote Access API to get a machine-readable description of the current build, including its full change set. The subtlety here is that if you have a 'quiet period' configured, Jenkins may batch multiple commits to the same repository into a single build, so relying on a single revision number is a bit naive.
I like to keep my Subversion post-commit hooks relatively simple and hand things off to the CI server. To do this, I use wget to trigger the build, something like this...
/usr/bin/wget --output-document "-" --timeout=2 \
https://ci.example.com/jenkins/job/JOBID/build?token=MYTOKEN
The job is then configured on the Jenkins side to execute a Python script that leverages the BUILD_URL environment variable and constructs the URL for the API from that. The URL ends up looking like this:
https://ci.example.com/jenkins/job/JOBID/BUILDID/api/json/
Here's some sample Python code that could be run inside the shell script. I've left out any error handling or HTTP authentication stuff to keep things readable here.
import os
import json
import urllib2
# Make the URL
build_url = os.environ['BUILD_URL']
api = build_url + 'api/json/'
# Call the Jenkins server and figured out what changed
f = urllib2.urlopen(api)
build = json.loads(f.read())
change_set = build['changeSet']
items = change_set['items']
touched = []
for item in items:
touched += item['affectedPaths']
Using the Build Flow plugin and Git:
final changeSet = build.getChangeSet()
final changeSetIterator = changeSet.iterator()
while (changeSetIterator.hasNext()) {
final gitChangeSet = changeSetIterator.next()
for (final path : gitChangeSet.getPaths()) {
println path.getPath()
}
}
With Jenkins pipelines (pipeline supporting APIs plugin 2.2 or above), this solution is working for me:
def changeLogSets = currentBuild.changeSets
for (int i = 0; i < changeLogSets.size(); i++) {
def entries = changeLogSets[i].items
for (int j = 0; j < entries.length; j++) {
def entry = entries[j]
def files = new ArrayList(entry.affectedFiles)
for (int k = 0; k < files.size(); k++) {
def file = files[k]
println file.path
}
}
}
See How to access changelogs in a pipeline job.
Through Groovy:
<!-- CHANGE SET -->
<% changeSet = build.changeSet
if (changeSet != null) {
hadChanges = false %>
<h2>Changes</h2>
<ul>
<% changeSet.each { cs ->
hadChanges = true
aUser = cs.author %>
<li>Commit <b>${cs.revision}</b> by <b><%= aUser != null ? aUser.displayName : it.author.displayName %>:</b> (${cs.msg})
<ul>
<% cs.affectedFiles.each { %>
<li class="change-${it.editType.name}"><b>${it.editType.name}</b>: ${it.path} </li> <% } %> </ul> </li> <% }
if (!hadChanges) { %>
<li>No Changes !!</li>
<% } %> </ul> <% } %>
#!/bin/bash
set -e
job_name="whatever"
JOB_URL="http://myserver:8080/job/${job_name}/"
FILTER_PATH="path/to/folder/to/monitor"
python_func="import json, sys
obj = json.loads(sys.stdin.read())
ch_list = obj['changeSet']['items']
_list = [ j['affectedPaths'] for j in ch_list ]
for outer in _list:
for inner in outer:
print inner
"
_affected_files=`curl --silent ${JOB_URL}${BUILD_NUMBER}'/api/json' | python -c "$python_func"`
if [ -z "`echo \"$_affected_files\" | grep \"${FILTER_PATH}\"`" ]; then
echo "[INFO] no changes detected in ${FILTER_PATH}"
exit 0
else
echo "[INFO] changed files detected: "
for a_file in `echo "$_affected_files" | grep "${FILTER_PATH}"`; do
echo " $a_file"
done;
fi;
It is slightly different - I needed a script for Git on a particular folder...
So, I wrote a check based on jollychang.
It can be added directly to the job's exec shell script. If no files are detected it will exit 0, i.e. SUCCESS... this way you can always trigger on check-ins to the repository, but build when files in the folder of interest change.
But... If you wanted to build on-demand (i.e. clicking Build Now) with the changed from the last build.. you would change _affected_files to:
_affected_files=`curl --silent $JOB_URL'lastSuccessfulBuild/api/json' | python -c "$python_func"`
Note: You have to use Jenkins' own SVN client to get a change list. Doing it through a shell build step won't list the changes in the build.
It's simple, but this works for me:
$DirectoryA = "D:\Jenkins\jobs\projectName\builds" ####Jenkind directory
$firstfolder = Get-ChildItem -Path $DirectoryA | Where-Object {$_.PSIsContainer} | Sort-Object LastWriteTime -Descending | Select-Object -First 1
$DirectoryB = $DirectoryA + "\" + $firstfolder
$sVnLoGfIle = $DirectoryB + "\" + "changelog.xml"
write-host $sVnLoGfIle
I tried to add that to comments but code in comments is no way:
Just want to prettify code from heroin's answer:
def changedFiles = []
def changeLogSets = currentBuild.changeSets
for (entries in changeLogSets) {
for (entry in entries) {
for (file in entry.affectedFiles) {
echo "Found changed file: ${file.path}"
changedFiles += "${file.path}"
}
}
}
Keep in mind for some cases git plugin returns empty changeSet, like:
First run in newly created branch
'Build now' button build
Refer to https://issues.jenkins-ci.org/browse/JENKINS-26354 for more details.