Parsing JSON-like configuration file using R or AWK - json

I need your help, as I was working with AWK many years ago and my knowledge is rusty now. Despite refreshing my memory some by reading several guides, I'm sure that my code contains some mistakes. Most related questions that I've read on SO deal with parsing standard JSON, so the advice is not applicable to my case. The only answer close to what I'm looking for is the accepted answer for this SO question: using awk sed to parse update puppet file. But I'm trying to implement a two-pass parsing, whereas I don't see it in that answer (or don't understand it enough).
After considering other options (from R itself to m4 and various template engines in between), I thought about implementing the solution purely in R via jsonlite and stringr packages, but it's not elegant. I've decided to write a short AWK script that would parse my R project's data collection configuration files before they will be read by my R code. Such file is for the most part a JSON file, but with some additions:
1) it contains embedded variables that are parameters, referring to values of JSON elements in the same file (which for simplicity I decided to place in the root of JSON tree);
2) parameters are denoted by placing a star character ('*') immediately before corresponding elements' names.
Initially I planned two types of embedded variables, which you can see here - internal (references to JSON elements in the same file, format: ${var}) and external (user-supplied, format: %{var}). However, the mechanism and benefits of passing values for the external parameters are still unclear to me, so currently I focus only on parsing configuration file with internal variables only. So, please disregard the external variables for now.
Example configuration file:
{
"*source":"SourceForge",
"*action":"import",
"*schema":"sf0314",
"data":[
{
"indicatorName":"test1",
"indicatorDescription":"Test Indicator 1",
"indicatorType":"numeric",
"resultType":"numeric",
"requestSQL":"SELECT * FROM sf0305.users WHERE user_id < 100"
},
{
"indicatorName":"test2",
"indicatorDescription":"Test Indicator 2",
"indicatorType":"numeric",
"resultType":"numeric",
"requestSQL":"SELECT *
FROM sf1104.users a, sf1104.artifact b
WHERE a.user_id = b.submitted_by AND b.artifact_id = 304727"
},
{
"indicatorName":"totalProjects",
"indicatorDescription":"Total number of unique projects",
"indicatorType":"numeric",
"resultType":"numeric",
"requestSQL":"SELECT COUNT(DISTINCT group_id) FROM ${schema}.user_group"
},
{
"indicatorName":"totalDevs",
"indicatorDescription":"Total number of developers per project",
"indicatorType":"numeric",
"resultType":"data.frame",
"requestSQL":"SELECT COUNT(*) FROM ${schema}.user_group WHERE group_id = %{group_id}"
}
]
}
AWK script:
#!/usr/bin/awk -f
BEGIN {
first_pass = true;
param = "\"\*[a-zA-Z^0-9]+?\"";
regex = "\$\{[a-zA-Z^0-9]+?\}";
params[""] = 0;
}
{
if (first_pass)
if (match($0, param)) {
print(substr($0, RSTART, RLENGTH));
params[param] = substr($0, RSTART, RLENGTH);
}
else
gsub(regex, params[regex], $0);
}
END {
if (first_pass) {
ARGC++;
ARGV[ARGIND++] = FILENAME;
first_pass = false;
nextfile;
}
}
Any help will be much appreciated! Thanks!
UPDATE (based on the G. Grothendieck's answer)
The following code (wrapped in a function and slightly modified from the original answer) behaves incorrectly, unexpectedly outputting values of all marked (with '_') configuration keys instead of only the referenced ones:
generateConfig <- function(configTemplate, configFile) {
suppressPackageStartupMessages(suppressWarnings(library(tcltk)))
if (!require(gsubfn)) install.packages('gsubfn')
library(gsubfn)
regexKeyValue <- '"_([^"]*)":"([^"]*)"'
regexVariable <- "[$]{([[:alpha:]][[:alnum:].]*)}"
cfgTmpl <- readLines(configTemplate)
defns <- strapplyc(cfgTmpl, regexKeyValue, simplify = rbind)
dict <- setNames(defns[, 2], defns[, 1])
config <- gsubfn(regexVariable, dict, cfgTmpl)
writeLines(config, con = configFile)
}
The function is called as follows:
if (updateNeeded()) {
<...>
generateConfig(SRDA_TEMPLATE, SRDA_CONFIG)
}
UPDATE 2 (per G. Grothendieck's request)
Function updateNeeded() checks existence and modification time of both files, then, based on logic, a decision is made on whether there is a need to (re)generate the config. file (returns boolean).
The following is the contents of the template configuration file (SRDA_TEMPLATE <- "./SourceForge.cfg.tmpl"):
{
"_source":"SourceForge",
"_action":"import",
"_schema":"sf0314",
"data":[
{
"indicatorName":"test1",
"indicatorDescription":"Test Indicator 1",
"indicatorType":"numeric",
"resultType":"numeric",
"requestSQL":"SELECT * FROM sf0305.users WHERE user_id < 100"
},
{
"indicatorName":"test2",
"indicatorDescription":"Test Indicator 2",
"indicatorType":"numeric",
"resultType":"numeric",
"requestSQL":"SELECT *
FROM sf1104.users a, sf1104.artifact b
WHERE a.user_id = b.submitted_by AND b.artifact_id = 304727"
},
{
"indicatorName":"totalProjects",
"indicatorDescription":"Total number of unique projects",
"indicatorType":"numeric",
"resultType":"numeric",
"requestSQL":"SELECT COUNT(DISTINCT group_id) FROM ${schema}.user_group"
},
{
"indicatorName":"totalDevs",
"indicatorDescription":"Total number of developers per project",
"indicatorType":"numeric",
"resultType":"data.frame",
"requestSQL":"SELECT COUNT(*) FROM ${schema}.user_group WHERE group_id = 78745"
}
]
}
The following is the contents of the auto-generated configuration file (SRDA_CONFIG <- "./SourceForge.cfg.json"):
{
"_source":"SourceForge",
"_action":"import",
"_schema":"sf0314",
"data":[
{
"indicatorName":"test1",
"indicatorDescription":"Test Indicator 1",
"indicatorType":"numeric",
"resultType":"numeric",
"requestSQL":"SELECT * FROM sf0305.users WHERE user_id < 100"
},
{
"indicatorName":"test2",
"indicatorDescription":"Test Indicator 2",
"indicatorType":"numeric",
"resultType":"numeric",
"requestSQL":"SELECT *
FROM sf1104.users a, sf1104.artifact b
WHERE a.user_id = b.submitted_by AND b.artifact_id = 304727"
},
{
"indicatorName":"totalProjects",
"indicatorDescription":"Total number of unique projects",
"indicatorType":"numeric",
"resultType":"numeric",
"requestSQL":"SELECT COUNT(DISTINCT group_id) FROM SourceForge import sf0314.user_group"
},
{
"indicatorName":"totalDevs",
"indicatorDescription":"Total number of developers per project",
"indicatorType":"numeric",
"resultType":"data.frame",
"requestSQL":"SELECT COUNT(*) FROM SourceForge import sf0314.user_group WHERE group_id = 78745"
}
]
}
Notice SourceForge and import, unexpectedly populated before sf0314.
Help by the answer's author will be much appreciated!

I am assuming the objective is to replace each occurrance of ${...} with the definition given on the star lines. In the post it indicates that you are looking at awk because an R solution was not elegant but I think that may have been due to the approach taken using R and I am assuming an R solution is still acceptable if by using a different approach it yields a solution that is reasonably compact.
Here config.json is the name of the input json file and config.out.json is the output file with the definitions substituted in.
We read in the file and use strapplyc to extract out a 2 column matrix of the definitions, defns. We rework this into a vector, dict, whose values are the values of the variables and whose names are the names of the variables. Then we use gsubfn to insert the definitions using the dict list. Finally we write it back out.
library(gsubfn)
Lines <- readLines("config.json")
defns <- strapplyc(Lines, '"\\*([^"]*)":"([^"]*)"', simplify = rbind)
dict <- setNames(as.list(defns[, 2]), defns[, 1])
Lines.out <- gsubfn("[$]{([[:alpha:]][[:alnum:].]*)}", dict, Lines)
writeLines(Lines.out, con = "config.out.json")
REVISED dict should be a list rather than a named character vector.

I believe:
#!/usr/bin/awk -f
BEGIN {
param = "\"\\*([a-zA-Z]+?)\":\"([^\"]*)\"";
regex = "\\${([a-zA-Z]+?)}";
}
NR == FNR {
if (match($0, param, a)) {
params[a[1]] = a[2]
}
next
}
match($0, regex, a) {
gsub(regex, params[a[1]], $0);
}
1
does what you want (when run as awk -f file.awk input.conf input.conf) for your given input.

Related

Nextflow rename barcodes and concatenate reads within barcodes

My current working directory has the following sub-directories
My Bash script
Hi there
I have compiled the above Bash script to do the following tasks:
rename the sub-directories (barcode01-12) taking information from the metadata.csv
concatenate the individual reads within a sub-directory and move them up in the $PWD
then I use these concatenated reads (one per barcode) for my Nextflow script below:
Query:
How can I get the above pre-processing tasks (renaming and concatenating) or the Bash script added at the beginning of my following Nextflow script?
In my experience, FASTQ files can get quite large. Without knowing too much of the specifics, my recommendation would be to move the concatenation (and renaming) to a separate process. In this way, all of the 'work' can be done inside Nextflow's working directory. Here's a solution that uses the new DSL 2. It uses the splitCsv operator to parse the metadata and identify the FASTQ files. The collection can then be passed into our 'concat_reads' process. To handle optionally gzipped files, you could try the following:
params.metadata = './metadata.csv'
params.outdir = './results'
process concat_reads {
tag { sample_name }
publishDir "${params.outdir}/concat_reads", mode: 'copy'
input:
tuple val(sample_name), path(fastq_files)
output:
tuple val(sample_name), path("${sample_name}.${extn}")
script:
if( fastq_files.every { it.name.endsWith('.fastq.gz') } )
extn = 'fastq.gz'
else if( fastq_files.every { it.name.endsWith('.fastq') } )
extn = 'fastq'
else
error "Concatentation of mixed filetypes is unsupported"
"""
cat ${fastq_files} > "${sample_name}.${extn}"
"""
}
process pomoxis {
tag { sample_name }
publishDir "${params.outdir}/pomoxis", mode: 'copy'
cpus 18
input:
tuple val(sample_name), path(fastq)
"""
mini_assemble \\
-t ${task.cpus} \\
-i "${fastq}" \\
-o results \\
-p "${sample_name}"
"""
}
workflow {
fastq_extns = [ '.fastq', '.fastq.gz' ]
Channel.fromPath( params.metadata )
| splitCsv()
| map { dir, sample_name ->
all_files = file(dir).listFiles()
fastq_files = all_files.findAll { fn ->
fastq_extns.find { fn.name.endsWith( it ) }
}
tuple( sample_name, fastq_files )
}
| concat_reads
| pomoxis
}

jq - How to extract domains and remove duplicates

Given the following json:
Full file here: https://pastebin.com/Hzt9bq2a
{
"name": "Visma Public",
"domains": [
"accountsettings.connect.identity.stagaws.visma.com",
"admin.stage.vismaonline.com",
"api.home.stag.visma.com",
"api.workbox.dk",
"app.workbox.dk",
"app.workbox.co.uk",
"authz.workbox.dk",
"connect.identity.stagaws.visma.com",
"eaccounting.stage.vismaonline.com",
"eaccountingprinting.stage.vismaonline.com",
"http://myservices-api.stage.vismaonline.com/",
"identity.stage.vismaonline.com",
"myservices.stage.vismaonline.com"
]
}
How can I transform the data to the below. Which is, to identify the domains in the format of site.SLD.TLD present and then remove the duplication of them. (Not including the subdomains, protocols or paths as illustrated below.)
{
"name": "Visma Public",
"domains": [
"workbox.co.uk",
"workbox.dk",
"visma.com",
"vismaonline.com"
]
}
I would like to do so in jq as that is what I've used to wrangled the data into this format so far, but at this stage any solution that I can run on Debian (I'm using bash) without any extraneous tooling ideally would be fine.
I'm aware that regex can be used within jq so I assume the best way is to regex out the domain and then pipe to unique however I'm unable to get anything working so far I'm currently trying this version which seems to me to need only the text transformation stage adding in somehow either during the jq process or with a run over with something like awk after the event perhaps:
jq '[.[] | {name: .name, domain: [.domains[]] | unique}]' testfile.json
This appears to be useful: https://github.com/stedolan/jq/issues/537
One solution was offered which does a regex match to extract the last two strings separated by . and call the unique function on that & works up to a point but doesn't cover site.SLD.TLD that has 2 parts. Like google.co.uk would return only co.uk with this jq for example:
jq '.domains |= (map(capture("(?<x>[[:alpha:]]+).(?<z>[[:alpha:]]+)(.?)$") | join(".")) | unique)'
A programming language is much more expressive than jq.
Try the following snippet with python3.
import json
import pprint
import urllib.request
from urllib.parse import urlparse
import os
def get_tlds():
f = urllib.request.urlopen("https://publicsuffix.org/list/effective_tld_names.dat")
content = f.read()
lines = content.decode('utf-8').split("\n")
# remove comments
tlds = [line for line in lines if not line.startswith("//") and not line == ""]
return tlds
def extract_domain(url, tlds):
# get domain
url = url.replace("http://", "").replace("https://", "")
url = url.split("/")[0]
# get tld/sld
parts = url.split(".")
suffix1 = parts[-1]
sld1 = parts[-2]
if len(parts) > 2:
suffix2 = ".".join(parts[-2:])
sld2 = parts[-3]
else:
suffix2 = suffix1
sld2 = sld1
# try the longger first
if suffix2 in tlds:
tld = suffix2
sld = sld2
else:
tld = suffix1
sld = sld1
return sld + "." + tld
def clean(site, tlds):
site["domains"] = list(set([extract_domain(url, tlds) for url in site["domains"]]))
return site
if __name__ == "__main__":
filename = "Hzt9bq2a.json"
cache_path = "tlds.json"
if os.path.exists(cache_path):
with open(cache_path, "r") as f:
tlds = json.load(f)
else:
tlds = get_tlds()
with open(cache_path, "w") as f:
json.dump(tlds, f)
with open(filename) as f:
d = json.load(f)
d = [clean(site, tlds) for site in d]
pprint.pprint(d)
with open("clean.json", "w") as f:
json.dump(d, f)
May I offer you achieving the same query with jtc: the same could be achieved in other languages (and of course in jq) - the query is mostly how to come up with the regex to satisfy your ask:
bash $ <file.json jtc -w'<domains>l:>((?:[a-z0-9]+\.)?[a-z0-9]+\.[a-z0-9]+)[^.]*$<R:' -u'{{$1}}' /\
-ppw'<domains>l:><q:' -w'[domains]:<[]>j:' -w'<name>l:'
{
"domains": [
"stagaws.visma.com",
"stage.vismaonline.com",
"stag.visma.com",
"api.workbox.dk",
"app.workbox.dk",
"workbox.co.uk",
"authz.workbox.dk"
],
"name": "Visma Public"
}
bash $
Note: it does extract only DOMAIN.TLD, as per your ask. If you like to extract DOMAIN.SLD.TLD, then the task becomes a bit less trivial.
Update:
Modified solution as per the comment: extract domain.sld.tld where 3 or more levels and domain.tld where there’s only 2
PS. I'm the creator of the jtc - JSON processing utility. This disclaimer is SO requirement.
One of the solutions presented on this page offers that:
A programming language is much more expressive than jq.
It may therefore be worthwhile pointing out that jq is an expressive, Turing-complete programming language, and that it would be as straightforward (and as tedious) to capture all the intricacies of the "Public Suffix List" using jq as any other programming language that does not already provide support for this list.
It may be useful to illustrate an approach to the problem that passes the (revised) test presented in the Q. This approach could easily be extended in any one of a number of ways:
def extract:
sub("^[^:]*://";"")
| sub("/.*$";"")
| split(".")
| (if (.[-1]|length) == 2 and (.[-2]|length) <= 3
then -3 else -2 end) as $ix
| .[$ix : ]
| join(".") ;
{name, domain: (.domains | map(extract) | unique)}
Output
{
"name": "Visma Public",
"domain": [
"visma.com",
"vismaonline.com",
"workbox.co.uk",
"workbox.dk"
]
}
Judging from your example, you don't actually want top-level domains (just one component, e.g. ".com"), and you probably don't really want second-level domains (last two components) either, because some domain registries don't operate at the TLD level. Given www.foo.com.br, you presumably want to find out about foo.com.br, not com.br.
To do that, you need to consult the Public Suffix List. The file format isn't too complicated, but it has support for wildcards and exceptions. I dare say that jq isn't the ideal language to use here — pick one that has a URL-parsing module (for extracting hostnames) and an existing Public Suffix List module (for extracting the domain parts from those hostnames).

How Do I Consume an Array of JSON Objects using Plumber in R

I have been experimenting with Plumber in R recently, and am having success when I pass the following data using a POST request;
{"Gender": "F", "State": "AZ"}
This allows me to write a function like the following to return the data.
#* #post /score
score <- function(Gender, State){
data <- list(
Gender = as.factor(Gender)
, State = as.factor(State))
return(data)
}
However, when I try to POST an array of JSON objects, I can't seem to access the data through the function
[{"Gender":"F","State":"AZ"},{"Gender":"F","State":"NY"},{"Gender":"M","State":"DC"}]
I get the following error
{
"error": [
"500 - Internal server error"
],
"message": [
"Error in is.factor(x): argument \"Gender\" is missing, with no default\n"
]
}
Does anyone have an idea of how Plumber parses JSON? I'm not sure how to access and assign the fields to vectors to score the data.
Thanks in advance
I see two possible solutions here. The first would be a command line based approach which I assume you were attempting. I tested this on a Windows OS and used column based data.frame encoding which I prefer due to shorter JSON string lengths. Make sure to escape quotation marks correctly to avoid 'argument "..." is missing, with no default' errors:
curl -H "Content-Type: application/json" --data "{\"Gender\":[\"F\",\"F\",\"M\"],\"State\":[\"AZ\",\"NY\",\"DC\"]}" http://localhost:8000/score
# [["F","F","M"],["AZ","NY","DC"]]
The second approach is R native and has the advantage of having everything in one place:
library(jsonlite)
library(httr)
## sample data
lst = list(
Gender = c("F", "F", "M")
, State = c("AZ", "NY", "DC")
)
## jsonify
jsn = lapply(
lst
, toJSON
)
## query
request = POST(
url = "http://localhost:8000/score?"
, query = jsn # values must be length 1
)
response = content(
request
, as = "text"
, encoding = "UTF-8"
)
fromJSON(
response
)
# [,1]
# [1,] "[\"F\",\"F\",\"M\"]"
# [2,] "[\"AZ\",\"NY\",\"DC\"]"
Be aware that httr::POST() expects a list of length-1 values as query input, so the array data should be jsonified beforehand. If you want to avoid the additional package imports altogether, some system(), sprintf(), etc. magic should do the trick.
Finally, here is my plumber endpoint (living in R/plumber.R and condensed a little bit):
#* #post /score
score = function(Gender, State){
lapply(
list(Gender, State)
, as.factor
)
}
and code to fire up the API:
pr = plumber::plumb("R/plumber.R")
pr$run(port = 8000)

Livy Server: return a dataframe as JSON?

I am executing a statement in Livy Server using HTTP POST call to localhost:8998/sessions/0/statements, with the following body
{
"code": "spark.sql(\"select * from test_table limit 10\")"
}
I would like an answer in the following format
(...)
"data": {
"application/json": "[
{"id": "123", "init_date": 1481649345, ...},
{"id": "133", "init_date": 1481649333, ...},
{"id": "155", "init_date": 1481642153, ...},
]"
}
(...)
but what I'm getting is
(...)
"data": {
"text/plain": "res0: org.apache.spark.sql.DataFrame = [id: string, init_date: timestamp ... 64 more fields]"
}
(...)
Which is the toString() version of the dataframe.
Is there some way to return a dataframe as JSON using the Livy Server?
EDIT
Found a JIRA issue that addresses the problem: https://issues.cloudera.org/browse/LIVY-72
By the comments one can say that Livy does not and will not support such feature?
I recommend using the built-in (albeit hard to find documentation for) magics %json and %table:
%json
session_url = host + "/sessions/1"
statements_url = session_url + '/statements'
data = {
'code': textwrap.dedent("""\
val d = spark.sql("SELECT COUNT(DISTINCT food_item) FROM food_item_tbl")
val e = d.collect
%json e
""")}
r = requests.post(statements_url, data=json.dumps(data), headers=headers)
print r.json()
%table
session_url = host + "/sessions/21"
statements_url = session_url + '/statements'
data = {
'code': textwrap.dedent("""\
val x = List((1, "a", 0.12), (3, "b", 0.63))
%table x
""")}
r = requests.post(statements_url, data=json.dumps(data), headers=headers)
print r.json()
Related: Apache Livy: query Spark SQL via REST: possible?
I don't have a lot of experience with Livy, but as far as I know this endpoint is used as an interactive shell and the output will be a string with the actual result that would be shown by a shell. So, with that in mind, I can think of a way to emulate the result you want, but It may not be the best way to do it:
{
"code": "println(spark.sql(\"select * from test_table limit 10\").toJSON.collect.mkString(\"[\", \",\", \"]\"))"
}
Then, you will have a JSON wrapped in a string, so your client could parse it.
I think in general your best bet is to write your output to a database of some kind. If you write to a randomly named table, you could have your code read it after the script is done.

Iteratively read a fixed number of lines into R

I have a josn file I'm working with that contains multiple json objects in a single file. R is unable to read the file as a whole. But since each object occurs at regular intervals, I would like to iteratively read a fixed number of lines into R.
There are a number of SO questions on reading single lines into R but I have been unable to extend these solutions to a fixed number of lines. For my problem I need to read 16 lines into R at a time (eg 1-16, 17-32 etc)
I have tried using a loop but can't seem to get the syntax right:
## File
file <- "results.json"
## Create connection
con <- file(description=file, open="r")
## Loop over a file connection
for(i in 1:1000) {
tmp <- scan(file=con, nlines=16, quiet=TRUE)
data[i] <- fromJSON(tmp)
}
The file contains over 1000 objects of this form:
{
"object": [
[
"a",
0
],
[
"b",
2
],
[
"c",
2
]
]
}
With #tomtom inspiration I was able to find a solution.
## File
file <- "results.json"
## Loop over a file
for(i in 1:1000) {
tmp <- paste(scan(file=file, what="character", sep="\n", nlines=16, skip=(i-1)*16, quiet=TRUE),collapse=" ")
assign(x = paste("data", i, sep = "_"), value = fromJSON(tmp))
}
I couldn't create a connection as each time I tried the connection would close before the file had been completely read. So I got rid of that step.
I had to include the what="character" variable as scan() seems to expect a number by default.
I included sep="\n", paste() and collapse=" " to create a single string rather than the vector of characters that scan() creates by default.
Finally I just changed the final assignment operator to have a bit more control over the names of the output.
This might help:
EDITED to make it use a list and Reduce into one file
## Loop over a file connection
data <- NULL
for(i in 1:1000) {
tmp <- scan(file=con, nlines=16, skip=(i-1)*16, quiet=TRUE)
data[[i]] <- fromJSON(tmp)
}
df <- Reduce(function(x, y) {paste(x, y, collapse = " ")})
You would have to make sure that you don't reach further than the end of the file though ;-)