jq - How to extract domains and remove duplicates - json

Given the following json:
Full file here: https://pastebin.com/Hzt9bq2a
{
"name": "Visma Public",
"domains": [
"accountsettings.connect.identity.stagaws.visma.com",
"admin.stage.vismaonline.com",
"api.home.stag.visma.com",
"api.workbox.dk",
"app.workbox.dk",
"app.workbox.co.uk",
"authz.workbox.dk",
"connect.identity.stagaws.visma.com",
"eaccounting.stage.vismaonline.com",
"eaccountingprinting.stage.vismaonline.com",
"http://myservices-api.stage.vismaonline.com/",
"identity.stage.vismaonline.com",
"myservices.stage.vismaonline.com"
]
}
How can I transform the data to the below. Which is, to identify the domains in the format of site.SLD.TLD present and then remove the duplication of them. (Not including the subdomains, protocols or paths as illustrated below.)
{
"name": "Visma Public",
"domains": [
"workbox.co.uk",
"workbox.dk",
"visma.com",
"vismaonline.com"
]
}
I would like to do so in jq as that is what I've used to wrangled the data into this format so far, but at this stage any solution that I can run on Debian (I'm using bash) without any extraneous tooling ideally would be fine.
I'm aware that regex can be used within jq so I assume the best way is to regex out the domain and then pipe to unique however I'm unable to get anything working so far I'm currently trying this version which seems to me to need only the text transformation stage adding in somehow either during the jq process or with a run over with something like awk after the event perhaps:
jq '[.[] | {name: .name, domain: [.domains[]] | unique}]' testfile.json
This appears to be useful: https://github.com/stedolan/jq/issues/537
One solution was offered which does a regex match to extract the last two strings separated by . and call the unique function on that & works up to a point but doesn't cover site.SLD.TLD that has 2 parts. Like google.co.uk would return only co.uk with this jq for example:
jq '.domains |= (map(capture("(?<x>[[:alpha:]]+).(?<z>[[:alpha:]]+)(.?)$") | join(".")) | unique)'

A programming language is much more expressive than jq.
Try the following snippet with python3.
import json
import pprint
import urllib.request
from urllib.parse import urlparse
import os
def get_tlds():
f = urllib.request.urlopen("https://publicsuffix.org/list/effective_tld_names.dat")
content = f.read()
lines = content.decode('utf-8').split("\n")
# remove comments
tlds = [line for line in lines if not line.startswith("//") and not line == ""]
return tlds
def extract_domain(url, tlds):
# get domain
url = url.replace("http://", "").replace("https://", "")
url = url.split("/")[0]
# get tld/sld
parts = url.split(".")
suffix1 = parts[-1]
sld1 = parts[-2]
if len(parts) > 2:
suffix2 = ".".join(parts[-2:])
sld2 = parts[-3]
else:
suffix2 = suffix1
sld2 = sld1
# try the longger first
if suffix2 in tlds:
tld = suffix2
sld = sld2
else:
tld = suffix1
sld = sld1
return sld + "." + tld
def clean(site, tlds):
site["domains"] = list(set([extract_domain(url, tlds) for url in site["domains"]]))
return site
if __name__ == "__main__":
filename = "Hzt9bq2a.json"
cache_path = "tlds.json"
if os.path.exists(cache_path):
with open(cache_path, "r") as f:
tlds = json.load(f)
else:
tlds = get_tlds()
with open(cache_path, "w") as f:
json.dump(tlds, f)
with open(filename) as f:
d = json.load(f)
d = [clean(site, tlds) for site in d]
pprint.pprint(d)
with open("clean.json", "w") as f:
json.dump(d, f)

May I offer you achieving the same query with jtc: the same could be achieved in other languages (and of course in jq) - the query is mostly how to come up with the regex to satisfy your ask:
bash $ <file.json jtc -w'<domains>l:>((?:[a-z0-9]+\.)?[a-z0-9]+\.[a-z0-9]+)[^.]*$<R:' -u'{{$1}}' /\
-ppw'<domains>l:><q:' -w'[domains]:<[]>j:' -w'<name>l:'
{
"domains": [
"stagaws.visma.com",
"stage.vismaonline.com",
"stag.visma.com",
"api.workbox.dk",
"app.workbox.dk",
"workbox.co.uk",
"authz.workbox.dk"
],
"name": "Visma Public"
}
bash $
Note: it does extract only DOMAIN.TLD, as per your ask. If you like to extract DOMAIN.SLD.TLD, then the task becomes a bit less trivial.
Update:
Modified solution as per the comment: extract domain.sld.tld where 3 or more levels and domain.tld where there’s only 2
PS. I'm the creator of the jtc - JSON processing utility. This disclaimer is SO requirement.

One of the solutions presented on this page offers that:
A programming language is much more expressive than jq.
It may therefore be worthwhile pointing out that jq is an expressive, Turing-complete programming language, and that it would be as straightforward (and as tedious) to capture all the intricacies of the "Public Suffix List" using jq as any other programming language that does not already provide support for this list.
It may be useful to illustrate an approach to the problem that passes the (revised) test presented in the Q. This approach could easily be extended in any one of a number of ways:
def extract:
sub("^[^:]*://";"")
| sub("/.*$";"")
| split(".")
| (if (.[-1]|length) == 2 and (.[-2]|length) <= 3
then -3 else -2 end) as $ix
| .[$ix : ]
| join(".") ;
{name, domain: (.domains | map(extract) | unique)}
Output
{
"name": "Visma Public",
"domain": [
"visma.com",
"vismaonline.com",
"workbox.co.uk",
"workbox.dk"
]
}

Judging from your example, you don't actually want top-level domains (just one component, e.g. ".com"), and you probably don't really want second-level domains (last two components) either, because some domain registries don't operate at the TLD level. Given www.foo.com.br, you presumably want to find out about foo.com.br, not com.br.
To do that, you need to consult the Public Suffix List. The file format isn't too complicated, but it has support for wildcards and exceptions. I dare say that jq isn't the ideal language to use here — pick one that has a URL-parsing module (for extracting hostnames) and an existing Public Suffix List module (for extracting the domain parts from those hostnames).

Related

Loading Multiple CSV files across all subfolder levels with Wildcard file name

I want to Load Multiple CSV files matching certain names into a dataframe. Currently i am looping through the whole folder and creating a list of filenames and then loading those csv's into the dataframe list and then concatenating that dataframe.
The approach i want to use (if possible) is to bypass all the code and read all files in a one liner kind of approach.
I know this can be done easily for single level of subfolders, but my subfolder structure is as follows
Root Folder
|
Subfolder1
|
Subfolder 2
|
X01.csv
Y01.csv
Z01.csv
|
Subfolder3
|
Subfolder4
|
X01.csv
Y01.csv
|
Subfolder5
|
X01.csv
Y01.csv
I want to read all "X01.csv" files while reading from Root Folder.
Is there a way i can read all the required files in code something like the below
filepath = "rootpath" + "/**/X*.csv"
df = spark.read.format("com.databricks.spark.csv").option("recursiveFilelookup","true").option("header","true").load(filepath)
This code works fine for single level of subfolders, is there any equivalent of this for multi level folders ? i thought the "recursiveFilelookup" option would look across all levels of subfolders, but apparently this is not the way it works.
Currently i am getting a
Path not found ... filepath
exception
any help please
Have you tried using the glob.glob function?
You can use it to search for files that match certain criteria inside a root path, and pass the list of files it finds to spark.read.csv function.
For example, I've recreated the folder structure from your example inside a Google Colab environment:
To get a list of all CSV files matching the criteria you've specified, you can use the following code:
import glob
rootpath = './Root Folder/'
# The following line of code looks through all files
# inside the rootpath recursively, trying to match the
# pattern specified. In this case, it tries to find any
# CSV file that starts with the letters X, Y, or Z,
# and ends with 2 numbers (ranging from 0 to 9).
glob.glob(rootpath + "**/[X|Y|Z][0-9][0-9].csv", recursive=True)
# Returns:
# ['./Root Folder/Subfolder5/Y01.csv',
# './Root Folder/Subfolder5/X01.csv',
# './Root Folder/Subfolder1/Subfolder 2/Y01.csv',
# './Root Folder/Subfolder1/Subfolder 2/Z01.csv',
# './Root Folder/Subfolder1/Subfolder 2/X01.csv']
Now you can combine this with spark.read.csv capability of reading a list of files to get the answer you're looking for:
import glob
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
rootpath = './Root Folder/'
spark.read.csv(glob.glob(rootpath + "**/[X|Y|Z][0-9][0-9].csv", recursive=True), inferSchema=True, header=True)
Note
You can specify more general patterns like:
glob.glob(rootpath + "**/*.csv", recursive=True)
To return a list of all csv files inside any subdirectory of rootpath.
Additionally, to consider only the immediate subdirectories files, you could use something like:
glob.glob(rootpath + "*.csv", recursive=True)
Edit
Based on your comments to this answer, does something like this works on Databricks?
from notebookutils import mssparkutils as ms
# databricks has a module called dbutils.fs.ls
# that works similarly to mssparkutils.fs, based on
# the following page of its documentation:
# https://docs.databricks.com/dev-tools/databricks-utils.html#ls-command-dbutilsfsls
def scan_dir(
initial_path: str,
search_str: str,
account_name: str,
):
"""Scan a directory and subdirectories for a string.
Parameters
----------
initial_path : str
The path to start the search. Accepts either a valid container name,
or the entire connection string.
search_str : str
The string to search.
account_name : str
The name of the account to access the container folders.
This value is only used, when the `initial_path`, doesn't
conform with the format: "abfss://<initial_path>#<account_name>.dfs.core.windows.net/"
Raises
------
FileNotFoundError
If the `initial_path` informed doesn't exist.
ValueError
If `initial_path` is not a string.
"""
if not isinstance(initial_path, str):
raise ValueError(
f'`initial_path` needs to be of type string, not {type(initial_path)}'
)
elif not initial_path.startswith('abfss'):
initial_path = f'abfss://{initial_path}#{account_name}.dfs.core.windows.net/'
try:
fdirs = ms.fs.ls(initial_path)
except Py4JJavaError as exc:
raise FileNotFoundError(
f'The path you informed \"{initial_path}\" doesn\'t exist'
) from exc
found = []
for path in fdirs:
p = path.path
if path.isDir:
found = [*found, *scan_dir(p, search_str)]
if search_str.lower() in path.name.lower():
# print(p.split('.net')[-1])
found = [*found, p.replace(path.name, "")]
return list(set(found))
Example:
# Change .parquet to .csv
spark.read.parquet(*scan_dir("abfss://CONTAINER_NAME#ACCOUNTNAME.dfs.core.windows.net/ROOT/FOLDER/", ".parquet"))
This method above worked for on Azure Synapse:

Match-zero-or-more operator in nextflow glob

I am trying to create a robust glob pattern that will match most of the different naming conventions used for fastq files we receive. However, the version of nextflow I am using (20.10.0) on the HPC doesn't seem to accept what I've written.
Here are some examples of file names:
19_S8_R1_001.fastq.gz 19_S8_R2_001.fastq.gz
F1HD1_S28_R1.fastq.gz F1HD1_S28_R2.fastq.gz
SRR3137747_1.fastq SRR3137747_2.fastq
The pattern I originally wrote to go with the fromFilePairs operator was *_?(R){1,2}?(_001).f?(ast)q?(.gz). Which I tested in a bash environment. Here is the output from testing in the directory with the top two example files:
-bash-4.2$ shopt -s extglob
-bash-4.2$ ls -1 *_?(R){1,2}?(_001).f?(ast)q?(.gz)
19_S8_R1_001.fastq.gz
19_S8_R2_001.fastq.gz
But when I tried to run this with nextflow, it just gave me the error message I put into the ifEmpty operator.
I've eventually got it working, but using this pattern: *_{R1,R2,1,2}{.fastq.gz,.fq.gz,.fastq,.fq,_001.fastq.gz,_001.fq.gz,_001.fastq,_001.fq}, which isn't particularly robust.
Unless I've missed it in the nextflow documentation (and the information I've found about glob), I don't see alternatives to match-zero-or-more operators in nextflow. Any alternative solutions?
Thanks in advance.
The following glob pattern seems to match some of the more common FASTQ filenames:
Channel
.fromFilePairs( '*_{,R}{1,2}{,_001}.{fq,fastq}{,.gz}' )
.view()
Or with a parameterized directory prefix:
Channel
.fromFilePairs( "${params.input_dir}/*_{,R}{1,2}{,_001}.{fq,fastq}{,.gz}" )
.view()
Results:
N E X T F L O W ~ version 21.04.3
Launching `script.nf` [serene_austin] - revision: 5527b9b3c0
[SRR3137747, [/path/to/fasta/SRR3137747_1.fastq, /path/to/fasta/SRR3137747_2.fastq]]
[19_S8, [/path/to/fasta/19_S8_R1_001.fastq.gz, /path/to/fasta/19_S8_R2_001.fastq.gz]]
[F1HD1_S28, [/path/to/fasta/F1HD1_S28_R1.fastq.gz, /path/to/fasta/F1HD1_S28_R2.fastq.gz]]
Another option, which might be more robust (and readable), is to make use of the fact that you can specify more than one glob pattern using a list as argument, and build your list of glob patterns dynamically:
nextflow.enable.dsl=2
params.input_dir = '/path/to/fasta'
def cartesian_product(A, B) {
A.collectMany{ a -> B.collect { b -> [a, b] } }
}
def extensions = [
'.fastq.gz',
'.fastq',
'.fq.gz',
'.fq',
]
def suffixes = [
'*_R{1,2}_001',
'*_R{1,2}',
'*_{1,2}',
]
workflow {
def patterns = cartesian_product(suffixes, extensions).collect {
"${params.input_dir}/${it.join()}"
}
Channel.fromFilePairs( patterns ).view()
}
Results:
N E X T F L O W ~ version 21.04.3
Launching `script.nf` [deadly_payne] - revision: 6d2472ef23
[19_S8, [/path/to/fasta/19_S8_R1_001.fastq.gz, /path/to/fasta/19_S8_R2_001.fastq.gz]]
[F1HD1_S28, [/path/to/fasta/F1HD1_S28_R1.fastq.gz, /path/to/fasta/F1HD1_S28_R2.fastq.gz]]
[SRR3137747, [/path/to/fasta/SRR3137747_1.fastq, /path/to/fasta/SRR3137747_2.fastq]]

Search and replace based on a dictionary

I have a json file filled with a list of data where each element has one field called url.
[
{ ...,
...,
"url": "us.test.com"
},
...
]
In a different file I have a list of mappings that I need to replace the affected url fields with, formatted like this:
us.test.com test.com
hello.com/se hello.com
...
So the end result should be:
[
{ ...,
...,
"url": "test.com"
},
...
]
Is there a way to do this in Vim or do I need to do it programmatically?
Well, I'd do this programmatically in Vim ;-) As you'll see it's quite similar to Python and many other scripting languages.
Let's suppose we have json file open. Then
:let foo = json_decode(join(getline(1, '$')))
will load json into VimScript variable. So :echo foo will show [{'url': 'us.test.com'}, {'url': 'hello.com/se'}].
Now let's switch to a "mapping" file. We're going to split all lines and make a Dictionary like that:
:let bar = {}
:for line in getline(1, '$') | let field = split(line) | let bar[field[0]] = field[1] | endfor
Now :echo bar shows {'hello.com/se': 'hello.com', 'us.test.com': 'test.com'} as expected.
To perform a substitution we do simply:
:for field in foo | let field.url = bar->get(field.url, field.url) | endfor
And now foo contains [{'url': 'test.com'}, {'url': 'hello.com'}] which is what we want. The remaining step is to write the new value into a buffer with
:put =json_encode(foo)
You could…
turn those lines in your mappings file (/tmp/mappings for illustration purpose):
us.test.com test.com
hello.com/se hello.com
...
into:
g/"url"/s#us.test.com#test.com#g
g/"url"/s#hello.com/se#hello.com#g
...
with:
:%normal Ig/"url"/s#
:%s/ /#
The idea is to turn the file into a script that will perform all those substitutions on all lines matching "url".
If you are confident that those strings are only in "url" lines, you can just do:
:%normal I%s#
:%s/ /#
to obtain:
%s#us.test.com#test.com#g
%s#hello.com/se#hello.com#g
...
write the file:
:w
and source it from your JSON file:
:source /tmp/mappings
See :help :g, :help :s, :help :normal, :help :range, :help :source, and :help pattern-delimiter.

Convert data from CSV to JSON with grouping

Example csv data (top row is column header followed by three data lines);
floor,room,note1,note2,note3
floor1,room1,2people
floor2,room4,6people,projector
floor6,room5,20people,projector,phone
I need the output in json, but grouped by floor, like this;
floor
room
note1
note2
note3
room
note1
note2
note3
floor
room
note1
note2
note3
room
note1
note2
note3
So all floor1 rooms are in their own json grouping, then floor2 rooms etc.
Please could someone point me in the right direction in terms of which tools to look at and any specific functions e.g. jq + categories. I've done some searching already and got muddled up between lots of different posts relating to csvtojson, jq and some python scripts. Ideally I would like to include the solution in a shell script rather than a separate program/language (I have sys admin experience but not a programmer).
Many thanks
Perhaps this can get you started.
Use a programming language like Python to convert the CSV data into a dictionary data structure by splitting on the commas, and use the JSON library to dump your dictionary out as JSON.
I have assumed that actually you expect to have more than one room per floor and thus I took the liberty to adjust your input data a little.
import json
csv = """floor1,room1,note1,note2,note3
floor1,room2,2people
floor1,room3,3people
floor2,room4,6people,projector
floor2,room5,3people,projector
floor3,room6,1person
"""
response = {}
for line in csv.splitlines():
fields = line.split(",")
floor, room, data = fields[0], fields[1], fields[2:]
if floor not in response:
response[floor] = {}
response[floor][room] = data
print json.dumps(response)
If you then run that script and pipe it into jq (where JQ is just used for pretty-printing the output on your screen ; it is not really required) you will see:
$ python test.py | jq .
{
"floor1": {
"room2": [
"2people"
],
"room3": [
"3people"
],
"room1": [
"note1",
"note2",
"note3"
]
},
"floor2": {
"room4": [
"6people",
"projector"
],
"room5": [
"3people",
"projector"
]
},
"floor3": {
"room6": [
"1person"
]
}
}

parsing nested structures in R

I have a json-like string that represents a nested structure. it is not a real json in that the names and values are not quoted. I want to parse it to a nested structure, e.g. list of lists.
#example:
x_string = "{a=1, b=2, c=[1,2,3], d={e=something}}"
and the result should be like this:
x_list = list(a=1,b=2,c=c(1,2,3),d=list(e="something"))
is there any convenient function that I don't know that does this kind of parsing?
Thanks.
If all of your data is consistent, there is a simple solution involving regex and jsonlite package. The code is:
if(!require(jsonlite, quiet=TRUE)){
#if library is not installed: installs it and loads it into the R session for use.
install.packages("jsonlite",repos="https://ftp.heanet.ie/mirrors/cran.r-project.org")
library(jsonlite)
}
x_string = "{a=1, b=2, c=[1,2,3], d={e=something}}"
json_x_string = "{\"a\":1, \"b\":2, \"c\":[1,2,3], \"d\":{\"e\":\"something\"}}"
fromJSON(json_x_string)
s <- gsub( "([A-Za-z]+)", "\"\\1\"", gsub( "([A-Za-z]*)=", "\\1:", x_string ) )
fromJSON( s )
The first section checks if the package is installed. If it is it loads it, otherwise it installs it and then loads it. I usually include this in any R code I'm writing to make it simpler to transfer between pcs/people.
Your string is x_string, we want it to look like json_x_string which gives the desired output when we call fromJSON().
The regex is split into two parts because it's been a while - I'm pretty sure this could be made more elegant. Then again, this depends on if your data is consistent so I'll leave it like this for now. First it changes "=" to ":", then it adds quotation marks around all groups of letters. Calling fromJSON(s) gives the output:
fromJSON(s)
$a
[1] 1
$b
[1] 2
$c
[1] 1 2 3
$d
$d$e
[1] "something"
I would rather avoid using JSON's parsing for the lack of extendibility and flexibility, and stick to a solution of regex + recursion.
And here is an extendable base code that parses your input string as desired
The main recursion function:
# Parse string
parse.string = function(.string){
regex = "^((.*)=)??\\{(.*)\\}"
# Recursion termination: element parsing
if(iselement(.string)){
return(parse.element(.string))
}
# Extract components
elements.str = gsub(regex, "\\3", .string)
elements.vector = get.subelements(elements.str)
# Recursively parse each element
parsed.elements = list(sapply(elements.vector, parse.string, USE.NAMES = F))
# Extract list's name and return
name = gsub(regex, "\\2", .string)
names(parsed.elements) = name
return(parsed.elements)
}
.
Helping functions:
library(stringr)
# Test if the string is a base element
iselement = function(.string){
grepl("^[^[:punct:]]+=[^\\{\\}]+$", .string)
}
# Parse element
parse.element = function(element.string){
splits = strsplit(element.string, "=")[[1]]
element = splits[2]
# Parse numeric elements
if(!is.na(as.numeric(element))){
element = as.numeric(element)
}
# TODO: Extend here to include vectors
# Reformat and return
element = list(element)
names(element) = splits[1]
return(element)
}
# Get subelements from a string
get.subelements = function(.string){
# Regex of allowed elements - Extend here to include more types
elements.regex = c("[^, ]+?=\\{.+?\\}", #Sublist
"[^, ]+?=\\[.+?\\]", #Vector
"[^, ]+?=[^=,]+") #Base element
str_extract_all(.string, pattern = paste(elements.regex, collapse = "|"))[[1]]
}
.
Parsing results:
string = "{a=1, b=2, c=[1,2,3], d={e=something}}"
string_2 = "{a=1, b=2, c=[1,2,3], d=somthing}"
named_string = "xyz={a=1, b=2, c=[1,2,3], d={e=something, f=22}}"
named_string_2 = "xyz={d={e=something, f=22}}"
parse.string(string)
# [[1]]
# [[1]]$a
# [1] 1
#
# [[1]]$b
# [1] 2
#
# [[1]]$c
# [1] "[1,2,3]"
#
# [[1]]$d
# [[1]]$d$e
# [1] "something"