Is there a JOLT documentation? What's the meaning of the &, # etc. operators? (NiFi, JoltTransformJSON) - json

Yeah there is! I made this question to share my knowledge, Q&A style since I had a hard time finding it myself :)
Thanks to https://stackoverflow.com/a/67821482/1561441 (Barbaros Özhan, see comments) for pointing me into the correct direction

The answer is: look here and here
Correct me if I'm wrong, but: Wow, currently to my knowledge a single .java file on GitHub, last commit in 2017, holds relevant parts of the official documentation of the JOLT syntax. I had to use its syntax since I'm working with NiFi and applied its JoltTransformJSON processor (hence the SEO abuses in my question, so more people find the answer)
Here are some of the most relevant parts copied from https://github.com/bazaarvoice/jolt/blob/master/jolt-core/src/main/java/com/bazaarvoice/jolt/Shiftr.java and slightly edited. The documentation itself is more extensive and also shows examples.
'*' Wildcard
Valid only on the LHS ( input JSON keys ) side of a Shiftr Spec
The '*' wildcard can be used by itself or to match part of a key.
'&' Wildcard
Valid on the LHS (left hand side - input JSON keys) and RHS (output data path)
Means, dereference against a "path" to get a value and use that value as if were a literal key.
The canonical form of the wildcard is "&(0,0)".
The first parameter is where in the input path to look for a value, and the second parameter is which part of the key to use (used with * key).
There are syntactic sugar versions of the wildcard, all of the following mean the same thing; Sugar : '&' = '&0' = '&(0)' = '&(0,0)
The syntactic sugar versions are nice, as there are a set of data transforms that do not need to use the canonical form, eg if your input data does not have any "prefixed" keys.
'$' Wildcard
Valid only on the LHS of the spec.
The existence of this wildcard is a reflection of the fact that the "data" of the input JSON, can be both in the "values" and the "keys" of the input JSON
The base case operation of Shiftr is to copy input JSON "values", thus we need a way to specify that we want to copy the input JSON "key" instead.
Thus '$' specifies that we want to use an input key, or input key derived value, as the data to be placed in the output JSON.
'$' has the same syntax as the '&' wildcard, and can be read as, dereference to get a value, and then use that value as the data to be output.
There are two cases where this is useful
when a "key" in the input JSON needs to be a "id" value in the output JSON, see the ' "$": "SecondaryRatings.&1.Id" ' example above.
you want to make a list of all the input keys.
'#' Wildcard
Valid both on the LHS and RHS, but has different behavior / format on either side.
The way to think of it, is that it allows you to specify a "synthentic" value, aka a value not found in the input data.
On the RHS of the spec, # is only valid in the the context of an array, like "[#2]".
What "[#2]" means is, go up the three levels and ask that node how many matches it has had, and then use that as an index in the arrays.
This means that, while Shiftr is doing its parallel tree walk of the input data and the spec, it tracks how many matches it has processed at each level of the spec tree.
This useful if you want to take a JSON map and turn it into a JSON array, and you do not care about the order of the array.
On the LHS of the spec, # allows you to specify a hard coded String to be place as a value in the output.
The initial use-case for this feature was to be able to process a Boolean input value, and if the value is boolean true write out the string "enabled". Note, this was possible before, but it required two Shiftr steps.
'#' Wildcard
Valid on both sides of the spec.
The basic '#' on the LHS.
This wildcard is necessary if you want to put both the input value and the input key somewhere in the output JSON.
Thus the '#' wildcard is the mean "copy the value of the data at this level in the tree, to the output".
Advanced '#' sign wildcard
The format is lools like "#(3,title)", where "3" means go up the tree 3 levels and then lookup the key "title" and use the value at that key.

I would love to know if there is an alternative to JoltTransformJSON simply because I'm struggling a lot with understanding it (not coming from a programming background myself). When it works (thanks to all the help here) it does simplify things a lot!
Here are a few other sites that help:
https://intercom.help/godigibee/en/articles/4044359-transformer-getting-to-know-jolt
https://erbalvindersingh.medium.com/applying-jolttransform-on-json-object-array-and-fetching-specific-fields-48946870b4fc
https://cool-cheng.blogspot.com/2019/12/json-jolt-tutorial.html

Related

How to marshal a predicate from JSON in Prolog?

In Python it is common to marshal objects from JSON. I am seeking similar functionality in Prolog, either swi-prolog or scryer.
For instance, if we have JSON stating
{'predicate':
{'mortal(X)', ':-', 'human(X)'}
}
I'm hoping to find something like load_predicates(j) and have that data immediately consulted. A version of json.dumps() and loads() would also be extremely useful.
EDIT: For clarity, this will allow interoperability with client applications which will be collecting rules from users. That application is probably not in Prolog, but something like React.js.
I agree with the commenters that it would be easier to convert the JSON data to a .pl file in the proper format first and then load that.
However, you can load the predicates from JSON directly, convert them to a representation that Prolog understands, and use assertz to add them to the knowledge base.
If indeed the data contains all the syntax needed for a predicate (as is the case in the example data in the question) then converting the representation is fairly simple as you just need to concatenate the elements of the list into a string and then create a term out of the string. Note that this assumption skips step 2 in the first comment by Guy Coder.
Note that the Prolog JSON library is rather strict in which format it accepts: only double quotes are valid as string delimiters, and lists with singleton values (i.e., not key-value pairs) need to use the notation [a,b,c] instead of {a,b,c}. So first the example data needs to be rewritten:
{"predicate":
["mortal(X)", ":-", "human(X)"]
}
Then you can load it in SWI-Prolog. Minimal working example:
:- use_module(library(http/json)).
% example fact for testing
human(aristotle).
load_predicate(J) :-
% open the file
open(J, read, JSONstream, []),
% parse the JSON data
json_read(JSONstream, json(L)),
% check for an occurrence of the predicate key with value L2
member(predicate=L2, L),
% concatenate the list into a string
atomics_to_string(L2, S),
% create a term from the string
term_string(T, S),
% add to knowledge base
assertz(T).
Example run:
?- consult('mwe.pl').
true.
?- load_predicate('example_predicate.json').
true.
?- mortal(X).
X = aristotle.
Detailed explanation:
The predicate json_read stores the data in the following form:
json([predicate=['mortal(X)', :-, 'human(X)']])
This is a list inside a json term with one element for each key-value pair. The element has the syntax key=value. In the call to json_read you can already strip the json() term and store the list directly in the variable L.
Then member/2 is used to search for the compound term predicate=L2. If you have more than one predicate in the JSON file then you should turn this into a foreach or in a recursive call to process all predicates in the list.
Since the list L2 already contains a syntactically well-formed Prolog predicate it can just be concatenated, turned into a term using term_string/2 and asserted. Note that in case the predicate is not yet in the required format, you can construct a predicate out of the various pieces using built-in predicate manipulation functionality, see https://www.swi-prolog.org/pldoc/doc_for?object=copy_predicate_clauses/2 for some pointers.

Alternative ways to extract the contents of a JSON string

Consider the following query:
select '"{\"foo\":\"bar\"}"'::json;
This will return a single record of a single element containing a JSON string. See:
test=# select json_typeof('"{\"foo\":\"bar\"}"'::json); json_typeof
-------------
string
(1 row)
It is possible to extract the contents of the string as follows:
=# select ('"{\"foo\":\"bar\"}"'::json) #>>'{}';
json
---------------
{"foo":"bar"}
(1 row)
From this point onward, the result can be cast as a JSON object:
test=# select json_typeof((('"{\"foo\":\"bar\"}"'::json) #>>'{}')::json);
json_typeof
-------------
object
(1 row)
This way seems magical.
I define no path within the extraction operator, yet what is returned is not what I passed. This seems like passing no index to an array accessor, and getting an element back.
I worry that I will confuse the next maintainer to look at this logic.
Is there a less magical way to do this?
But you did define a path. Defining "root" as path is just another path. And that's just what the #>> operator is for:
Extracts JSON sub-object at the specified path as text.
Rendering as text effectively applies the escape characters in the string. When casting back to json the special meaning of double-quotes (not escaped any more) kicks in. Nothing magic there. No better way to do it.
If you expect it to be confusing to the afterlife, add comments explaining what you are doing there.
Maybe, in the spirit of clarity, you might use the equivalent function json_extract_path_text() instead. The manual:
Extracts JSON sub-object at the specified path as text. (This is functionally equivalent to the #>> operator.)
Now, the function has a VARIADIC parameter, and you typically enter path elements one-by-one, like the example in the manual demonstrates:
json_extract_path_text('{"f2":{"f3":1},"f4":{"f5":99,"f6":"foo"}}',
'f4', 'f6') → foo
You cannot enter the "root" path this way. But (what the manual does not add at this point) you can alternatively provide an actual array after adding the keyword VARIADIC. See:
Pass multiple values in single parameter
So this does the trick:
SELECT json_extract_path_text('"{\"foo\":\"bar\"}"'::json, VARIADIC '{}')::json;
And since we are all about being explicit, use the verbose SQL standard cast syntax:
SELECT cast(json_extract_path_text('"{\"foo\":\"bar\"}"'::json, VARIADIC '{}') AS json)
Any clearer, yet? (I would personally prefer your shorter original, but I may be biased, being a "native speaker" of Postgres..)
The question is, why do you have that odd JSON literal including escapes as JSON string in the first place?

Regex for matching with and without quotes for dynamic JSON

I have the following text strings:
"Name":"John"}]
"Age":36
"Address":"ABC,PQR234[]/.,#ANYCHARACTERS"
"Gender":null
I need to get two groups (key value pair) from this such that the output would be only:
Key|Value
Name|John
Age|36
Address|ABC,PQR234[]/.,#ANYCHARACTERS
The requirement is to have a single regex to grab everything in the double quotes if the double quotes are present. If not, take the value without the quotes.
In our example above, 36 and null are the one without the quotes and they need to be captured as well.
I have tried a lot but have failed to do so.
UPDATE:
I don't know why I am getting down votes for this question. Yes this is JSON that I am trying to parse but there is a reason behind why I am doing this and not using any document parser.
I am supposed to use Talend for getting a dynamic JSON converted into Key Value Pair. What I mean by dynamic is the fields of the JSON can vary and hence I do not have a fixed schema and hence cannot use a document parser (which demands a fixed structure of JSON). I am devising a solution to get around this using Normalizer (on comma) and then extracting the key value pair which will be in double quotes using Regular Expressions. I tried many things on my own and since I am not an expert in Regular expressions, I have come here to get inputs.
If you know any better solution to this, I would be very happy to get your inputs.
How about this?
/"?([^\n"]*)"?:"?([^\n"]*)"?/
Explained in detail at:
https://regex101.com/r/UM0rl2/1/

Using regex to extract data from structured data

The problem I'm facing here is that I have a blob of text which contains structured data (in the form of a JSON payload) and I'm interested in extracting the value of one of the keys for a specific JSON instance, picture the structured data inside as the following:
"Item 1": {"key1":"item1_key1_value", "key2":"item1_key2_value", "key3":"item1_key3_value"}, "Item 2": {"key1":"item2_key1_value", "key2":"item2_key2_value", "key3":"item2_key3_value"}
What I would like to use is use regex to grab item1_key2_value for instance. The keys all have the same name but the items are different. So I know which key for which Item I need but am not quite sure of the regex to retrieve that value. I've tried a few approaches to some basic matching but was wondering if any other more experienced regex users could direct me a bit here and explain what I'm doing wrong
1(.)(?=item1_key2_value.) will match a chunk of data from here but I'm not sure of the best way to reduce it to the value that I need.
The regex syntax for JSON is clearly specified at http://www.json.org. If you scroll down a little to where it says "A string is a sequence of", you will find the proper string structure.
Assuming the string follows the correct JSON structure, you could use
"key2"\s*:\s*"((\\.|[^\\"])*)"
where \s means whitespace and * means 0 or more times. \\ means a slosh (backslash) character and can be followed by . (any character). If it does not encounter a slosh, then it instead looks for [^\\"], which means not slosh nor quote.
If you want to be a little more strict to the exact JSON form, you could try
"key2"\s*:\s*"((\\["\\/bfnrtu]|[^\\"])*)"
which you can see follows the string form on the webpage more closely.

Are double definitions allowed in JSON, and if so, how should they be interpreted?

Is this valid JSON?
{
"name": "foo",
"name": "bar"
}
If so, how should it be interpreted?
It's technically legal, but strongly discouraged, according to the RFC:
The names within an object SHOULD be unique.
You can go one of two routes:
The JavaScript route: In JavaScript, this is illegal. Since JSON is supposed to be a subset, reject the input as invalid.
The Postel/Python route: Overwrite the "var" entry with the latest value.
According to RFC 4627, duplicate names are discouraged. See section 2.2. Objects:
The names within an object SHOULD be unique.
The above URL also refers us to RFC 2119, which specifies how the word SHOULD is interpreted:
SHOULD
This word, or the adjective "RECOMMENDED", mean that there
may exist valid reasons in particular circumstances to ignore a
particular item, but the full implications must be understood and
carefully weighed before choosing a different course.
However, many parsers & JSON APIs implement this as SHOULD ALWAYS, and throw an error or ignore multiple values upon encountering duplicate properties. This includes jQuery.parseJSON() as well as .NET's JSON serialization.
It is not valid JSON as there are two name variables. Take a read of this to help you understand JSON a bit better.
JSon object, like any other object, can not have two attribute with same name. That's illegal in the same way as having same key twice in a map.
JSONObject would throw an exception if you have two keys with same name in one object. You may want to alter your object so that keys are not repeated under same object.
In this case the change would be to make your json key name have value as an array
No, is not. You have two attributes with the same label/name/title. Here is a very simple and short explanation of the JSON