Nifi - Extract values from a array - json

I have a flowfile with:
{"ocorrencias":[129539290,129539291]}
I need to create an attribute with each value, for example:
ocorrencias = 129539290
ocorrencias = 129539291
This is possible with any processor?
Obs: I'm testing using a splitjson, but it only returns the value and I need the variable ocorrencias

Use 2 processors: SplitJson (split json to FlowFiles) -> ExtractText (extract content to attribute).
SplitJson processor:
JsonPath Expression: .ocorrencias[*]
ExtractText processor:
ocorrencias (dynamic property): (.*)
Output: 2 FlowFiles with attribute:
ocorrencias 129539290
ocorrencias 129539291

Related

Most efficient way to read concatenated json in PySpark?

I'm getting one json file where each line in the json is a json itself of 1000 objects, like this:
{"id":"test1", "results": [{"property1": "sample1"},{"property2": "sample2"}]}
{"id":"test2", "results": [{"property1": "sample3"},{"property2": "sample4"}]}
If I read it as a json using spark.read.json(filepath), I'm getting:
+-----+--------------------+
| id| results|
+-----+--------------------+
|test1|[{sample1, null},...|
+-----+--------------------+
(Which is only the first json in the concatenated json)
While I'm trying to get:
+-----+---------+---------+
|id |property1|property2|
+-----+---------+---------+
|test1|sample1 |sample1 |
|test2|sample3 |sample4 |
+-----+---------+---------+
I end up by reading the json as text, and iterate over each row to treat it as json and union each dataframe:
df = (spark.read.text(data[self.files]))
dataCollect = df.collect()
i = 0
for row in dataCollect:
df_row = flatten_json(spark.read.json(spark.sparkContext.parallelize(row)))
if i == 0:
df_all = df_row
else:
df_all = df_row.unionByName(df_all, allowMissingColumns = True)
i = i + 1
flatten_json is a helper that helps me to automatically flatten the json.
I guess there is a better approach, any help would be much appreciate
Your JSON file is called JSON Lines or JSONL which is a supported file format that Pyspark can handle natively. So, use the regular spark.read.json to read it and perform the additional transformations to match with what you want.
df = spark.read.json('yourfile.json or json/directory')
# Explode the array into structs. This will generate lots of nulls.
df = (df.select('id', F.explode('results').alias('results'))
.select('id', 'results.*'))
# Group them and aggregate to remove the nulls.
df = (df.groupby('id')
.agg(*[F.first(x, ignorenulls=True).alias(x) for x in df.columns if x != 'id']))
I think this works fine for 1000 lines JSONL, however, if you are curious about alternative solution that doesn't involve generating/removing nulls, please check here: By using PySpark how to parse nested json. In some situations, the alternative solution which doesn't do explode could be more performant.

Play JSON Parse and Extract Elements Without a Key Path

I have a JSON that looks like this, yes the JSON is a valid format.
[2,
"19223201",
"BootNotification",
{
"reason": "PowerUp",
"chargingStation": {
"model": "SingleSocketCharger",
"vendorName": "VendorX"
}
}
]
I'm using Play framework's JSON library and I would like to understand how I could parse the 3rd line and extract the BootNotification value as a String.
If it had a key, I can use that key to traverse the JSON and get the corresponding value, but this is not the case here. I also do not have the possibility to load this line by line and infer from line number 3 as with the example above.
Any suggestions on how I could do this?
I think, I have found out a way after trying all this on Ammonite. Here is what I could do:
# val input: JsValue = Json.parse("""[2,"12345678","BNR",{"reason":"PowerUp"}]""")
input: JsValue = JsArray(ArrayBuffer(JsNumber(2), JsString("12345678"), JsString("BNR"), JsObject(Map("reason" -> JsString("PowerUp")))))
Parsing the JSON, I get a nice array and I know that I always expect just 4 elements in the Array, so explicitly looking for an element with the array index is what I need. So to get the text at position 3, I could do the following:
# (input \ 2)
res2: JsLookupResult = JsDefined(JsString("BNR"))
# (input \ 2).toOption
res3: Option[JsValue] = Some(JsString("BNR"))
# (input \ 2).toOption.isDefined
res4: Boolean = true

json schema validation does not work for dependencies?

As far as I understand, using:
dependencies:
cake:
- eggs
- flour
in a JSON schema validation (I actually parse the schema in YAML, but it should not really matter) should forbid the presence of a cake without an entry for eggs and flour. So this should be rejected:
receip:
cake: crepes
while this should be accepted:
receip:
eggs: 3 eggs, given by farmer
flour: 500g
cake: crepes
Unfortunately, both cases are accepted in my case. Any idea what I made wrong? I also tried to add another level required as proposed by jsonSchema attribute conditionally required but the problem is the same.
MWE
In case it helps, here are all the files that I'm using.
debug_schema.yml:
$schema: https://json-schema.org/draft/2020-12/schema
type: object
properties:
receip:
type: object
properties:
eggs:
type: string
flour:
type: string
cake:
type: string
dependencies:
cake:
- eggs
- flour
b.yml:
receip:
## I'd like to forbid a cake without eggs and flour
## Should not validate until I uncomment:
# eggs: 3 eggs, given by farmer
# flour: 500g
cake: crepes
validate_yaml.py
#!/usr/bin/env python3
import yaml
from jsonschema import validate, ValidationError
import argparse
# Instantiate the parser
parser = argparse.ArgumentParser(description='This tool not only checks if the YAML file is correct in term'
+ ' of syntex, but it also checks if the structure of the YAML file follows'
+ ' the expectations of this project (e.g. that it contains a list of articles,'
+ ' with each articles having a title etc…).')
# Optional argument
parser.add_argument('--schema', type=str, default="schema.yml",
help='Path to the schema')
parser.add_argument('--yaml', type=str, default="science_zoo.yml",
help='Path to the YAML file to verify')
args = parser.parse_args()
class YAMLNotUniqueKey(Exception):
pass
# By default, PyYAML does not raise an error when two keys are identical, which is an issue here
# see https://github.com/yaml/pyyaml/issues/165#issuecomment-430074049
class UniqueKeyLoader(yaml.SafeLoader):
def construct_mapping(self, node, deep=False):
mapping = set()
for key_node, value_node in node.value:
key = self.construct_object(key_node, deep=deep)
if key in mapping:
raise YAMLNotUniqueKey(f"Duplicate key {key!r} found.")
mapping.add(key)
return super().construct_mapping(node, deep)
with open(args.schema, "r") as schema_stream:
try:
schema = yaml.load(schema_stream, UniqueKeyLoader)
with open(args.yaml, "r") as yaml_stream:
try:
validate(yaml.load(yaml_stream, UniqueKeyLoader), schema)
print("The YAML file is correct and follows the schema, congrats! :D")
except ValidationError as exc:
print(exc.message)
print("The problem is located in the JSON at the path {}.".format(exc.json_path))
except YAMLNotUniqueKey as exc:
print("ERROR: {}".format(exc))
print("Rename in the YAML one the two instances of this key to ensure all keys are unique.")
except yaml.YAMLError as exc:
print("Errors when trying to load the YAML file:")
print(exc)
except YAMLNotUniqueKey as exc:
print("ERROR: {}".format(exc))
print("Rename in the schema one the two instances of this key to ensure all keys are unique.")
except yaml.YAMLError as exc:
print("Errors when trying to load the schema file:")
print(exc)
Command:
./validate_yaml.py --yaml b.yml --schema debug_schema.yml
Oh, so it seems that this keyword has been deprecated, and split into dependentRequired (see e.g. this example) and dependentSchemas (see e.g. this example). Just using dependentRequired solves the issue:
$schema: https://json-schema.org/draft/2020-12/schema
type: object
properties:
receip:
type: object
properties:
eggs:
type: string
flour:
type: string
cake:
type: string
dependentRequired:
cake:
- eggs
- flour

JSON data deserialize using Regular Expressions

I'm facing issue while fetching keys and values from the data using regular expression if the JSON contains \ & ".
{
"KeyOne":"Value One",
"KeyTwo": "Value \\ two",
"KeyThree": "Value \" Three",
"KeyFour": "ValueFour\\"
}
It is sample data, from this I want to read the values are keys. How can I achieve with regular expressions.
Note: I'm deserializing this JSON data in the server side(SAP ABAP).
On earlier releases less than 7.2 (from memory) you can use class /ui2/cl_json
if on 7.3 or later use kernel IXML writer which support JSON.
It is orders of magnitude faster than /ui2/cl_json
you can use identity transformation approach where the source structure is known
and you can create that structure in abap or already has an abap equivalent defined. Otherwise just traverse the JSON document.
The example string was easily parsed
EDIT: Add sample code
REPORT zjsondemo.
CLASS lcl DEFINITION CREATE PUBLIC.
PUBLIC SECTION.
METHODS json_stru_known.
METHODS json_stru_traverse.
ENDCLASS.
CLASS lcl IMPLEMENTATION.
METHOD json_stru_known.
DATA l_src_json TYPE string.
DATA l_mara TYPE mara.
WRITE: / 'DEMO 1 Known structure Identity transformation '.
l_src_json = `{"MARA":{"MATNR":"012345678", "MATKL": "DUMMY" }}`.
WRITE: / 'Conver to MARA -> ', l_src_json.
CALL TRANSFORMATION id SOURCE XML l_src_json
RESULT mara = l_mara. "
WRITE: / 'MARA - MATNR', l_mara-matnr,
/ ' MATKL', l_mara-matkl.
TYPES:
BEGIN OF lty_foo_bar,
KeyOne TYPE string,
KeyTwo Type string,
KeyThree TYPE string,
KeyFour Type string,
END OF lty_foo_bar.
DATA:
lv_json_string TYPE string,
ls_data TYPE lty_foo_bar.
" in this example we use upper case attribute names
"because we map to SAP target
" structure which has upper case names.
" if you need lowercase variables then you can not map straight to an
" SAP type. Then you need to use the traverse technique. See example 2
lv_json_string = |\{| &&
|"KEYONE":"Value One",| &&
|"KEYTWO": "Value \\\\ two", | &&
|"KEYTHREE": "Value \\" Three", | &&
|"KEYFOUR": "ValueFour\\\\" | &&
|\}|.
lv_json_string = `{"JUNK":` && lv_json_string && `}`.
CALL TRANSFORMATION id SOURCE XML lv_json_string
RESULT junk = ls_data. "
Write: / ls_data-keyone,ls_data-keytwo, ls_data-keythree , ls_data-keyfour.
ENDMETHOD.
METHOD json_stru_traverse.
DATA l_src_json TYPE string.
DATA: lo_node TYPE REF TO if_sxml_node.
DATA: lif_element TYPE REF TO if_sxml_open_element,
lif_element_close TYPE REF TO if_sxml_close_element,
lif_value_node TYPE REF TO if_sxml_value,
l_val TYPE string,
l_attr TYPE if_sxml_attribute=>attributes,
l_att_val TYPE string.
FIELD-SYMBOLS: <attr> LIKE LINE OF l_attr.
WRITE: / 'DEMO 2 Traverse any json document'.
l_src_json = `{"MATNR":"012345678", "MATKL": "DUMMY", "SOMENODE": "With this value" }`.
WRITE: / 'Parse as JSON with 3 nodes -> ', l_src_json.
DATA(reader) = cl_sxml_string_reader=>create( cl_abap_codepage=>convert_to( l_src_json ) ).
lo_node = reader->read_next_node( ). " {
IF lo_node IS INITIAL.
EXIT.
ENDIF.
DO 3 TIMES.
lif_element ?= reader->read_next_node( ).
l_attr = lif_element->get_attributes( ).
LOOP AT l_attr ASSIGNING <attr>.
l_att_val = <attr>->get_value( ).
WRITE: / 'Attribute:', l_att_val.
ENDLOOP.
lif_value_node ?= reader->read_next_node( ).
l_val = lif_value_node->get_value( ).
WRITE: '=>', l_val.
lif_element_close ?= reader->read_next_node( ).
ENDDO.
ENDMETHOD.
ENDCLASS.
START-OF-SELECTION.
DATA lo_lcl TYPE REF TO lcl.
CREATE OBJECT lo_lcl.
lo_lcl->json_stru_known( ).
lo_lcl->json_stru_traverse( ).
The SAP system is supplied with many example programs.
Search for demo*json
SAP docu on json parsing
As #mrzasa and #joanis said in their comments: Do not use RegEx to parse JSON!
For small objects or when performance is not a concern, you can use /ui2/cl_json:
TYPES:
BEGIN OF lty_foo_bar,
KeyOne TYPE string,
KeyTwo Type string,
KeyThree TYPE string,
KeyFour Type string,
END OF lty_foo_bar.
DATA:
lv_json_string TYPE string,
ls_data TYPE lty_foo_bar.
lv_json_string = |\{| &&
|"KeyOne":"Value One",| &&
|"KeyTwo": "Value \\\\ two", | &&
|"KeyThree": "Value \\" Three", | &&
|"KeyFour": "ValueFour\\\\" | &&
|\}|.
/ui2/cl_json=>deserialize(
EXPORTING
json = lv_json_string
CHANGING
data = ls_data ).
ls_data-KeyOne contains 'Value One' and so on.
For larger objects and/or better performance check lxml from #phil soadys answer below. The correct handling of upper and lower case letters still causes headache in ABAP anyways.

How to parse a string to key value pair using regex?

What is the best way to parse the string into key value pair using regex?
Sample input:
application="fre" category="MessagingEvent" messagingEventType="MessageReceived"
Expected output:
application "fre"
Category "MessagingEvent"
messagingEventType "MessageReceived"
We already tried the following regex and its working.
application=(?<application>(...)*) *category=(?<Category>\S*) *messagingEventType=(?<messagingEventType>\S*)
But we want a generic regex which will parse the sample input to the expected output as key value pair?
Any idea or solution will be helpful.
input = 'application="fre" category="MessagingEvent" messagingEventType="MessageReceived"'
puts input.
scan(/(\w+)="([^"]+)"/). # scan for KV-pairs
map{ |k, v| %Q|#{k.ljust(30,' ')}"#{v}"| }. # adjust as you requested
join($/) # join with platform-dependent line delimiters
#⇒ application "fre"
# category "MessagingEvent"
# messagingEventType "MessageReceived"
Instead of using regex, it can be done by spliting and storing the string in hash like below:
input = 'application="fre" category="MessagingEvent" messagingEventType="MessageReceived"'
res = {}
input.split.each { |str| a,b = str.split('='); res[a] = b}
puts res
==> {"application"=>"\"fre\"", "category"=>"\"MessagingEvent\"", "messagingEventType"=>"\"MessageReceived\""}