Jackson: Stream JSON array via databinding? - json

I have a JSON file with a structure like this:
[
obj1,
obj2,
...
objN
]
All of the sub-objects are entirely self-contained, i.e. there are no cross references between them. The problem is that the file as a whole is huge ( >100k entries in the root array).
Is there any way in Jackson to stream in the contents of the root array via databinding, such that the root array never resides fully in main memory? I would like to avoid the low level JsonGenerator/JsonParser API.

Yes. Check out ObjectReader (constructed using various methods in ObjectMapper, like .readerFor(ElementType.class)), and then its readValues() method, which returns MappingIterator<ElementType> (for whatever type you use). This method will only bind one item at a time.
If the values are in root-level array, this should work as is. If they were somewhere deeper in JSON structure, you would have to construct JsonParser first, then iterate (with nextToken()) to the first value, but after that you could still create MappingIterator for efficient item-by-item binding.

Related

Processing a Kafka message using KSQL that has a field that can be either an ARRAY or a STRUCT

I'm consuming a Kafka topic published by another team (so I have very limited influence over the message format). The message has a field that holds an ARRAY of STRUCTS (an array of objects), but if the array has only one value then it just holds that STRUCT (no array, just an object). I'm trying to transform the message using Confluent KSQL. Unfortunately, I cannot figure out how to do this.
For example:
{ "field": {...} } <-- STRUCT (single element)
{ "field": [ {...}, {...} ] } <-- ARRAY (multiple elements)
{ "field": [ {...}, {...}, {...} ] <-- ARRAY (multiple elements)
If I configure the field in my message schema as a STRUCT then all messages with multiple values error. If I configure the field in my message schema as an ARRAY then all messages with a single value error. I could create two streams and merge them, but then my error log will be polluted with irrelevant errors.
I've tried capturing this field as a STRING/VARCHAR which is fine and I can split the messages into two streams. If I do this, then I can parse the single value messages and extract the data I need, but I cannot figure out how to parse the multivalue messages. None of the KSQL JSON functions seem to allow parsing of JSON Arrays out of JSON Strings. I can use EXTRACTJSONFIELD() to extract a particular element of the array, but not all of the elements.
Am I missing something? Is there any way to handle this reasonably?
In my experience, this is one use-case where KSQL just doesn't work. You would need to use Kafka Streams or a plain consumer to deserialize the event as a generic JSON type, then check object.get("field").isArray() or isObject(), and handle accordingly.
Even if you used a UDF in KSQL, the STREAM definition would be required to know ahead of time if you have field ARRAY<?> or field STRUCT<...>
I finally solved this in a roundabout way...
First, I created an initial stream reading the transaction as a stream of bytes using KAFKA format instead of JSON format. This allows me to put a filter conditional filter on the data so I can fork the stream into a version for the single (STRUCT) variation and a version for the multiple (ARRAY) variation.
The initial stream looks like:
CREATE OR REPLACE STREAM `my-topic-stream` (
id STRING KEY,
data BYTES
)
WITH (
KAFKA_TOPIC='my-topic',
VALUE_FORMAT='KAFKA'
);
Forking that stream looks like this with a second for a multiple version filtering for IS NOT NULL:
CREATE OR REPLACE STREAM `my-single-stream`
WITH (
kafka_topic='my-single-topic'
) AS
SELECT *
FROM `my-topic-stream`
WHERE JSON_ARRAY_LENGTH(EXTRACTJSONFIELD(FROM_BYTES(data, 'utf8'), '$.field')) IS NULL;
At this point I can create a schema for both variations, explode field, and merge the two streams back together. I don't know if this can be refined to be more efficient, but this successfully processes the transactions as I wanted.

Understanding NewtonSoft in PowerShell

I making a foray into the world of JSON parsing and NewtonSoft and I'm confused, to say the least.
Take the below PowerShell script:
$json = #"
{
"Array1": [
"I am string 1 from array1",
"I am string 2 from array1"
],
"Array2": [
{
"Array2Object1Str1": "Object in list, string 1",
"Array2Object1Str2": "Object in list, string 2"
}
]
}
"#
#The newtonSoft way
$nsObj = [Newtonsoft.Json.JsonConvert]::DeserializeObject($json, [Newtonsoft.Json.Linq.JObject])
$nsObj.GetType().fullname #Type = Newtonsoft.Json.Linq.JObject
$nsObj[0] #Returns nothing. Why?
$nsObj.Array1 #Again nothing. Maybe because it contains no key:value pairs?
$nsObj.Array2 #This does return, maybe because has object with kv pairs
$nsObj.Array2[0].Array2Object1Str1 #Returns nothing. Why? but...
$nsObj.Array2[0].Array2Object1Str1.ToString() #Cool. I get the string this way.
$nsObj.Array2[0] #1st object has a Path property of "Array2[0].Array2Object1Str1" Great!
foreach( $o in $nsObj.Array2[0].GetEnumerator() ){
"Path is: $($o.Path)"
"Parent is: $($o.Parent)"
} #??? Why can't I see the Path property like when just output $nsObj.Array2[0] ???
#How can I find out what the root parent (Array2) is for a property? Is property even the right word?
I'd like to be able to find the name of the root parent for any given position. So above, I'd like to know that the item I'm looking at (Array2Object1Str1) belongs to the Array2 root parent.
I think I'm not understanding some fundamentals here. Is it possible to determine the root parent? Also, any help in understanding my comments in the script would be great. Namely why I can't return things like path or parent, but can see it when I debug in VSCode.
dbc's answer contains helpful background information, and makes it clear that calling the NewtonSoft Json.NET library from PowerShell is cumbersome.
Given PowerShell's built-in support for JSON parsing - via the ConvertFrom-Json and ConvertTo-Json cmdlets - there is usually no reason to resort to third-party libraries (directly[1]), except in the following cases:
When performance is paramount.
When the limitations of PowerShell's JSON parsing must be overcome (lack of support for empty key names and keys that differ in letter case only).
When you need to work with the Json.NET types and their methods rather than with the method-less "property-bag" [pscustomobject] instances ConvertFrom-Json constructs.
While working with NewtonSoft's Json.NET directly in PowerShell is awkward, it is manageable, if you observe a few rules:
Lack of visible output doesn't necessarily mean that there isn't any output at all:
Due to a bug in PowerShell (as of v7.0.0-preview.4), [JValue] instances and [JProperty] instances containing them produce no visible output by default; access their (strongly typed) .Value property instead (e.g., $nsObj.Array1[0].Value or $nsProp.Value.Value (sic))
To output the string representation of a [JObject] / [JArray] / [JProperty] / [JValue] instance, do not rely on output as-is (e.g, $nsObj), use explicit stringification with .ToString() (e.g., $nsObj.ToString()); while string interpolation (e.g., "$nsObj") does generally work, it doesn't with [JValue] instances, due to the above-mentioned bug.
[JObject] and [JArray] objects by default show a list of their elements' instance properties (implied Format-List applied to the enumeration of the objects); you can use the Format-* cmdlets to shape output; e.g., $nsObj | Format-Table Path, Type.
Due to another bug (which may have the same root cause), as of PowerShell Core 7.0.0-preview.4, default output for [JObject] instances is actually broken in cases where the input JSON contains an array (prints error format-default : Target type System.Collections.IEnumerator is not a value type or a non-abstract class. (Parameter 'targetType')).
To numerically index into a [JObject] instance, i.e. to access properties by index rather than by name, use the following idiom: #($nsObj)[<n>], where <n> is the numerical index of interest.
$nsObj[<n>] actually should work, because, unlike C#, PowerShell exposes members implemented via interfaces as directly callable type members, so the numeric indexer that JObject implements via the IList<JToken> interface should be accessible, but isn't, presumably due to this bug (as of PowerShell Core 7.0.0-preview.4).
The workaround based on #(...), PowerShell's array-subexpression operator, forces enumeration of a [JObject] instance to yield an array of its [JProperty] members, which can then be accessed by index; note that this approach is simple, but not efficient, because enumeration and construction of an aux. array occurs; however, given that a single JSON object (as opposed to an array) typically doesn't have large numbers of properties, this is unlikely to matter in practice.
A reflection-based solution that accesses the IList<JToken> interface's numeric indexer is possible, but may even be slower.
Note that additional .Value-based access may again be needed to print the result (or to extract the strongly typed property value).
Generally, do not use the .GetEnumerator() method; [JObject] and [JArray] instances are directly enumerable.
Keep in mind that PowerShell may automatically enumerate such instances in contexts where you don't expect it, notably in the pipeline; notably, when you send a [JObject] to the pipeline, it is its constituent [JProperty]s that are sent instead, individually.
Use something like #($nsObj.Array1).Value to extract the values of an array of primitive JSON values (strings, numbers, ...) - i.e, [JValue] instances - as an array.
The following demonstrates these techniques in context:
$json = #"
{
"Array1": [
"I am string 1 from array1",
"I am string 2 from array1",
],
"Array2": [
{
"Array2Object1Str1": "Object in list, string 1",
"Array2Object1Str2": "Object in list, string 2"
}
]
}
"#
# Deserialize the JSON text into a hierarchy of nested objects.
# Note: You can omit the target type to let Newtonsoft.Json infer a suitable one.
$nsObj = [Newtonsoft.Json.JsonConvert]::DeserializeObject($json)
# Alternatively, you could more simply use:
# $nsObj = [Newtonsoft.Json.Linq.JObject]::Parse($json)
# Access the 1st property *as a whole* by *index* (index 0).
#($nsObj)[0].ToString()
# Ditto, with (the typically used) access by property *name*.
$nsObj.Array1.ToString()
# Access a property *value* by name.
$nsObj.Array1[0].Value
# Get an *array* of the *values* in .Array1.
# Note: This assumes that the array elements are JSON primitives ([JValue] instances.
#($nsObj.Array1).Value
# Access a property value of the object contained in .Array2's first element by name:
$nsObj.Array2[0].Array2Object1Str1.Value
# Enumerate the properties of the object contained in .Array2's first element
# Do NOT use .GetEnumerator() here - enumerate the array *itself*
foreach($o in $nsObj.Array2[0]){
"Path is: $($o.Path)"
"Parent is: $($o.Parent.ToString())"
}
[1] PowerShell Core - but not Windows PowerShell - currently (v7) actually uses NewtonSoft's Json.NET behind the scenes.
You have a few separate questions here:
$nsObj[0] #Returns nothing. Why?
This is because nsObj corresponds to a JSON object, and, as explained in this answer to How to get first key from JObject?, JObject does not directly support accessing properties by integer index (rather than property name).
JObject does, however, implement IList<JToken> explicitly so if you could upcast nsObj to such a list you could access properties by index -- but apparently it's not straightforward in PowerShell to call an explicitly implemented method. As explained in the answers to How can I call explicitly implemented interface method from PowerShell? it's necessary to do this via reflection.
First, define the following function:
Function ChildAt([Newtonsoft.Json.Linq.JContainer]$arg1, [int]$arg2)
{
$property = [System.Collections.Generic.IList[Newtonsoft.Json.Linq.JToken]].GetProperty("Item")
$item = $property.GetValue($nsObj, #([System.Object]$arg2))
return $item
}
And then you can do:
$firstItem = ChildAt $nsObj 0
Try it online here.
#??? Why can't I see the Path property like when just output $nsObj.Array2[0] ???
The problem here is that JObject.GetEnumerator() does not return what you think it does. Your code assumes it returns the JToken children of the object, when in fact it is declared as
public IEnumerator<KeyValuePair<string, JToken>> GetEnumerator()
Since KeyValuePair<string, JToken> doesn't have the properties Path or Parent your output method fails.
JObject does implement interfaces like IList<JToken> and IEnumerable<JToken>, but it does so explicitly, and as mentioned above calling the relevant GetEnumerator() methods would require reflection.
Instead, use the base class method JContainer.Children(). This method works for both JArray and JObject and returns the immediate children in document order:
foreach( $o in $nsObj.Array2[0].Children() ){
"Path is: $($o.Path)"
"Parent is: $($o.Parent)"
}
Try it online here.
$nsObj.Array1 #Again nothing. Maybe because it contains no key:value pairs?
Actually this does return the value of Array1, if I do
$nsObj.Array1.ToString()
the JSON corresponding to the value of Array1 is displayed. The real issue seems to be that PowerShell doesn't know how to automatically print a JArray with JValue contents -- or even a simple, standalone JValue. If I do:
$jvalue = New-Object Newtonsoft.Json.Linq.JValue 'my jvalue value'
'$jvalue' #Nothing output
$jvalue
'$jvalue.ToString()' #my jvalue value
$jvalue.ToString()
Then the output is:
$jvalue
$jvalue.ToString()
my jvalue value
Try it online here and, relatedly, here.
Thus the lesson is: when printing a JToken hierarchy in PowerShell, always use ToString().
As to why printing a JObject produces some output while printing a JArray does not, I can only speculate. JToken implements the interface IDynamicMetaObjectProvider which is also implemented by PSObject; possibly something about the details of how this is implemented for JObject but not JValue or JArray are compatible with PowerShell's information printing code.

Deserialize an anonymous JSON array?

I got an anonymous array which I want to deserialize, here the example of the first array object
[
{ "time":"08:55:54",
"date":"2016-05-27",
"timestamp":1464332154807,
"level":3,
"message":"registerResourcePath ('', '/sap/bc/ui5_ui5/ui2/ushell/resources/')",
"details":"","component":"sap.ui.ModuleSystem"},
{"time":"08:55:54","date":"2016-05-27","timestamp":1464332154808,"level":3,"message":"URL prefixes set to:","details":"","component":"sap.ui.ModuleSystem"},
{"time":"08:55:54","date":"2016-05-27","timestamp":1464332154808,"level":3,"message":" (default) : /sap/bc/ui5_ui5/ui2/ushell/resources/","details":"","component":"sap.ui.ModuleSystem"}
]
I tried deserializing using CL_TREX_JSON_SERIALIZER, but it is corrupt and does not work with my JSON, here is why
Then I tried /UI2/CL_JSON, but it needs a "structure" that perfectly fits the object given by the JSON Object. "Structure" means in my case an internal table of objects with the attributes time, date, timestamp, level, messageanddetails. And there was the problem: it does not properly handle references and uses class description to describe the field assigned to the field-symbol. Since I can not have a list of objects but only a list of references to objects that solution also doesn't works.
As a third attempt I tried with the CALL TRANSFORMATION as described by Horst Keller, but with this method I was not able to read in an anonymous array, and here is why
My major points:
I do not want to change the JSON, since that is what I get from sap.ui.log
I prefere to use built-in functionality and not a thirdparty framework
Your problem comes out not from the anonymity of array, but from the awkwardness of SAP JSON (De)serializer, which doesn't respect double quotes, which enclose JSON attributes. The issue is thoroughly described in this answer.
If you don't want to change your JSON on-the-fly, the only way you have is to change CL_TREX_JSON_DESERIALIZER class like this.
/UI5/CL_JSON_PARSER parses JSONs with unknown format.
Note that it's got "for internal use" written on it so many times that you probably should take it seriously and clone its code to fixate it.

Return empty array when transforming xml data to json

I am returning some xml structure as json using the built-in MarkLogic json module. For the most part it does what I expect. However, when an element marked as an array is empty, it returns an empty string instead of an empty array. Here is an example:
xquery version "1.0-ml";
import module namespace json = "http://marklogic.com/xdmp/json"
at "/MarkLogic/json/json.xqy";
let $config := json:config("custom")
return (
map:put( $config, "array-element-names", ("item") ),
json:transform-to-json(<result>
<item>21</item>
<item>22</item>
<item>23</item>
</result>, $config),
json:transform-to-json(<result></result>, $config))
Result:
{"result":{"item":["21", "22", "23"]}}
{"result":""}
I would expect an empty array if there were no items matching in the array-element-name called "item". i.e.
{"result":{"item":[]}}
Is there some way to configure it so it knows the element is required ?
No - it will not create anything that is not there. In your case, what if the XML was more complex. There is no context of 'where' such an element might live - so it could not create it even if it wanted to.
Solution is to repair the content if needed by adding one element - or transforming it into the json-basic namespace - where those elements live inside of of an element noted as an array (which can be empty) - or third, use an XSD to hint to the processor what to do . But that would still need a containing element for the 'array' - and then the items would be minOccurance=0. But if this is the case, then repair and transform into the json/basic namespace is probably nice and simple for your example.

Use JSON.stringify but the data still have array, what does it mean?

I have a data which are object array. It contains object arrays in a tree structure. I use JSON.stringify(myArray) but the data still contain array because I see [] inside the converted data.
In my case, I want all the data to be converted into json object not array regarding I need to used the data on TreeTable of SAPUI5.
Maybe I misunderstand. Please help me clear.
This is the example of the data that I got from JSON.stringify.
[{"value":{"Id":"00145E5BB2641EE284F811A7907717A3",
"Text":"BI-RA Reporting, analysis, and dashboards",
"Parent":"00145E5BB2641EE284F811A79076F7A3","Type":"BMF"},
"children":[{"value":{"Id":"00145E5BB2641EE284F811A7907737A3",
"Text":"WebIntelligence_4.1","Parent":"00145E5BB2641EE284F811A7907717A3",
"Type":"TWB"},"children":[{"value":{"Id":"00145E5BB2641EE284F811A7907757A3",
"Text":"Functional Areas","Parent":"00145E5BB2641EE284F811A7907737A3","Type":"TWB"},
"children":[{"value":{"Id":"00145E5BB2641EE284F811A7907777A3",
"Text":"CHARTING","Parent":"00145E5BB2641EE284F811A7907757A3","Type":"TWB"},
"children":[{"value":{"Id":"001999E0B9081EE28AB706BE26631E93",
"Text":"Drill","Parent":"00145E5BB2641EE284F811A7907777A3","Type":"TWB"},
"children":[{"value":{"Id":"001999E0B9081EE28AB706BE26633E93",
"Text":"[AUTO][ACCEPT] Drill on charts DHTML","Parent":"001999E0B9081EE28AB706BE26631E93",
"Type":"TWB","Ref":"UT_WEBI_CHARTS_DRILL_HTML"}},{"value":{"Id":"001999E0B9081EE28AB706BE26635E93",
"Text":"[AUTO][ACCEPT] Drill on charts JAVA","Parent":"001999E0B9081EE28AB706BE26631E93",
"Type":"TWB","Ref":"UT_WEBI_CHARTS_DRILL_JAVA"}}]},...
The output that I want shouldn't be array of object but should be something like...
{{"value":{
"Id":"00145E5BB2641EE284F811A7907717A3",
"Text":"BI-RA Reporting, analysis, and dashboards",
"Parent":"00145E5BB2641EE284F811A79076F7A3","Type":"BMF"},
"children":{
{"value":{
"Id":"00145E5BB2641EE284F811A7907737A3",
"Text":"WebIntelligence_4.1",
"Parent":"00145E5BB2641EE284F811A7907717A3",
"Type":"TWB"},
"children":{
{"value":{
"Id":"00145E5BB2641EE284F811A7907757A3",
"Text":"Functional Areas",
"Parent":"00145E5BB2641EE284F811A7907737A3",
"Type":"TWB"},...
JSON.stringify merely converts JavaScript data structures to a JSON-formatted string for consumption by other parsers (including JSON.parse). If you want it to stringify to a different value, you must change the source data structures first.
However, it seems that this can't be represented as anything other than an array because you have duplicate keys (i.e. value appears more than once). That would not be valid for a JavaScript object or a JSON representation of such.
I think what you want is
JSON.stringify(data[0]);
or perhaps
JSON.stringify(data[0].value);
where data is the object you passed in the question