Efficient way to parse a file with different json schemas in spark - json

I am trying to find the best way to parse a json file with inconsistent schema (but the schema of the same type is known and consistent) in spark in order to split it by "type" and store it in parquet
{"type":1, "data":{"data_of_type1" : 1}}
{"type":2, "data":{"data_of_type2" : "can be any type"}}
{"type":3, "data":{"data_of_type3" : "value1", "anotherone": 1}}
I want also to reduce the IO because I am dealing with huge volumes, so I don't want to do a first split (by type) then process each type independently...
Current idea (not working):
Loaded the json and parse only the type ( "data" is loaded as a string)
attach to each row the corresponding schema (a DDL as string in a new column)
try to parse the "data" with the DDL from the previous column (method from_json)
=> Throwing error : Schema should be specified in DDL format as a string literal or output of the schema_of_json/schema_of_csv functions instead of schema
Do you have any idea if it's possible?

Related

Processing a Kafka message using KSQL that has a field that can be either an ARRAY or a STRUCT

I'm consuming a Kafka topic published by another team (so I have very limited influence over the message format). The message has a field that holds an ARRAY of STRUCTS (an array of objects), but if the array has only one value then it just holds that STRUCT (no array, just an object). I'm trying to transform the message using Confluent KSQL. Unfortunately, I cannot figure out how to do this.
For example:
{ "field": {...} } <-- STRUCT (single element)
{ "field": [ {...}, {...} ] } <-- ARRAY (multiple elements)
{ "field": [ {...}, {...}, {...} ] <-- ARRAY (multiple elements)
If I configure the field in my message schema as a STRUCT then all messages with multiple values error. If I configure the field in my message schema as an ARRAY then all messages with a single value error. I could create two streams and merge them, but then my error log will be polluted with irrelevant errors.
I've tried capturing this field as a STRING/VARCHAR which is fine and I can split the messages into two streams. If I do this, then I can parse the single value messages and extract the data I need, but I cannot figure out how to parse the multivalue messages. None of the KSQL JSON functions seem to allow parsing of JSON Arrays out of JSON Strings. I can use EXTRACTJSONFIELD() to extract a particular element of the array, but not all of the elements.
Am I missing something? Is there any way to handle this reasonably?
In my experience, this is one use-case where KSQL just doesn't work. You would need to use Kafka Streams or a plain consumer to deserialize the event as a generic JSON type, then check object.get("field").isArray() or isObject(), and handle accordingly.
Even if you used a UDF in KSQL, the STREAM definition would be required to know ahead of time if you have field ARRAY<?> or field STRUCT<...>
I finally solved this in a roundabout way...
First, I created an initial stream reading the transaction as a stream of bytes using KAFKA format instead of JSON format. This allows me to put a filter conditional filter on the data so I can fork the stream into a version for the single (STRUCT) variation and a version for the multiple (ARRAY) variation.
The initial stream looks like:
CREATE OR REPLACE STREAM `my-topic-stream` (
id STRING KEY,
data BYTES
)
WITH (
KAFKA_TOPIC='my-topic',
VALUE_FORMAT='KAFKA'
);
Forking that stream looks like this with a second for a multiple version filtering for IS NOT NULL:
CREATE OR REPLACE STREAM `my-single-stream`
WITH (
kafka_topic='my-single-topic'
) AS
SELECT *
FROM `my-topic-stream`
WHERE JSON_ARRAY_LENGTH(EXTRACTJSONFIELD(FROM_BYTES(data, 'utf8'), '$.field')) IS NULL;
At this point I can create a schema for both variations, explode field, and merge the two streams back together. I don't know if this can be refined to be more efficient, but this successfully processes the transactions as I wanted.

Clickhouse/Kafka: reading a JSON Object type into a field

I have this kind of data in a Kafka Topic:
{..., fields: { "a": "aval", "b": "bval" } }
If I create a Kafka Engine table, I get an error when using a field definition like this:
fields String
because it (correctly) doesn't recognize it as a String:
2018.07.09 17:09:54.362061 [ 27 ] <Error> void DB::StorageKafka::streamThread(): Code: 26, e.displayText() = DB::Exception: Cannot parse JSON string: expected opening quote: (while read the value of key fields): (at row 1)
As ClickHouse does not currently have a Map or JSONObject type, what would be the best way to work over it, provided I don't know in advance the name of the inner fields ("a" or "b" in the example - so I cannot see Nested structures helping)?
Apparently, at the moment ClickHouse does not support complex JSON parsing.
From this answer in ClickHouse Github:
Clickhouse uses quick and dirty JSON parser, which does not how to read complex deep structures. So it can't skip that field as it does not know where that nested structure ends.
Sorry. :/
So you should preprocess your json with some external tools, of you can contribute to Clickhouse and improve JSON parser.

How can I validate Json schema in spark 2.X?

Using Spark streaming (written in Scala) to read messages from Kafka.
The messages are all Strings in Json format.
Defining the expected schema in a local variable expectedSchema
then parsing the Strings in the RDD to Json
spark.sqlContext.read.schema(schema).json(rdd.toDS())
The problem: Spark will process all the records/rows as long as it has some fields that I try to read, even if the actual Json format (i.e schema) of the input row (String) doesn't match my expectedSchema.
Assume expected schema looks like this (in Json): {"a": 1,"b": 2, "c": 3}
and input row looks like this: {"a": 1, "c": 3}
Spark will process the input without failing.
I tried using the solution described here: How do I apply schema with nullable = false to json reading
but assert(readJson.schema == expectedSchema) never fails, even when I deliberately send input rows with wrong Json schema.
Is there a way for me to verify that the actual schema of a given input row, matches my expected schema?
Is there a way for me to insert a null value to "fill" fields missing from "corrupt" schema row?

How to insert multiple Json data to hbase using NiFI.?

Please tell me how to insert multiple json data into hbase using Nifi
PutHbaseJson Image Output
PutHbaseCell Image Output
when we try to insert more than one id's or object.
This is the file which i have tried with PutHbaseCell
{"id" : "1334134","name" : "Apparel Fabric","path" : "Arts, Crafts & Sewing/Fabric/Apparel Fabric"},
{"id" : "412","name" : "Apparel Fabric","path" : "Arts, Crafts & Sewing/Fabric/Apparel Fabric"}
Image of PutHbaseCell Processor
PutHBaseJson expects each flow file to contain one JSON document which becomes a row in HBase. The row id can be specified in the processor using expression language, or it can come from one of the fields in the JSON. The other field/value pairs in the JSON become the the columns/values of the row in HBase.
If you want to use PutHBaseJson, you just need to split up your data in NiFi before it reaches this processor. There are many ways to do this.. SplitJson, SplitText, SplitContent, ExecuteScript, a custom processors.
Alternatively there is a PutHBaseRecord processors which can use a record reader to read records from a flow file and send them all to HBase. In your case you would need a JSON record reader. The data also has to be in a format that is understood by the record reader, and I believe for JSON it would need to be an array of documents.

Avro Schema: force to interpret value (map, array) as string

I want to convert JSON to Avro via NiFi. Unfortunately the JSON has complex types as values that I want to see as a simple string!
JSON:
"FLAGS" : {"FLAG" : ["STORED","ACTIVE"]}
How can I tell AVRO to simply store "{"FLAG" : ["STORED","ACTIVE"]}" or "[1,2,3,"X"]" as a string?
Thank you sincerely!
The JSON to Avro conversion performed in NiFi's ConvertJSONToAvro processor does not really do transformation in the same step. There is a very limited ability to transform based on the Avro schema, mostly omitting input data fields in the output. But it won't coerce a complex structure to a string.
Instead, you should do a JSON-to-JSON transformation first, then convert your summarized JSON to Avro. I think what you are looking for is a structure like this:
{
"FLAGS": "{\"FLAG\":[\"STORED\",\"ACTIVE\"]}"
}
NiFi's JoltTransformJSON, ExecuteScript processors are great for this. If your records are simple enough, maybe even a combination of EvaluateJsonPath $.FLAGS and ReplaceText { "FLAGS": "${flags:escapeJson()}" }.