How can I validate Json schema in spark 2.X? - json

Using Spark streaming (written in Scala) to read messages from Kafka.
The messages are all Strings in Json format.
Defining the expected schema in a local variable expectedSchema
then parsing the Strings in the RDD to Json
spark.sqlContext.read.schema(schema).json(rdd.toDS())
The problem: Spark will process all the records/rows as long as it has some fields that I try to read, even if the actual Json format (i.e schema) of the input row (String) doesn't match my expectedSchema.
Assume expected schema looks like this (in Json): {"a": 1,"b": 2, "c": 3}
and input row looks like this: {"a": 1, "c": 3}
Spark will process the input without failing.
I tried using the solution described here: How do I apply schema with nullable = false to json reading
but assert(readJson.schema == expectedSchema) never fails, even when I deliberately send input rows with wrong Json schema.
Is there a way for me to verify that the actual schema of a given input row, matches my expected schema?
Is there a way for me to insert a null value to "fill" fields missing from "corrupt" schema row?

Related

Efficient way to parse a file with different json schemas in spark

I am trying to find the best way to parse a json file with inconsistent schema (but the schema of the same type is known and consistent) in spark in order to split it by "type" and store it in parquet
{"type":1, "data":{"data_of_type1" : 1}}
{"type":2, "data":{"data_of_type2" : "can be any type"}}
{"type":3, "data":{"data_of_type3" : "value1", "anotherone": 1}}
I want also to reduce the IO because I am dealing with huge volumes, so I don't want to do a first split (by type) then process each type independently...
Current idea (not working):
Loaded the json and parse only the type ( "data" is loaded as a string)
attach to each row the corresponding schema (a DDL as string in a new column)
try to parse the "data" with the DDL from the previous column (method from_json)
=> Throwing error : Schema should be specified in DDL format as a string literal or output of the schema_of_json/schema_of_csv functions instead of schema
Do you have any idea if it's possible?

Clickhouse/Kafka: reading a JSON Object type into a field

I have this kind of data in a Kafka Topic:
{..., fields: { "a": "aval", "b": "bval" } }
If I create a Kafka Engine table, I get an error when using a field definition like this:
fields String
because it (correctly) doesn't recognize it as a String:
2018.07.09 17:09:54.362061 [ 27 ] <Error> void DB::StorageKafka::streamThread(): Code: 26, e.displayText() = DB::Exception: Cannot parse JSON string: expected opening quote: (while read the value of key fields): (at row 1)
As ClickHouse does not currently have a Map or JSONObject type, what would be the best way to work over it, provided I don't know in advance the name of the inner fields ("a" or "b" in the example - so I cannot see Nested structures helping)?
Apparently, at the moment ClickHouse does not support complex JSON parsing.
From this answer in ClickHouse Github:
Clickhouse uses quick and dirty JSON parser, which does not how to read complex deep structures. So it can't skip that field as it does not know where that nested structure ends.
Sorry. :/
So you should preprocess your json with some external tools, of you can contribute to Clickhouse and improve JSON parser.

mysql check varchar length of json?

Is there any way we can find varchar length of JSON inputs?
For example, SELECT JSON_LENGTH('[1, 2, {"a": 3}]'); This give output as 3.
But is there any way to find a number of characters in this json? Like we have 19 characters in the JSON input '[1, 2, {"a": 3}]', is there any function that can return this?
We need this to avoid storing very big JSON objects and keep it as safe check before storing any JSON in the database.

NiFi, flow with KafkaConsumer to write as json

currently I am stuck on the following problem:
I am reading messages from a Kafka Topic using KafkaConsumer. The messages are strings and have the following format:
{ "a" : "b", "a1" : "b1", "c2" : "c3" }
They are saved within the payload of the FlowFile.
I want to convert that string into json or ideally to csv, but cant figure out how to do it.
I am new to NiFi and researched as much as possible, but the answers I found were regarding conversions from json to avro or similar, but never string to json or avro.
I also found out that the Kafka message is in the payload of the FlowFile, not in the attributes, so I have no clue how to get my hands on it, since the examples are always involving the attributes.
So in short: Can I convert the payload of a FlowFile, which is a string, to json/cvs with some built-in processor.
if your message is in FlowFile, the following sequence could help:
1) Use AttributesToJson to convert payload message to Json.
2) Use EvaluateJsonPath to extract the payload message. In your case the kafka message. Then you can pass the extracted messages for csv generation.
This post can help to convert Json TO CSV: Convert Json To CSV
I ended up doing this:
ConsumeKafka gives me the string:
{ "a" : "b", "a1" : "b1" }
EvaluateJsonPath creates attributes by adding properties
a -> $.a //results in attribute named a with value b
a1 -> $.a1 //results in attribute named a1 with value b1
ReplaceText gets the attributes from EvaluateJsonPath to form one single csv formated:
Replacement value -> ${'a'},${'a1'}
This results as a single line, BUT NO NEW LINE:
b,b1
To add the new line appending \n, '\n', "\n" did not work.
What worked was pressing Shift+Enter while typing into the Replacement value field, which resulted in creating an empty new line.

Avro Schema: force to interpret value (map, array) as string

I want to convert JSON to Avro via NiFi. Unfortunately the JSON has complex types as values that I want to see as a simple string!
JSON:
"FLAGS" : {"FLAG" : ["STORED","ACTIVE"]}
How can I tell AVRO to simply store "{"FLAG" : ["STORED","ACTIVE"]}" or "[1,2,3,"X"]" as a string?
Thank you sincerely!
The JSON to Avro conversion performed in NiFi's ConvertJSONToAvro processor does not really do transformation in the same step. There is a very limited ability to transform based on the Avro schema, mostly omitting input data fields in the output. But it won't coerce a complex structure to a string.
Instead, you should do a JSON-to-JSON transformation first, then convert your summarized JSON to Avro. I think what you are looking for is a structure like this:
{
"FLAGS": "{\"FLAG\":[\"STORED\",\"ACTIVE\"]}"
}
NiFi's JoltTransformJSON, ExecuteScript processors are great for this. If your records are simple enough, maybe even a combination of EvaluateJsonPath $.FLAGS and ReplaceText { "FLAGS": "${flags:escapeJson()}" }.