How can I convert JSON to Avro in apache NiFi?
I.e. JSON is obtained using the getTwitter?
Previous versions seem to support a ConvertJSONToAvro.
To me, it looks like nowadays the convertRecord processor should be used:
I.e. using record-oriented processing to read the JSON using a JSON tree reader and write it to Avro.
But where / how do I specify the schema? Especially for such a complex schema as i.e. obtained from Twitter. Is NiFi automatically guessing the right schema somehow?
edit
In fact, something rather obvious happens:
ConvertRecord Failed to process StandardFlowFileRecord will route to failure: ${schema.name} did not provide appropriate Schema Name
I.e. convert record succeeds in parsing the json, but when trying to apply the avro writer it fails. So how could I get an avro representation from the tweets?
You should be able to infer the schema and have that translated automatically to Avro using the modern record processors. Abdelkrim Hadjidj has a great write up about it, but to summarize:
Modern method
Use the Schema Inference capability in the JsonPathReader or JsonTreeReader implementation you're using in ConvertRecord. This will allow it to infer the schema, and then pass that along to the AvroRecordSetWriter via the schema.name and avro.schema attributes. There is also a schema inference cache (use the provided volatile implementation unless you have other requirements) to improve performance.
Old method
Use the InferAvroSchema processor to parse the incoming data and generate an Avro schema.
Note: this processor is no longer included with default builds of NiFi due to space restrictions, but you can manually build the nifi-kite-nar and then deploy it into the $NIFI_HOME/extensions/ directory to load that functionality.
Related
I created a pipeline that handles a single json file (a vector of 5890 elements, each a record) and send it via Kafka in avro format. The producer works fine, then when I read it with a consumer I get a flowfile (a avro file) each record. 5890 avro files. How can I set or merge more records in a single avro file?
I simply use a PublishKafkaRecord_0_10 1.5.0 (jsonTreeReader 1.5.0 and AvroRecordSetWriter 1.5.0) and ConsumeKafka_0_10 1.5.0 .
Firstly, NiFi 1.5.0 is from January 2018. Please consider upgrading as this is terribly out of date. NiFi 1.15.3 is the latest as of today.
Secondly, the *Kafka_0_10 processors are geared at very old versions of Kafka - are you really using v0.10 of Kafka? You have the following processors for later Kafka versions:
*Kafka_1.0 for Kafka 1.0+
*Kafka_2.0 for Kafka 2.0+
*Kafka_2.6 for Kafka 2.6+.
It would be useful if you provide examples of your input and desired output and what you are actually trying to achieve.
If you are looking to consume those message in NiFi and you want a single FlowFile with many messages, you should use ConsumeKafkaRecord rather than ConsumeKafka. This will let you control how many records you'd like to see per 'file'.
If your consumer is not NiFi, then either they need to merge on their end, or you need to bundle all your records into one larger message when producing. However, this is not really the point of Kafka as it's not geared towards large messages/files.
I want to use apache avro schema's for data serialization and deserialization.
I want to use it with json encoding.
I want to put several of this serialized objects using different schemas to the same "source" (it's a kafka topic).
When I read it back I have the need to be able to resolve the right schema for the current data entry.
But the serialized data don't have any schema information in it. And to test all possible schema's for compatibility (kind of a duck typing approach) would be pretty unclean and error prone (for data which fits to multiple schemas it would be unclear which one to take)
I'm currently thought about putting the namespace and object name inside the json data programatically. But such a solution would not belong to the avro standard and it would open a new error scenario where it's possible to put the wrong schema namespace and/or object name inside the data.
I'm wondering if there would be a better way. Or if I have a general flaw in my solution.
Background: I want to use it for kafka message but don't want to use the schema registry (don't wan't to have a new single point of failure). I also still want to have KSQL support which is only available for json format or for using avro with the schema registry)
I am using python 3 for functional testing of a bunch of rest endpoints.
But i cannot figure out the best way to validate the json reaponse ( verifying the type, required, missing and additional fields)
I thought of below options :
1. Writing custom code and validate the response while converting the data into python class objects.
2. Validate using json schema .
Option 1: would be difficult to maintain and need to add separate functions to all the data models.
Option 2 : i like it. But i dont want to write schema for each endpoint in separate file/object. Is there a way to put it in a single object like we have swagger yml file. That way would be easy to maintain.
I would like to know which option is the best and if there are other better options / libraries available.
I've been through the same process, but validating REST requests and responses with Java. In the end I went with JSON Schema (there's an equivalent Python implementation at https://pypi.python.org/pypi/jsonschema) because it was simple and powerful, and hand-crafting the validation for anything but a trivial payload soon became a nightmare. Also, reading a JSON Schema file is easier than reasoning about a long list of validation statements.
It's true you need to define the schema in a separate file, but this proved to be no big deal. And, if your endpoints share some common features you can modularise your schemas and reuse common parts. There's a good tutorial at Understanding JSON Schema.
I encountered many troubles of dealing with serializing/deserializing Scala data types to/from JSON objects and then store them to/from MongoDB in BSON form.
1st question: why Play Framework uses JSON why MongoDb uses BSON.
2nd question: If I am not wrong, Javascript does not have readers and writers for serializing/deserializing BSON from MongoDB. How can this happen? Javascript can seamlessly handle JSON, but for BSON I expect it needs some sort of readers and writers.
3rd question: (I read somewhere) why Salat and ReactiveMongo uses different mechanisms to talk to MongoDB.
JSON is a widely used format for transfer data in this days. So pretty good to have it "from the box" in the web framework. That is the reason Play has it.
The same reason mongo use it - it is a good idea to store data in the same format as user query it and save it. So Why mongo use BSON but JSON ? Well, BSON is the same as JSON but have additional properties on every value - data length and data type. The reason of this - when you are looking a lot of data (like db query do) you need to read all the object in JSON to get to another one. We can skip reading in the case if we will know the length of the data.
So You just do not need any BSON readers in JS (it could be somewhere but rarely used) because BSON is format for inside DB usage.
you can read this article for more inforamtion
I have a use case where I need to validate JSON objects against a schema that can change real time..
Let me explain my requirements..
I persist JSON objects (MongoDB).
Before persisting I MUST validate the data type of some of the
fields of JSON objects (mentioned in #1) against a schema.
I persist the schema in mongodb.
I always validate the JSON objects against the latest schema available in db. (so I dont think it matters much even if the schema can change in real time for me it is kinda static).
I am using a J2EE stack (Spring Framework).
Can anyone guide me here..?
Another way of doing it is to use an external library https://github.com/fge/json-schema-validator to do the work for you. The one I proposed supports draft 4 of JSON Schema.
The IBM DataPower appliance has JSON Schema validation support. This will allow you to offload validation to an appliance that is designed for it along with routing of data within te enterprise.