Processing json data from kafka using structured streaming - json

I want to convert incoming JSON data from Kafka into a dataframe.
I am using structured streaming with Scala 2.12
Most people add a hard coded schema, but if the json can have additional fields, it requires changing the code base every-time, which is tedious.
One approach is to write it into a file and infer it with but I rather avoid doing that.
Is there any other way to approach this problem?
Edit: Found a way to turn a json string into a dataframe but cant extract it from the stream source, it is possible to extract it?

One way is to store the schema itself in the message headers (not in the key or value).
Though, this increases message size, it will be easy to parse the JSON value without the need for any external resource like a file or a schema registry.
New messages can have new schemas while at the same time old messages can still be processed using their old schema itself, because the schema is within the message itself.
Alternatively, you can version the schemas and include an id for every schema in the message headers (or) a magic byte in the key or value and infer the schema from there.
This approach is followed by Confluent Schema registry. It allows you to basically go through different versions of same schema and see how your schema has evolved over time.

Read the data as string and then convert it to map[string,String], this way you can process the any json without even knowing its schema

based on JavaTechnical answer , the best approach would be to use a schema registry and
avro data instead of json, there is no going around hardcoding a schema (for now).
include your schema name and id as a header and use them to read the schema from the schema registry.
use the from_avro fucntion to turn that data into a df!

Related

Kafka stream append to JSON as event enrichment

I have a producer that writes a json file to the topic to be read by a kafka consumer stream. Its simple key-value pair.
I want to stream the topic and enrich the event by adding/concatenating more JSON key-value rows and publish to another topic.
None of the values or keys have anything in common by the way.
I am probably overthinking this, but how would I get around this logic?
I suppose you want to decode JSON message at the consumer side.
If you are not concerned about schema and but just want to deal with JSON as a Map, you can use Jackson library to read the JSON string as a Map<String,Object>. For this you can add the fields that you want, convert it back to a JSON string and push it to the new topic.
If you want to have a schema, you need to store the information as to which class it is mapping to or the JSON schema or some id that maps to this, then the following could work.
Store the schema info in headers
For example, you can store the JSON schema or Java class name in the headers of the message while producing and write a deserializer to extract that information from the headers and decode it.
The Deserializer#deserialize() has the Headers argument.
default T deserialize(java.lang.String topic,
Headers headers,
byte[] data)
and you can do something like..
objectMapper.readValue(data,
new Class.forName(
new String(headers.lastHeader("classname").value()
))
Use schema registry
Apart from these, there is also a schema registry from Confluent which can maintain different versions of the schema. You would need to run another process for that, though. If you are going to use this, you may want to look at the subject naming strategy and set it to RecordNameStrategy since you have multiple schemas in the same topic.

Json file that would be used in Elasticsearch

I want to know, if the Json files that would be used in Elasticsearch should have a predefined structure. Or can any Json document can be uploaded?
I've seen some Json documents that before each record there's such:
{"index":{"_index":"plos","_type":"article","_id":0}}
{"id":"10.1371/journal.pone.0007737","title":"Phospholipase C-β4 Is Essential for the Progression of the Normal Sleep Sequence and Ultradian Body Temperature Rhythms in Mice"}
Theoretically you can upload any JSON document. However, be mindful that Elasticsearch can create/change the index mapping based on your create/update actions. So if you send a JSON that includes a previously unknown field? Congratulations, your index mapping now contains a new field! In this same way the data type of a field might also be affected by introducing a document with data of a different type. So, my advice is be very careful in constructing your requests to avoid surprises.
Fyi, the syntax you posted looks like a bulk request (https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html). Those do have some demands on the syntax to clarify what you want to do to which documents. "Index" call sending one document is very unrestricted though.

Is there a way to use Playframework Json macro generated format for partial entity validation?

My program stack is ReactiveMongo 0.11.0, Scala 2.11.6, Play 2.4.2.
I'm adding PATCH functionality support to my Controllers. I want it to be type safe, so that PATCH would not mess the data in Mongo.
Current dirty solution of doing this, is
Reading object from Mongo first,
Performing JsObject.deepMerge with provided patch,
Checking that value can still be deserialized to target type.
Serializing merged object back to JsObject, and check, that patch contains only fields that are present in merged Json (So that there is no trash added to the stored object)
Call actual $set on mongo
This is obviously not perfect, but works fine. I would write macros to generate appropriate format generalization, but it might take too much time, which I currently lack of.
Is there a way to use Playframework Json macro generated format for partial entity validation like this?
Or any other solution, that can be easily integrated in Playframework for that matters.
With the help of #julien-richard-foy made a small library, to do exactly what I wanted.
https://github.com/clemble/scala-validator
Need to add some documentation, and I'll publish it to repository.

Importing a json file into Cassandra

Hi is it possible to import any random json file into cassandra.
The json file is not exported from sstable2json. The json file is from a different website and needs to be imported into cassandra. Please could anyone advise whether this is possible
JSON support won't be introduced until Cassandra 3.0 (see CASSANDRA-7970) and in this case you still need to define a schema for your json data to map to. You do have some other options:
Use maps which sort of map to JSON. Maps can be indexed as of Cassandra 2.1 (CASSANDRA-4511) There is also a good Stack Exchange post about this.
You mention 'any random json file'. You could just have a string column that contains the raw JSON, but then you lose any query-ability of that data.
Come up with some kind of schema for your JSON data and map it to a CQL table and write some code that parses the JSON and writes it to the CQL table mapping to that data. This doesn't sound like an option for you since you want to be able to import any random JSON file.
If you are looking to only do json document storage, you might want to look at more document-oriented solutions instead of a column-oriented solution like cassandra.

Javascript in place of json input step

I am loading data from a mongodb collection to a mysql table through Kettle transformation.
First I extract them using MongodbInput and then I use json input step.
But since json input step has very low performance, I wanted to replace it with a
javacript script.
I am a beginner in Javascript and even though i tried somethings, the kettle javascript script is not recognizing any keywords.
can anyone give me sample code to convert Json data to different columns using javascript?
To solve your problem you need to see three aspects:
Reading from MongoDB
Reading from JSON
Reading from (probably) String
Reading from MongoDB Except if you changed the interface, MongoDB returns not JSON but BSON files (~binary JSON). You need to see the MongoDB documentation about reading and writing BSON: probably something like BSON.to() and BSON.from() but I don't know it by heart.
Reading from JSON Once you have your BSON in JSON format, you can read it using JSON.stringify() which returns a String.
Reading from (probably) String If you want to use the capabilities of JSON (why else would you use JSON?), you also want to use JSON.parse() which returns a JSON object.
My experience is that to send a JSON object from one step to the other, using a String is not a bad idea, i.e. at the end of a JavaScript step, you write your JSON object to a String and at the beginning of the next JavaScript step (can be further down the stream) you parse it back to JSON to work with it.
I hope this answers your question.
PS: writing JavaScript steps requires you to learn JavaScript. You don't have to be a master, but the basics are required. There is no way around it.
you could use the json input step to get the values of this json and put in common rows