Does Spark streaming process every JSON "event" individually when reading from Kafka? - json

I want to use Spark streaming to read from a single Kafka topic messages in JSON format, however not all the events have similar schema. If possible, what's the best way to check each event's schema and process it accordingly?
Is it possible to group in memory several groups each made of a bunch of similar schema events and then process each group as a bulk?

I'm afraid you can't do. You need somehow to decode your JSON message to identify the schema and this would be done in your Spark code. However you can try to populate the Kafka message key with a different value per schema and get assign Spark partitions per key.

Object formats like parquet and avro are good for this reason since the schema is available in the header. If you absolutely must use JSON then you can do as you said and use group-by-key while casting to the object you want. If you are using large JSON objects then you will see a performance hit since the entire JSON "file" must be parsed before any objects resolution can take place.

Related

Variable schema and Hive integration using Kafka

I've been searching for answer but haven't found any similar issue or thread which could help.
The problem is that I have a Kafka topic which receives data from a different topic coming from another Kafka. The data is a continuous flow of various Json files, each having its own schema - only few fields are common.
I need data from all of them to be ingested into a single Hive table. I thought of creating a table with only one column to store the whole .json content as a raw string but ultimately failed to integrate with Hive (I was only able to move data to HDFS but I'd rather like to have a table receiving data directly from Kafka as it's a continuous flow).
Unfortunately, I'm not able to alter the original topic in any way. Therefore, does someone have any idea how to deal with that?

avro schema with json encoding - how to determine schema back from serialized data

I want to use apache avro schema's for data serialization and deserialization.
I want to use it with json encoding.
I want to put several of this serialized objects using different schemas to the same "source" (it's a kafka topic).
When I read it back I have the need to be able to resolve the right schema for the current data entry.
But the serialized data don't have any schema information in it. And to test all possible schema's for compatibility (kind of a duck typing approach) would be pretty unclean and error prone (for data which fits to multiple schemas it would be unclear which one to take)
I'm currently thought about putting the namespace and object name inside the json data programatically. But such a solution would not belong to the avro standard and it would open a new error scenario where it's possible to put the wrong schema namespace and/or object name inside the data.
I'm wondering if there would be a better way. Or if I have a general flaw in my solution.
Background: I want to use it for kafka message but don't want to use the schema registry (don't wan't to have a new single point of failure). I also still want to have KSQL support which is only available for json format or for using avro with the schema registry)

In Apache Nifi, is it possible to drop some objects based on a condition

In Apache Nifi, am reading some logs from S3 and which are json objects in text files, I split them with SplitText processor, now I want to filter out some objects based on attribute 'source=es_logs'. Is there any processor for this?
Thanks for the help!
You should use the RouteOnAttribute processor. This can be a boolean match routing to SUCCESS or FAILURE relationships, or an n-many match routing to an arbitrary number of relationships.

Why Play Framework uses JSON why MongoDb uses BSON

I encountered many troubles of dealing with serializing/deserializing Scala data types to/from JSON objects and then store them to/from MongoDB in BSON form.
1st question: why Play Framework uses JSON why MongoDb uses BSON.
2nd question: If I am not wrong, Javascript does not have readers and writers for serializing/deserializing BSON from MongoDB. How can this happen? Javascript can seamlessly handle JSON, but for BSON I expect it needs some sort of readers and writers.
3rd question: (I read somewhere) why Salat and ReactiveMongo uses different mechanisms to talk to MongoDB.
JSON is a widely used format for transfer data in this days. So pretty good to have it "from the box" in the web framework. That is the reason Play has it.
The same reason mongo use it - it is a good idea to store data in the same format as user query it and save it. So Why mongo use BSON but JSON ? Well, BSON is the same as JSON but have additional properties on every value - data length and data type. The reason of this - when you are looking a lot of data (like db query do) you need to read all the object in JSON to get to another one. We can skip reading in the case if we will know the length of the data.
So You just do not need any BSON readers in JS (it could be somewhere but rarely used) because BSON is format for inside DB usage.
you can read this article for more inforamtion

Distributed processing of JSON in Hadoop

I want to process a ~300 GB JSON file in Hadoop. As far as my understanding goes a JSON consists of a single string with data nested in it. Now if I want to parse the JSON string using Google's GSON, then won't the Hadoop have to put the entire load upon a single node as the JSON is not logically divisible for it.
How do I partition the file (I can make out the partitions logically looking at the data) if I want that it should be processed parallely on different nodes. Do I have to break the file before I load it onto HDFS itself. Is it absolutely necessary that the JSON is parsed by one machine (or node) at least once?
Assuming you know can logically parse the JSON into logical separate components then you can accomplish this just by writing your own InputFormat.
Conceptually you can think of each of the logically divisible JSON components as one "line" of data. Where each component contains the minimal amount of information that can be acted on independently.
Then you will need to make a class, a FileInputFormat, where you will have to return each of these JSON components.
public class JSONInputFormat extends FileInputFormat<Text,JSONComponent {...}
If you can logically divide your giant JSON into parts, do it, and save these parts as separate lines in file (or records in sequence file). Then, if you feed this new file to Hadoop MapReduce, mappers will be able to process records in parallel.
So, yes, JSON should be parsed by one machine at least once. This preprocessing phase doesn't need to be performed in Hadoop, simple script can do the work. Use streaming API to avoid loading a lot of data into memory.
You might find this JSON SerDe useful. It allows hive to read and write in JSON format. If it works for you, it'll be a lot more convenient to process you JSON data with Hive as you don't have to worry about the custom InputFormat that is going to read your JSON data and create splits for you.