How to parse json data from kafka server with spark streaming? - json

I managed to connect spark streaming to my kafka server in which I have data with json format. I want to parse these data in order to do use the function groupby as explained here: Can Apache Spark merge several similar lines into one line?
In fact, in this link we import json data from a file which is clearly easier to treat. I didn't find someting similar with a kafka server.
Do you have any idea bout it.
Thanks and regards

It's really hard to understand what you're asking because we can't see where you are now without code. Maybe this general guidance is what you need.
Your StreamingContext can be given a foreachRDD block where you'll get an RDD. Then you can sqlContext.read.json(inputRDD) and you will have a DataFrame which you can process however you like.

Related

How to send data from Spark to my Angular8 project

Technologies I am using to fetch data from my MySQL database is Spark 2.4.4 and Scala. I want to display that data in my Angular8 project. Any help on how to do it? I could not find any documentation regarding this.
I am not sure if this is a scala/spark related question. It sounds more towards system design of your project.
One solution is to use your Angular8 directly read from MySQL. There are tons of tutorials online.
Another solution is to use your spark/scala to read data and dump to CSV/JSON file at somewhere and use Angular8 to read in that file. The pros is that you can do some transformation before displaying your data. The cons is that there is latency between transformation and displaying. After reading the flat file into JSON it's up to you how to render that data on user's screen.

What is kafka connect Json Schema good for?

I wonder what is the benefit of adding the json schema to your message as kafka connect do support it ?
Schemas are an important part of data pipelines. Kafka Connect supports embedding it in JSON, or you can use another option (Avro, Protobuf). If you don't have a schema you make life more difficult for consumers of the data, and some will insist on it—for example the JDBC Sink connector requires there be a schema and will fail if there isn't.
So to answer your question, if you don't want to use Avro or Protobuf (and if you like having large messages with lots of redundant repeating data ;-) then you can use Kafka Connect JSON schema format.

Does Spark streaming process every JSON "event" individually when reading from Kafka?

I want to use Spark streaming to read from a single Kafka topic messages in JSON format, however not all the events have similar schema. If possible, what's the best way to check each event's schema and process it accordingly?
Is it possible to group in memory several groups each made of a bunch of similar schema events and then process each group as a bulk?
I'm afraid you can't do. You need somehow to decode your JSON message to identify the schema and this would be done in your Spark code. However you can try to populate the Kafka message key with a different value per schema and get assign Spark partitions per key.
Object formats like parquet and avro are good for this reason since the schema is available in the header. If you absolutely must use JSON then you can do as you said and use group-by-key while casting to the object you want. If you are using large JSON objects then you will see a performance hit since the entire JSON "file" must be parsed before any objects resolution can take place.

Apache Spark-Get All Field Names From Nested Arbitrary JSON Files

I have run into a somewhat perplexing issue that has plagued me for several months now. I am trying to create an Avro Schema (schema-enforced format for serializing arbitrary data, basically, as I understand it) to convert some complex JSON files (arbitrary and nested) eventually to Parquet in a pipeline.
I am wondering if there is a way to get the superset of field names I need for this use case staying in Apache Spark instead of Hadoop MR in a reasonable fashion?
I think Apache Arrow under development might be able to help avoid this by treating JSON as a first class citizen eventually, but it is still aways off yet.
Any guidance would be sincerely appreciated!

Creating mysql tables from json file in django

I have exported data from firebase in one json file,I want to create a mysql tables using this json file.Is there any way to do so?
Try to use MySql Workbench. It looks provide Json format to import.
https://dev.mysql.com/doc/workbench/en/wb-admin-export-import.html
I guess you could convert your json into csv and import that into mysql like it has been described here: Importing JSON into Mysql
However, I am not really sure how that would be connected to django (since you mentioned it in your questions title). It might be helpful if you would describe in more detail what you want to do.
You can store data with very dynamic schema with Firebase. MySQL is a traditional RDBMS, you have to analyze the structure of your dataset and decide what parts do you want to convert to relational format.
There's a good news however: there are several packages which support JSON fields on models: https://www.djangopackages.com/grids/g/json-fields/
There's one MySQL specific also: https://django-mysql.readthedocs.org/en/latest/model_fields/json_field.html
Such JSON data is queryable and updatable. So if your dataset has some varying schema parts, you can store those in JSON format.
Starting from MySQL 5.7 JSON data is even natively supported by the DB. Read more here:
https://django-mysql.readthedocs.org/en/latest/model_fields/json_field.html