Variable schema and Hive integration using Kafka - json

I've been searching for answer but haven't found any similar issue or thread which could help.
The problem is that I have a Kafka topic which receives data from a different topic coming from another Kafka. The data is a continuous flow of various Json files, each having its own schema - only few fields are common.
I need data from all of them to be ingested into a single Hive table. I thought of creating a table with only one column to store the whole .json content as a raw string but ultimately failed to integrate with Hive (I was only able to move data to HDFS but I'd rather like to have a table receiving data directly from Kafka as it's a continuous flow).
Unfortunately, I'm not able to alter the original topic in any way. Therefore, does someone have any idea how to deal with that?

Related

BigQuery to GCS JSON

I wanted to be able to store Bigquery results as json files in Google Cloud Storage. I could not find an OOB way of doing this so what I had to do was
Run query against Bigquery and store results in permanent tables. I use a random guid to name the permanent table.
Read data from bigquery, convert it to json in my server side code and upload json data to GCS.
Delete permanent table.
Return the json file url in GCS to front end application.
While this works there are some issues with this.
A. I do not believe I am making use of BigQuery's caching by making use of my own permanent tables. Can someone confirm this?
B. Step 2 will be a performance bottleneck. To pull data out of GCP to do JSON conversion to reupload into GCP just feels wrong. A better approach would be to use some cloud native serverless function or some other GCP data workflow type service to do this step that gets triggered upon creation of a new table in the dataset. What do you think is the best way to achieve this step?
C. Is there really no way to do this without using permanent tables?
Any help appreciated. Thanks.
With persistent table, your are able to leverage Bigquery Data Exporting to export the table in JSON format to GCS. It has no cost, comparing with you reading the table from your server side.
Right now, there is indeed a way to avoid creating permanent table. Because every query result is actually a temporary table already. If you go to "Job Information" you can find the full name of the temp table, which can be used in Data Exporting to be exproted as a JSON to GCS. However, this is way more complicated than just create a persistent table and delete it afterwards.

Merging dataset results in an Azure Data Factory pipeline

I am reading a JSON-formatted blob from Azure Storage. I am then using one of the values in that JSON to query a database to get more information. What I need to do is take the JSON from the blob, add the fields from the database to it, then write that combined JSON to another Azure Storage. I cannot, however, figure out how to combine the two pieces of information.
I have tried custom mapping in the copy activity for the pipeline. I have tried parameterized datasets, etc. Nothing seems to provide the results I'm looking for.
Is there a way to accomplish this using native activities and parameters (i.e. not by writing a simple utility and executing it as a custom activity)?
For this I would recommend create a custom U-SQL job to do what you want. So first lookup for both the data you want. Do the job in the U-SQL job and copy the results to the Azure Storage. See this example for your pipeline:
If you are not familiar to U-SQL this can help you:
https://saveenr.gitbooks.io/usql-tutorial/content/
Also this will help you working with Json in your job:
https://www.taygan.co/blog/2018/01/09/azure-data-lake-series-working-with-json-part-2
https://www.taygan.co/blog/2018/03/02/azure-data-lake-series-working-with-json-part-3

DB Design - Sending JSON to REST API which persists the data in MySQL database

So this is more of a conceptual question. There might be some fundamental concepts which I don't understand clearly so please point out any mistakes in my understanding.
I am tasked with designing a framework and a part of it is I have a MySQL DB and a REST API which acts as the Data Access Layer. Now, the user should be able to parse various data (JSON, CSV, XML, Text, Source Code etc.) and send it to the REST API which persists the data to the DB.
Question 1: Should I specify that all data sent to the REST API should be in JSON format no matter what is parsed? This will ensure (best to my understanding) language independence and gives the REST API a common format to deal with.
Question 2: When it comes to a data model, what should I specify? Is it like a one-model-fits-all sort of thing or is the data model subject to change based on the incoming data?
Question 3: When I think of a relational data model, the thought of foreign keys comes to mind which creates the relation. Now, it might happen that some data may not contain any relation at all. If we think of something like Customer Order sort of data then the relation is easy to identify. But what if the data does not have any relation at all? How does the relational model fit into this?
Any help/suggestion is greatly appreciated. Thank you!
EDIT:
First off, the data can be both structured (say XML) and unstructured (say two text files). I want the DAL to be able to handle and persist whatever data that comes in (that's why I thought of a REST interface in front of the DB).
Secondly, I also just recently thought about MongoDB as an option and was looking into it (I have never used NoSQL DBs before). It kind of makes sense to use it if the incoming data in REST is in JSON. From what I understood I can create a collection in Mongo. Does that make more sense than using a Relational DB??
Finally, as to what I want to do with the data is I have a tool which performs a sort of difference analysis (think git diff) on the data. Say I sent two XML files and the tool retrieves it from the DB and performs the difference analysis and stores the result back in the DB.
Based on these requirements, what would be the optimum way to go about it?
The answer to this will depend on what sort of data it is. Are all these different data types using different notation for the same data? If so then storing in normalised database tables is the way to go. If its just arbitrary strings that happen to have some form of encoding, then its probably best to store in raw.
Again, it depends on what you want to do with it afterwards. Are you analysing the data, and you reporting on it? Are you reading one format and converting to another? Is it all some form of key-value pairs in some notation or other
No way to answer this further without understanding what you are trying to achieve.

Does Spark streaming process every JSON "event" individually when reading from Kafka?

I want to use Spark streaming to read from a single Kafka topic messages in JSON format, however not all the events have similar schema. If possible, what's the best way to check each event's schema and process it accordingly?
Is it possible to group in memory several groups each made of a bunch of similar schema events and then process each group as a bulk?
I'm afraid you can't do. You need somehow to decode your JSON message to identify the schema and this would be done in your Spark code. However you can try to populate the Kafka message key with a different value per schema and get assign Spark partitions per key.
Object formats like parquet and avro are good for this reason since the schema is available in the header. If you absolutely must use JSON then you can do as you said and use group-by-key while casting to the object you want. If you are using large JSON objects then you will see a performance hit since the entire JSON "file" must be parsed before any objects resolution can take place.

Creating mysql tables from json file in django

I have exported data from firebase in one json file,I want to create a mysql tables using this json file.Is there any way to do so?
Try to use MySql Workbench. It looks provide Json format to import.
https://dev.mysql.com/doc/workbench/en/wb-admin-export-import.html
I guess you could convert your json into csv and import that into mysql like it has been described here: Importing JSON into Mysql
However, I am not really sure how that would be connected to django (since you mentioned it in your questions title). It might be helpful if you would describe in more detail what you want to do.
You can store data with very dynamic schema with Firebase. MySQL is a traditional RDBMS, you have to analyze the structure of your dataset and decide what parts do you want to convert to relational format.
There's a good news however: there are several packages which support JSON fields on models: https://www.djangopackages.com/grids/g/json-fields/
There's one MySQL specific also: https://django-mysql.readthedocs.org/en/latest/model_fields/json_field.html
Such JSON data is queryable and updatable. So if your dataset has some varying schema parts, you can store those in JSON format.
Starting from MySQL 5.7 JSON data is even natively supported by the DB. Read more here:
https://django-mysql.readthedocs.org/en/latest/model_fields/json_field.html