From Kafka json message to Snowflake table - json

I am trying to implement a Snowflake Sink connector so that can i load messages coming to a Kafka topic directly to an appropriate Snowflake table. So far, I could only get to the point of loading the raw json into a table with two columns (RECORD_METADATA and RECORD_CONTENT). My goal would be to directly load the json messages into an appropriate table by flattening them. I have the structure of what the table should be, so I could create a table and directly load into that. But I need a way for the load process to flatten the messages.
I have been looking online and through the documentation, but haven't found a clear way to do this.
Is it possible or do I have to first load the raw json and then do transformations to get the table that I want?
Thanks

You have to load the raw JSON first then you can do transformations.
Each Kafka message is passed to Snowflake in JSON format or Avro format. The Kafka connector stores that formatted information in a single column of type VARIANT. The data is not parsed, and the data is not split into multiple columns in the Snowflake table.
For more information you can read here

Related

What is the better way to store json data in postgresql?

I have some json data which i am getting from a particular API. i am using postgresql as a db. What the best way to store the json data? Using row column format or saving the complete json data in a single field of jsonb type
From my experience with Django using postgresql: I am used to store the raw json in a single field.
Next, I parse it according to my needs.

Need to parse protocol-buffer data

I have never used the protocol-buffer, I am getting data in protocol-buffer and need to parse the data and query the data from mySQL database to show in view using Nodejs.
Can any one help me with any links and solution to retrieve data and parse from protocol-buffer.

Insert JSON into Hadoop

I have a lot of data (JSON string) per day (around 150-200B).
I want to insert the JSON to Hadoop, what is the best way to do it (I need a fast insert and a fast query on JSON fields)?
Do I need to use hive and create Avro scheme to my JSON? Or do I need to insert the JSON as a string to a specific column?
If you want to make the data available in Hive to perform mostly aggregations on top of it, I would suggest 1 of the following method using spark.
If you have multiple-line json files
var df = spark.read.json(sc.wholeTextFiles("hdfs://ypur/hdfs/path/*.json").values)
df.write.format("parquet").mode("overwrite").saveAsTable("yourhivedb.tablename")
If you have single-line json files
val df = spark.read.json("hdfs://ypur/hdfs/path/*.json")
df.write.format("parquet").mode("overwrite").saveAsTable("yourhivedb.tablename")
Spark will automatically infer the table schema for you. If you are using cloudera distribution you will be able to read the data using impala (depending on your cloudera version it may not support complex structures)
I want to insert the JSON to Hadoop
You just put it in HDFS... Since you have data over a time period, you'll want to create partitions for Hive to read
jsondata/dt=20180619/foo.json
jsondata/dt=20180620/bar.json
Do I need to use hive and create Avro scheme to my JSON?
Nope. Not sure where you got mixed up between Avro and JSON. Now, if you could convert the JSON into defined Avro with a schema, then that would help improve Hive queries since querying structured binary is better than parsing JSON text.
do I need to insert the JSON as a string to a specific column?
Not recommended. You could, but then you cannot query it, via Hive's JSON Serde support
Don't forget with the above structure you'll need PARTITIONED BY (dt STRING). And in order for partitions to be created on the table for existing files, you'll need to manually (and daily) run an MSCK REPAIR TABLE command
i have JSON as string (from kafka)
Don't use Spark for that (at least, don't reinvent the wheel). My suggestion would be to use Confluent's HDFS Kafka Connect that comes with Hive table creation support.

Connect Cassandra NoSQL DB and get the response as JSON response

One of our project, we have to get the data from Cassandra tables and populate it in JSON format in response. What are the possible ways to do it for the same? Some time, we require to get the data from more than one Cassandra table. What are possible ways available for the same
especially what are the ways to connect Cassandra?
You can query your data and retrieve a JSON string with the following type of queries:
SELECT JSON keyspace_name, durable_writes FROM system_schema.keyspaces ;
This will return you a json string that maps the keys (column name) with the corresponding value.
See the doc here: http://cassandra.apache.org/doc/latest/cql/json.html
Then you could re-insert the json string in Cassandra, if that's what you want.
If you need to do that at scale, or as a streaming job, you would want to look at using Spark on top of Cassandra: Load your Cassandra data into spark, use spark to transform that into a JSON string, and reinsert into Cassandra or another db.

Dataframes reading json files with changing schema

I am currently reading json files which have variable schema in each file. We are using the following logic to read json - first we read the base schema which has all fields and then read the actual data. We are using this approach because the schema is understood based on the first file read, but we are not getting all the fields in the first file it self. So just tricking the code to understand the schema first and then start reading the actual data.
rdd=sc.textFile(baseSchemaWithAllColumns.json).union(pathToActualFile.json)
sqlContext.read.json(rdd)
//Create dataframe and then save as temp table and query
I know the above is just work around and we need a cleaner solution to accept json files with varying schema.
I understand that there are two other ways to understand schema as mentioned here
However, for that it looks like we need to parse the json and map each field to the data received.
There seems to be an option for parquet schema merger, but that looks like mostly at the reading from the dataframe - or am I missing something here.
What is the best way to read a changing schema of json files and work with Spark SQL for querying.
Can I just read the json file as is and save as temp table and then use mergeSchema=true while querying