Insert JSON into Hadoop - json

I have a lot of data (JSON string) per day (around 150-200B).
I want to insert the JSON to Hadoop, what is the best way to do it (I need a fast insert and a fast query on JSON fields)?
Do I need to use hive and create Avro scheme to my JSON? Or do I need to insert the JSON as a string to a specific column?

If you want to make the data available in Hive to perform mostly aggregations on top of it, I would suggest 1 of the following method using spark.
If you have multiple-line json files
var df = spark.read.json(sc.wholeTextFiles("hdfs://ypur/hdfs/path/*.json").values)
df.write.format("parquet").mode("overwrite").saveAsTable("yourhivedb.tablename")
If you have single-line json files
val df = spark.read.json("hdfs://ypur/hdfs/path/*.json")
df.write.format("parquet").mode("overwrite").saveAsTable("yourhivedb.tablename")
Spark will automatically infer the table schema for you. If you are using cloudera distribution you will be able to read the data using impala (depending on your cloudera version it may not support complex structures)

I want to insert the JSON to Hadoop
You just put it in HDFS... Since you have data over a time period, you'll want to create partitions for Hive to read
jsondata/dt=20180619/foo.json
jsondata/dt=20180620/bar.json
Do I need to use hive and create Avro scheme to my JSON?
Nope. Not sure where you got mixed up between Avro and JSON. Now, if you could convert the JSON into defined Avro with a schema, then that would help improve Hive queries since querying structured binary is better than parsing JSON text.
do I need to insert the JSON as a string to a specific column?
Not recommended. You could, but then you cannot query it, via Hive's JSON Serde support
Don't forget with the above structure you'll need PARTITIONED BY (dt STRING). And in order for partitions to be created on the table for existing files, you'll need to manually (and daily) run an MSCK REPAIR TABLE command
i have JSON as string (from kafka)
Don't use Spark for that (at least, don't reinvent the wheel). My suggestion would be to use Confluent's HDFS Kafka Connect that comes with Hive table creation support.

Related

From Kafka json message to Snowflake table

I am trying to implement a Snowflake Sink connector so that can i load messages coming to a Kafka topic directly to an appropriate Snowflake table. So far, I could only get to the point of loading the raw json into a table with two columns (RECORD_METADATA and RECORD_CONTENT). My goal would be to directly load the json messages into an appropriate table by flattening them. I have the structure of what the table should be, so I could create a table and directly load into that. But I need a way for the load process to flatten the messages.
I have been looking online and through the documentation, but haven't found a clear way to do this.
Is it possible or do I have to first load the raw json and then do transformations to get the table that I want?
Thanks
You have to load the raw JSON first then you can do transformations.
Each Kafka message is passed to Snowflake in JSON format or Avro format. The Kafka connector stores that formatted information in a single column of type VARIANT. The data is not parsed, and the data is not split into multiple columns in the Snowflake table.
For more information you can read here

load json files in google cloud storage into big query table

I am trying to do it with client lib using python.
the problem I am facing is that the TIMESTAMP on the JSON files are on Unix epoch TIMESTAMP format and big query can't detect that:
according to documentation:
so I wonder what to do?
I thought about changing the JSON format manually before I load it into BigQuery table?
Or maybe looking for an auto conversion from the BigQuery side?
I wondered across the internet and could not find anything useful yet.
Thanks in advance for any support.
You have 2 solutions
Either you update the format before the BigQuery integration
Or you update the format after the BigQuery integration
Before
Before means updating your JSON (manually or by script) or to update it by the process that load the JSON into BigQuery (like Dataflow).
I personally don't like this, file handling are never funny and efficient.
After
In this case, you let BigQuery loading your JSON file into a temporary table and convert your UNIX timestamp into a Number or a String. Then, perform a request into this temporary table, convert the field in the correct timestamp format, and insert the data in the final table.
This way is smoother and easier (a simple SQL query to write). However, it implies cost to read all the loaded data (to write them then)

What is the better way to store json data in postgresql?

I have some json data which i am getting from a particular API. i am using postgresql as a db. What the best way to store the json data? Using row column format or saving the complete json data in a single field of jsonb type
From my experience with Django using postgresql: I am used to store the raw json in a single field.
Next, I parse it according to my needs.

Connect Cassandra NoSQL DB and get the response as JSON response

One of our project, we have to get the data from Cassandra tables and populate it in JSON format in response. What are the possible ways to do it for the same? Some time, we require to get the data from more than one Cassandra table. What are possible ways available for the same
especially what are the ways to connect Cassandra?
You can query your data and retrieve a JSON string with the following type of queries:
SELECT JSON keyspace_name, durable_writes FROM system_schema.keyspaces ;
This will return you a json string that maps the keys (column name) with the corresponding value.
See the doc here: http://cassandra.apache.org/doc/latest/cql/json.html
Then you could re-insert the json string in Cassandra, if that's what you want.
If you need to do that at scale, or as a streaming job, you would want to look at using Spark on top of Cassandra: Load your Cassandra data into spark, use spark to transform that into a JSON string, and reinsert into Cassandra or another db.

hadoop mongodb connector read data but outputting as mysql data

is it possible to read mongodb data with hadoop connector but save output as mysql data table. So I want to read some data from mongodb collection by hadoop, processing it with hadoop and outputing it NOT already in mongodb but as MYSQL.
I used like, fetching data from mongodb as input and store result in different mongodb address. For that one you need to specify like
MongoConfigUtil.setInputURI(discussConf,"mongodb://ipaddress1/Database.Collection");
MongoConfigUtil.setOutputURI(discussConf,"mongodb://ipaddress2/Database.Collection");
for mongodb to mysql
my suggestion is , you can write normal java code to insert whatever data you need to insert in mysql . that code may be in reduce or map function