Streaming JSON data saving as Parquet in S3 - json

I've a Kinesis stream producing JSON and wanted to use Storm to write to S3 in Parquet format. This approach will require conversion from JSON --> Avro --> Parquet during stream processing. Also, I need to deal with schema evolution in this approach and keep updating avro schema and avsc generated java classes.
Another option is directly writing JSON in S3 and use Spark to convert stored files to parquet. Spark can take care of schema evolution in this case.
I would like to get pros and cons of both of the approaches. Also, is there any other better approach that can deal with schema evolution in json-->avro-->parquet conversion pipeline?

Related

Storing plain JSON in HDFS to be used in MongoDB

I am fetching JSON data from different API's. I want to store them in HDFS and then use them in MongoDB.
Do I need to convert them to avro, sequence file, parquet, etc., or can I simply store them as plain JSON and load them to the database later?
I know that if i convert them to another format they will get distributed better and compressed, but how will I be able then to upload an avro file to MongoDB? MongoDB only accepts JSON. Should I do another step to read them from avro and convert them to JSON?
How large is the data you're fetching? If it's less than 128MB (with or without compression) per file, it really shouldn't be in HDFS.
To answer the question, format doesn't really matter. You can use SparkSQL to read any Hadoop format (or JSON) to load into Mongo (and vice versa).
Or you can write the data first to Kafka, then use a process such as Kafka Connect to write to both HDFS and Mongo at the same time.

apache NiFi convert JSON to avro

How can I convert JSON to Avro in apache NiFi?
I.e. JSON is obtained using the getTwitter?
Previous versions seem to support a ConvertJSONToAvro.
To me, it looks like nowadays the convertRecord processor should be used:
I.e. using record-oriented processing to read the JSON using a JSON tree reader and write it to Avro.
But where / how do I specify the schema? Especially for such a complex schema as i.e. obtained from Twitter. Is NiFi automatically guessing the right schema somehow?
edit
In fact, something rather obvious happens:
ConvertRecord Failed to process StandardFlowFileRecord will route to failure: ${schema.name} did not provide appropriate Schema Name
I.e. convert record succeeds in parsing the json, but when trying to apply the avro writer it fails. So how could I get an avro representation from the tweets?
You should be able to infer the schema and have that translated automatically to Avro using the modern record processors. Abdelkrim Hadjidj has a great write up about it, but to summarize:
Modern method
Use the Schema Inference capability in the JsonPathReader or JsonTreeReader implementation you're using in ConvertRecord. This will allow it to infer the schema, and then pass that along to the AvroRecordSetWriter via the schema.name and avro.schema attributes. There is also a schema inference cache (use the provided volatile implementation unless you have other requirements) to improve performance.
Old method
Use the InferAvroSchema processor to parse the incoming data and generate an Avro schema.
Note: this processor is no longer included with default builds of NiFi due to space restrictions, but you can manually build the nifi-kite-nar and then deploy it into the $NIFI_HOME/extensions/ directory to load that functionality.

Merits of JSON vs CSV file format while writing to HDFS for downstream applications

We are in the process of extracting source data (xls) and injecting to HDFS. Is it better to write these files as CSV or JSON format? We are contemplating choosing one of them, but before making the call, we are wondering what are the merits & demerits of using either one of them.
Factors we are trying to figure out are:
Performance (Data Volume is 2-5 GB)
Loading vs Reading Data
How much easier it is to extract Metadata (Structure) info from either of these files.
Injected data will be consumed by other applications which support both JSON & CSV.

How does Spark Sql Parse Json?

I am currently working with Spark SQL and am considering using data contained within JSON datasets. I am aware of the .jsonFile() method in Spark SQL.
I have two questions:
What is the general strategy used by Spark SQL .jsonFile() to parse/decode a JSON dataset?
What are some other general strategies to parsing/decoding JSON datasets?
(Example of an answers I'm looking for is that the JSON file is read into an ETL pipeline and transformed into a predefined data structure.)
Thanks.

Why Play Framework uses JSON why MongoDb uses BSON

I encountered many troubles of dealing with serializing/deserializing Scala data types to/from JSON objects and then store them to/from MongoDB in BSON form.
1st question: why Play Framework uses JSON why MongoDb uses BSON.
2nd question: If I am not wrong, Javascript does not have readers and writers for serializing/deserializing BSON from MongoDB. How can this happen? Javascript can seamlessly handle JSON, but for BSON I expect it needs some sort of readers and writers.
3rd question: (I read somewhere) why Salat and ReactiveMongo uses different mechanisms to talk to MongoDB.
JSON is a widely used format for transfer data in this days. So pretty good to have it "from the box" in the web framework. That is the reason Play has it.
The same reason mongo use it - it is a good idea to store data in the same format as user query it and save it. So Why mongo use BSON but JSON ? Well, BSON is the same as JSON but have additional properties on every value - data length and data type. The reason of this - when you are looking a lot of data (like db query do) you need to read all the object in JSON to get to another one. We can skip reading in the case if we will know the length of the data.
So You just do not need any BSON readers in JS (it could be somewhere but rarely used) because BSON is format for inside DB usage.
you can read this article for more inforamtion