Storing plain JSON in HDFS to be used in MongoDB - json

I am fetching JSON data from different API's. I want to store them in HDFS and then use them in MongoDB.
Do I need to convert them to avro, sequence file, parquet, etc., or can I simply store them as plain JSON and load them to the database later?
I know that if i convert them to another format they will get distributed better and compressed, but how will I be able then to upload an avro file to MongoDB? MongoDB only accepts JSON. Should I do another step to read them from avro and convert them to JSON?

How large is the data you're fetching? If it's less than 128MB (with or without compression) per file, it really shouldn't be in HDFS.
To answer the question, format doesn't really matter. You can use SparkSQL to read any Hadoop format (or JSON) to load into Mongo (and vice versa).
Or you can write the data first to Kafka, then use a process such as Kafka Connect to write to both HDFS and Mongo at the same time.

Related

import large JSON file into mongo

Due to some migrations and change in server, I have to change my Mongo database from old to new data is also need to transfer but data is too much now almost ~4GB of each file. In total, I have almost 20 files.
My problem is when I upload to new collections it says "tostring" error. I read and come to know there is the limit from MongoDB of 16mb to import a file.
How can I import JSON file into MongoDB? Thank you in advance.
If you read the documentation for mongoexport it says:
Avoid using mongoimport and mongoexport for full instance production
backups. They do not reliably preserve all rich BSON data types,
because JSON can only represent a subset of the types supported by
BSON. Use mongodump and mongorestore as described in MongoDB Backup
Methods for this kind of functionality.
Rather than using mongoexport to create a json file and then mongoimport to reimport it, you should use mongodump and mongorestore.

Why Play Framework uses JSON why MongoDb uses BSON

I encountered many troubles of dealing with serializing/deserializing Scala data types to/from JSON objects and then store them to/from MongoDB in BSON form.
1st question: why Play Framework uses JSON why MongoDb uses BSON.
2nd question: If I am not wrong, Javascript does not have readers and writers for serializing/deserializing BSON from MongoDB. How can this happen? Javascript can seamlessly handle JSON, but for BSON I expect it needs some sort of readers and writers.
3rd question: (I read somewhere) why Salat and ReactiveMongo uses different mechanisms to talk to MongoDB.
JSON is a widely used format for transfer data in this days. So pretty good to have it "from the box" in the web framework. That is the reason Play has it.
The same reason mongo use it - it is a good idea to store data in the same format as user query it and save it. So Why mongo use BSON but JSON ? Well, BSON is the same as JSON but have additional properties on every value - data length and data type. The reason of this - when you are looking a lot of data (like db query do) you need to read all the object in JSON to get to another one. We can skip reading in the case if we will know the length of the data.
So You just do not need any BSON readers in JS (it could be somewhere but rarely used) because BSON is format for inside DB usage.
you can read this article for more inforamtion

Can you store JSON fields on Redshift?

Does Redshift support JSON fields, like Postgresql's json data type? If so what do I do to use it?
You can store JSON in Amazon Redshift, within a normal text field.
There are functions available to extract data from JSON fields, but it is not an effective way to store data since it doesn't leverage the full capabilities of Redshift's column-based architecture.
See: Amazon Redshift documentation - JSON Functions
UPDATE:
Redshift now supports Data column of type "super" which allows saving JSONs and also querying over it.
Added a link to video that further explains the new option:
https://www.youtube.com/watch?v=PR15TVZDgy4

Streaming JSON data saving as Parquet in S3

I've a Kinesis stream producing JSON and wanted to use Storm to write to S3 in Parquet format. This approach will require conversion from JSON --> Avro --> Parquet during stream processing. Also, I need to deal with schema evolution in this approach and keep updating avro schema and avsc generated java classes.
Another option is directly writing JSON in S3 and use Spark to convert stored files to parquet. Spark can take care of schema evolution in this case.
I would like to get pros and cons of both of the approaches. Also, is there any other better approach that can deal with schema evolution in json-->avro-->parquet conversion pipeline?

Distributed processing of JSON in Hadoop

I want to process a ~300 GB JSON file in Hadoop. As far as my understanding goes a JSON consists of a single string with data nested in it. Now if I want to parse the JSON string using Google's GSON, then won't the Hadoop have to put the entire load upon a single node as the JSON is not logically divisible for it.
How do I partition the file (I can make out the partitions logically looking at the data) if I want that it should be processed parallely on different nodes. Do I have to break the file before I load it onto HDFS itself. Is it absolutely necessary that the JSON is parsed by one machine (or node) at least once?
Assuming you know can logically parse the JSON into logical separate components then you can accomplish this just by writing your own InputFormat.
Conceptually you can think of each of the logically divisible JSON components as one "line" of data. Where each component contains the minimal amount of information that can be acted on independently.
Then you will need to make a class, a FileInputFormat, where you will have to return each of these JSON components.
public class JSONInputFormat extends FileInputFormat<Text,JSONComponent {...}
If you can logically divide your giant JSON into parts, do it, and save these parts as separate lines in file (or records in sequence file). Then, if you feed this new file to Hadoop MapReduce, mappers will be able to process records in parallel.
So, yes, JSON should be parsed by one machine at least once. This preprocessing phase doesn't need to be performed in Hadoop, simple script can do the work. Use streaming API to avoid loading a lot of data into memory.
You might find this JSON SerDe useful. It allows hive to read and write in JSON format. If it works for you, it'll be a lot more convenient to process you JSON data with Hive as you don't have to worry about the custom InputFormat that is going to read your JSON data and create splits for you.