I am currently working with Spark SQL and am considering using data contained within JSON datasets. I am aware of the .jsonFile() method in Spark SQL.
I have two questions:
What is the general strategy used by Spark SQL .jsonFile() to parse/decode a JSON dataset?
What are some other general strategies to parsing/decoding JSON datasets?
(Example of an answers I'm looking for is that the JSON file is read into an ETL pipeline and transformed into a predefined data structure.)
Thanks.
Related
I was created my project in spring 4 MVC + Hibernate with MongoDB. now, I have to convert it into the Hibernate with MySQL. my problem is I have too many collections in MongoDB the format of bson and json. how can I convert that file into MySQL table format? is that possible?
Mongodb is a non-relational database, while MySQL is relational. The key difference is that the non relational database contains documents (JSON objects) which can contain hierarchical structure, where as the relational database expects the objects to be normalised, and broken down into tables. It is therefore not possible to simply convert the bson data from MongoDB into something which MySQL will understand. You will need to write some code that will read the data from MongoDB and the write it into MySQL.
The documents in your MongoDB collections represent serialised forms of some classes (POJOs, domain object etc) in your project. Presumably, you read this data from MongoDB deserialise it into its class form and use it in your project e.g. display it to end users, use it in calculations, generate reports from it etc.
Now, you'd prefer to host that data in MySQL so you'd like to know how to migrate the data from MongoDB to MySQL but since the persistent formats are radically different you are wondering how to do that.
Here are two options:
Use your application code to read the data from MongoDB, deserialise it into your classes and then write that data into MySQL using JDBC or an ORM mapping layer etc.
Use mongoexport to export the data from MongoDB (in JSON format) and then write some kind of adapter which is capable of mapping this data into the desired format for your MySQL data model.
The non functionals (especially for the read and write aspects) will differ between these approaches but fundamentally both approaches are quite similar; they both (1) read from MongoDB; (2) map the document data to the relational model; (3) write the mapped data into MySQL. The trickiest aspect of this flow is no. 2 and since only you understand your data and your relational model there is no tool which can magically do this for you. How would a thirdparty tool be sufficiently aware of your document model and your relational model to be able to perform this transformation for you?
You could investigate a MongoDB JDBC driver or use something like Apache Drill to facilitate JDBC queries onto your Mongo DB. Since these could return java.sql.ResultSet you would be dealing with a result format which is more suited for writing to MySQL but it's likely that this still wouldn't match your target relational model and hence you'd still need some form of transformation code.
I have a huge flat json string which has some 1000+ fields. I want to restructure the json into a nested/hierarchical structure based on certain business logic without doing a lot of object-to-json or json-to-object conversions, so that the performance will not get affected.
What are the ways to achieve this in scala?
Thanks in advance!
I suggest you to have a look into JSON transformers provided by play-json library. It allows you to manipulate json (moving fields, creating nested objects) without doing any object mapping.
Check this out : https://www.playframework.com/documentation/2.5.x/ScalaJsonTransformers
I managed to connect spark streaming to my kafka server in which I have data with json format. I want to parse these data in order to do use the function groupby as explained here: Can Apache Spark merge several similar lines into one line?
In fact, in this link we import json data from a file which is clearly easier to treat. I didn't find someting similar with a kafka server.
Do you have any idea bout it.
Thanks and regards
It's really hard to understand what you're asking because we can't see where you are now without code. Maybe this general guidance is what you need.
Your StreamingContext can be given a foreachRDD block where you'll get an RDD. Then you can sqlContext.read.json(inputRDD) and you will have a DataFrame which you can process however you like.
I have exported data from firebase in one json file,I want to create a mysql tables using this json file.Is there any way to do so?
Try to use MySql Workbench. It looks provide Json format to import.
https://dev.mysql.com/doc/workbench/en/wb-admin-export-import.html
I guess you could convert your json into csv and import that into mysql like it has been described here: Importing JSON into Mysql
However, I am not really sure how that would be connected to django (since you mentioned it in your questions title). It might be helpful if you would describe in more detail what you want to do.
You can store data with very dynamic schema with Firebase. MySQL is a traditional RDBMS, you have to analyze the structure of your dataset and decide what parts do you want to convert to relational format.
There's a good news however: there are several packages which support JSON fields on models: https://www.djangopackages.com/grids/g/json-fields/
There's one MySQL specific also: https://django-mysql.readthedocs.org/en/latest/model_fields/json_field.html
Such JSON data is queryable and updatable. So if your dataset has some varying schema parts, you can store those in JSON format.
Starting from MySQL 5.7 JSON data is even natively supported by the DB. Read more here:
https://django-mysql.readthedocs.org/en/latest/model_fields/json_field.html
I've a Kinesis stream producing JSON and wanted to use Storm to write to S3 in Parquet format. This approach will require conversion from JSON --> Avro --> Parquet during stream processing. Also, I need to deal with schema evolution in this approach and keep updating avro schema and avsc generated java classes.
Another option is directly writing JSON in S3 and use Spark to convert stored files to parquet. Spark can take care of schema evolution in this case.
I would like to get pros and cons of both of the approaches. Also, is there any other better approach that can deal with schema evolution in json-->avro-->parquet conversion pipeline?