I am currently trying to store my deep learning models from tensorflow and keras in a graph database called ArangoDB. Like most object databases, ArangoDB requires the files to be in JSON. I may be willing to switch to HDFS, but either way Tensorflow and Keras insist on using proprietary HDF5 format from the HDF5 group to store their weights.
How can I convert these using python to a JSON format to be stored in the DB and how can I convert them back to be loaded again in tensorflow?
https://machinelearningmastery.com/save-load-keras-deep-learning-models/ shows how you could save models in JSON with keras. Once you have the file in JSON format, you can persist that.
The following does not store a model but stores TFX artifacts in ArangoDB.
https://github.com/arangoml/arangopipe/blob/master/arangopipe/tests/TFX/tfx_metadata_integration.ipynb
Related
I am fetching JSON data from different API's. I want to store them in HDFS and then use them in MongoDB.
Do I need to convert them to avro, sequence file, parquet, etc., or can I simply store them as plain JSON and load them to the database later?
I know that if i convert them to another format they will get distributed better and compressed, but how will I be able then to upload an avro file to MongoDB? MongoDB only accepts JSON. Should I do another step to read them from avro and convert them to JSON?
How large is the data you're fetching? If it's less than 128MB (with or without compression) per file, it really shouldn't be in HDFS.
To answer the question, format doesn't really matter. You can use SparkSQL to read any Hadoop format (or JSON) to load into Mongo (and vice versa).
Or you can write the data first to Kafka, then use a process such as Kafka Connect to write to both HDFS and Mongo at the same time.
we are planning to migrate our db to Azure cosmos graph db. we are using this
bulk import tool.
nowhere it mentioned Json input format.
Whats the Json format for bulk import to Azure cosmos graph db
https://github.com/Azure-Samples/azure-cosmosdb-graph-bulkexecutor-dotnet-getting-started
azure bulk import image
Appreciate any help.
You actually don't need to build the gremlin queries to insert your edges. In CosmosDB, everything is regarded as a JSON document (even the vertices and edges in a graph collection).
The format of the required JSON isn't officially published and can change at any time but can be discovered though inspection of the SDKs.
I wrote about it here a while ago and it is still valid today.
I am using a Spark cluster and I want to write a string to a file and save it to the master node using Scala
I looked at this topic and tried some of the suggestions, but I can't find the saved file
You would just need to call collect on the RDD to bring the collection back to the driver. Then you can write it to the file system on the driver.
Although it would be recommended to write it in a parallelized fashion for performance.
I am currently working with Spark SQL and am considering using data contained within JSON datasets. I am aware of the .jsonFile() method in Spark SQL.
I have two questions:
What is the general strategy used by Spark SQL .jsonFile() to parse/decode a JSON dataset?
What are some other general strategies to parsing/decoding JSON datasets?
(Example of an answers I'm looking for is that the JSON file is read into an ETL pipeline and transformed into a predefined data structure.)
Thanks.
I've a Kinesis stream producing JSON and wanted to use Storm to write to S3 in Parquet format. This approach will require conversion from JSON --> Avro --> Parquet during stream processing. Also, I need to deal with schema evolution in this approach and keep updating avro schema and avsc generated java classes.
Another option is directly writing JSON in S3 and use Spark to convert stored files to parquet. Spark can take care of schema evolution in this case.
I would like to get pros and cons of both of the approaches. Also, is there any other better approach that can deal with schema evolution in json-->avro-->parquet conversion pipeline?