How to store JSON efficiently? - json

I'm working on a project which needs to store its configuration in a JSON database.
The problem is how to store efficiently that database, I mean:
don't rewrite the whole JSON tree in a file for each modification
manage multiple access in read/write at the same time
all of this without using an external server to the project (which is itself a server)

take a peek to MongoDB, wich uses Bson (binary json) to store data. http://www.mongodb.org/display/DOCS/BSON
http://www.mongodb.org/display/DOCS/Inserting
Edit 2021:
Today I better recommend to use postgresql to store json :
https://info.crunchydata.com/blog/using-postgresql-for-json-storage

I had an idea which fits to my needs :
For the in-memory configuration, I use a JSON tree (with the jansson library)
When I need to save the configuration, I retrieve the XPath of each elements in the JSON-tree, to use it as a key and store the key/value pair in a BerkeleyDB database
For example :
{'test': {
'option': true,
'options': [ 1, 2 ]
}}
Will give the following key/value pairs :
Key | Value
-----------------+-----------
/test/option | true
/test/options[1] | 1
/test/options[2] | 2

Related

Entity Framework Queries For Complicated JSON Documents (npgsql)

I am handling legacy (old) JSON files that we are now uploading to a database that was built using code-first EF Core (with the JSON elements saved as a jsonb field in a postgresql db, represented as JsonDocument properties in the EF classes). We want to be able to query these massive documents against any of the JSON's many properties. I've been very interested in the excellent docs here https://www.npgsql.org/efcore/mapping/json.html?tabs=data-annotations%2Cpoco, but the problem in our case is that our JSON has incredibly complicated hierarchies.
According to the npgsql/EF doc above, a way to do this for "shallow" json hierarchies would be something like:
myDbContext.MyClass
.Where(e => e.JsonDocumentField.RootElement.GetProperty("FieldToSearch").GetString() == "SearchTerm")
.ToList();
But that only works if is directly under the root of the JSONDocument. If the doc is structed like, say
{"A": {
...
"B": {
...
"C": {
...
"FieldToSearch":
<snip>
Then the above query won't work. There is an alternative to map our JSON to an actual POCO model, but this JSON structure (a) may change and (b) is truly massive and would result in some ridiculously complicated objects.
Right now, I'm building SQL strings using field configurations where I save strings to find the fields I want using psql's JSON querying language
Example:
"(JSONDocumentField->'A'->'B'->'C'->>'FieldToSearch')"
and then using that sql against the DB using
myDbContext.MyClass.FromSqlRaw(sql).ToList();
This is hacky and I'd much rather do it in a method call. Is there a way to force JsonDocument's GetProperty call to drill down into the hierarchy to find the first/any instance of the property name in question (or another method I'm not aware of)?
Thanks!

Move JSON Data from DocumentDB (or CosmosDB) to Azure Data Lake

I have a lot of JSON files (in millions) in Cosmos DB (earlier called Document DB) and I want to move it into Azure Data Lake, for cold storage.
I found this document https://learn.microsoft.com/en-us/dotnet/api/microsoft.azure.documents.client.documentclient.readdocumentfeedasync?view=azure-dotnet but it doesnt have any samples to start with.
How should I proceed, any code samples are highly appreciated.
Thanks.
I suggest you using Azure Data Factory to implement your requirement.
Please refer to this doc about how to export json documents from cosmos db and this doc about how to import data into ADL.
Hope it helps you.
Update Answer:
Please refer to this : Azure Cosmos DB as source, you could create query in pipeline.
Yeah the change feed will do the trick.
You have two options. The first one (which is probably what you want in this case) is to use it via the SDK.
Microsoft has a detailed page on how to do with including code examples here: https://learn.microsoft.com/en-us/azure/cosmos-db/change-feed#rest-apis
The second one is the Change Feed Library which allows you to have a service running at all times listening for changes and processing them based on your needs. More details with code examples of the change feed library here: https://learn.microsoft.com/en-us/azure/cosmos-db/change-feed#change-feed-processor
(Both pages (which is the same really just different sections) contain a link to a Microsoft github repo which contains code examples.)
Keep in mind you will still be charged for using this in terms of RU/s but from what I've seems it is relatively low (or at least lower than what you'd pay of you start reading the collections themselves.)
You also could read the change feed via Spark. Following python code example generates parquet files partitioned by loaddate for changed data. Works in an Azure Databricks notebooks on a daily schedule:
# Get DB secrets
endpoint = dbutils.preview.secret.get(scope = "cosmosdb", key = "endpoint")
masterkey = dbutils.preview.secret.get(scope = "cosmosdb", key = "masterkey")
# database & collection
database = "<yourdatabase>"
collection = "<yourcollection"
# Configs
dbConfig = {
"Endpoint" : endpoint,
"Masterkey" : masterkey,
"Database" : database,
"Collection" : collection,
"ReadChangeFeed" : "True",
"ChangeFeedQueryName" : database + collection + " ",
"ChangeFeedStartFromTheBeginning" : "False",
"ChangeFeedUseNextToken" : "True",
"RollingChangeFeed" : "False",
"ChangeFeedCheckpointLocation" : "/tmp/changefeedcheckpointlocation",
"SamplingRatio" : "1.0"
}
# Connect via Spark connector to create Spark DataFrame
df = spark.read.format("com.microsoft.azure.cosmosdb.spark").options(**dbConfig).load()
# set partition to current date
import datetime
from pyspark.sql.functions import lit
partition_day= datetime.date.today()
partition_datetime=datetime.datetime.now().isoformat()
# new dataframe with ingest date (=partition key)
df_part= df.withColumn("ingest_date", lit(partition_day))
# write parquet file
df_part.write.partitionBy('ingest_date').mode('append').json('dir')
You could also use a Logic App. A Timer trigger could be used. This would be a no-code solution
Query for documents
Loop though documents
Add to Data Lake
The advantage is that you can apply any rules before sending to Data Lake

JSON - MongoDB versioning

I am trying to use JSON for application Configuration. I need some of the objects in the JSON to be dynamically crated(ex: Lookup from SQL database). It also needs to store the version history of the JSON file. Since I want to go back and forth to switch from old configuration to a new configuration version.
Initial thoughts were to put JSON on MongoDB and use placeholders for the dynamic part of the JSON object. Can someone give a guidance whether my thinking here is correct?(I am thinking of using JSON.NET for serialize/desiralize JSON object). Thanks in advance.
Edit:
Ex: lets assume we have 2 Environments. env1(v1.0.0.0) env2(v1.0.0.1)
**Env1**
{
id: env1 specific process id
processname: process_name_specific_to_env1
host: env1_specific_host_ip
...
threads: 10(this is common across environments for V1.0.0.0 release)
}
**Env2**
{
id: env2 specific process id
processname: process_name_specific_to_env2
host: env2_specific_host_ip
...
threads: 10(this is common across environments for V1.0.0.1 release)
queue_size:15 (this is common across environments for V1.0.0.1 release)
}
what I want store is a common JSON file PER version. The idea is if I want to upgrade the version lets say env1 to 1.0.0.1(from 1.0.0.0), should be able to take the v1.0.0.1 of JSON config and fill the env specific data from SQL and generate a new JSON). This way when moving environments from one release to another do not have to re do configuration.
ex: 1.0.0.0 JSON file
{
id: will be dynamically filled in from SQL
processname: will be dynamically filled in from SQL
host: will be dynamically filled in from SQL
...
threads: 10(this is common across environments for V1.0.0.0 release))
}
=> generate a new file for any environment when requested.
Hope I am being clear on what I am trying to achieve
As you said, you need some way to include the SQL part dynamically, that means manual joins in your application. Simple Ids referring to the other table should be enough, you don't need to invent a placeholder mechanism.
Choose which one is better for you:
MongoDB to SQL reference
MongoDB
{
"configParamA": "123", // ID of SQL row
"configParamB": "456", // another SQL ID
"configVersion": "2014-11-09"
}
SQL to MongoDB reference
MongoDB
{
"configVersion": "2014-11-09"
}
SQL
Just add a column with the configuration id, which is used in MongoDB, to every associated configuration row.

how do i create a huge json file

I want to create a large file containing a big list of records from a database.
This file is used by another process.
When using xml i don't have to load everything into memory and can just use XML::Writer
When using JSON we create normally a perl data structure and use the to_json function to dump the results.
This means that I have to load everything into the memory.
Is there a way to avoid it?
Is JSON suitable for large files?
Just use JSON::Streaming::Writer
Description
Most JSON libraries work in terms of in-memory data structures. In Perl, JSON
serializers often expect to be provided with a HASH or ARRAY ref containing
all of the data you want to serialize.
This library allows you to generate syntactically-correct JSON without first
assembling your complete data structure in memory. This allows large structures
to be returned without requiring those structures to be memory-resident, and
also allows parts of the output to be made available to a streaming-capable
JSON parser while the rest of the output is being generated, which may
improve performance of JSON-based network protocols.
Synopsis
my $jsonw = JSON::Streaming::Writer->for_stream($fh)
$jsonw->start_object();
$jsonw->add_simple_property("someName" => "someValue");
$jsonw->add_simple_property("someNumber" => 5);
$jsonw->start_property("someObject");
$jsonw->start_object();
$jsonw->add_simple_property("someOtherName" => "someOtherValue");
$jsonw->add_simple_property("someOtherNumber" => 6);
$jsonw->end_object();
$jsonw->end_property();
$jsonw->start_property("someArray");
$jsonw->start_array();
$jsonw->add_simple_item("anotherStringValue");
$jsonw->add_simple_item(10);
$jsonw->start_object();
# No items; this object is empty
$jsonw->end_object();
$jsonw->end_array();
Furthermore there is the JSON::Streaming::Reader :)

persisting json array data

I've been working with AngularJS and JSON for a while now, and I am currently writing a simple todo app that uses the following array to store its todos:
$scope.todos = [
// todo 1
{
title: 'Personal',
status: 'todo',
// categories for todo 1
categories: [
{
title: 'Shopping',
status: 'doing',
// items for category 1, todo 1
items: [
{
title: 'Buy bacon',
status: 'complete',
},
{
title: 'Buy tuna',
status: 'doing',
},
], // / items
},
], // /categories
},
]; // todos
So far, so well. Now what I am not sure about is how to actually store this data permanently. If I use my application to add or modify a todo, it's all nice and good until I close the browser window and it's all back to the default values (obviously).
Until now, I have always been working with MySQL databases to store relational data. But I was wondering if there is a better way to store this json data?
I was thinking to create a simple php page with saves the whole array to a textfile. But that would mean rewriting the whole file every time I make even the tiniest change to the data.
I've heard there were databases available that allow you to store this type of data, but I don't know where to start? Any pointer would be much appreciated.
Nothing keeps you from saving this in a relation database like MySQL, you could have entities like a Todo, Category and Item, then serialize then into JSON and serve them RESTfully.
I think what you were looking for is a NoSQL database. They can store JSON data natively and can store chunks of data instead of just rows of data like traditional relational databases.
Two popular NoSQL databases are
MongoDB
RethinkDB
I would suggest going with a framework like restangular to define your relations, you will then be able to use all kinds of noSQL databases which have a RESTfull JSON API such as couchdb or mongodb etc.
It uses promises which is nice future proof and modern, it also supports all HTTP methods you might need, but it has a lot more features than that, take a look at the repo's readme.
Here is also a demo which uses mongolabs, mongodb flawored cloud service.
Hope it helps.