JSON - MongoDB versioning - json

I am trying to use JSON for application Configuration. I need some of the objects in the JSON to be dynamically crated(ex: Lookup from SQL database). It also needs to store the version history of the JSON file. Since I want to go back and forth to switch from old configuration to a new configuration version.
Initial thoughts were to put JSON on MongoDB and use placeholders for the dynamic part of the JSON object. Can someone give a guidance whether my thinking here is correct?(I am thinking of using JSON.NET for serialize/desiralize JSON object). Thanks in advance.
Edit:
Ex: lets assume we have 2 Environments. env1(v1.0.0.0) env2(v1.0.0.1)
**Env1**
{
id: env1 specific process id
processname: process_name_specific_to_env1
host: env1_specific_host_ip
...
threads: 10(this is common across environments for V1.0.0.0 release)
}
**Env2**
{
id: env2 specific process id
processname: process_name_specific_to_env2
host: env2_specific_host_ip
...
threads: 10(this is common across environments for V1.0.0.1 release)
queue_size:15 (this is common across environments for V1.0.0.1 release)
}
what I want store is a common JSON file PER version. The idea is if I want to upgrade the version lets say env1 to 1.0.0.1(from 1.0.0.0), should be able to take the v1.0.0.1 of JSON config and fill the env specific data from SQL and generate a new JSON). This way when moving environments from one release to another do not have to re do configuration.
ex: 1.0.0.0 JSON file
{
id: will be dynamically filled in from SQL
processname: will be dynamically filled in from SQL
host: will be dynamically filled in from SQL
...
threads: 10(this is common across environments for V1.0.0.0 release))
}
=> generate a new file for any environment when requested.
Hope I am being clear on what I am trying to achieve

As you said, you need some way to include the SQL part dynamically, that means manual joins in your application. Simple Ids referring to the other table should be enough, you don't need to invent a placeholder mechanism.
Choose which one is better for you:
MongoDB to SQL reference
MongoDB
{
"configParamA": "123", // ID of SQL row
"configParamB": "456", // another SQL ID
"configVersion": "2014-11-09"
}
SQL to MongoDB reference
MongoDB
{
"configVersion": "2014-11-09"
}
SQL
Just add a column with the configuration id, which is used in MongoDB, to every associated configuration row.

Related

Entity Framework Queries For Complicated JSON Documents (npgsql)

I am handling legacy (old) JSON files that we are now uploading to a database that was built using code-first EF Core (with the JSON elements saved as a jsonb field in a postgresql db, represented as JsonDocument properties in the EF classes). We want to be able to query these massive documents against any of the JSON's many properties. I've been very interested in the excellent docs here https://www.npgsql.org/efcore/mapping/json.html?tabs=data-annotations%2Cpoco, but the problem in our case is that our JSON has incredibly complicated hierarchies.
According to the npgsql/EF doc above, a way to do this for "shallow" json hierarchies would be something like:
myDbContext.MyClass
.Where(e => e.JsonDocumentField.RootElement.GetProperty("FieldToSearch").GetString() == "SearchTerm")
.ToList();
But that only works if is directly under the root of the JSONDocument. If the doc is structed like, say
{"A": {
...
"B": {
...
"C": {
...
"FieldToSearch":
<snip>
Then the above query won't work. There is an alternative to map our JSON to an actual POCO model, but this JSON structure (a) may change and (b) is truly massive and would result in some ridiculously complicated objects.
Right now, I'm building SQL strings using field configurations where I save strings to find the fields I want using psql's JSON querying language
Example:
"(JSONDocumentField->'A'->'B'->'C'->>'FieldToSearch')"
and then using that sql against the DB using
myDbContext.MyClass.FromSqlRaw(sql).ToList();
This is hacky and I'd much rather do it in a method call. Is there a way to force JsonDocument's GetProperty call to drill down into the hierarchy to find the first/any instance of the property name in question (or another method I'm not aware of)?
Thanks!

Are My Data Better Suited To A CSV Import, Rather Than a JSON Import?

I am trying to force myself to use mongoDB, using the excuse of the "convenience" of it being able to accept JSON data. Of course, it's not as simple as that (it never is!).
At the moment, for this use case, I think I should revert to a traditional CSV import, and possibly a traditional RDBMS (e.g. MariaDB or MySQL). Am I wrong?
I found a possible solution in CSV DATA import to nestable json data, which seems to be a lot of faffing around.
The problem:
I am pulling some data from an online database, which returns data in blocks like this (actually it's all on one line, but I have broken it up to improve readability):
[
[8,1469734163000,50.84516753,0.00021818,2],
[6,1469734163000,50.80342373,0.00021818,2],
[4,1469734163000,50.33066367,0.00021818,2],
[12,1469734164000,40.31650031,0.00021918,2],
[10,1469734164000,11.36652478,0.00021818,2],
[14,1469734165000,52.03905845,0.00021918,2],
[16,1469734168000,57.32,0.00021918,2]
]
According to the command python -mjson.tool this is valid JSON.
But this command barfs
mongoimport --jsonArray --db=bitfinexLendingHistory --collection=fUSD --file=test.json
with
2019-12-31T12:23:42.934+0100 connected to: localhost
2019-12-31T12:23:42.935+0100 Failed: error unmarshaling bytes on document #3: JSON decoder out of sync - data changing underfoot?
2019-12-31T12:23:42.935+0100 imported 0 documents
The named DB and collection already exist.
$ mongo
> use bitfinexLendingHistory
switched to db bitfinexLendingHistory
> db.getCollectionNames()
[ "fUSD" ]
>
I realise that, at this stage, I have no <whatever the mongoDB equivalent of a column header is called in this case> defined, but I suspect the problem above is independent of that.
By wrapping by data above in the way as shown below, I managed to get the data imported.
{
"arf":
[
[8,1469734163000,50.84516753,0.00021818,2],
[6,1469734163000,50.80342373,0.00021818,2],
[4,1469734163000,50.33066367,0.00021818,2],
[12,1469734164000,40.31650031,0.00021918,2],
[10,1469734164000,11.36652478,0.00021818,2],
[14,1469734165000,52.03905845,0.00021918,2],
[16,1469734168000,57.32,0.00021918,2]
]
}
Next step is to determine if that is what I want, and if so, work out how to query it.

Move JSON Data from DocumentDB (or CosmosDB) to Azure Data Lake

I have a lot of JSON files (in millions) in Cosmos DB (earlier called Document DB) and I want to move it into Azure Data Lake, for cold storage.
I found this document https://learn.microsoft.com/en-us/dotnet/api/microsoft.azure.documents.client.documentclient.readdocumentfeedasync?view=azure-dotnet but it doesnt have any samples to start with.
How should I proceed, any code samples are highly appreciated.
Thanks.
I suggest you using Azure Data Factory to implement your requirement.
Please refer to this doc about how to export json documents from cosmos db and this doc about how to import data into ADL.
Hope it helps you.
Update Answer:
Please refer to this : Azure Cosmos DB as source, you could create query in pipeline.
Yeah the change feed will do the trick.
You have two options. The first one (which is probably what you want in this case) is to use it via the SDK.
Microsoft has a detailed page on how to do with including code examples here: https://learn.microsoft.com/en-us/azure/cosmos-db/change-feed#rest-apis
The second one is the Change Feed Library which allows you to have a service running at all times listening for changes and processing them based on your needs. More details with code examples of the change feed library here: https://learn.microsoft.com/en-us/azure/cosmos-db/change-feed#change-feed-processor
(Both pages (which is the same really just different sections) contain a link to a Microsoft github repo which contains code examples.)
Keep in mind you will still be charged for using this in terms of RU/s but from what I've seems it is relatively low (or at least lower than what you'd pay of you start reading the collections themselves.)
You also could read the change feed via Spark. Following python code example generates parquet files partitioned by loaddate for changed data. Works in an Azure Databricks notebooks on a daily schedule:
# Get DB secrets
endpoint = dbutils.preview.secret.get(scope = "cosmosdb", key = "endpoint")
masterkey = dbutils.preview.secret.get(scope = "cosmosdb", key = "masterkey")
# database & collection
database = "<yourdatabase>"
collection = "<yourcollection"
# Configs
dbConfig = {
"Endpoint" : endpoint,
"Masterkey" : masterkey,
"Database" : database,
"Collection" : collection,
"ReadChangeFeed" : "True",
"ChangeFeedQueryName" : database + collection + " ",
"ChangeFeedStartFromTheBeginning" : "False",
"ChangeFeedUseNextToken" : "True",
"RollingChangeFeed" : "False",
"ChangeFeedCheckpointLocation" : "/tmp/changefeedcheckpointlocation",
"SamplingRatio" : "1.0"
}
# Connect via Spark connector to create Spark DataFrame
df = spark.read.format("com.microsoft.azure.cosmosdb.spark").options(**dbConfig).load()
# set partition to current date
import datetime
from pyspark.sql.functions import lit
partition_day= datetime.date.today()
partition_datetime=datetime.datetime.now().isoformat()
# new dataframe with ingest date (=partition key)
df_part= df.withColumn("ingest_date", lit(partition_day))
# write parquet file
df_part.write.partitionBy('ingest_date').mode('append').json('dir')
You could also use a Logic App. A Timer trigger could be used. This would be a no-code solution
Query for documents
Loop though documents
Add to Data Lake
The advantage is that you can apply any rules before sending to Data Lake

How to combine multiple MySQL databases using D2RQ?

I have four different MySQL databases that I need to convert into Linked Data and then run queries on the aggregated data. I have generated the D2RQ maps separately and then manually copied them together into a single file. I have read up some material on customizing the maps but am finding it hard to do so in my case because:
The ontology classes do not correspond to table names. In fact, most classes are column headers.
When I open the combined mapping in Protege, it generates only 3 classes (ClassMap, Database, and PropertyBridge) and lists all the column headers as instances of these.
If I import this file into my ontology, everything becomes annotation.
Please suggest an efficient way to generate a single graph that is formed by mapping these databases to my ontology.
Here is an example. I am using the EEM ontology to refine the mapping file generated by D2RQ. This is a section from the mapping file:
map:scan_event_scanDate a d2rq:PropertyBridge;
d2rq:belongsToClassMap map:scan_event;
d2rq:property vocab:scan_event_scanDate;
d2rq:propertyDefinitionLabel "scan_event scanDate";
d2rq:column "scan_event.scanDate";
# Manually added
d2rq:datatype xsd:int;
.
map:scan_event_scanTime a d2rq:PropertyBridge;
d2rq:belongsToClassMap map:scan_event;
d2rq:property vocab:scan_event_scanTime;
d2rq:propertyDefinitionLabel "scan_event scanTime";
d2rq:column "scan_event.scanTime";
# Manually added
d2rq:datatype xsd:time;
The ontology I am interested in has the following:
Data property: eventOccurredAt
Domain: EPCISevent
Range: datetime
Now, how should I modify the mapping file so that the date and time are two different relationships?
I think the best way to generate a single graph of your 4 databases is to convert them one by one to a Jena Model using D2RQ, and then use the Union method to create a global model.
For your D2RQ mapping file, you should read carefully The mapping language, it's not normal to have classes corresponding to columns.
If you give an example of your table structure, I can give you an illustration of a mapping file.
Good luck

How to store JSON efficiently?

I'm working on a project which needs to store its configuration in a JSON database.
The problem is how to store efficiently that database, I mean:
don't rewrite the whole JSON tree in a file for each modification
manage multiple access in read/write at the same time
all of this without using an external server to the project (which is itself a server)
take a peek to MongoDB, wich uses Bson (binary json) to store data. http://www.mongodb.org/display/DOCS/BSON
http://www.mongodb.org/display/DOCS/Inserting
Edit 2021:
Today I better recommend to use postgresql to store json :
https://info.crunchydata.com/blog/using-postgresql-for-json-storage
I had an idea which fits to my needs :
For the in-memory configuration, I use a JSON tree (with the jansson library)
When I need to save the configuration, I retrieve the XPath of each elements in the JSON-tree, to use it as a key and store the key/value pair in a BerkeleyDB database
For example :
{'test': {
'option': true,
'options': [ 1, 2 ]
}}
Will give the following key/value pairs :
Key | Value
-----------------+-----------
/test/option | true
/test/options[1] | 1
/test/options[2] | 2