Move JSON Data from DocumentDB (or CosmosDB) to Azure Data Lake - json

I have a lot of JSON files (in millions) in Cosmos DB (earlier called Document DB) and I want to move it into Azure Data Lake, for cold storage.
I found this document https://learn.microsoft.com/en-us/dotnet/api/microsoft.azure.documents.client.documentclient.readdocumentfeedasync?view=azure-dotnet but it doesnt have any samples to start with.
How should I proceed, any code samples are highly appreciated.
Thanks.

I suggest you using Azure Data Factory to implement your requirement.
Please refer to this doc about how to export json documents from cosmos db and this doc about how to import data into ADL.
Hope it helps you.
Update Answer:
Please refer to this : Azure Cosmos DB as source, you could create query in pipeline.

Yeah the change feed will do the trick.
You have two options. The first one (which is probably what you want in this case) is to use it via the SDK.
Microsoft has a detailed page on how to do with including code examples here: https://learn.microsoft.com/en-us/azure/cosmos-db/change-feed#rest-apis
The second one is the Change Feed Library which allows you to have a service running at all times listening for changes and processing them based on your needs. More details with code examples of the change feed library here: https://learn.microsoft.com/en-us/azure/cosmos-db/change-feed#change-feed-processor
(Both pages (which is the same really just different sections) contain a link to a Microsoft github repo which contains code examples.)
Keep in mind you will still be charged for using this in terms of RU/s but from what I've seems it is relatively low (or at least lower than what you'd pay of you start reading the collections themselves.)

You also could read the change feed via Spark. Following python code example generates parquet files partitioned by loaddate for changed data. Works in an Azure Databricks notebooks on a daily schedule:
# Get DB secrets
endpoint = dbutils.preview.secret.get(scope = "cosmosdb", key = "endpoint")
masterkey = dbutils.preview.secret.get(scope = "cosmosdb", key = "masterkey")
# database & collection
database = "<yourdatabase>"
collection = "<yourcollection"
# Configs
dbConfig = {
"Endpoint" : endpoint,
"Masterkey" : masterkey,
"Database" : database,
"Collection" : collection,
"ReadChangeFeed" : "True",
"ChangeFeedQueryName" : database + collection + " ",
"ChangeFeedStartFromTheBeginning" : "False",
"ChangeFeedUseNextToken" : "True",
"RollingChangeFeed" : "False",
"ChangeFeedCheckpointLocation" : "/tmp/changefeedcheckpointlocation",
"SamplingRatio" : "1.0"
}
# Connect via Spark connector to create Spark DataFrame
df = spark.read.format("com.microsoft.azure.cosmosdb.spark").options(**dbConfig).load()
# set partition to current date
import datetime
from pyspark.sql.functions import lit
partition_day= datetime.date.today()
partition_datetime=datetime.datetime.now().isoformat()
# new dataframe with ingest date (=partition key)
df_part= df.withColumn("ingest_date", lit(partition_day))
# write parquet file
df_part.write.partitionBy('ingest_date').mode('append').json('dir')

You could also use a Logic App. A Timer trigger could be used. This would be a no-code solution
Query for documents
Loop though documents
Add to Data Lake
The advantage is that you can apply any rules before sending to Data Lake

Related

Can Pyarrow non-legacy parquet datasets read and write to Azure Blob? (legacy system and Dask are able to)

Is it possible to read a parquet dataset from Azure Blob using the new non-legacy?
I can read and write to blob storage with the old system where fs is fsspec:
pq.write_to_dataset(table=table.replace_schema_metadata(),
root_path=path,
partition_cols=[
'year',
'month',
],
filesystem=fs,
version='2.0',
flavor='spark',
)
With Dask, I am able to read the data using storage options:
ddf = dd.read_parquet(path='abfs://analytics/iag-cargo/zendesk/ticket-metric-events',
storage_options={
'account_name': base.login,
'account_key': base.password,
})
But when I try using
import pyarrow.dataset as ds
dataset = ds.dataset()
Or
dataset = pq.ParquetDataset(path_or_paths=path, filesystem=fs, use_legacy_dataset=False)
I run into errors about invalid filesystem URIs. I tried every combination I could think of and tried to figure out how Dask and the legacy system can read and write files but the new one can't.
I'd like to test the row filtering and non-Hive partitioning.

Explicitly providing schema in a form of a json in Spark / Mongodb integration

When integrating spark and mongodb, it is possible to provide a sample schema in a form of an object - as described here: https://docs.mongodb.com/spark-connector/master/scala/datasets-and-sql/#sql-declare-schema
As a short-cut, there is a sample code how one can provide mongodb spark connector with sample schema:
case class Character(name: String, age: Int)
val explicitDF = MongoSpark.load[Character](sparkSession)
explicitDF.printSchema()
I have a collection, which has a constant document structure. I can provide a sample json, however to create a sample object manually will be impossible (30k properties in a document, 1.5MB average size). Is there a way how spark would infer schema just from that very json and would circumvent Mongodb connector's initial sampling which is quite exhaustive?
Spark is able to infer the schema, especially from sources having it as MongoDB. For instance for RDBMS it executes a simple query returning nothing but table columns with their types (SELECT * FROM $table WHERE 1=0).
For the sampling it'll read all documents unless you specify the configuration option called samplingRatio like this:
sparkSession.read.option("samplingRatio", 0.1)
For above Spark will only read 10% of the data. You can of course set any value you want. But be careful because if your documents have inconsistent schemas (e.g. 50% have a field called "A", the others not), the schema deduced by Spark may be incomplete and at the end you can miss some data.
Some time ago I wrote a post about schema projection if you're interested: http://www.waitingforcode.com/apache-spark-sql/schema-projection/read

Recommended ways to load large csv to RDB like mysql

Aim : Build a small ETL framework to take a Huge CSV and dump it into RDB(say MySQL).
The current approach we are thinking about is to load csv using spark into a dataframe and persist it and later use frameworks like apache scoop and and load it into mySQL.
Need recommendations on which format to persist and on the approach itself.
Edit:
CSV will have around 50 million rows with 50-100 columns.
Since our tasks involves lots of transformations before dumping into RDB, we thought using spark was a good idea.
Spark SQL support to writing to RDB directly. You can load your huge CSV as DataFrame, transform it, and call below API to save it to database.
Please refer to below API:
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils
def saveTable(df: DataFrame,
url: String,
table: String,
properties: Properties): Unit
Saves the RDD to the database in a single transaction.
Example Code:
val url: String = "jdbc:oracle:thin:#your_domain:1521/dbname"
val driver: String = "oracle.jdbc.OracleDriver"
val props = new java.util.Properties()
props.setProperty("user", "username")
props.setProperty("password", "userpassword")
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils.saveTable(dataFrame, url, "table_name", props)

JSON - MongoDB versioning

I am trying to use JSON for application Configuration. I need some of the objects in the JSON to be dynamically crated(ex: Lookup from SQL database). It also needs to store the version history of the JSON file. Since I want to go back and forth to switch from old configuration to a new configuration version.
Initial thoughts were to put JSON on MongoDB and use placeholders for the dynamic part of the JSON object. Can someone give a guidance whether my thinking here is correct?(I am thinking of using JSON.NET for serialize/desiralize JSON object). Thanks in advance.
Edit:
Ex: lets assume we have 2 Environments. env1(v1.0.0.0) env2(v1.0.0.1)
**Env1**
{
id: env1 specific process id
processname: process_name_specific_to_env1
host: env1_specific_host_ip
...
threads: 10(this is common across environments for V1.0.0.0 release)
}
**Env2**
{
id: env2 specific process id
processname: process_name_specific_to_env2
host: env2_specific_host_ip
...
threads: 10(this is common across environments for V1.0.0.1 release)
queue_size:15 (this is common across environments for V1.0.0.1 release)
}
what I want store is a common JSON file PER version. The idea is if I want to upgrade the version lets say env1 to 1.0.0.1(from 1.0.0.0), should be able to take the v1.0.0.1 of JSON config and fill the env specific data from SQL and generate a new JSON). This way when moving environments from one release to another do not have to re do configuration.
ex: 1.0.0.0 JSON file
{
id: will be dynamically filled in from SQL
processname: will be dynamically filled in from SQL
host: will be dynamically filled in from SQL
...
threads: 10(this is common across environments for V1.0.0.0 release))
}
=> generate a new file for any environment when requested.
Hope I am being clear on what I am trying to achieve
As you said, you need some way to include the SQL part dynamically, that means manual joins in your application. Simple Ids referring to the other table should be enough, you don't need to invent a placeholder mechanism.
Choose which one is better for you:
MongoDB to SQL reference
MongoDB
{
"configParamA": "123", // ID of SQL row
"configParamB": "456", // another SQL ID
"configVersion": "2014-11-09"
}
SQL to MongoDB reference
MongoDB
{
"configVersion": "2014-11-09"
}
SQL
Just add a column with the configuration id, which is used in MongoDB, to every associated configuration row.

How to store JSON efficiently?

I'm working on a project which needs to store its configuration in a JSON database.
The problem is how to store efficiently that database, I mean:
don't rewrite the whole JSON tree in a file for each modification
manage multiple access in read/write at the same time
all of this without using an external server to the project (which is itself a server)
take a peek to MongoDB, wich uses Bson (binary json) to store data. http://www.mongodb.org/display/DOCS/BSON
http://www.mongodb.org/display/DOCS/Inserting
Edit 2021:
Today I better recommend to use postgresql to store json :
https://info.crunchydata.com/blog/using-postgresql-for-json-storage
I had an idea which fits to my needs :
For the in-memory configuration, I use a JSON tree (with the jansson library)
When I need to save the configuration, I retrieve the XPath of each elements in the JSON-tree, to use it as a key and store the key/value pair in a BerkeleyDB database
For example :
{'test': {
'option': true,
'options': [ 1, 2 ]
}}
Will give the following key/value pairs :
Key | Value
-----------------+-----------
/test/option | true
/test/options[1] | 1
/test/options[2] | 2