Google Dataflow pipeline for varying schema

Google Dataflow pipeline for varying schema - google-cloud-functions

I have a product to define and configure business workflows. A part of this product is a form-builder which enables users to setup different forms.
This entire forms data is backed on MongoDB in the following structure
- form_schemas
{
"_id" : "",
"name" : "",
"account_id" : "",
"fields" : [
{
"label" : "Enter Email",
"name" : "email",
"type" : "text",
"required" : "true",
"hidden" : "false",
"additional_config" : { }
},
{
"label" : "Select DOB",
"name" : "dob",
"type" : "date",
"required" : "true",
"hidden" : "false",
"additional_config" : { }
}
...
]
}
- form_datas
{
"workflow_template_id" : "",
"account_id" : ""
"data" : {
"email" : "xyx#gmail.com",
"dob" : "2001-04-05"
},
"created_at" : "",
"updated_at" : ""
}
As seen above the form can be for various different businesses.
However, I am looking at data pipeline to transport the data to Google Bigquery at periodic intervals for analysis.
On BQ side, I am maintaining separate tables for each workflows
I have a current working solution which is completely written on Google Cloud Functions.
I have a Google Scheduler Job run at periodic intervals invoking the different cloud functions.
The cloud functions is doing the following things at high level
Iterate for each schema
Read the data mongodb for every schema since the last run (as cursor)
For each row of data, run the custom transformation logic (this includes transforming various nested data types like grids/lookup etc)
Write each row of transformed data directly as stream as ndjson on Google Cloud Storage
I above solution provides me with,
Complete control on transformation
Simple deployment
However since its all on CF, I am bound by limitation of 9 minutes per run.
This essentially puts a lot of pagination requirements especially if there is a need to regenerate the complete data from beginning of time
While the above solution works fine for now, I was looking at other serverless options like Google data-flow. Since I am just starting on data-flow/apache beam, I was wondering
If I were to write a pipeline on beam, should I go with same approach of
Extract(Row by Row) -> Transform -> Load (GCS) -> Load (BQ)
or
Extract (entire data as JSON) -> Load to GCS -> Transform (Beam) -> Load to GCS -> Load to BQ
Let me know if there is any better option for entire data processing.

Typically, this sort of process writes raw data to GCS and then transforms into Bigquery. This is done so that when you discover defects in the transform (which are inevitable) and the requirements change (also inevitable) you can replay the data with the new code.
Ideally, the steps prior to the transform are automated by a Change Data Capture (CDC) tool. There are plenty of CDC tools, but Debezium is taking over as it is reliable and free. There is a Debezium connector to get data from MongoDB and examples of how to put Debezium CDC into Bigquery.
If you are going to write the code that puts data to GCS, I would recommend considering using Apache Parquet rather than NDJSON as a format. Performance and cost will be better, and I find a format with data types easier to work with.

Related

Data Factory Copy Data Source as Body of Sink Object

I've been trying to create an ADF pipeline to move data from one of our databases into an azure storage folder - but I can't seem to get the transform to work correctly.
I'm using a Copy Data task and have the source and sink set up as datasets and data is flowing from one to the other, it's just the format that's bugging me.
In our Database we have a single field that contains a JSON object, this needs to be mapped into the sink object but doesn't have a Column name, it is simply the base object.
So for example the source looks like this
and my output needs to look like this
[
{
"ID": 123,
"Field":"Hello World",
"AnotherField":"Foo"
},
{
"ID": 456,
"Field":"Don't Panic",
"AnotherField":"Bar"
}
]
However, the Copy Data task seems to only seems to accept direct Source -> Sink mapping, and also is treating the SQL Server field as VARCHAR (which I suppose it is). So as a result I'm getting this out the other side
[
{
"Json": "{\"ID\": 123,\"Field\":\"Hello World\",\"AnotherField\":\"Foo\"}"
},
{
"Json": "{\"ID\": 456,\"Field\":\"Don't Panic\",\"AnotherField\":\"Bar\"}"
}
]
I've tried using the internal #json() parse function on the source field but this causes errors in the pipeline. I also can't get the sink to just map directly as an object inside the output array.
I have a feeling I just shouldn't be using Copy Data, or that Copy Data doesn't support the level of transformation I'm trying to do. Can anybody set me on the right path?

Using a JSON dataset as a source in your data flow allows you to set five additional settings. These settings can be found under the JSON settings accordion in the Source Options tab. For Document Form setting, you can select one of Single document, Document per line and Array of documents types.
Select Document form as Array of documents.
Refer - https://learn.microsoft.com/en-us/azure/data-factory/format-json

How to convert an unstructured json file from Azure cosmos Db to a structured table?

I have a json file with dynamic schema in Azure Cosmos Db (Mongo API). I want to read this file, convert it into a structured sql table and store in Azure SQL Data warehouse. How do I achieve this?
I have already tried reading this unstructured data from Azure Data Factory using Copy Activity but it seems like ADF cannot read unsturctured data.
Sample data from my Cosmos DB is -
{
"name" : "Dren",
"details" : [
{
"name" : "Vinod",
"relation" : "Father",
"age" : 40,
"country" : "India",
"ph1" : "+91-9492918762",
"ph2" : "+91-8769187451"
},
{
"name" : "Den",
"relation" : "Brother",
"age" : 10,
"country" : "India"
},
{
"name" : "Vinita",
"relation" : "Mother",
"age" : 40,
"country" : "India",
"ph1" : "+91-9103842782"
} ]
}
I expect NULL values for those columns whoes value does not exist in the json file.

As you have noticed, Data Factory doesn't manipulate unstructured data. Relequestual has correctly suggested that an outside data mapper will be required as Azure Data Warehouse does not offer JSON manipulation either. There are a couple ways to do this from Data Factory. Both involve calling another service to handle the mapping for you.
1) Have the pipeline call an Azure Function to do the work. The pipeline wouldn't be able to pass data in and out of the function- it would need to read from Cosmos and write to Azure DW on its own. Between the two you can do your mapping in whatever language you write the function in. The upside of this is that they are fairly simple to write, but your ability to scale will be somewhat limited by how much data your function can process within a few minutes.
2) Do an interim hop in and out of Azure Data Lake. You would copy the data into a storage account (there are a few options that work with Data Lake Analytics), call the USQL job and then load the results into Azure DW. The downside of this is that you are adding extra read/writes to the storage account. However, it does let you scale as much as you need to based on your volume. It is also utilizing a SQL-like language if that is your preference.

Kafka topic data to HDFS parquet file using HDFS sink connector configuration issue

I need help regarding a kafka topic that I would like to put into HDFS in parquet format (with daily partitionner).
I have a lot of data in a kafka topic which are basically json data like this :
{"title":"Die Hard","year":1988,"cast":["Bruce Willis","Alan Rickman","Bonnie Bedelia","William Atherton","Paul Gleason","Reginald VelJohnson","Alexander Godunov"],"genres":["Action"]}
{"title":"Toy Story","year":1995,"cast":["Tim Allen","Tom Hanks","(voices)"],"genres":["Animated"]}
{"title":"Jurassic Park","year":1993,"cast":["Sam Neill","Laura Dern","Jeff Goldblum","Richard Attenborough"],"genres":["Adventure"]}
{"title":"The Lord of the Rings: The Fellowship of the Ring","year":2001,"cast":["Elijah Wood","Ian McKellen","Liv Tyler","Sean Astin","Viggo Mortensen","Orlando Bloom","Sean Bean","Hugo Weaving","Ian Holm"],"genres":["Fantasy »]}
{"title":"The Matrix","year":1999,"cast":["Keanu Reeves","Laurence Fishburne","Carrie-Anne Moss","Hugo Weaving","Joe Pantoliano"],"genres":["Science Fiction"]}
This topic's name is : test
And I would like to put those data into my HDFS cluster in parquet format.
But I struggle with the sink connector configuration.
I use the confluent hdfs-sink-connector for that.
Here is what I manage to do so far :
{
"name": "hdfs-sink",
"config": {
"name": "hdfs-sink",
"connector.class": "io.confluent.connect.hdfs.HdfsSinkConnector",
"tasks.max": "1",
"topics": "test",
"hdfs.url": "hdfs://hdfs-IP:8020",
"hadoop.home": "/user/test-user/TEST",
"flush.size": "3",
"locale": "fr-fr",
"timezone": "UTC",
"format.class": "io.confluent.connect.hdfs.parquet.ParquetFormat",
"partitioner.class": "io.confluent.connect.hdfs.partitioner.DailyPartitioner",
"consumer.auto.offset.reset": "earliest",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"key.converter": "org.apache.kafka.connect.json.JsonConverter",
"key.converter.schemas.enable": "true",
"value.converter.schemas.enable": "true"
}
}
Some explanation on why I configured the connector like that :
I have a lot of those data that populate my topic every day
Final goal is to have one parquet file per day in my HDFS for this topic
I understood that maybe I have to use the schema-registry for formatting the data into parquet but I don't know how to do that. And is it necessary?
Can you please help me on that?
Thank you

I have not personally used the ParquetFormat, but your data must have a schema, which means one of the following
Your data is produced using Confluent Avro serializer
Your data is produced as Protobuf and you get the Protobuf converter added to your Connect workers
You use Kafka Connect's special JSON format that includes a schema within your records.
Basically, It cannot be "plain JSON". I.e. you currently have "value.converter.schemas.enable": "true", and I'm guessing your connector isn't working because your records are not in the above format.
Basically, without a schema, the JSON parser cannot possible know what "columns" that Parquet needs to write.
And Daily Partitioner does not create one file per day, only a directory. You will get one file per flush.size and there is also a configuration for scheduled rotate intervals of flushing files. In addition, there will be one file per Kafka partition.
Also, "consumer.auto.offset.reset": "earliest", only works in the connect-distribtued.properties file, not on a per-connector bases, AFAIK.
Since I haven't personally used the ParquetFormat, that's all the advice I can give, but I have used other tools like NiFi for similar goals, which will allow you to not change your existing Kafka producer code.
Alternatively, use JSONFormat instead, however, Hive integration will not work, automatically, and the tables must be pre-defined (which will require you having a schema for your topic anyway).
And another option is just configure Hive to read from Kafka directly

Move JSON Data from DocumentDB (or CosmosDB) to Azure Data Lake

I have a lot of JSON files (in millions) in Cosmos DB (earlier called Document DB) and I want to move it into Azure Data Lake, for cold storage.
I found this document https://learn.microsoft.com/en-us/dotnet/api/microsoft.azure.documents.client.documentclient.readdocumentfeedasync?view=azure-dotnet but it doesnt have any samples to start with.
How should I proceed, any code samples are highly appreciated.
Thanks.

I suggest you using Azure Data Factory to implement your requirement.
Please refer to this doc about how to export json documents from cosmos db and this doc about how to import data into ADL.
Hope it helps you.
Update Answer:
Please refer to this : Azure Cosmos DB as source, you could create query in pipeline.

Yeah the change feed will do the trick.
You have two options. The first one (which is probably what you want in this case) is to use it via the SDK.
Microsoft has a detailed page on how to do with including code examples here: https://learn.microsoft.com/en-us/azure/cosmos-db/change-feed#rest-apis
The second one is the Change Feed Library which allows you to have a service running at all times listening for changes and processing them based on your needs. More details with code examples of the change feed library here: https://learn.microsoft.com/en-us/azure/cosmos-db/change-feed#change-feed-processor
(Both pages (which is the same really just different sections) contain a link to a Microsoft github repo which contains code examples.)
Keep in mind you will still be charged for using this in terms of RU/s but from what I've seems it is relatively low (or at least lower than what you'd pay of you start reading the collections themselves.)

You also could read the change feed via Spark. Following python code example generates parquet files partitioned by loaddate for changed data. Works in an Azure Databricks notebooks on a daily schedule:
# Get DB secrets
endpoint = dbutils.preview.secret.get(scope = "cosmosdb", key = "endpoint")
masterkey = dbutils.preview.secret.get(scope = "cosmosdb", key = "masterkey")
# database & collection
database = "<yourdatabase>"
collection = "<yourcollection"
# Configs
dbConfig = {
"Endpoint" : endpoint,
"Masterkey" : masterkey,
"Database" : database,
"Collection" : collection,
"ReadChangeFeed" : "True",
"ChangeFeedQueryName" : database + collection + " ",
"ChangeFeedStartFromTheBeginning" : "False",
"ChangeFeedUseNextToken" : "True",
"RollingChangeFeed" : "False",
"ChangeFeedCheckpointLocation" : "/tmp/changefeedcheckpointlocation",
"SamplingRatio" : "1.0"
}
# Connect via Spark connector to create Spark DataFrame
df = spark.read.format("com.microsoft.azure.cosmosdb.spark").options(**dbConfig).load()
# set partition to current date
import datetime
from pyspark.sql.functions import lit
partition_day= datetime.date.today()
partition_datetime=datetime.datetime.now().isoformat()
# new dataframe with ingest date (=partition key)
df_part= df.withColumn("ingest_date", lit(partition_day))
# write parquet file
df_part.write.partitionBy('ingest_date').mode('append').json('dir')

You could also use a Logic App. A Timer trigger could be used. This would be a no-code solution
Query for documents
Loop though documents
Add to Data Lake
The advantage is that you can apply any rules before sending to Data Lake

Efficient Portable Database for Hierarchical Dataset - Json, Sqlite or?

I need to make a file that contains a hierarchical dataset. The dataset in question is a file-system listing (directory names, file name/sizes in each directory, sub-directories, ...).
My first instinct was to use Json and flatten the hierarchy using paths so the parser doesn't have to recurse so much. As seen in the example below, each entry is a path ("/", "/child01", "/child01/gchild01",...) and it's files.
{
"entries":
[
{
"path":"/",
"files":
[
{"name":"File1", "size":1024},
{"name":"File2", "size":1024}
]
},
{
"path":"/child01",
"files":
[
{"name":"File1", "size":1024},
{"name":"File2", "size":1024}
]
},
{
"path":"/child01/gchild01",
"files":
[
{"name":"File1", "size":1024},
{"name":"File2", "size":1024}
]
},
{
"path":"/child02",
"files":
[
{"name":"File1", "size":1024},
{"name":"File2", "size":1024}
]
}
]
}
Then I thought that repeating the keys over and over ("name", "size") for each file kind of sucks. So I found this article about how to use Json as if it were a database - http://peter.michaux.ca/articles/json-db-a-compressed-json-format
Using that technique I'd have a Json table like "Entry" with columns "Id", "ParentId", "EntryType", "Name", "FileSize" where "EntryType" would be 0 for Directory and 1 for File.
So, at this point, I'm wondering if sqlite would be a better choice. I'm thinking that the file size would be a LOT smaller than a Json file, but it might only be negligible if I use Json-DB-compressed format from the article. Besides size, are there any other advantages that you can think of?

I think a Javascript object for datasource, loaded as a file stream into the browser and then used in javascript logic in the browser would consume the least time and have good performance.. BUT only until a limited hierarchy size of the content.
Also, not storing the hierarchy anywhere else and keeping it only as a JSON file badly limits your data source's use in your project to client-side technologies.. or forces conversions to other technologies.
If you are building a pure javascript based application (html, js, css only app), then you could keep it as JSON object alone.. and limit your hierarchy sizes.. you could split bigger hierarchies into multiple files linking json objects.
If you will have server-side code like php, in your project,
Considering managebility of code, and scaling, you should ideally store the data in SQLite DB, at runtime create your json hierarchies for limited levels as ajax loads from your page.

If this is the only data your application stores then you can do something really simple like just store the data in an easy to parse/read text file like this:
File1:1024
File2:1024
child01
File1:1024
File2:1024
gchild01
File1:1024
File2:1024
child02
File1:1024
File2:1024
Files get File:Size and directories get just their name. Indentation gives structure. For something slightly more standard but just as easy to read, use yaml.
http://www.yaml.org/
Both can benefit from decreased file size (but decreased user readability) by gzipping the file.
And if you have more data to store, then use SQLite. SQLite is great.
Don't use JSON for data persistence. It's wasteful.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008