Apache beam for data validation, GCP - csv

I have a requirement to validate the incoming csv data against the metadata which is coming from a bigquery table. I am using dataflow in local to do this, but I am unable to find out a way how can I use apache beam transform to implement this logic.
Note: At the moment, I saw some code where ParDo was getting used in which sideinput is given. Is this a right approach?
Validation required: Before loading the data to a bigquery table I have to check it against the metadata, checking if it passes the validation. I have to insert the data to BQ table.
Sample code:
"Process Data into TableRows" >> beam.ParDo(ProcessDF(),
beam.pvalue.AsList(metadata_for_validation))

Related

Azure Data Factory Copy Data from Rest API to CSV - Json object value isn't output

The values for data.CustomerNumber do not get written to the file, the other elements do.
The CustomerNumber is Pascal case because it is a custom value in the system I am pulling the data out of.
I have tried the map complex values option, advanced editor, deleted, started over. I originally was trying to save this to a Azure Sql table but gave up , and just tried to write to a file. I have been able to do this type of mapping in other pipelines so I am really at a loss for why it doesn't work in this case.
The actual json content for the mapping

Configuring a linked service between zoho (CRM) and Azure Data Factory

I am trying to configure a linked service in Azure Data Factory (ADF) in order to load the ZOHO data to my SQL database. I am using a REST API linked service for this which successfully connects to ZOHO. I am mainly struggling to select the proper relative URL. Currently I use the following settings:
Base URL: https://www.zohoapis.eu/crm/v2/
Relative URL: /orgxxxxxxxxxx
When I try to preview the data, this results in the following error:
Error occurred when deserializing source JSON file ''. Check if the data is in valid JSON object format.
Unexpected character encountered while parsing value: <. Path '', line 0, position 0.
Activity ID: 8d32386a-eee0-4d2a-920a-5c70dc15ef06
Does anyone know what I have to do to make this work such that I can load all required Zoho tables into ADF?
As per Microsoft official document:
The Zoho connector is supported for the following activities:
1. Copy activity with supported source/sink matrix
2. Lookup activity
REST API doesn't support Zoho Connector.
You are expecting a JSON file as source but your Sink is Azure SQL Database. You need to change your approach.
You can use directly use Zoho Linked Service in ADF to fetch the data from zoho (CRM).
You can copy data from Zoho to any supported sink data store. This
connector supports Xero access token authentication and OAuth 2.0
authentication.
You need to use Copy activity to fetch the data from Zoho (CRM) and store it in Azure Storage account. Once data received, you need to Denormalize the JSON file to store it in Azure SQL Database. Use data flow activity to denormalize the data using flatten transformation.
Use the flatten transformation in mapping data flow to take array values inside
hierarchical structures such as JSON and unroll them into individual
rows. This process is known as denormalization.
Learn more here about Flatten transformation in mapping data flow.
Once flatten done, you can store the data in Azure SQL Database.

AWS Glue: Import JSON from Datalake(S3) with mixed data

I'm currently struggling to understand how to create a data catalog of our data lake (=Source).
Background:
We have an event-driven architecture and started to store all events produced by our application to a data lake (S3 Bucket). Before the events are stored we sanitize them (remove sensitive information) and add an envelope around each event with some general data:
event origin (which application generated the event)
event type (what kind of event was generated)
timestamp (when was the event generated)
...
With Kinesis Streams and Firehose, we batch those events together and store them as a JSON file in an S3 bucket. The bucket is structured like this:
/////
In there, we store the batched events with the envelope as JSON files. That means one JSON file contains multiple events:
{
"origin": "hummingbird",
"type": "AuthenticationFailed",
"timestamp": "2019-06-30T18:24:13.868Z",
"correlation_id": "2ff0c077-542d-4307-a58b-d6afe34ba748",
"data": {
...
}
}
{
"origin": "hummingbird",
"type": "PostingCreated",
"timestamp": "2019-06-30T18:24:13.868Z",
"correlation_id": "xxxx",
"data": {
...
}
}
The data object contains specific data of the events.
Now I thought I can use AWS Glue to hook into the raw data and use ETL Jobs to aggregate the event data. As I understood I need to have a data catalog for my source data and here is where I'm struggling with since the JSON always contains different events which are batched together. The standard "Crawler" cannot handle this..well it does but it creates non-sense schemas based on every JSON file.
What I wanted to achieve:
Parse through the data lake to filter out events that I'm interested in
Use the events that I'm interested in and do some transformation/aggregation/calculation with it
Store results into our current Analytics RDS or wherever (enough for our purposes right now)
Parse through newly events on a daily basis and insert/append/update that to our analytics rds
My Questions I have:
What's the best way to use glue with our data lake?
Are there possible ways to use crawlers with custom classifiers and some sort of filter together with our datalake?
Do I need to transform the data even before, to actually be able to use AWS glue?
let me give it a try.
Parse through the data lake to filter out events that I'm interested
in
Use the events that I'm interested in and do some
transformation/aggregation/calculation with it
--> You can flatten the json for each event, then export it into different S3 bucket. Refer to some python code here https://aws.amazon.com/blogs/big-data/simplify-querying-nested-json-with-the-aws-glue-relationalize-transform/
--> use Glue to crawl your new bucket & generate a new table schema, then in Athena you should be able to see it & do your filter/query/aggregation on top of the table. Once you're happy with the transformed data, you can further import it into Redshift or RDS.
Store results into our current Analytics RDS or wherever (enough for
our purposes right now)
--> From Glue Catalog above, add Redshift/RDS connection, then use Python Spark (need some basic knowledge on working with dataframe) to load data into Redshift or RDS.
https://www.mssqltips.com/sqlservertip/5952/read-enrich-and-transform-data-with-aws-glue-service/
Parse through newly events on a daily basis and insert/append/update
that to our analytics rds
--> You can schedule your Glue crawler to discover new data from the new bucket.
Alternatively, Lambda is also a good option for this. Can use S3 object creation (thenew bucket with flattened json) to trigger a Lambda to , pre-process, ETL & then insert into Redshift/RDS (Using JDBC driver)

How to copy mysql data to azure blob using python in Azure data factory

I have followed this link https://learn.microsoft.com/en-us/azure/data-factory/quickstart-create-data-factory-python and created the pipeline to copy data from Azure blob to blog.
Then I tried to extend this code to copy data from mysql table and store to azure blob.
I'm creating the linked services by using:
ls_mysql = AzureMySqlLinkedService(connection_string=mysql_string)
and creating dataset like this:
ds_azure_mysql = AzureMySqlTableDataset(ds_ls,table_name=table1)
And I'm getting the following exception:
"Errors: 'AzureMySqlTable' is not a supported data set type for Azure Storage linked service. Supported types are 'AzureBlob' and 'AzureTable'"
Can you help to identify the correct method to use here?
Or any API reference related to mysql to blob copy would really help.

Google Cloud Functions: loading GCS JSON files into BigQuery with non-standard keys

I have a Google Cloud Storage bucket where a legacy system drops NEW_LINE_DELIMITED_JSON files that need to be loaded into BigQuery.
I wrote a Google Cloud Function that takes the JSON file and loads it up to BigQuery. The function works fine with sample JSON files - the problem is the legacy system is generating a JSON with a non-standard key:
{
"id": 12345,
"#address": "XXXXXX"
...
}
Of course the "#address" key throws everything off and the cloud function errors out ...
Is there any option to "ignore" the JSON fields that have non-standard keys? Or to provide a mapping and ignore any JSON field that is not in the map? I looked around to see if I could deactivate the autodetect and provide my own mapping, but the online documentation does not cover this situation.
I am contemplating the option of:
Loading the file in memory into a string var
Replace #address with address
Convert the json new line delimited to a list of dictionaries
Use bigquery stream insert to insert the rows in BQ
But I'm afraid this will take a lot longer, the file size may exceed the max 2Gb for functions, deal with unicode when loading file in a variable, etc. etc. etc.
What other options do I have?
And no, I cannot modify the legacy system to rename the "#address" field :(
Thanks!
I'm going to assume the error that you are getting is something like this:
Errors: query: Invalid field name "#address". Fields must contain
only letters, numbers, and underscores, start with a letter or
underscore, and be at most 128 characters long.
This is an error message on the BigQuery side, because cols/fields in BigQuery have naming restrictions. So, you're going to have to clean your file(s) before loading them into BigQuery.
Here's one way of doing it, which is completely serverless:
Create a Cloud Function to trigger on new files arriving in the bucket. You've already done this part by the sounds of things.
Create a templated Cloud Dataflow pipeline that is trigged by the Cloud Function when a new file arrives. It simply passes the name of the file to process to the pipeline.
In said Cloud Dataflow pipeline, read the JSON file into a ParDo, and using a JSON parsing library (e.g. Jackson if you are using Java), read the object and get rid of the "#" before creating your output TableRow object.
Write results to BigQuery. Under the hood, this will actually invoke a BigQuery load job.
To sum up, you'll need the following in the conga line:
File > GCS > Cloud Function > Dataflow (template) > BigQuery
The advantages of this:
Event driven
Scalable
Serverless/no-ops
You get monitoring alerting out of the box with Stackdriver
Minimal code
See:
Reading nested JSON in Google Dataflow / Apache Beam
https://cloud.google.com/dataflow/docs/templates/overview
https://shinesolutions.com/2017/03/23/triggering-dataflow-pipelines-with-cloud-functions/
disclosure: the last link is to a blog which was written by one of the engineers I work with.