Azure Table storage to Azure Table storage using Azure Data Factory using DateTime Parameterized query - parameter-passing

I have an Azure Table storage where a few records are added every day (usually 3-5). There are days when no records can be added, so the volume is very low. Here is the structure of table with the pre-defined PartitionKey, RowKey, and Timestamp columns:
I need to query this table from the Azure Data Factory for the previous day's records. So for the example data shown below, I could be querying for 2019-04-25 data on 2019-04-26. The reason being, one day's staleness does not make a difference and that way, I don't have to worry about 'Watermarks' etc. I simply query for the data for the previous day and copy it to an identical Azure Table in Azure Data Lake Storage Gen 2.
I know that I need to specify a parameterized query based on the 'Timestamp' column for the previous day but do not know how to specify it.
Please advise.

You could set query sql in the copy activity table storage source.For your needs,it should be like:
time gt datetime'2019-04-25T00:00:00' and time le datetime'2019-04-2T00:00:00'
My sample data as below:
Preview data as below:
Pls see some examples in this doc: https://learn.microsoft.com/en-us/azure/data-factory/connector-azure-table-storage#azuretablesourcequery-examples.

Related

Debezium Ordering of CDC Events between multiple tables

hi i am new to debezium
i am planning to do a realtime database(mysql and mongo) integration by using debezium
from each database i need to sync the data to a destination database(mysql and mongo)
from mysql and mongo i need only X number of tables and y number of collections respectively from each database
in those x number of tabels and y number of collection i need only the specific set of data based on a condtion for each table and collection
here the conditon is not straight forward i meant for example if we take the mysql database i need to take records by joining a table(the table that i want to capture cdc) with one specific table and capture the matching records only
above mentioned is my requirement i have some question regarding the requirements
as per the debezium documentation based on each table its creating a topic,
for each topic if we push the cdc data into a specific partition for that topic the order is gurantateed so when working with multiple tables it will push the cdc data to each table's topic's partition in this case how can i achieve the order of events between multiple tables
i mean i need the exact order of events that performed on mysql binlog
is the order is guranteeed when working with multiple tables because i need to do the sync on the destination database in the same order that happened on source database's binlog??
if i want the data based on a mysql or mongo join condition with a table or collection how can i achieve this from debezium
these are my two main questions
please help me on this
Thanks
Mike
since 15 days i posted this question so far i didn't received any answer so i did the test myself and found the answer for the ordering i did the testing for debezium mysql connector
i have tested with the transaction wich has 7 operation
1.tableA insert
2.tableA update
3.tableB insert
4.tableA update
5.tableB insert
6.tableA update
7.tableC insert
i went with the default connector configuration and observed the changes for 5 times i could see the order was not in the order which i performed
so i made a configuration change to set the default topic partition to 1
then i could observe that each topic in the data was ordered but the whole order was not there then when i managed to achieve it by having a single topic and a single partition then the trasaction order was guaranteed
this is for single partition
"topic.creation.default.replication.factor": 1,
"topic.creation.default.partitions": 1,
this is for single topic
"transforms":"dropPrefix",
"transforms.dropPrefix.type":"org.apache.kafka.connect.transforms.RegexRouter",
"transforms.dropPrefix.regex": "(.*)([^0-9])(.*)",
"transforms.dropPrefix.replacement": "yourtopic",
Note here i have tested with kafka version 2.7.0 i was facing issue with 2.6.0
and i have created a library in golang which will be useful to extract needed data
https://github.com/ahamedmulaffer/cdc-formatter
Hope its help someone
Thanks
Mike

Replicating data from MySQL to BigQuery using GCP Data Fusion - Getting issue with 'Date' datatype

I wanted to replicate Mysql tables held in GCP Compute Engine to the GC BigQuery.
I referred this document : https://cloud.google.com/data-fusion/docs/tutorials/replicating-data/mysql-to-bigquery.
so I Decided to use GCP Data Fusion for the Job.
Everything works fine and the Data is replicated in Bigquery.
So I was testing different datatype support for this Replication.
Where I come up with issue in this Replication Pipeline,
So whenever I try to put the 'DATE' datatype Column for the Data fusion replication, the whole table (Which contain 'DATE' Column) doesn't show up in BigQuery
It Creates the table with schema same as source and 'Date' datatype also present in Bigquery, and I have use the same Date format as supported by BigQuery.
I also gone through Data fusion logs, It shows pipeline is Loading the data perfectly fine into BigQuery, Also catches the new rows added into Mysql Table from source Mysql DB with inserts and updates as well.
But somehow rows are not getting into Bigquery.
Did anyone used Data fusion Replication with 'Date' column Datatype ?
Is this issue with BigQuery or Data Fusion ?
Do I need to provide any manual setting in the BigQuery ?
Can anyone please provide inputs on this ?
I'll mark this issue as Resolve.
Since the Problem is with Data Fusion, and the latest version 6.4.1 now supports Datatime datatype while replicating in BigQuery.
I'm Receiving correct date & datetime data now.
Thank you for all the Help :)
I used following schema which had Date field in it.
create table tutorials_tbl(tutorial_id INT NOT NULL AUTO_INCREMENT, tutorial_title VARCHAR(100) NOT NULL, tutorial_author VARCHAR(40) NOT NULL, submission_date DATE, PRIMARY KEY ( tutorial_id ));
When I run the replication pipeline, I see BQ table is created with following schema:
I also see the events in the table:
Can you please share the input table schema? You can also check the Job History and Query History tabs under BQ table to see if any errors.

AWS DMS - Microsecond precision for CDC on MYSQL as source EndPoint

I am using AWS DMS for migrating data from MYSQL as source endpoint and S3 as target endpoint.
I want to track the updates from source so during the configuration, I have enabled TimestampColumnName property (col name : event_timestamp).
In the result (listed below), I am getting the timestamp of records/events but NOT the micro-second precision.
I want microsecond precision to build sequence logic on top of that.
I have investigated the property of source endpoint as well as target but not getting desired result. Here is the sample output :
.
Can somebody take a look and suggest if I am missing any property.
Output format: for my file in S3 is parquet.
Unfortunately DATETIME column added by AWS DMS S3 TimestampColumnName for change data capture (CDC) load from MySQL source will have only second precision.
Because transaction timestamp in MySQL binary log has only seconds.
Simplest solution is to add to MySQL table new column - timestamp with microsecond precision with default value to be set on insert or / and update automatically and use this column as event_timestamp.
ts TIMESTAMP(6) DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP
Also, check that in AWS DMS to S3 setting ParquetTimestampInMillisecond is False (or not present / unset, false is default).
AWS DMS S3 TimestampColumnName setting adds a column with timestamp to the output.
In 'static' read - it will generate current timestamp:
For a full load, each row of this timestamp column contains a timestamp for when the data was transferred from the source to the target by DMS.
For CDC it will read transaction time from database transaction log:
For a change data capture (CDC) load, each row of the timestamp column contains the timestamp for the commit of that row in the source database.
And its precision will be the one of the timestamp in database transaction log:
...the rounding of the precision depends on the commit timestamp supported by DMS for the source database.
CDC mode is essentially replication. Source database should be configured appropriately to write such transaction log. Database writes to this log transaction info along with transaction / commit timestamp.
In case of MySQL this is the binary log. And MySQL binlog timestamp is only 32 bit - just seconds.
Also, this transaction timestamp may not always be in line with the actual order of transactions or the order changes were actually committed at (link 1, link 2).
This question is over a year old but I faced the same/similar issue and thought I'd explain how I solved it in case it could help others.
I have tables in RDS and am using DMS to migrate them from RDS to S3. In the DMS task settings, I enabled the timestamp column and parquet file format. I want to use the CDC files that get stored in S3 to upsert into my datalake. So in order to do that, I needed to deduplicate the rows by getting the latest action upon a specific record in the RDS table. But just like the problem you faced, I noticed that the the timestamp column did not have a high precision so selecting rows with the max timestamp did not work, as it would return multiple rows. So I added a new row_number column, ordered by the timestamp column, grouped by id, and selected MAX(row_number). This gave me the latest action from the CDC rows that was applied to my table.
table.withColumn("row_number", row_number().over(Window.partitionBy("table_id").orderBy("dms_timestamp")))
The above is pyspark code as thats the framework I'm using to process my parquet files, but you can do the same in SQL. I noticed that when the records are ordered by the timestamp column, they maintain their original order even if the timestamps are the same.
Hope this could potentially help you with your sequential logic that you were tying to implement.

MYSQL Change Data Capture(CDC) - Azure Services (Azure data factory)

I want to perform ETL operation on the data tables of MYSQL Database and store the data in the azure data warehouse. I do not have updated date column to identify a modified record over the period. How do I come to know which record is modified. Does MYSQL database support CDC?
It is possible to read the MYSQL binlogs or binary logs using azure services (Azure data factory)?
If you can put together a single statement query that will return what you want using whatever functions and joins are available to you then you can put that into the sqlReaderQuery part of the ADF.
Otherwise you might be able to use a stored procedure activity (sorry not so familiar with mySQL as I am ADF)
Do you have any column which is increasing integer? If so, you can still use lookup activity + copy activity + stored procedure activity to get incremental load. More details are as following: https://learn.microsoft.com/en-us/azure/data-factory/tutorial-incremental-copy-powershell
ADF do not have built-in support for CDC yet. You can do that through custom activity in ADF with your code.
In MySQL you have the option to add a timestamp column which updates on an update on rowlevel by default. A CDC is not available, but when you can to see the de difference you can compare the MAX(updatedate) on MySQL versus (>=) your own MAX(ETLDate) to get all the modified records.

AWS Data pipeline - how to use it for incremental RDS data updates?

I have a situation where I am using data pipeline to import data from csv file stored in S3. For initial data load, data pipeline is executing good.
Now I need to keep this database up-to-date and synced to the in-premise DB. Which mean there will be set of CSV file coming to S3 which would be the updates to some existing records, new records or deletion. I need that to be updated on RDS through data pipeline.
Question - Can data pipeline is designed for such purpose OR is just meant for one-off data load? If it can be used for incremental updates, then how do I go about it.
Any help is much appreciated!
Yes, you need to do an update and insert (aka upsert).
If you have a table with keys: key_a, key_b and other columns: col_c, col_d you can use the following SQL:
insert into TABLENAME (key_a, key_b, col_c, col_d) values (?,?,?,?) ON DUPLICATE KEY UPDATE col_c=values(col_c), col_d=values(col_d)
Kindly refer to the aws documentation: http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-template-incrementalcopyrdstos3.html
There is a predefined template for Mysql RDS incremental upload, I personally have tried incremental uploads from mysql, sql server and redshift.
You can start with using the mysql template and edit it in architect view to get a insight of the new/additional fiels it uses and likewise create datapipeline for other RDS database as well.
Internally the Incremental requires you to provide the change column which needs to be essentially a date column, and it this changecolumn is them used in the Sql script which is like:
select * from #{table} where #{myRDSTableLastModifiedCol} >= '#{format(#scheduledStartTime, 'YYYY-MM-dd HH-mm-ss')}' and #{myRDSTableLastModifiedCol} <= '#{format(#scheduledEndTime, 'YYYY-MM-dd HH-mm-ss')}'
scheduledStartTime and scheduleEndTime are the datapipeline expression whose value depends upon your schedule.
http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-pipeline-expressions.html
and scheduletype is timeseries to execute the sql at the end of the schedule end time to guarrante that there is no data loss.
And yes deleted data cant be tracked in through datapipeline; also datapipleline would also not help if the datetime column is not there in your table, in which case i wlould prefer loading full table.
I hope i have covered pretty-much i know :)
Regards,
Varun R