I am using AWS DMS for migrating data from MYSQL as source endpoint and S3 as target endpoint.
I want to track the updates from source so during the configuration, I have enabled TimestampColumnName property (col name : event_timestamp).
In the result (listed below), I am getting the timestamp of records/events but NOT the micro-second precision.
I want microsecond precision to build sequence logic on top of that.
I have investigated the property of source endpoint as well as target but not getting desired result. Here is the sample output :
.
Can somebody take a look and suggest if I am missing any property.
Output format: for my file in S3 is parquet.
Unfortunately DATETIME column added by AWS DMS S3 TimestampColumnName for change data capture (CDC) load from MySQL source will have only second precision.
Because transaction timestamp in MySQL binary log has only seconds.
Simplest solution is to add to MySQL table new column - timestamp with microsecond precision with default value to be set on insert or / and update automatically and use this column as event_timestamp.
ts TIMESTAMP(6) DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP
Also, check that in AWS DMS to S3 setting ParquetTimestampInMillisecond is False (or not present / unset, false is default).
AWS DMS S3 TimestampColumnName setting adds a column with timestamp to the output.
In 'static' read - it will generate current timestamp:
For a full load, each row of this timestamp column contains a timestamp for when the data was transferred from the source to the target by DMS.
For CDC it will read transaction time from database transaction log:
For a change data capture (CDC) load, each row of the timestamp column contains the timestamp for the commit of that row in the source database.
And its precision will be the one of the timestamp in database transaction log:
...the rounding of the precision depends on the commit timestamp supported by DMS for the source database.
CDC mode is essentially replication. Source database should be configured appropriately to write such transaction log. Database writes to this log transaction info along with transaction / commit timestamp.
In case of MySQL this is the binary log. And MySQL binlog timestamp is only 32 bit - just seconds.
Also, this transaction timestamp may not always be in line with the actual order of transactions or the order changes were actually committed at (link 1, link 2).
This question is over a year old but I faced the same/similar issue and thought I'd explain how I solved it in case it could help others.
I have tables in RDS and am using DMS to migrate them from RDS to S3. In the DMS task settings, I enabled the timestamp column and parquet file format. I want to use the CDC files that get stored in S3 to upsert into my datalake. So in order to do that, I needed to deduplicate the rows by getting the latest action upon a specific record in the RDS table. But just like the problem you faced, I noticed that the the timestamp column did not have a high precision so selecting rows with the max timestamp did not work, as it would return multiple rows. So I added a new row_number column, ordered by the timestamp column, grouped by id, and selected MAX(row_number). This gave me the latest action from the CDC rows that was applied to my table.
table.withColumn("row_number", row_number().over(Window.partitionBy("table_id").orderBy("dms_timestamp")))
The above is pyspark code as thats the framework I'm using to process my parquet files, but you can do the same in SQL. I noticed that when the records are ordered by the timestamp column, they maintain their original order even if the timestamps are the same.
Hope this could potentially help you with your sequential logic that you were tying to implement.
Related
I have an Azure Table storage where a few records are added every day (usually 3-5). There are days when no records can be added, so the volume is very low. Here is the structure of table with the pre-defined PartitionKey, RowKey, and Timestamp columns:
I need to query this table from the Azure Data Factory for the previous day's records. So for the example data shown below, I could be querying for 2019-04-25 data on 2019-04-26. The reason being, one day's staleness does not make a difference and that way, I don't have to worry about 'Watermarks' etc. I simply query for the data for the previous day and copy it to an identical Azure Table in Azure Data Lake Storage Gen 2.
I know that I need to specify a parameterized query based on the 'Timestamp' column for the previous day but do not know how to specify it.
Please advise.
You could set query sql in the copy activity table storage source.For your needs,it should be like:
time gt datetime'2019-04-25T00:00:00' and time le datetime'2019-04-2T00:00:00'
My sample data as below:
Preview data as below:
Pls see some examples in this doc: https://learn.microsoft.com/en-us/azure/data-factory/connector-azure-table-storage#azuretablesourcequery-examples.
I want to perform ETL operation on the data tables of MYSQL Database and store the data in the azure data warehouse. I do not have updated date column to identify a modified record over the period. How do I come to know which record is modified. Does MYSQL database support CDC?
It is possible to read the MYSQL binlogs or binary logs using azure services (Azure data factory)?
If you can put together a single statement query that will return what you want using whatever functions and joins are available to you then you can put that into the sqlReaderQuery part of the ADF.
Otherwise you might be able to use a stored procedure activity (sorry not so familiar with mySQL as I am ADF)
Do you have any column which is increasing integer? If so, you can still use lookup activity + copy activity + stored procedure activity to get incremental load. More details are as following: https://learn.microsoft.com/en-us/azure/data-factory/tutorial-incremental-copy-powershell
ADF do not have built-in support for CDC yet. You can do that through custom activity in ADF with your code.
In MySQL you have the option to add a timestamp column which updates on an update on rowlevel by default. A CDC is not available, but when you can to see the de difference you can compare the MAX(updatedate) on MySQL versus (>=) your own MAX(ETLDate) to get all the modified records.
I am working on Hbase database and using apache Phoenix to access Hbase using normal SQL queries.
I have two columns in table which holds the current UTC timestamp in varchar and Date. after loading some data and when I query back Hbase I am getting strange results for event timestamp column which is of Date type.
Event UTC (Date) :2017-01-13 16:36:59.0
Event UTC (varchar):2017-01-13 21:36:59
above two values should be identical but for each record when querying back Event UTC ( Date) column giving me wrong result i.e exactly 5 hours behind.
I dont know from where this problem is coming .I am not saving any Timezone info and I am aware that Java Util or SQL timestamp doesnt store any time zone info, But really confused with the result set data when running a query. Please Help me in resolving this issue
Most likely it is because of the client's local time zone.
From official docs
Timestamp type:
the internal representation is based on a number of milliseconds since the epoch (which is based on a time in GMT), while java.sql.Timestamp will format timestamps based on the client's local time zone.
I have a situation where I am using data pipeline to import data from csv file stored in S3. For initial data load, data pipeline is executing good.
Now I need to keep this database up-to-date and synced to the in-premise DB. Which mean there will be set of CSV file coming to S3 which would be the updates to some existing records, new records or deletion. I need that to be updated on RDS through data pipeline.
Question - Can data pipeline is designed for such purpose OR is just meant for one-off data load? If it can be used for incremental updates, then how do I go about it.
Any help is much appreciated!
Yes, you need to do an update and insert (aka upsert).
If you have a table with keys: key_a, key_b and other columns: col_c, col_d you can use the following SQL:
insert into TABLENAME (key_a, key_b, col_c, col_d) values (?,?,?,?) ON DUPLICATE KEY UPDATE col_c=values(col_c), col_d=values(col_d)
Kindly refer to the aws documentation: http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-template-incrementalcopyrdstos3.html
There is a predefined template for Mysql RDS incremental upload, I personally have tried incremental uploads from mysql, sql server and redshift.
You can start with using the mysql template and edit it in architect view to get a insight of the new/additional fiels it uses and likewise create datapipeline for other RDS database as well.
Internally the Incremental requires you to provide the change column which needs to be essentially a date column, and it this changecolumn is them used in the Sql script which is like:
select * from #{table} where #{myRDSTableLastModifiedCol} >= '#{format(#scheduledStartTime, 'YYYY-MM-dd HH-mm-ss')}' and #{myRDSTableLastModifiedCol} <= '#{format(#scheduledEndTime, 'YYYY-MM-dd HH-mm-ss')}'
scheduledStartTime and scheduleEndTime are the datapipeline expression whose value depends upon your schedule.
http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-pipeline-expressions.html
and scheduletype is timeseries to execute the sql at the end of the schedule end time to guarrante that there is no data loss.
And yes deleted data cant be tracked in through datapipeline; also datapipleline would also not help if the datetime column is not there in your table, in which case i wlould prefer loading full table.
I hope i have covered pretty-much i know :)
Regards,
Varun R
I have two data environments 1) a data source 2) a production database powering a website. These two data environments are in two different timezones.
I am updating my production database incrementally by using
1. mysqldump - for syncing newly added records
2. sqlyog sja - for syncing updated records.
I have a column named modified_time (modified_time timestamp NOT NULL default CURRENT_TIMESTAMP on update CURRENT_TIMESTAMP) in each table to store the last modified time.
While syncing this data between two timezones I am not able to change the timezone.
I wanted to know how can I change the source timezone to target timezone while syncing
This is not possible at the db level and even if it were to be possible it would be inefficient, I would say deal with it in your application, its simple, all data is in a different timezone, so you just need to change it by a constant to get your time.
Again if the source data is using UTC (which is recommended) then you dont have any issue at all.