I want to perform ETL operation on the data tables of MYSQL Database and store the data in the azure data warehouse. I do not have updated date column to identify a modified record over the period. How do I come to know which record is modified. Does MYSQL database support CDC?
It is possible to read the MYSQL binlogs or binary logs using azure services (Azure data factory)?
If you can put together a single statement query that will return what you want using whatever functions and joins are available to you then you can put that into the sqlReaderQuery part of the ADF.
Otherwise you might be able to use a stored procedure activity (sorry not so familiar with mySQL as I am ADF)
Do you have any column which is increasing integer? If so, you can still use lookup activity + copy activity + stored procedure activity to get incremental load. More details are as following: https://learn.microsoft.com/en-us/azure/data-factory/tutorial-incremental-copy-powershell
ADF do not have built-in support for CDC yet. You can do that through custom activity in ADF with your code.
In MySQL you have the option to add a timestamp column which updates on an update on rowlevel by default. A CDC is not available, but when you can to see the de difference you can compare the MAX(updatedate) on MySQL versus (>=) your own MAX(ETLDate) to get all the modified records.
Related
I have an Azure Table storage where a few records are added every day (usually 3-5). There are days when no records can be added, so the volume is very low. Here is the structure of table with the pre-defined PartitionKey, RowKey, and Timestamp columns:
I need to query this table from the Azure Data Factory for the previous day's records. So for the example data shown below, I could be querying for 2019-04-25 data on 2019-04-26. The reason being, one day's staleness does not make a difference and that way, I don't have to worry about 'Watermarks' etc. I simply query for the data for the previous day and copy it to an identical Azure Table in Azure Data Lake Storage Gen 2.
I know that I need to specify a parameterized query based on the 'Timestamp' column for the previous day but do not know how to specify it.
Please advise.
You could set query sql in the copy activity table storage source.For your needs,it should be like:
time gt datetime'2019-04-25T00:00:00' and time le datetime'2019-04-2T00:00:00'
My sample data as below:
Preview data as below:
Pls see some examples in this doc: https://learn.microsoft.com/en-us/azure/data-factory/connector-azure-table-storage#azuretablesourcequery-examples.
I have a situation where I am using data pipeline to import data from csv file stored in S3. For initial data load, data pipeline is executing good.
Now I need to keep this database up-to-date and synced to the in-premise DB. Which mean there will be set of CSV file coming to S3 which would be the updates to some existing records, new records or deletion. I need that to be updated on RDS through data pipeline.
Question - Can data pipeline is designed for such purpose OR is just meant for one-off data load? If it can be used for incremental updates, then how do I go about it.
Any help is much appreciated!
Yes, you need to do an update and insert (aka upsert).
If you have a table with keys: key_a, key_b and other columns: col_c, col_d you can use the following SQL:
insert into TABLENAME (key_a, key_b, col_c, col_d) values (?,?,?,?) ON DUPLICATE KEY UPDATE col_c=values(col_c), col_d=values(col_d)
Kindly refer to the aws documentation: http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-template-incrementalcopyrdstos3.html
There is a predefined template for Mysql RDS incremental upload, I personally have tried incremental uploads from mysql, sql server and redshift.
You can start with using the mysql template and edit it in architect view to get a insight of the new/additional fiels it uses and likewise create datapipeline for other RDS database as well.
Internally the Incremental requires you to provide the change column which needs to be essentially a date column, and it this changecolumn is them used in the Sql script which is like:
select * from #{table} where #{myRDSTableLastModifiedCol} >= '#{format(#scheduledStartTime, 'YYYY-MM-dd HH-mm-ss')}' and #{myRDSTableLastModifiedCol} <= '#{format(#scheduledEndTime, 'YYYY-MM-dd HH-mm-ss')}'
scheduledStartTime and scheduleEndTime are the datapipeline expression whose value depends upon your schedule.
http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-pipeline-expressions.html
and scheduletype is timeseries to execute the sql at the end of the schedule end time to guarrante that there is no data loss.
And yes deleted data cant be tracked in through datapipeline; also datapipleline would also not help if the datetime column is not there in your table, in which case i wlould prefer loading full table.
I hope i have covered pretty-much i know :)
Regards,
Varun R
Me and my team wants to use elastic-search on our project, however we have a requirement we dont want to use local instance of mysql for each node. We want to use a remote server of mysql data to store data that elastic search services are querying.
So the idea is each time a new item is adding on a ES server local is not add to a local instance but on a remote server of mysql (we think on amazon RDS). And for search query for any index we want the ES node query the remote database (on RDS instance).
We try to use river-jbdc with two flavour (river for pulling data) and feeder (for putting data on RDS instance).But we not able to make this working with river-jdbc .
Anyone try something similar? Or can anyone linking to one blog where this was made?
I appreciate any help
Thanks in advance
We use a simular approach. We use an Oracle database as primary datastore.
We use PLSQL to flatten/convert data. For initial load we add data(records) to a "oneshot" table. Updates of the data will be flatten/converted and result in records in an "update" table. The oneshot and update table will be mapped to a single index in Elasticsearch.
Initial load of ES:
[Oracle DB]--->flatten data (pl sql)-->[records to animal_oneshot_river table, records to user_oneshot_river table]
The data will be pulled by the river to for example http://localhost/9200/zoo/animal and http://localhost/9200/zoo/user)
Updates
[Software]---->Change data--->[Oracle DB]--->flatten data (pl sql)-->[records to animal_update_river table, records to user_update_river table]
The update tables also contains a type of change (insert, update or delete).
The river wil poll the update_river tables for updates and mutates the data in Elasticsearch (we use a pull). The records will be deleted after processing by the river.
Data changes to Elasticsearch won't be send to Oracle. All changes on the primary datastore will be done by our own bussiness logic software.
We also write data to _spare tables (animal_oneshot_river_spare) because that makes it possible to reload the Elasticsearch without downtime and without synchronisation issues (we switch aliasses after reloading Elasticsearch).
I'm getting data from an MSSQL DB ("A") and inserting into a MySQL DB ("B") using the date created in the MSSQL DB. I'm doing it with simple logics, but there's got to be a faster and more efficient way of doing this. Below is the sequence of logics involved:
Create one connection for MSSQL DB and one connection for MySQL DB.
Grab all of data from A that meet the date range criterion provided.
Check to see which of the data obtained are not present in B.
Insert these new data into B.
As you can imagine, step 2 is basically a loop, which can easily max out the time limit on the server, and I feel like there must be a way of doing this must faster and during when the first query is made. Can anyone point me to right direction to achieve this? Can you make "one" connection to both of the DBs and do something like below?
SELECT * FROM A.some_table_in_A.some_column WHERE
"it doesn't exist in" B.some_table_in_B.some_column
A linked server might suit this
A linked server allows for access to distributed, heterogeneous
queries against OLE DB data sources. After a linked server is created,
distributed queries can be run against this server, and queries can
join tables from more than one data source. If the linked server is
defined as an instance of SQL Server, remote stored procedures can be
executed.
Check out this HOWTO as well
If I understand your question right, you're just trying to move things in the MSSQL DB into the MySQL DB. I'm also assuming there is some sort of filter criteria you're using to do the migration. If this is correct, you might try using a stored procedure in MSSQL that can do the querying of the MySQL database with a distributed query. You can then use that stored procedure to do the loops or checks on the database side and the front end server will only need to make one connection.
If the MySQL database has a primary key defined, you can at least skip step 3 ("Check to see which of the data obtained are not present in B"). Use INSERT IGNORE INTO... and it will attempt to insert all the records, silently skipping over ones where a record with the primary key already exists.
I have a reporting database and have to transfer data from that to another server where we run some other reports or functions on Data. What is the best way to transfer data periodically like months or by-weekly. I can use SSIS but is there anyway I can put some where clause on what rows should be extracted from the source database? like i only want to extract data for a current month. Please do let me know.
Thanks,
Vivek
For scheduling periodic extractions, I'd leave to that SQL Agent.
As for restricting the results by some condition, that's an easy thing. Instead of this (and you should always use SQL Command or SQL Command From Variable over Table Name/Table Name From Variable as they are faster)
Add a parameter. If you're use OLE DB connection manager, your indicator for a variable is ?. ADO.NET will be #parameterName
Now, wire the filter up by clicking the Parameters... button. With OLE DB, it's ordinal position starting at 0. If you wanted to use the same parameter twice, you will have to list it each time or use the ADO.NET connection manager.
The biggest question you will have to answer is how do I identify what row(s) need to go. Possibilities are endless: query into the target database and find most recent modified date for a table or highest key value. You could create a local table that tracks what's been sent and query that. You could perform an incremental load / ETL Instrumentation to identify new/updated/unchanged rows, etc.