About staging tables and merge - ssis

I'm really new in the BI world, and some concepts seems misunderstood for me.
I'm reading some articles and books about this, they are full of graphics and flows that does not tell much about the process in practice.
About the staging tables and the extraction process.
I know that the tables in staging area need to be deleted after the flow has been executed.
Considering this, imagine a flow with a initial full extraction to the target database. Then, using a merge cdc, i need to identify what was updated in the source tables. My doubt is here, how can i know what was updated since my tables are on the target, and the data on staging has been deleted?
I need to bring the data of the target tables to the staging area and then do the merge?

Change Data Capture (CDC) is usually done on the source system, either with an explicit changed field (either a simple boolean or a timestamp) or automatically by the underlying database management system.
If you have a timestamp field in your data you first do your initial load to staging, record the maximum timestamp retrieved, and then on the next update you only retrieve records where the timestamp is greater than your recorded value. This is the preferred way to do it if there's no real CDC functionality on the source system.
Using a boolean field is trickier as all inserts and updates to the source-system must set it to true and after your extraction you'll have to reset it to false.

Related

MySQL backup pieces of the database from a server

I'm writing the back-end for a web app in Spring and it uses a MySQL database on an AWS RDS instance to keep track of user data. Right now the SQL tables are separated by user groups (just a value in a column), so different groups have different access to data. Whenever a person using the app does a certain operation, we want to back up their part of the database, which can be viewed later, or replace their data in the current branch if they want.
The only way I can figure out how to do this is to create separate copies of every table for each backup and keep another table to keep track of what all the names of the tables are. This feels very inelegant and labor intensive.
So far all operations I do on the database are SQL queries from the server, and I would like to stay consistent with that.
Is there a nice way to do what I need?
Why would you want a separate table for each backup? You could have a single table that mirrored the main table but had a few additional fields to record some metadata about the change, for example the person making it, a timestamp, and the type of change either update or delete. Whenever a change is made, simply copy the old value over to this table and you will then have a complete history of the state of the record over time. You can still enforce the group-based access by keeping that column.
As for doing all this with queries, you will need some for viewing or restoring these archived changes, but the simplest way for maintaining the archived records is surely to create TRIGGERS on the main tables. If you add BEFORE UPDATE and BEFORE DELETE TRIGGERS these can copy the old version of each record over to the archive (and also add the metadata at the same time) each time a record is updated or deleted.

Logging of data change in mysql tables using ado.net

Is there any work around to get the latest change in MySQL Database using Ado.NET.
i.e. change in which table, which column, performed operation, old and new value. both for single table change and multiple table change. want to log the changes in my own new table.
There are several ways how change tracking can be implemented for mysql:
triggers: you can add DB trigger for insert/update/delete that creates an entry in the audit log.
add application logic to track changes. Implementation highly depends on your data layer; if you use ADO.NET DataAdapter, RowUpdating event is suitable for this purpose.
Also you have the following alternatives how to store audit log in mysql database:
use one table for audit log with columns like: id, table, operation, new_value (string), old_value (string). This approach has several drawbacks: this table will grow up very fast (as it holds history for changes in all tables), it keeps values as strings, it saves excessive data duplicated between old-new pairs, changeset calculation takes some resources on every insert/update.
use 'mirror' table (say, with '_log' suffix) for each table with enabled change tracking. On insert/update you can execute additional insert command into mirror table - as result you'll have record 'snapshots' on every save, and by this snapshots it is possible to calculate what and when is changed. Performance overhead on insert/update is minimal, and you don't need to determine which values are actually changed - but in 'mirror' table you'll have a lot of redundant data as full row copy is saved even if only one column is changed.
hybrid solution when record 'snapshots' are temporarily saved, and then processed in background to store differences in optimal way without affecting app performance.
There are no one best solution for all cases, everything depends on the concrete application requirements: how many inserts/updates are performed, how audit log is used etc.

Best way for incremental load in ssis

I am getting 600,000 rows daily from my source and I need to dump them into the SQL Server destination, which would be an incremental load.
Now, as the destination table size is likely to be increase day by day which would be the best approach for the incremental load. I have few options in my mind:
Lookup Task
Merge Join
SCD
etc..
Please suggest me the best option which will perform well in incremental load.
Look at Andy Leonard's excellent Stairway to Integration Services series or Todd McDermid's videos on how to use the free SSIS Dimension Merge SCD component Both will address how to do it right far better than I could enumerate in this box.
Merge join is a huge performance problem as it requires sorting of all records upfront and should not be used for this.
We process many multimillion record files daily and generally place them in a staging table and do a hash compare to the data in our Change data tracking tables to see if the data is different from what is on prod and then only load the new ones or ones that are different. Because we do the comparison outside of our production database, we have very little impact on prod becasue uinstead of checking millions of records against prod, we are only dealing with the 247 that it actually needs to have. In fact for our busiest server, all this processing happens on a separate server except for the last step that goes to prod.
if you only need to insert them, it doesnt actually matter.
if you need to check something like, if exists, update else insert, I suggest creating a oleDbSource where you query your 600.000 rows and check if they exist with a lookup task on the existing datasource. Since the existing datasource is (or tend to be) HUGE, be careful with the way you configure the caching mode. i would go with partial cache with some memory limit ordered by the ID you are looking up (this detais is very important based on the way the caching works)

Can I mirror changes in a database with homebrew code?

I have several mysql databases and tables that need to be "listened to". I need to know what data changes and send the changes to remote servers that have local mirrors of the database.
How can I mirror changes in the mysql databases? I was thinking of setting up mysql triggers that write all changes to another table. This table has the database name, table name, and all of the columns. I'd then write custom code to transfer the changes and install them periodically on the remote mirrors. Will this accomplish my need?
Your plan is 100% correct.
That extra table is called an "audit" or "history" table (there are subtle distinctions but you shouldn't much care - but you now have the "official" terms which you can use to do further research).
If the main table has columns A, B, C, then the audit would have 3 more: A, B, C, Operation, Changed_By, Change_DateTime (names are subject to your tastes and coding standards).
"Operation" column stores whether the change was an insert, delete, old value of update or new value of update (frequently it's 3 characters wide and the operations as "INS"/"DEL"/"U_D" and "U_I", but there are other approaches).
The data in the audit table is populated via a trigger on the main table.
Then make sure there's an index on Change_DateTime column.
And to find a list of changes, you keep track of when you last polled, and then simply do
SELECT * FROM Table_Audit WHERE Change_DateTime > 'LAST_POLL_TIME'
You can tell MySQL to create an incremental backup from a specific point in time. The data contains only the changes to the database since that time.
You have to turn on binary logging and then you can use the mysqlbinlog command to export the changes since a given timestamp. See the Point-in-Time (Incremental) Recovery section of the manual as well as the documentation for mysqlbinlog. Specifically, you will want the --start-datetime parameter.
Once you have the exported log in text format, you can execute it on another database instance.
As soon as you step outside the mechanisms of the DBMS to accomplish an inherently DB oriented task like mirroring, you've violated most of the properties of a DB that distinguish it from an ordinary file.
In particular, the mechanism you propose violates the atomicity, consistency, isolation, and durability that MySQL is built to ensure. For example, incomplete playback of the log on the mirrors will leave your mirrors in a state inconsistent with the parent DB. What you propose can only approximate mirroring, thus you should prefer DBMS intrinsic mechanisms unless you don't care if the mirrors accurately reflect the state of the parent.

MySQL table modified timestamp

I have a test server that uses data from a test database. When I'm done testing, it gets moved to the live database.
The problem is, I have other projects that rely on the data now in production, so I have to run a script that grabs the data from the tables I need, deletes the data in the test DB and inserts the data from the live DB.
I have been trying to figure out a way to improve this model. The problem isn't so much in the migration, since the data only gets updated once or twice a week (without any action on my part). The problem is having the migration take place only when it needs to. I would like to have my migration script include a quick check against the live tables and the test tables and, if need be, make the move. If there haven't been updates, the script quits.
This way, I can include the update script in my other scripts and not have to worry if the data is in sync.
I can't use time stamps. For one, I have no control over the tables on the live side once it goes live, and also because it seems a bit silly to bulk up the tables more for conviencience.
I tried doing a "SHOW TABLE STATUS FROM livedb" but because the tables are all InnoDB, there is no "Update Time", plus, it appears that the "Create Time" was this morning, leading me to believe that the database is backed up and re-created daily.
Is there any other property in the table that would show which of the two is newer? A "Newest Row Date" perhaps?
In short: Make the development-live updating first-class in your application. Instead of depending on the database engine to supply you with the necessary information to enable you to make a decision (to update or not to update ... that is the question), just implement it as part of your application. Otherwise, you're trying to fit a round peg into a square hole.
Without knowing what your data model is, and without understanding at all what your synchronization model is, you have a few options:
Match primary keys against live database vs. the test database. When test > live IDs, do an update.
Use timestamps in a table to determine if it needs to be updated
Use the md5 hash of a database table and modification date (UTC) to determine if a table has changed.
Long story short: Database synchronization is very hard. Implement a solution which is specific to your application. There is no "generic" solution which will work ideally.
If you have an autoincrement in your tables, you could compare the maximum autoincrement values to see if they're different.
But which version of mysql are you using?
Rather than rolling your own, you could use a preexisting solution for keeping databases in sync. I've heard good things about SQLYog's SJA (see here). I've never used it myself, but I've been very impressed with their other programs.