update the destination table - ssis

I want to pull data from an source destination. How can I insert rows that are not already in the table and update rows that already exist ?

We could use LOOK UP on target for the existing recrods. On matching Update otherwise insert in the target.
Other approach of using the MERGE statement.
thanks
prav

Use a slowly changing dimension transform see http://msdn.microsoft.com/en-us/library/ms141715.aspx

I would recommend CozyRoc's TableDifference component. I have used the predecessor from SQLBI.EU and it's very good.
I also recommend that instead of using a Command compponent to run individual updates on the stream with updates detected, that you stream the updates to a table and then use a single UPDATE statement in a SQL task to perform the update.

I found this webcast very helpful in learning some different methods of doing "upserts" with SSIS. You can download the samples referenced in the webcast and see working examples of exactly what you need. MSDN Architecture Webcast: Using SQL Server 2005 Integration Services to Populate a Kimball Method Data Warehouse (Level 200)

Related

Sybase to MySQL automatic exportation

I have two databases: Sybase and MySQL. I need to export records to MySql when these are inserted in Sybase or export in some scheduled event.
I've tried with output statement but this can not be used in triggers or procedures.
Any suggestion to solve this problem?
(disclaimer, I've done similar things previously, but by no means would I consider the answer below the state of the art - just one possible approach
google around something like 'cross-database replication' or 'cross rdbms replication' to see who's done this before.
).
I would first of all see if you can't score an ETL tool do the job without too much work. There are free open source ones and even things like Microsoft SSIS might work on non-MS databases.
If not, I would split this into different steps.
Find an appropriate Sybase output command that exports a subset of rows from one or more tables. By subset I mean you need to be able to add a WHERE clause, not just do a full table dump.
Use an appropriate MySQL import script/command to load the data gotten out of step #1. You may need to cycle back and forth between the 2 till you have something that works manually.
Write a Sybase trigger to insert lookup keys into a to-export table. You want to store at least the tablename & source Sybase table's keys for each inserted row. Use column names like key1_char, key2_char, not the actual column names, that makes it easier to extend to other source tables as needed. keep trigger processing as light as possible. What about updates btw?
Write a scheduled batch on Sybase side to run step #1 for the rows flagged in #3.
Write a scheduled batch on Mysql to import ,via #2, the results of #4. Or kick it off from #4.
Another approach is to do the #3 flagging bit as needed, but use to drive one scheduled batch that SELECTs data from Sybase and INSERTs it into mysql directly.
You'll have to pick up the data from Sybase's SELECT and bind it manually to the INSERT of mysql. But you probably get finer control over whats going on and you don't have to juggle 2 batches. That's what I think a clever ETL would already be doing on your behalf. Any half clever scripting language like php, python or ruby ought to handle it easily. Especially important if you have things like surrogate/auto-generated keys.
Keep in mind that in both cases you'll have to either delete the to-export rows that you've successfully inserted or flag them as done.

Pentaho Kettle - How to produce update query based on result set?

I come up with Insert query generator from pentaho spoon that writes input data to a text file in the form of a set of SQL statements.
I wonder if there is any method that can be used similar to this but generate update query based on input.
Well, if you need to update a table based on some key columns compared to your stream, you may use the Insert/Update step.
The downside is that it won't generate the statements in a file, it will execute the updates or inserts based on that comparison and that's all.
Can you give more details about your scenario? We may work things out together.
Why do you need a file with UPDATE statements?
Can't we connect to the database and run the updates right away?
Sure use the "Dynamic SQL Row" step.

Multiple Pentaho Transformations 'Variables?'

I am using Pentaho Data Integration Software.
I am currently running a Pentaho Job as an ETL. I ETL data from multiple places and put them into a single database table. The schema for all of the places i ETL from are exactly the same. So, other than database connections and a single 'variable' that stores where that data came from, the transformation in Pentaho is exactly the same for each one. So i have a job, that runs each of these transformation.
The problem comes in, when i want to make a change. I need to change 6 transformations every time. What i want to do, is somehow set something like a variable in Pentaho, that tells it to run a single transformation, 6 times, with different database connections, and perhaps a single variable.
Is this possible?
Thanks in advanced.
If i have understood your question correctly, you need to loop multiple transformations using a single KTR file (assuming there is only one database type).
PDI provides you with a step called "Copy Rows to Result", where you can store the credentials of your database in multiple rows and for every run of the Job, it will use different connections and run the transformation multiple times (6 in ur case).
Note: I have assumed that you are having only one database type e.g. : mySQL but with different credentials.
Hope this helps :) I would be happy to provide you sample code in case you need it.
Well, why don't you use a job that will pass the host/user/password as variables? That way your whole data flow will be generic.
Hope this answer will lead you into the right direction!

Using existing data with Liquibase?

When using Liquibase, is there any way to use existing data to generate some of the data that is to be inserted?
For example, say I'd want to update a row with id 5, but I don't know up front that the id will be 5, as this is linked to another table where I will actually be getting the id from. Is there any way for me to tell Liquibase to get the id from SELECT query?
I'm guessing this isn't really possible as I get the feeling Liquibase is really designed for a very structured non-dynamic approach, but it doesn't hurt to ask.
Thanks.
You cannot use the built-in changes to insert data based on existing data, but you can use the tag with insert statements with nested selects.
For example:
<changeSet>
<sql>insert into person (name, manager_id) values ('Fred', (select id from person where name='Ted'))</sql>
</changeSet>
Note: the SQL (and support for insert+select) depends on database vendor.
It is possible write your own custom refactoring class to generate SQL. The functionality is designed to support the generation of static SQL based on the changeset's parameters.
So.. it's feasible to obtain a connection to the database, but the health warning attached to this approach is that the generated SQL is dynamic (your data could change) and tied tightly to your database instance.
An example of problems this will cause is an inability to generate a SQL upgrade script for a DBA to run against a production database.
I've been thinking about this use-case for some time. I still don't know if liquibase is the best solution for this data management problem or whether it needs to be combined with an additional tool like dbunit.

How to version control data stored in mysql

I'm trying to use a simple mysql database but tweak it so that every field is backed up up to an indefinite number of versions. The best way I can illustrate this is by replacing each and every field of every table with a stack of all the values this field has ever had (each of these values should be timestamped). I guess it's kind of like having customized version control for all my data..
Any ideas on how to do this?
The usual method for "tracking any changes" to a table is to add insert/update/delete trigger procedures on the table and have those records saved in a history table.
For example, if your main data table is "ItemInfo" then you would also have an ItemInfo_History table that got a copy of the new record every time anything changed (via the triggers).
This keeps the performance of your primary table consistent, yet gives you access to the history of any changes if you need it.
Here are some examples, they are for SQL Server but they demonstrate the logic:
My Repository table
My Repository History table
My Repository Insert trigger procedure
My Repository Update trigger procedure
Hmm, what you're talking about sounds similar to Slowly Changing Dimension.
Be aware that version control on arbitrary database structures is officially a rather Hard Problem. :-)
A simple solution would be to add a version/revision field to the tables, and whenever a record is updated, instead of updating it in place, insert a copy with the changes applied and the version number incremented. Then when selecting, always choose the record with the latest version. That's roughly how most such schemes are implemented (e.g. Wikimedia does it pretty much this exact way).
Maybe a tool can help you to do that for you. Have a look at nextep designer :
https://github.com/christophefondacci/nextep-designer
With this IDE you will be able to take snapshots of your database structure and data and put it under version control. After this you can compute the differences between any 2 versions and generate the appropriate SQL that can insert / update / delete your data.
Maybe this is an alternative way to achieve what you wanted.