I have a lot of databases (+100) each one has the same structure and different connections.
I'm using Kettle to run a transformation in the different databases in order to create a data-warehouse.
How can I automate the run of the same transformation with different connections?
I already prove this Pass DB Connection parameters to a Kettle a.k.a PDI table Input step dynamically from Excel but it only accepts a row in the csv.
Should I create a loop, or I'm going to need to create a script?
Any help would be appreciated.
(Sorry for my english)
You can do it with a loop.
But, do not fret, it's not hard to make that with Pentaho.
First of all, you will use a JOB to create your loop:
START --> Transform_that_holds_parameters --> Transform_to_run_in_a_loop
As you can guess, your transformation that runs equally on each DB is the last one on this flow. But we need to set two Advanced flags on that Job Entry:
Execute for every input row?
Copy previous result to parameters?
Then we need to build our Transform_that_holds_parameters with the following structure:
Some_sort_of_input --> copy_rows_to_result
Here you will have to grab all connections parameters from somewhere, be it a Excel file or a table in another database. But once you grad this data, be sure to have 1 row for each database you want to run your transformation in. Ok?
Connect that to the 'Copy rows to result' step, this step sends the data back to our JOB and if you remember, our next transformation is set to 'Execute for every input row' and 'Copy previous result to parameters'.
Now, remember well what are the column names going to the last step of that transformation, you will need them on the next step.
Get back to our JOB and go to properties of the Transform_to_run_in_a_loop, open parameters and fill in the column 'parameter' and 'stream column name' with the columns we just copied to the Result.
Inside your transformation, you will need to set the same parameters with exactly the same names. And use these parameters on your connection settings.
Done, now you will have the first transformation setting all parameters and the second one running for each database config you have.
Related
Imagine that you want to save in a variable the number of rows the were updated or deleted in a table.
This is the steps that i did:
First, in the Control flow i created a Data Flow Task.
Them, in the Data Flow, i created a source(in my case is a excel file), then i proceeded to create two variables to count those rows- countDeleted and countUpdated, then connected the variables to two row count transformations, and them connected my destination (OLE DB).
Now in the control flow, what do i do??
Create a SQL execute task?? or a Script task?? What is the best way to do it?? What is the piece of code to use??
Thanks for youy help.
PS: i only have 4 weeks off SSIS, sorry for my noobieness :)
An OLD DB destination only inserts. It can't UPDATE or DELETE
What's your logic for updating or deleting?
If you're just starting out and reading about doing things in SSIS you will eventually find advice to use the OLE DB Command to perform row by row delete and inserts.
In my opinion this is to be avoided. It does not scale (works fine for small recorsets then fails for large recordsets), and it is difficult to maintain parameter mappings in the OLE DB Command. Although you should try it anyway to familiarise yourself with it.
My advice is to load the Excel data into a staging table, perform batch DELETE and UPDATE statements to load the data and use ##ROWCOUNT to capture the records updated.
For example;
Your existing described dataflow can be used to load into a table called StagingTable
Before your dataflow you should run an Execute SQL Task (This is in the Control Flow pane, not the Data Flow pane) that clears the staging table:
TRUNCATE TABLE StagingTable;
So first get that working - repeatedly running your package clears the staging table then loads Excel into it without creating duplicates
This in itself is a challenge as Excel is a terrible data interchange format.
Once you have that working, you add an execute SQL task to the end that runs some SQL that deletes the records you want and captures the count. For example:
DELETE FROM MyFinalTable WHERE PriamryKey IN (SELECT PrimaryKey FROM StagingTable);
SELECT ##ROWCOUNT;
Then you follow the instructions here to load that back to your SSIS variable
http://microsoft-ssis.blogspot.com/2011/03/rowcount-for-execute-sql-statement.html
What are you doing with this row count? Are you writing it to a logging table? Save
yourself the bother of pulling it back into an SSIS variable and just write it directly:
DELETE FROM MyFinalTable WHERE PriamryKey IN (SELECT PrimaryKey FROM StagingTable);
INSERT INTO LogTable(Table,Operation,Type)
SELECT 'MyFinalTable','Delete', ##ROWCOUNT;
In my experience it is not a good idea to build convoluted logic into SSIS packages if you can instead do in a database. Although it does depend on the person who has to eventually maintain it. Hopefully you can appreciate that this T-SQL approach is a more straightforward code based approach as opposed to having to dig around in property pages and events and other places inside SSIS packages.
I assume that you're using an Execute SQL Task for the updates and deletes? As #Nick.McDermaid mentioned, using an OLE DB Command within a Data Flow presents various issues when performing DML. You can find the number of rows updated, inserted, or deleted in a table through an Execute SQL Task by using the ExecValueVariable property of this task. Set the variable that will hold the row count to this property and it will return the number of affected rows. Note that is will only return the number of rows impacted by the last statement in the Execute SQL Task, regardless of batches (i.e. GO separators) are in the component.
I am new in PDI (passing from SSIS) and I am having some troubles by handling the variables issue.
I would like to perform this:
From a sql select query I would like to save the result into a variable.
For that reason I have created one job and two transformations, given that in pentaho every step is executed in parallel.
The first transformation is going to be on charge of setting the variable and the second transformation is going to use this result as an input.
But in the first transformation I am having troubles by setting the variable, I do not understand where do I have to instanciate this variable to implement the "set season variable" step. And then how to get this result in the next transformation.
If anyone knows about this, or if you could recommend any link with a good example, I'll really appreciate it.
This can indeed be confusing for SSIS users. In PDI, you don't create a recordset variable as you do in SSIS. Simply creating a job creates one for you. Each job has two different types of "Results". One for recordset rows and one for filenames.
These variables are not directly accessible; they are just part of the job. There are steps that interact with them directly. For example under the "Job" branch when you're creating a transform, there is a Get rows from results step and a Copy rows to results step. They work directly with the job's row results.
Be aware that you must manually manage the metadata for the results. This is a pain, but over-all I find PDI's method of doing this more intuitive and easier than SSIS. I find SSIS more flexible in this regard.
There are also Get files from result and Set files in result. These interact with the job's built in file results. This is simply a list of every file touched by any step configured in the job. On the job tab there are tasks that deal with it directly such as Process result filenames, Add filenames to result and Delete filenames from results. These tasks operate on the built in file results list for the job and provide an easy way to, say, archive all the files loaded by the transform you just ran.
Be aware when using these steps that they record EVERY file touched by EVERY step in the job. If you look through most of the steps in transformations (data flows) that deal with files, there's usually an "Add files to results" checkbox that is checked by default. If you uncheck this, it will not add the file names to the jobs file results. You can also delete specific files from the file results with the Delete filenames from result step.
From your Job, start a Transformation:
Overload transformation variable into global variable in your job and use it:
I am using Pentaho Data Integration Software.
I am currently running a Pentaho Job as an ETL. I ETL data from multiple places and put them into a single database table. The schema for all of the places i ETL from are exactly the same. So, other than database connections and a single 'variable' that stores where that data came from, the transformation in Pentaho is exactly the same for each one. So i have a job, that runs each of these transformation.
The problem comes in, when i want to make a change. I need to change 6 transformations every time. What i want to do, is somehow set something like a variable in Pentaho, that tells it to run a single transformation, 6 times, with different database connections, and perhaps a single variable.
Is this possible?
Thanks in advanced.
If i have understood your question correctly, you need to loop multiple transformations using a single KTR file (assuming there is only one database type).
PDI provides you with a step called "Copy Rows to Result", where you can store the credentials of your database in multiple rows and for every run of the Job, it will use different connections and run the transformation multiple times (6 in ur case).
Note: I have assumed that you are having only one database type e.g. : mySQL but with different credentials.
Hope this helps :) I would be happy to provide you sample code in case you need it.
Well, why don't you use a job that will pass the host/user/password as variables? That way your whole data flow will be generic.
Hope this answer will lead you into the right direction!
Can somebody please help me to transfer around 15 tables from one database to another database. At present I can do this one by one using Data Flow task, but then I need to do this task 15 times which is very time consuming.
Why don't you just use a task? Maybe tasks->export is what you're looking for.
Otherwise you'll need to create separate blocks for each table or:
Create a variable of type object
Script Task: Add to your list all table names.
Iterate over this object variable with For each loop container
Inside the loop create a source from a variable. In this variable specify the connection dynamically depending on the current loop value.
you can use SSIS package, select Transfer SQL server objects from SSIS toolbox , in Object specify the source and destination servers and database. for copyAllObjects make it false . ObjectToCopy select CopyAllTables true or make it false and pick from the list the table you want to copy.
I have no idea whether this can be done or not, but basically, I have the following data flow:
Extracts the data from an XML file (works fine)
Simply splits the records based on an enclosed condition (works fine)
Had to add a derived column object due to some character set issues (might be better methods, but it works)
Now "Step 4" is where I'm running into a scenario where I'd only like to insert the values that have a corresponding match in my database, for instance, the XML has about 6000 records, and from those, I have maybe 10 of them that I need to match back against and insert them instead of inserting all 6000 of them and doing the compare after the fact (which I could also do, but was hoping there'd be another method). I was thinking that I might be able to perform a sql insert command within the OLE DB DESTINATION object where the ID value in the file matches, but that's what I'm not 100% clear on or if it's even possible for that matter. Should I simply go the temp table route and scrub the data after the fact, or can I do this directly in the destination piece? Any suggestions would be greatly appreciated.
EDIT
Thanks to the last comment from billinkc, I managed to get bit closer, where I can identify the matches and use that result set, but somehow it seems to be running the data flow twice, which is strange.... I took the lookup object out to see whether it was causing it and somehow it seems to be the case, any reason why it would run this entire flow twice with the addition of the lookup? I should have a total of 8 matches, which I confirmed with the data viewer output, but then it seems to be running it a second time for the same file.
Is there a reason you can't use a Lookup transformation to find existing records. Configure it so that it routes non-match records to the no match output and then only connect the match found connector to the "Navigator Staging Manager Funds"
I believe that answers what you've asked but I wonder if you're expressing the right desire? My assumption is the lookup would go against the existing destination and so the lookup returns the id 10 for a row. All of the out of the box destinations in SSIS only perform inserts, so that row that found a match would now get doubled. As you are looking for existing rows, that usually implies you'd want to perform an update to an existing row. If that's the case, there is a specially designed transformation, the OLE DB Command. It is the component that allows for updates. There is a performance problem with that component, it issues a single update statement per row flowing through it. For 10 rows, I think it'd be fine. Otherwise, the pattern you'd use is to write all the new rows (inserts) into your destination table and then write all of your changed rows (updates) into a second staging-type table. After the data flow is complete, then use an Execute SQL Task to perform a set based update statement.
There are third party options that handle combined upserts. I know Pragmatic Works has an option and there are probably others on the tasks and components site.