I am new in PDI (passing from SSIS) and I am having some troubles by handling the variables issue.
I would like to perform this:
From a sql select query I would like to save the result into a variable.
For that reason I have created one job and two transformations, given that in pentaho every step is executed in parallel.
The first transformation is going to be on charge of setting the variable and the second transformation is going to use this result as an input.
But in the first transformation I am having troubles by setting the variable, I do not understand where do I have to instanciate this variable to implement the "set season variable" step. And then how to get this result in the next transformation.
If anyone knows about this, or if you could recommend any link with a good example, I'll really appreciate it.
This can indeed be confusing for SSIS users. In PDI, you don't create a recordset variable as you do in SSIS. Simply creating a job creates one for you. Each job has two different types of "Results". One for recordset rows and one for filenames.
These variables are not directly accessible; they are just part of the job. There are steps that interact with them directly. For example under the "Job" branch when you're creating a transform, there is a Get rows from results step and a Copy rows to results step. They work directly with the job's row results.
Be aware that you must manually manage the metadata for the results. This is a pain, but over-all I find PDI's method of doing this more intuitive and easier than SSIS. I find SSIS more flexible in this regard.
There are also Get files from result and Set files in result. These interact with the job's built in file results. This is simply a list of every file touched by any step configured in the job. On the job tab there are tasks that deal with it directly such as Process result filenames, Add filenames to result and Delete filenames from results. These tasks operate on the built in file results list for the job and provide an easy way to, say, archive all the files loaded by the transform you just ran.
Be aware when using these steps that they record EVERY file touched by EVERY step in the job. If you look through most of the steps in transformations (data flows) that deal with files, there's usually an "Add files to results" checkbox that is checked by default. If you uncheck this, it will not add the file names to the jobs file results. You can also delete specific files from the file results with the Delete filenames from result step.
From your Job, start a Transformation:
Overload transformation variable into global variable in your job and use it:
Related
I have a lot of databases (+100) each one has the same structure and different connections.
I'm using Kettle to run a transformation in the different databases in order to create a data-warehouse.
How can I automate the run of the same transformation with different connections?
I already prove this Pass DB Connection parameters to a Kettle a.k.a PDI table Input step dynamically from Excel but it only accepts a row in the csv.
Should I create a loop, or I'm going to need to create a script?
Any help would be appreciated.
(Sorry for my english)
You can do it with a loop.
But, do not fret, it's not hard to make that with Pentaho.
First of all, you will use a JOB to create your loop:
START --> Transform_that_holds_parameters --> Transform_to_run_in_a_loop
As you can guess, your transformation that runs equally on each DB is the last one on this flow. But we need to set two Advanced flags on that Job Entry:
Execute for every input row?
Copy previous result to parameters?
Then we need to build our Transform_that_holds_parameters with the following structure:
Some_sort_of_input --> copy_rows_to_result
Here you will have to grab all connections parameters from somewhere, be it a Excel file or a table in another database. But once you grad this data, be sure to have 1 row for each database you want to run your transformation in. Ok?
Connect that to the 'Copy rows to result' step, this step sends the data back to our JOB and if you remember, our next transformation is set to 'Execute for every input row' and 'Copy previous result to parameters'.
Now, remember well what are the column names going to the last step of that transformation, you will need them on the next step.
Get back to our JOB and go to properties of the Transform_to_run_in_a_loop, open parameters and fill in the column 'parameter' and 'stream column name' with the columns we just copied to the Result.
Inside your transformation, you will need to set the same parameters with exactly the same names. And use these parameters on your connection settings.
Done, now you will have the first transformation setting all parameters and the second one running for each database config you have.
What is the major use of recordset destination in SSIS?I heard that it is an in-memory,so the variable which is holding the data is it in raw format? Can someone explain the explain me the real time project use of Recordset destination?
A recordset destination can be used for just about anything you can think of. Some common uses I hear is to use the recordset in a foreach loop. Say you want to export several "categories" from a transaction table. Perhaps you get a recordset of the categories that exist and then call a new dataflow to export that category as it's own file. Or perhaps date ranges, months, etc.
One way I use it is in a script task to perform an action on the data that SSIS cannot do natively. I was using a script component but this particular task ran into a concurrency issue. So by dumping to a recordset I was able to use the recordset in a script task to do the logic in a manner to avoid that issue.
Another script task use is to build and send HTML emails.
I suppose a use for it might be when you have 1 data flow to get 1 record set then do a bunch of non dataflow tasks and then use that as a source in another data flow task, but that is not something I have ever done.
I am using Pentaho Data Integration Software.
I am currently running a Pentaho Job as an ETL. I ETL data from multiple places and put them into a single database table. The schema for all of the places i ETL from are exactly the same. So, other than database connections and a single 'variable' that stores where that data came from, the transformation in Pentaho is exactly the same for each one. So i have a job, that runs each of these transformation.
The problem comes in, when i want to make a change. I need to change 6 transformations every time. What i want to do, is somehow set something like a variable in Pentaho, that tells it to run a single transformation, 6 times, with different database connections, and perhaps a single variable.
Is this possible?
Thanks in advanced.
If i have understood your question correctly, you need to loop multiple transformations using a single KTR file (assuming there is only one database type).
PDI provides you with a step called "Copy Rows to Result", where you can store the credentials of your database in multiple rows and for every run of the Job, it will use different connections and run the transformation multiple times (6 in ur case).
Note: I have assumed that you are having only one database type e.g. : mySQL but with different credentials.
Hope this helps :) I would be happy to provide you sample code in case you need it.
Well, why don't you use a job that will pass the host/user/password as variables? That way your whole data flow will be generic.
Hope this answer will lead you into the right direction!
I have no idea whether this can be done or not, but basically, I have the following data flow:
Extracts the data from an XML file (works fine)
Simply splits the records based on an enclosed condition (works fine)
Had to add a derived column object due to some character set issues (might be better methods, but it works)
Now "Step 4" is where I'm running into a scenario where I'd only like to insert the values that have a corresponding match in my database, for instance, the XML has about 6000 records, and from those, I have maybe 10 of them that I need to match back against and insert them instead of inserting all 6000 of them and doing the compare after the fact (which I could also do, but was hoping there'd be another method). I was thinking that I might be able to perform a sql insert command within the OLE DB DESTINATION object where the ID value in the file matches, but that's what I'm not 100% clear on or if it's even possible for that matter. Should I simply go the temp table route and scrub the data after the fact, or can I do this directly in the destination piece? Any suggestions would be greatly appreciated.
EDIT
Thanks to the last comment from billinkc, I managed to get bit closer, where I can identify the matches and use that result set, but somehow it seems to be running the data flow twice, which is strange.... I took the lookup object out to see whether it was causing it and somehow it seems to be the case, any reason why it would run this entire flow twice with the addition of the lookup? I should have a total of 8 matches, which I confirmed with the data viewer output, but then it seems to be running it a second time for the same file.
Is there a reason you can't use a Lookup transformation to find existing records. Configure it so that it routes non-match records to the no match output and then only connect the match found connector to the "Navigator Staging Manager Funds"
I believe that answers what you've asked but I wonder if you're expressing the right desire? My assumption is the lookup would go against the existing destination and so the lookup returns the id 10 for a row. All of the out of the box destinations in SSIS only perform inserts, so that row that found a match would now get doubled. As you are looking for existing rows, that usually implies you'd want to perform an update to an existing row. If that's the case, there is a specially designed transformation, the OLE DB Command. It is the component that allows for updates. There is a performance problem with that component, it issues a single update statement per row flowing through it. For 10 rows, I think it'd be fine. Otherwise, the pattern you'd use is to write all the new rows (inserts) into your destination table and then write all of your changed rows (updates) into a second staging-type table. After the data flow is complete, then use an Execute SQL Task to perform a set based update statement.
There are third party options that handle combined upserts. I know Pragmatic Works has an option and there are probably others on the tasks and components site.
I need to query three different database and dump them into csv files. Its the same procedure for the three databases. The only difference is the database and the name of the csv file. Can I do this without cutting and pasting? Is there a way to pass parameters to the data flow task?
Thanks!
Your flat file and db connection managers could have the connection string based on a package scoped variable.
Then use a foreach looping container to call your dataflow task. Configure the looping container with a foreach item enumerator and add the appropriate names to the collection.
santiiiii's explanation covers the use case of downloading the data in one package execution. If you need to get the data at different times, then you can use a conditional statement in a variable that will give you different file names and database connections based on the supplied value for the variable. You can then set the value of the variable in the SQL Server Agent Job in the Set Values tab. This can give you more flexibility, but santiiiii's solution is definately best if you want to process all three files at the same time.