Define same output table across multiple transformations - mysql

I have 6 different input datasets. I want to run ETL over all 6 datasets so they all get transformed to the same output table (same columns and types).
I am using Pentaho (Spoon) to do this.
Is there a way I can define an output table schema to be used by all these transformations in Pentaho? I am using MySQL as my output database.
Thanks in advance.

Sounds like you need the Select Values step. Put one of those on the last hop of each dataset's path and make the metadata for the paths all look EXACTLY the same. Then you can connect the output from each Select Values step into a Table Output. All the rows from each set will be mixed together in no particular order.
This can be more challenging than it looks. Spoon will throw errors if any of the fields aren't just exactly identical to the corresponding field from all other datasets. You'll have to find some way to get all the metadata from the datasets to be the same.

Related

Pentaho Import uniqe records into database

I am quite new to Pentaho Spoon and I would like to import records of an csv file to an database table. However, only unique records should be imported into the database table. That is why I need to compare EACH record with all records of the database table in order to determine if the record should be imported or not.
So far, I tried out the suggested CRUD-pattern which looks like this:
As you can see in the picture, I merge the excel input and the table input (ignore the cast-steps. I needed to cast a value because ther differed in the float format: database format was #.000000 and the csv format of float was #.0)
After the merge join, I compare the flag (which is given by the merge rows(diff) and if the compared records are new, I import them to the database table, if they are changed, I update the record and if they are deleted or identical, I simply do nothing. So far, so good.
But here is the problem: If I shuffle the records of the csv-input-file and run the transformation anew, all the records are imported anew and consequently, there are duplicated in my database table (which I wanted to avoid). To emphasize again: The right way to solve this is that each row of the csv-input-file is compared with ALL entries in the database table.
How can I realize this? Any suggestions? Thank you so much in advance!!
The Merge Rows (diff) expect the input to be sorted. Normally, you have been warned of this by a pop-up.
Put a Sort rows step on the output flow of the Excel Input, before it reaches the Merge Rows (diff).
You should do the same between the Table Input and the Merge Rows (diff). On course you may think you could do it in the sql statement of the Table Input.
However, there is a beginner trap here. You have 3 other steps Output Rows, Update and Delete which operates on the same table. And these steps may lock the table. As in Kettle all the steps are running concurrently, you do not know which steps will fire first, and the table may be locked and never be able to read even the first record. This is known in jargon as an auto-lock, and the way to solve it is to put a Sort Row step as a buffer.
You can use the 'Dimension lookup/update' control which provides the same functionality which you are trying to achieve.
Thanks,
Nilesh

SSIS - Reuse Ole DB source when matching Fact against lookup table twice

I am pretty new to SSIS and BI in general, so first of all sorry if this is a newbie question.
I have my source data for the fact table in a csv, so I want to match the ids against the surrogate keys in lookup tables.
The data structure in the csv is like this
... userId, OriginStationId, DestinyStationId,..
What I am trying to accomplish is to match the data against my lookup table. So what I am doing is
Reading Lookup data using OLE DB Source
Reading my csv file
Sorting both inputs by the same field
Doing a left join by Id, in order to get the SK
This way, if there is no match (aka can't find the surrogate key) I can redirect that to a rejected csv and handle it later.
something like this:
(sorry for the spanish!)
I am doing this for each dimension, so I can handle each one with different error codes.
Since OriginStationId and DestinyStationId are two values from the same dimension (they both match against the same lookup table), I wanted to know if there's a way to avoid reading two times the data from the table (I mean, not to use two ole db sources to read twice the data from the same table).
I tried adding a second output to the sort but I am not allowed to. The same goes to adding another output from OLE DB Source.
I see there's an "cache option", is the best way to go ? (Although it would impy creating anyway another OLE DB source.. right?)
The third option I thought of was joining by the two fields, but since there is only one field in the lookup table (the same field) I am getting an error when I try to map both colums from my csv against the same column in my Lookup table
There are columns missing with the sort order 2 to 2
What is the best way to go for this ?
Or I am thinking something incorrectly ?
If something was not clear let me know and I'll update my question
Any time you wish you could have multiple outputs from a component that only allows one, all you have to do is follow that component with the Multicast component, whose sole purpose is to split a Data Flow stream into multiple outputs.
Gonzalo
I have just used this article on how to derive columns for a data warehouse building:- How to Populate a Fact Table using SSIS (part 1).
Using this I built a simple package that reads a CSV file with two columns that are used to derive separate values from the same CodeTable. The CodeTable has two fields Id and Description.
The Data Flow has two "Lookup" tasks. The first one joins the attribute Lookup1 against the Description to derive its Id. The second joins the attribute Lookup2 against the Description to derive a different Id.
Here is the Data Flow:-
Note the "Data Conversion" was required to convert the string attributes from the CSV file into "Unicode string [DT_WSTR]" so they could be joined to the nvarchar(50) description attribute in the table.
Here is the Data Conversion:-
Here is the first Lookup (the second one joins "Copy of Lookup2" to the Description):-
Here is the Data Viewer output with the to two derived Ids CodeTableFirstId and CodeTableSecondId:-
Hopefully I understand your problem and this is of use to you.
Cheers John

How to populate 10 different query results with different columns and number of columns to a text file in MSSQL

I am doing a project to generate data extracts on a daily basis. I have ten different queries with different columns and also the number of columns are also different. the database is MSSQL server 2008 R2 and I tried SSIS packet to accomplish the result.I used the components datasource, then a sort and the result of the sort to merge and then to text file. But I am getting error when combining the result saying the columns are different or something. Can anyone suggest a solution or is there any other way to accomplish this.
thanks,
Sivajith
Can you please provide error message? The merge component can merge data flows with various amount of columns, by selecting for the input columns.
First create a template .csv file which contain all the columns from the queries (i.e. if you have the columns A B C in the first query, B, E, F in the 2nd , B , X, Y in the third and so on, make sure your template file will have A B C E F X Y)
Make 10 tasks (one for each query). As a source use sql from command and write your query. As a destination, use the template file created above. Make sure you uncheck "Overwrite data".
Use the same destination for all the queries.
This should do the trick. I am not sure that I completely understood your question since it's a little big vague.
Here are the following reference that may help you a bit more:
SQL Server : export query as a .txt file
You will have to make sure you have a proper connection to the SQL server and then run this as a powershell or a .bat file. This can be scheduled to run daily as well.

How to get input dataset from Look up transformation in ssis

i have an flat file input dataset, and i need to use look up transformation for comparing 3 columns with that of input to reference dataset used in look up transformation.
After look up operation operation is performed i need to get the input columns also along with reference dataset columns.
I googled but i coluldn't find any method to get input dataset from look up. The only solution left is Merge join. I dont want to use merge Join
The input columns are automatically added to the output pipelines of a Lookup transformation. You'll be able to reference them in subsequent transformations.

How to i merge two results in SSIS?

I have two excel source 1st is giving me date value and 2nd is giving me price value from excel sheet.
Now i need to insert these two values into one table please tell me how can i do this?
I have used merge join but it is giving me error input must be sorted that i can't as it excel file.
Well personally, I would put each Excel file into it's own staging table. Then I would use a SQL query that joins the two tables as the source for my insert to the production tables.
After you get the input from each source, you have to sort it prior to merging it.
You can sort the input from an Excel source, from any source, because the sort is performed with the data on memory. Its an element in the Toolbar.
Check this:
http://msdn.microsoft.com/en-us/library/ms137653.aspx
I'm pretty sure you can define a sort on an excel