Performing Merge-Join Using Derived Column as Join Key - ssis

I have created a derived column in my dataflow that is the simple concatenation of two columns. I have done this to two separate data sources. I then want to perform a merge join with my newly derived column as the outer join key. However, it doesn't seem like it is possible to accomplish this? Does anyone have experience with something like this?
The issue stems from the fact that I am unable to set a "Sort Key Position" to my newly created column as this is specified at the source. It is not possible to set this at the Derived Column transformation.

You would need to add a sort component between the merge join and the Derived Column sorting on the column(s) introduced in the Derived Column components.
While Excel is definitely going to need the derived column + Sort to make this work, I have not run into a situation where I could express an idea in the SSIS Expression language that I could not also do the same in TSQL. If you can, it will simplify your package as well as speed up the execution time.
Also, it's been my experience a Lookup Component is most often the tool people want compared to a Merge Join. If I'm augmenting an existing row, Lookup. If I need to be able to have 1 row generate 0 to many rows, then a merge join may be appropriate.
I know you had a lookup question earlier. Excel can act as a Lookup source if you use the Cache Connection Manager Excel Source as Lookup Transformation Connection

Related

Both inputs of the transformation must contain at least one sorted column, and those columns must have matching metadata ssis

I have created an SSIS package and I used Merge Join to join a Dimension with the result of another Merge Join and I got the following error:
Both inputs of the transformation must contain at least one sorted column, and those columns must have matching metadata ssis
I have found that the issue is related to the data type of the two sorted columns, I just made a conversion to make both of them "INT" and everything is going fine.
The message is pretty clear. SSIS merge operations required that the data to be compared is sorted so comparisons are faster.
Make sure that you are retrieving ordered data from your database using the ORDER BY clause (if on SQL), and mark the columns with their corresponding order at the property IsSorted.
If you can't have the data ordered at the source, you can add a Sort operation in SSIS which will sort the merging columns (before the actual merge). You will have to do this on both flows before the merge. Please be adviced that using this componene will block the data flow until all rows are sorted.
The Merge error message will go away once you join both data flows with sorted columns.

Columns changing to Null after passing through a Merge Join Transformation

I have a project where I am migrating data from a DB2 table over to a SQL table.
I pull the data in, then pass the dataset through a Script Transformation to hash each row in the dataset (to be used for comparison later in detecting any updates that need to be made). After the Script Transformation, I sort both datasets and pass them into a Merge Join Transformation.
Here's the problem that I'm running into.
After passing the sorted datasets into the Merge Join, the resulting Left Outer Join dataset that emerges contains a NULL value for each record in the dataset....and I don't know why?
Here is a picture of my Merge Join Transformation Editor:
I enabled a data viewer earlier in my project to verify that hashes are being generated for both the Host table and the SQL table. Everything works fine up until after it passes through the Merge Join Transformation.
I have a similar project that does the exact same thing with a different table. Both are modeled and designed the same, except this one is the only one that is having this particular hiccup with the SQL Hash column.
Does anyone have any thoughts that could help me troubleshoot this?
My apologies again. After doing some digging, apparently EVERY single row that emerged from the Merge Join was a new row and I was searching for the new rows by the wrong column name. Logic error on my part.

SSIS - Reuse Ole DB source when matching Fact against lookup table twice

I am pretty new to SSIS and BI in general, so first of all sorry if this is a newbie question.
I have my source data for the fact table in a csv, so I want to match the ids against the surrogate keys in lookup tables.
The data structure in the csv is like this
... userId, OriginStationId, DestinyStationId,..
What I am trying to accomplish is to match the data against my lookup table. So what I am doing is
Reading Lookup data using OLE DB Source
Reading my csv file
Sorting both inputs by the same field
Doing a left join by Id, in order to get the SK
This way, if there is no match (aka can't find the surrogate key) I can redirect that to a rejected csv and handle it later.
something like this:
(sorry for the spanish!)
I am doing this for each dimension, so I can handle each one with different error codes.
Since OriginStationId and DestinyStationId are two values from the same dimension (they both match against the same lookup table), I wanted to know if there's a way to avoid reading two times the data from the table (I mean, not to use two ole db sources to read twice the data from the same table).
I tried adding a second output to the sort but I am not allowed to. The same goes to adding another output from OLE DB Source.
I see there's an "cache option", is the best way to go ? (Although it would impy creating anyway another OLE DB source.. right?)
The third option I thought of was joining by the two fields, but since there is only one field in the lookup table (the same field) I am getting an error when I try to map both colums from my csv against the same column in my Lookup table
There are columns missing with the sort order 2 to 2
What is the best way to go for this ?
Or I am thinking something incorrectly ?
If something was not clear let me know and I'll update my question
Any time you wish you could have multiple outputs from a component that only allows one, all you have to do is follow that component with the Multicast component, whose sole purpose is to split a Data Flow stream into multiple outputs.
Gonzalo
I have just used this article on how to derive columns for a data warehouse building:- How to Populate a Fact Table using SSIS (part 1).
Using this I built a simple package that reads a CSV file with two columns that are used to derive separate values from the same CodeTable. The CodeTable has two fields Id and Description.
The Data Flow has two "Lookup" tasks. The first one joins the attribute Lookup1 against the Description to derive its Id. The second joins the attribute Lookup2 against the Description to derive a different Id.
Here is the Data Flow:-
Note the "Data Conversion" was required to convert the string attributes from the CSV file into "Unicode string [DT_WSTR]" so they could be joined to the nvarchar(50) description attribute in the table.
Here is the Data Conversion:-
Here is the first Lookup (the second one joins "Copy of Lookup2" to the Description):-
Here is the Data Viewer output with the to two derived Ids CodeTableFirstId and CodeTableSecondId:-
Hopefully I understand your problem and this is of use to you.
Cheers John

SSIS row field to be sum of lookup

I have an SSIS data flow in SSIS 2012 project.
I need to calculate in the best way possible for every row field a sum of another table based on some criteria.
It would be something like a lookup but returning an aggregate on the lookup result.
Is there an SSIS way to do it by components or i need to turn to script task or stored procedure?
Example:
One data flow has a filed names LOT.
i need to get the sum(quantity) from table b where dataflow.LOT = tableb.lot
and write this back to a flow field
You just need to use the Lookup Component. Instead of selecting tableb write the query, thus
SELECT
B.Lot -- for matching
, SUM(B.quantity) AS TotalQuantity -- for data flow injection
FROM
tableb AS B
GROUP BY
B.Lot;
Now when the package begins, it will first run this query against that data source and generate the quantities across all lots.
This may or may not be a good thing based on data volumes and whether the values in tableB are changing. In the larger volume case, if it's a problem, then I'd look at whether I can do something about the above query. Maybe I only need current year's data. Maybe my list of Lots could be pushed into the remove server beforehand to only compute the aggregates for what I need.
If TableB is very active, then you might need to change your caching from the default of Full to a Partial or None. If Lot 10 shows up twice in the data flow, the None would perform 2 lookups against the source while the Partial would cache the values it has seen. Probably, depends on memory pressure, etc.

SSIS 2008. Transferring data from one table to another ONLY if the data is not duplicated

I'm going to do my best to try to explain this. I currently have a data flow task that has an OLE DB Source transferring data from a table from a different database to a table to another database. It works fine but the issue I'm having is the fact that I keep adding duplicate data to the destination table.
So a CustomerID of '13029' with an amount of '$56.82' on Date '11/30/2012' is seen in that table multiple times. How do I make it so I can only have unique data transferring over to that destination table?
In the dataflow task, where you transfer the data, you can insert a Lookup transformation. In the lookup, you can specify a data source (table or query, what serves you best). When you chose the data source, you can go to the Columns view and create a mapping, where you connect the CustomerID, Date and Amount of both tables.
In the general view, you can configure, what happens with matched/non matched row. Simply take the not matched output and direct it to the DB destination.
You will need to identify what makes that data unique in the table. If it's a customer table, then it's probably the customerid of 13029. However if it's a customer order table, then maybe it's the combination of CustomerId and OrderDate (and maybe not, I have placed two unique orders on the same date). You will know the answer to that based on your table's design.
Armed with that knowledge, you will want to write a query to pull back the keys from the target table SELECT CO.CustomerId, CO.OrderId FROM dbo.CustomerOrder CO If you know the process only transfers data from the current year, add a filter to the above query to restrict the number of rows returned. The reason for this is memory conservation-you want SSIS to run fast, don't bring back extraneous columns or rows it will never need.
Inside your dataflow, add a Lookup Transformation with that query. You don't specify 2005, 2008 or 2012 as your SSIS version and they have different behaviours associated with the Lookup Transformation. Generally speaking, what you are looking to do is identify the unmatched rows. By definition, unmatched means they don't exist in the target database so those are the rows that are new. 2005 assumes every row is going to match or it errors. You will need to click the Configure Error Output... button and select "Redirect Rows". 2008+ has an option under "Specify how to handle rows with no matching entries" and there you'll want "Redirect rows to no match output."
Now take the No match output branch (2008+) or the error output branch (2005) and plumb that into your destination.
What this approach doesn't cover is detecting and handling when the source system reports $56.82 and the target system has $22.38 (updates). If you need to handle that, then you need to look at some change detection system. Look at Andy Leonard's Stairway to Integration Services series of articles to learn about options for detecting and handling changes.
Have you considered using the T-SQL MERGE statement? http://technet.microsoft.com/en-us/library/bb510625.aspx
It will compare both tables on defined fields, and take an action if matched or not.