Columns changing to Null after passing through a Merge Join Transformation - ssis

I have a project where I am migrating data from a DB2 table over to a SQL table.
I pull the data in, then pass the dataset through a Script Transformation to hash each row in the dataset (to be used for comparison later in detecting any updates that need to be made). After the Script Transformation, I sort both datasets and pass them into a Merge Join Transformation.
Here's the problem that I'm running into.
After passing the sorted datasets into the Merge Join, the resulting Left Outer Join dataset that emerges contains a NULL value for each record in the dataset....and I don't know why?
Here is a picture of my Merge Join Transformation Editor:
I enabled a data viewer earlier in my project to verify that hashes are being generated for both the Host table and the SQL table. Everything works fine up until after it passes through the Merge Join Transformation.
I have a similar project that does the exact same thing with a different table. Both are modeled and designed the same, except this one is the only one that is having this particular hiccup with the SQL Hash column.
Does anyone have any thoughts that could help me troubleshoot this?

My apologies again. After doing some digging, apparently EVERY single row that emerged from the Merge Join was a new row and I was searching for the new rows by the wrong column name. Logic error on my part.

Related

Performing Merge-Join Using Derived Column as Join Key

I have created a derived column in my dataflow that is the simple concatenation of two columns. I have done this to two separate data sources. I then want to perform a merge join with my newly derived column as the outer join key. However, it doesn't seem like it is possible to accomplish this? Does anyone have experience with something like this?
The issue stems from the fact that I am unable to set a "Sort Key Position" to my newly created column as this is specified at the source. It is not possible to set this at the Derived Column transformation.
You would need to add a sort component between the merge join and the Derived Column sorting on the column(s) introduced in the Derived Column components.
While Excel is definitely going to need the derived column + Sort to make this work, I have not run into a situation where I could express an idea in the SSIS Expression language that I could not also do the same in TSQL. If you can, it will simplify your package as well as speed up the execution time.
Also, it's been my experience a Lookup Component is most often the tool people want compared to a Merge Join. If I'm augmenting an existing row, Lookup. If I need to be able to have 1 row generate 0 to many rows, then a merge join may be appropriate.
I know you had a lookup question earlier. Excel can act as a Lookup source if you use the Cache Connection Manager Excel Source as Lookup Transformation Connection

Both inputs of the transformation must contain at least one sorted column, and those columns must have matching metadata ssis

I have created an SSIS package and I used Merge Join to join a Dimension with the result of another Merge Join and I got the following error:
Both inputs of the transformation must contain at least one sorted column, and those columns must have matching metadata ssis
I have found that the issue is related to the data type of the two sorted columns, I just made a conversion to make both of them "INT" and everything is going fine.
The message is pretty clear. SSIS merge operations required that the data to be compared is sorted so comparisons are faster.
Make sure that you are retrieving ordered data from your database using the ORDER BY clause (if on SQL), and mark the columns with their corresponding order at the property IsSorted.
If you can't have the data ordered at the source, you can add a Sort operation in SSIS which will sort the merging columns (before the actual merge). You will have to do this on both flows before the merge. Please be adviced that using this componene will block the data flow until all rows are sorted.
The Merge error message will go away once you join both data flows with sorted columns.

Pentaho Import uniqe records into database

I am quite new to Pentaho Spoon and I would like to import records of an csv file to an database table. However, only unique records should be imported into the database table. That is why I need to compare EACH record with all records of the database table in order to determine if the record should be imported or not.
So far, I tried out the suggested CRUD-pattern which looks like this:
As you can see in the picture, I merge the excel input and the table input (ignore the cast-steps. I needed to cast a value because ther differed in the float format: database format was #.000000 and the csv format of float was #.0)
After the merge join, I compare the flag (which is given by the merge rows(diff) and if the compared records are new, I import them to the database table, if they are changed, I update the record and if they are deleted or identical, I simply do nothing. So far, so good.
But here is the problem: If I shuffle the records of the csv-input-file and run the transformation anew, all the records are imported anew and consequently, there are duplicated in my database table (which I wanted to avoid). To emphasize again: The right way to solve this is that each row of the csv-input-file is compared with ALL entries in the database table.
How can I realize this? Any suggestions? Thank you so much in advance!!
The Merge Rows (diff) expect the input to be sorted. Normally, you have been warned of this by a pop-up.
Put a Sort rows step on the output flow of the Excel Input, before it reaches the Merge Rows (diff).
You should do the same between the Table Input and the Merge Rows (diff). On course you may think you could do it in the sql statement of the Table Input.
However, there is a beginner trap here. You have 3 other steps Output Rows, Update and Delete which operates on the same table. And these steps may lock the table. As in Kettle all the steps are running concurrently, you do not know which steps will fire first, and the table may be locked and never be able to read even the first record. This is known in jargon as an auto-lock, and the way to solve it is to put a Sort Row step as a buffer.
You can use the 'Dimension lookup/update' control which provides the same functionality which you are trying to achieve.
Thanks,
Nilesh

SSIS row field to be sum of lookup

I have an SSIS data flow in SSIS 2012 project.
I need to calculate in the best way possible for every row field a sum of another table based on some criteria.
It would be something like a lookup but returning an aggregate on the lookup result.
Is there an SSIS way to do it by components or i need to turn to script task or stored procedure?
Example:
One data flow has a filed names LOT.
i need to get the sum(quantity) from table b where dataflow.LOT = tableb.lot
and write this back to a flow field
You just need to use the Lookup Component. Instead of selecting tableb write the query, thus
SELECT
B.Lot -- for matching
, SUM(B.quantity) AS TotalQuantity -- for data flow injection
FROM
tableb AS B
GROUP BY
B.Lot;
Now when the package begins, it will first run this query against that data source and generate the quantities across all lots.
This may or may not be a good thing based on data volumes and whether the values in tableB are changing. In the larger volume case, if it's a problem, then I'd look at whether I can do something about the above query. Maybe I only need current year's data. Maybe my list of Lots could be pushed into the remove server beforehand to only compute the aggregates for what I need.
If TableB is very active, then you might need to change your caching from the default of Full to a Partial or None. If Lot 10 shows up twice in the data flow, the None would perform 2 lookups against the source while the Partial would cache the values it has seen. Probably, depends on memory pressure, etc.

SSIS 2008. Transferring data from one table to another ONLY if the data is not duplicated

I'm going to do my best to try to explain this. I currently have a data flow task that has an OLE DB Source transferring data from a table from a different database to a table to another database. It works fine but the issue I'm having is the fact that I keep adding duplicate data to the destination table.
So a CustomerID of '13029' with an amount of '$56.82' on Date '11/30/2012' is seen in that table multiple times. How do I make it so I can only have unique data transferring over to that destination table?
In the dataflow task, where you transfer the data, you can insert a Lookup transformation. In the lookup, you can specify a data source (table or query, what serves you best). When you chose the data source, you can go to the Columns view and create a mapping, where you connect the CustomerID, Date and Amount of both tables.
In the general view, you can configure, what happens with matched/non matched row. Simply take the not matched output and direct it to the DB destination.
You will need to identify what makes that data unique in the table. If it's a customer table, then it's probably the customerid of 13029. However if it's a customer order table, then maybe it's the combination of CustomerId and OrderDate (and maybe not, I have placed two unique orders on the same date). You will know the answer to that based on your table's design.
Armed with that knowledge, you will want to write a query to pull back the keys from the target table SELECT CO.CustomerId, CO.OrderId FROM dbo.CustomerOrder CO If you know the process only transfers data from the current year, add a filter to the above query to restrict the number of rows returned. The reason for this is memory conservation-you want SSIS to run fast, don't bring back extraneous columns or rows it will never need.
Inside your dataflow, add a Lookup Transformation with that query. You don't specify 2005, 2008 or 2012 as your SSIS version and they have different behaviours associated with the Lookup Transformation. Generally speaking, what you are looking to do is identify the unmatched rows. By definition, unmatched means they don't exist in the target database so those are the rows that are new. 2005 assumes every row is going to match or it errors. You will need to click the Configure Error Output... button and select "Redirect Rows". 2008+ has an option under "Specify how to handle rows with no matching entries" and there you'll want "Redirect rows to no match output."
Now take the No match output branch (2008+) or the error output branch (2005) and plumb that into your destination.
What this approach doesn't cover is detecting and handling when the source system reports $56.82 and the target system has $22.38 (updates). If you need to handle that, then you need to look at some change detection system. Look at Andy Leonard's Stairway to Integration Services series of articles to learn about options for detecting and handling changes.
Have you considered using the T-SQL MERGE statement? http://technet.microsoft.com/en-us/library/bb510625.aspx
It will compare both tables on defined fields, and take an action if matched or not.