I'm working on an SSIS package, adding update functionality (updating rows by using a staging table). To do this, I use a lookup and conditional split where I compare all the columns.
For some reason, some of the data throws false positives and marks rows as changed, when they have not. I've isolated this down to a single string column (zip code).
The column comes straight from a lookup. The source data column is varchar(9), the destination (i.e. source of second value) is char(9). In SSIS, both columns come through as DT_STR,9,1252
If I start with an empty table, and run the package twice, the second time about 20% of the rows come up as changed, even though they haven't. The following sql joins the existing rows to the "updated" rows in the staging table and compares their zips:
SELECT a.key_DestinationZip, b.key_DestinationZip,
CASE WHEN a.key_DestinationZip = b.key_DestinationZip then 1 else 0 end
FROM [dbo].[sta_Sales] as a
join [dbo].[fact_Sales] as b
on a.key_FullSalesNumber = b.key_FullSalesNumber
with results similar to
78735 78735 1
38138 38138 1
Your source data is varchar(9) and your lookup data is char(9). I believe, but have not tested, this is resulting in |65401| and |65401 | (4 spaces there and pipes for delineation only) in your data.
The data coming from your source system is going to be affected by the ANSI_PADDING setting when it was loaded. By default, SSIS isn't going to pad out the string.
Therefore, in your lookup, you will want to either pad the source data to 9 characters or trim the lookup's key.
And unrelated to this, but you might want to store the postal code separate from the zip+4 data. The later is more likely to change than the former when/if you ever run your data through an address validation service.
That looks to me like the problem is your data has two zip codes.
Related
I am building an SSIS for SQL Server 2014 package and currently trying to get the most recent record from 2 different sources using datetime columns between the two sources and implementing a method to accomplish that. So far I am using a Lookup Task on thirdpartyid to match the records that I need to eventually compare and using a Merge Join to bring them together and eventually have a staging table that has the most recent record.I have a previous data task, not shown that already inserts records that are not in AD1 into a staging table so at this point these records are a one to one match. Both sources look like this with the exact same datetime columns just different dates and some information having null values as there is no history of it.
Sample output
This is my data flow task so far. I am really new to SSIS so any ideas or suggestions would be greatly appreciated.
Given that there is a 1:1 match between your two sources, I would structure this as an Source (V1) -> Lookup (AD1)
Define your lookup based on thirdpartyid and retrieve all the AD1 columns. You'll likely end up with data flow look like like name and name_ad1, etc
I would then add a Derived Column that identifies if the dates are different (assuming in that situation, you need to take action)
IsNull(LastUpdated_AD1) || LastUpdated > LastUpdated_AD1
That expression would create a boolean if the column in AD1 is null or if the V1 last updated column is greater than the AD1 version.
You'd likely then add a Conditional Split into your Data Flow and base it on the value of our new column and then route the changed data into your mechanism for handling updates (OLE DB Command or preferably, an OLE DB Destination + an Execute SQL Task post Data Flow to perform an batch update)
The comment asks
should it all be one expression? like IsNull(AssignmentLastUpdated_AD1) || AssignmentLastUpdated > AssignmentLastUpdated_AD1 || IsNull(RoomLastUpdated_AD1) || RoomLastUpdated > RoomLastUpdated_AD1
You can do it like that but when you get a weird result and someone asks how you got that value, long expressions make it hard to debug. I'd likely have two derived columns components in the data flow. The first would create a has changed column for each set of conditions
HasChangedAssignment
(IsNull(AssignmentLastUpdated_AD1) || AssignmentLastUpdated > AssignmentLastUpdated_AD1)
HasChangedRoom
IsNull(RoomLastUpdated_AD1) || RoomLastUpdated > RoomLastUpdated_AD1
etc
And then in the final derived column, you create the HasChanged column
HasChangedAssignment || HasChangedRoom || HasChangedAdNauseum
Using a pattern based approach like this makes it much easier to build, troubleshoot and/or or make small changes that can have a big impact to the correctness, maintainability and performance of your packages.
I have a file like as seen below: Just Ex:
kwqif h;wehf uhfeqi f ef
fekjfnkenfekfh ijferihfq eiuh qfe iwhuq fbweq
fjqlbflkjqfh iufhquwhfe hued liuwfe
jewbkfb flkeb l jdqj jvfqjwv yjwfvjyvdfe
enjkfne khef kurehf2 kuh fkuwh lwefglu
gjghjgyuhhh jhkvv vytvgyvyv vygvyvv
gldw nbb ouyyu buyuy bjbuy
ID Name Address
1 Andrew UK
2 John US
3 Kate AUS
I want to dynamically skip header information and load flatfile to DB
Like below:
ID Name Address
1 Andrew UK
2 John US
3 Kate AUS
The header information may vary (not fixed no. of rows) from file to file.
Any help..Thanks in advance.
The generic SSIS components cannot meet this requirement. You need to code for this e.g. in an SSIS Script task.
I would code that script to read through the file looking for that header row ID Name Address, and then write that line and the rest of the file out to a new file.
Then I would load that new file using the SSIS Flat File Source component.
You might be able to avoid a script task if you'd prefer not to use one. I'll offer a few ideas here as it's not entirely clear which will be best from your example data. To some extent it's down to personal preference anyway, and also the different ideas might help other people in future:
Convert ID and ignore failures: Set the file source so that it expects however many columns you're forced into having by the header rows, and simply pull everything in as string data. In the data flow - immediately after the source component - add a data conversion component or conditional split component. Try to convert the first column (with the ID) into a number. Add a row count component and set the error output of the data conversion or conditional split to be redirected to that row count rather than causing a failure. Send the rest of the data on its way through the rest of your data flow.
This should mean you only get the rows which have a numeric value in the ID column - but if there's any chance you might get real failures (i.e. the file comes in with invalid ID values on rows you otherwise would want to load), then this might be a bad idea. You could drop your failed rows into a table where you can check for anything unexpected going on.
Check for known header values/header value attributes: If your header rows have other identifying features then you could avoid relying on the error output by simply setting up the conditional split to check for various different things: exact string matches if the header rows always start with certain values, strings over a certain length if you know they're always much longer than the ID column can ever be, etc.
Check for configurable header values: You could also put a list of unacceptable ID values into a table, and then do a lookup onto this table, throwing out the rows which match the lookup - then if you need to update the list of header values, you just have to update the table and not your whole SSIS package.
Check for acceptable ID values: You could set up a table like the above, but populate this with numbers - not great if you have no idea how many rows might be coming in or if the IDs are actually unique each time, but if you're only loading in a few rows each time and they always start at 1, you could chuck the numbers 1 - 100 into a table and throw away and rows you load which don't match when doing a lookup onto this table.
Staging table: This is probably the way I'd deal with it if I didn't want to use a script component, but in part that's because I tend to implement initial staging tables like this anyway, and I'm comfortable working in SQL - so your mileage may vary.
Pick up the file in a data flow and drop it into a staging table as-is. Set your staging table data types to all be large strings which you know will hold the file data - you can always add a derived column which truncates things or set the destination to ignore truncation if you think there's a risk of sometimes getting abnormally large values. In a separate data flow which runs after that, use SQL to pick up the rows where ID is numeric, and carry on with the rest of your processing.
This has the added bonus that you can just pick up the columns which you know will have data you care about in (i.e. columns 1 through 3), you can do any conversions you need to do in SQL rather than in SSIS, and you can make sure your columns have sensible names to be used in SSIS.
I need to prevent loading data into my fact table if any of the incoming data has a [DateId] that already exists in the fact table. The field [DateId] is an integer value.
The Lookup action in SSIS allows you to fail on non-matches, but I actually need a failure if any match is found. How can I get the package to fail when there's a match?
If you just want non-matches to flow through the lookup, just use the "Lookup No Match Output" to connect to the next component in your data flow.
Since the Lookup Match Output isn't hooked up to anything, all that data will just "stop" there. This is the equivalent of the SQL pattern LEFT JOIN WHERE --some left column-- IS NULL.
Use either a merge-join (with a conditional split) or a lookup with a nomatch output (without hooking the match to anything).
No answer after 8 years. I had the same issue today and I can't think of anything else than this workaround.
Context:
Writing from excel to SQL Server. Need to fail if the date is already present. This is also the parent in the hierarchy of data flow tasks in the control flow so wanted the child transformations to not execute if the parent fails.
Workaround:
Created a dummy table with one row and 2 columns for date and description. Entered the value "1900-01-02 12:34:56.000" in date and put a description in the other column.
Passed the Lookup Match Output to another lookup which looks up the dates in excel to this dummy table and will fail eventually.
At least I'm able to fail this and avoid writing duplicate data to 6 other tables that are written as a part of the 6 other data flow tasks which would otherwise execute if the parent task does not fail.
I'm going to do my best to try to explain this. I currently have a data flow task that has an OLE DB Source transferring data from a table from a different database to a table to another database. It works fine but the issue I'm having is the fact that I keep adding duplicate data to the destination table.
So a CustomerID of '13029' with an amount of '$56.82' on Date '11/30/2012' is seen in that table multiple times. How do I make it so I can only have unique data transferring over to that destination table?
In the dataflow task, where you transfer the data, you can insert a Lookup transformation. In the lookup, you can specify a data source (table or query, what serves you best). When you chose the data source, you can go to the Columns view and create a mapping, where you connect the CustomerID, Date and Amount of both tables.
In the general view, you can configure, what happens with matched/non matched row. Simply take the not matched output and direct it to the DB destination.
You will need to identify what makes that data unique in the table. If it's a customer table, then it's probably the customerid of 13029. However if it's a customer order table, then maybe it's the combination of CustomerId and OrderDate (and maybe not, I have placed two unique orders on the same date). You will know the answer to that based on your table's design.
Armed with that knowledge, you will want to write a query to pull back the keys from the target table SELECT CO.CustomerId, CO.OrderId FROM dbo.CustomerOrder CO If you know the process only transfers data from the current year, add a filter to the above query to restrict the number of rows returned. The reason for this is memory conservation-you want SSIS to run fast, don't bring back extraneous columns or rows it will never need.
Inside your dataflow, add a Lookup Transformation with that query. You don't specify 2005, 2008 or 2012 as your SSIS version and they have different behaviours associated with the Lookup Transformation. Generally speaking, what you are looking to do is identify the unmatched rows. By definition, unmatched means they don't exist in the target database so those are the rows that are new. 2005 assumes every row is going to match or it errors. You will need to click the Configure Error Output... button and select "Redirect Rows". 2008+ has an option under "Specify how to handle rows with no matching entries" and there you'll want "Redirect rows to no match output."
Now take the No match output branch (2008+) or the error output branch (2005) and plumb that into your destination.
What this approach doesn't cover is detecting and handling when the source system reports $56.82 and the target system has $22.38 (updates). If you need to handle that, then you need to look at some change detection system. Look at Andy Leonard's Stairway to Integration Services series of articles to learn about options for detecting and handling changes.
Have you considered using the T-SQL MERGE statement? http://technet.microsoft.com/en-us/library/bb510625.aspx
It will compare both tables on defined fields, and take an action if matched or not.
I have an SSIS data flow that uses a lookup. Sometimes the value to be looked up (in my stream, not in the lookup table) is null.
The MSDN Docs say:
consider using full caching, which supports lookup operations on null values.
I am using Full Caching (that is the default).
But when I run I get this error on my null rows:
Row yielded no match during lookup
If I change the result to ignore no-matches then it works fine. But that ignores all no-matches. I just want to allow nulls through (as null). Any other no-match should fail the component.
What am I doing wrong? How can I get nulls to write as nulls, but not ignore any other errors.
(NOTE: I have double checked my look up table. It has ALL the values that are in my source table. It just does not have NULL as a value (because it is weird to have a look up value for null.)
I know this is a late answer, but for anyone searching on this like I was, I found this to be the simplest answer:
In the lookup connection, use a SQL query to retrieve your data and add UNION SELECT NULL, NULL to the bottom.
For example:
SELECT CarId, CarName FROM Cars
UNION SELECT NULL, NULL
Preview will show an additional row of CarId = Null and CarName = Null that will be available in the lookpup.
I had never noticed that line in BOL about the full cache mode. The answer to your problem is what you've already stated about having NULL in the reference set of data. A quick proof would be these two sample data flows.
In this example, I generate 4 rows of data: 1, 2, 3, and NULL (Column_isNull = true) and hit an in memory table that has all the values and perform a lookup between Column in the data flow and c1 defined in the in-memory table. It blows up as you described
I then added one more value to the lookup table, NULL and voila, the full-cache lookup is able to match NULL to NULL.
Takeaway
To make a NULL input value match in a lookup component, the reference data set must have a corresponding NULL value available as well as have the cache mode set to FULL.
To work around this issue within SSIS, there is an alternative approach to a previous SO answer which can be applied.
Within the Lookup Transformation, you can redirect rows on error and then pass them to another destination which can simply be the same intended destination table within your database.
Therefore your destination table within the database will still receive all the rows (477 in screen shot below).
This approach therefore avoids the need to put dummy NULL values into your lookup table within your database, with the trade offs being:
An extra step within your SSIS package.
Error rows (non-NULL non-matches in this case) will always be loaded into the destination table. To help identify these rouge records, you can export the destination table into a txt file and then diff with the input source file to see any differences.
You can use a Conditional Split to break the data into two sets, one where the given value is null and the rest. Only perform the lookup on the output that has non-null values and then combine the results back with the data set that contains the null values using a Union All. You can still trap for unmatched values on the lookup and not worry about having to add a null entry to your lookup table