Capturing the max value in a data flow - ssis

I'm just getting back into SSIS after several years of not using it. Here is what I need to do.
1) Read a value from a table and store into a variable
2) Create a data flow where I retrieve some number of rows
having a value greater than the value retrieved in #1.
3) Store the rows retrieved in #2 into another table
4) Determine the maximum value of a particular column from the rows
read in from step #2 and update the table referenced in #1.
The first three steps are easy, straightforward and working. However, I'm not certain the best way to accomplish #4.

Best can always be subjective but the most straight forward mechanism would be to add a Multicast component prior to your destination.
The Multicast will allow all the data flowing through the pipeline to show up in more than one stream. This is all done through pointers to the actual data buffers and doesn't result in physical copies of the data being strewn about.
From the Multicast, connect it to an Aggregate component and perform a MAX operation on whatever column you're using.
You know that you will only have one row coming from this aggregate so I'd use an OLE DB Command component to update your table #1. Something like
UPDATE ETLStatus
SET MaxValue = ?
WHERE PackageName = ?;
And then you'd map column names in like
MaxValue => Parameter_0
PackageName => Parameter_1

Related

SSIS Getting the most up to date record using datetime columns

I am building an SSIS for SQL Server 2014 package and currently trying to get the most recent record from 2 different sources using datetime columns between the two sources and implementing a method to accomplish that. So far I am using a Lookup Task on thirdpartyid to match the records that I need to eventually compare and using a Merge Join to bring them together and eventually have a staging table that has the most recent record.I have a previous data task, not shown that already inserts records that are not in AD1 into a staging table so at this point these records are a one to one match. Both sources look like this with the exact same datetime columns just different dates and some information having null values as there is no history of it.
Sample output
This is my data flow task so far. I am really new to SSIS so any ideas or suggestions would be greatly appreciated.
Given that there is a 1:1 match between your two sources, I would structure this as an Source (V1) -> Lookup (AD1)
Define your lookup based on thirdpartyid and retrieve all the AD1 columns. You'll likely end up with data flow look like like name and name_ad1, etc
I would then add a Derived Column that identifies if the dates are different (assuming in that situation, you need to take action)
IsNull(LastUpdated_AD1) || LastUpdated > LastUpdated_AD1
That expression would create a boolean if the column in AD1 is null or if the V1 last updated column is greater than the AD1 version.
You'd likely then add a Conditional Split into your Data Flow and base it on the value of our new column and then route the changed data into your mechanism for handling updates (OLE DB Command or preferably, an OLE DB Destination + an Execute SQL Task post Data Flow to perform an batch update)
The comment asks
should it all be one expression? like IsNull(AssignmentLastUpdated_AD1) || AssignmentLastUpdated > AssignmentLastUpdated_AD1 || IsNull(RoomLastUpdated_AD1) || RoomLastUpdated > RoomLastUpdated_AD1
You can do it like that but when you get a weird result and someone asks how you got that value, long expressions make it hard to debug. I'd likely have two derived columns components in the data flow. The first would create a has changed column for each set of conditions
HasChangedAssignment
(IsNull(AssignmentLastUpdated_AD1) || AssignmentLastUpdated > AssignmentLastUpdated_AD1)
HasChangedRoom
IsNull(RoomLastUpdated_AD1) || RoomLastUpdated > RoomLastUpdated_AD1
etc
And then in the final derived column, you create the HasChanged column
HasChangedAssignment || HasChangedRoom || HasChangedAdNauseum
Using a pattern based approach like this makes it much easier to build, troubleshoot and/or or make small changes that can have a big impact to the correctness, maintainability and performance of your packages.

Deduplication of records without sorting in a mainframe sequential dataset with already sorted data

This is a query on deduplicating an already sorted mainframe dataset without re-sorting it.
The input sequential dataset has the following structure. 'KEYn' in the first 4 bytes represents the key and the remainder of each row represents the rest of the record's data. There are records in which the same key is repeated though the remaining data is different in each record. The records are already sorted on 'KEYn'.
KEY1aaaaaa
KEY1bbbbbb
KEY2cccccc
KEY3xxxxxx
KEY3yyyyyy
KEY3zzzzzz
KEY3wwwwww
KEY4uuuuuu
KEY5hhhhhh
KEY5ffffff
My requirement is to pick up the first record of each key and drop the remaining 'duplicates'. so the output file for the above input should look like this:
KEY1aaaaaa
KEY2cccccc
KEY3xxxxxx
KEY4uuuuuu
KEY5hhhhhh
Since the data is already sorted, I don't want to use SORT utility with SUM FIELDS=NONE or ICETOOL with SELECT - FIRST operand since both of these will actually end up re-sorting the data on the deduplication key (KEYn). Also the actual dataset I am referring to is huge (1.6 billion records, AVGRLEN 900 VB) and a job actually ran out of sort work space trying to sort it in one go.
My query is: Is there any option available in JCL based utilities to do this deduplication without resorting and using sort work space? I am trying to avoid writing a COBOL/Assembler program to do this.
Try this untested.
OPTION COPY
INREC BUILD=(1,4,SEQNUM,3,ZD,RESTART=(5,4),5)
OUTFIL INCLUDE=(5,3,ZD,EQ,1),BUILD=(1,4,8)

NIFI - QueryDatabaseTable processor. How to query rows which is modified?

I am working on NIFI Data Flow where my usecase is fetch mysql table data and put into hdfs/local file system.
I have built a data flow pipeline where i used querydatabaseTable processor ------ ConvertRecord --- putFile processor.
My Table Schema ---> id,name,city,Created_date
I am able to receive files in destination even when i am inserting new records in table
But, but ....
When i am updating exsiting rows then processor is not fetching those records looks like it has some limitation.
My Question is ,How to handle this scenario? either by any other processor or need to update some property.
PLease someone help
#Bryan Bende
QueryDatabaseTable Processor needs to be informed which columns it can use to identify new data.
A serial id or created timestamp is not sufficient.
From the documentation:
Maximum-value Columns:
A comma-separated list of column names. The processor will keep track of the maximum value for each column that has been returned since the processor started running. Using multiple columns implies an order to the column list, and each column's values are expected to increase more slowly than the previous columns' values. Thus, using multiple columns implies a hierarchical structure of columns, which is usually used for partitioning tables. This processor can be used to retrieve only those rows that have been added/updated since the last retrieval. Note that some JDBC types such as bit/boolean are not conducive to maintaining maximum value, so columns of these types should not be listed in this property, and will result in error(s) during processing. If no columns are provided, all rows from the table will be considered, which could have a performance impact. NOTE: It is important to use consistent max-value column names for a given table for incremental fetch to work properly.
Judging be the table scheme, there is no sql-way of telling whether data was updated.
There are many ways to solve this. In your case, the easiest thing to do might be to rename column created to modified and set to now() on updates
or to work with a second timestamp column.
So for instance
| stamp_updated | timestamp | CURRENT_TIMESTAMP | on update CURRENT_TIMESTAMP |
is the new column added. In the processor you use the stamp_updated column to identify new data
Don't forget to set Maximum-value Columns to those columns.
So what I am basically saying is:
If you cannot tell that it is a new record in sql yourself, nifi cannot either.

SSIS row field to be sum of lookup

I have an SSIS data flow in SSIS 2012 project.
I need to calculate in the best way possible for every row field a sum of another table based on some criteria.
It would be something like a lookup but returning an aggregate on the lookup result.
Is there an SSIS way to do it by components or i need to turn to script task or stored procedure?
Example:
One data flow has a filed names LOT.
i need to get the sum(quantity) from table b where dataflow.LOT = tableb.lot
and write this back to a flow field
You just need to use the Lookup Component. Instead of selecting tableb write the query, thus
SELECT
B.Lot -- for matching
, SUM(B.quantity) AS TotalQuantity -- for data flow injection
FROM
tableb AS B
GROUP BY
B.Lot;
Now when the package begins, it will first run this query against that data source and generate the quantities across all lots.
This may or may not be a good thing based on data volumes and whether the values in tableB are changing. In the larger volume case, if it's a problem, then I'd look at whether I can do something about the above query. Maybe I only need current year's data. Maybe my list of Lots could be pushed into the remove server beforehand to only compute the aggregates for what I need.
If TableB is very active, then you might need to change your caching from the default of Full to a Partial or None. If Lot 10 shows up twice in the data flow, the None would perform 2 lookups against the source while the Partial would cache the values it has seen. Probably, depends on memory pressure, etc.

SSIS 2008. Transferring data from one table to another ONLY if the data is not duplicated

I'm going to do my best to try to explain this. I currently have a data flow task that has an OLE DB Source transferring data from a table from a different database to a table to another database. It works fine but the issue I'm having is the fact that I keep adding duplicate data to the destination table.
So a CustomerID of '13029' with an amount of '$56.82' on Date '11/30/2012' is seen in that table multiple times. How do I make it so I can only have unique data transferring over to that destination table?
In the dataflow task, where you transfer the data, you can insert a Lookup transformation. In the lookup, you can specify a data source (table or query, what serves you best). When you chose the data source, you can go to the Columns view and create a mapping, where you connect the CustomerID, Date and Amount of both tables.
In the general view, you can configure, what happens with matched/non matched row. Simply take the not matched output and direct it to the DB destination.
You will need to identify what makes that data unique in the table. If it's a customer table, then it's probably the customerid of 13029. However if it's a customer order table, then maybe it's the combination of CustomerId and OrderDate (and maybe not, I have placed two unique orders on the same date). You will know the answer to that based on your table's design.
Armed with that knowledge, you will want to write a query to pull back the keys from the target table SELECT CO.CustomerId, CO.OrderId FROM dbo.CustomerOrder CO If you know the process only transfers data from the current year, add a filter to the above query to restrict the number of rows returned. The reason for this is memory conservation-you want SSIS to run fast, don't bring back extraneous columns or rows it will never need.
Inside your dataflow, add a Lookup Transformation with that query. You don't specify 2005, 2008 or 2012 as your SSIS version and they have different behaviours associated with the Lookup Transformation. Generally speaking, what you are looking to do is identify the unmatched rows. By definition, unmatched means they don't exist in the target database so those are the rows that are new. 2005 assumes every row is going to match or it errors. You will need to click the Configure Error Output... button and select "Redirect Rows". 2008+ has an option under "Specify how to handle rows with no matching entries" and there you'll want "Redirect rows to no match output."
Now take the No match output branch (2008+) or the error output branch (2005) and plumb that into your destination.
What this approach doesn't cover is detecting and handling when the source system reports $56.82 and the target system has $22.38 (updates). If you need to handle that, then you need to look at some change detection system. Look at Andy Leonard's Stairway to Integration Services series of articles to learn about options for detecting and handling changes.
Have you considered using the T-SQL MERGE statement? http://technet.microsoft.com/en-us/library/bb510625.aspx
It will compare both tables on defined fields, and take an action if matched or not.