Grouping in SSIS - ssis

So, I have a bunch of data that I'm trying to import using SSIS. The problem I'm having is that some of the data is outdated. So I want to only import the most recent data. I have a key that indicates which set of data each row belongs and I only want to import the most row per key.
What is the best way to do this in SSIS?
My only thought would be to use two sort transform. The first would sort by date. The second would sort by my key, and eliminate duplicate rows. This would only work if the sort was guaranteed to maintain the previous order. Does anyone know if this holds true? Or does the second sort completely eliminate order the first sort put into place?

I don't think you can rely on the sort order. You can sort by multiple keys in a single sort - perhaps sending it through a script task at that point to do the filtering by simply comparing it to the previous row.

I usually split (multicast) my dataset : one to aggregate the value I want to keep, the other one is used to merge with the first dataset.
For example, I have an history of position by employee (Employee, Date, Position)
I split my dataset to retrieve the last history date by employee (aggregate employee and max date) and I sort it by employee => 1.Employee + 1.last_date
I merge my 2 dataset => 1.Employee = 2.Employee AND 1.last_date = 2.date

Related

Merge or delete semi-duplicates in an array

I need to either merge or delete semi-duplicates in a Dataverse table. I looked into duplicate detection there, but I couldn't figure it out. I want to point out that that might be the solution to this and might be easier than what I'm asking for here. If I could keep the duplicates from writing to the table in the first place, that would be better.
In Power Automate, I create an array out of the aforementioned Dataverse table. Particularly, I'm looking at objects with duplicate Order Number values. Below is an example portion of the array. These three objects have the same Order Number but different Schedule Date.
I will have this array sorted by the Schedule Date, because I want the Schedule Date to be the earliest. Using Power Automate, how could I merge these records or remove all but the first one?

how to perform a range lookup with Cache connection manager in SSIS

Is there a way to perform a date range lookup using a cache connection manager in SSIS? Or something similar that is very performant.
The scenario I have is as follows.
I have a row in a table that has a date, lets call it BusinessDate. I need to perform a lookup on a table to see if the businessDate is between the StartDate and EndDate of the dimension.
The problem is, the table I'm reading from had millions of records and my dimension (Lookup table) has a few thousand records and it takes very long.
Please help...
Nope, the Lookup with a cache connection manager is a strict equals. You might be able to finagle it with a lookup against an OLE DB source with a Partial/None cache model and custom queries.
So, what can you do?
You can modify the way you populate your Lookup Cache. Assuming your data looks something like
MyKey|StartDate|EndDate|MyVal
1 |2021-01-01|2021-02-01|A
1 |2021-02-01|9999-12-31|B
Instead of just loading as is, explode out your dimension.
MyKey|TheDate|MyVal
1 |2021-01-01|A
1 |2021-01-02|A
1 |2021-01-03|A
1 |2021-01-04|A
1 |2021-01-05|A
...
1 |2021-02-01|B
1 |2021-02-02|B
...
You might not want to build your lookup all the way to year 9999 but know your data and say go 5 years in the future as well as pick up the end date.
Now your lookup usage is a supported case - strict equals.
Otherwise, the pattern of a merge join is how people handle range joins in a data flow. Going to reproduce Matt Masson's article from the msdn blogs because it's dead
Lookup Pattern: Range Lookups
Performing range lookups (i.e. to find a key for a given range) is a common ETL operation in data warehousing scenarios. It's especially for historical loads and late arriving fact situations, where you're using type 2 dimensions and you need to locate the key which represents the dimension value for a given point in time.
This blog post outlines three separate approaches for doing range lookups in SSIS:
Using the Lookup Transform
Merge Join + Conditional Split
Script Component
All of our scenarios will use the AdventureWorksDW2008 sample database (DimProduct table) as the dimension, and take its fact data from AdventureWorks2008 (SalesOrderHeader and SalesOrderDetail tables). The "ProductNumber" column from the SalesOrderDetail table maps to the natural key of the DimProduct dimension (ProductAlternateKey column). In all cases we want to lookup the key (ProductKey) for the product which was valid (identified by StartDate and EndDate) for the given OrderDate.
One last thing to note is that the Merge Join and Script Component solutions assume that a valid range exists for each incoming value. The Lookup Transform approach is the only one that will identify rows that have no matches (although the Script Component solution could be modified to do so as well).
Lookup Transform
The Lookup Transform was designed to handle 1:1 key matching, but it can also be used in the range lookup scenario by using a partial cache mode, and tweaking the query on the Advanced Settings page. However, the Lookup doesn't cache the range itself, and will end up going to the database very often - it will only detect a match in its cache if all of the parameters are the same (i.e. same product purchased on the same date).
We can use the following query to have the lookup transform perform our range lookup:
select [ProductKey], [ProductAlternateKey],
[StartDate], [EndDate]
from [dbo].[DimProduct]
where [ProductAlternateKey] = ?
and [StartDate] <= ?
and (
[EndDate] is null or
[EndDate] > ?
)
On the query parameters page, we map 0 -> ProductNumber, 1 and 2 -> OrderDate.
This approach is effective and easy to setup, but it is pretty slow when dealing with a large number of rows, as most lookups will be going to the database.
Merge Join and Conditional Split
This approach doesn't use the Lookup Transform. Instead we use a Merge Join Transform to do an inner join on our dimension table. This will give us more rows coming out than we had coming in (you'll get a row for every repeated ProductAlternateKey). We use the conditional split to do the actual range check, and take only the rows that fall into the right range.
For example, a row coming in from our source would contain an OrderDate and ProductNumber, like this:
From the DimProduct source, we take three additional columns - ProductKey (what we're after), StartDate and EndDate. The DimProduct dimension contains three entries for the "LJ-0192-L" product (as its information, like unit price, has changed over time). After going through the Merge Join, the single row becomes three rows.
We use the Conditional Split to do the range lookup, and take the single row we want. Here is our expression (remember, in our case an EndDate value of NULL indicates that it's the most current row):
StartDate <= OrderDate && (OrderDate < EndDate || ISNULL(EndDate))
This approach is a little more complicated, but performs a lot better than using the Lookup Transform.
Script component
Not reproduced here
Conclusion
Not reproduced here

Deduplication of records without sorting in a mainframe sequential dataset with already sorted data

This is a query on deduplicating an already sorted mainframe dataset without re-sorting it.
The input sequential dataset has the following structure. 'KEYn' in the first 4 bytes represents the key and the remainder of each row represents the rest of the record's data. There are records in which the same key is repeated though the remaining data is different in each record. The records are already sorted on 'KEYn'.
KEY1aaaaaa
KEY1bbbbbb
KEY2cccccc
KEY3xxxxxx
KEY3yyyyyy
KEY3zzzzzz
KEY3wwwwww
KEY4uuuuuu
KEY5hhhhhh
KEY5ffffff
My requirement is to pick up the first record of each key and drop the remaining 'duplicates'. so the output file for the above input should look like this:
KEY1aaaaaa
KEY2cccccc
KEY3xxxxxx
KEY4uuuuuu
KEY5hhhhhh
Since the data is already sorted, I don't want to use SORT utility with SUM FIELDS=NONE or ICETOOL with SELECT - FIRST operand since both of these will actually end up re-sorting the data on the deduplication key (KEYn). Also the actual dataset I am referring to is huge (1.6 billion records, AVGRLEN 900 VB) and a job actually ran out of sort work space trying to sort it in one go.
My query is: Is there any option available in JCL based utilities to do this deduplication without resorting and using sort work space? I am trying to avoid writing a COBOL/Assembler program to do this.
Try this untested.
OPTION COPY
INREC BUILD=(1,4,SEQNUM,3,ZD,RESTART=(5,4),5)
OUTFIL INCLUDE=(5,3,ZD,EQ,1),BUILD=(1,4,8)

SSIS Aggregate transformation

I am working on a very large SSIS package with many transformations.
I need to do an AGGREGATE that groups a field and also counts the field,
the problem that I am having is the AGGREGATE is coming from a MULTICAST. I tried doing a SORT from the MULTICAST then an AGGREGATE but I lose all of the other columns and I need them.
I tried adding another SORT coming from the MULTICAST so that I can have all of the columns and have all transformations going into a MERGE but the package gets hung up on the SORT coming from the MULTICAST.
MULTICAST is also being routed into a CONDITIONAL SPLIT, in which one of the splits will have an AGGREGATE that groups a field and also counts the field and will go into the above MERGE.
.
SORT 1 is sorting by CUSTOMER ID and SORT 2 is sorting by CUSTOMER ID_SYSTEM.
Aggregate 1 groups CUSTOMER_ID and count distinct CUSTOMER_ID and Aggregate 2 groups CUSTOMER ID_SYSTEM and count distinct CUSTOMER_ID SYSTEM.
Basically what I am trying to accomplish by doing the AGGREGATE is if COUNTS from the first AGGREGATE equal the COUNTS from the second AGGREGATE then those rows will go down a separate path than those rows in which the COUNTS don’t match.
Any suggestions on the best way to do this without the package taking a long time to process, right now the package does not get past the SORTS.
The fastest way to handle this is to send it to a destination table with covering index(es) at the point of the multicast, and then do the aggregating and comparison logic in a stored procedure. Then if more dataflow processing is needed, start a new dataflow with that table as the source.
Sorting and aggregating in an SSIS dataflow is always going to be slow and not recommended for lots of rows. Nothing you can do about it but avoid it.

Merge data from two sources into one destination without duplicates

I have data from two different source locations that need to be combined into one. I am assuming I would want to do this with a merge or a merge join, but I am unsure of what exactly I need to do.
Table 1 has the same fields as Table 2 but the data is different which is why I would like to combine them into one destination table. I am trying to do this with SSIS, but I have never had to merge data before.
The other issue that i have is that some of the data is duplicated between the two. How would I only keep 1 of the duplicated records?
Instead of making an entirely new table which will need to be updated again every time Table 1 or 2 changes, you could use a combination of views and UNIONs. In other words create a view that is the result of a UNION query between your two tables. To get rid of duplicates you could group by whatever column uniquely identifies each record.
Here is a UNION query using Group By to remove duplicates:
SELECT
MAX (ID) AS ID,
NAME,
MAX (going)
FROM
(
SELECT
ID :: VARCHAR,
NAME,
going
FROM
facebook_events
UNION
SELECT
ID :: VARCHAR,
NAME,
going
FROM
events
) AS merged_events
GROUP BY
NAME
(Postgres not SSIS, but same concept)
Instead of Merge and Sort , Use union all Sort. because Merge transform need two sorted input and performance will be decreased
1)Give Source1 & Source2 as input to UnionALL Transformation
2) Give Output of UnionALL transfromation to Sort transformation and check remove duplicate keys.
This sounds like a pretty classic merge. Create your source and destination connections. Put in a Data Flow task. Put both sources into the Data Flow. Make sure the sources are both sorted and connect them to a Merge. You can either add in a Sort transformation between the connection and the Merge or sort them using a query when you pull them in. It's easier to do it with a query if that's possible in your situation. Put a Sort transformation after the Merge and check the "Remove rows with duplicate sort values" box. That will take care of any duplicates you have. Connect the Sort transformation to the data destination.
You can do this without SSIS, too.