I am working on a very large SSIS package with many transformations.
I need to do an AGGREGATE that groups a field and also counts the field,
the problem that I am having is the AGGREGATE is coming from a MULTICAST. I tried doing a SORT from the MULTICAST then an AGGREGATE but I lose all of the other columns and I need them.
I tried adding another SORT coming from the MULTICAST so that I can have all of the columns and have all transformations going into a MERGE but the package gets hung up on the SORT coming from the MULTICAST.
MULTICAST is also being routed into a CONDITIONAL SPLIT, in which one of the splits will have an AGGREGATE that groups a field and also counts the field and will go into the above MERGE.
.
SORT 1 is sorting by CUSTOMER ID and SORT 2 is sorting by CUSTOMER ID_SYSTEM.
Aggregate 1 groups CUSTOMER_ID and count distinct CUSTOMER_ID and Aggregate 2 groups CUSTOMER ID_SYSTEM and count distinct CUSTOMER_ID SYSTEM.
Basically what I am trying to accomplish by doing the AGGREGATE is if COUNTS from the first AGGREGATE equal the COUNTS from the second AGGREGATE then those rows will go down a separate path than those rows in which the COUNTS don’t match.
Any suggestions on the best way to do this without the package taking a long time to process, right now the package does not get past the SORTS.
The fastest way to handle this is to send it to a destination table with covering index(es) at the point of the multicast, and then do the aggregating and comparison logic in a stored procedure. Then if more dataflow processing is needed, start a new dataflow with that table as the source.
Sorting and aggregating in an SSIS dataflow is always going to be slow and not recommended for lots of rows. Nothing you can do about it but avoid it.
Related
I have four datasets that get information for four different things (a unique set of fields for each one), but that can be joined using a field they share. I need to get them all into a tablix that will have four rows, one for each dataset per the linking field. How do I do that?
Currently I can only put in values from one dataset.
Often the best idea would be to create a query that joins the datasets in the sql. If that is not possible, you can look into using the Lookup function to find info from other datasets in your report. The related Lookupset function is able to retrieve sets of information and may be useful as well.
I have an SSIS data flow in SSIS 2012 project.
I need to calculate in the best way possible for every row field a sum of another table based on some criteria.
It would be something like a lookup but returning an aggregate on the lookup result.
Is there an SSIS way to do it by components or i need to turn to script task or stored procedure?
Example:
One data flow has a filed names LOT.
i need to get the sum(quantity) from table b where dataflow.LOT = tableb.lot
and write this back to a flow field
You just need to use the Lookup Component. Instead of selecting tableb write the query, thus
SELECT
B.Lot -- for matching
, SUM(B.quantity) AS TotalQuantity -- for data flow injection
FROM
tableb AS B
GROUP BY
B.Lot;
Now when the package begins, it will first run this query against that data source and generate the quantities across all lots.
This may or may not be a good thing based on data volumes and whether the values in tableB are changing. In the larger volume case, if it's a problem, then I'd look at whether I can do something about the above query. Maybe I only need current year's data. Maybe my list of Lots could be pushed into the remove server beforehand to only compute the aggregates for what I need.
If TableB is very active, then you might need to change your caching from the default of Full to a Partial or None. If Lot 10 shows up twice in the data flow, the None would perform 2 lookups against the source while the Partial would cache the values it has seen. Probably, depends on memory pressure, etc.
I have data from two different source locations that need to be combined into one. I am assuming I would want to do this with a merge or a merge join, but I am unsure of what exactly I need to do.
Table 1 has the same fields as Table 2 but the data is different which is why I would like to combine them into one destination table. I am trying to do this with SSIS, but I have never had to merge data before.
The other issue that i have is that some of the data is duplicated between the two. How would I only keep 1 of the duplicated records?
Instead of making an entirely new table which will need to be updated again every time Table 1 or 2 changes, you could use a combination of views and UNIONs. In other words create a view that is the result of a UNION query between your two tables. To get rid of duplicates you could group by whatever column uniquely identifies each record.
Here is a UNION query using Group By to remove duplicates:
SELECT
MAX (ID) AS ID,
NAME,
MAX (going)
FROM
(
SELECT
ID :: VARCHAR,
NAME,
going
FROM
facebook_events
UNION
SELECT
ID :: VARCHAR,
NAME,
going
FROM
events
) AS merged_events
GROUP BY
NAME
(Postgres not SSIS, but same concept)
Instead of Merge and Sort , Use union all Sort. because Merge transform need two sorted input and performance will be decreased
1)Give Source1 & Source2 as input to UnionALL Transformation
2) Give Output of UnionALL transfromation to Sort transformation and check remove duplicate keys.
This sounds like a pretty classic merge. Create your source and destination connections. Put in a Data Flow task. Put both sources into the Data Flow. Make sure the sources are both sorted and connect them to a Merge. You can either add in a Sort transformation between the connection and the Merge or sort them using a query when you pull them in. It's easier to do it with a query if that's possible in your situation. Put a Sort transformation after the Merge and check the "Remove rows with duplicate sort values" box. That will take care of any duplicates you have. Connect the Sort transformation to the data destination.
You can do this without SSIS, too.
So, I have a bunch of data that I'm trying to import using SSIS. The problem I'm having is that some of the data is outdated. So I want to only import the most recent data. I have a key that indicates which set of data each row belongs and I only want to import the most row per key.
What is the best way to do this in SSIS?
My only thought would be to use two sort transform. The first would sort by date. The second would sort by my key, and eliminate duplicate rows. This would only work if the sort was guaranteed to maintain the previous order. Does anyone know if this holds true? Or does the second sort completely eliminate order the first sort put into place?
I don't think you can rely on the sort order. You can sort by multiple keys in a single sort - perhaps sending it through a script task at that point to do the filtering by simply comparing it to the previous row.
I usually split (multicast) my dataset : one to aggregate the value I want to keep, the other one is used to merge with the first dataset.
For example, I have an history of position by employee (Employee, Date, Position)
I split my dataset to retrieve the last history date by employee (aggregate employee and max date) and I sort it by employee => 1.Employee + 1.last_date
I merge my 2 dataset => 1.Employee = 2.Employee AND 1.last_date = 2.date
This is quite a strange problem, wasn't quite sure how to title it. The issue I have is some data rows in an SSIS task which need to be modified depending on other rows.
Name Location IsMultiple
Bob England
Jim Wales
John Scotland
Jane England
A simplifed dataset, with some names, their locations, and a column 'IsMultiple' which needs to be updated to show which rows share locations. (Bob and Jane's rows would be flagged 'True' in the example above).
In my situation there is much more complex logic involved, so solutions using sql would not be suitable.
My initial thoughts were to use an asyncronous script task, take in all the data rows, parse them, and then output them all after the very last row has been input. The only way I could think of doing this was to call row creation in the PostExecute Phase, which did not work.
Is there a better way to go about this?
A couple of options come to mind for SSIS solutions. With both options you would need the data sorted by location. If you can do this in your SQL source, that would be best. Otherwise, you have the Sort component.
With sorted data as your input you can use a Script component that compares the values of adjacent rows to see if multiple locations exist.
Another option would be to split your data path into two. Do this by adding a Multicast component. The first path would be your main path that you currently have. In the second task, add an Aggregate transformation after the Multicast component. Edit the Aggregate and select Location as a Group By operation. Select (*) as a Count all. The output will be rows with counts by location.
After the Aggregate, Add a Merge Join component and select your first and second data paths as inputs. Your join keys should be the Location column from each path. All the inputs from path 1 should be outputs and include the count from path 2 as an output.
In a derived column, modify the isMultiple column with an expression that expresses "If count is greater than 1 then true else false".
If possible, I might recommend doing it with pure SQL in a SQL task on your control flow prior to your data flow. A simple UPDATE query where you GROUP BY location and do a HAVING COUNT for everything greater than 1 should be able to do this. But if this is a simplified version this may not be feasible.
If the data isn't available until after the data flow is done you could place the SQL task after your data flow on your control flow.