I have an SSIS data flow in SSIS 2012 project.
I need to calculate in the best way possible for every row field a sum of another table based on some criteria.
It would be something like a lookup but returning an aggregate on the lookup result.
Is there an SSIS way to do it by components or i need to turn to script task or stored procedure?
Example:
One data flow has a filed names LOT.
i need to get the sum(quantity) from table b where dataflow.LOT = tableb.lot
and write this back to a flow field
You just need to use the Lookup Component. Instead of selecting tableb write the query, thus
SELECT
B.Lot -- for matching
, SUM(B.quantity) AS TotalQuantity -- for data flow injection
FROM
tableb AS B
GROUP BY
B.Lot;
Now when the package begins, it will first run this query against that data source and generate the quantities across all lots.
This may or may not be a good thing based on data volumes and whether the values in tableB are changing. In the larger volume case, if it's a problem, then I'd look at whether I can do something about the above query. Maybe I only need current year's data. Maybe my list of Lots could be pushed into the remove server beforehand to only compute the aggregates for what I need.
If TableB is very active, then you might need to change your caching from the default of Full to a Partial or None. If Lot 10 shows up twice in the data flow, the None would perform 2 lookups against the source while the Partial would cache the values it has seen. Probably, depends on memory pressure, etc.
Related
Is there a way to perform a date range lookup using a cache connection manager in SSIS? Or something similar that is very performant.
The scenario I have is as follows.
I have a row in a table that has a date, lets call it BusinessDate. I need to perform a lookup on a table to see if the businessDate is between the StartDate and EndDate of the dimension.
The problem is, the table I'm reading from had millions of records and my dimension (Lookup table) has a few thousand records and it takes very long.
Please help...
Nope, the Lookup with a cache connection manager is a strict equals. You might be able to finagle it with a lookup against an OLE DB source with a Partial/None cache model and custom queries.
So, what can you do?
You can modify the way you populate your Lookup Cache. Assuming your data looks something like
MyKey|StartDate|EndDate|MyVal
1 |2021-01-01|2021-02-01|A
1 |2021-02-01|9999-12-31|B
Instead of just loading as is, explode out your dimension.
MyKey|TheDate|MyVal
1 |2021-01-01|A
1 |2021-01-02|A
1 |2021-01-03|A
1 |2021-01-04|A
1 |2021-01-05|A
...
1 |2021-02-01|B
1 |2021-02-02|B
...
You might not want to build your lookup all the way to year 9999 but know your data and say go 5 years in the future as well as pick up the end date.
Now your lookup usage is a supported case - strict equals.
Otherwise, the pattern of a merge join is how people handle range joins in a data flow. Going to reproduce Matt Masson's article from the msdn blogs because it's dead
Lookup Pattern: Range Lookups
Performing range lookups (i.e. to find a key for a given range) is a common ETL operation in data warehousing scenarios. It's especially for historical loads and late arriving fact situations, where you're using type 2 dimensions and you need to locate the key which represents the dimension value for a given point in time.
This blog post outlines three separate approaches for doing range lookups in SSIS:
Using the Lookup Transform
Merge Join + Conditional Split
Script Component
All of our scenarios will use the AdventureWorksDW2008 sample database (DimProduct table) as the dimension, and take its fact data from AdventureWorks2008 (SalesOrderHeader and SalesOrderDetail tables). The "ProductNumber" column from the SalesOrderDetail table maps to the natural key of the DimProduct dimension (ProductAlternateKey column). In all cases we want to lookup the key (ProductKey) for the product which was valid (identified by StartDate and EndDate) for the given OrderDate.
One last thing to note is that the Merge Join and Script Component solutions assume that a valid range exists for each incoming value. The Lookup Transform approach is the only one that will identify rows that have no matches (although the Script Component solution could be modified to do so as well).
Lookup Transform
The Lookup Transform was designed to handle 1:1 key matching, but it can also be used in the range lookup scenario by using a partial cache mode, and tweaking the query on the Advanced Settings page. However, the Lookup doesn't cache the range itself, and will end up going to the database very often - it will only detect a match in its cache if all of the parameters are the same (i.e. same product purchased on the same date).
We can use the following query to have the lookup transform perform our range lookup:
select [ProductKey], [ProductAlternateKey],
[StartDate], [EndDate]
from [dbo].[DimProduct]
where [ProductAlternateKey] = ?
and [StartDate] <= ?
and (
[EndDate] is null or
[EndDate] > ?
)
On the query parameters page, we map 0 -> ProductNumber, 1 and 2 -> OrderDate.
This approach is effective and easy to setup, but it is pretty slow when dealing with a large number of rows, as most lookups will be going to the database.
Merge Join and Conditional Split
This approach doesn't use the Lookup Transform. Instead we use a Merge Join Transform to do an inner join on our dimension table. This will give us more rows coming out than we had coming in (you'll get a row for every repeated ProductAlternateKey). We use the conditional split to do the actual range check, and take only the rows that fall into the right range.
For example, a row coming in from our source would contain an OrderDate and ProductNumber, like this:
From the DimProduct source, we take three additional columns - ProductKey (what we're after), StartDate and EndDate. The DimProduct dimension contains three entries for the "LJ-0192-L" product (as its information, like unit price, has changed over time). After going through the Merge Join, the single row becomes three rows.
We use the Conditional Split to do the range lookup, and take the single row we want. Here is our expression (remember, in our case an EndDate value of NULL indicates that it's the most current row):
StartDate <= OrderDate && (OrderDate < EndDate || ISNULL(EndDate))
This approach is a little more complicated, but performs a lot better than using the Lookup Transform.
Script component
Not reproduced here
Conclusion
Not reproduced here
Is there a way to get the last run date from the cache refresh page for a SSRS report and have that displayed on the report? I'd like the user to know when the data was last refreshed.
You can query the ReportServer database directly to accomplish this:
SELECT MAX(els.TimeEnd) AS LastCacheRefresh
FROM dbo.ExecutionLogStorage AS els
INNER JOIN dbo.Catalog AS cat ON els.ReportID = cat.ItemID
WHERE els.RequestType = 2 --Refresh Cache
AND els.Status = 'rsSuccess'
AND cat.Name = 'MyReport'
Also FYI, Microsoft does not support querying the ReportServer database directly, which means fields/schema could change in later versions of SSRS.
Sorry to contribute to old thread, but wasn't finding a lot of hits on this question, so figured I'd add this:
If you include a calculated field in your source query, it is evaluated when the query is run and records along with the rest of your data. For example:
select
field1
, getdate() as QueryDateTime
from
SQLServerTable
and you can then use that value as min/max/first in your report (by definition is the same on every record).
This has to be done by the server dishing up the data, not as a calculated field in SSRS, because those are evaluated at run time, same as now() expression, or global execution time variable.
One downside of course is that you're recording that data and storing it, then having to retrieve a bunch of redundant data when pulling it, so it's not really efficient from a purist I/O perspective. I suspect the cost of one column of a single date value is not too much to worry about in most cases.
I am using SSIS to load data from flat files to a SQL table. The flat files contain both new and updated rows. Each time the process is run, the updated rows will affect a small subset of the SQL table, specified by a 'period' column (e.g. one procedure may only affect periods 3, 4, and 5).
I am using a Look-Up transformation to separate new rows (Lookup No Match Output) from existing row s(Lookup Match Output). Since both the reference set and the data set being loaded is extremely large, I would like to use partial caching for the lookup. Is it somehow possible to modify the partial caching query to only include rows from the period numbers included in the flat files?
For example, my reference table may contain data from periods 1-10, but my flat files being loaded may only have data from periods 3-5. Therefore, I only want to cache data from periods 3-5, since I already know periods 1-2 and 6-10 will never produce a match.
Instead of using the table selector in the drop down, which you should never do unless you need every column from every row, write your query to only pull back the columns you need for either matching or augmenting the existing data. In your case, you're going to need to add a filter which is a bit persnickety.
The best approach I've found is to write the lookup query in a variable of type String. In it, I will build the query and apply the needed filter. Below, you see I defined two variables. One an int which will serve as my filter and then the query itself which uses it.
The expression on my SourceQuery Variable is
"SELECT
D.rn
FROM
(
SELECT TOP 10
ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) * 2 AS rn
FROM
sys.all_columns AS SA
) AS D(rn)
WHERE D.rn <= " + (DT_WSTR, 10) #[User::MaxID]
My Data Flow looks like
I have my source and it hits a lookup and based on matched results it goes to one of the two buckets. My source query just generates the numbers 1 to 10 and the lookup is a query that generates even numbers from 2 to 20.
During design time, that query looks like
SELECT
D.rn
FROM
(
SELECT TOP 10
ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) * 2 AS rn
FROM
sys.all_columns AS SA
) AS D(rn)
A normal run would result in a 50/50 split between the buckets
The goal of course is to make the lookup query take a parameter like one of the source components but you'd quickly discover that
SELECT
D.rn
FROM
(
SELECT TOP 10
ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) * 2 AS rn
FROM
sys.all_columns AS SA
) AS D(rn)
WHERE D.rn > ?
doesn't fly. Instead, you have to go back out to the Control Flow and select the Data Flow, Right click and select Properties. In the window for your data flow, go to Expressions and click the ellipses (...)
There will be a property named as your Lookup task. Assign the Variable that uses the expression to make it all dynamic and voila, with a maxid of 6 I only find 3 matches
A final note, the Partial Cache may or may not be what you're looking for. That's an actual lookup setting that controls how it balances the cost of lookup data versus caching it locally. A full cache will drop all the specified columns for the applicable range into memory which is why you only want to specify the columns you need. If you can get it down to a few skinny columns and even if it's millions of rows, you probably aren't going to feel pain.
Contrived example:
Glean your period minum and maximum at runtime and store them in two variables, PeriodMinimum and PeriodMaximum (I'm assuming it's a range, I'll discuss alternatives at the end.)
Add them as Derived Columns to your Source flow.
In the Lookup Editor, under the Advanced tab, use a custom query (Contrived example): SELECT lookup, value FROM reference where period between ? and ?
Click the Parameters button and use your input columns appropriately.
If instead of a range you want to be able to randomly choose periods (3, 6, and 10) you'll have to do something a bit more contrived ...
Create multiple variables, Period1, 2, 3 ... n and set the default to -1 or some value which is not a valid period.
Populate these variables as needed with the periods you do want to filter on.
In the custom query, use SELECT lookup, value FROM reference where period = ? or period = ? or period = ?, ...
Set each parameter using your input columns.
Anyway, in general, use the Custom query with parameters when you want a dynamic lookup query based on runtime data.
I'm going to do my best to try to explain this. I currently have a data flow task that has an OLE DB Source transferring data from a table from a different database to a table to another database. It works fine but the issue I'm having is the fact that I keep adding duplicate data to the destination table.
So a CustomerID of '13029' with an amount of '$56.82' on Date '11/30/2012' is seen in that table multiple times. How do I make it so I can only have unique data transferring over to that destination table?
In the dataflow task, where you transfer the data, you can insert a Lookup transformation. In the lookup, you can specify a data source (table or query, what serves you best). When you chose the data source, you can go to the Columns view and create a mapping, where you connect the CustomerID, Date and Amount of both tables.
In the general view, you can configure, what happens with matched/non matched row. Simply take the not matched output and direct it to the DB destination.
You will need to identify what makes that data unique in the table. If it's a customer table, then it's probably the customerid of 13029. However if it's a customer order table, then maybe it's the combination of CustomerId and OrderDate (and maybe not, I have placed two unique orders on the same date). You will know the answer to that based on your table's design.
Armed with that knowledge, you will want to write a query to pull back the keys from the target table SELECT CO.CustomerId, CO.OrderId FROM dbo.CustomerOrder CO If you know the process only transfers data from the current year, add a filter to the above query to restrict the number of rows returned. The reason for this is memory conservation-you want SSIS to run fast, don't bring back extraneous columns or rows it will never need.
Inside your dataflow, add a Lookup Transformation with that query. You don't specify 2005, 2008 or 2012 as your SSIS version and they have different behaviours associated with the Lookup Transformation. Generally speaking, what you are looking to do is identify the unmatched rows. By definition, unmatched means they don't exist in the target database so those are the rows that are new. 2005 assumes every row is going to match or it errors. You will need to click the Configure Error Output... button and select "Redirect Rows". 2008+ has an option under "Specify how to handle rows with no matching entries" and there you'll want "Redirect rows to no match output."
Now take the No match output branch (2008+) or the error output branch (2005) and plumb that into your destination.
What this approach doesn't cover is detecting and handling when the source system reports $56.82 and the target system has $22.38 (updates). If you need to handle that, then you need to look at some change detection system. Look at Andy Leonard's Stairway to Integration Services series of articles to learn about options for detecting and handling changes.
Have you considered using the T-SQL MERGE statement? http://technet.microsoft.com/en-us/library/bb510625.aspx
It will compare both tables on defined fields, and take an action if matched or not.
This is quite a strange problem, wasn't quite sure how to title it. The issue I have is some data rows in an SSIS task which need to be modified depending on other rows.
Name Location IsMultiple
Bob England
Jim Wales
John Scotland
Jane England
A simplifed dataset, with some names, their locations, and a column 'IsMultiple' which needs to be updated to show which rows share locations. (Bob and Jane's rows would be flagged 'True' in the example above).
In my situation there is much more complex logic involved, so solutions using sql would not be suitable.
My initial thoughts were to use an asyncronous script task, take in all the data rows, parse them, and then output them all after the very last row has been input. The only way I could think of doing this was to call row creation in the PostExecute Phase, which did not work.
Is there a better way to go about this?
A couple of options come to mind for SSIS solutions. With both options you would need the data sorted by location. If you can do this in your SQL source, that would be best. Otherwise, you have the Sort component.
With sorted data as your input you can use a Script component that compares the values of adjacent rows to see if multiple locations exist.
Another option would be to split your data path into two. Do this by adding a Multicast component. The first path would be your main path that you currently have. In the second task, add an Aggregate transformation after the Multicast component. Edit the Aggregate and select Location as a Group By operation. Select (*) as a Count all. The output will be rows with counts by location.
After the Aggregate, Add a Merge Join component and select your first and second data paths as inputs. Your join keys should be the Location column from each path. All the inputs from path 1 should be outputs and include the count from path 2 as an output.
In a derived column, modify the isMultiple column with an expression that expresses "If count is greater than 1 then true else false".
If possible, I might recommend doing it with pure SQL in a SQL task on your control flow prior to your data flow. A simple UPDATE query where you GROUP BY location and do a HAVING COUNT for everything greater than 1 should be able to do this. But if this is a simplified version this may not be feasible.
If the data isn't available until after the data flow is done you could place the SQL task after your data flow on your control flow.