Conditional split based on array variable - sql-server-2008

I need something like a T-SQL IN statement to filter records in a conditional split based on an array variable (or something similar)
I need to have a list of items that a column can be filtered on.

As Filip has indicated, there is no IN operator in the expression language. I did come up with some options though as I thought this sounded like an interesting problem.
My long analysis is on my blog: Filter list in SSIS
Conditional split
If you can transform your list of values into a delimited string, then you can use FINDSTRING and the current value to determine whether it's in the list. This provided the best throughput for my testing scenario. (FINDSTRING(#[User::MyListStr], [MyColumn],1)) > 0
Script task
I had assumed using a List in a script task to determine membership would provide the best performance but I was wrong. Row.IsInList = MyListObj.Contains(Row.MyColumn);
Lookup/Cache Connection Manager
The third approach I had come up with was dumping the list into a Cached Connection Manager and then using that in a lookup task. I thought this was the easiest to conceptualize and maintain but the performance was lacking.
Conclusion
For this problem domain, the FINDSTRING approach was the most efficient, by a considerable margin. The other three approaches consistently averaged a throughput of within 7 rows per millisecond of each other. I did find it interesting that the standard deviation of the FINDSTRING approach fluctuated so much. While this box is older and slower, there was not a considerable amount of activity going on during the package executions.

There is no IN operator in SSIS expression operators. And there is no similar operator. Since there is no such operator, You can't do that with built-in expressions and built-in Conditional Split. But You can do one of the following:
use Script Transformation to check if particular column has that is in variable array, and add additional column (flag) with value 1 if it contains, 0 if not; then use Conditional Split on this flag added in Script Transformation, or
it's better to put variables in database table and then use Lookup or Merge Join to check if row exists

Related

Aggregate Transformation vs Sort (remove Duplicate) in SSIS

I'm trying to populate dimension tables on a regular basis and I've thought of two ways of getting distinct values for my dimension:
Using an Aggregate transformation and then using the "Group by" operation.
Using a Sort transformation while removing duplicates.
I'm not sure which one is better (more efficient), or which one is adopted more widely in the industry.
I tried to perform some tests using dummy data, but I can't quite get a solid answer.
P.S. Using SELECT DISTINCT from the source is not an option here.
My first choice would always be to correct this in my source query if possible. I realise that isn't always an option, but for the sake of completeness for future readers: I would first check whether I had a problem in my source query that was creating duplicates. Whenever a DISTINCT seems necessary, I first see whether there's actually a problem with the query that needs resolving.
My second choice would be a DISTINCT - if it were possible - because this is one of those cases where it will probably be quicker to resolve in SQL than in SSIS; but I realise that's not an option for you.
From that point, you're getting into a situation where you might need to try out the remaining options. Aside from using an Aggregate or Sort in SSIS, you could also dump the results into a staging table, and then have a separate data flow which does use a DISTINCT in its source query. Aggregate and Sort are both blocking transactions in SSIS so using a staging table might end up being faster - but which is fastest for you will depend on a number of factors including the nature of your data, and also the nature of your infrastructure. You might also want to keep in mind what else is running in parallel if you use the SSIS options, as they can be memory-hungry.
If your data is (or can be) sorted in your source or source query, then there's also a clever idea in the link below for creating "semi-blocking" versions of Aggregate and Sort using script tasks:
http://social.technet.microsoft.com/wiki/contents/articles/30703.ssis-implementing-a-faster-distinct-sort-or-aggregate-transformation.aspx

SSRS and Comparison Operators on Numeric Portion of varchar

Each returned transaction I am to report on is stored with a return reason code and a description of the return reason code. I built a tablix with two columns - one for return codes and another for descriptions. This works just peachy. The report owner is upset that a long list of codes will split pages - sigh. I was told to display them side-by-side.
I am new to t-sql and SSRS and its idiosyncrasies. I have minimal support from our DBAs. Two tables, filtered to display codes that meet a criteria sound simple enough.
My research:
MSDN's support network, Operators in Expressions page, and various help topics. I also found SO posts regarding split functions in t-sql and similar as well as one specifically asking about comparison and varchar. I found sites with helpful information like ResultData and Network Steve. I haven't found what I think I'm looking for.
My problem:
The return reason code is a varchar that always consists of the letter 'r' and two numeric digits (R00 to R99). It appears I can't run a comparison operator on an entire varchar that is alphanumeric; it doesn't recognize IIF((Fields!... <= R17),True,False). Additionally, the company will not allow the warehouse or its functions to be edited so I cannot create my own.
My solution ideas:
Add each Rnn code to the tablix filter, individually. This means ~50 filters per tablix and seems a sloppy or inefficient way of handling this
Separate the varchar string in to its alpha and numeric components and compare the latter using standard operators. This sounds the cleanest method but I'm unsure how to accomplish this in an expression or within SSRS
Forgo the two-table idea and create one table with four columns (code, description, code, description). This still leaves me with how to set a limit on the number of rows that can be created before 'spilling over' to the other side
I appreciate being pointed to any resources or any offered input to the issue and my (not so?)logical approach to it.
You can achieve your second option as follows:
CInt(Fields!ReturnCode.Value.Substring(1,2))

SSAS calculated measure: Access relational database

I recently asked a question about many-to-many relationships and how they can be used to calculate intersections that got answered pretty fine. Now, there is another nice-to-have requirement for our cube to extend that to more data. The general question remains: How many orders contain both product x and y?
However, the measure groups are now much larger, currently about 1.4 billion rows. I tried to implement that using the method described in the other post, with several hidden cross-referenced measure groups. However, this is simply too much for our hardware, the cube is reaching sizes next to 0.5 TB, and querys take several minutes to complete.
Now I would try to use another option: Can I access our relational database in a calculated measure? It seems I can, using UDFs like described in this article. I could write a Function in c# that queries our relational database and returns all the orders that contain the products chosen by the user. But in order to do that, I need to supply all the dimensional data the user has selected to the UDF. I also need the UDF to return the calculated value so it can be output as the result of the calculated member. Is that possible? If yes, how? The example microsoft provides only includes a small deterministic string-function as the UDF.
Here my own results:
It seems to be possible, though with limitations. The class Microsoft.AnalysisServices.AdomdServer.Context can provide you with the currentMember of each Hierarchy, however this does not work with Excel-Style-Subselects. It either contains a single member or the AllMember.
Another option is to get the MDX query using the dmv SELECT * FROM $System.DISCOVER_SESSIONS. There will be a column on that view which contains the last mdx query for a given session. However in order to not overwrite your own last query, you will need to not use the current connection, but to open a new one. The session id can be obtained through Microsoft.AnalysisServices.AdomdServer.Context.CurrentConnection.SessionID.
The second approach is ok for our use-case. It does not allow you to handle axes, since the udf-function has a cell-scope, but you don't know which cell you are in. If anyone of you knows anything about that last bit, please tell me. Thanks!

Statistical Process Control Charts in SQL Server 2008 R2

I'm hoping you can point me in the right direction.
I'm trying to generate a control chart (http://en.wikipedia.org/wiki/Control_chart) using SQL Server 2008. Creating a basic control chart is easy enough. I'd just calculate the mean and standard deviations and then plot them.
The complex bit (for me at least) is that I would like the chart to reset the mean and the control limits when a step change is identified.
Currently I'm only interested in a really simple method of identifying a step change, 5 points appearing consecutively above or below the mean. There are more complex ways of identifying them (http://en.wikipedia.org/wiki/Western_Electric_rules) but I just want to get this off the ground first.
The process I have sort of worked out is:
Aggregate and order by month and year, apply row numbers.
Calculate overall mean
Identify if each data item is higher, lower or the same as the mean, tag with +1, -1 or 0.
Identify when their are 5 consecutive data items which are above or below the mean (currently using a cursor).
Recalculate the mean if 5 points are above or 5 points are below the mean.
Repeat until end of table.
Is this sort of process possible in SQL server? It feels like I maybe need a recursive UDF but recursion is a bit beyond me!
A nudge in the right direction would be much appreciated!
Cheers
Ok, I ended up just using WHILE loops to iterate through. I won't post full code but the steps were:
Set up a user defined table data type in order to pass data into a stored procedure parameter.
Wrote accompanying stored procedure that uses row numbers and while loops to iterate along each data value in the input table and then uses the current row number to do set based processing on a subset of the input data (to check if following 5 points are above/below mean and recalculate the mean and standard deviations when this flag is tripped).
Outputs table with original values, row numbers, months, mean values, upper control limit and lower control limit.
I've also got one up and running that works based on full Nelson rules and will also state which test the data has failed.
Currently it's only been used by me as I develop it further so I've set up an Excel sheet with some VBA to dynamically construct a SQL string which it passes to a pivot table as the command text. That way you can repeatedly ping the USP with different data sets and also change a few of the other parameters on how the procedure runs (such as adjusting control limits and the like).
Ultimately I want to be able to pass the resulting data to Business Objects reports and dashboards that we're working on.

SSRS - How can I create a custom aggregate function?

I'm creating a report that has an unusual BoxPlot chart. I need to calculate the values for "Low Box" and "High Box" using all of the data for the certain column. The methodology for calculating these values is not that complicated, but I can not disclose it.
Basically I want to create a custom aggregate function. I understand how to create a VB function, but how do I make it take in a series of data instead of a single value. I know there is a Max function already, but for the sake of example how would one implement a Max function?
Thanks for your help.
"can not disclose it." implies high value, which implies that you are using a recent version of SSRS, so this link should be of value for you. (The blog article also includes how you might implement this in 2005, but doesn't focus on it.)
Essentially create a custom function that gets called for every row of the data, taking in values from that row. That method or another related method can return your aggregate. 2008 includes Group Variables should help with a convenient place to store that.
Another approach, but much harder I think, would be to implement a custom data provider wrapping your query.