Statistical Process Control Charts in SQL Server 2008 R2 - sql-server-2008

I'm hoping you can point me in the right direction.
I'm trying to generate a control chart (http://en.wikipedia.org/wiki/Control_chart) using SQL Server 2008. Creating a basic control chart is easy enough. I'd just calculate the mean and standard deviations and then plot them.
The complex bit (for me at least) is that I would like the chart to reset the mean and the control limits when a step change is identified.
Currently I'm only interested in a really simple method of identifying a step change, 5 points appearing consecutively above or below the mean. There are more complex ways of identifying them (http://en.wikipedia.org/wiki/Western_Electric_rules) but I just want to get this off the ground first.
The process I have sort of worked out is:
Aggregate and order by month and year, apply row numbers.
Calculate overall mean
Identify if each data item is higher, lower or the same as the mean, tag with +1, -1 or 0.
Identify when their are 5 consecutive data items which are above or below the mean (currently using a cursor).
Recalculate the mean if 5 points are above or 5 points are below the mean.
Repeat until end of table.
Is this sort of process possible in SQL server? It feels like I maybe need a recursive UDF but recursion is a bit beyond me!
A nudge in the right direction would be much appreciated!
Cheers

Ok, I ended up just using WHILE loops to iterate through. I won't post full code but the steps were:
Set up a user defined table data type in order to pass data into a stored procedure parameter.
Wrote accompanying stored procedure that uses row numbers and while loops to iterate along each data value in the input table and then uses the current row number to do set based processing on a subset of the input data (to check if following 5 points are above/below mean and recalculate the mean and standard deviations when this flag is tripped).
Outputs table with original values, row numbers, months, mean values, upper control limit and lower control limit.
I've also got one up and running that works based on full Nelson rules and will also state which test the data has failed.
Currently it's only been used by me as I develop it further so I've set up an Excel sheet with some VBA to dynamically construct a SQL string which it passes to a pivot table as the command text. That way you can repeatedly ping the USP with different data sets and also change a few of the other parameters on how the procedure runs (such as adjusting control limits and the like).
Ultimately I want to be able to pass the resulting data to Business Objects reports and dashboards that we're working on.

Related

Anylogic: How to create an objective function using values of two dataset (for optimization experiment)?

In my Anylogic model I have a population of agents (4 terminals) were trucks arrive at, are being served and depart from. The terminals have two parameters (numberOfGates and servicetime) which influence the departures per hour of trucks leaving the terminals. Now I want to tune these two parameters, so that the amount of departures per hour is closest to reality (I know the actual departures per hour). I already have two datasets within each terminal agent, one with de amount of departures per hour that I simulate, and one with the observedDepartures from the data.
I already compare these two datasets in plots for every terminal:
Now I want to create an optimization experiment to tune the numberOfGates and servicetime of the terminals so that the departure dataset is the closest to the observedDepartures dataset. Does anyone know how to do create a(n) (objective) function for this optimization experiment the easiest way?
When I add a variable diff that is updated every hour by abs( departures - observedDepartures) and put root.diff in the optimization experiment, it gives me the eq(null) is not allowed. Use isNull() instead error, in a line that reads the database for the observedDepartures (see last picture), but it works when I run the simulation normally, it only gives this error when running the optimization experiment (I don't know why).
You can use the absolute value of the sum of the differences for each replication. That is, create a variable that logs the | difference | for each hour, call it diff. Then in the optimization experiment, minimize the value of the sum of that variable. In fact this is close to a typical regression model's objectives. There they use a more complex objective function, by minimizing the sum of the square of the differences.
A Calibration experiment already does (in a more mathematically correct way) what you are trying to do, using the in-built difference function to calculate the 'area between two curves' (which is what the optimisation is trying to minimise). You don't need to calculate differences or anything yourself. (There are two variants of the function to compare either two Data Sets (your case) or a Data Set and a Table Function (useful if your empirical data is not at the same time points as your synthetic simulated data).)
In your case it (the objective function) will need to be a sum of the differences between the empirical and simulated datasets for the 4 terminals (or possibly a weighted sum if the fit for some terminals is considered more important than for others).
So your objective is something like
difference(root.terminals(0).departures, root.terminals(0).observedDepartures)
+ difference(root.terminals(1).departures, root.terminals(1).observedDepartures)
+ difference(root.terminals(2).departures, root.terminals(2).observedDepartures)
+ difference(root.terminals(3).departures, root.terminals(2).observedDepartures)
(It would be better to calculate this for an arbitrary population of terminals in a function but this is the 'raw shape' of the code.)
A Calibration experiment is actually just a wizard which creates an Optimization experiment set up in a particular way (with a UI and all settings/code already created for you), so you can just use that objective in your existing Optimization experiment (but it won't have a built-in useful UI like a Calibration experiment). This also means you can still set this up in the Personal Learning Edition too (which doesn't have the Calibration experiment).

SSRS IIF efficiency

I am using the following expression to tier the sales figures.
=sum(iif(Fields!InitialValue.Value>=500000 and Fields!InitialValue.Value<1000000,Fields!InitialValue.Value,nothing))
Basically, I just change the greater than and less than values for each cell. We have 4 tiers.
From what I understand, the IIF statement will go through each line and evaluate it before returning anything.
I am also averaging the size of each new account, so I have 8 cells that evaluate the data each time. I will also need to add how many accounts are in each tier, which means 12 passes at the same data. It takes some time to generate this report.
Is this the most efficient method?
Thanks in advance for all your help!
One way you could increase the efficiency at this, from what I can tell is one of two way, the way I would do it is add a column to your query that labels each row by Tier, this would mean when the data gets to SSRS it is already set and never needs to be evaluated. My theory is SSRS is not as smart as a query optimizer. Another way to do it, that may or may not speed it up, is add a calculated field to your data set that especially does the same thing. I believe this would have SSRS calculate it once and that is is.

Partial Data Set in WEBI 4.0

When I run a query in Web Intelligence, I only get a part of the data.
But I want to get all the data.
The resulting data set I am retrieving from database is quite large (10 million rows). However, I do not want to have 10 million rows in my reports, but to summarize it, so that the report has the most 50 rows.
Why am I getting only a partial data set as a result of WEBI query?
(I also noticed that in the bottom right corner there is an exclamation mark, that indicates I am working with partial data set, and when I click on refresh I still get the partial data set.)
BTW, I know I can see the SQL query when I built it using query editor, but can i see the corresponding query when I make a certain report? If yes, how?
UPDATE: I have tried the option by editing the 'Limit size of result set to:' in the Query Options in Business Layer by setting the value to 9 999 999 and the again by unchecking this option. However, I am still getting the partial result.
UPDATE: I have checked the number of rows in the resulting set - it is 9,6 million. Now it's even more confusing why I'm not getting all the rows (the max number of rows was set to 9 999 999)
SELECT
I_ATA_MV_FinanceTreasury.VWD_Segment_Value_A.Description_TXT,
count(I_ATA_MV_FinanceTreasury.VWD_Party_A.Party_KEY)
FROM
I_ATA_MV_FinanceTreasury.VWD_Segment_Value_A RIGHT OUTER JOIN
I_ATA_MV_FinanceTreasury.VWD_Party_A ON
(I_ATA_MV_FinanceTreasury.VWD_Segment_Value_A.Segment_Value_KEY=I_ATA_MV_FinanceTreasury.VWD_Party_A.Segment_Value_KEY)
GROUP BY 1
The "Limit size of result set" setting is a little misleading. You can choose an amount lower than the associated setting in the universe, but not higher. That is, if the universe is set to a limit of 5,000, you can set your report to a limit lower than 5,000, but you can't increase it.
Does your query include any measures? If not, and your query is set to retrieve duplicate rows, you will get an un-aggregated result.
If you're comfortable reading SQL, take a look at the report's generated SQL, and that might give you a clue as to what's going on. It's possible that there is a measure in the query that does not have an aggregate function (as it should).
While this may be a little off-topic, I personally would advise against loading that much data into a Web Intelligence document, especially if you're going to aggregate it to 50 rows in your report.
These are not the kind of data volumes WebI was designed to handle (regardless whether it will or not). Ideally, you should push down as much of the aggregation as possible to your database (which is much better equipped to handle such volumes) and return only the data you really need.
Have a look at this link, which contains some best practices. For example, slide 13 specifies that:
50.000 rows per document is a reasonable number
What you need to do is to add a measure to your query and make sure that this measure uses an aggregate database function (e.g. SUM()). This will cause WebI to create a SQL statement with GROUP BY.
Another alternative is to disable the option Retrieve duplicate rows. You can set this option by opening the data provider's properties.
.

SSAS calculated measure: Access relational database

I recently asked a question about many-to-many relationships and how they can be used to calculate intersections that got answered pretty fine. Now, there is another nice-to-have requirement for our cube to extend that to more data. The general question remains: How many orders contain both product x and y?
However, the measure groups are now much larger, currently about 1.4 billion rows. I tried to implement that using the method described in the other post, with several hidden cross-referenced measure groups. However, this is simply too much for our hardware, the cube is reaching sizes next to 0.5 TB, and querys take several minutes to complete.
Now I would try to use another option: Can I access our relational database in a calculated measure? It seems I can, using UDFs like described in this article. I could write a Function in c# that queries our relational database and returns all the orders that contain the products chosen by the user. But in order to do that, I need to supply all the dimensional data the user has selected to the UDF. I also need the UDF to return the calculated value so it can be output as the result of the calculated member. Is that possible? If yes, how? The example microsoft provides only includes a small deterministic string-function as the UDF.
Here my own results:
It seems to be possible, though with limitations. The class Microsoft.AnalysisServices.AdomdServer.Context can provide you with the currentMember of each Hierarchy, however this does not work with Excel-Style-Subselects. It either contains a single member or the AllMember.
Another option is to get the MDX query using the dmv SELECT * FROM $System.DISCOVER_SESSIONS. There will be a column on that view which contains the last mdx query for a given session. However in order to not overwrite your own last query, you will need to not use the current connection, but to open a new one. The session id can be obtained through Microsoft.AnalysisServices.AdomdServer.Context.CurrentConnection.SessionID.
The second approach is ok for our use-case. It does not allow you to handle axes, since the udf-function has a cell-scope, but you don't know which cell you are in. If anyone of you knows anything about that last bit, please tell me. Thanks!

SSRS - How can I create a custom aggregate function?

I'm creating a report that has an unusual BoxPlot chart. I need to calculate the values for "Low Box" and "High Box" using all of the data for the certain column. The methodology for calculating these values is not that complicated, but I can not disclose it.
Basically I want to create a custom aggregate function. I understand how to create a VB function, but how do I make it take in a series of data instead of a single value. I know there is a Max function already, but for the sake of example how would one implement a Max function?
Thanks for your help.
"can not disclose it." implies high value, which implies that you are using a recent version of SSRS, so this link should be of value for you. (The blog article also includes how you might implement this in 2005, but doesn't focus on it.)
Essentially create a custom function that gets called for every row of the data, taking in values from that row. That method or another related method can return your aggregate. 2008 includes Group Variables should help with a convenient place to store that.
Another approach, but much harder I think, would be to implement a custom data provider wrapping your query.