Anylogic: How to create an objective function using values of two dataset (for optimization experiment)? - function

In my Anylogic model I have a population of agents (4 terminals) were trucks arrive at, are being served and depart from. The terminals have two parameters (numberOfGates and servicetime) which influence the departures per hour of trucks leaving the terminals. Now I want to tune these two parameters, so that the amount of departures per hour is closest to reality (I know the actual departures per hour). I already have two datasets within each terminal agent, one with de amount of departures per hour that I simulate, and one with the observedDepartures from the data.
I already compare these two datasets in plots for every terminal:
Now I want to create an optimization experiment to tune the numberOfGates and servicetime of the terminals so that the departure dataset is the closest to the observedDepartures dataset. Does anyone know how to do create a(n) (objective) function for this optimization experiment the easiest way?
When I add a variable diff that is updated every hour by abs( departures - observedDepartures) and put root.diff in the optimization experiment, it gives me the eq(null) is not allowed. Use isNull() instead error, in a line that reads the database for the observedDepartures (see last picture), but it works when I run the simulation normally, it only gives this error when running the optimization experiment (I don't know why).

You can use the absolute value of the sum of the differences for each replication. That is, create a variable that logs the | difference | for each hour, call it diff. Then in the optimization experiment, minimize the value of the sum of that variable. In fact this is close to a typical regression model's objectives. There they use a more complex objective function, by minimizing the sum of the square of the differences.

A Calibration experiment already does (in a more mathematically correct way) what you are trying to do, using the in-built difference function to calculate the 'area between two curves' (which is what the optimisation is trying to minimise). You don't need to calculate differences or anything yourself. (There are two variants of the function to compare either two Data Sets (your case) or a Data Set and a Table Function (useful if your empirical data is not at the same time points as your synthetic simulated data).)
In your case it (the objective function) will need to be a sum of the differences between the empirical and simulated datasets for the 4 terminals (or possibly a weighted sum if the fit for some terminals is considered more important than for others).
So your objective is something like
difference(root.terminals(0).departures, root.terminals(0).observedDepartures)
+ difference(root.terminals(1).departures, root.terminals(1).observedDepartures)
+ difference(root.terminals(2).departures, root.terminals(2).observedDepartures)
+ difference(root.terminals(3).departures, root.terminals(2).observedDepartures)
(It would be better to calculate this for an arbitrary population of terminals in a function but this is the 'raw shape' of the code.)
A Calibration experiment is actually just a wizard which creates an Optimization experiment set up in a particular way (with a UI and all settings/code already created for you), so you can just use that objective in your existing Optimization experiment (but it won't have a built-in useful UI like a Calibration experiment). This also means you can still set this up in the Personal Learning Edition too (which doesn't have the Calibration experiment).

Related

Can I find price floors and ceilings with cuda

Background
I'm trying to convert an algorithm from sequential to parallel, but I am stuck.
Point and Figure Charts
I am creating point and figure charts.
Decreasing
While the stock is going down, add an O every time it breaks through the floor.
Increasing
While the stock is going up, add an X every time it breaks through the ceiling.
Reversal
If the stock reverses direction, but the change is less than a reversal threshold (3 units) do nothing. If the change is greater than the reversal threshold, start a new column (X or O)
Sequential vs Parallel
Sequentially, this is pretty straight forward. I keep a variable for the floor and ceiling. If the current price breaks through the floor or ceiling, or changes more than the reversal threshold, I can take the appropriate action.
My question is, is there a way to find these reversal point in parallel? I'm fairly new to thinking in parallel, so I'm sorry if this is trivial. I am trying to do this in CUDA, but I have been stuck for weeks. I have tried using the finite difference algorithms from NVidia. These produce local max / min but not the reversal points. Small fluctuations produce numerous relative max / min, but most of them are trivial because the change is not greater than the reversal size.
My question is, is there a way to find these reversal point in parallel?
one possible approach:
use thrust::unique to remove periods where the price is numerically constant
use thrust::adjacent_difference to produce 1st difference data
use thrust::adjacent_difference on 1st difference data to get the 2nd difference data, i.e the points where there is a change in the sign of the slope.
use these points of change in sign of slope to identify separate regions of data - build a key vector from these (e.g. with a prefix sum). This key vector segments the price data into "runs" where the price change is in a particular direction.
use thrust::exclusive_scan_by_key on the 1st difference data, to produce the net change of the run
Wherever the net change of the run exceeds a threshold, flag as a "reversal"
Your description of what constitutes a reversal may also be slightly unclear. The above method would not flag a reversal on certain data patterns that you might classify as a reversal. I suspect you are looking beyond a single run as I have defined it here. If that is the case, there may be a method to address that as well - with more steps.

Is standard deviation (STDDEV) the right function for the job?

We wrote a monitoring system. This monitor is made of agents. Each agent runs on a different server, and monitors that specific server resources (RAM, CPU, SQL Server Status, Replication Status, Free Disk Space, Internet Access, specific bussiness metrics, etc.).
The agents report every measure they take to a central database where these "observations" are stored.
For example, every few seconds an agent would store in the central database a specific bussiness metric called "unprocessed_files" with its corresponding value:
(unprocessed_files, 41)
That value is constanty being written to our DB (among many others, as explained above).
We are now implementing a client application, a screen, that displays the status of every thing we monitor. So, how can we calculate what's a "normal" value and what's a wrong value?
For example, we know that if our servers are working correctly, the unprocessed_files would always be close to 0, but maybe (We don't know yet), 45 is an acceptable value.
So the question is, should we use the Standard Deviation in order to know what the acceptable range of values is?
ACCEPTABLE_RANGE = AVG(value) +- STDDEV(value) ?
We would like to notify with a red color when something is not going well.
For your backlog (unprocessed file) metric, using a standard deviation to know when to sound an alarm (turn something red) is going to drive you crazy with false alarms.
Why? most of the time your backlog will be zero. So, the standard deviation will also be very close to zero. Standard deviation tells you how much your metric varies. Therefore, whenever you get a nonzero backlog, it will be outside the avg + stdev range.
For a backlog, you may want to turn stuff yellow when the value is > 1 and red when the value is > 10.
If you have a "how long did it take" metric, standard deviation might be a valid way to identify alarm conditions. For example, you might have a web request that usually takes about half a second, but typically varies from 0.25 to 0.8 second. If they suddenly start taking 2.5 seconds, then you know something has gone wrong.
Standard deviation is a measurement that makes most sense for a normal distribution (bell curve distribution). When you handle your measurements as if they fit a bell curve, you're implicitly making the assumption that each measurement is entirely independent of the others. That assumption works poorly for typical metrics of a computing system (backlog, transaction time, load average, etc). So, using stdev is OK, but not great. You'll probably struggle to make sense of stdev numbers: that's because they don't actually make much sense.
You'd be better off, like #duffymo suggested, looking at the 95th percentile (the worst-performing operations). But MySQL doesn't compute those kinds of distributions natively. postgreSQL does. So does Oracle Standard Edition and higher.
How do you determine an out-of-bounds metric? It depends on the metric, and on what you're trying to do. If it's a backlog measurement, and it grows from minute to minute, you have a problem to investigate. If it's a transaction time, and it's far longer than average (avg + 3 x stdev, for example, you have a problem. The open source monitoring system Nagios has worked this out for various kinds of metrics.
Read a book by N. N. Taleb called "The Black Swan" if you want to know how assuming the real world fits normal distributions can crash the global economy.
Standard deviation is just a way of characterizing how much a set of values spreads away from its average (i.e. mean). In a sense, it's an "average deviation from average", though a little more complicated than that. It is true that values which differ from the mean by many times the standard deviation tend to be rare, but that doesn't mean the standard deviation is a good benchmark for identifying anomalous values that might indicate something is wrong.
For one thing, if you set your acceptable range at the average plus or minus one standard deviation, you're probably going to get very frequent results outside that range! You could use the average plus or minus two standard deviations, or three, or however many you want to reduce the number of notifications/error conditions as low as you want, but there's no telling whether any of this actually helps you identify error conditions.
I think your main problem is not statistics. Your problem is that you don't know what kinds of results actually indicate an error. So before you program in any acceptable range, just let the system run for a while and collect some calibration data showing what kinds of values you see when it's running normally, and what kinds of values you see when it's not running normally. Make sure you have some way to tell which are which. Once you have a good amount of data for both conditions, you can analyze it (start with a simple histogram) and see what kinds of values are characteristic of normal operation and what kinds are characteristics of error conditions. Then you can set your acceptable range based on that.
If you want to get fancy, there is a statistical technique called likelihood ratio testing that can help you evaluate just how likely it is that your system is working properly. But I think it's probably overkill. Monitoring systems don't need to be super-precise about this stuff; just show a cautionary notice whenever the readings start to seem abnormal.

How would I use DynamoDB to move this usage from my mysql db to nosql?

I'm currently experiencing issues with a service I've developed that relies heavily on large payload reads from the db (500 rows). I'm seeing huge throughput, in the range of 35,000+ requests per minute for up to 500 rows per request going through the DB and it is not handling the scaling at all.
The data in question is retrieved primarily on a latitude / longitude where statement that checks if the latitude and longitude of the row can be contained within a minimum latitude longitude coordinate, and a maximum latitude longitude coordinate. This is effective checking if the row in question is within the bounding box created by the min / max passed into the where.
This is the where portion of the query we rely on for reference.
s.latitude > {minimumLatitude} AND
s.longitude > {minimumLongitude} AND
s.latitude < {maximumLatitude} AND
s.longitude < {maximumLongitude}
SO, with that said. MySQL is handling this find, I'm presently on RDS and having to rely heavily on an r3.8XL master, and 3 r3.8XL reads just to get the throughput capacity I need to prevent the application from slowing down and throwing the CPU into 100% usage.
Obviously, with how heavy the payload is and how frequently it's queried this data needs to be moved into a more fitting service. Something like Elasticache's services or DynamoDB.
I've been leaning towards DynamoDB, but my only option here seems to be using SCAN as there is no useful primary key I can associate on my data to reduce the result set as it relies on calculating if the latitude / longitude of a point is within a bounding box. DynamoDB filters on attributes would work great as they support the basic conditions needed, however on a table that would be 250,000+ rows and growing by nearly 200,000 a day or more would be unusably expensive.
Another option to reduce the result set was to use a Map Binning technique to associate a map region with the data, and reduce on that in dynamo as the primary key and then further filter down on the latitude / longitude attributes. This wouldn't be ideal though, we'd prefer to get data within specific bounds and not have excess redundant data passed back as the min/max lat/lng can overlap multiple bins and would then pull data from pins that a majority may not be needed for.
At this point I'm continuously having to deploy read replicas to keep the service up and it's definitely not ideal. Any help would be greatly appreciated.
You seem to be overlooking what seems like it would be the obvious first thing to try... indexing the data using an index structure suited to the nature of the data... in MySQL.
B-trees are of limited help since you still have to examine all possible matches in one dimension after eliminating impossible matches in the other.
Aside: Assuming you already have an index on (lat,long), you will probably be able to gain some short-term performance improvement by adding a second index with the columns reversed (long,lat). Try this on one of your replicas¹ and see if it helps. If you have no indexes at all, then of course that is your first problem.
Now, the actual solution. This requires MySQL 5.7 because before then, the feature works with MyISAM but not with InnoDB. RDS doesn't like it at all if you try to use MyISAM.
This is effective checking if the row in question is within the bounding box created by the min / max passed into the where.
What you need is an R-Tree index. These indexes actually store the points (or lines, polygons, etc.) in an order that understands and preserves their proximity in more than one dimension... proximate points are closer in the index and minimum bounding rectangles ("bounding box") are easily and quickly identified.
The MySQL spatial extensions support this type of index.
There's even an MBRContains() function that compares the points in the index to the points in the query, using the R-Tree to find all the points contained in thr MBR you're searching. Unlike the usual optimization rule that you should not use column names as function arguments in the where clause to avoid triggering a table scan, this function is an exception -- the optimizer does not actually evaluate the function against every row but uses the meaning of the expression to evaluate it against the index.
There's a bit of a learning curve needed in order to understand the design of the spatial extensions but once you understand the principles, it falls into place nicely and the performance will exceed your expectations. You'll want a single column of type GEOMETRY and you'll want to store lat and long together in that one indexed column as a POINT.
To safely test this without disruption, make a replica, then detach it from your master, promoting it to become its own independent master, and upgrade it to 5.7 if necessary. Create a new table with the same structure plus a GEOMETRY column and a SPATIAL KEY, then populate it with INSERT ... SELECT.
Note that DynamoDB scan is a very "expensive" operation. On a table I was testing against just yesterday, a single scan consistently cost 112 read units each time it was run, regardless of the number of records, presumably because a scan always reads 1MB of data, which is 256 blocks of 4K (definition of a read unit) but not with strong consistency (so, half the cost). 1 MB ÷ 4KB ÷ 2 = 128 which I assume is close enough to 112 that this explains that number.
¹ It's a valid, supported operation to add an index to a MySQL replica but not the master, even in RDS. You need to temporarily make the replica writable by creating a new parameter group identical to the existing one, and then flipping read_only to 0 in that group. Associate the replica to the new parameter group, then wait for the state to change from applying to in-sync, log in to the replica and add the index. Then put the parameter group back when done.

SSAS calculated measure: Access relational database

I recently asked a question about many-to-many relationships and how they can be used to calculate intersections that got answered pretty fine. Now, there is another nice-to-have requirement for our cube to extend that to more data. The general question remains: How many orders contain both product x and y?
However, the measure groups are now much larger, currently about 1.4 billion rows. I tried to implement that using the method described in the other post, with several hidden cross-referenced measure groups. However, this is simply too much for our hardware, the cube is reaching sizes next to 0.5 TB, and querys take several minutes to complete.
Now I would try to use another option: Can I access our relational database in a calculated measure? It seems I can, using UDFs like described in this article. I could write a Function in c# that queries our relational database and returns all the orders that contain the products chosen by the user. But in order to do that, I need to supply all the dimensional data the user has selected to the UDF. I also need the UDF to return the calculated value so it can be output as the result of the calculated member. Is that possible? If yes, how? The example microsoft provides only includes a small deterministic string-function as the UDF.
Here my own results:
It seems to be possible, though with limitations. The class Microsoft.AnalysisServices.AdomdServer.Context can provide you with the currentMember of each Hierarchy, however this does not work with Excel-Style-Subselects. It either contains a single member or the AllMember.
Another option is to get the MDX query using the dmv SELECT * FROM $System.DISCOVER_SESSIONS. There will be a column on that view which contains the last mdx query for a given session. However in order to not overwrite your own last query, you will need to not use the current connection, but to open a new one. The session id can be obtained through Microsoft.AnalysisServices.AdomdServer.Context.CurrentConnection.SessionID.
The second approach is ok for our use-case. It does not allow you to handle axes, since the udf-function has a cell-scope, but you don't know which cell you are in. If anyone of you knows anything about that last bit, please tell me. Thanks!

Statistical Process Control Charts in SQL Server 2008 R2

I'm hoping you can point me in the right direction.
I'm trying to generate a control chart (http://en.wikipedia.org/wiki/Control_chart) using SQL Server 2008. Creating a basic control chart is easy enough. I'd just calculate the mean and standard deviations and then plot them.
The complex bit (for me at least) is that I would like the chart to reset the mean and the control limits when a step change is identified.
Currently I'm only interested in a really simple method of identifying a step change, 5 points appearing consecutively above or below the mean. There are more complex ways of identifying them (http://en.wikipedia.org/wiki/Western_Electric_rules) but I just want to get this off the ground first.
The process I have sort of worked out is:
Aggregate and order by month and year, apply row numbers.
Calculate overall mean
Identify if each data item is higher, lower or the same as the mean, tag with +1, -1 or 0.
Identify when their are 5 consecutive data items which are above or below the mean (currently using a cursor).
Recalculate the mean if 5 points are above or 5 points are below the mean.
Repeat until end of table.
Is this sort of process possible in SQL server? It feels like I maybe need a recursive UDF but recursion is a bit beyond me!
A nudge in the right direction would be much appreciated!
Cheers
Ok, I ended up just using WHILE loops to iterate through. I won't post full code but the steps were:
Set up a user defined table data type in order to pass data into a stored procedure parameter.
Wrote accompanying stored procedure that uses row numbers and while loops to iterate along each data value in the input table and then uses the current row number to do set based processing on a subset of the input data (to check if following 5 points are above/below mean and recalculate the mean and standard deviations when this flag is tripped).
Outputs table with original values, row numbers, months, mean values, upper control limit and lower control limit.
I've also got one up and running that works based on full Nelson rules and will also state which test the data has failed.
Currently it's only been used by me as I develop it further so I've set up an Excel sheet with some VBA to dynamically construct a SQL string which it passes to a pivot table as the command text. That way you can repeatedly ping the USP with different data sets and also change a few of the other parameters on how the procedure runs (such as adjusting control limits and the like).
Ultimately I want to be able to pass the resulting data to Business Objects reports and dashboards that we're working on.