How to manage multiple simulations that continually create large datasets

How to manage multiple simulations that continually create large datasets - numerical-methods

my work is related to mathematical modelling and running computer simulations in fluid mechanics. I have a mathematical model that has, say, has 5 parameters. Each of them have some range defined by us, and we would like to study how this model performs within these ranges.
We make a computer code, and start running simulations.
Very soon, I have an extremely large dataset, and it becomes increasingly difficult to keep track of what simulation i run when...
...and if they are running on different computers, it is even more difficult to manage.
One simulation takes about 3-4 days to finish, so by the time one finishes, we have to track our lab notes to see what made us run that simulation in the first place.
The problem is compounded when the number of parameters is very large, obviously.
I want something that tracks all of this. An app, website, tool, code, software, anything that can tabulate all of these parameters. Maybe record dates, keep track of re-runs, and just show the 'status-board' of all my simulations.

Related

Scheduling Optimization for multi-step growth modeling

I recently had a Gaussian Process machine learning program built for my production department. This GP system has built a massive mySQL database that provides growth durations for each of the organisms we grow (Lab environment) and the predicted yield for each of those combinations of growth steps.
I would like to build an optimization program in python (preferably) to assist me in scheduling what organisms to grow, when to grow them, and for how long at each step.
Here is some background:
4 steps to the process
Plate step (organism is plated; growth is started)
Seed step (organism transferred from plate to seed phase)
Incubation step (organism is transferred from seed to incubation phase)
Harvest step (organism is harvested; yield collected)
There are multiple organisms (>50) that are grown per year. Each has their own numerical ID
There is finite space to grow organisms at the incubation step
There is infinite space to grow organisms at the plate and seed step.
Multiple 'lots' of the same organism are typically grown at a time. A lot is predefined by the number of containers being used at the incubation step.
Different organisms have very different maximum yields. Some yield 2000 grams max and others 600 g max.
The mySQL server has every combination of # of days at each step for each organism and the predicted yield for that combination. This data is what needs to be used for optimization.
The massive challenge we run into is scheduling what organisms to grow when. With the GP process, we know the theoretical maximums (and they work!) but its hard putting it into practice due to constraints (see below)
Here would be my constraints:
Only one organism can be harvested per day.
No steps can be started on weekends. Organisms can grow over the weekend, but we can't start a new step on a weekend
If multiple 'lots' are being grown of the same mold, the plate and seed start dates should be the same for every 'lot'.
- What this typically looks like in practice is:
- plate and seed steps start on the same day
- next, incubation steps start day-after-day for as many lots as being made
- finally, harvests occur in the same pattern (day-after-day)
- Therefore, what you typically get is identical # of days in the plate phase, identical # of incubation days, and differing # of seed days.
Objective Function: I don't know how to articulate this perfectly, but very broadly we need to maximize the yields for each organism. However, there needs to be a time balance too as the space to grow the organisms is finite and the time we have to grow them is finite as well.
I have created a metric known as lot*weeks that tries to capture that. It is a measure of the number of the number of weeks (at the incubation phase) needed to grow the expected annual demand of a specific organism based upon the predicted yield from the SQL server. Therefore, a potential objective function would be to minimize the lot_weeks for each organism.
This is obviously more of a broad ask for help. I don't have a specific request. If this is not appropriate for this forum, I can take my question elsewhere. I feel comfortable with the scope of the project and can figure out how to write the code over time but I need assistance with what tools to use and what's possible.
I've seen that pyomo may be helpful but I also wanted to check here first. Thank you
I've tried looking into using Pyomo but stopped due to the complexity and didn't want to learn all of it if it wasn't appropriate for the problem.
Edit: This was too broad, I apologize. I've created another post with more concrete examples. Thank you for all that helped.

This is really too broad of a question for this forum, and it may likely get closed. That said...
You have a framework here that you could develop an optimization in. The database part is irrelevant. For an effective optimization model, what you really need is a known relationship between the variables and the outcomes, for instance, days in incubation ==> size of harvest or such. Which it sounds like you have.
This isn't an entry level model you are describing. Do you have any resources to help? Local university that might have need for grad student projects in the field or such?
As you develop this, you should start small and focus the model on the key issues here... if they aren't known, then perhaps that is the place to start. For instance, perhaps the key issue is management of planting times vis-a-vis the weekends (that is one model). Or perhaps the key issue is the management of the limited space for growth and the inability to achieve steps on the weekend just kinda works itself out. (That is another model for space management.) Try one that seems to address key management questions. Start very small and see if you can get something working as a proof of concept. If this is your first foray into linear programming, you will need help. You might also start with an introductory textbook on LP.

OpenAI Gym stepping in an externally controlled environment

I have a simulation that ticks the time every 5 seconds. I want to use OpenAI and its baselines algorithms to perform learning in this environment. For that I'd like to adapt the simulation by writing some adapter code that corresponds to the OpenAI Env API. But there is a problem: The flow of control is defined by the Agent in the OpenAI setting. But in my world, the environment steps, independent of the agent. If the agent doesn't decide or is not fast enough, the world just keeps going without him. How would one achieve this reversal of triggering the next step?
In short: OpenAI Env gets stepped by the agent. My environment gives my agent about 2-3 seconds to decide and then just tells it what's new, again offering to make choice to act or not.
As an example: My environment is rather similar to a real world stock trading market. The agent gets 24 chances to buy / sell products for a certain limit price to accumulate a certain volume for that target time and at time step 24, the reward is given to the agent and the slot is completed. The reward is based on the average price paid per item in comparison to the average price by all market participants.
At any given moment, 24 slots are traded in parallel (a 24x parallel trading of futures). I believe for this I need to create 24 environments which leads me to believe A3C would be a good choice.

After re-reading the question, it seems like OpenAI gym is not a great fit for what you’re trying to do. It is designed for running rapid experiments, which cannot be done efficiently if you are waiting on live events to occur. If you have no historical data and can only train on incoming live data, there is no point to using OpenAI gym. You can write your own code to represent the environment from that data, and that would be easier than trying to morph it into another framework, although OpenAI gym’s API does provide a good model for how your environment should work.

Comparing speeds of different tasks in different languages

If I wanted to test out the speeds it takes for certain tasks to be done, would it matter what language I did the test in? We can consider this to be any job a programmer might want to perform. Simple jobs such as sorting, or more complicated jobs which involve the signing and verification of files.
Considering that we all know that certain languages will run faster than others, this means that the tasks will rely upon the languages and the way their compilers / runtimes are optimised. But these will obviously all be different.
So is it best to rely upon a language which relies less on abstraction such as C, or is it OK to test out jobs and tasks in more high level languages, and rely on the fact that they are implemented well enough not to worry about any possible inefficiencies? I hope my question is clear.

it doesn't matter which real-world language you use for your tests... if you design the tests correctly. there is always an overhead of the language but there is always an overhead of the operating system, thread scheduler, IO, ram or speed of electricity that depends on current temperature etc.
but to compare anything you don't want to measure how many nanoseconds it took to do the one assignment statement. instead you want to measure how many hours it took to do billions of assignment statements. then all mentioned overheads are negligible

Stress test cases for web application

What other stress test cases are there other than finding out the maximum number of users allowed to login into the web application before it slows down the performance and eventually crashing it?

This question is hard to answer thoroughly since it's too broad.
Anyway many stress tests depend on the type and execution flow of your workload. There's an entire subject dedicated (as a graduate course) to queue theory and resources optimization. Most of the things can be summarized as follows:
if you have a resource (be it a gpu, cpu, memory bank, mechanical or
solid state disk, etc..), it can serve a number of users/requests per
second and takes an X amount of time to complete one unit of work.
Make sure you don't exceed its limits.
Some systems can also be studied with a probabilistic approach (Little's Law is one of the most fundamental rules in these cases)

There are a lot of reasons for load/performance testing, many of which may not be important to your project goals. For example:
- What is the performance of a system at a given load? (load test)
- How many users the system can handle and still meet a specific set of performance goals? (load test)
- How does the performance of a system changes over time under a certain load? (soak test)
- When will the system will crash under increasing load? (stress test)
- How does the system respond to hardware or environment failures? (stress test)
I've got a post on some common motivations for performance testing that may be helpful.

You should also check out your web analytics data and see what people are actually doing.
It's not enough to simply simulate X number of users logging in. Find the scenarios that represent the most common user activities (anywhere between 2 to 20 scenarios).
Also, make sure you're not just hitting your cache on reads. Add some randomness / diversity in the requests.
I've seen stress tests where all the users were requesting the same data which won't give you real world results.

On what is game time based? Real time or frames?

I'm designing a game for the first time, but I wonder on what game time is based. Is it based on the clock or does it rely on frames? (Note: I'm not sure if 'game time' is the right word here, correct me if it isn't)
To be more clear, imagine these scenarios:
Computer 1 is fast, up to 60fps
Computer 2 is slow, not more than 30fps
On both computers the same game is played, in which a character walks at the same speed.
If game time is based on frames, the character would move twice as fast on computer 1. On the other hand, if game time was based on actual time, computer 1 would show twice as much frames, but the character would move just as fast as on computer 2.
My question is, what is the best way to deal with game time and what are advantages and disadvantages?

In general, commercial games have two things running - a "simulation" loop and a "rendering" loop. These need to be decoupled as much as possible.
You want to fix your simulation time-step to some value (greater or equal to your maximum framerate). Complex physics doesn't like variable time steps. I'm surprised no-one has mentioned this, but fixed-time steps versus variable time steps are a big deal if you have any kind of interesting physics. Here's a good link:
http://gafferongames.com/game-physics/fix-your-timestep/
Then your rendering loop can run as fast as possible, and render the output of the current simulation step.
So, referring to your example:
You would run your simulation at 60fps, that is 16.67ms time step. Computer A would render at 60fps, ie it would render every simulation frame. Computer B would render every second simulation frame. Thus the character would move the same distance in the same time, but not as smoothly.

Really old games used a frame-count. It became fairly obvious quickly that this was a poor idea, since machines get newer, and thus the games run faster.
Thus, base it on the system clock. Generally this is done by knowing how long last frame took, and using that number to know how much 'real time' to go through this frame.

It should rely on the system clock, not on the number of frames. You've made your own case for this.

The FPS is simply how much frame the computer can render per second.
The game time is YOUR game time. You define it. It is often called the "Game Loop". The frame rendering is a part of the game loop. Also check for FSM related to game programming.
I highly suggest you to read a couple of books on game programming. The question you are asking is what those book explain in the first chapters.

For the users of each to have the same experience you'll want to use actual time, otherwise different users will have advantages/disadvantages depending on their hardware.

Games should all use the clock, not the frames, to provide the same gameplay whatever the platform. It is obvious when you look at MMO or online shooter games: no player should be faster than others.

It depends on what you're processing, what part of the game is in question.
For example, animations, physics and AI need to be framerate independent to function properly. If you have a FPS-dependent animation or physics thread, then the physics system will slow down or character will move slower on slower systems and will go incredibly fast on very fast systems. Not good.
For some other elements, like scripting and rendering, you obviously need it to be per-frame and so, framerate-dependent. You would want to process each script and render each object once per frame, regardless of the time difference between frames.

Game must rely on system clock. Since you don't want your game is played in decent computers in notime!

Games typically use the highest resolution timer available like QueryPerformanceCounter on Windows to time things. Old games used to use frames, but after you could literally run faster in Quake by changing your FPS, we learned not to do that anymore.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008