I recently had a Gaussian Process machine learning program built for my production department. This GP system has built a massive mySQL database that provides growth durations for each of the organisms we grow (Lab environment) and the predicted yield for each of those combinations of growth steps.
I would like to build an optimization program in python (preferably) to assist me in scheduling what organisms to grow, when to grow them, and for how long at each step.
Here is some background:
4 steps to the process
Plate step (organism is plated; growth is started)
Seed step (organism transferred from plate to seed phase)
Incubation step (organism is transferred from seed to incubation phase)
Harvest step (organism is harvested; yield collected)
There are multiple organisms (>50) that are grown per year. Each has their own numerical ID
There is finite space to grow organisms at the incubation step
There is infinite space to grow organisms at the plate and seed step.
Multiple 'lots' of the same organism are typically grown at a time. A lot is predefined by the number of containers being used at the incubation step.
Different organisms have very different maximum yields. Some yield 2000 grams max and others 600 g max.
The mySQL server has every combination of # of days at each step for each organism and the predicted yield for that combination. This data is what needs to be used for optimization.
The massive challenge we run into is scheduling what organisms to grow when. With the GP process, we know the theoretical maximums (and they work!) but its hard putting it into practice due to constraints (see below)
Here would be my constraints:
Only one organism can be harvested per day.
No steps can be started on weekends. Organisms can grow over the weekend, but we can't start a new step on a weekend
If multiple 'lots' are being grown of the same mold, the plate and seed start dates should be the same for every 'lot'.
- What this typically looks like in practice is:
- plate and seed steps start on the same day
- next, incubation steps start day-after-day for as many lots as being made
- finally, harvests occur in the same pattern (day-after-day)
- Therefore, what you typically get is identical # of days in the plate phase, identical # of incubation days, and differing # of seed days.
Objective Function: I don't know how to articulate this perfectly, but very broadly we need to maximize the yields for each organism. However, there needs to be a time balance too as the space to grow the organisms is finite and the time we have to grow them is finite as well.
I have created a metric known as lot*weeks that tries to capture that. It is a measure of the number of the number of weeks (at the incubation phase) needed to grow the expected annual demand of a specific organism based upon the predicted yield from the SQL server. Therefore, a potential objective function would be to minimize the lot_weeks for each organism.
This is obviously more of a broad ask for help. I don't have a specific request. If this is not appropriate for this forum, I can take my question elsewhere. I feel comfortable with the scope of the project and can figure out how to write the code over time but I need assistance with what tools to use and what's possible.
I've seen that pyomo may be helpful but I also wanted to check here first. Thank you
I've tried looking into using Pyomo but stopped due to the complexity and didn't want to learn all of it if it wasn't appropriate for the problem.
Edit: This was too broad, I apologize. I've created another post with more concrete examples. Thank you for all that helped.
This is really too broad of a question for this forum, and it may likely get closed. That said...
You have a framework here that you could develop an optimization in. The database part is irrelevant. For an effective optimization model, what you really need is a known relationship between the variables and the outcomes, for instance, days in incubation ==> size of harvest or such. Which it sounds like you have.
This isn't an entry level model you are describing. Do you have any resources to help? Local university that might have need for grad student projects in the field or such?
As you develop this, you should start small and focus the model on the key issues here... if they aren't known, then perhaps that is the place to start. For instance, perhaps the key issue is management of planting times vis-a-vis the weekends (that is one model). Or perhaps the key issue is the management of the limited space for growth and the inability to achieve steps on the weekend just kinda works itself out. (That is another model for space management.) Try one that seems to address key management questions. Start very small and see if you can get something working as a proof of concept. If this is your first foray into linear programming, you will need help. You might also start with an introductory textbook on LP.
Related
my work is related to mathematical modelling and running computer simulations in fluid mechanics. I have a mathematical model that has, say, has 5 parameters. Each of them have some range defined by us, and we would like to study how this model performs within these ranges.
We make a computer code, and start running simulations.
Very soon, I have an extremely large dataset, and it becomes increasingly difficult to keep track of what simulation i run when...
...and if they are running on different computers, it is even more difficult to manage.
One simulation takes about 3-4 days to finish, so by the time one finishes, we have to track our lab notes to see what made us run that simulation in the first place.
The problem is compounded when the number of parameters is very large, obviously.
I want something that tracks all of this. An app, website, tool, code, software, anything that can tabulate all of these parameters. Maybe record dates, keep track of re-runs, and just show the 'status-board' of all my simulations.
I see that Google Cloud may terminate preemptible instances at any time, but have any unofficial, independent studies been reported, showing "preempt rates" (number of VMs preempted per hour), perhaps sampled in several different regions?
Given how little information I'm finding (as with similar questions), even anecdotes such as: "Looking back the past 6 months, I generally see 3% - 5% instances preempt per hour in uswest1" would be useful (I presume this can be monitored similarly to instance count metrics in AWS).
Clients occasionally want to shove their existing, non-fault-tolerant code in the cloud for "cheap" (despite best practices), and without having an expected rate of failure, they're often blind-sighted by the cheapness of preemptible, so I'd like to share some typical experiences of the GCP community, even if people's experiences may vary, to help convey safe expectations.
Thinking about “unofficial, independent studies” and “even anecdotes such as:” “Clients occasionally want to shove their existing, non-fault-tolerant code in the cloud for "cheap"” it ought to be said that no one architect or sysadmin in right mind would place production workloads with defined SLA into an execution environment without SLA. Hence the topic is rather speculative.
For those who is keen, Google provides preemption rate expectation:
For reference, we've observed from historical data that the average
preemption rate varies between 5% and 15% per day per project, on a
seven-day average, occasionally spiking higher depending on time and
zone. Keep in mind that this is an observation only: Preemptible
instances have no guarantees or SLAs for preemption rates or
preemption distributions.
Besides that there is an interesting edutainment approach to the task of "how to make inapplicable applicable".
I have a concern in understanding why a target network is necessary in DQN? I’m reading paper on “human-level control through deep reinforcement learning”
I understand Q-learning. Q-learning is value-based reinforcement learning algorithm that learns “optimal” probability distribution between state-action that will maximize it’s long term discounted reward over a sequence of timesteps.
The Q-learning is updated using the bellman equation, and a single step of the q-learning update is given by
Q(S, A) = Q(S, A) + $\alpha$[R_(t+1) + $\gamma$ (Q(s’,a;’) - Q(s,a)]
Where alpha and gamma are learning and discount factors.
I can understand that the reinforcement learning algorithm will become unstable and diverge.
The experience replay buffer is used so that we do not forget past experiences and to de-correlate datasets provided to learn the probability distribution.
This is where I fail.
Let me break the paragraph from the paper down here for discussion
The fact that small updates to $Q$ may significantly change the policy and therefore change the data distribution — understood this part. Changes to Q-network periodically may lead to unstability and changes in distribution. For example, if we always take a left turn or something like this.
and the correlations between the action-values (Q) and the target values r + $gamma$ (argmax(Q(s’,a’)) — This says that the reward + gamma * my prediction of the return given that I take what I think is the best action in the current state and follow my policy from then on.
We used an iterative update that adjusts the action-values (Q) towards target values that are only periodically updated, thereby reducing correlations with the target.
So, in summary a target network required because the network keeps changing at each timestep and the “target values” are being updated at each timestep?
But I do not understand how it is going to solve it?
So, in summary a target network required because the network keeps changing at each timestep and the “target values” are being updated at each timestep?
The difference between Q-learning and DQN is that you have replaced an exact value function with a function approximator. With Q-learning you are updating exactly one state/action value at each timestep, whereas with DQN you are updating many, which you understand. The problem this causes is that you can affect the action values for the very next state you will be in instead of guaranteeing them to be stable as they are in Q-learning.
This happens basically all the time with DQN when using a standard deep network (bunch of layers of the same size fully connected). The effect you typically see with this is referred to as "catastrophic forgetting" and it can be quite spectacular. If you are doing something like moon lander with this sort of network (the simple one, not the pixel one) and track the rolling average score over the last 100 games or so, you will likely see a nice curve up in score, then all of a sudden it completely craps out starts making awful decisions again even as your alpha gets small. This cycle will continue endlessly regardless of how long you let it run.
Using a stable target network as your error measure is one way of combating this effect. Conceptually it's like saying, "I have an idea of how to play this well, I'm going to try it out for a bit until I find something better" as opposed to saying "I'm going to retrain myself how to play this entire game after every move". By giving your network more time to consider many actions that have taken place recently instead of updating all the time, it hopefully finds a more robust model before you start using it to make actions.
On a side note, DQN is essentially obsolete at this point, but the themes from that paper were the fuse leading up to the RL explosion of the last few years.
I have a simulation that ticks the time every 5 seconds. I want to use OpenAI and its baselines algorithms to perform learning in this environment. For that I'd like to adapt the simulation by writing some adapter code that corresponds to the OpenAI Env API. But there is a problem: The flow of control is defined by the Agent in the OpenAI setting. But in my world, the environment steps, independent of the agent. If the agent doesn't decide or is not fast enough, the world just keeps going without him. How would one achieve this reversal of triggering the next step?
In short: OpenAI Env gets stepped by the agent. My environment gives my agent about 2-3 seconds to decide and then just tells it what's new, again offering to make choice to act or not.
As an example: My environment is rather similar to a real world stock trading market. The agent gets 24 chances to buy / sell products for a certain limit price to accumulate a certain volume for that target time and at time step 24, the reward is given to the agent and the slot is completed. The reward is based on the average price paid per item in comparison to the average price by all market participants.
At any given moment, 24 slots are traded in parallel (a 24x parallel trading of futures). I believe for this I need to create 24 environments which leads me to believe A3C would be a good choice.
After re-reading the question, it seems like OpenAI gym is not a great fit for what you’re trying to do. It is designed for running rapid experiments, which cannot be done efficiently if you are waiting on live events to occur. If you have no historical data and can only train on incoming live data, there is no point to using OpenAI gym. You can write your own code to represent the environment from that data, and that would be easier than trying to morph it into another framework, although OpenAI gym’s API does provide a good model for how your environment should work.
I have found the keras-rl/examples/cem_cartpole.py example and I would like to understand, but I don't find documentation.
What does the line
memory = EpisodeParameterMemory(limit=1000, window_length=1)
do? What is the limit and what is the window_length? Which effect does increasing either / both parameters have?
EpisodeParameterMemory is a special class that is used for CEM. In essence it stores the parameters of a policy network that were used for an entire episode (hence the name).
Regarding your questions: The limit parameter simply specifies how many entries the memory can hold. After exceeding this limit, older entries will be replaced by newer ones.
The second parameter is not used in this specific type of memory (CEM is somewhat of an edge case in Keras-RL and mostly there as a simple baseline). Typically, however, the window_length parameter controls how many observations are concatenated to form a "state". This may be necessary if the environment is not fully observable (think of it as transforming a POMDP into an MDP, or at least approximately). DQN on Atari uses this since a single frame is clearly not enough to infer the velocity of a ball with a FF network, for example.
Generally, I recommend reading the relevant paper (again, CEM is somewhat of an exception). It should then become relatively clear what each parameter means. I agree that Keras-RL desperately needs documentation but I don't have time to work on it right now, unfortunately. Contributions to improve the situation are of course always welcome ;).
A little late to the party, but I feel like the answer doesn't really answer the question.
I found this description online (https://pytorch.org/tutorials/intermediate/reinforcement_q_learning.html#replay-memory):
We’ll be using experience replay
memory for training our DQN. It stores the transitions that the agent
observes, allowing us to reuse this data later. By sampling from it
randomly, the transitions that build up a batch are decorrelated. It
has been shown that this greatly stabilizes and improves the DQN
training procedure.
Basically you observe and save all of your state transitions so that you can train your network on them later on (instead of having to make observations from the environment all the time).