Do I need to retrain reinforcement model from scratch each time I want to use it in practice? - reinforcement-learning

This seems like it should be obvious but I can't find resources on it anywhere. I am building a reinforcement learning model with openai's gym any_trading environment and stablebaselines3. There are a ton of online tutorials and documentation for training and evaluating the model but almost nothing on actually using it in practice.
e.g. I want the model constantly looking at today's data and making predictions about what action I should take to lock in tomorrow's profits.
Reinforcement learning algorithms seem to all have a model.predict() method but you have to pass the environment which is just more historical data. What if I want it to use today's data to predict tomorrow's values? Do I just include up to today in the test set and retrain the model from scratch each time I want it to make a prediction?
e.g. Original training data ranges from 2014-01-01 to Today (aka 2023-02-12) then run through the whole train and testing process? Then tomorrow I start from scratch and train/test using date_ranges 2014-01-01 to Today (aka 2023-02-13) then the next day 2014-01-01 to Today (aka 2023-02-14) etc etc? How do I actually make real-time predictions with a Reinforcement Learning model as opposed to continually evaluating how it would have performed on past data?
Thanks.

This is a very good and practical question. I assume you use all the history data to train your RL agent in stablebaselines3 in practice and then apply the trained RL agent to predict tomorrow's action. Short answer is NO, you don't need to train your agent from scratch every day.
First you need to understand the procedure in learning and prediction:
In learning or training process:
Initialize your RL agent policy or value network.
Input the observation on day 2014-01-01 to your RL agent.
Your agent makes decisions based on the observation.
Calculate your observation and reward/profit on day 2014-01-02 and send them back to your agent.
Depend on the RL algorithm you use, your agent might update its policy or value network based on this observation reward pair or it could save this pair into buffer. And only update its policy or value network after certain amount of days (e.g., 30 days, 180 days)
repeat step 2-6 until you reach the last day of your database (e.g., 2023-02-12)
In prediction process (which has only procedure 2,3 from training process):
Input the observation on day 2014-01-01 to your RL agent.
Your agent makes decisions based on the observation.
That's it.
You can repeated train your model in the training process with the history data until you are satisfied with the performance during the training process. In this retrain process, after each training through the entire history data, you can save the model and load the saved model for the retrain as the initialized model.
Once you get that good model, you don't need to train it anymore with the new coming data after 2023-2-12. It is still valid.
You may think new data is generated everyday and the most recent data is the most valuable one. In this case, you can periodically update your existing model with the new data using following procedure:
load your existing RL agent model (the trained model).
Input the observation on day one in your most recent new data to your RL agent.
Your agent makes decisions based on the observation.
Calculate your observation and reward/profit on day two of your new data and send them back to your agent.
Depend on the RL algorithm you use, your agent might update its policy or value network based on this observation reward pair or it could save this pair into buffer. And only update its policy or value network after certain amount of days (e.g., 30 days)
repeat step 2-6 until you reach the last day of your new data

Related

Deep Learning for Acoustic Emission concrete fracture speciments: regression on-set time and classification of type of failure

How can I use deep learning for both regression and classification tasks?
I am facing a problem with acoustic emission on fracture with concrete speciment. The objective is to find automatically the on-set time instant (time at the beginning of the acoustic emission) and the slope with the peak value to determine the kind of fracture (mode I or mode II based on the raise angle RA).
I have tried Regional CNN to work with images of the signals Fine-tuning Faster-RCNN using pytorch, but unfortunately the results are not outstanding up to now.
I would like to work with sequences (time series) of amplitude data according to a certain sampling frequency, but they have different length each. How can I deal with this problem?
Can I make a 1D-CNN which makes a sort of anomaly detection based on the supervised point that I can mark manually on training examples?
I have a certain number of recordings which I would like to exploit to train the model sampled at 100Hz. In examples on anomaly detection like Timeseries anomaly detection using an Autoencoder, they use the same time series and they perform a window with sliding 1 time step in order to obtain about 3700 to train their neural network. Instead I have different number of recordings (time series) each of them with a certain on-set time instant and different global length in seconds. How can I manage it?
I actually need the time instant of the beginning of the signal and the maximum point to define the raise angle and classify the type of fracture. Can I make classification directly with CNN simultaneously with regression tasks of the on-set time instant?
Thank you in advance!
I finally solved, thanks to the fundamental suggestion by #JonNordby, using Sound Event Detection method. We adopted and readapted the code from GitHub YashNita.
I labelled the data according to the following image:
Then, I adopted the method for extracting features from computing the spectrogram of the input signals:
And finally we were able to get a more precise output recognition of the Seismic Event Detection which is directly connected to the Acoustic Emission Event detection, obtaining the following result:
For the moment, only the event recognition phase was done, but it would be simple to readapt also to conduct classification of mode I or mode II of cracking.

How to do Transfer Learning with LSTM for time series forecasting?

I am working on a project about time-series forecasting using LSTMs layers. The dataset used for training and testing the model was collected among 443 persons which worn a sensor that samples a physical variable ( 1 variable/measure) every 5 minutes, for each patient there are around 5000 records/readings.
Although, I can train and test my model under different scenarios, I am troubled finding information about how to apply transfer learning in such an architecture. I mean, I understand I can use inductive transfer-learning by copying the matrix-weights from the general model onto a secondary model (unknown person), then after I can re-train this model with specific data and evaluate the result.
But I would like to know if somebody knows other ways to apply transfer-learning on this type of architecture or where to find information about it since there aren't many scientific papers talking about it, mostly they talk about NLP and other type of application but time series?
Cheers X )

Multiagent (not deep) reinforcement learning? Modeling the problem

I have N number of agents/users accessing a single wireless channel and at each time, only one agent can access the channel and receive a reward.
Each user has a buffer that can store B number of packets and I assume it as infinite buffer.
Each user n gets observation from the environment if the packet in time slot t was successful or failure (collision). If more than one users access the channel, they get penalty.
This feedback from the channel is same for all the users since we only have one channel. The reward is - B_n (negative of the number of packets in buffer). Each user wants to maximize its own reward and try to empty the buffer.
Packets arrive at each users following a poisson process with average $\lambda$ packets per time slot.
Each user has a history of previous 10 time slots that it uses as an input to the DQN to output the probability of taking action A_n: stay silent or transmit. The history is (A_n, F, B_n)
Each user is unaware of the action and buffer status of other users.
I am trying to model my problem with multiagent reinforcement learning and so far I have tried it with DQN but results are more or less like a random scheme. It could be that users don't have much contextual information in order to learn the behaviour of other users? Or can there be any other reason?
I would like to know how can I model my environment since the state (in RL sense) is static, the environment doesn't changes. The only thing that changes is each users history at each time slot. So I am not sure if its a partially observable MDP or should it be modelled as multiagent single-arm bandit problem which I don't know is correct or not.
The second concern is that I have tried DQN but it has not worked and I would like to know if such problem can be used with tabular Q-learning? I have not seen multiagent works in which anyone has used QL. Any insights might be helpful.
Your problem can be modeled as a Decentralized POMDP (see a overview here).
Summarizing this approach, you consider a multi-agent system where each agent model his own policy, and then you try to build a joint policy through these individual ones. Of course that, the complexity grows as the number of agents, states and actions increases,so for that you have several approaches mainly based in heuristics to prune branches of this joint policy tree that are not "good" in comparison with others. A very know example using this approach is exactly about routing packages where is possible define a discrete action/space.
But be aware that even for tiny system, the complexity becomes often infeasible!

Fitting step in Deep Q Network

I am confused why dqn with experience replay algorithm would perform gradient descent step for every step in a given episode? This will fit only one step, right? This would make it extremely slow. Why not after each episode ends or every time the model is cloned?
In the original paper, the author pushes one sample to the experience replay buffer and randomly samples 32 transitions to train the model in the minibatch fashion. The samples took from interacting with the environment is not directly feeding to the model. To increase the speed of training, the author store samples every step but updates the model every four steps.
Use OpenAI's Baseline project; this single-process method can master easy games like Atari Pong (Pong-v4) about 2.5 hours using a single GPU. Of course, training in this kind of single process way makes multi-core, multi-GPU (or single-GPU) system's resource underutilised. So in new publications had decoupled action-selection and model optimisation. They use multiple "Actors" to interact with environments simultaneously and a single GPU "Leaner" to optimise the model or multiple Leaners with multiple models on various GPUs. The multi-actor-single-learner is described in Deepmind's Apex-DQN (Distributed Prioritized Experience Replay, D. Horgan et al., 2018) method and the multi-actor-multi-learner described in (Accelerated Methods for Deep Reinforcement Learning, Stooke and Abbeel, 2018). When using multiple learners, the parameter sharing across processes becomes essential. The old trail described in Deepmind's PDQN (Massively Parallel Methods for Deep Reinforcement Learning, Nair et al., 2015) which was proposed in the period between DQN and A3C. However, the work was performed entirely on CPUs, so it looks using massive resources, the result can be easy outperformed by PPAC's batched action-selection on GPU method.
You can't optimise on each episode end, because the episode length isn't fixed, the better the model usually results in the longer episode steps. The model's learning capability will decrease when they perform a little better. The learning progress will be instable.
We also don't train the model only on target model clone, because the introduction of the target is to stabilise the training process by keeping an older set of parameters. If you update only on parameter clones, the target model's parameters will be the same as the model and this cause instability. Because, if we use the same parameters, one model update will cause the next state to have a higher value.
In Deepmind's 2015 Nature paper, it states that:
The second modification to online Q-learning aimed at further improving the stability of our method with neural networks is to use a separate network for generating the target yj in the Q-learning update. More precisely, every C updates we clone the network Q to obtain a target network Q' and use Q' for generating the Q-learning targets yj for the following C updates to Q.
This modification makes the algorithm more stable compared to standard online Q-learning, where an update that increases Q(st,at) often also increases Q(st+1, a) for all a and hence also increases the target yj, possibly leading to oscillations or divergence of the policy. Generating the targets using the older set of parameters adds a delay between the time an update to Q is made and the time the update affects the targets yj, making divergence or oscillations much more unlikely.
Human-level control through deep reinforcement
learning, Mnih et al., 2015

OpenAI Gym stepping in an externally controlled environment

I have a simulation that ticks the time every 5 seconds. I want to use OpenAI and its baselines algorithms to perform learning in this environment. For that I'd like to adapt the simulation by writing some adapter code that corresponds to the OpenAI Env API. But there is a problem: The flow of control is defined by the Agent in the OpenAI setting. But in my world, the environment steps, independent of the agent. If the agent doesn't decide or is not fast enough, the world just keeps going without him. How would one achieve this reversal of triggering the next step?
In short: OpenAI Env gets stepped by the agent. My environment gives my agent about 2-3 seconds to decide and then just tells it what's new, again offering to make choice to act or not.
As an example: My environment is rather similar to a real world stock trading market. The agent gets 24 chances to buy / sell products for a certain limit price to accumulate a certain volume for that target time and at time step 24, the reward is given to the agent and the slot is completed. The reward is based on the average price paid per item in comparison to the average price by all market participants.
At any given moment, 24 slots are traded in parallel (a 24x parallel trading of futures). I believe for this I need to create 24 environments which leads me to believe A3C would be a good choice.
After re-reading the question, it seems like OpenAI gym is not a great fit for what you’re trying to do. It is designed for running rapid experiments, which cannot be done efficiently if you are waiting on live events to occur. If you have no historical data and can only train on incoming live data, there is no point to using OpenAI gym. You can write your own code to represent the environment from that data, and that would be easier than trying to morph it into another framework, although OpenAI gym’s API does provide a good model for how your environment should work.