Is learning and cumulative reward a good metrics to evaluate a RL model? - reinforcement-learning

i am new to reinforcement learning.
I have a problem here that i am using DQN on. I have plotted a cumulative reward curve while learning and taking actions. After 100 episodes and it shows a lot of fluctuations that does not show me whether it has learnt anything.
However, instead of using learning and cumulative reward, I put the model through the whole simulation without learning method after each episode and it shows me that the model is actually learning well. This extended the program runtime by quite a bit.
In addition, i have to extract the best model along the way because the final model seems to be performing badly at times.
Any advice or explanation for this?

Try to use the average mean return it's usually a good metric to know if the agent is improving or not.
If you're using tf_agent you can do something like this :
...
checkpoint_dir = os.path.join('./', 'checkpoint')
train_checkpointer = common.Checkpointer(
ckpt_dir=checkpoint_dir,
max_to_keep=1,
agent=agent,
policy=agent.policy,
replay_buffer=replay_buffer,
global_step=train_step
)
policy_dir = os.path.join('./', 'policy')
tf_policy_saver = policy_saver.PolicySaver(agent.policy)
def train_agent(n_iterations):
best_AverageReturn = 0
time_step = None
policy_state = agent.collect_policy.get_initial_state(tf_env.batch_size)
iterator = iter(dataset)
for iteration in range(n_iterations):
time_step, policy_state = collect_driver.run(time_step, policy_state)
trajectories, buffer_info = next(iterator)
train_loss = agent.train(trajectories)
if iteration % 10 == 0:
print("\r{} loss:{:.5f}".format(iteration, train_loss.loss.numpy()), end="")
if iteration % 1000 == 0 and averageReturnMetric.result() > best_AverageReturn:
best_AverageReturn = averageReturnMetric.result()
train_checkpointer.save(train_step)
tf_policy_saver.save(policy_dir)
After 1000 steps the train function evaluates the average return and create a checkpoint if there are any improvements

Related

Modify Random Forest Regression to predict multiple values in future using past data

I am using Random Forest Regression on a power vs time data of an experiment that is performed for a certain time duration. Using that data, I want to predict the trend of power in future using time as an input. The code that has been implemented is mentioned below.
# Loading the excel dataset
df = pd.read_excel('/content/drive/MyDrive/Colab Notebooks/Cleaned total data.xlsx', header = None, names = [ "active_power", "current", "voltage"], usecols = "A:C",skiprows = [i for i in range(1)])
df = df.dropna()
The data set consists of approximately 30 hours of power vs time values as mentioned below.
Next a random Forest Regressor is fitted on training data. The R2 score achieved on test data is 0.87.
# Creating X and y
X = np.array(series[['time_h']]).reshape(-1,1)
y = np.array(series['active_power'])
# Splitting dataset in training and testing
X_train2,X_test2,y_train2,y_test2 = train_test_split(X,y,test_size = 0.15, random_state = 1)
# Creating Random Forest model and fitting it on training data
forest = RandomForestRegressor(n_estimators=128, criterion='mse', random_state=1, n_jobs=-1)
forest_fit = forest.fit(X_train2, y_train2)
# Saving the model and checking the R2 score on test data
filename = 'random_forest.sav'
joblib.dump(forest, filename)
loaded_model = joblib.load(filename)
result = loaded_model.score(X_test2, y_test2)
print(result)
For future prediction, an array of time for 400 hours has been created to use as an input to the model as the power needs to be predicted for that duration.
# Creating a time array for future which will be used as input for future predictions
future_time2 = np.arange(len(series)*15)
future_time2 = future_time2*0.25/360
columns = ['time_hour']
dataframe = pd.DataFrame(data = future_time2, columns = columns)
future_times = dataframe[41006:].to_numpy()
future_times
When the predictions are made in future, the model only provides output of a constant value over the entire duration of 400 hours. The output prediction is as below.
# Predicting power for future
future_pred = loaded_model.predict(future_times)
future_pred
Could someone please suggest me why the model is predicting same value for entire duration and how to modify the code so that I can get a trend of prediction with reasonable values and not a single value.
Thank you.

Why does my LSTM autoencoder model couldn't detect outliers?

I am trying to build a LSTM Autoendoer for anomaly detection.
But the model seems not work for my data.
Here is the normal data that I use it for training.
And here is abnormal data that I use it for validation.
If model works, its loss should be high at #200000~#500000.
Unfortunately, here is the result I put the valid data to the model:
In the abnormal interval, the loss still low.
Here is my code of training model.
I would greatly appreciate it if you kindly give me any suggestions.
scaler = MinMaxScaler(feature_range=(0, 1))
scaler.fit(healthy_data)
data_scaled = scaler.transform(healthy_data)
data_broken_scaled = scaler.transform(broken_data)
timesteps=32
data = data_scaled
dim = 1
data.shape = (-1,timesteps,dim)
lr = 0.0001
Nadam = optimizers.Nadam(lr=lr)
model = Sequential()
model.add(LSTM(50,input_shape=(timesteps,dim),return_sequences=True))
model.add(Dense(dim))
model.compile(loss='mae', optimizer=Nadam ,metrics=['mse'])
EStop = EarlyStopping(monitor='val_loss', min_delta=0.001,patience=150, verbose=2, mode='auto',restore_best_weights=True)
history = model.fit(data,data,validation_data=(data,data),epochs=3000,batch_size=72,verbose=2,shuffle=False,callbacks=[EStop]).history
pred_broken = model.predict(data_broken_scaled)
loss_broken = np.mean(np.abs(pred_broken-data_broken_scaled),axis=1)
fig, ax = plt.subplots(figsize=(20, 6), dpi=80, facecolor='w', edgecolor='k')
ax.plot(range(0,len(loss_broken)), loss_broken, '-', color='red', animated = True, linewidth=1)
I Think, it may better to use Fourier transform to detect anomaly in frequency view.
I means, the train and test data will be converted to frequency domain via Fourier transform. Also I recommend to use Time-windows.
Most AI will act better if they have enough good train data. Your anomaly points should be labeled to 1(by pre-processing step of the data) and it should have high frequency at your time-step(time-windows).
In short, I think your train data may not sufficiently representative for anomaly now.

Implementing WNGrad in Pytorch?

I'm trying to implement the WNGrad (technically WN-Adam, algorithm 4 in the paper) optimizier (WNGrad) in pytorch. I've never implemented an optimizer in pytorch before so I don't know if I've done it correctly (I started from the adam implementation). The optimizer does not make much progress and falls down like I would expect (bj values can only monotonically increase, which happens quickly so no progress is made) but I'm guessing I have a bug. Standard optimizers (Adam, SGD) work fine on the same model I'm trying to optimize.
Does this implementation look correct?
from torch.optim import Optimizer
class WNAdam(Optimizer):
"""Implements WNAdam algorithm.
It has been proposed in `WNGrad: Learn the Learning Rate in Gradient Descent`_.
Arguments:
params (iterable): iterable of parameters to optimize or dicts defining
parameter groups
lr (float, optional): learning rate (default: 0.1)
beta1 (float, optional): exponential smoothing coefficient for gradient.
When beta=0 this implements WNGrad.
.. _WNGrad\: Learn the Learning Rate in Gradient Descent:
https://arxiv.org/abs/1803.02865
"""
def __init__(self, params, lr=0.1, beta1=0.9):
if not 0.0 <= beta1 < 1.0:
raise ValueError("Invalid beta1 parameter: {}".format(beta1))
defaults = dict(lr=lr, beta1=beta1)
super().__init__(params, defaults)
def step(self, closure=None):
"""Performs a single optimization step.
Arguments:
closure (callable, optional): A closure that reevaluates the model
and returns the loss.
"""
loss = None
if closure is not None:
loss = closure()
for group in self.param_groups:
for p in group['params']:
if p.grad is None:
continue
grad = p.grad.data
state = self.state[p]
# State initialization
if len(state) == 0:
state['step'] = 0
# Exponential moving average of gradient values
state['exp_avg'] = torch.zeros_like(p.data)
# Learning rate adjustment
state['bj'] = 1.0
exp_avg = state['exp_avg']
beta1 = group['beta1']
state['step'] += 1
state['bj'] += (group['lr']**2)/(state['bj'])*grad.pow(2).sum()
# update exponential moving average
exp_avg.mul_(beta1).add_(1 - beta1, grad)
bias_correction = 1 - beta1 ** state['step']
p.data.sub_(group['lr'] / state['bj'] / bias_correction, exp_avg)
return loss
The paper's author has an open sourced implementation on GitHub.
The WNGrad paper
states it's inspired by batch (and weight) normalization. You should use L2 norm with respect to the weight dimensions (don't sum it all) as show in this algorithm

Explanation behind actor-critic algorithm in pytorch example?

Pytorch provides a good example of using actor-critic to play Cartpole in the OpenAI gym environment.
I'm confused about several of their equations in the code snippet found at https://github.com/pytorch/examples/blob/master/reinforcement_learning/actor_critic.py#L67-L79:
saved_actions = model.saved_actions
value_loss = 0
rewards = []
for r in model.rewards[::-1]:
R = r + args.gamma * R
rewards.insert(0, R)
rewards = torch.Tensor(rewards)
rewards = (rewards - rewards.mean()) / (rewards.std() + np.finfo(np.float32).eps)
for (action, value), r in zip(saved_actions, rewards):
action.reinforce(r - value.data.squeeze())
value_loss += F.smooth_l1_loss(value, Variable(torch.Tensor([r])))
optimizer.zero_grad()
final_nodes = [value_loss] + list(map(lambda p: p.action, saved_actions))
gradients = [torch.ones(1)] + [None] * len(saved_actions)
autograd.backward(final_nodes, gradients)
optimizer.step()
What do r and value mean in this case? Why do they run REINFORCE on the action space with the reward equal to r - value? And why do they try to set the value so that it matches r?
Thanks for your help!
First the rewards a collected for a time, along with the state:action that resulted in the reward
Then r - value is the difference between the expected reward and actual
That difference is used to adjust the expected value of that action from that state
So if in state "middle", the expected reward for action "jump" was 10 and the actual reward was only 2, then the AI was off by -8 (2-10). Reinforce means "adjust expectations". So if we adjust them by half, we'll new expected reward is 10-(8 *.5), or 6. meaning the AI really thought it would get 10 for that, but now it's less confident and thinks 6 is a better guess. So if the AI is not off by much, 10 - ( 2 *.5) = 9, it will adjust by a smaller amount.

Solving Kepler's Equation computationally

I'm trying to solve Kepler's Equation as a step towards finding the true anomaly of an orbiting body given time. It turns out though, that Kepler's equation is difficult to solve, and the wikipedia page describes the process using calculus. Well, I don't know calculus, but I understand that solving the equation involves an infinite number of sets which produce closer and closer approximations to the correct answer.
I can't see from looking at the math how to do this computationally, so I was hoping someone with a better maths background could help me out. How can I solve this beast computationally?
FWIW, I'm using F# -- and I can calculate the other elements necessary for this equation, it's just this part I'm having trouble with.
I'm also open to methods which approximate the true anomaly given time, periapsis distance, and eccentricity
This paper:
A Practical Method for Solving the Kepler Equation http://murison.alpheratz.net/dynamics/twobody/KeplerIterations_summary.pdf
shows how to solve Kepler's equation using an iterative computing method. It should be fairly straightforward to translate it to the language of your choice.
You might also find this interesting. It's an ocaml program, part of which claims to contain a Kepler Equation solver. Since F# is in the ML family of languages (as is ocaml), this might provide a good starting point.
wanted to drop a reply in here in case this page gets found by anyone else looking for similar materials.
The following was written as an "expression" in Adobe's After Effects software, so it's javascriptish, although I have a Python version for a different app (cinema 4d). The idea is the same: execute Newton's method iteratively until some arbitrary precision is reached.
Please note that I'm not posting this code as exemplary or meaningfully efficient in any way, just posting code we produced on a deadline to accomplish a specific task (namely, move a planet around a focus according to Kepler's laws, and do so accurately). We don't write code for a living, and so we're also not posting this for critique. Quick & dirty is what meets deadlines.
In After Effects, any "expression" code is executed once - for every single frame in the animation. This restricts what one can do when implementing many algorithms, due to the inability to address global data easily (other algorithms for Keplerian motion use interatively updated velocity vectors, an approach we couldn't use). The result the code leaves behind is the [x,y] position of the object at that instant in time (internally, this is the frame number), and the code is intended to be attached to the position element of an object layer on the timeline.
This code evolved out of material found at http://www.jgiesen.de/kepler/kepler.html, and is offered here for the next guy.
pi = Math.PI;
function EccAnom(ec,am,dp,_maxiter) {
// ec=eccentricity, am=mean anomaly,
// dp=number of decimal places
pi=Math.PI;
i=0;
delta=Math.pow(10,-dp);
var E, F;
// some attempt to optimize prediction
if (ec<0.8) {
E=am;
} else {
E= am + Math.sin(am);
}
F = E - ec*Math.sin(E) - am;
while ((Math.abs(F)>delta) && (i<_maxiter)) {
E = E - F/(1.0-(ec* Math.cos(E) ));
F = E - ec * Math.sin(E) - am;
i = i + 1;
}
return Math.round(E*Math.pow(10,dp))/Math.pow(10,dp);
}
function TrueAnom(ec,E,dp) {
S=Math.sin(E);
C=Math.cos(E);
fak=Math.sqrt(1.0-ec^2);
phi = 2.0 * Math.atan(Math.sqrt((1.0+ec)/(1.0-ec))*Math.tan(E/2.0));
return Math.round(phi*Math.pow(10,dp))/Math.pow(10,dp);
}
function MeanAnom(time,_period) {
curr_frame = timeToFrames(time);
if (curr_frame <= _period) {
frames_done = curr_frame;
if (frames_done < 1) frames_done = 1;
} else {
frames_done = curr_frame % _period;
}
_fractime = (frames_done * 1.0 ) / _period;
mean_temp = (2.0*Math.PI) * (-1.0 * _fractime);
return mean_temp;
}
//==============================
// a=semimajor axis, ec=eccentricity, E=eccentric anomaly
// delta = delta digits to exit, period = per., in frames
//----------------------------------------------------------
_eccen = 0.9;
_delta = 14;
_maxiter = 1000;
_period = 300;
_semi_a = 70.0;
_semi_b = _semi_a * Math.sqrt(1.0-_eccen^2);
_meananom = MeanAnom(time,_period);
_eccentricanomaly = EccAnom(_eccen,_meananom,_delta,_maxiter);
_trueanomaly = TrueAnom(_eccen,_eccentricanomaly,_delta);
r = _semi_a * (1.0 - _eccen^2) / (1.0 + (_eccen*Math.cos(_trueanomaly)));
x = r * Math.cos(_trueanomaly);
y = r * Math.sin(_trueanomaly);
_foc=_semi_a*_eccen;
[1460+x+_foc,540+y];
You could check this out, implemented in C# by Carl Johansen
Represents a body in elliptical orbit about a massive central body
Here is a comment from the code
True Anomaly in this context is the
angle between the body and the sun.
For elliptical orbits, it's a bit
tricky. The percentage of the period
completed is still a key input, but we
also need to apply Kepler's
equation (based on the eccentricity)
to ensure that we sweep out equal
areas in equal times. This
equation is transcendental (ie can't
be solved algebraically) so we
either have to use an approximating
equation or solve by a numeric method.
My implementation uses
Newton-Raphson iteration to get an
excellent approximate answer (usually
in 2 or 3 iterations).