Trading algorithm - actions in Q-learning/DQN - reinforcement-learning

The following has completed using MATLAB.
I am trying to build a trading algorithm using Deep Q learning. I have just taken a years worth of daily stock prices and am using that as the training set.
My state space is my [money, stock, price]
money is the amount of cash I have,
stock is the number of stocks I have, and
price is the price of the stock at that time step.
The issue I am having is with the actions; looking online, people only have three actions, { buy | sell | hold }.
My reward function is the difference between the value of portfolio value in the current time step and the previous time step.
But using just three actions, I am unsure how to choose to buy, lets say 67 stocks at the price?
I am using a neural network to approximate the q-values. It has three inputs
[money, stock, price] and 202 outputs, i.e. I can sell between 0 and 100 number of stock, 0, I can hold the stock, or I can buy 1-100 stock.
Can anyone shed some light on the how can I reduce this to 3 actions?
My code is :
% p is the stock price
% sp is the stock price at the next time interval
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
hidden_layers = 1;
actions = 202;
net = newff( [-1000000 1000000;-1000000 1000000;0 1000;],
[hidden_layers, actions],
{'tansig','purelin'},
'trainlm'
);
net = init( net );
net.trainParam.showWindow = false;
% neural network training parameters -----------------------------------
net.trainParam.lr = 0.01;
net.trainParam.mc = 0.1;
net.trainParam.epochs = 100;
% parameters for q learning --------------------------------------------
epsilon = 0.8;
gamma = 0.95;
max_episodes = 1000;
max_iterations = length( p ) - 1;
reset = false;
inital_money = 1000;
inital_stock = 0;
%These will be where I save the outputs
save_s = zeros( max_iterations, max_episodes );
save_pt = zeros( max_iterations, max_episodes );
save_Q_target = zeros( max_iterations, max_episodes );
save_a = zeros( max_iterations, max_episodes );
% construct the inital state -------------------------------------------
% a = randi( [1 3], 1, 1 );
s = [inital_money;inital_stock;p( 1, 1 )];
% construct initial q matrix -------------------------------------------
Qs = zeros( 1, actions );
Qs_prime = zeros( 1, actions );
for i = 1:max_episodes
for j = 1:max_iterations % max_iterations --------------
Qs = net( s );
%% here we will choose an action based on epsilon-greedy strategy
if ( rand() <= epsilon )
[Qs_value a] = max(Qs);
else
a = randi( [1 202], 1, 1 );
end
a2 = a - 101;
save_a(j,i) = a2;
sp = p( j+1, 1 ) ;
pt = s( 1 ) + s( 2 ) * p( j, 1 );
save_pt(j,i) = pt;
[s_prime,reward] = simulateStock( s, a2, pt, sp );
Qs_prime = net( s_prime );
Q_target = reward + gamma * max( Qs_prime );
save_Q_target(j,i) = Q_target;
Targets = Qs;
Targets( a ) = Q_target;
save_s( j, i ) = s( 1 );
s = s_prime;
end
epsilon = epsilon * 0.99 ;
reset = false;
s = [inital_money;inital_stock;p(1,1)];
end
% ----------------------------------------------------------------------
function[s_prime,reward] = simulateStock( s, a, pt, sp )
money = s(1);
stock = s(2);
price = s(3);
money = money - a * price ;
money = max( money, 0 );
stock = s(2) + a;
stock = max( stock, 0 );
s_prime = [money;stock;sp];
reward = ( money + stock * price ) - pt;
end

Actions: ill-defined ( if not giving an ultimate reason for so flattened, decaffeinated & knowingly short-cut model )
You may be right, that using a range of just { buy | hold | sell } actions is a frequent habit for academic papers, where authors sometimes decide to illustrate their demonstrated academic efforts on improving learning / statistical methods and opt to pick an exemplary application in a trading domain. The pity is, this could be done in academic papers, but not in the reality of trading.
Why?
Even with an elementary view on trading, the problem is much more complex. As a brief reference, there are more than five principal domains of such model-space. Given a trading is to be modelled, one cannot remain without a fully described strategy --
Tru-Strategy := { SelectPOLICY,
DetectPOLICY,
ActPOLICY,
AllocatePOLICY,
TerminatePOLICY
}
Any whatever motivated simplification, that would opt to omit any single one domain of these five principal domains will become whatever but a truly Trading Strategy.
One can easily figure out, what comes out of just training ( the worse from later harnessing such model in doing real trades with ) an ill-defined model, that is not coherent with the reality.
Sure, it can reach ( and will ( again, unless ill-formulated minimiser's criterion function ) ) some mathematical function's minimum, but that does not ensure the reality to immediately change it's so far natural behaviours and to start "obey" the ill-defined model and to "dance" according to such oversimplified or otherwise skewed ( ill-modelled )-opinions about the reality.
Rewards: ill-defined ( if not giving a reason for ignoring the fact or delayed rewards )
If in doubts what this means, try to follow an example:
Today, the Strategy-Model decides to A:Buy(AAPL,67).
Tomorrow, AAPL goes down, some 0.1% and thus the immediate reward ( as was proposed above ) is negative, thus punishing such decision. The Model is stimulated not to do it ( do not buy AAPL ).
The point is, that after some period of time, AAPL rises much higher, producing much higher reward compared to initial fluctuations in D2D Close, which is known, but the proposed Strategy-Model Q-fun simply principally erroneously did not reflect at all.
Beware WYTIWYG -- What You Train Is What You Get ...
This means an as-is-Model could be trained to act according to the such defined stimuli, but it's actual behaviour will favour NOTHING but such extremely naive intraday "quasi-scalping" shots with limited ( if any at all ) support from actual Market State & Market Dynamics as are available by many industry-wide accepted quantitative models.
So, sure, one can train a reality-blind model, that was kept blind & deaf ( ignoring the reality of the Problem Domain ), but for what sake?
Epilogue:
There is nothing like a "Data Science"even when MarCom & HR beat their drums & whistles, as they indeed do a lot nowadays
Why?
Exactly because the above observed rationale. Having data-points is nothing. Sure, it is better than standing clueless in front of the customer without a single observation of the reality, but the Data-points do not save the game.
It is the domain-knowledge, that starts to make some sense from the Data-points, not the Data-points per se.
If still in doubts, if one has a few terabytes of numbers, there is no Data Science to tell you, what the data-points represent.
On the other hand, if one knows, from the domain-specific context, these data-points ought be temperature readings, there is still no Data-Science God to tell you, whether there are all ( just by coincidence ) in [°K] or [°C] ( if there are just positive readings >= 0.00001 ).

Related

Issues with Q-learning and neural networks

I'm just starting out learning Q-learning, and I've been okay with using the tabular method to get some decent results. One game I found quite fun to use Q-learning was with Blackjack, which seemed like a perfect MDP type problem.
I've been wanting to extend this to using a neural network as a function approximator, but I'm not having any luck at all. The approach is to calculate the expected value for every action in a given state and then pick the best one with a small chance of picking something random (epsilon greedy). Nothing converges, it learns silly Q-values, and it can't even figure out how to play when the only card in the deck is 5.
I am genuinely stuck, after spending hours on this and tuning hyper parameters and everything else I can think of. I feel like I must have made a fundamental error with Q-learning that I can't see. My code is below:
import gym
from gym import spaces
from gym.utils import seeding
import numpy as np
import random
import pandas as pd
import sklearn
import math
import itertools
import tensorflow as tf
from matplotlib import pyplot as plt
############################ START BLACKJACK CLASS ############################
class Blackjack(gym.Env):
"""Simple Blackjack environment"""
def __init__(self, natural=False):
self.action_space = spaces.Discrete(2)
self._seed()
# Start the first game
self.prevState = self.reset()
def _seed(self, seed=None):
self.np_random, seed = seeding.np_random(seed)
return seed
# Returns a tuple of the form (str, int) where str is "H" or "S" depending on if its a
# Soft or Hard hand and int is the sum total of the cards in hand
# Example output: ("H", 15)
def getTotal(cards):
running_total = 0
softs = 0
for c in cards:
running_total += c
if c == 11:
softs += 1
if running_total > 21 and softs > 0:
softs -= 1
running_total -= 10
return "H" if softs == 0 else "S", running_total
def drawCard():
# Draw a random card from the deck with replacement. 11 is ACE
# I've set it to always draw a 5. In theory this should be very easy to learn and
# The only possible states, and their correct Q values should be:
# Q[10_5, stand] = -1 Q[10_5, hit] = 0
# Q[15_5, stand] = -1 Q[15_5, hit] = 0
# Q[20_5, stand] = 0 Q[20_5, hit] = -1
# The network can't even learn this!
return 5
return random.choice([5,6])
return random.choice([2,3,4,5,6,7,8,9,10,10,10,10,11])
def isBlackjack(cards):
return sum(cards) == 21 and len(cards) == 2
def getState(self):
# Defines the state of the current game
pstate, ptotal = Blackjack.getTotal(self.player)
dstate, dtotal = Blackjack.getTotal(self.dealer)
return "{}_{}".format("BJ" if Blackjack.isBlackjack(self.player) else pstate+str(ptotal), dtotal)
def reset(self):
# Resets the game - Dealer is dealt 1 card, player is dealt 2 cards
# The player and dealer are represented by an array of numbers, which are the cards they were
# dealt in order
self.soft = "H"
self.dealer = [Blackjack.drawCard()]
self.player = [Blackjack.drawCard() for _ in range(2)]
pstate, ptotal = Blackjack.getTotal(self.player)
dstate, dtotal = Blackjack.getTotal(self.dealer)
# Returns the current state of the game
return self.getState()
def step(self, action):
assert self.action_space.contains(action)
# Action should be 0 or 1.
# If standing, the dealer will draw all cards until they are >= 17. This will end the episode
# If hitting, a new card will be added to the player, if over 21, reward is -1 and episode ends
# Stand
if action == 0:
pstate, ptotal = Blackjack.getTotal(self.player)
dstate, dtotal = Blackjack.getTotal(self.dealer)
while dtotal < 17:
self.dealer.append(Blackjack.drawCard())
dstate, dtotal = Blackjack.getTotal(self.dealer)
# if player won with blackjack
if Blackjack.isBlackjack(self.player) and not Blackjack.isBlackjack(self.dealer):
rw = 1.5
# if dealer bust or if the player has a higher number than dealer
elif dtotal > 21 or (dtotal <= 21 and ptotal > dtotal and ptotal <= 21):
rw = 1
# if theres a draw
elif dtotal == ptotal:
rw = 0
# player loses in all other situations
else:
rw = -1
state = self.getState()
# Returns (current_state, reward, boolean_true_if_episode_ended, empty_dict)
return state, rw, True, {}
# Hit
else:
# Player draws another card
self.player.append(Blackjack.drawCard())
# Calc new total for player
pstate, ptotal = Blackjack.getTotal(self.player)
state = self.getState()
# Player went bust and episode is over
if ptotal > 21:
return state, -1, True, {}
# Player is still in the game, but no observed reward yet
else:
return state, 0, False, {}
############################ END BLACKJACK CLASS ############################
# Converts a player or dealers hand into an array of 10 cards
# that keep track of how many of each card are held. The card is identified
# through its index:
# Index: 0 1 2 3 4 5 6 7 9 10
# Card: 2 3 4 5 6 7 8 9 T A
def cardsToX(cards):
ans = [0] * 12
for c in cards:
ans[c] += 1
ans = ans[2:12]
return ans
# Easy way to convert Q values into weighted decision probabilities via softmax.
# This is useful if we probablistically choose actions based on their values rather
# than always choosing the max.
# eg Q[s,0] = -1
# Q[s,1] = -2
# softmax([-1,-2]) = [0.731, 0.269] --> 73% chance of standing, 27% chance of hitting
def softmax(x):
"""Compute softmax values for each sets of scores in x."""
e_x = np.exp(x - np.max(x))
return e_x / e_x.sum()
plt.ion()
# Define number of Neurons per layer
K = 20 # Layer 1
L = 10 # Layer 2
M = 5 # Layer 2
N_IN = 20 # 10 unique cards for player, and 10 for dealer = 20 total inputs
N_OUT = 2
SDEV = 0.000001
# Input / Output place holders
X = tf.placeholder(tf.float32, [None, N_IN])
X = tf.reshape(X, [-1, N_IN])
# This will be the observed reward + decay_factor * max(Q[s+1, 0], Q[s+1, 1]).
# This should be an estimate of the 'correct' Q-value with the ony caveat being that
# the Q-value of the next state is a biased estimate of the true value.
Q_TARGET = tf.placeholder(tf.float32, [None, N_OUT])
# LAYER 1
W1 = tf.Variable(tf.random_normal([N_IN, K], stddev = SDEV))
B1 = tf.Variable(tf.random_normal([K], stddev = SDEV))
# LAYER 2
W2 = tf.Variable(tf.random_normal([K, L], stddev = SDEV))
B2 = tf.Variable(tf.random_normal([L], stddev = SDEV))
# LAYER 3
W3 = tf.Variable(tf.random_normal([L, M], stddev = SDEV))
B3 = tf.Variable(tf.random_normal([M], stddev = SDEV))
# LAYER 4
W4 = tf.Variable(tf.random_normal([M, N_OUT], stddev = SDEV))
B4 = tf.Variable(tf.random_normal([N_OUT], stddev = SDEV))
H1 = tf.nn.relu(tf.matmul(X, W1) + B1)
H2 = tf.nn.relu(tf.matmul(H1, W2) + B2)
H3 = tf.nn.relu(tf.matmul(H2, W3) + B3)
# The predicted Q value, as determined by our network (function approximator)
# outputs expected reward for standing and hitting in the form [stand, hit] given the
# current game state
Q_PREDICT = (tf.matmul(H3, W4) + B4)
# Is this correct? The Q_TARGET should be a combination of the real reward and the discounted
# future rewards of the future state as predicted by the network. Q_TARGET - Q_PREDICT should be
# the error in prediction, which we want to minimise. Does this loss function work to help the network
# converge to the true Q values with sufficient training?
loss_func = tf.reduce_sum(tf.square(Q_TARGET - Q_PREDICT))
# This are some placeholder values to enable manually set decayed learning rates. For now, use
# the same learning rate all the time.
LR_START = 0.001
#LR_END = 0.000002
#LR_DECAY = 0.999
# Optimizer
LEARNING_RATE = tf.Variable(LR_START, trainable=False)
optimizer = tf.train.GradientDescentOptimizer(LEARNING_RATE)#(LEARNING_RATE)
train_step = optimizer.minimize(loss_func)
init = tf.global_variables_initializer()
sess = tf.Session()
sess.run(init)
# Initialise the game environment
game = Blackjack()
# Number of episodes (games) to play
num_eps = 10000000
# probability of picking a random action. This decays over time
epsilon = 0.1
# discount factor. For blackjack, future rewards are equally important as immediate rewards.
discount = 1.0
all_rewards = [] # Holds all observed rewards. The rolling mean of rewards should improve as the network learns
all_Qs = [] # Holds all predicted Q values. Useful as a sanity check once the network is trained
all_losses = [] # Holds all the (Q_TARGET - Q_PREDICTED) values. The rolling mean of this should decrease
hands = [] # Holds a summary of all hands played. (game_state, Q[stand], Q[hit], action_taken)
# boolean switch to use the highest action value instead of a stochastic decision via softmax on Q-values
use_argmax = True
# Begin generating episodes
for ep in range(num_eps):
game.reset()
# Keep looping until the episode is not over
while True:
# x is the array of 20 numbers. The player cards, and the dealer cards.
x = cardsToX(game.player) + cardsToX(game.dealer)
# Q1 refers to the predicted Q-values before any action was taken
Q1 = sess.run(Q_PREDICT, feed_dict = {X : np.reshape( np.array(x), (-1, N_IN) )})
all_Qs.append(Q1)
if use_argmax:
# action is selected to be the one with the highest Q-value
act = np.argmax(Q1)
else:
# action is a weighted selection based on predicted Q_values
act = np.random.choice(range(N_OUT), p = softmax(Q1)[0])
if random.random() < epsilon:
# action is selected randomly
act = random.randint(0, N_OUT-1)
# Get game state before action is taken
game_state = game.getState()
# Take action! Observe new state, reward, and if the game is over
game_state_new, reward, done, _ = game.step(act)
hands.append( (game_state, Q1[0][0], Q1[0][1], act, reward) )
# Store the new state vector to feed into our network.
# x2 corresponds to the x vector observed in state s+1
x2 = cardsToX(game.player) + cardsToX(game.dealer)
# Q2 refers to the predicted Q-values in the new s+1 state. This is used for the 'SARSA' update.
Q2 = sess.run(Q_PREDICT,feed_dict = {X : np.reshape( np.array(x2), (-1, N_IN) )})
# Store the maximum Q-value in this new state. This should be the expected reward from this new state
maxQ2 = np.max(Q2)
# targetQ is the same as our predicted one initially. The index of the action we took will be
# updated to be [observed reward] + [discount_factor] * max(Q[s+1])
targetQ = np.copy(Q1)
# If the game is done, then there is no future state
if done:
targetQ[0,act] = reward
all_rewards.append(reward)
else:
targetQ[0,act] = reward + discount * maxQ2
# Perform one gradient descent update, filling the placeholder value for Q_TARGET with targetQ.
# The returned loss is the difference between the predicted Q-values and the targetQ we just calculated
loss, _, _ = sess.run([loss_func, Q_PREDICT, train_step],
feed_dict = {X : np.reshape( np.array(x), (-1, N_IN) ),
Q_TARGET : targetQ}
)
all_losses.append(loss)
# Every 1000 episodes, show how the q-values moved after the gradient descent update
if ep % 1000 == 0 and ep > 0:
Q_NEW = sess.run(Q_PREDICT, feed_dict = {X : np.reshape( np.array(x), (-1, N_IN) ),
Q_TARGET : targetQ})
#print(game_state, targetQ[0], Q1[0], (Q_NEW-Q1)[0], loss, ep, epsilon, act)
rolling_window = 1000
rolling_mean = np.mean( all_rewards[-rolling_window:] )
rolling_loss = np.mean( all_losses[-rolling_window:] )
print("Rolling mean reward: {:<10.4f}, Rolling loss: {:<10.4f}".format(rolling_mean, rolling_loss))
if done:
# Reduce chance of random action as we train the model.
epsilon = 2/((ep/500) + 10)
epsilon = max(0.02, epsilon)
# rolling mean of rewards should increase over time!
if ep % 1000 == 0 and ep > 0:
pass# Show the rolling mean of all losses. This should decrease over time!
#plt.plot(pd.rolling_mean(pd.Series(all_losses), 5000))
#plt.pause(0.02)
#plt.show()
break
print(cardsToX(game.player))
print(game.dealer)
Any ideas? I'm stuck :(

Writing Fibonacci Sequence Elegantly Python

I am trying to improve my programming skills by writing functions in multiple ways, this teaches me new ways of writing code but also understanding other people's style of writing code. Below is a function that calculates the sum of all even numbers in a fibonacci sequence up to the max value. Do you have any recommendations on writing this algorithm differently, maybe more compactly or more pythonic?
def calcFibonacciSumOfEvenOnly():
MAX_VALUE = 4000000
sumOfEven = 0
prev = 1
curr = 2
while curr <= MAX_VALUE:
if curr % 2 == 0:
sumOfEven += curr
temp = curr
curr += prev
prev = temp
return sumOfEven
I do not want to write this function recursively since I know it takes up a lot of memory even though it is quite simple to write.
You can use a generator to produce even numbers of a fibonacci sequence up to the given max value, and then obtain the sum of the generated numbers:
def even_fibs_up_to(m):
a, b = 0, 1
while a <= m:
if a % 2 == 0:
yield a
a, b = b, a + b
So that:
print(sum(even_fibs_up_to(50)))
would output: 44 (0 + 2 + 8 + 34 = 44)

Calculating power for repeated measures in gpower

I'm trying to calculate power for my repeated measures design in GPower. I'm confident I have the right result for my design as a single measure ANOVA:
2 factors, 3 levels each:
F tests - ANOVA: Fixed effects, special, main effects and interactions
Analysis: A priori: Compute required sample size
Input: Effect size f = .4
α err prob = 0.05
Power (1-β err prob) = .8
Numerator df = 2
Number of groups = 6
Output: Noncentrality parameter λ = 10.240000
Critical F = 3.155932
Denominator df = 58
Total sample size = 64
Actual power = 0.803690
Then here's my set-up for repeated msrs:
F tests - ANOVA: Repeated measures, within-between interaction
Analysis: A priori: Compute required sample size
Input: Effect size f = .4
α err prob = 0.05
Power (1-β err prob) = .8
Number of groups = 6
Repetitions = 4
Corr among rep measures = 0.5
Nonsphericity correction ε = 1
Output: Noncentrality parameter λ = 30.720000
Critical F = 1.855810
Numerator df = 15.000000
Denominator df = 54.000000
Total sample size = 24
Actual power = 0.917180
Questions are: are my group numbers the same for repeated msrs as single measure?
-are .5 correlation and 1 for nonsphericity correction standard? And are these parameters derived from the design?
Thanks!

Inequality constrained convex optimization in Matlab

I want to solve the following problem:
minimize E[T]
subject to λi * pi - μi <= 0; for all i, i=1,...,n
(λ0 + sum(λi*(1-pi))) - μ0 <=0;
pi-1 <=0; for all i, i=1,...,n
pi => 0; for all i, i=1,...,n
where E(T) = (λ0 + sum(λi*(1-pi)) / ( ( λ0 + sum(λi) ) * μ0 -(λ0 + sum(λi*(1-pi) ) )) + sum((pi * λi) / ((λ0 + sum(λi)) * (μi - pi * λi)) )
where all sum goes from 1 to n
That's what we know about the parameters: n = 2, λ0 = 0, μ0 = 1, λ1 = free parameter, λ2 = 2, μ1 = μ2 = 2,
This problem can be handled as an inequality constrained minimization
problem.
I know that λ1 goes from 0 to 3 and what i want to get are p1 and p2. p1 and p2 are between 0 and 1.
And how can i choose the starting points? Or is this problem can be solved in Matlab?
I tried to to use fmincon with interior point algorithm in Matlab. But i don't really know how a linearly increasing parameter can be in nonlinear constraints.
If you can tell me suggestions or other functions that can handle this problem properly i would be pleased.

Data set too large to load into memory for processing

I have a bigger rapidly growing data set of around 4 million rows, in order to define and exclude the outliers (for statistics / analytics usage) I need the algorithm to consider all entries in this data set. However this is too much data to load into memory and my system chokes. I'm currently using this to collect and process the data:
#scoreInnerFences = innerFence Post.where( :source => 1 ).
order( :score ).
pluck( :score )
Using the typical divide and conquer method won't work, I don't think because every entry has to be considered to keep my outlier calculation accurate. How can this be achieved efficiently?
innerFence identifies the lower quartile and upper quartile of the data set, then uses those findings to calculate the outliers. Here is the (yet to be refactored, non-DRY) code for this:
def q1(s)
q = s.length / 4
if s.length % 2 == 0
return ( s[ q ] + s[ q - 1 ] ) / 2
else
return s[ q ]
end
end
def q2(s)
q = s.length / 4
if s.length % 2 == 0
return ( s[ q * 3 ] + s[ (q * 3) - 1 ] ) / 2
else
return s[ q * 3 ]
end
end
def innerFence(s)
q1 = q1(s)
q2 = q2(s)
iq = (q2 - q1) * 3
if1 = q1 - iq
if2 = q2 + iq
return [if1, if2]
end
This is not the best way, but it is an easy way:
Do several querys. First you count the number of scores:
q = Post.where( :source => 1 ).count
then you do your calculations
then you fetch the scores
q1 = Post.where( :source => 1 ).
reverse_order(:score).
select("avg(score) as score").
offset(q).limit((q%2)+1)
q2 = Post.where( :source => 1 ).
reverse_order(:score).
select("avg(score) as score").
offset(q*3).limit((q%2)+1)
The code is probably wrong but I'm sure you get the idea.
For large datasets, I sometimes drop down below ActiveRecord. It's a memory hog, even I imagine, using pluck. Of course it's less portable, but sometimes it's worth it.
scores = Post.connection.execute('select score from posts where score > 1 order by score').map(&:first)
Don't know if that will help enough for 4 million record. If not, maybe look at a stored procedure?