Soft Actor Critic - Losses are not converging - reinforcement-learning

I'm trying to implement soft actor-critic algorithm for financial data (stock prices) and I have trouble with losses, no matter what combination of HyperParameters I enter, they are not converging, and basically it caused bad reward return as well. It sounds like the agent is not learning at all.
My question is, would it be related to the data itself (nature of data) or is it something related to the logic of the code?

Related

Variable actions in Deep Reinforcement Learning

I'm trying to teach an AI the combat mechanics of a system similar to Darkest Dungeon.
The goal is for the AI to be able to act well controlling NPCs with random stats and random skills. This means that on each session the AI's character will different values for health, stress, accuracy, dodge, etc... The stats of each skill the character has also has random values - the damage, accuracy, effects.
In my current system-
The inputs for the model are:
All the stats that the AI's character has.
All the stats of it's
allies.
A subset of stats of each enemy (in Darkest Dungeon you
cannot see all the enemies's stats).
All the stats of each skill.
The outputs for the model are:
Which skill to use (out of 4 options)
Which target (total of 8 options, being any ally (including self) and
any enemy)
I'm using an action mask to disable invalid actions (such as using an offensive skill on an ally or targeting a position that has no character on).
The main problem I'm having is that what each action does changes heavily depending on the stats of the skill in that index.
Does anyone have an insight on which kind of learning I'm looking for? So far I have tried using MA-POCA provided by the Unity Ml-Agents package with no success, the model didn't seem to understand that what each action does relies extremely on the associated skill stats.
Searching for papers on the subject only resulted in articles about action-spaces with variable size, which I already solve with by masking invalid actions.
Obs: I'm not limited to training in the Unity environment. The only limitation I have is that the model must be convertible/exported to ONNX format.

Deep reinforcement learning for similar observations but need totally different actions, how to solve it?

For DRL using neural networks, like DQN, if there is a task that needs total different actions at similar observations, is NN going to show its weakness at this moment? Will two near input to the NN generate similar output? If so, it cannot get the different the task need?
For instance:
the agent can choose discrete action from [A,B,C,D,E], here is the observation by a set of plugs in a binary list [0,0,0,0,0,0,0].
For observation [1,1,1,1,1,1,1] and [1,1,1,1,1,1,0] they are quite similar but if the agent should conduct action A at [1,1,1,1,1,1,1] but action D at [1,1,1,1,1,1,0]. Those two observation are too closed on the distance so the DQN may not easily get the proper action? How to solve?
One more thing:
One hot encoding is a way to improve the distance between observations. It is also a common and useful way for many supervised learning tasks. But one hot will also increase the dimension heavily.
Will two near input to the NN generate similar output ?
Artificial neural networks, by nature, are non-linear function approximators. Meaning that for two given similar inputs, the output can be very different.
You might get an intuition on it considering this example, two very similar pictures (the one on the right just has some light noise added to it) give very different results for the model.
For observation [1,1,1,1,1,1,1] and [1,1,1,1,1,1,0] they are quite similar but if the agent should conduct action A at [1,1,1,1,1,1,1] but action D at [1,1,1,1,1,1,0]. Those two observation are too closed on the distance so the DQN may not easily get the proper action ?
I see no problem with this example, a properly trained NN should be able to map the desired action for both inputs. Furthermore, in your example the input vectors contain binary values, a single difference in these vectors (meaning that they have a Hamming distance of 1) is big enough for the neural net to classify them properly.
Also, the non-linearity in neural networks comes from the activation functions, hope this helps !

Regarding overfitting a model

Suppose I want to learn a dataset, and I pick a deep-NN model with some complexity (say 5-layers).
Now, I know that this model is overfitting my data when I notice that during the training, the train-loss and validation-loss both decrease until N-epochs, and then the valid-loss starts to go up after that.
My question is: Is this model good for my data if I stop at Nth epoch? Or it is an over-complicated model in the first-place, even if I stop the training at N-epochs or not. Do I just discard this architecture and hunt for a better one?

Q learning for blackjack, reward function?

I am currently learning reinforcement learning and am have built a blackjack game.
There is an obvious reward at the end of the game (payout), however some actions do not directly lead to rewards (hitting on a count of 5), which should be encouraged, even if the end result is negative (loosing the hand).
My question is what should the reward be for those actions ?
I could hard code a positive reward (fraction of the reward for winning the hand) for hits which do not lead to busting, but it feels like I am not approaching the problem correctly.
Also, when I assign a reward for a win (after the hand is over), I update the q-value corresponding to the last action/state pair, which seems suboptimal, as this action may not have directly lead to the win.
Another option I thought is to assign the same end reward to all of the action/state pairs in the sequence, however, some actions (like hitting on count <10) should be encouraged even if it leads to a lost hand.
Note: My end goal is to use deep-RL with an LSTM, but I am starting with q-learning.
I would say to start simple and use the rewards the game dictates. If you win, you'll receive a reward +1, if you lose -1.
It seems you'd like to reward some actions based on human knowledge. Maybe start with using epsilon greedy and let the agent discover all actions. Play along with the discount hyperparameter which determines the importance of future rewards, and look if it comes with some interesting strategies.
This blog is about RL and Blackjack.
https://towardsdatascience.com/playing-blackjack-using-model-free-reinforcement-learning-in-google-colab-aa2041a2c13d

Is it possible to train the sentiment classification model with the labeled data and then use it to predict sentiment on data that is not labeled?

I want to do sentiment analysis using machine learning (text classification) approach. For example nltk Naive Bayes Classifier.
But the issue is that a small amount of my data is labeled. (For example, 100 articles are labeled positive or negative) and 500 articles are not labeled.
I was thinking that I train the classifier with labeled data and then try to predict sentiments of unlabeled data.
Is it possible?
I am a beginner in machine learning and don't know much about it.
I am using python 3.7.
Thank you in advance.
Is it possible to train the sentiment classification model with the labeled data and then use it to predict sentiment on data that is not labeled?
Yes. This is basically the definition of what supervised learning is.
I.e. you train on data that has labels, so that you can then put it into production on categorizing your data that does not have labels.
(Any book on supervised learning will have code examples.)
I wonder if your question might really be: can I use supervised learning to make a model, assign labels to another 500 articles, then do further machine learning on all 600 articles? Well the answer is still yes, but the quality will fall somewhere between these two extremes:
Assign random labels to the 500. Bad results.
Get a domain expert assign correct labels to those 500. Good results.
Your model could fall anywhere between those two extremes. It is useful to know where it is, so know if it is worth using the data. You can get an estimate of that by taking a sample, say 25 records, and have them also assigned by a domain expert. If all 25 match, there is a reasonable chance your other 475 records also have been given good labels. If e.g. only 10 of the 25 match, the model is much closer to the random end of the spectrum, and using the other 475 records is probably a bad idea.
("10", "25", etc. are arbitrary examples; choose based on the number of different labels, and your desired confidence in the results.)