I am working on a project where I want to measure the predictive performance of some categorical variables on click-through rate (continuous). However, the categorical variables are highly imbalanced:
packaged_goods: 796
food: 104
person: 61
bagged_packaged_goods: 35
tableware: 18
10 more categorical variables...
How can I best deal with this imbalance for preparing a regression analysis (in terms of train, test, validation split) and how should I set this up?
I tried k-fold cross validation but this did not help with the imbalance problem...
Related
I'm investigating the task of training a neural network to predict one future value given a sinusoidal input. So for example, as seen in the Figure, the input signal is x and the expected output signal y. The model's output is y^. Doing the regression task is fairly straightforward, and there are a lot of choices for this problem. I'm using a simple recurrent neural network with mean-squared error (MSE) loss between y and y^.
Additionally, suppose I know that the sinusoid is made up of N modalities, e.g., at some points, the wave oscillates at 5 Hz, then 10 Hz, then back to 5 Hz, then up to 15 Hz maybe—i.e., N=3.
In this case, I have ground-truth class labels in a vector k and the model does both regression and classification, additionally outputting a vector k^. An example is shown in the Figure. As this is a multi-class problem with exclusivity, I figured binary cross entropy (BCE) loss should be relevant here.
I'm sure there is a lot of research about combining loss functions, but does just adding MSE and BCE make sense? Scaling one up or down by a factor of 10 doesn't seem to change the learning outcome too much. So I was wondering what is considered the standard approach to problems where there is a joint classification and regression objective.
Additionally, on top of just BCE, I want to penalize k^ for quickly jumping around between classes; for example, if the model guesses one class, I'd like it to stay in that one class and switch only when it's necessary. See how in the Figure, there are fast dark blue blips in k^. I would like the same solid bands as seen in k, and naive BCE loss doesn't account for that.
Appreciate any and all advice!
I'm trying to solve the LunarLander continuous environment from open AI gym (Solving the LunarLanderContinuous-v2 means getting an average reward of 200 over 100 consecutive trials.) With best reward average possible for 100 straight episodes from this environment.
The difficulty is that I refer to the Lunar-lander with uncertainty. (explanation: observations in the real physical world are sometimes noisy). Specifically, I add a zero-mean
Gaussian noise with mean=0 and std = 0.05 to PositionX and PositionY observation of the location of the lander.
I also discretise the LunarLander actions to a finite number of actions instead of the continuous range the environment enables.
So far I'm using DQN, double-DQN and Duelling DDQN.
My hyperparameters are:
gamma,
epsilon start
epsilon end
epsilon decay
learning rate
number of actions (discretisation)
target update
batch size
optimizer
number of episodes
network architecture.
I'm having difficulty to reach good or even mediocre results.
Does someone have an advice about the hyperparameters changes I should make to improve my results?
Thanks!
I'm trying to replicate DQN scores for Breakout using RLLib. After 5M steps the average reward is 2.0 while the known score for Breakout using DQN is 100+. I'm wondering if this is because of reward clipping and therefore actual reward does not correspond to score from Atari. In OpenAI baselines, the actual score is placed in info['r'] the reward value is actually the clipped value. Is this the same case for RLLib? Is there any way to see actual average score while training?
According to the list of trainer parameters, the library will clip Atari rewards by default:
# Whether to clip rewards prior to experience postprocessing. Setting to
# None means clip for Atari only.
"clip_rewards": None,
However, the episode_reward_mean reported on tensorboard should still correspond to the actual, non-clipped scores.
While the average score of 2 is not much at all relative to the benchmarks for Breakout, 5M steps may not be large enough for DQN unless you are employing something akin to the rainbow to significantly speed things up. Even then, DQN is notoriously slow to converge, so you may want to check your results using a longer run instead and/or consider upgrading your DQN configurations.
I've thrown together a quick test and it looks like the reward clipping doesn't have much of an effect on Breakout, at least early on in the training (unclipped in blue, clipped in orange):
I don't know too much about Breakout to comment on its scoring system, but if higher rewards become available later on as we get better performance (as opposed to getting the same small reward but with more frequency, say), we should start seeing the two diverge.
In such cases, we can still normalize the rewards or convert them to logarithmic scale.
Here's the configurations I used:
lr: 0.00025
learning_starts: 50000
timesteps_per_iteration: 4
buffer_size: 1000000
train_batch_size: 32
target_network_update_freq: 10000
# (some) rainbow components
n_step: 10
noisy: True
# work-around to remove epsilon-greedy
schedule_max_timesteps: 1
exploration_final_eps: 0
prioritized_replay: True
prioritized_replay_alpha: 0.6
prioritized_replay_beta: 0.4
num_atoms: 51
double_q: False
dueling: False
You may be more interested in their rl-experiments where they posted some results from their own library against the standard benchmarks along with the configurations where you should be able to get even better performance.
I am trying to construct a RNN to predict the possibility of a player playing the match along with the runs score and wickets taken by the player.I would use a LSTM so that performance in current match would influence player's future selection.
Architecture summary:
Input features: Match details - Venue, teams involved, team batting first
Input samples: Player roster of both teams.
Output:
Discrete: Binary: Did the player play.
Discrete: Wickets taken.
Continous: Runs scored.
Continous: Balls bowled.
Question:
Most often RNN uses "Softmax" or"MSE" in the final layers to process "a" from LSTM -providing only a single variable "Y" as output. But here there are four dependant variables( 2 Discrete and 2 Continuous). Is it possible to stitch together all four as output variables?
If yes, how do we handle mix of continuous and discrete outputs with loss function?
(Though the output from LSTM "a" has multiple features and carries the information to the next time-slot, we need multiple features at output for training based on the ground-truth)
You just do it. Without more detail on the software (if any) in use it is hard to give more detasmail
The output of the LSTM unit is at every times step on of the hidden layers of your network
You can then input it in to 4 output layers.
1 sigmoid
2 i'ld messarfound wuth this abit. Maybe 4x sigmoid(4 wickets to an innnings right?) Or relu4
3,4 linear (squarijng it is as lso an option,e or relu)
For training purposes your loss function is the sum of your 4 individual losses.
Since f they were all MSE you could concatenat your 4 outputs before calculating the loss.
But sincd the first is cross-entropy (for a decision sigmoid) yould calculate seperately and sum.
You can still concatenate them after to have a output vector
What exactly do kur test and kur evaluate differ?
The differences we see from console
(dlnd-tf-lab) ->kur evaluate mnist.yml
Evaluating: 100%|████████████████████████████| 10000/10000 [00:04<00:00, 2417.95samples/s]
LABEL CORRECT TOTAL ACCURACY
0 949 980 96.8%
1 1096 1135 96.6%
2 861 1032 83.4%
3 868 1010 85.9%
4 929 982 94.6%
5 761 892 85.3%
6 849 958 88.6%
7 935 1028 91.0%
8 828 974 85.0%
9 859 1009 85.1%
ALL 8935 10000 89.3%
Focus on one: /Users/Natsume/Downloads/kur/examples
(dlnd-tf-lab) ->kur test mnist.yml
Testing, loss=0.458: 100%|█████████████████████| 3200/3200 [00:01<00:00, 2427.42samples/s]
Without understanding the source codes behind kur test and kur evaluate, how can we understand what exactly do they differ?
#ajsyp the developer of Kur (deep learning library) provided the following answer, which I found to be very helpful.
kur test is used when you know what the "correct answer" is, and you
simply want to see how well your model performs on a held-out sample.
kur evaluate is pure inference: it is for generating results from
your trained model.
Typically in machine learning you split your available data into 3
sets: training, validation, and testing (people sometimes call these
different things, just so you're aware). For a particular model
architecture / selection of model hyperparameters, you train on the
training set, and use the validation set to measure how well the model
performs (is it learning correctly? is it overtraining? etc). But you
usually want to compare many different model hyperparameters: maybe
you tweak the number of layers, or their size, for example.
So how do you select the "best" model? The most naive thing to do is
to pick the model with the lowest validation loss. But then you run
the risk of optimizing/tweaking your model to work well on the
validation set.
So the test set comes into play: you use the test set as a very final,
end of the day, test of how well each of your models is performing.
It's very important to hide that test set for as long as possible,
otherwise you have no impartial way of knowing how good your model is
or how it might compare to other models.
kur test is intended to be used to run a test set through the model
to calculate loss (and run any applicable hooks).
But now let's say you have a trained model, say an image recognition
model, and now you want to actually use it! You get some new data (you
probably don't even have "truth" labels for them, just the raw
images), and you want the model to classify the images. That's what
kur evaluate is for: it takes a trained model and uses it "in
production mode," where you don't have/need truth values.