How to model Multi Discrete action space with 720 possible combinatorial actions? - deep-learning

I have an Scheduling problem where the state/observation is an image of 125X100 pixels. The action space is a sequence of three actions the agent can take:
MultiDiscrete [1 to 20, 0 to 5, 0 to 5]. These give a total of 20 * 6 * 6 = 720 possible actions.
I am currently using a DQN algorithm to train the agent and at every 'step' the value function V(s) is trained on one of these actions making it very sparse. I trained for 100,000 iterations but it didn't converge.
How to train the agent using DQN in these situations? How long does the training time increases due to such a large action space?
Is there any alternate algorithm that works better in these scenarios? Because in the future problem the action space can increase even more.
I trained for 100,000 iterations but it didn't converge.
How to train the agent using DQN in these situations? How long does the training time increases due to such a large action space?

Related

Ways to prevent underfitting and overfitting to when using data augmentation to train a transposed CNN

I'm training a CNN (one using a series of ConvTranspose2D in pytorch) that uses input data from JSON to constitute an image. Unlike natural language, the input data can be in any order, as it contains info about various sprites in a scene.
In my first attempts to train the model, I didn't change the order of the input data (meaning, on each epoch, each sprite was represented in the same place in the input data). The model learned for about 10 epochs, but then there started to be divergence between the training loss (which continued to go down) and the test loss. So classic overfitting.
I tried to solve this by doing a form of data augmentation where the output data (in this case an image) stayed the same but I shuffled the order of the input data. As I have around 400 sprites, the maximum shuffling is 400!, so theoretically this can vastly expand the amount of training data. For example, instead of 100k JSON documents corresponding to 100K images, by shuffling the order of sprites in the input data, you have 400!*100000 training data points. In practice of course this amount of data is impractical, so I went with around 2m data points for an initial test. The issue I ran into here was that the model was not learning at all - after getting to a certain loss very quickly (after the first few mini-batches), it didn't learn at all for around 4 epochs. So classic underfitting.
Like Goldilocks, I'd like to find "just right" between the initial overfitting and subsequent underfitting. I'm wondering other strategies I could try out. One idea I had was letting the model train on a predetermined order of sprites (the overfitting case) and then, once overfitting starts (ie two straight epochs with divergence between the test and training loss) shuffling the data. I can also play with changing the model, although it can only be so big because of constraints with the hardware and the fact that inference needs to happen in under 20ms.
Are there any papers or techniques that are recommended in this scenario where data augmentation can lead to vastly more data points but results in a model ceasing to learn? Thanks in advance for any tips!

Data augmentation stops overfitting by preventing learning entirely?

I am training a network to classify psychosis (binary classification as either healthy or psychosis) given an MRI scan of a subject. My dataset is 500 items, where I am using 350 for training and 150 for validation. Around 44% of the dataset is healthy, and ~56% has psychosis.
When I train the network without data augmentation, the training loss begins decreasing immediately while validation loss never changes. The red line in the accuracy graph below is the dominant class percentage (56%).
When I re-train using data augmentation 80% of the time (random affine, blur, noise, flip), overfitting is prevented, but now nothing is learned at all.
So I suppose my question is: What are some ideas for how to get the validation accuracy to increase? i.e. get the network to learn things without overfitting...

How does the number of Gibbs sampling iterations impacts Latent Dirichlet Allocation?

The documentation of MALLET mentions following:
--num-iterations [NUMBER]
The number of sampling iterations should be a trade off between the time taken to complete sampling and the quality of the topic model.
MALLET provides furthermore an example:
// Run the model for 50 iterations and stop (this is for testing only,
// for real applications, use 1000 to 2000 iterations)
model.setNumIterations(50);
It is obvious that too few iterations lead to bad topic models.
However, does increasing the number of Gibbs sampling iterations necessarily benefit the quality of the topic model (measured by perplexity, topic coherence or on a downstream task)?
Or is it possible that the model quality decreases with the --num-iterations set to a too high value?
On a personal project, averaged over 10-fold cross-validation increasing the number of iterations from 100 to 1000 did not impact the average accuracy (measured as Mean Reciprocal Rank) for a downstream task. However, within the cross-validation splits the performance changed significantly, although the random seed was fixed and all other parameters kept the same. What part of background knowledge about Gibbs sampling am I missing to explain this behavior?
I am using a symmetric prior for alpha and beta without hyperparameter optimization and the parallelized LDA implementation provided by MALLET.
The 1000 iteration setting is designed to be a safe number for most collection sizes, and also to communicate "this is a large, round number, so don't think it's very precise". It's likely that smaller numbers will be fine. I once ran a model for 1000000 iterations, and fully half the token assignments never changed from the 1000 iteration model.
Could you be more specific about the cross validation results? Was it that different folds had different MRRs, which were individually stable over iteration counts? Or that individual fold MRRs varied by iteration count, but they balanced out in the overall mean? It's not unusual for different folds to have different "difficulty". Fixing the random seed also wouldn't make a difference if the data is different.

Neural Network : Epoch and Batch Size

I am trying to train a neural network to classify words into different categories.
I notice two things:
When I use a smaller batch_size (like 8,16,32) the loss is not decreasing, but rather sporadically varying. When I use a larger batch_size (like 128, 256), the loss is going going down, but very slowly.
More importantly, when I use a larger EPOCH value, my model does a good job at reducing the loss. However I'm using a really large value (EPOCHS = 10000).
Question:
How to get the optimal EPOCH and batch_size values?
There is no way to decide on these values based on some rules. Unfortunately, the best choices depend on the problem and the task. However, I can give you some insights.
When you train a network, you calculate a gradient which would reduce the loss. In order to do that, you need to backpropagate the loss. Now, ideally, you compute the loss based on all of the samples in your data because then you consider basically every sample and you come up with a gradient that would capture all of your samples. In practice, this is not possible due to the computational complexity of calculating gradient on all samples. Because for every update, you have to compute forward-pass for all your samples. That case would be batch_size = N, where N is the total number of data points you have.
Therefore, we use small batch_size as an approximation! The idea is instead of considering all the samples, we say I compute the gradient based on some small set of samples but the thing is I am losing information regarding the gradient.
Rule of thumb:
Smaller batch sizes give noise gradients but they converge faster because per epoch you have more updates. If your batch size is 1 you will have N updates per epoch. If it is N, you will only have 1 update per epoch. On the other hand, larger batch sizes give a more informative gradient but they convergence slower.
That is the reason why for smaller batch sizes, you observe varying losses because the gradient is noisy. And for larger batch sizes, your gradient is informative but you need a lot of epochs since you update less frequently.
The ideal batch size should be the one that gives you informative gradients but also small enough so that you can train the network efficiently. You can only find it by trying actually.

Estimating the training time in convolutional neural network

I want to know whether it is possible to estimate the training time of a convolutional neural network, given parameters like depth, filter, size of input, etc.
For instance, I am working on a 3D convolutional neural network whose structure is like:
a (20x20x20) convolutional layer with stride of 1 and 8 filters
a (20x20x20) max-pooling layer with stride of 20
a fully connected layer mapping to 8 nodes
a fully connected layer mapping to 1 output
I am running 100 epochs and print the loss(mean squared error) every 10 epochs. Now it has run 24 hours and no loss printed(I suppose it has not run 10 epochs yet). By the way, I am not using GPU.
Is it possible to estimate the training time like a formula or something like that? Is it related to time complexity or my hardware? I also found the following paper, will it give me some information?
https://ai2-s2-pdfs.s3.amazonaws.com/140f/467566e799f32831db6913d84ccdbdcac0b2.pdf
Thanks in advance.