Epochs and Iterations in Deeplearning4j - deep-learning

I recently started learning Deeplearning4j and I fail to understand how the concept of epochs and iterations is actually implemented.
In the online documentation it says:
an epoch is a complete pass through a given dataset ...
Not to be confused with an iteration, which is simply one
update of the neural net model’s parameters.
I ran a training using a MultipleEpochsIterator, but for the first run I set 1 epoch, miniBatchSize = 1 and a dataset of 1000 samples, so I expected the training to finish after 1 epoch and 1000 iterations, but after more than 100.000 iterations it was still running.
int nEpochs = 1;
int miniBatchSize = 1;
MyDataSetFetcher fetcher = new MyDataSetFetcher(xDataDir, tDataDir, xSamples, tSamples);
//The same batch size set here was set in the model
BaseDatasetIterator baseIterator = new BaseDatasetIterator(miniBatchSize, sampleSize, fetcher);
MultipleEpochsIterator iterator = new MultipleEpochsIterator(nEpochs, baseIterator);
model.fit(iterator)
Then I did more tests changing the batch size, but that didn't change the frequency of the log lines printed by the IterationListener. I mean that I thought that if I increase the batch size to 100 then with 1000 samples I would have just 10 updates of the parameters an therefore just 10 iterations, but the logs and the timestamp intervals are more or less the same.
BTW. There is a similar question, but the answer does not actually answer my question, I would like to understand better the actual details:
Deeplearning4j: Iterations, Epochs, and ScoreIterationListener

None of this will matter after 1.x (which is already out in alpha) - we got rid of iterations long ago.
Originally it was meant to be shortcut syntax so folks wouldn't have to write for loops.
Just focus on for loops with epochs now.

Related

Lower batch size in the last iteration of first training epoch than the other iteration

I'm trying to train an deep neural network model, the output dimensions of each iteration in one epoch is like [64,1600,8] (64 is the batch size). But in the last iteration of first epoch, this output changed to [54,1600,8] and faced with dimension error. Why in the last iteration batch size had changed??
Additionally, if I change the batch size to 32 the last iteration's output is [22,1600,8].
I think that the output of the last iteration must be same as the other iteration.
The last iteration batch size changed because you did not have enough data to completely fill the batch. If you have a batch size of 10, for example, and you have 101 entries total in your data, then you will have 10 batches of 10 and 1 batch of 1.
The solution is to either drop the batch if it is not the correct size, or to adapt your model so that it will detect the size of the batch and change accordingly, instead of having the batch size hard-coded in to your model parameters.
Seeing that you are using pytorch, I'll add to the answer by Richard by saying that pytorch DataLoaders have the functionality built-in to drop the last (incomplete) batch. Checking the documentation, you can specify drop_last=True while instantiating the DataLoader.

Slow prediction speed for translation model opus-mt-en-ro

I'm using the model Helsinki-NLP/opus-mt-en-ro from huggingface.
To produce output, I'm using the following code:
inputs = tokenizer(
questions,
max_length=max_input_length,
truncation=True,
return_tensors='pt',
padding=True).to('cuda')
translation = model.generate(**inputs)
For small inputs (i.e. the number of sentences in questions), it works fine. However, when the number of sentences increases (e.g., batch size = 128), it is very slow.
I have a dataset of 100K examples and I have to produce the output. How to make it faster? (I already checked the usage of GPU and it varies between 25% and 70%).
Update: Following the comment of dennlinger, here is the additional information:
Average question length: Around 30 tokens
Definition of slowness: With a batch of 128 questions, it takes around 25 seconds. So given my dataset of 100K examples, it will take more than 5 hours. I'm using GPU Nvidia V100 (16GB) (hence to('cuda') in the code). I cannot increase the batch size because it results in out of memory error.
I didn't try different parameters, but I know by default, the number of beams equals 1.

Stable Baselines - PPO Iterate through the data frame for learning

PPO model doesn't iterate through the whole dataframe .. its basically repeating the first step many times (10,000 in this example) ?
In this case, DF's shape is (5476, 28) and each step's obs shape is: (60, 28).. I dont see that its iterating through the whole DF.
# df shape - (5476, 28)
env = MyRLEnv(df)
model = PPO("MlpPolicy", env, verbose=4)
model.learn(total_timesteps=10000)
MyRLEnv:
self.action_space = spaces.Discrete(4)
self.observation_space = spaces.Box(low=-np.inf, high=np.inf, shape=(60, 28) , dtype=np.float64)
Thanks!
I also got stuck few days ago on something similar to this , but after inspecting deeply I found that the learn method actually runs the environment for n times , now this n is equal to total_timesteps/size_of_df , this in your case would be nearly 10000/5476times which is almost equal to 1.8 , so this 1.8 means , the algorithm would reset the environment at the beginning, then run the step method for the entire dataframe and reset the environment again and run the step method for only 80% of the data in dataframe. So, when the PPO Algorithm stops you see only 80% of the dataframe is being ran .
The Actor Critic Algorithms runs the environment numerous number of times to improve it's efficiency, so that is the reason it is usually suggested that in order to get better results we should keep the value of total_timesteps fairly high , so that it can run it on the same data for quite some times to learn better.
Example:
Say my total_timesteps = 10000 and len(df) = 5000,
then in that case it would run for, n = total_timesteps/len(df) = 2 Full scans of the entire dataframe .

Testing from an LMDB file in Caffe

I am wondering how to go about setting up ONLY a test phase in Caffe for an LMDB file. I have already trained my model, everything seems good, my loss has decreased, and the output I am getting on images loaded in one by one also seem good.
Now I would like to see how my model performs on a separate LMDB test set, but seem to be unable to do so successfully. It would not be ideal for me to do a loop by loading images one at a time since my loss function is already defined in caffe and this would require me to redefine it.
this is what I have so far, but the results of this dont make sense; when I compare the loss I have from the train set to the loss I get from this, they don't match (orders of magnitude apart). Does anyone have any idea what my problem could be?
caffe.set_device(0)
caffe.set_mode_gpu()
net = caffe.Net('/home/jeremy/Desktop/caffestuff/JP_Kitti/all_proto/mirror_shuffle/deploy_JP.prototxt','/home/jeremy/Desktop/caffestuff/JP_Kitti/all_proto/mirror_shuffle/snapshot_iter_10000.caffemodel',caffe.TEST)
solver = None # ignore this workaround for lmdb data (can't instantiate two solvers on the same data)
solver = caffe.SGDSolver('/home/jeremy/Desktop/caffestuff/JP_Kitti/all_proto/mirror_shuffle/lenet_auto_solverJP_test.prototxt')
niter = 100
test_loss = zeros(niter)
count = 0
for it in range(niter):
solver.test_nets[0].forward() # SGD by Caffe
# store the test loss
test_loss[count] = solver.test_nets[0].blobs['loss']
print(solver.test_nets[0].blobs['loss'].data)
count = count+1
See my answer here. Do not forget to subtract the mean, otherwise you'll get low accuracy. The link to the code, posted above, takes care of that.

Temperature Scale in SA

First, this is not a question about temperature iteration counts or automatically optimized scheduling. It's how the data magnitude relates to the scaling of the exponentiation.
I'm using the classic formula:
if(delta < 0 || exp(-delta/tK) > random()) { // new state }
The input to the exp function is negative because delta/tK is positive, so the exp result is always less then 1. The random function also returns a value in the 0 to 1 range.
My test data is in the range 1 to 20, and the delta values are below 20. I pick a start temperature equal to the initial computed temperature of the system and linearly ramp down to 1.
In order to get SA to work, I have to scale tK. The working version uses:
exp(-delta/(tK * .001)) > random()
So how does the magnitude of tK relate to the magnitude of delta? I found the scaling factor by trial and error, and I don't understand why it's needed. To my understanding, as long as delta > tK and the step size and number of iterations are reasonable, it should work. In my test case, if I leave out the extra scale the temperature of the system does not decrease.
The various online sources I've looked at say nothing about working with real data. Sometimes they include the Boltzmann constant as a scale, but since I'm not simulating a physical particle system that doesn't help. Examples (typically with pseudocode) use values like 100 or 1000000.
So what am I missing? Is scaling another value that I must set by trial and error? It's bugging me because I don't just want to get this test case running, I want to understand the algorithm, and magic constants mean I don't know what's going on.
Classical SA has 2 parameters: startingTemperate and cooldownSchedule (= what you call scaling).
Configuring 2+ parameters is annoying, so in OptaPlanner's implementation, I automatically calculate the cooldownSchedule based on the timeGradiant (which is a double going from 0.0 to 1.0 during the solver time). This works well. As a guideline for the startingTemperature, I use the maximum score diff of a single move. For more information, see the docs.