I am trying to train a model on two different OS (ubuntu:18.04, macOS 11.6.5) and get the same result. I use pytorch_lightning.seed_everything as well as
Trainer( deterministic=True, ..)
Both models are initialized to identically, so the seeds are working correctly. And both train on the cpu.
Training with data that has nice continuous values, I get the identical models at the end. However, if I use data that has a bunch of onehot features, I get similar models on both OS but as the epochs go up they diverge slowly probably due to the small errors/difference in precision adding up.
Does anyone have any ideas of what could cause this issue? Any ideas on how to fix this?
Related
Problem Description
I have constructed a fully connected model as shown by the code below:
auto* model = new Sequential({
new Linear(3072,120),
new ReLU(120),
new Linear(120,32),
new ReLU(32),
new Linear(32,10),
new Softmax(10)
});
OptimizerInfo* info = new OPTIMIZER_SGD( /*lr=*/ 0.003);
model->construct(info);
model->randInit();
model->setLoss(crossEntropyLoss);
The model works properly as expected on the MNIST dataset (with the first Linear input set to 784), as the loss decreases as more and more batches get trained. The graph looks like this:
The model performance on MNIST
However, when I switched to the CIFAR-10 Dataset, the loss does not decrease over time and fluctuate around 2.30, with each element in the output vector approaching 0.10 . This case continued for all the following epoches:
The model performance on CIFAR-10
Since I worked a little bit on the problem and checked through several docs and papers, the performance of the layered fully connected model on CIFAR-10 should be around 0.8 accuracy with CE loss below 0.5, so something clearly went wrong here....
Things I have checked:
Both datasets (MNIST and CIFAR-10) are read into the model and normalized between [0,1] properly, and there are no misalignments between labels and corresponding training sample.
no NaN or Inf appeared anywhere in the model or the loss
Though the network is based on an engine I constructed without any external dependencies, I have compared the results of all functions (GEMM, ReLU, Softmax...) with same inputs in pytorch, and there is no algorithmic errors I found in my project.
Adjustments on hyperparameters (Lr, batch_size, change of optimizers, etc) do not make any difference in the performance of the model.
The model is running on FP32 and used CUDA for computations.
For any reasonable network configurations, there are no memory leaking or memory issues occuring.
Anyone have idea about what is happening?
My project link: https://github.com/SeanEngine/SEANN_2
Plz help :3
I am working on this implementation of a GAN:
https://github.com/rishikksh20/hifigan-denoiser/blob/master/train.py
You can see there are a total of three neural networks, plus the optimizes defined using torch.optim.AdamW.
It seems that, when loading both the networks and the optimizers using load_state_dict, the loss functions start from a much higher value.
Judging from some similar issues here on stackoverflow, this may be due to the learning rates or other parameters not being stored. However, the learning rate here is fixed. The loading operations seem all be correctly executed here. Can you spot the bug in the code?
I'm trying to train a model by purchasing multiple virtual machines on a cloud service provider(3-4 instances probably on AWS). I would like to load the model onto each VM, run the training process, then update the models on each VM. The problem is once each model has made its forward pass, I don't know how to accumulate the gradients of each model, and if I could, I'm not sure if I should sum the gradients or average them. I've been using DataParallel on single multiple GPU VMs, so I haven't had to keep track of multiple gradients before this point. I'm unsure if there is a PyTorch package that could help with GPU data parallelism across multiple VMs. I've seen PyTorch Lightning, but it isn't clear what modules to use to communicate between VMs. I'm very new to machine learning, and this is my first training process beyond a single machine. Any advice and tips would be appreciated on this matter including packages or architectural ideas.
I am using MXNet to finetune Resnet model on Caltech 256 dataset from the following example:
https://mxnet.incubator.apache.org/how_to/finetune.html
I am primarily doing it for a POC to test distributed training (which I'll later use in my actual project).
First I ran this example on a single machine with 2 GPUs for 8 epochs. I took around 20 minutes and the final validation accuracy was 0.809072.
Then I ran it on 2 machines (identical, each with 2 GPUs) with distributed setting and partitioned the training data in half for these two machines (using num_parts and part_index).
8 epochs took only 10 minutes, but the final validation accuracy was only 0.772847 (highest of the two). Even when I used 16 epochs, I was only able to achieve 0.797006.
So my question is that is it normal? I primarily want to use distributed training to reduce training time. But if it takes twice or more epochs to achieve the same accuracy, then what's the advantage? Maybe I am missing something.
I can post my code and run command if required.
Thanks
EDIT
Some more info to help with the answer:
MXNet version: 0.11.0
Topology: 2 workers (each on a separate machine)
Code: https://gist.github.com/reactivefuture/2a1f9dcd3b27c0fe8215b4e3d25056ce
Command to start:
python3 mxnet/tools/launch.py -n 2 -H hosts --sync-dst-dir /tmp/mxnet python3 training.py --kv-store dist_sync --gpus 0,1
I have used a hacky way to do partitioning (using IP addresses) since I couldn't get kv.num_workers and kv.rank to work.
So my question is that is it normal? I primarily want to use distributed training to reduce training time. But if it takes twice or more epochs to achieve the same accuracy, then what's the advantage?
No it is not normally, distributed training, indeed, should be used to speed up the training process, not to slow it down. However there are many ways to do it in a wrong way.
Based on the provided data it feels like workers are still running in the single training('device') mode, or maybe kv_store is created incorrectly. Therefore each worker just trains model himself. In such case you should see validation result after 16 epoch been close to the single machine with 8 epoch (simply because in cluster you are splitting the data). In your case it is 0.797006 vs 0.809072. Depends on how many experiments you have executed this numbers might be treated as equal. I would focus my investigation on the way how cluster bootstrapped.
If you need to dive deeper on the topic how to create kv_store(or what is this) and use it with the distributed training please see this article.
In general in order to give a better answer, in the future pleace provide at least the following information:
what is the version of MXNet?
what is the topology of the cluster, with the following information:
how many logical workers are used;
how many servers are used (are they on the same machines with workers)?
how do you start the training (ideally with the code)
if it is not possible to provide code, at least specify type of kv_store
how do you partitioning data between worker
EDIT
Even though call that starts training looks correct:
python3 mxnet/tools/launch.py -n 2 -H hosts --sync-dst-dir /tmp/mxnet python3 training.py --kv-store dist_sync --gpus 0,1
There is, at least one problem in the training.py itself. If you look here, it actually does not respect type of kv-store from the input argument and just uses 'device'. Therefore all workers are trining training the model separatly(and not in a cluster). I believe fixing this one line should help.
I would again advice to read the article in order to familiarize yourself in the topic how MXNet cluster is working. Such problems can be easily spotted by analyzing debug logs and observing that there are no kv-store created and therefore cluster is not training anything (only stand-alone machines are doing something).
I have a dataset of around 6K chemical formulas which I am preprocessing via Keras' tokenization to perform binary classification. I am currently using a 1D convolutional neural network with dropouts and am obtaining an accuracy of 82% and validation accuracy of 80% after only two epochs. No matter what I try, the model just plateaus there and doesn't seem to be improving at all. Those same exact accuracies are reached with a vanilla LSTM too. What else can I try to improve my accuracies? Losses only have a difference of 0.04... Anyone have any ideas? Both models use an embedding layer and changing the output dimension isn't having an effect either.
According to your answer, I believe your model has a high bias and low variance (see this link for further details). Thus, your model is not fitting your data very well and it is causing underfitting. So, I suggest you 3 things:
Train your model a little longer: I believe two epoch are too few to give a chance to your model understand the patterns in the data. Try to minimize learning rate and increase the number of epochs.
Try a different architecture: you may change the amount of convolutions, filters and layers, You can also use different activation functions and other layers like max pooling.
Make an error analysis: once you finished your training, apply your model to test set and take a look into the errors. How much false positives and false negatives do you have? Is your model better to classify one class than the other? You can see a pattern in the errors that may be related to your data?
Finally, if none of these suggestions helped you, you may also try to increase the number of features, if possible.