I have a really sophisticated net which takes up a lot of memory on my gpu. I have found out that if I train and test my data (which is the standard case) the memory usage is as twice as high as if I do only training. Is it really necessary to test my data? Or is it just used for visualisation, i.e. to show me if my net is overfitting or sth like that?
I assume it is necessary, but I do not know the reason. My question is: How to separate training and testing? I know you can do
test_initialization: false
But if I want to test my net how would I do that afterwards?
Thanks in advance!
If you have a TEST phase in your train.prototxt, you can use a command line to test your network. You can see this link, where they mention the following command line:
# score the learned LeNet model on the validation set as defined in the
# model architeture lenet_train_test.prototxt
caffe test -model examples/mnist/lenet_train_test.prototxt -weights
examples/mnist/lenet_iter_10000.caffemodel -gpu 0 -iterations 100
You can edit it to test your network.
There is also a Python tutorial you can follow to load the trained network with a script and use it in the field. This can be manipulated to perform separate forward passes and compare the results with what you expect. I don't expect this to work completely out of the box, so you will have to try some things out.
Related
I try to run a code written by Pytorch Lightning. I want to run it on one machine multi GPUs.
##The running setting is
--gpus=1,2,3,4
--strategy=ddp
There is no problem when training. But when we validate the model after one epoch training. It still runs on multi-GPUs(multi-processings), so that, the validation dataset will be split and assigned to different GPUs. So when the code try to write the prdict file and compute the scores with the gold file, it will have problems. Source Code
So I just want to shut down the ddp when I validate the model. Just run it on local_rank 0.
I am using MXNet to finetune Resnet model on Caltech 256 dataset from the following example:
https://mxnet.incubator.apache.org/how_to/finetune.html
I am primarily doing it for a POC to test distributed training (which I'll later use in my actual project).
First I ran this example on a single machine with 2 GPUs for 8 epochs. I took around 20 minutes and the final validation accuracy was 0.809072.
Then I ran it on 2 machines (identical, each with 2 GPUs) with distributed setting and partitioned the training data in half for these two machines (using num_parts and part_index).
8 epochs took only 10 minutes, but the final validation accuracy was only 0.772847 (highest of the two). Even when I used 16 epochs, I was only able to achieve 0.797006.
So my question is that is it normal? I primarily want to use distributed training to reduce training time. But if it takes twice or more epochs to achieve the same accuracy, then what's the advantage? Maybe I am missing something.
I can post my code and run command if required.
Thanks
EDIT
Some more info to help with the answer:
MXNet version: 0.11.0
Topology: 2 workers (each on a separate machine)
Code: https://gist.github.com/reactivefuture/2a1f9dcd3b27c0fe8215b4e3d25056ce
Command to start:
python3 mxnet/tools/launch.py -n 2 -H hosts --sync-dst-dir /tmp/mxnet python3 training.py --kv-store dist_sync --gpus 0,1
I have used a hacky way to do partitioning (using IP addresses) since I couldn't get kv.num_workers and kv.rank to work.
So my question is that is it normal? I primarily want to use distributed training to reduce training time. But if it takes twice or more epochs to achieve the same accuracy, then what's the advantage?
No it is not normally, distributed training, indeed, should be used to speed up the training process, not to slow it down. However there are many ways to do it in a wrong way.
Based on the provided data it feels like workers are still running in the single training('device') mode, or maybe kv_store is created incorrectly. Therefore each worker just trains model himself. In such case you should see validation result after 16 epoch been close to the single machine with 8 epoch (simply because in cluster you are splitting the data). In your case it is 0.797006 vs 0.809072. Depends on how many experiments you have executed this numbers might be treated as equal. I would focus my investigation on the way how cluster bootstrapped.
If you need to dive deeper on the topic how to create kv_store(or what is this) and use it with the distributed training please see this article.
In general in order to give a better answer, in the future pleace provide at least the following information:
what is the version of MXNet?
what is the topology of the cluster, with the following information:
how many logical workers are used;
how many servers are used (are they on the same machines with workers)?
how do you start the training (ideally with the code)
if it is not possible to provide code, at least specify type of kv_store
how do you partitioning data between worker
EDIT
Even though call that starts training looks correct:
python3 mxnet/tools/launch.py -n 2 -H hosts --sync-dst-dir /tmp/mxnet python3 training.py --kv-store dist_sync --gpus 0,1
There is, at least one problem in the training.py itself. If you look here, it actually does not respect type of kv-store from the input argument and just uses 'device'. Therefore all workers are trining training the model separatly(and not in a cluster). I believe fixing this one line should help.
I would again advice to read the article in order to familiarize yourself in the topic how MXNet cluster is working. Such problems can be easily spotted by analyzing debug logs and observing that there are no kv-store created and therefore cluster is not training anything (only stand-alone machines are doing something).
I have a pre-trained network with which I would like to test my data. I defined the network architecture using a .prototxt and my data layer is a custom Python Layer that receives a .txt file with the path of my data and its label, preprocess it and then feed to the network.
At the end of the network, I have a custom Python layer that get the class prediction made by the net and the label (from the first layer) and print, for example, the accuracy regarding all batches.
I would like to run the network until all examples have passed through the net.
However, while searching for the command to test a network, I've found:
caffe test -model architecture.prototxt -weights model.caffemodel -gpu 0 -iterations 100
If I don't set the -iterations, it uses the default value (50).
Does any of you know a way to run caffe test without setting the number of iterations?
Thank you very much for your help!
No, Caffe does not have a facility to detect that it has run exactly one epoch (use each input vector exactly once). You could write a validation input routine to do that, but Caffe expects you to supply the quantity. This way, you can generate easily comparable results for a variety of validation data sets. However, I agree that it would be a convenient feature.
The lack of this feature might be related to its lack for training and the interstitial testing.
In training, we tune the hyper-parameters to get the most accurate model for a given application. As it turns out, this is more closely dependent on TOTAL_NUM than on the number of epochs (given a sufficiently large training set).
With a fixed training set, we often graph accuracy (y-axis) against epochs (x-axis), because that gives tractable results as we adjust batch size. However, if we cut the size of the training set in half, the most comparable graph would scale on TOTAL_NUM rather than the epoch number.
Also, by restricting the size of the test set, we avoid long waits for that feedback during training. For instance, in training against the ImageNet data set (1.2M images), I generally test with around 1000 images, typically no more than 5 times per epoch.
I'm working with Caffe and am interested in comparing my training and test errors in order to determine if my network is overfitting or underfitting. However, I can't seem to figure out how to have Caffe report training error. It will show training loss (the value of the loss function computed over the batch), but this is not useful in determining if the network is overfitting/underfitting. Is there a straightforward way to do this?
I'm using the Python interface to Caffe (pycaffe). If I could get access to the raw training set somehow, I could just put batches through with forward passes and evaluate the results. But, I can't seem to figure out how to access more than the currently-processing batch of training data. Is this possible? My data is in a LMDB format.
In the train_val.prototxt file change the source in the TEST phase to point to the training LMDB database (by default it points to the validation LMDB database) and then run this command:
$ ./build/tools/caffe test -solver models/bvlc_reference_caffenet/solver.prototxt -weights models/bvlc_reference_caffenet/<caffenet_train_iter>.caffemodel -gpu 0
While using caffe as
./build/tools/caffe train --solver=models/Handmade/solver.prototxt
caffe also gets into "phase: TEST" but I have no test data. I only want to train the parameters on my training data, so I haven't used "phase: Test" in "train.prototxt", which causes error. What should I do?
I don't know if you can completely omit the test phase but it's possible to train your model without needing a separate test set. It's also possible to prevent the solver from ever switching to the test phase.
Reuse your training data for the test phase. You can do so by duplicating your data layer and specifying it for the test phase.
To limit computations to the training phase only increase the value of test_interval in your solver definition to a number larger than your training set or, better, larger than max_iter. This prevents the solver from ever switching to the test phase.
I find it a bit odd to train a model without wanting to know how to does on a separate set of data points.