MXNet distributed training accuracy - deep-learning

I am using MXNet to finetune Resnet model on Caltech 256 dataset from the following example:
https://mxnet.incubator.apache.org/how_to/finetune.html
I am primarily doing it for a POC to test distributed training (which I'll later use in my actual project).
First I ran this example on a single machine with 2 GPUs for 8 epochs. I took around 20 minutes and the final validation accuracy was 0.809072.
Then I ran it on 2 machines (identical, each with 2 GPUs) with distributed setting and partitioned the training data in half for these two machines (using num_parts and part_index).
8 epochs took only 10 minutes, but the final validation accuracy was only 0.772847 (highest of the two). Even when I used 16 epochs, I was only able to achieve 0.797006.
So my question is that is it normal? I primarily want to use distributed training to reduce training time. But if it takes twice or more epochs to achieve the same accuracy, then what's the advantage? Maybe I am missing something.
I can post my code and run command if required.
Thanks
EDIT
Some more info to help with the answer:
MXNet version: 0.11.0
Topology: 2 workers (each on a separate machine)
Code: https://gist.github.com/reactivefuture/2a1f9dcd3b27c0fe8215b4e3d25056ce
Command to start:
python3 mxnet/tools/launch.py -n 2 -H hosts --sync-dst-dir /tmp/mxnet python3 training.py --kv-store dist_sync --gpus 0,1
I have used a hacky way to do partitioning (using IP addresses) since I couldn't get kv.num_workers and kv.rank to work.

So my question is that is it normal? I primarily want to use distributed training to reduce training time. But if it takes twice or more epochs to achieve the same accuracy, then what's the advantage?
No it is not normally, distributed training, indeed, should be used to speed up the training process, not to slow it down. However there are many ways to do it in a wrong way.
Based on the provided data it feels like workers are still running in the single training('device') mode, or maybe kv_store is created incorrectly. Therefore each worker just trains model himself. In such case you should see validation result after 16 epoch been close to the single machine with 8 epoch (simply because in cluster you are splitting the data). In your case it is 0.797006 vs 0.809072. Depends on how many experiments you have executed this numbers might be treated as equal. I would focus my investigation on the way how cluster bootstrapped.
If you need to dive deeper on the topic how to create kv_store(or what is this) and use it with the distributed training please see this article.
In general in order to give a better answer, in the future pleace provide at least the following information:
what is the version of MXNet?
what is the topology of the cluster, with the following information:
how many logical workers are used;
how many servers are used (are they on the same machines with workers)?
how do you start the training (ideally with the code)
if it is not possible to provide code, at least specify type of kv_store
how do you partitioning data between worker
EDIT
Even though call that starts training looks correct:
python3 mxnet/tools/launch.py -n 2 -H hosts --sync-dst-dir /tmp/mxnet python3 training.py --kv-store dist_sync --gpus 0,1
There is, at least one problem in the training.py itself. If you look here, it actually does not respect type of kv-store from the input argument and just uses 'device'. Therefore all workers are trining training the model separatly(and not in a cluster). I believe fixing this one line should help.
I would again advice to read the article in order to familiarize yourself in the topic how MXNet cluster is working. Such problems can be easily spotted by analyzing debug logs and observing that there are no kv-store created and therefore cluster is not training anything (only stand-alone machines are doing something).

Related

Scheduling Optimization for multi-step growth modeling

I recently had a Gaussian Process machine learning program built for my production department. This GP system has built a massive mySQL database that provides growth durations for each of the organisms we grow (Lab environment) and the predicted yield for each of those combinations of growth steps.
I would like to build an optimization program in python (preferably) to assist me in scheduling what organisms to grow, when to grow them, and for how long at each step.
Here is some background:
4 steps to the process
Plate step (organism is plated; growth is started)
Seed step (organism transferred from plate to seed phase)
Incubation step (organism is transferred from seed to incubation phase)
Harvest step (organism is harvested; yield collected)
There are multiple organisms (>50) that are grown per year. Each has their own numerical ID
There is finite space to grow organisms at the incubation step
There is infinite space to grow organisms at the plate and seed step.
Multiple 'lots' of the same organism are typically grown at a time. A lot is predefined by the number of containers being used at the incubation step.
Different organisms have very different maximum yields. Some yield 2000 grams max and others 600 g max.
The mySQL server has every combination of # of days at each step for each organism and the predicted yield for that combination. This data is what needs to be used for optimization.
The massive challenge we run into is scheduling what organisms to grow when. With the GP process, we know the theoretical maximums (and they work!) but its hard putting it into practice due to constraints (see below)
Here would be my constraints:
Only one organism can be harvested per day.
No steps can be started on weekends. Organisms can grow over the weekend, but we can't start a new step on a weekend
If multiple 'lots' are being grown of the same mold, the plate and seed start dates should be the same for every 'lot'.
- What this typically looks like in practice is:
- plate and seed steps start on the same day
- next, incubation steps start day-after-day for as many lots as being made
- finally, harvests occur in the same pattern (day-after-day)
- Therefore, what you typically get is identical # of days in the plate phase, identical # of incubation days, and differing # of seed days.
Objective Function: I don't know how to articulate this perfectly, but very broadly we need to maximize the yields for each organism. However, there needs to be a time balance too as the space to grow the organisms is finite and the time we have to grow them is finite as well.
I have created a metric known as lot*weeks that tries to capture that. It is a measure of the number of the number of weeks (at the incubation phase) needed to grow the expected annual demand of a specific organism based upon the predicted yield from the SQL server. Therefore, a potential objective function would be to minimize the lot_weeks for each organism.
This is obviously more of a broad ask for help. I don't have a specific request. If this is not appropriate for this forum, I can take my question elsewhere. I feel comfortable with the scope of the project and can figure out how to write the code over time but I need assistance with what tools to use and what's possible.
I've seen that pyomo may be helpful but I also wanted to check here first. Thank you
I've tried looking into using Pyomo but stopped due to the complexity and didn't want to learn all of it if it wasn't appropriate for the problem.
Edit: This was too broad, I apologize. I've created another post with more concrete examples. Thank you for all that helped.
This is really too broad of a question for this forum, and it may likely get closed. That said...
You have a framework here that you could develop an optimization in. The database part is irrelevant. For an effective optimization model, what you really need is a known relationship between the variables and the outcomes, for instance, days in incubation ==> size of harvest or such. Which it sounds like you have.
This isn't an entry level model you are describing. Do you have any resources to help? Local university that might have need for grad student projects in the field or such?
As you develop this, you should start small and focus the model on the key issues here... if they aren't known, then perhaps that is the place to start. For instance, perhaps the key issue is management of planting times vis-a-vis the weekends (that is one model). Or perhaps the key issue is the management of the limited space for growth and the inability to achieve steps on the weekend just kinda works itself out. (That is another model for space management.) Try one that seems to address key management questions. Start very small and see if you can get something working as a proof of concept. If this is your first foray into linear programming, you will need help. You might also start with an introductory textbook on LP.

Early stopping in Caffe

Seems this related PR is dead now, is there any workaround to use early stopping in Caffe? maybe using python on top of Caffe?
A first part is easy to do manually: let monitor your validation error then stop when this one do not change a lot (below a threshold). Then let consider the state with the lowest validation error as the "optimal" network.
The real problem is then to benefit from the full train+val dataset from there. There are two basic strategies:
Retrain your network with train+val for the same number of epoch OR for the same number of data (i.e let compute the number of minibatches that were used to reach the "optimal" state and set the number of passes such that the same number of minibatches (with same size...) go through the network
Let keep the "optimal" network then add the validation data and continue the training. If you reach the same error rate as before, let stop. Else, let just fix an a priori number of epochs.
You could apply this patch for early stopping to standard Caffe RC 1.0.0. It adds an optional early_stop_param into the solver. You can specify the test network ID, the length of tries to check for improvement in the test loss and a "skip" so that not every iteration is tried. Disclosure: I am one of the developers.

caffe: Test needed?

I have a really sophisticated net which takes up a lot of memory on my gpu. I have found out that if I train and test my data (which is the standard case) the memory usage is as twice as high as if I do only training. Is it really necessary to test my data? Or is it just used for visualisation, i.e. to show me if my net is overfitting or sth like that?
I assume it is necessary, but I do not know the reason. My question is: How to separate training and testing? I know you can do
test_initialization: false
But if I want to test my net how would I do that afterwards?
Thanks in advance!
If you have a TEST phase in your train.prototxt, you can use a command line to test your network. You can see this link, where they mention the following command line:
# score the learned LeNet model on the validation set as defined in the
# model architeture lenet_train_test.prototxt
caffe test -model examples/mnist/lenet_train_test.prototxt -weights
examples/mnist/lenet_iter_10000.caffemodel -gpu 0 -iterations 100
You can edit it to test your network.
There is also a Python tutorial you can follow to load the trained network with a script and use it in the field. This can be manipulated to perform separate forward passes and compare the results with what you expect. I don't expect this to work completely out of the box, so you will have to try some things out.

Testing a network without setting number of iterations

I have a pre-trained network with which I would like to test my data. I defined the network architecture using a .prototxt and my data layer is a custom Python Layer that receives a .txt file with the path of my data and its label, preprocess it and then feed to the network.
At the end of the network, I have a custom Python layer that get the class prediction made by the net and the label (from the first layer) and print, for example, the accuracy regarding all batches.
I would like to run the network until all examples have passed through the net.
However, while searching for the command to test a network, I've found:
caffe test -model architecture.prototxt -weights model.caffemodel -gpu 0 -iterations 100
If I don't set the -iterations, it uses the default value (50).
Does any of you know a way to run caffe test without setting the number of iterations?
Thank you very much for your help!
No, Caffe does not have a facility to detect that it has run exactly one epoch (use each input vector exactly once). You could write a validation input routine to do that, but Caffe expects you to supply the quantity. This way, you can generate easily comparable results for a variety of validation data sets. However, I agree that it would be a convenient feature.
The lack of this feature might be related to its lack for training and the interstitial testing.
In training, we tune the hyper-parameters to get the most accurate model for a given application. As it turns out, this is more closely dependent on TOTAL_NUM than on the number of epochs (given a sufficiently large training set).
With a fixed training set, we often graph accuracy (y-axis) against epochs (x-axis), because that gives tractable results as we adjust batch size. However, if we cut the size of the training set in half, the most comparable graph would scale on TOTAL_NUM rather than the epoch number.
Also, by restricting the size of the test set, we avoid long waits for that feedback during training. For instance, in training against the ImageNet data set (1.2M images), I generally test with around 1000 images, typically no more than 5 times per epoch.

Performing a distributed CUDA/OpenCL based password cracking

Is there a way to perform a distributed (as in a cluster of a connected computers) CUDA/openCL based dictionary attack?
For example, if I have a one computer with some NVIDIA card that is sharing the load of the dictionary attack with another coupled computer and thus utilizing a second array of GPUs there?
The idea is to ensure a scalability option for future expanding without the need of replacing the whole set of hardware that we are using. (and let's say cloud is not an option)
This is a simple master / slave work delegation problem. The master work server hands out to any connecting slave process a unit of work. Slaves work on one unit and queue one unit. When they complete a unit, they report back to the server. Work units that are exhaustively checked are used to estimate operations per second. Depending on your setup, I would adjust work units to be somewhere in the 15-60 second range. Anything that doesn't get a response by the 10 minute mark is recycled back into the queue.
For queuing, offer the current list of uncracked hashes, the dictionary range to be checked, and the permutation rules to be applied. The master server should be able to adapt queues per machine and per permutation rule set so that all machines are done their work within a minute or so of each other.
Alternately, coding could be made simpler if each unit of work were the same size. Even then, no machine would be idle longer than the amount of time for the slowest machine to complete one unit of work. Size your work units so that the fastest machine doesn't enter a case of resource starvation (shouldn't complete work faster than five seconds, should always have a second unit queued). Using that method, hopefully your fastest machine and slowest machine aren't different by a factor of more than 100x.
It would seem to me that it would be quite easy to write your own service that would do just this.
Super Easy Setup
Let's say you have some GPU enabled program X that takes a hash h as input and a list of dictionary words D, then uses the dictionary words to try and crack the password. With one machine, you simply run X(h,D).
If you have N machines, you split the dictionary into N parts (D_1, D_2, D_3,...,D_N). Then run P(x,D_i) on machine i.
This could easily be done using SSH. The master machine splits the dictionary up, copies it to each of the slave machines using SCP, then connects to the slaves and tells them to run the program.
Slightly Smarter Setup
When one machine cracks the password, they could easily notify the master that they have completed the task. The master then kills the programs running on the other slaves.