how fast are JMockit unit tests (at scales of thousands of tests)? - powermock

Would equivalent unit tests in JMockit give a significant speed up compared to PowerMock?
Background:
I have to get unit test coverage up on a large legacy code base.
We currently have PowerMock unit tests (300+) that take over 15 minutes to run.
PowerMock has been used so far due to lots of static methods calling static methods, ad infinitum.
Expanding further we estimate the need for about 1000+ Unit test classes and want a sub 10 Minute build.
We are simultaneously breaking the dependencies and unit testing with Mockito which tends to be 4x faster than the equivalent PowerMock test, but this practice is much harder, slower and perceived as risk owing to the change to production code.
Many thanks,
Alex.

Related

Schedule jobs between GPUs for PyTorch Models

I'm trying to build up a system that trains deep models on requests. A user comes to my web site, clicks a button and a training process starts.
However, I have two GPUs and I'm not sure which is the best way to queue/handle jobs between the two GPUs: start a job when at least one GPU is available, queue the job if there are currently no GPUs available. I'd like to use one GPU per job request.
Is this something I can do in combination with Celery? I've used this in the past but I'm not sure how to handle this GPU related problem.
Thanks a lot!
Not sure about celery as I've never used it, but conceptually what seems reasonable (and the question is quite open ended anyway):
create thread(s) responsible solely for distributing tasks to certain GPUs and receiving requests
if any GPU is free assign task immediately to it
if both are occupied estimate time it will probably take to finish the task (neural network training)
add it to the GPU will smallest approximated time
Time estimation
ETA of current task can be approximated quite well given fixed number of samples and epochs. If that's not the case (e.g. early stopping) it will be harder/way harder and would need some heuristic.
When GPUs are overloaded (say each has 5 tasks in queue), what I would do is:
Stop process currently on-going on GPU
Run new process for a few batches of data to make rough estimation how long it might take to finish this task
Ask it to the estimation of all tasks
Now, this depends on the traffic. If it's big and would interrupt on-going process too often you should simply add new tasks to GPU queue which has the least amount of tasks (some heuristic would be needed here as well, you should have estimated possible amount of requests by now, assuming only 2 GPUs it cannot be huge probably).

single-class or multi-class object detection for a specific class object?

One thing that I was wondering for a long time is the performance of a CNN-based object detector in single class and multi-class.
For example, If I want to design a pedestrian detector using the famous Faster R-CNN(VGG-16). The official version could detect pedestrian with 76.7 AP (PASCAL 07 test) if the training data is PASCAL VOC07+12 trainval.
However, I am quite satisfied with the detection results but what if I just revise its framework into a single class pedestrian detector and the training data will only contain pedestrian so both training and testing data will be fewer.
I know the computational power will consume less than the original 20-class one but I am curious about the detection performance.
has anybody tried to compare single-class and multi-class detector in the same class?
Yes, but the results vary quite a bit according to model and application. I've done this with several SVM applications and one CNN. As expected, the single-class consumed less resource in every case.
However, the results were quite different. One SVM actually did better in single-class training; two were significantly worse, and the other 3-4 were about the same (within expected error range).
The CNN didn't fare so well; it needed some tweaks to the topology.

Comparing speeds of different tasks in different languages

If I wanted to test out the speeds it takes for certain tasks to be done, would it matter what language I did the test in? We can consider this to be any job a programmer might want to perform. Simple jobs such as sorting, or more complicated jobs which involve the signing and verification of files.
Considering that we all know that certain languages will run faster than others, this means that the tasks will rely upon the languages and the way their compilers / runtimes are optimised. But these will obviously all be different.
So is it best to rely upon a language which relies less on abstraction such as C, or is it OK to test out jobs and tasks in more high level languages, and rely on the fact that they are implemented well enough not to worry about any possible inefficiencies? I hope my question is clear.
it doesn't matter which real-world language you use for your tests... if you design the tests correctly. there is always an overhead of the language but there is always an overhead of the operating system, thread scheduler, IO, ram or speed of electricity that depends on current temperature etc.
but to compare anything you don't want to measure how many nanoseconds it took to do the one assignment statement. instead you want to measure how many hours it took to do billions of assignment statements. then all mentioned overheads are negligible

right way to report CUDA speedup

I would like to compare the performance of a serial program running on a CPU and a CUDA program running on a GPU. But I'm not sure how to compare the performance fairly. For example, if I compare the performance of an old CPU with a new GPU, then I will have immense speedup.
Another question: How can I compare my CUDA program with another CUDA program reported in a paper (both run on different GPUs and I cannot access the source code).
For fairness, you should include the data transfer times to get the data into and out of the GPU. It's not hard to write a blazing fast CUDA function. The real trick is in figuring out how to keep it fed, or how to hide the cost of data transfer by overlapping it with other necessary work. Unless your routine is 100% compute-bound, including data transfer in your units-of-work-done-per-unit-of-time is critical to understanding how your implementation would handle, say, a lot more units of work.
For cross-device comparisons, it might be useful to report units of work performed per unit of time per processor core. The per processor core will help normalize large differences between, say, a 200 core and a 2000 core CUDA device.
If you're talking about your algorithm (not just output), it is useful to describe how you broke the problem down for parallel execution - your block/thread distribution, for example.
Make sure you are not measuring performance on a debug build, or running in a debugger. Debugging adds overhead.
Make sure that your work sample is large enough that it is significantly above the "noise floor". A test run that takes a few seconds to complete will be measuring more of your function and less of the ambient noise of the environment than a test run that completes in milliseconds. You can always divide the units of work by the test execution time to arrive at a sexy "units per nanosecond" figure, but you don't actually measure it that way.
The speed of cuda program on different GPUs depends on many factors of the GPU like memory bandwidth, core clock speed, cores, number of threads/registers/shared memory available. so it is difficult to compare the performance in different GPUs

How to prepare state for several JUnit tests only once

I need to test a program that first preprocesses some data and then computes several different results using this preprocessed data -- it makes sense to write separate test for each computation.
Official JUnit policy seems to be that I should run preprocessing before each computation test.
How can I set up my test so that I could run the preparation only once (it's quite slow) before running remaining tests?
Use the annotation #BeforeClass to annotate the method which will be run once before all test methods.