I have a cluster with 50 nodes and each node has 8 cores for computation.
If I have job to which I'm planning to impose 200 reducers, what would be good computational resource allocation strategy for better performance ?
I mean is it better to allocate 50 nodes and 4 cores on each of them or to allocate 25 nodes and 8 cores for each of them ? Which one is better in what case ?
To answer your question, it depends on a few things. The 50 nodes are going to be better in general, in my opinion:
If you are reading a lot of data off disk, 50 nodes will be better because you will parallelize the loading off disk 2x.
If you are computing and processing over a lot of data, 50 nodes will be better, because the number of cores doesn't scale 1:1 with processing (i.e., 2x as many cores is not quite 2x as fast... meanwhile, more processors does scale close to 1:1).
Hadoop has to run things like the TaskTracker and DataNode processes on those nodes, as well as the OS layer stuff. Those "take up" cores, as well.
However, if your main concern is network, here are the few downsides of having 50 nodes:
Likely, 50 nodes is going to be over two racks. Are they on a flat network or do you have to deal with iter-rack communication? You'll have to set up Hadoop accordingly;
A network switch supporting 50 nodes is going to be more expensive than one that supports 25;
The network shuffle between the map and the reduce will cause the switch a bit more work for your 50 node cluster, but still about the same amount of data will be passed through the network.
Even with these network concerns, I think you'll find that the 50 nodes is better, just because the value of a node is not just the number of cores. You have to consider mostly how many disks you have.
It is hard to say, usually it is always "the higher the better".
More machines would be better to prevent failure.
Usually Hadoop is fine with commodity hardware and you can pick the 50 4 cores each servers.
But I would pick the 8 cores if they would have superior hardware, e.G. higher CPU frequency, DDR3 RAM or 10k rpm disks.
Related
I want to know how Caffe utilizes multiple GPUs so that I can decide to upgrade to a new more powerful card or just buy the same card and run on SLI.
for example am I better off buying one TitanX 12 GB , or two GTX 1080 8 GB ?
If I go SLI the 1080s, will my effective memory get doubled? I mean can I run a network which takes 12 or more GB of vram using them? Or am I left with only 8 GB ?
Again how is memory utilized in such scenarios ?
What would happen if two different cards are installed (both NVIDIA) ? Does caffe utilize the memory available the same? (suppose one 980 and one 970!)
for example am I better off buying one TitanX 12 GB , or two GTX 1080
8 GB ? If I go SLI the 1080s, will my effective memory get doubled? I
mean can I run a network which takes 12 or more GB of vram using them?
Or am I left with only 8 GB ?
No, effective memory size in case of 2 GPU with 8Gb of RAM will be 8Gb, but effective batch size will be doubled which will lead to more stable\fast training.
What would happen if two different cards are installed (both NVIDIA) ?
Does caffe utilize the memory available the same? (suppose one 980 and
one 970!)
I think you will be limited to lower card and may have problem with drivers, so I don't recomend to try this configuration.
Also from documentation:
Current implementation has a "soft" assumption that the devices being
used are homogeneous. In practice, any devices of the same general
class should work together, but performance and total size is limited
by the smallest device being used. e.g. if you combine a TitanX and a
GTX980, performance will be limited by the 980. Mixing vastly
different levels of boards, e.g. Kepler and Fermi, is not supported.
Summing up: with GPU that have lots of RAM you can train deeper models, with multiply GPUs you can train single model faster and also you can train separate models per GPU. I would choose single GPU with more memory (TitanX) because deep networks nowadays are more RAM bounded(e.g. ResNet-152 or some semantic segmentation network) and more memory will give the opportunity to run deeper networks and with larger batch size, otherwise if you have some task that fit on single GPU (GTX 1080) you can buy 2 or 4 of them just to make things faster.
Also here is some info about multi GPU support in Caffe:
The current implementation uses a tree reduction strategy. e.g. if
there are 4 GPUs in the system, 0:1, 2:3 will exchange gradients, then
0:2 (top of the tree) will exchange gradients, 0 will calculate
updated model, 0->2, and then 0->1, 2->3.
https://github.com/BVLC/caffe/blob/master/docs/multigpu.md
I don't believe Caffe supports SLI mode. The two GPUs are treated as
separate cards.
When you run Caffe and add the '-gpu' flag (assuming you are using the
command line), you can specify which GPU to use (-gpu 0 or -gpu 1 for
example). You can also specify multiple GPUs (-gpu 0,1,3) including
using all GPUs (-gpu all).
When you execute using multiple GPUs, Caffe will execute the training
across all of the GPUs and then merge the training updates across the
models. This is effectively doubling (or more if you have more than 2
GPUs) the batch size for each iteration.
In my case, I started with a NVIDIA GTX 970 (4GB card) and then
upgraded to a NVIDIA GTX Titan X (Maxwell version with 12 GB) because
my models were too large to fit in the GTX 970. I can run some of the
smaller models across both cards (even though they are not the same)
as long as the model will fully fit into the 4GB of the smaller card.
Using the standard ImageNet model, I could execute across both cards
and cut my training time in half.
If I recall correctly, other frameworks (TensorFlow and maybe the
Microsoft CNTK) support splitting a model among different nodes to
effectively increase the available GPU memory like what you are
describing. Although I haven't personally tried either one, I
understand you can define on a per-layer basis where the layer
executes.
Patrick
Link
Perhaps a late answer, but caffe supports gpu parallelism, which means you can indeed fully utilize both gpu's, but I do recommend getting two gpu's of equal memory size, since I don't think caffe lets you select the batch size per gpu.
As for how memory is utilized, using multiple gpu's each gpu gets a batch of batch size as specified in your train_val.prototxt, so if your batch size is for example 16 and you're using 2 gpu's, you'd have an effective batch size 32.
Finally, I know that for things such as gaming, SLI seems to be much less effective and often much more problematic than having a single, more powerful GPU. So if you are planning on using the GPU's for more than only Deep Learning, I'd recommend you still go for the Titan X
When I use time profiler in instrument, it shows the cpu usage for each core (or logical core) as well as a "cpu usage". I'm wondering how the cpu usage is calculated according to the cpu usage of each core. I tried data from a specific timestamp and it is neither sum of each core nor average. Here is a snapshot of the panel.
The CPU usage is neither the sum nor the average. In contrast to OS CPU usage (say top), profile CPU usage is generally taken from an actual hardware counter in the processor. This also makes it hardware dependent, meaning its exact meaning on an Intel processor is different from that of an AMD processor. So why are these measurements useful? Because the ratios and values are correct when compared to values taken over the same interval / at the same instant, and the average values are what you expect them to be.
When profiling, look at correlations first over intervals and then between intervals. Afterwards, zoom in on more specific registers, such as cache misses or pipeline stalls.
You might check out the Intel optimization documentation. It's pretty good in my experience. I'll post a reference in the comment section if I can find the time.
PS By the way, the "Core 4" and "Core 5 (logical)" are really not accurate above (not your fault). The names imply that the "logical" core is somehow inferior to the non-logical one. When a CPU is executing multiple hardware threads on one core, what Intel in marketing speak calls hyperthreading, there is no difference between Core 4 and Core 5 as they behave identically on the physical core -- meaning they are both "logical".
I have a single machine with 32 cores (2 processors), and 32G RAM. I installed gridengine to submit jobs to those queues I created. But it seems jobs are running on all cores.
I wonder if there is way to limit cores and RAMs for each job. For example I have two queues: parallel.q and serial.q, so that I allocate 20G RAMS and 20 cores to serial.q but I want each job only use one core and maximum 1G RAMs, and 8G RAMs + 8 cores to a single parallel job. All 4 cores and 4G rams left for other usage.
How can I config my queue or gridengine to get the setting right? I tried to read the manual, but don't have a clue.
Thanks!
I don't have problem with parallel jobs. I have some serial jobs will call several different programs somehow the system will assign them all cores available. But I don't want all cores be used for jobs rather for example only two cores available for each job.(Each job has several programs run sequentially, in which case systems allocate each program a core). BTW, I would like have some idle cores all the time to process other jobs, like processing data. Is it possible or necessary?
In fact, if I understand well, you want to partition a single machine with several sub-queues, is that right?
This may be problematic with SGE because the host configuration allows you to set the number of CPU available on a given node. Than you create your queues and assign different hosts to different queues.
In your case, you shoud assign the same host to one master queue, and then add subordinate queues that can use only a given MAX_SLOTS slots.
But if I may ask one question: why should you partition it? If you set up only one queue and configure some parallel environment then you can just submit your jobs using qsub -pe <parallelEnvironment> <NSLOTS> and the grid engine takes care of everything. I suggest you setup at least an OpenMP parallel environment, because you won't probably need MPI on a shared memory machine like yours (it seems a great machine BTW).
Another thing is that you must be able to configure your model run so that the code that you are using can be used with a limited number of CPU; this is very important. In practice you must assign the same number of CPUs to the simulation code than to the SGE. This information is contained in the $NSLOTS variable of your qsub-script.
I want to know what happened when all threads of a warp read the same 32-bit address of global memory. How many memory requests are there? Is there any serialization. The GPU is Fermi card, the programming environment is CUDA 4.0.
Besides, can anybody explain the concept of bus utilization? What is the difference between caching loading and non-caching loading? I saw the concept in http://theinf2.informatik.uni-jena.de/theinf2_multimedia/Website_downloads/NVIDIA_Fermi_Perf_Jena_2011.pdf.
All threads in warp accessing same address in global memory
I could answer your questions off the top of my head for AMD GPUs. For Nvidia, googling found the answers quickly enough.
I want to know what happened when all threads of a warp read the same 32-bit address of global memory. How many memory requests are
there? Is there any serialization. The GPU is Fermi card, the
programming environment is CUDA 4.0.
http://developer.download.nvidia.com/CUDA/training/NVIDIA_GPU_Computing_Webinars_Best_Practises_For_OpenCL_Programming.pdf, dated 2009, says:
Coalescing:
Global memory latency: 400-600 cycles. The single most important
performance consideration!
Global memory access by threads of a half warp can be coalesced to
one transaction for word of size 8-bit, 16-bit, 32-bit, 64-bit or two
transactions for 128-bit.
Global memory can be viewed as composing aligned segments of 16 and
32 words.
Coalescing in Compute Capability 1.0 and 1.1:
K-th thread in a half warp must access the k-th word in a segment;
however, not all threads need to participate
Coalescing in Compute Capability
1.2 and 1.3:
Coalescing for any pattern of access that fits into a segment size
So, it sounds like having all threads of a warp access the same 32-bit address of global memory will work as well as could be hoped for, in anything >= Compute Capability 1.2. But not for 1.0 and 1.1.
Your card is okay.
I must admit that I have not tested this for Nvidia. I have tested it for AMD.
Difference between cache and uncached load
To start off, look at slide 4 of the presentation you refer to, http://theinf2.informatik.uni-jena.de/theinf2_multimedia/Website_downloads/NVIDIA_Fermi_Perf_Jena_2011.pdf.
I.e. the slide titled "Differences between CPU & GPUs" - that says that CPUs have huge caches, and GPUs don't.
A few years ago such a slide might have said that GPUs don't have any caches at all. However, GPUs have begin to add more and more cache, and/or switch more and more local to cache.
I am not sure if you understand what a "cache" is in computer architecture. It's a big topic, so I will only provide a short answer.
Basically, a cache is like local memory. Both cache and local memory - are closer to the processor or GPU than DRAM, main memory - whether that be the GPU's private DRAM, or the CPU's system memory. DRAM main memory is called by Nvidia Global Memory. Slide 9 illustrates this.
Both cache and local memory are closer to the GPU than DRAM global memory: on slide 9 they are drawn as being inside the same chip as the GPU, whereas the DRAMs are separate chips. This can have several good effects, on latency, throughput, power - and, yes, bus utilization (related to bandwidth).
Latency: global memory is 400-800 cycles away. This means that if you only had one warp in your application, it would only execute one memory operation every 400-800 cycles. This means that, in order not to slow down, you need many threads/warps producing memory requests that can be run in parallel, i.e. that have high MLP (Memory Level Parallelism). Fortunately graphics usually does this. The caches are closer, so will have lower latency. Your slides do not say what it is, but other places say 50-200 cycles, 4-8X faster than global memory. This translates to needing fewer threads&warps to avoid slowing down.
Throughput/Bandwidth: there is typically more bandwidth to local memory and/or cache than to DRAM global memory. Your slides say 1+ TB/s versus 177 GB/s - i.e. cache and local memory is more than 5X faster. This higher bandwidth could translate to significantly higher framerates.
Power: you save a lot of power going to cache or local memory rather than to DRAM global memory. This may not matter to a desktop gaming PC, but it matters to a laptop, or to a tablet PC. Actually, it matters even to a desktop gaming PC, because less opower means it can be (over)clocked faster.
OK, so local and cache memory are similar in the above? What's the difference?
Basically, it is easier to program cache than it is local memory. Very good, expert, nionja programmers are needed to manage local memory properly, copying stuff from global memory as needed, and flushing it out. Whereas cache memory is much easier to manage, because you just doing a cached load, and the memory is put in cache automatically, where it will be accessed faster the next time around.
But caches have a downside as well.
First, they actually burn a bit more power than local memory - or they would, if there were actually separate local and global memories. However, in Fermi, the local memory may be configured as cache, and vice versa. (For years GPU folks said "we don't need no stinking cache - cache tags and other overhead are jujst wasteful.)
More importantly, caches tend to operate on cache lines - but not all programs do. This leads to the bus utilization issue you mention. If a warp accesses all words in a cache line, great. But if a warp only accesses 1 word in a cache line, i.e. 1 4 byte word and then skips 124 bytes, then 128 bytes of data are transferred over the bus, but only 4 bytes are used. I.e. >96% of the bus bandwidth is being wasted. This is the low bus utilization.
Whereas the very next slide shows that a non-caching load, suich as one might use to load data into local memory, would transfer only 32 bytes, so "only" 28 bytes out of 32 are wasted. In other words, the non-cache loads could be 4X more efficient, 4X faster, than the cached loads.
Then why not use non-cache loads everywhere? Because they are harder to program - it requirwes expert ninja programmers. And caches work pretty well much of the time.
So, instead of paying your expert ninja programmers to spend a lot of time optimizing all of the code to use non-cache loads and hand-managed local memory - instead you do the easy stuff using cached loads, and you let the highly paid expert ninja programmers concentrate on the stuff that the cache does not work well for.
Besides: nobody likes admitting it, but oftentimes the cache does better than the expert ninja programmers.
Hope this helps.
Throughput, power, and bus utilization: in addition to reducing
In parallel computing theoretically super-linear speedup is not possible. But in practice we do see such cases. One reason is cache effect but I fail to understand what does it play. Also, there are other things involved but what are they? In summary,
How are super-linear speedups possible?
I'm a beginner with respect to parallel computing.
Suppose you have an 8 processor machine, each processor has a 1MB cache, and your computation uses 6MB of data.
On 1 processor the computation will be doing a lot of data movement between CPU, cache and RAM. On 8 processors the computation will only have to move data between CPU and cache. This way you can achieve super-linear speedup.
These figures and this analysis have been simplified for exposition for a beginner.
In short, superlinear speedup is achieved when the total amount of work processors do is strictly less than the total work performed by a single processor.
This can happen in three ways:
The original sequential algorithm was really bad, using the parallel version of the algorithm on one processor will usually do away with the superlinear speedup.
The parallel algorithm uses some search like a random walk, the more processors that are walking, the less distance has to be walked in total before you reach what you are looking for.
Modern processors have faster and slower memories. Typically it will try to keep the data you are using in the fast memory. We can safely say your amount of data is larger than the amount of fast memory. If you use n processors you have n times the amount of faster memory. More data fits in the fast memory which makes it possible to take less time (thus amount of work) on the same task.