Does Caffe Scale up to multiple CPU Cores? - deep-learning

I wish to run Caffe on a 32 core machine.
Does caffe scale up to available number of cores to utilize them the best?
Although there are 32 cores, can I make caffe use only a selected number of cores?

Generally caffe doesn't support multiple CPUs/cores in its source code, but it uses BLAS routines.
Thus answers to your questions are the following:
Yes, but only through BLAS configuration, i. e. your BLAS version should be compiled with multithreading support (see related discussions: here or here - at the second link you can also find some modifications for caffe itself).
Also through BLAS (if it was compiled with openmp support, you can define OMP_NUM_THREADS to desired value).

caffe does not, but you can you Intel caffe which is optimized for CPU and supports multi node
https://github.com/intel/caffe/wiki/Multinode-guide

Related

How to make Intel GPU available for processing through pytorch?

I'm using a laptop which has Intel Corporation HD Graphics 520.
Does anyone know how to it set up for Deep Learning, specifically Pytorch? I have seen if you have Nvidia graphics I can install cuda but what to do when you have intel GPU?
PyTorch doesn't support anything other than NVIDIA CUDA and lately AMD Rocm.
Intels support for Pytorch that were given in the other answers is exclusive to xeon line of processors and its not that scalable either with regards to GPUs.
Intel's oneAPI formerly known ad oneDNN however, has support for a wide range of hardwares including intel's integrated graphics but at the moment, the full support is not yet implemented in PyTorch as of 10/29/2020 or PyTorch 1.7.
But you still have other options. for inference you have couple of options.
DirectML is one of them. basically you convert your model into onnx, and then use directml provider to run your model on gpu (which in our case will use DirectX12 and works only on Windows for now!)
Your other Option is to use OpenVino and TVM both of which support multi platforms including Linux, Windows, Mac, etc.
OpenVino and TVM use ONNX models so you need to first convert your model to onnx format and then use them.
Lately(as of 2023),IREE (Intermediate Representation Execution Environment) (torch-mlir in this case) can be used as well.
Intel provides optimized libraries for Deep and Machine Learning if you are using one of their later processors. A starting point would be this post, which is about getting started with Intel optimization of PyTorch. They provide more information about this in their AI workshops.

Using atomic arithmetic operations in CUDA Unified Memory multi-GPU or multi-processor

I am trying to implement a CUDA program that uses Unified Memory. I have two unified arrays and sometimes they need to be updated atomically.
The question below has an answer for a single GPU environment but I am not sure how to extend the answer given in the question to adapt in multi-GPU platforms.
Question: cuda atomicAdd example fails to yield correct output
I have 4 Tesla K20 if you need this information and all of them updates a part of those arrays that must be done atomically.
I would appreciate any help/recommendations.
To summarize comments into an answer:
You can perform this sort of address space wide atomic operation using atomicAdd_system
However, you can only do this on compute capability 6.x or newer devices (7.2 or newer if using Tegra)
specifically this means you have to compile for the correct compute capability such as -arch=sm_60 or similar
You state in the question you are using Telsa K20 cards -- these are compute capability 3.5 and do not support any of the system wide atomic functions.
As always, this information is neatly summarized in the relevant section of the Programming Guide.

Can pylearn2 and Theano run on AMD GPU based platform?

I'd like to use pylearn2, theano and scikit-neuralnetwork to build neural network models. But my friend told me that all this module can only run on NVIDIA GPU based platform (because they would import the pycuda module). But I only have an AMD GPU(R9 270,and an AMD FX-8300 CPU),and I wish to take advantage of AMD GPU to speed up computing. Can I use all of the modules metioned above? Or is there any substitutes I can use to build neural network models ? Thanks!
Currently, Theano only supports nvidia GPUs. There is a partial implementation of an OpenCL backend that would support AMD GPUs but it is incomplete and unsupported.
scikit-neuralnetwork builds on PyLearn2 and PyLearn2 builds on Theano so none of those packages can operate on AMD GPUs.
Torch appears to already have some OpenCL support. Caffe's OpenCL support appears to be under development.

Multi GPU vs GPU cluster

I am using cuda programming for the effective and fast computation. and during the study I found that multi gpu and the gpu cluster are the other means for the much further effective calculation but I am confused between these two terms.
What is the actual difference between these two in terms of programming cuda?
I assume that you mean a PC with multiple GPUs and many PCs with single GPU (Cluster)
if this is the case, for a multi-GPU PC you can easily use CUDA library itself and if you connect GPUs with a SLI bridge, you will see improvements in performance.
If you want to use a cluster with GPUs, you may use CUDA-Aware MPI. It is combined solution of MPI standard and CUDA library. I suggest you to check this blog post: https://devblogs.nvidia.com/parallelforall/introduction-cuda-aware-mpi/

differences between virtual and real architecture of cuda

Trying to understand the differences between virtual and real architecture of cuda, and how the different configurations will affect the performance of the program, e.g.
-gencode arch=compute_20,code=sm_20
-gencode arch=compute_20,code=sm_21
-gencode arch=compute_21,code=sm_21
...
The following explanation was given in NVCC manual,
GPU compilation is performed via an intermediate representation, PTX
([...]), which can
be considered as assembly for a virtual GPU architecture. Contrary to an actual graphics
processor, such a virtual GPU is defined entirely by the set of capabilities, or features,
that it provides to the application. In particular, a virtual GPU architecture provides a
(largely) generic instruction set, and binary instruction encoding is a non-issue because
PTX programs are always represented in text format.
Hence, a nvcc compilation command always uses two architectures: a compute
architecture to specify the virtual intermediate architecture, plus a real GPU architecture
to specify the intended processor to execute on. For such an nvcc command to be valid,
the real architecture must be an implementation (someway or another) of the virtual
architecture. This is further explained below.
The chosen virtual architecture is more of a statement on the GPU capabilities that
the application requires: using a smallest virtual architecture still allows a widest range
of actual architectures for the second nvcc stage. Conversely, specifying a virtual
architecture that provides features unused by the application unnecessarily restricts the
set of possible GPUs that can be specified in the second nvcc stage.
But still don't quite get how the performance will be affected by different configurations (or, maybe only affect the selection of the physical GPU devices?). In particular, this statement is most confusing to me:
In particular, a virtual GPU architecture provides a
(largely) generic instruction set, and binary instruction encoding is a non-issue because
PTX programs are always represented in text format.
The NVIDIA CUDA Compiler Driver NVCC User Guide Section on GPU Compilation provides a very thorough description of virtual and physical architecture and how the concepts are used in the build process.
The virtual architecture specifies the feature set that is targeted by the code. The table listed below shows some of the evolution of the virtual architecture. When compiling you should specify the lowest virtual architecture that has a sufficient feature set to enable the program to be executed on the widest range of physical architectures.
Virtual Architecture Feature List (from the User Guide)
compute_10 Basic features
compute_11 + atomic memory operations on global memory
compute_12 + atomic memory operations on shared memory
+ vote instructions
compute_13 + double precision floating point support
compute_20 + Fermi support
compute_30 + Kepler support
The physical architecture specifies the implementation of the GPU. This provides the compiler with the instruction set, instruction latency, instruction throughput, resource sizes, etc. so that the compiler can optimally translate the virtual architecture to binary code.
It is possible to specify multiple virtual and physical architecture pairs to the compiler and have the compiler back the final PTX and binary into a single binary. At runtime the CUDA driver will choose the best representation for the physical device that is installed. If binary code is not provided in the fatbinary file the driver can use the JIT runtime for the best PTX implementation.
"Virtual architecture" code will get compiled by a just-in-time compiler before being loaded on the device. AFAIK, it is the same compiler as the one NVCC invokes when building "physical architecture" code offline - so I don't know if there will be any differences in the resulting application performance.
Basically, every generation of the CUDA hardware is binary incompatible with previous generation - imagine next generation of Intel processors sporting ARM instruction set. This way, virtual architectures provide an intermediate representation of the CUDA application that can be compiled for compatible hardware. Every hardware generation introduces new features (e.g. atomics, CUDA Dynamic Parallelism) that require new instructions - that's why you need new virtual architectures.
Basically, if you want to use CDP you should compile for SM 3.5. You can compile it to device binary that will have assembly code for specific CUDA device generation or you can compile it to PTX code that can be compiled into device assembly for any device generation that provides these features.
The virtual architecture specifies what capabilities a GPU has and the real architecture specifies how it does it.
I can't think of any specific examples off hand. A (probably not correct) example may be a virtual GPU specifying the number of cores a card has, so code is generated targeting that number of cores, whereas the real card may have a few more for redundancy (or a few less due to manufacturing errors) and some methods of mapping to the cores that are actually in use, which can be placed on top of the more generic code generated in the first step.
You can think of the PTX code sort of like assembly code, which targets a certain architecture, which can then be compiled to machine code for a specific processor. Targeting the assembly code for the right kind of processor will, in general, generate better machine code.
well usually what nvidia writes as document causes people (including myself) to become more confused! (just me maybe!)
you are concerned with the performance, basically what this says is that don't be (probably) but you should.basically the GPU architecture is like nature. they run something on it and something happens. then they try to explain it. and then they feed it to you.
at the end should probably run some tests and see what configuration gives the best result.
the virtual architecture is what is designed to let you think freely. you should obey that, use as much as threads as you want, you can assign virtually everything as number of threads and blocks, doesn't matter, it will be translated to PTX and the device will run it.
the only problem is, if you assign more than 1024 threads per a single block you will get 0 s as the result, because the device(the real architecture) doesn't support it.
or for example your device support the CUDA 1.2, you can define double pointing variables in your code, but again you will get 0 s as the result because simply the device can't run it.
performance wise you have to know that every 32 thread (e.g. warps) have to access a single position in memory or else your access will be serialized and so on.
So I hope you've got the point by now, It is a relatively new science and GPU is a really sophisticated piece of hardware architecture, everybody is trying to make the best of it but it's a game of testing and a little knowledge of actual architecture behind CUDA. I suggest that search for GPU architecture and see how the virtual threads and thread blocks are actually implemented.