According to most NVidia documentation CUDA cores are scalar processors and should only execute scalar operations, that will get vectorized to 32-component SIMT warps.
But OpenCL has vector types like for example uchar8.It has the same size as ulong (64 bit), which can be processed by a single scalar core. If I do operations on a uchar8 vector (for example component-wise addition), will this also map to an instruction on a single core?
If there are 1024 work items in a block (work group), and each work items processes a uchar8, will this effectively process 8120 uchar in parallel?
Edit:
My question was if on CUDA architectures specifically (independently of OpenCL), there are some vector instructions available in "scalar" cores. Because if the core is already capable of handling a 32-bit type, it would be reasonable if it can also handle addition of a 32-bit uchar4 for example, especially since vector operations are often used in computer graphics.
CUDA has "built-in" (i.e. predefined) vector types up to a size of 4 for 4-byte quantities (e.g. int4) and up to a size of 2 for 8-byte quantities (e.g. double2). A CUDA thread has a maximum read/write transaction size of 16 bytes, so these particular size choices tend to line up with that maximum.
These are exposed as typical structures, so you can reference for example .x to access just the first element of a vector type.
Unlike OpenCL, CUDA does not provide built-in operations ("overloads") for basic arithmetic e.g. +, -, etc. for element-wise operations on these vector types. There's no particular reason you couldn't provide such overloads yourself. Likewise, if you wanted a uchar8 you could easily provide a structure definition for such, as well as any desired operator overloads. These could probably be implemented just as you would expect for ordinary C++ code.
Probably an underlying question is, then, what is the difference in implementation between CUDA and OpenCL in this regard? If I operate on a uchar8, e.g.
uchar8 v1 = {...};
uchar8 v2 = {...};
uchar8 r = v1 + v2;
what will the difference be in terms of machine performance (or low-level code generation) between OpenCL and CUDA?
Probably not much, for a CUDA-capable GPU. A CUDA core (i.e. the underlying ALU) does not have direct native support for such an operation on a uchar8, and furthermore, if you write your own C++ compliant overload, you're probably going to use C++ semantics for this which will inherently be serial:
r.x = v1.x + v2.x;
r.y = v1.y + v2.y;
...
So this will decompose into a sequence of operations performed on the CUDA core (or in the appropriate integer unit within the CUDA SM). Since the NVIDIA GPU hardware doesn't provide any direct support for an 8-way uchar add within a single core/clock/instruction, there's really no way OpenCL (as implemented on a NVIDIA GPU) could be much different. At a low level, the underlying machine code is going to be a sequence of operations, not a single instruction.
As an aside, CUDA (or PTX, or CUDA intrinsics) does provide for a limited amount of vector operations within a single core/thread/instruction. Some examples of this are:
a limited set of "native" "video" SIMD instructions. These instructions are per-thread, so if used, they allow for "native" support of up to 4x32 = 128 (8-bit) operands per warp, although the operands must be properly packed into 32-bit registers. You can access these from C++ directly via a set of built-in intrinsics. (A CUDA warp is a set of 32 threads, and is the fundamental unit of lockstep parallel execution and scheduling on a CUDA capable GPU.)
a vector (SIMD) multiply-accumulate operation, which is not directly translatable to a single particular elementwise operation overload, the so-called int8 dp2a and dp4a instructions. int8 here is somewhat misleading. It does not refer to an int8 vector type but rather a packed arrangement of 4 8-bit integer quantities in a single 32-bit word/register. Again, these are accessible via intrinsics.
16-bit floating point is natively supported via half2 vector type in cc 5.3 and higher GPUs, for certain operations.
The new Volta tensorCore is something vaguely like a SIMD-per-thread operation, but it operates (warp-wide) on a set of 16x16 input matrices producing a 16x16 matrix result.
Even with a smart OpenCL compiler that could map certain vector operations into the various operations "natively" supported by the hardware, it would not be complete coverage. There is no operational support for an 8-wide vector (e.g. uchar8) on a single core/thread, in a single instruction, to pick one example. So some serialization would be necessary. In practice, I don't think the OpenCL compiler from NVIDIA is that smart, so my expectation is that you would find such per-thread vector operations fully serialized, if you studied the machine code.
In CUDA, you could provide your own overload for certain operations and vector types, that could be represented approximately in a single instruction. For example a uchar4 add could be performed "natively" with the __vadd4() intrinsic (perhaps included in your implementation of an operator overload.) Likewise, if you are writing your own operator overload, I don't think it would be difficult to perform a uchar8 elementwise vector add using two __vadd4() instructions.
If I do operations on a uchar8 vector (for example component-wise addition), will this also map to an instruction on a single core?
AFAIK it'll always be on a single core (instructions from a single kernel / workitem don't cross cores, except special instructions like barriers), but it may be more than one instruction. This depends on whether your hardware support operations on uchar8 natively. If it does not, then uchar8 will be broken up to as many pieces as required, and each piece will be processed with a separate instruction.
OpenCL is very "generic" in the sense that it supports many different vector type/size combos, but real-world hardware usually only implements some vector type/size combinations. You can query OpenCL devices for "preferred vector size" which should tell you what's the most efficient for that hardware.
I have implemented a kernel that process data where the input comes from an cudaTextureObject_t. To increase the throughput of my method, I call this kernel with N different stream objects. Therefore, I create N texture objects that are then passed to the different kernel calls.
This works perfectly well on GPUs with Kepler architecture. However, now I want to use this method also on a GPU with Fermi architecture, where no cudaTextureObject_t is available.
My question is as follows: Is there a way to make an abstraction based on texture references, or do I have to completely rewrite my code for the older architecture?
You will have to re-write your code. It isn't possible to encapsulate a texture reference inside a class or structure, nor pass a texture reference to a kernel.
I know that non-POD types can't be passed as parameters to CUDA kernel launches in general.
But where I can find an explanation for this, I mean a reliable source like a book, a CUDA manual, etc.
The entire premise of this question is incorrect. CUDA kernel arguments are not limited to POD types.
You are free to pass any complete type as an argument, either by reference or by value. There is a limit of 255 bytes or 4kb for the total size of the argument list depending on which architecture you compile for, but that is the only restriction on kernel arguments. When passing an instance of a class to a CUDA kernel, there are a number of simple restrictions which you must follow, including:
Any pointers in the class instance which the device code will dereference must be valid device pointers
Any member functions in the class which the the device code will call must be valid __device__ functions
Is it illegal to pass a class containing virtual functions or derived from a virtual base type as as a kernel argument
Classes which access namespace anonymous unions are not supported in device code
All of the features and limitations of C++ support in CUDA kernel code is described in the CUDA Programming Guide, a copy of which ships in every version of the CUDA toolkit. All you need to do is read it.
How to use cudamemcpy for vectors in C++ ? my code works fine for arrays but vectors it doesnt seem to support. Any idea how to support vectors in CUDA?
The short answer is that you can't just using the basic CUDA APIs.
If you want to use STL containers with CUDA, you should look at the thrust template library, which provides and STL like interface to the GPU and a number of useful GPU algorithms to operate on data within container types.
I'm using texture memory for image filtering in CUDA as:
texture<unsigned char> texMem; //deceleration
cudaBindTexture( NULL, texMem,d_inputImage,imageSize); //binding
However I'm not satisfied with the results at the boundary. Is there any other considerations or settings for texture memory tailored for 2D filtering?
I've seen people declear texture this way:
texture<float> texMem(0,cudaFilterModeLinear);
// what does this do?
Moreover, if anyone can suggest some online guide explaining how to properly set setup texture memory abstraction in CUDA, that'll be helpful. Thanks
You can specify what kind of sampling you want using cudaFilterMode (could be linear or cubic).
You could look at Appendix G from the CUDA_C_Programming_Guide.pdf provided in path/to/cudatoolkit/doc to see this explained in detail