I have implemented a kernel that process data where the input comes from an cudaTextureObject_t. To increase the throughput of my method, I call this kernel with N different stream objects. Therefore, I create N texture objects that are then passed to the different kernel calls.
This works perfectly well on GPUs with Kepler architecture. However, now I want to use this method also on a GPU with Fermi architecture, where no cudaTextureObject_t is available.
My question is as follows: Is there a way to make an abstraction based on texture references, or do I have to completely rewrite my code for the older architecture?
You will have to re-write your code. It isn't possible to encapsulate a texture reference inside a class or structure, nor pass a texture reference to a kernel.
Related
I have a buffer (array) on the host that should be resided in the constant memory region of the device (in this case, an NVIDIA GPU).
So, I have two questions:
How can I allocate a chunk of constant memory? Given the fact that I am tracing the available constant memory on the device and I know, for a fact, that we have that amount of memory available to us (at this time)
How can I initialize (populate) those arrays from values that are computed at the run time on the host?
I searched the web for this but there is no concise document documenting this. I would appreciate it if provided examples would be in both OpenCL and CUDA. The example for OpenCL is more important to me than CUDA.
How can I allocate a chunk of constant memory? Given the fact that I am tracing the available constant memory on the device and I know, for a fact, that we have that amount of memory available to us (at this time)
In CUDA, you can't. There is no runtime allocation of constant memory, only static definition of memory via the __constant__ specifier which get mapped to constant memory pages at assembly. You could generate some code contain such a static declaration at runtime and compile it via nvrtc, but that seems like a lot of effort for something you know can only be sized up to 64kb. It seems much simpler (to me at least) to just statically declare a 64kb constant buffer and use it at runtime as you see fit.
How can I initialize (populate) those arrays from values that are computed at the runtime on the host?
As noted in comments, see here. The cudaMemcpyToSymbol API was created for this purpose and it works just like standard memcpy.
Functionally, there is no difference between __constant in OpenCL and __constant__ in CUDA. The same limitations apply: static definition at compile time (which is runtime in the standard OpenCL execution model), 64kb limit.
For cuda, I use driver API and NVRTC and create kernel string with a global constant array like this:
auto kernel = R"(
..
__constant__ ##Type## buffer[##SIZE##]={
##elm##
};
..
__global__ void test(int * input)
{ }
)";
then replace ##-pattern words with size and element value information in run-time and compile like this:
__constant__ int buffer[16384]={ 1,2,3,4, ....., 16384 };
So, it is run-time for the host, compile-time for the device. Downside is that the kernel string gets too big, has less readability and connecting classes needs explicitly linking (as if you are compiling a side C++ project) other compilation units. But for simple calculations with only your own implementations (no host-definitions used directly), it is same as runtime API.
Since large strings require extra parsing time, you can cache the ptx intermediate data and also cache the binary generated from ptx. Then you can check if kernel string has changed and needs to be re-compiled.
Are you sure just __constant__ worths the effort? Do you have some benchmark results to show that actually improves performance? (premature optimization is source of all evil). Perhaps your algorithm works with register-tiling and the source of data does not matter?
Disclaimer: I cannot help you with CUDA.
For OpenCL, constant memory is effectively treated as read-only global memory from the programmer/API point of view, or defined inline in kernel source.
Define constant variables, arrays, etc. in your kernel code, like constant float DCT_C4 = 0.707106781f;. Note that you can dynamically generate kernel code on the host at runtime to generate derived constant data if you wish.
Pass constant memory from host to kernel via a buffer object, just as you would for global memory. Simply specify a pointer parameter in the constant memory region in your kernel function's prototype and set the buffer on the host side with clSetKernelArg(), for example:
kernel void mykernel(
constant float* fixed_parameters,
global const uint* dynamic_input_data,
global uint* restrict output_data)
{
cl_mem fixed_parameter_buffer = clCreateBuffer(
cl_context,
CL_MEM_READ_ONLY | CL_MEM_HOST_NO_ACCESS | CL_MEM_COPY_HOST_PTR,
sizeof(cl_float) * num_fixed_parameters, fixed_parameter_data,
NULL);
clSetKernelArg(mykernel, 0, sizeof(cl_mem), &fixed_parameter_buffer);
Make sure to take into account the value reported for CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE for the context being used! It usually doesn't help to use constant memory buffers for streaming input data, this is better stored in global buffers, even if they are marked read-only for the kernel. constant memory is most useful for data that are used by a large proportion of work-items. There is typically a fairly tight size limitation such as 64KiB on it - some implementations may "spill" to global memory if you try to exceed this, which will lose you any performance advantages you would gain from using constant memory.
I'm having trouble wrapping my head around the restrictions on CUDA constant memory.
Why can't we allocate __constant__ memory at runtime? Why do I need to compile in a fixed size variable with near-global scope?
When is constant memory actually loaded, or unloaded? I understand that cudaMemcpytoSymbol is used to load the particular array, but does each kernel use its own allocation of constant memory? Related, is there a cost to binding, and unbinding similar to the old cost of binding textures (aka, using textures added a cost to every kernel invocation)?
Where does constant memory reside on the chip?
I'm primarily interested in answers as they relate to Pascal and Volta.
It is probably easiest to answer these six questions in reverse order:
Where does constant memory reside on the chip?
It doesn't. Constant memory is stored in statically reserved physical memory off-chip and accessed via a per-SM cache. When the compiler can identify that a variable is stored in the logical constant memory space, it will emit specific PTX instructions which allow access to that static memory via the constant cache. Note also that there are specific reserved constant memory banks for storing kernel arguments on all currently supported architectures.
Is there a cost to binding, and unbinding similar to the old cost of binding textures (aka, using textures added a cost to every kernel invocation)?
No. But there also isn't "binding" or "unbinding" because reservations are performed statically. The only runtime costs are host to device memory transfers and the cost of loading the symbols into the context as part of context establishment.
I understand that cudaMemcpytoSymbol is used to load the particular array, but does each kernel use its own allocation of constant memory?
No. There is only one "allocation" for the entire GPU (although as noted above, there is specific constant memory banks for kernel arguments, so in some sense you could say that there is a per-kernel component of constant memory).
When is constant memory actually loaded, or unloaded?
It depends what you mean by "loaded" and "unloaded". Loading is really a two phase process -- firstly retrieve the symbol and load it into the context (if you use the runtime API this is done automagically) and secondly any user runtime operations to alter the contents of the constant memory via cudaMemcpytoSymbol.
Why do I need to compile in a fixed size variable with near-global scope?
As already noted, constant memory is basically a logical address space in the PTX memory hierarchy which is reflected by a finite size reserved area of the GPU DRAM map and which requires the compiler to emit specific instructions to access uniformly via a dedicated on chip cache or caches. Given its static, compiler analysis driven nature, it makes sense that its implementation in the language would also be primarily static.
Why can't we allocate __constant__ memory at runtime?
Primarily because NVIDIA have chosen not to expose it. But given all the constraints outlined above, I don't think it is an outrageously poor choice. Some of this might well be historic, as constant memory has been part of the CUDA design since the beginning. Almost all of the original features and functionality in the CUDA design map to hardware features which existed for the hardware's first purpose, which was the graphics APIs the GPUs were designed to support. So some of what you are asking about might well be tied to historical features or limitations of either OpenGL or Direct 3D, but I am not familiar enough with either to say for sure.
all tutorials and introductional material for GPGPU/Cuda often use flat arrays, however I'm trying to port a piece of code which uses somewhat more sophisticated objects compared to an array.
I have a 3-dimensional std::vector whose data I want to have on the GPU. Which strategies are there to get this on the GPU?
I can think of 1 for now:
copy the vector's data on the host to a more simplistic structure like an array. However this seems wasteful because 1) I have to copy data and then send to the GPU; and 2) I have to allocate a 3-dimensional array whose dimensions are the max of the the element count in any of the vectors e.g. using a 2D vector
imagine {{1, 2, 3, 4, .. 1000}, {1}}, In the host memory these are roughly ~1001 allocated items, whereas if I were to copy this to a 2 dimensional array, I would have to allocate 1000*1000 elements.
Are there better strategies?
There are many methodologies for refactoring data to suit GPU computation, one of the challenges being copying data between device and host, the other challenge being representation of data (and also algorithm design) on the GPU to yield efficient use of memory bandwidth. I'll highlight 3 general approaches, focusing on ease of copying data between host and device.
Since you mention std::vector, you might take a look at thrust which has vector container representations that are compatible with GPU computing. However thrust won't conveniently handle vectors of vectors AFAIK, which is what I interpret to be your "3D std::vector" nomenclature. So some (non-trivial) refactoring will still be involved. And thrust still doesn't let you use a vector directly in ordinary CUDA device code, although the data they contain is usable.
You could manually refactor the vector of vectors into flat (1D) arrays. You'll need one array for the data elements (length = total number of elements contained in your "3D" std::vector), plus one or more additional (1D) vectors to store the start (and implicitly the end) points of each individual sub-vector. Yes, folks will say this is inefficient because it involves indirection or pointer chasing, but so does the use of vector containers on the host. I would suggest that getting your algorithm working first is more important than worrying about one level of indirection in some aspects of your data access.
as you point out, the "deep-copy" issue with CUDA can be a tedious one. It's pretty new, but you might want to take a look at Unified Memory, which is available on 64-bit windows and linux platforms, under CUDA 6, with a Kepler (cc 3.0) or newer GPU. With C++ especially, UM can be very powerful because we can extend operators like new under the hood and provide almost seamless usage of UM for shared host/device allocations.
Can someone give a clear explanation of how the new and delete keywords would behave if called from __device__ or __global__ code in CUDA 4.2?
Where does the memory get allocated, if its on the device is it local or global?
It terms of context of the problem I am trying to create neural networks on the GPU, I want a linked representation (Like a linked list, but each neuron stores a linked list of connections that hold weights, and pointers to the other neurons), I know I could allocate using cudaMalloc before the kernel launch but I want the kernel to control how and when the networks are created.
Thanks!
C++ new and delete operate on device heap memory. The device allows for a portion of the global (i.e. on-board) memory to be allocated in this fashion. new and delete work in a similar fashion to device malloc and free.
You can adjust the amount of device global memory available for the heap using a runtime API call.
You may also be interested in the C++ new/delete sample code.
CC 2.0 or greater is required for these capabilities.
I'm using Thrust in my current project so I don't have to write a device_vector abstraction or (segmented) scan kernels myself.
So far I have done all of my work using thrust abstractions, but for simple kernels or kernels that don't translate easily to the for_each or transform abstractions I'd prefer at some point to write my own kernels instead.
So my question is: Can I through Thrust (or perhaps CUDA) ask which device is currently being used and what properties it has (max block size, max shared memory, all that stuff)?
If I can't get the current device, is there then some way for me to get thrust to calculate the kernel dimensions if I provide the kernel registers and shared memory requirements?
You can query the current device with CUDA. See the CUDA documentation on device management. Look for cudaGetDevice(), cudaSetDevice(), cudaGetDeviceProperties(), etc.
Thrust has no notion of device management currently. I'm not sure what you mean by "get thrust to calculate the kernel dimensions", but if you are looking to determine grid dimensions for launching your custom kernel, then you need to do that on your own. It can help to query the properties of the kernel with cudaFuncGetAttributes(), which is what Thrust uses.