I am trying to get a better grasp of memory management in cuda. There is Something that is just now occurring to me as a major lack of understanding. How do kernels access values that, as I understand it, should be in host memory.
When vectorAdd() is called, it runs the function on the device. But only the elements are stored on the device memory. the length of the vectors are stored on the host. How is it that the kernel does not exit with an error from trying to access foo.length, something that should be on the host.
#include <cuda.h>
#include <cuda_runtime.h>
#include <stdio.h>
#include <stdlib.h>
typedef struct{
float *elements;
int length;
}vector;
__global__ void vectorAdd(vector foo, vector bar){
int idx = threadIdx.x + blockDim.x * blockId.x.x;
if(idx < foo.length){ //this is the part that I do not understand
foo.elements[idx] += bar.elements[idx];
}
}
int main(void){
vector foo, bar;
foo.length = bar.length = 50;
cudaMalloc(&(foo.elements), sizeof(float)*50);
cudaMalloc(&(bar.elements), sizeof(float)*50);
//these vectors are empty, so adding is just a 0.0 += 0.0
int blocks_per_grid = 10;
int threads_per_block = 5;
vectorAdd<<<blocks_per_grid, threads_per_block>>>(foo, bar);
return 0;
}
In C and C++, a typical mechanism for making arguments available to the body of a function call is pass-by-value. The basic idea is that a separate copy of the arguments are made, for use by the function.
CUDA claims compliance to C++ (subject to various limitations), and it therefore provides a mechanism for pass-by-value. On a kernel call, the CUDA compiler and runtime will make copies of the arguments, for use by the function (kernel). In the case of a kernel call, these copies are placed in a particular area of __constant__ memory which is in the GPU and within GPU memory space, and therefore "accessible" to device code.
So, in your example, the entire structures passed as the arguments for the parameters vector foo, vector bar are copied to GPU device memory (specifically, constant memory) by the CUDA runtime. The CUDA device code is structured in such a way by the compiler to access these arguments as needed directly from constant memory.
Since those structures contain both the elements pointer and the scalar quantity length, both items are accessible in CUDA device code, and the compiler will structure references to them (e.g. foo.length) so as to retrieve the needed quantities from constant memory.
So the kernels are not accessing host memory in your example. The pass-by-value mechanism makes the quantities available to device code, in GPU constant memory.
Related
I'm doing a thrust transform_reduce and need to access a thrust::device_vector from within the functor. I am not iterating on the device_vector. It allows me to declare the functor, passing in the device_vector reference, but won't let me dereference it, either with begin() or operator[].
1>C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v6.5\include\thrust/detail/function.h(187): warning : calling a host function("thrust::detail::vector_base > ::operator []") from a host device function("thrust::detail::host_device_function ::operator () ") is not allowed
I assume I'll be able to pass in the base pointer and do the pointer math myself, but is there a reason this isn't supported?
Just expanding on what #JaredHoberock has already indicated. I think he will not mind.
A functor usable by thrust must (for the most part) conform to the requirements imposed on any CUDA device code.
Both thrust::host_vector and thrust::device_vector are host code containers used to manipulate host data and device data respectively. A reference to the host code container cannot be used successfully in device code. This means even if you passed a reference to the container successfully, you could not use it (i.e. could not do .push_back(), for example) in device code.
For direct manipulation in device code (such as functors, or kernels) you must extract raw device pointers from thrust and use those directly, with your own pointer arithmetic. And advanced functions (such as .push_back()) will not be available.
There are a variety of ways to extract the raw device pointer corresponding to thrust data, and the following example code demonstrates two possibilities:
$ cat t651.cu
#include <thrust/device_vector.h>
#include <thrust/sequence.h>
__global__ void printkernel(float *data){
printf("data = %f\n", *data);
}
int main(){
thrust::device_vector<float> mydata(5);
thrust::sequence(mydata.begin(), mydata.end());
printkernel<<<1,1>>>(mydata.data().get());
printkernel<<<1,1>>>(thrust::raw_pointer_cast(&mydata[2]));
cudaDeviceSynchronize();
return 0;
}
$ nvcc -o t651 t651.cu
$ ./t651
data = 0.000000
data = 2.000000
$
I have a __constant__ memory array holding information that is needed by many kernels, which are placed in different source files. This constant memory array is defined in the header GlobalParameters.h, which is #included by all files containing kernels that need to access to this array.
I just discovered (look at talonmies' answer) that __constant memory__ is only available in the translation unit where it is defined, unless you turn on separate compilation (with CUDA 5.0 or later).
I still do not get completely what this means for my case.
Assuming that I cannot turn on separate compilation, is there a way for dealing with my needs? Where should I place the definition of my constant memory array? What if I place it in my header, which is #included in many translation units?
Assuming I can turn on separate compilation, should I declare my __constant__ memory array in the header as extern and place the definition inside a source file (e.g. GlobalParameters.cu)?
One way to make constant memory available to translation units other than the one where it is declared, is to call cudaGetSymbolAddress() and make the pointer available to the other functions.
This technique is playing with fire to some degree, because if you use the pointer to write to the memory without appropriate barriers and synchronization, you may run afoul of the lack of coherency between constant memory and global memory.
Also, you may not get the full performance benefits of constant memory if you use this method. That should be less true on SM 2.x and later hardware - disassemble the object code and make sure the compiler is emitting "load uniform" instructions.
The below example assumes the possibility of using separate compilation. In this case, the below example shows how using extern to work with constant memory across different compilation units.
FILE kernel.cu
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include "Utilities.cuh"
__constant__ int N_GPU;
__constant__ float a_GPU;
__global__ void printKernel();
int main()
{
const int N = 5;
const float a = 10.466;
gpuErrchk(cudaMemcpyToSymbol(N_GPU, &N, sizeof(int)));
gpuErrchk(cudaMemcpyToSymbol(a_GPU, &a, sizeof(float)));
printKernel << <1, 1 >> > ();
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaDeviceSynchronize());
return 0;
}
FILE otherCompilationUnit.cu
#include <stdio.h>
extern __constant__ int N_GPU;
extern __constant__ float a_GPU;
__global__ void printKernel() {
printf("N = %i; a = %f\n", N_GPU, a_GPU);
}
No, without using separate compilation it won't be possible to use the same constant memory, that is declared once, over several .cu files.
In my oppinion there are two ways for a workaround.
First one is to implement all kernels within one .cu file. Therefore you will get the disadvantage that this file will become very large with a bad overview.
A second way would be to declare in every .cu file the constant memory again. Then once with a wrapper copy the values into the constant memory for every single .cu file - like I described in an answer here. Disadvantages would be that you have to ensure that you don't forget to copy the values in single .cu files and you have to check that you won't run in the limitation of total available constant memory.
Yes. The later CUDA doc says:
When compiling in the separate compilation mode (see the nvcc user manual for a description of this mode), device, shared, managed and constant variables can be defined as external using the extern keyword. nvlink will generate an error when it cannot find a definition for an external variable (unless it is a dynamically allocated shared variable).
Is there any way to deallocate shared memory previosuly allocated inside the same CUDA kernel?
For example, inside the kernel at one point I have defined
__shared__ unsigned char flag;
__shared__ unsigned int values [ BLOCK_DIM ];
Later on inside the code, I need to define an array that with considering previously defined shared memory exceeds the shared memory limit set for a block. How can I do that without dirty works of re-using previously defined shared memory? Or NVCC is smart enough to recognize dependencies along the kernel trace and deallocates it whenever done using shared variables?
My device is GeForce GTX 780 (CC=3.5).
In C/C++, it is not possible to deallocate statically defined arrays.
You may wish to dynamically allocate the amount of shared memory needed for the worst case as follows. Add
extern __shared__ float foo[];
within the kernel function and launch your kernel function as
myKernel<<<numBlocks, numThreads, sh_mem_size>>> (...);
Remember that you can manage multiple arrays by playing with pointers. Have a look at the CUDA C Programming Guide for further details. For example, quoting the Guide
extern __shared__ float array[];
__device__ void func() // __device__ or __global__ function
{
short* array0 = (short*)array;
float* array1 = (float*)&array0[128];
int* array2 = (int*)&array1[64];
}
By the same concept, you can dynamically change the size of the arrays you are dealing with.
In my application I have some part of the code that works as follows
main.cpp
int main()
{
//First dimension usually small (1-10)
//Second dimension (100 - 1500)
//Third dimension (10000 - 1000000)
vector<vector<vector<double>>> someInfo;
Object someObject(...); //Host class
for (int i = 0; i < N; i++)
someObject.functionA(&(someInfo[i]));
}
Object.cpp
void SomeObject::functionB(vector<vector<double>> *someInfo)
{
#define GPU 1
#if GPU == 1
//GPU COMPUTING
computeOnGPU(someInfo, aConstValue, aSecondConstValue);
#else
//CPU COMPUTING
#endif
}
Object.cu
extern "C" void computeOnGPU(vector<vector<double>> *someInfo, int aConstValue, int aSecondConstValue)
{
//Copy values to constant memory
//Allocate memory on GPU
//Copy data to GPU global memory
//Launch Kernel
//Copy data back to CPU
//Free memory
}
So as (I hope) you can see in the code, the function that prepares the GPU is called many times depending on the value of the first dimension.
All the values that I send to constant memory always remain the same and the sizes of the pointers allocated in global memory are always the same (the data is the only one changing).
This is the actual workflow in my code but I'm not getting any speedup when using GPU, I mean the kernel does execute faster but the memory transfers became my problem (as reported by nvprof).
So I was wondering where in my app the CUDA context starts and finishes to see if there is a way to do only once the copies to constant memory and memory allocations.
Normally, the cuda context begins with the first CUDA call in your application, and ends when the application terminates.
You should be able to do what you have in mind, which is to do the allocations only once (at the beginning of your app) and the corresponding free operations only once (at the end of your app) and populate __constant__ memory only once, before it is used the first time.
It's not necessary to allocate and free the data structures in GPU memory repetetively, if they are not changing in size.
I wish to ask two questions on performance. I have been unable to create simple code to illustrate.
Question 1: How expensive is non-divergent branching? In my code it seems that it even goes up as to more then the equivalent of 4 non-fma FLOPS. Note that I am speaking of the BRA PTX code whereby the predicate is already calculated
Question 2: I have been reading a lot about performance of shared memory and some articles like a Dr Dobbs article even state that it can be as fast as registers (as far as accessed well). In my code all threads within the warps within the block access the same shared variable. I believe in this case shared memory is accessed in broadcast mode, isn't it? Should it reach the performance of registers in this way? Is there any special things that should be considered to make it work?
EDIT: I have been able to construct some simple code that give more insight for my query
Here it is
#include <math.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <float.h>
#include "cuComplex.h"
#include "time.h"
#include "cuda_runtime.h"
#include <iostream>
using namespace std;
__global__ void test()
{
__shared__ int t[1024];
int v=t[0];
bool b=(v==-1);
bool c=(v==-2);
int myValue=0;
for (int i=0;i<800;i++)
{
#if 1
v=i;
#else
v=t[i];
#endif
#if 0
if (b) {
printf("abs");
}
#endif
if (c)
{
printf ("IT HAPPENED");
v=8;
}
myValue+=v;
}
if (myValue==1000)
printf ("IT HAPPENED");
}
int main(int argc, char *argv[])
{
cudaEvent_t event_start,event_stop;
float timestamp;
float4 *data;
// Initialise
cudaDeviceReset();
cudaSetDevice(0);
dim3 threadsPerBlock;
dim3 blocks;
threadsPerBlock.x=32;
threadsPerBlock.y=32;
threadsPerBlock.z=1;
blocks.x=1;
blocks.y=1000;
blocks.z=1;
cudaEventCreate(&event_start);
cudaEventCreate(&event_stop);
cudaEventRecord(event_start, 0);
test<<<blocks,threadsPerBlock,0>>>();
cudaEventRecord(event_stop, 0);
cudaEventSynchronize(event_stop);
cudaEventElapsedTime(×tamp, event_start, event_stop);
printf("Calculated in %f", timestamp);
}
I am running this code on a GTX680.
Now the results are as follows ..
If run as it is it takes 5.44 ms
If I change the first #if conditional to 0 (which will enable reading from shared memory) it will take 6.02ms.. Not much more but still not enough for me
If I enable the second #if conditional (inserts a branch that will never evaluate to true) the it runs in 9.647040ms. The performance reduction is very big. What is the cause and what can be done?
I have also changed slightly the code to make further checks with shared memory
Instead of
__shared__ int t[1024]
I did
__shared__ int2 t[1024]
and wherever I access t[] I just access t[].x. In got a further drop in performance to 10ms..(another 400micro seconds) Why this should happen?
Regards
Daniel
Have you determined if your kernel is compute bound or memory bound? Your first question would be most relevant if your kernel is compute bound, while the second wold be most relevant if your kernel is memory bound. You might be getting results that are confusing or hard to reproduce if you're assuming one, while it is the other.
(1) I don't think the cost of a branch has been published. You might be left to determining that experimentally for your architecture. The CUDA Programming Guide does say that there is no "branch prediction and no speculative execution."
(2) You're right that when you access a single 32-bit value in shared memory from all the threads in a warp, the value is broadcast. But my guess would be that accessing a single value from all threads would have the same cost as accessing any combination of values as long as you don't incur any bank conflicts. So you end up with the latency of a single fetch from shared memory. I don't think the number of cycles of latency has been published. It is short enough that it is normally easily hidden.
You need to keep in mind that the compiler is highly optimizing. So if you comment out the branch, you also eliminate the evaluation of the conditional, wether or not you leave it in the source code. Thus a difference of four instructions seems very plausible for your example:
load -1,
compare v to it (and store result in b),
test b,
branch,
although I have not compiled your example and looked at the code (which is what you should do - run cuobjdump -sass on your binaries and look at the actual differences in machine code.
Using the only the .x compnent of an int2 changes the layout in shared memory so that you go from bank conflict free access to a 2-way bank conflict, which causes the slight further slowdown in your example. IIRC the latency of a shared memory access is of the order of 30 cycles, which usually is easily hidden by other threads (as Roger has already mentioned).