Calling:
cudaExtent extent = make_cudaExtent( 1920 * sizeof(float), 1080, 10);
chanDesc = cudaCreateChannelDesc ( 32, 0, 0, 0, cudaChannelFormatKindFloat);
err = cudaMalloc3DArray ( &(devYAll[0]), &chanDesc, extent, 0);
is failing with err=cudaErrorInvalidValue. When I use the first arguement to extent as 1024 or smaller then the call to 3D array succeeds. Is there somehow a limit on the size of the memory that can be allocated with cudaMalloc3DArray ?
Yes there is a limit on extent size - either 2048x2048x2048 or 4096x4096x4096 depending on which hardware you have (from the details in your question I presume you have a Fermi card). But that true source of your problem is your make_cudaExtent call. For cudaMalloc3DArray, the first argument of extent should be given in elements, not bytes. This is why you are getting an error for first dimensions > 1024, as 1024 * sizeof(float) = 4096 which is the limit of Fermi GPUs.
So to allocate a 1920x1080x10 3D array, do this:
cudaExtent extent = make_cudaExtent( 1920, 1080, 10);
chanDesc = cudaCreateChannelDesc ( 32, 0, 0, 0, cudaChannelFormatKindFloat);
err = cudaMalloc3DArray ( &(devYAll[0]), &chanDesc, extent, 0);
In this call, the size of the type is deduced from the channel description, and the extent dimensions are modified as required to meet the pitch/alignment requirements of the hardware.
Related
Here is my code:
int threadNum = BLOCKDIM/8;
dim3 dimBlock(threadNum,threadNum);
int blocks1 = nWidth/threadNum + (nWidth%threadNum == 0 ? 0 : 1);
int blocks2 = nHeight/threadNum + (nHeight%threadNum == 0 ? 0 : 1);
dim3 dimGrid;
dimGrid.x = blocks1;
dimGrid.y = blocks2;
// dim3 numThreads2(BLOCKDIM);
// dim3 numBlocks2(numPixels/BLOCKDIM + (numPixels%BLOCKDIM == 0 ? 0 : 1) );
perform_scaling<<<dimGrid,dimBlock>>>(imageDevice,imageDevice_new,min,max,nWidth, nHeight);
cudaError_t err = cudaGetLastError();
cudasafe(err,"Kernel2");
This is the execution of my second kernel and it is fully independent in term of the usage of data. BLOCKDIM is 512 , nWidth and nHeight are 512 too and cudasafe simply prints the corresponding string message of the error code. This section of the code gives configuration error just after the kernel call.
What might give this error, any idea?
This type of error message frequently refers to the launch configuration parameters (grid/threadblock dimensions in this case, could also be shared memory, etc. in other cases). When you see a message like this it's a good idea just to print out your actual config parameters before launching the kernel, to see if you've made any mistakes.
You said BLOCKDIM = 512. You have threadNum = BLOCKDIM/8 so threadNum = 64. Your threadblock configuration is:
dim3 dimBlock(threadNum,threadNum);
So you are asking to launch blocks of 64 x 64 threads, that is 4096 threads per block. That won't work on any generation of CUDA devices. All current CUDA devices are limited to a maximum of 1024 threads per block, which is the product of the 3 block dimensions.
Maximum dimensions are listed in table 14 of the CUDA programming guide, and also available via the deviceQuery CUDA sample code.
Just to add to the previous answers, you can find the max threads allowed in your code also, so it can run in other devices without hard-coding the number of threads you will use:
struct cudaDeviceProp properties;
cudaGetDeviceProperties(&properties, device);
cout<<"using "<<properties.multiProcessorCount<<" multiprocessors"<<endl;
cout<<"max threads per processor: "<<properties.maxThreadsPerMultiProcessor<<endl;
I have a simple CUDA kernel that can do vector accumulation by basic reduction. I am scaling it up to be able to handle larger data by splitting it across multiple blocks. However, my assumption about allocating an appropriate amount of shared memory to be used by the kernel is failing with illegal memory access. It goes away when I increase this limit, but I want to know why.
Here is the code that I am talking about:
CORE KERNEL:
__global__ static
void vec_add(int *buffer,
int numElem, // The actual number of elements
int numIntermediates) // The next power of two of numElem
{
extern __shared__ unsigned int interim[];
int index = blockDim.x * blockIdx.x + threadIdx.x;
// Copy global intermediate values into shared memory.
interim[threadIdx.x] =
(index < numElem) ? buffer[index] : 0;
__syncthreads();
// numIntermediates2 *must* be a power of two!
for (unsigned int s = numIntermediates / 2; s > 0; s >>= 1) {
if (threadIdx.x < s) {
interim[threadIdx.x] += interim[threadIdx.x + s];
}
__syncthreads();
}
if (threadIdx.x == 0) {
buffer[blockIdx.x] = interim[0];
}
}
And this is the caller:
void accumulate (int* buffer, int numElem)
{
unsigned int numReductionThreads =
nextPowerOfTwo(numElem); // A routine to return the next higher power of 2.
const unsigned int maxThreadsPerBlock = 1024; // deviceProp.maxThreadsPerBlock
unsigned int numThreadsPerBlock, numReductionBlocks, reductionBlockSharedDataSize;
while (numReductionThreads > 1) {
numThreadsPerBlock = numReductionThreads < maxThreadsPerBlock ?
numReductionThreads : maxThreadsPerBlock;
numReductionBlocks = (numReductionThreads + numThreadsPerBlock - 1) / numThreadsPerBlock;
reductionBlockSharedDataSize = numThreadsPerBlock * sizeof(unsigned int);
vec_add <<< numReductionBlocks, numThreadsPerBlock, reductionBlockSharedDataSize >>>
(buffer, numElem, numReductionThreads);
numReductionThreads = nextPowerOfTwo(numReductionBlocks);
}
}
I tried this code with a sample set of 1152 elements on my GPU with the following configuration:
Type: Quadro 600
MaxThreadsPerBlock: 1024
MaxSharedMemory: 48KB
OUTPUT:
Loop 1: numElem = 1152, numReductionThreads = 2048, numReductionBlocks = 2, numThreadsPerBlock = 1024, reductionBlockSharedDataSize = 4096
Loop 2: numElem = 1152, numReductionThreads = 2, numReductionBlocks = 1, numThreadsPerBlock = 2, reductionBlockSharedDataSize = 8
CUDA Error 77: an illegal memory access was encountered
Suspecting that my 'interim' shared memory was causing illegal memory access, I arbitrarily increased the shared memory by two times in the following line:
reductionBlockSharedDataSize = 2 * numThreadsPerBlock * sizeof(unsigned int);
And my kernel started working fine!
What I do not understand is - why I had to provide this extra shared memory to make my problem go away (temporarily).
As a further experiment to check this magic number I ran my code with a much larger data-set with 6912 points. This time, even 2X or 4X didn't help me.
Loop 1: numElem = 6912, numReductionThreads = 8192, numReductionBlocks = 8, numThreadsPerBlock = 1024, reductionBlockSharedDataSize = 16384
Loop 2: numElem = 6912, numReductionThreads = 8, numReductionBlocks = 1, numThreadsPerBlock = 8, reductionBlockSharedDataSize = 128
CUDA Error 77: an illegal memory access was encountered
But the problem again went away when I increased the shared memory size by 8X.
Of course, I cannot be arbitrarily picking this scaling factor for larger and larger data-sets because I will soon run out of the 48KB shared memory limit. So I want to know a legitimate way of fixing my issue.
Thanks to #havogt for pointing out the out-of-index access.
The issue was that I was using the wrong argument as numIntermediates to the vec_add method. The intention was for the kernel to operate on exactly the same number of data points as the number of threads, which should have been 1024 all the time.
I fixed it by using numThreadsPerBlock as the argument:
vec_add <<< numReductionBlocks, numThreadsPerBlock, reductionBlockSharedDataSize >>>
(buffer, numElem, numThreadsPerBlock);
First Question.
The CUDA C Programming Guide is written like below.
The same on-chip memory is used for both L1 and shared memory: It can
be configured as 48 KB of shared memory and 16 KB of L1 cache or as 16
KB of shared memory and 48 KB of L1 cache
But, device query shows "Total number of registers available per block: 32768".
I use GTX580.(CC is 2.0)
The guide says default cache size is 16KB, but 32768 means 32768*4(byte) = 131072 Bytes = 128 KBytes. Actually, I don't know which is correct.
Second Question.
I set like below,
dim3 grid(32, 32); //blocks in a grid
dim3 block(16, 16); //threads in a block
kernel<<<grid,block>>>(...);
Then, the number of threads per a block is 256. => we need 256*N registers per a block.
N means the number of registers per a thread needed.
(256*N)*blocks is the number of registers per a SM.(not byte)
So, if default size is 16KB and threads/SM is MAX(1536), then N can't over 2. Because of "Maximum number of threads per multiprocessor: 1536".
16KB/4Bytes = 4096 registers, 4096/1536 = 2.66666...
In case of larger caches 48KB, N can't over 8.
48KB/4Bytes = 12288 registers, 12288/1536 = 8
Is that true? Actually I'm so confused.
Actually, My almost full code is here.
I think, the kernel is optimized when the block dimension is 16x16.
But, in case of 8x8, faster than 16x16 or similar.
I don't know the why.
the number of registers per a thread is 16, the shared memory is 80+16 bytes.
I had asked same question, but I couldn't get the exact solution.:
The result of an experiment different from CUDA Occupancy Calculator
#define WIDTH 512
#define HEIGHT 512
#define TILE_WIDTH 8
#define TILE_HEIGHT 8
#define CHANNELS 3
#define DEVICENUM 1
#define HEIGHTs HEIGHT/DEVICENUM
__global__ void PRINT_POLYGON( unsigned char *IMAGEin, int *MEMin, char a, char b, char c){
int Col = blockIdx.y*blockDim.y+ threadIdx.y; //Col is y coordinate
int Row = blockIdx.x*blockDim.x+ threadIdx.x; //Row is x coordinate
int tid_in_block = threadIdx.x + threadIdx.y*blockDim.x;
int bid_in_grid = blockIdx.x + blockIdx.y*gridDim.x;
int threads_per_block = blockDim.x * blockDim.y;
int tid_in_grid = tid_in_block + threads_per_block * bid_in_grid;
float result_a, result_b;
__shared__ int M[15];
for(int k = 0; k < 5; k++){
M[k] = MEMin[a*5+k];
M[k+5] = MEMin[b*5+k];
M[k+10] = MEMin[c*5+k];
}
int result_a_up = (M[11]-M[1])*(Row-M[0]) - (M[10]-M[0])*(Col-M[1]);
int result_b_up = (M[6] -M[1])*(M[0]-Row) - (M[5] -M[0])*(M[1]-Col);
int result_down = (M[11]-M[1])*(M[5]-M[0]) - (M[6]-M[1])*(M[10]-M[0]);
result_a = (float)result_a_up / (float)result_down;
result_b = (float)result_b_up / (float)result_down;
if((0 <= result_a && result_a <=1) && ((0 <= result_b && result_b <= 1)) && ((0 <= (result_a+result_b) && (result_a+result_b) <= 1))){
IMAGEin[tid_in_grid*CHANNELS] += M[2] + (M[7]-M[2])*result_a + (M[12]-M[2])*result_b; //Red Channel
IMAGEin[tid_in_grid*CHANNELS+1] += M[3] + (M[8]-M[3])*result_a + (M[13]-M[3])*result_b; //Green Channel
IMAGEin[tid_in_grid*CHANNELS+2] += M[4] + (M[9]-M[4])*result_a + (M[14]-M[4])*result_b; //Blue Channel
}
}
struct DataStruct {
int deviceID;
unsigned char IMAGE_SEG[WIDTH*HEIGHTs*CHANNELS];
};
void* routine( void *pvoidData ) {
DataStruct *data = (DataStruct*)pvoidData;
unsigned char *dev_IMAGE;
int *dev_MEM;
unsigned char *IMAGE_SEG = data->IMAGE_SEG;
HANDLE_ERROR(cudaSetDevice(5));
//initialize array
memset(IMAGE_SEG, 0, WIDTH*HEIGHTs*CHANNELS);
cudaDeviceSetCacheConfig(cudaFuncCachePreferL1);
printf("Device %d Starting..\n", data->deviceID);
//Evaluate Time
cudaEvent_t start, stop;
cudaEventCreate( &start );
cudaEventCreate( &stop );
cudaEventRecord(start, 0);
HANDLE_ERROR( cudaMalloc( (void **)&dev_MEM, sizeof(int)*35) );
HANDLE_ERROR( cudaMalloc( (void **)&dev_IMAGE, sizeof(unsigned char)*WIDTH*HEIGHTs*CHANNELS) );
cudaMemcpy(dev_MEM, MEM, sizeof(int)*35, cudaMemcpyHostToDevice);
cudaMemset(dev_IMAGE, 0, sizeof(unsigned char)*WIDTH*HEIGHTs*CHANNELS);
dim3 grid(WIDTH/TILE_WIDTH, HEIGHTs/TILE_HEIGHT); //blocks in a grid
dim3 block(TILE_WIDTH, TILE_HEIGHT); //threads in a block
PRINT_POLYGON<<<grid,block>>>( dev_IMAGE, dev_MEM, 0, 1, 2);
PRINT_POLYGON<<<grid,block>>>( dev_IMAGE, dev_MEM, 0, 2, 3);
PRINT_POLYGON<<<grid,block>>>( dev_IMAGE, dev_MEM, 0, 3, 4);
PRINT_POLYGON<<<grid,block>>>( dev_IMAGE, dev_MEM, 0, 4, 5);
PRINT_POLYGON<<<grid,block>>>( dev_IMAGE, dev_MEM, 3, 2, 4);
PRINT_POLYGON<<<grid,block>>>( dev_IMAGE, dev_MEM, 2, 6, 4);
HANDLE_ERROR( cudaMemcpy( IMAGE_SEG, dev_IMAGE, sizeof(unsigned char)*WIDTH*HEIGHTs*CHANNELS, cudaMemcpyDeviceToHost ) );
HANDLE_ERROR( cudaFree( dev_MEM ) );
HANDLE_ERROR( cudaFree( dev_IMAGE ) );
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
cudaEventElapsedTime( &elapsed_time_ms[data->deviceID], start, stop );
cudaEventDestroy(start);
cudaEventDestroy(stop);
elapsed_time_ms[DEVICENUM] += elapsed_time_ms[data->deviceID];
printf("Device %d Complete!\n", data->deviceID);
return 0;
}
The blockDim 8x8 is faster than 16x16 due to the increase in address divergence in your memory access when you increase the block size.
Metrics collected on GTX480 with 15 SMs.
metric 8x8 16x16
duration 161µs 114µs
issued_ipc 1.24 1.31
executed_ipc .88 .59
serialization 54.61% 28.74%
The number of instruction replays clues us in that we likely have bad memory access patterns.
achieved occupancy 88.32% 30.76%
0 warp schedulers issues 8.81% 7.98%
1 warp schedulers issues 2.36% 29.54%
2 warp schedulers issues 88.83% 52.44%
16x16 appears to keep the warp scheduler busy. However, it is keeping the schedulers busy re-issuing instructions.
l1 global load trans 524,407 332,007
l1 global store trans 401,224 209,139
l1 global load trans/request 3.56 2.25
l1 global store trans/request 16.33 8.51
The first priority is to reduce transactions per request. The Nsight VSE source view can display memory statistics per instruction. The primary issue in your kernel is the interleaved U8 load and stores for IMAGEin[] += value. At 16x16 this is resulting in 16.3 transactions per request but only 8.3 for 8x8 configuration.
Changing
IMAGEin[(i*HEIGHTs+j)*CHANNELS] += ...
to be consecutive increases performance of 16x16 by 3x. I imagine increasing channels to 4 and handling packing in the kernel will improve cache performance and memory throughput.
If you fix the number of memory transactions per request you will then likely have to look at execution dependencies and try to increase your ILP.
It is faster with block size of 8x8 because it is a lesser multiple of 32, as it is visible in the picture below, there are 32 CUDA cores bound together, with two different warp schedulers that actually schedule the same thing. So the same instruction is executed on these 32 cores in each execution cycle.
To better clarify this, in the first case (8x8) each block is made of two warps (64 threads) so it is finished within only two execution cycles, however, when you are using (16x16) as your block size, each takes 8 warps (256 threads), therefore taking 4 times more execution cycles resulting in a slower compound.
However, filling an SM with more warps is better in some cases, when memory access is high and each warp is likely to go into a memory stall (i.e. getting its operands from memory), then it will be replaced with another warp until the memory operation gets completed. Therefore resulting in more occupancy of the SM.
You should of course throw in the number of blocks per SM and number of SMs total in your calculations, for example, assigning more than 8 blocks to a single SM might reduce its occupancy, but probably in your case, you are not facing these issues, because 256 is generally a better number than 64, since it will balance your blocks among SMs whereas using 64 threads will result in more blocks getting executed in the same SM.
EDIT: This answer is based on my speculations, for a more scientific approach, see Greg Smiths answer.
Register pool is different from shared memory/cache, to the very bottom of their architecture!
Registers are made of Flip-flops and L1 cache are probably SRAM.
Just to get an idea, look at the picture below which represents FERMI architecture, then update your question to further specify the problem you are facing.
As a note, you can see how many registers and shared memory (smem) are taken by your functions by passing the option --ptxas-options = -v to nvcc.
2D textures are a useful feature of CUDA in image processing applications. To bind pitch linear memory to 2D textures, the memory has to be aligned. cudaMallocPitch is a good option for aligned memory allocation. On my device, the pitch returned by cudaMallocPitch is a multiple of 512, i.e the memory is 512 byte aligned.
The actual alignment requirement for the device is determined by cudaDeviceProp::texturePitchAlignment which is 32 bytes on my device.
My question is:
If the actual alignment requirement for 2D textures is 32 bytes, then why does cudaMallocPitch return 512 byte aligned memory?
Isn't it a waste of memory? For example if I create an 8 bit image of size 513 x 100, it will occupy 1024 x 100 bytes.
I get this behaviour on following systems:
1: Asus G53JW + Windows 8 x64 + GeForce GTX 460M + CUDA 5 + Core i7 740QM + 4GB RAM
2: Dell Inspiron N5110 + Windows 7 x64 + GeForce GT525M + CUDA 4.2 + Corei7 2630QM + 6GB RAM
This is a slightly speculative answer, but keep in mind that there are two alignment properties which the pitch of an allocation must satisfy for textures, one for the texture pointer and one for the texture rows. I suspect that cudaMallocPitch is honouring the former, defined by cudaDeviceProp::textureAlignment. For example:
#include <cstdio>
int main(void)
{
const int ncases = 12;
const size_t widths[ncases] = { 5, 10, 20, 50, 70, 90, 100,
200, 500, 700, 900, 1000 };
const size_t height = 10;
float *vals[ncases];
size_t pitches[ncases];
struct cudaDeviceProp p;
cudaGetDeviceProperties(&p, 0);
fprintf(stdout, "Texture alignment = %zd bytes\n",
p.textureAlignment);
cudaSetDevice(0);
cudaFree(0); // establish context
for(int i=0; i<ncases; i++) {
cudaMallocPitch((void **)&vals[i], &pitches[i],
widths[i], height);
fprintf(stdout, "width = %zd <=> pitch = %zd \n",
widths[i], pitches[i]);
}
return 0;
}
which gives the following on a GT320M:
Texture alignment = 256 bytes
width = 5 <=> pitch = 256
width = 10 <=> pitch = 256
width = 20 <=> pitch = 256
width = 50 <=> pitch = 256
width = 70 <=> pitch = 256
width = 90 <=> pitch = 256
width = 100 <=> pitch = 256
width = 200 <=> pitch = 256
width = 500 <=> pitch = 512
width = 700 <=> pitch = 768
width = 900 <=> pitch = 1024
width = 1000 <=> pitch = 1024
I am guessing that cudaDeviceProp::texturePitchAlignment applies to CUDA arrays.
After doing some experiments with the memory allocation, at last I found a working solution which saves memory. If I forcefully align the memory allocated by cudaMalloc, cudaBindTexture2D works perfectly.
cudaError_t alignedMalloc2D(void** ptr, int width, int height, int* pitch, int alignment = 32)
{
if((width% alignment) != 0)
width+= (alignment - (width % alignment));
(*pitch) = width;
return cudaMalloc(ptr,width* height);
}
The memory allocated by this function is 32 byte aligned, which is the requirement of cudaBindTexture2D. My memory usage is now reduced 16 times and all the CUDA functions, which use 2D textures are also working correctly.
Here is a small utility function to get the currently selected CUDA device pitch alignment requirement.
int getCurrentDeviceTexturePitchAlignment()
{
cudaDeviceProp prop;
int currentDevice = 0;
cudaGetDevice(¤tDevice);
cudaGetDeviceProperties(&prop,currentDevice);
return prop.texturePitchAlignment;
}
As title, I would like to know the right execution order in case we have a 3d block
I think to remember that I read already something regarding it, but it was some time ago, I dont remember where but it was coming by someone who didnt look so reliable..
Anyway I would like to have some confirmations about it.
Is it as the following (divided in warps)?
[0, 0, 0]...[blockDim.x, 0, 0] - [0, 1, 0]...[blockDim.x, 1, 0] - (...) - [0, blockDim.y, 0]...[blockDim.x, blockDim.y, 0] - [0, 0, 1]...[blockDim.x, 0, 1] - (...) - [0, blockDim.y, 1]...[blockDim.x, blockDim.y, 1] - (...) - [blockDim.x, blockDim.y, blockDim.z]
Yes, that is the correct ordering; threads are ordered with the x dimension varying first, then y, then z (equivalent to column-major order) within a block. The calculation can be expressed as
int threadID = threadIdx.x +
blockDim.x * threadIdx.y +
(blockDim.x * blockDim.y) * threadIdx.z;
int warpID = threadID / warpSize;
int laneID = threadID % warpsize;
Here threadID is the thread number within the block, warpID is the warp within the block and laneID is the thread number within the warp.
Note that threads are not necessarily executed in any sort of predicable order related to this ordering within a block. The execution model guarantees that threads in the same warp are executed "lock-step", but you can't infer any more than that from the thread numbering within a block.