2D textures are a useful feature of CUDA in image processing applications. To bind pitch linear memory to 2D textures, the memory has to be aligned. cudaMallocPitch is a good option for aligned memory allocation. On my device, the pitch returned by cudaMallocPitch is a multiple of 512, i.e the memory is 512 byte aligned.
The actual alignment requirement for the device is determined by cudaDeviceProp::texturePitchAlignment which is 32 bytes on my device.
My question is:
If the actual alignment requirement for 2D textures is 32 bytes, then why does cudaMallocPitch return 512 byte aligned memory?
Isn't it a waste of memory? For example if I create an 8 bit image of size 513 x 100, it will occupy 1024 x 100 bytes.
I get this behaviour on following systems:
1: Asus G53JW + Windows 8 x64 + GeForce GTX 460M + CUDA 5 + Core i7 740QM + 4GB RAM
2: Dell Inspiron N5110 + Windows 7 x64 + GeForce GT525M + CUDA 4.2 + Corei7 2630QM + 6GB RAM
This is a slightly speculative answer, but keep in mind that there are two alignment properties which the pitch of an allocation must satisfy for textures, one for the texture pointer and one for the texture rows. I suspect that cudaMallocPitch is honouring the former, defined by cudaDeviceProp::textureAlignment. For example:
#include <cstdio>
int main(void)
{
const int ncases = 12;
const size_t widths[ncases] = { 5, 10, 20, 50, 70, 90, 100,
200, 500, 700, 900, 1000 };
const size_t height = 10;
float *vals[ncases];
size_t pitches[ncases];
struct cudaDeviceProp p;
cudaGetDeviceProperties(&p, 0);
fprintf(stdout, "Texture alignment = %zd bytes\n",
p.textureAlignment);
cudaSetDevice(0);
cudaFree(0); // establish context
for(int i=0; i<ncases; i++) {
cudaMallocPitch((void **)&vals[i], &pitches[i],
widths[i], height);
fprintf(stdout, "width = %zd <=> pitch = %zd \n",
widths[i], pitches[i]);
}
return 0;
}
which gives the following on a GT320M:
Texture alignment = 256 bytes
width = 5 <=> pitch = 256
width = 10 <=> pitch = 256
width = 20 <=> pitch = 256
width = 50 <=> pitch = 256
width = 70 <=> pitch = 256
width = 90 <=> pitch = 256
width = 100 <=> pitch = 256
width = 200 <=> pitch = 256
width = 500 <=> pitch = 512
width = 700 <=> pitch = 768
width = 900 <=> pitch = 1024
width = 1000 <=> pitch = 1024
I am guessing that cudaDeviceProp::texturePitchAlignment applies to CUDA arrays.
After doing some experiments with the memory allocation, at last I found a working solution which saves memory. If I forcefully align the memory allocated by cudaMalloc, cudaBindTexture2D works perfectly.
cudaError_t alignedMalloc2D(void** ptr, int width, int height, int* pitch, int alignment = 32)
{
if((width% alignment) != 0)
width+= (alignment - (width % alignment));
(*pitch) = width;
return cudaMalloc(ptr,width* height);
}
The memory allocated by this function is 32 byte aligned, which is the requirement of cudaBindTexture2D. My memory usage is now reduced 16 times and all the CUDA functions, which use 2D textures are also working correctly.
Here is a small utility function to get the currently selected CUDA device pitch alignment requirement.
int getCurrentDeviceTexturePitchAlignment()
{
cudaDeviceProp prop;
int currentDevice = 0;
cudaGetDevice(¤tDevice);
cudaGetDeviceProperties(&prop,currentDevice);
return prop.texturePitchAlignment;
}
Related
Here is part of my cuda code from cs344:
It is a task to convert a picture in rgb to gray.
The code runs ok now.
But when I use the code in the comments. It fails.
__global__
void rgba_to_greyscale(const uchar4* const rgbaImage,
unsigned char* const greyImage,
int numRows, int numCols)
{
//int x = threadIdx.x ;
//int y = threadIdx.y ;
int x = blockIdx.x ;
int y = blockIdx.y ;
if (x<numCols && y<numRows)
{
uchar4 rgba = rgbaImage[y*numCols+x] ;
float channelSum = .299f * rgba.x + .587f * rgba.y + .114f * rgba.z;
greyImage[y*numCols+x] = channelSum;
}
}
void your_rgba_to_greyscale(const uchar4 * const h_rgbaImage,
uchar4 * const d_rgbaImage, unsigned char* const d_greyImage,
size_t numRows, size_t numCols)
{
// const dim3 blockSize(numCols,numRows , 1); //TODO
// const dim3 gridSize( 1, 1, 1); //TODO
const dim3 blockSize(1,1 , 1); //TODO
const dim3 gridSize( numCols,numRows , 1); //TODO
std::cout << numCols << " " << numRows << std::endl ; // numCols=557 numRows=313
rgba_to_greyscale<<<gridSize, blockSize>>>(d_rgbaImage, d_greyImage, numRows, numCols);
cudaDeviceSynchronize();
checkCudaErrors(cudaGetLastError());
}
The code fails to run when I use the commented ones, with errors:
CUDA error at: /home/yc/cuda_prj/cs344_bak/Problem Sets/Problem Set 1/student_func.cu:90
invalid configuration argument cudaGetLastError()
Here is my deviceQuery report:
Detected 1 CUDA Capable device(s)
Device 0: "GeForce GTX 1080"
CUDA Driver Version / Runtime Version 8.0 / 8.0
CUDA Capability Major/Minor version number: 6.1
Total amount of global memory: 8110 MBytes (8504279040 bytes)
(20) Multiprocessors, (128) CUDA Cores/MP: 2560 CUDA Cores
GPU Max Clock rate: 1823 MHz (1.82 GHz)
Memory Clock rate: 5005 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 2097152 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = GeForce GTX 1080
Can anyone tell me why?
Robert posts the answer that solves the problem:
The total threads per block is the product of the dimensions. If your dimensions are numCols=557 numRows=313 then the product of those is over 150,000. The limit on the total threads per block on your device is 1024 and it is in the deviceQuery output here: Maximum number of threads per block: 1024
I want to convert my previous code in c++ to CUDA
for(int x=0 ; x < 100; x++)
{
for(int y=0 ; y < 100; y++)
{
for(int w=0 ; w < 100; w++)
{
for(int z=0 ; z < 100; z++)
{
........
}
}
}
}
these loops combine to make a new int value.
if I want to use CUDA I have to design threads hierarchy before building
the kernel code.
So How can I design the hierarchy ?
depend on every loop I think it will be like this:
100*100*100*100 = 100000000 thread .
Could you help me
Thanks
My CUDA spec:
CUDA Device #0
Major revision number: 1
Minor revision number: 1
Name: GeForce G 105M
Total global memory: 536870912
Total shared memory per block: 16384
Total registers per block: 8192
Warp size: 32
Maximum memory pitch: 2147483647
Maximum threads per block: 512
Maximum dimension 1 of block: 512
Maximum dimension 2 of block: 512
Maximum dimension 3 of block: 64
Maximum dimension 1 of grid: 65535
Maximum dimension 2 of grid: 65535
Maximum dimension 3 of grid: 1
Clock rate: 1600000
Total constant memory: 65536
Texture alignment: 256
Concurrent copy and execution: No
Number of multiprocessors: 1
Kernel execution timeout: Yes
100000000 threads (or blocks) is not too many for a GPU.
Your GPU has compute capability 1.1, so it is limited to 65535 blocks in each of the first two grid dimensions (x and y). Since 100*100 = 10000, we could launch 10000 blocks in each of the first two grid dimensions, to cover your entire for-loop extent. This would launch one block per for-loop iteration (unique combination of x,y,z, and w) and assume that you would use the threads in a block to address the needs of your for-loop calculation code:
__global__ void mykernel(...){
int idx = blockIdx.x;
int idy = blockIdx.y;
int w = idx/100;
int z = idx%100;
int x = idy/100;
int y = idy%100;
int tx = threadIdx.x;
// (the body of your for-loop code here...
}
launch:
dim3 blocks(10000, 10000);
dim3 threads(...); // can use any number here up to 512 for your device
mykernel<<<blocks, threads>>>(...);
If instead, you wanted to assign one thread to each of the inner z iterations of your for-loop (might be useful/higher performance depending on what you are doing and your data organization) you could do something like this:
__global__ void mykernel(...){
int idx = blockIdx.x;
int idy = blockIdx.y;
int w = idx/100;
int x = idx%100;
int y = idy;
int z = threadIdx.x;
// (the body of your for-loop code here...
}
launch:
dim3 blocks(10000, 100);
dim3 threads(100);
mykernel<<<blocks, threads>>>(...);
All of the above assumes your for-loop iterations are independent. If your for-loop iterations are dependent on each other (dependent on the order of execution) then these simplistic answers won't work, and you have not provided enough information in your question to discuss a reasonable strategy.
I already posted my question on NVIDIA dev forums, but there are no definitive answers yet.
I'm just starting to learn CUDA and was really surprised that, contrary to what I found on the Internet, my card (GeForce GTX 660M) supports some insane grid sizes (2147483647 x 65535 x 65535). Please take a look at the following results I'm getting from deviceQuery.exe provided with the toolkit:
c:\ProgramData\NVIDIA Corporation\CUDA Samples\v5.0\bin\win64\Release>deviceQuery.exe
deviceQuery.exe Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GeForce GTX 660M"
CUDA Driver Version / Runtime Version 5.5 / 5.0
CUDA Capability Major/Minor version number: 3.0
Total amount of global memory: 2048 MBytes (2147287040 bytes)
( 2) Multiprocessors x (192) CUDA Cores/MP: 384 CUDA Cores
GPU Clock rate: 950 MHz (0.95 GHz)
Memory Clock rate: 2500 Mhz
Memory Bus Width: 128-bit
L2 Cache Size: 262144 bytes
Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65536), 3D=(4096,4096,4096)
Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 2147483647 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
CUDA Device Driver Mode (TCC or WDDM): WDDM (Windows Display Driver Model)
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 5.5, CUDA Runtime Version = 5.0, NumDevs = 1, Device0 = GeForce GTX 660M
I was curious enough to write a simple program testing if it's possible to use more than 65535 blocks in the first dimension of the grid, but it doesn't work confirming what I found on the Internet (or, to be more precise, does work fine for 65535 blocks and doesn't for 65536).
My program is extremely simple and basically just adds two vectors. This is the source code:
#include <cuda.h>
#include <cuda_runtime.h>
#include <device_launch_parameters.h>
#include <stdio.h>
#include <math.h>
#pragma comment(lib, "cudart")
typedef struct
{
float *content;
const unsigned int size;
} pjVector_t;
__global__ void AddVectorsKernel(float *firstVector, float *secondVector, float *resultVector)
{
unsigned int index = threadIdx.x + blockIdx.x * blockDim.x;
resultVector[index] = firstVector[index] + secondVector[index];
}
int main(void)
{
//const unsigned int vectorLength = 67107840; // 1024 * 65535 - works fine
const unsigned int vectorLength = 67108864; // 1024 * 65536 - doesn't work
const unsigned int vectorSize = sizeof(float) * vectorLength;
int threads = 0;
unsigned int blocks = 0;
cudaDeviceProp deviceProperties;
cudaError_t error;
pjVector_t firstVector = { (float *)calloc(vectorLength, sizeof(float)), vectorLength };
pjVector_t secondVector = { (float *)calloc(vectorLength, sizeof(float)), vectorLength };
pjVector_t resultVector = { (float *)calloc(vectorLength, sizeof(float)), vectorLength };
float *d_firstVector;
float *d_secondVector;
float *d_resultVector;
cudaMalloc((void **)&d_firstVector, vectorSize);
cudaMalloc((void **)&d_secondVector, vectorSize);
cudaMalloc((void **)&d_resultVector, vectorSize);
cudaGetDeviceProperties(&deviceProperties, 0);
threads = deviceProperties.maxThreadsPerBlock;
blocks = (unsigned int)ceil(vectorLength / (double)threads);
for (unsigned int i = 0; i < vectorLength; i++)
{
firstVector.content[i] = 1.0f;
secondVector.content[i] = 2.0f;
}
cudaMemcpy(d_firstVector, firstVector.content, vectorSize, cudaMemcpyHostToDevice);
cudaMemcpy(d_secondVector, secondVector.content, vectorSize, cudaMemcpyHostToDevice);
AddVectorsKernel<<<blocks, threads>>>(d_firstVector, d_secondVector, d_resultVector);
error = cudaPeekAtLastError();
cudaMemcpy(resultVector.content, d_resultVector, vectorSize, cudaMemcpyDeviceToHost);
for (unsigned int i = 0; i < vectorLength; i++)
{
if(resultVector.content[i] != 3.0f)
{
free(firstVector.content);
free(secondVector.content);
free(resultVector.content);
cudaFree(d_firstVector);
cudaFree(d_secondVector);
cudaFree(d_resultVector);
cudaDeviceReset();
printf("Error under index: %i\n", i);
return 0;
}
}
free(firstVector.content);
free(secondVector.content);
free(resultVector.content);
cudaFree(d_firstVector);
cudaFree(d_secondVector);
cudaFree(d_resultVector);
cudaDeviceReset();
printf("Everything ok!\n");
return 0;
}
When I run it from Visual Studio in debug mode (the bigger vector), the last cudaMemcpy always fills my resultVector with seemingly random data (very close to 0 if it matters) so that the result doesn't pass the final validation. When I try to profile it with Visual Profiler, it returns following error message:
2 events, 0 metrics and 0 source-level metrics were not associated with the kernels and will not be displayed
As a result, profiler measures only cudaMalloc and cudaMemcpy operations and doesn't even show the kernel execution.
I'm not sure if I'm checking cuda erros right, so please let me know if it can be done better. cudaPeekAtLastError() placed just after my kernel launch returns cudaErrorInvalidValue(11) error when the bigger vector is used and cudaSuccess(0) for every other call (cudaMalloc and cudaMemcpy). When I run my program with the smaller vector, all cuda functions and my kernel launch return no errors (cudaSuccess(0)) and it works just fine.
So my question is: is cudaGetDeviceProperties returning rubbish grid size values or am I doing something wrong?
If you want to run a kernel using the larger grid size support offered by the Kepler architecture, you must compile you code for that architecture. So change you build flags to sepcific sm_30 as the target architecture. Otherwise the compiler will build for compute 1.0 targets.
The underlying reason for the launch failure is that the driver will attempt to recompile the compute 1.0 code for your Kepler card, but in doing so it enforces the execution grid limits dictated by the source architecture, ie. two dimensional grids with 65535 x 65535 maximum blocks per grid.
I am reading and testing the examples in the book "Cuda By example. An introduction to General Purpose GPU Programming".
When testing the examples in chapter 7, relative to texture memory, I realized that access to global memory via texture cache is much slower than direct access (My NVIDIA GPU is GeForceGTX 260, compute capability 1.3 and I am using NVDIA CUDA 4.2):
Time per frame with texture fetch (1D or 2D) for a 256*256 image: 93 ms
Time per frame not using texture (just direct global access) for 256*256: 8.5 ms
I have double checked the code several times and I have also been reading the "CUDA C Programming guide" and "CUDA C Best practices Guide" which come along with the SDK, and I do not really understand the problem.
As far as I understand, texture memory is just global memory with a specific access mechanism implementation to make it look like a cache (?). I am wondering whether coalesced access to global memory will make texture fetch slower, but I cannot be sure.
Does anybody have a similar problem?
(I found some links in NVIDIA forums for a similar problem, but the link is no longer available.)
The testing code looks this way, only including the relevant parts:
//#define TEXTURE
//#define TEXTURE2
#ifdef TEXTURE
// According to C programming guide, it should be static (3.2.10.1.1)
static texture<float> texConstSrc;
static texture<float> texIn;
static texture<float> texOut;
#endif
__global__ void copy_const_kernel( float *iptr
#ifdef TEXTURE2
){
#else
,const float *cptr ) {
#endif
// map from threadIdx/BlockIdx to pixel position
int x = threadIdx.x + blockIdx.x * blockDim.x;
int y = threadIdx.y + blockIdx.y * blockDim.y;
int offset = x + y * blockDim.x * gridDim.x;
#ifdef TEXTURE2
float c = tex1Dfetch(texConstSrc,offset);
#else
float c = cptr[offset];
#endif
if ( c != 0) iptr[offset] = c;
}
__global__ void blend_kernel( float *outSrc,
#ifdef TEXTURE
bool dstOut ) {
#else
const float *inSrc ) {
#endif
// map from threadIdx/BlockIdx to pixel position
int x = threadIdx.x + blockIdx.x * blockDim.x;
int y = threadIdx.y + blockIdx.y * blockDim.y;
int offset = x + y * blockDim.x * gridDim.x;
int left = offset - 1;
int right = offset + 1;
if (x == 0) left++;
if (x == SXRES-1) right--;
int top = offset - SYRES;
int bottom = offset + SYRES;
if (y == 0) top += SYRES;
if (y == SYRES-1) bottom -= SYRES;
#ifdef TEXTURE
float t, l, c, r, b;
if (dstOut) {
t = tex1Dfetch(texIn,top);
l = tex1Dfetch(texIn,left);
c = tex1Dfetch(texIn,offset);
r = tex1Dfetch(texIn,right);
b = tex1Dfetch(texIn,bottom);
} else {
t = tex1Dfetch(texOut,top);
l = tex1Dfetch(texOut,left);
c = tex1Dfetch(texOut,offset);
r = tex1Dfetch(texOut,right);
b = tex1Dfetch(texOut,bottom);
}
outSrc[offset] = c + SPEED * (t + b + r + l - 4 * c);
#else
outSrc[offset] = inSrc[offset] + SPEED * ( inSrc[top] +
inSrc[bottom] + inSrc[left] + inSrc[right] -
inSrc[offset]*4);
#endif
}
// globals needed by the update routine
struct DataBlock {
unsigned char *output_bitmap;
float *dev_inSrc;
float *dev_outSrc;
float *dev_constSrc;
cudaEvent_t start, stop;
float totalTime;
float frames;
unsigned size;
unsigned char *output_host;
};
void anim_gpu( DataBlock *d, int ticks ) {
checkCudaErrors( cudaEventRecord( d->start, 0 ) );
dim3 blocks(SXRES/16,SYRES/16);
dim3 threads(16,16);
#ifdef TEXTURE
volatile bool dstOut = true;
#endif
for (int i=0; i<90; i++) {
#ifdef TEXTURE
float *in, *out;
if (dstOut) {
in = d->dev_inSrc;
out = d->dev_outSrc;
} else {
out = d->dev_inSrc;
in = d->dev_outSrc;
}
#ifdef TEXTURE2
copy_const_kernel<<<blocks,threads>>>( in );
#else
copy_const_kernel<<<blocks,threads>>>( in,
d->dev_constSrc );
#endif
blend_kernel<<<blocks,threads>>>( out, dstOut );
dstOut = !dstOut;
#else
copy_const_kernel<<<blocks,threads>>>( d->dev_inSrc,
d->dev_constSrc );
blend_kernel<<<blocks,threads>>>( d->dev_outSrc,
d->dev_inSrc );
swap( d->dev_inSrc, d->dev_outSrc );
#endif
}
// Some stuff for the events
// ...
}
I have been testing the results with the nvvp (NVIDIA profiler)
The result are quite curious as they show that there are a lot of texture cache misses (which are probably the cause for the bad performance).
The result from the profiler show also information that is difficult to understand even using the guide "CUPTI_User_GUide):
text_cache_hit: Number of texture cache hits (they are accounted only for one SM according to 1.3 capability).
text_cache_miss: Number of texture cache miss (they are accounted only for one SM according to 1.3 capability).
The following are the results for an example of 256*256 without using texture cache (only relevant info is shown):
Name Duration(ns) Grid_Size Block_Size
"copy_const_kernel(...) 22688 16,16,1 16,16,1
"blend_kernel(...)" 51360 16,16,1 16,16,1
Following are the results using 1D texture cache:
Name Duration(ns) Grid_Size Block_Size tex_cache_hit tex_cache_miss
"copy_const_kernel(...)" 147392 16,16,1 16,16,1 0 1024
"blend_kernel(...)" 841728 16,16,1 16,16,1 79 5041
Following are the results using 2D texture cache:
Name Duration(ns) Grid_Size Block_Size tex_cache_hit tex_cache_miss
"copy_const_kernel(...)" 150880 16,16,1 16,16,1 0 1024
"blend_kernel(...)" 872832 16,16,1 16,16,1 2971 2149
These result show several interesting info:
There are no cache hits at all for the "copy const" function (although ideally the memory is "spatially located", in the sense that each thread accesses memory which is near to the memory acceded by other near threads). I guess that this is because the threads within this function do not access memory from other threads, which seems to be the way for the texture cache to be usable (being the "spatially located" concept quite confusing)
There are some cache hits in the 1D and a lot more in the 2D case for the function "blend_kernel". I guess that it is due to the fact that within that function, any thread access memory from their neighbours threads. I cannot understand why there are more in 2D than 1d.
The duration time is greater in the texture cases than in the no texture case (nearly about one order of magnitude). Perhaps related with the so many texture cache misses.
For the "copy_const" function there are 1024 total accesses for the SM and 5120 for the "blend kernel". The relation 5:1 is correct due to the fact that there are 5 fetches in "blend" and only 1 in "copy_const". Anyway, I cannot understand where all this 1024 come from: ideally, this event "text cache miss/hot" only accounts for one SM (I have 24 in my GeForceGTX 260) and it only accounts for warps ( 32 thread size). Therefore, I have 256 threads/32=8 warps per SM and 256 blocks/24 = 10 or 11 "iterations" per SM, so I would be expecting something like 80 or 88 fetches (more over, some other event like sm_cta_launched, which is the number of thread blocks per SM, which is supposed to be supported in my 1.3 device, is always 0...)
In CUDA to cover multiple blocks, and thus incerase the range of indices for arrays we do some thing like this:
Host side Code:
dim3 dimgrid(9,1)// total 9 blocks will be launched
dim3 dimBlock(16,1)// each block is having 16 threads // total no. of threads in
// the grid is thus 16 x9= 144.
Device side code
...
...
idx=blockIdx.x*blockDim.x+threadIdx.x;// idx will range from 0 to 143
a[idx]=a[idx]*a[idx];
...
...
What is the equivalent in OpenCL for acheiving the above case ?
On the host, when you enqueue your kernel using clEnqueueNDRangeKernel, you have to specify the global and local work size. For instance:
size_t global_work_size[1] = { 144 }; // 16 * 9 == 144
size_t local_work_size[1] = { 16 };
clEnqueueNDRangeKernel(cmd_queue, kernel, 1, NULL,
global_work_size, local_work_size,
0, NULL, NULL);
In your kernel, use:
size_t get_global_size(uint dim);
size_t get_global_id(uint dim);
size_t get_local_size(uint dim);
size_t get_local_id(uint dim);
to retrieve the global and local work sizes and indices respectively, where dim is 0 for x, 1 for y and 2 for z.
The equivalent of your idx will thus be simply size_t idx = get_global_id(0);
See the OpenCL Reference Pages.
Equivalences between CUDA and OpenCL are:
blockIdx.x*blockDim.x+threadIdx.x = get_global_id(0)
LocalSize = blockDim.x
GlobalSize = blockDim.x * gridDim.x