OpenCl equivalent of finding Consecutive indices in CUDA - cuda

In CUDA to cover multiple blocks, and thus incerase the range of indices for arrays we do some thing like this:
Host side Code:
dim3 dimgrid(9,1)// total 9 blocks will be launched
dim3 dimBlock(16,1)// each block is having 16 threads // total no. of threads in
// the grid is thus 16 x9= 144.
Device side code
...
...
idx=blockIdx.x*blockDim.x+threadIdx.x;// idx will range from 0 to 143
a[idx]=a[idx]*a[idx];
...
...
What is the equivalent in OpenCL for acheiving the above case ?

On the host, when you enqueue your kernel using clEnqueueNDRangeKernel, you have to specify the global and local work size. For instance:
size_t global_work_size[1] = { 144 }; // 16 * 9 == 144
size_t local_work_size[1] = { 16 };
clEnqueueNDRangeKernel(cmd_queue, kernel, 1, NULL,
global_work_size, local_work_size,
0, NULL, NULL);
In your kernel, use:
size_t get_global_size(uint dim);
size_t get_global_id(uint dim);
size_t get_local_size(uint dim);
size_t get_local_id(uint dim);
to retrieve the global and local work sizes and indices respectively, where dim is 0 for x, 1 for y and 2 for z.
The equivalent of your idx will thus be simply size_t idx = get_global_id(0);
See the OpenCL Reference Pages.

Equivalences between CUDA and OpenCL are:
blockIdx.x*blockDim.x+threadIdx.x = get_global_id(0)
LocalSize = blockDim.x
GlobalSize = blockDim.x * gridDim.x

Related

Threads hierarchy design in CUDA for my code

I want to convert my previous code in c++ to CUDA
for(int x=0 ; x < 100; x++)
{
for(int y=0 ; y < 100; y++)
{
for(int w=0 ; w < 100; w++)
{
for(int z=0 ; z < 100; z++)
{
........
}
}
}
}
these loops combine to make a new int value.
if I want to use CUDA I have to design threads hierarchy before building
the kernel code.
So How can I design the hierarchy ?
depend on every loop I think it will be like this:
100*100*100*100 = 100000000 thread .
Could you help me
Thanks
My CUDA spec:
CUDA Device #0
Major revision number: 1
Minor revision number: 1
Name: GeForce G 105M
Total global memory: 536870912
Total shared memory per block: 16384
Total registers per block: 8192
Warp size: 32
Maximum memory pitch: 2147483647
Maximum threads per block: 512
Maximum dimension 1 of block: 512
Maximum dimension 2 of block: 512
Maximum dimension 3 of block: 64
Maximum dimension 1 of grid: 65535
Maximum dimension 2 of grid: 65535
Maximum dimension 3 of grid: 1
Clock rate: 1600000
Total constant memory: 65536
Texture alignment: 256
Concurrent copy and execution: No
Number of multiprocessors: 1
Kernel execution timeout: Yes
100000000 threads (or blocks) is not too many for a GPU.
Your GPU has compute capability 1.1, so it is limited to 65535 blocks in each of the first two grid dimensions (x and y). Since 100*100 = 10000, we could launch 10000 blocks in each of the first two grid dimensions, to cover your entire for-loop extent. This would launch one block per for-loop iteration (unique combination of x,y,z, and w) and assume that you would use the threads in a block to address the needs of your for-loop calculation code:
__global__ void mykernel(...){
int idx = blockIdx.x;
int idy = blockIdx.y;
int w = idx/100;
int z = idx%100;
int x = idy/100;
int y = idy%100;
int tx = threadIdx.x;
// (the body of your for-loop code here...
}
launch:
dim3 blocks(10000, 10000);
dim3 threads(...); // can use any number here up to 512 for your device
mykernel<<<blocks, threads>>>(...);
If instead, you wanted to assign one thread to each of the inner z iterations of your for-loop (might be useful/higher performance depending on what you are doing and your data organization) you could do something like this:
__global__ void mykernel(...){
int idx = blockIdx.x;
int idy = blockIdx.y;
int w = idx/100;
int x = idx%100;
int y = idy;
int z = threadIdx.x;
// (the body of your for-loop code here...
}
launch:
dim3 blocks(10000, 100);
dim3 threads(100);
mykernel<<<blocks, threads>>>(...);
All of the above assumes your for-loop iterations are independent. If your for-loop iterations are dependent on each other (dependent on the order of execution) then these simplistic answers won't work, and you have not provided enough information in your question to discuss a reasonable strategy.

nvprof events "fb_subp0_read_sectors" and "fb_subp1_read_sectors" do not report correct results

I tried to count the number of DRAM (global memory) accesses for simple vector add kernel.
__global__ void AddVectors(const float* A, const float* B, float* C, int N)
{
int blockStartIndex = blockIdx.x * blockDim.x * N;
int threadStartIndex = blockStartIndex + threadIdx.x;
int threadEndIndex = threadStartIndex + ( N * blockDim.x );
int i;
for( i=threadStartIndex; i<threadEndIndex; i+=blockDim.x ){
C[i] = A[i] + B[i];
}
}
Grid Size = 180
Block size = 128
size of array = 180 * 128 * N floats where N is input parameter (elements per thread)
when N = 1, size of array = 180 * 128 * 1 floats = 90KB
All arrays A, B and C should be read from DRAM.
Therefore theoretically,
DRAM writes (C) = 2880 (32 byte accesses)
DRAM reads (A,B) = 2880 + 2880 = 5760 (32 byte accesses)
But when I used nvprof
DRAM writes = fb_subp0_write_sectors + fb_subp1_write_sectors = 1440 + 1440 = 2880 (32 byte accesses)
DRAM reads = fb_subp0_read_sectors + fb_subp1_read_sectors = 23 + 7 = 30 (32 byte accesses)
Now this is the problem. Theoretically there should be 5760 DRAM reads, but nvprof only reports 30, for me this looks impossible. Further more, if you double the size of the vector (N = 2), still the reported DRAM accesses remains at 30.
It would be great, if someone can shed some light.
I have disabled the L1 cache by using compiler option "-Xptxas -dlcm=cg"
Thanks,
Waruna
If you have done cudaMemcpy before the kernel launch to copy the source buffers from host to device, that gets the source buffers in L2 cache and hence the kernel doesn't see any misses from L2 for reads and you get less number of (fb_subp0_read_sectors + fb_subp1_read_sectors).
If you comment out cudaMemcpy before the kernel launch, you will see that the event values of fb_subp0_read_sectors and fb_subp1_read_sectors include the values you are expecting.

What is the general way to launch appropriate amount of reduction kernels?

As I have read from NVIDIA's instruction in this link http://www.cuvilib.com/Reduction.pdf, for arrays bigger than blockSize, I should launch multiple reduction kernels to achieve global synchronization. What is the general way to determine how many times I should launch the reduction kernel? I tried as below but I need to Malloc 2 additional pointers, which takes a lot of processing times.
My job is to Reduce the array d_logLuminance into one minimum value min_logLum
void your_histogram_and_prefixsum(const float* const d_logLuminance,
float &min_logLum,
const size_t numRows,
const size_t numCols)
{
const dim3 blockSize(512);
unsigned int pixel = numRows*numCols;
const dim3 gridSize(pixel/blockSize.x+1);
//Reduction kernels to find max and min value
float *d_tempMin, *d_min;
checkCudaErrors(cudaMalloc((void**) &d_tempMin, sizeof(float)*pixel));
checkCudaErrors(cudaMalloc((void**) &d_min, sizeof(float)*pixel));
checkCudaErrors(cudaMemcpy(d_min, d_logLuminance, sizeof(float)*pixel, cudaMemcpyDeviceToDevice));
dim3 subGrid = gridSize;
for(int reduceLevel = pixel; reduceLevel > 0; reduceLevel /= blockSize.x) {
checkCudaErrors(cudaMemcpy(d_tempMin, d_min, sizeof(float)*pixel, cudaMemcpyDeviceToDevice));
reduceMin<<<subGrid,blockSize,blockSize.x*sizeof(float)>>>(d_tempMin, d_min);
cudaDeviceSynchronize(); checkCudaErrors(cudaGetLastError());
subGrid.x = subGrid.x / blockSize.x + 1;
}
checkCudaErrors(cudaMemcpy(&min_logLum, d_min, sizeof(float), cudaMemcpyDeviceToHost));
std::cout<< "Min value = " << min_logLum << std::endl;
checkCudaErrors(cudaFree(d_tempMin));
checkCudaErrors(cudaFree(d_min));
}
And if you are curious, here is my reduction kernel:
__global__
void reduceMin(const float* const g_inputRange,
float* g_outputRange)
{
extern __shared__ float sdata[];
unsigned int tid = threadIdx.x;
unsigned int i = blockDim.x * blockIdx.x + threadIdx.x;
sdata[tid] = g_inputRange[i];
__syncthreads();
for(unsigned int s = blockDim.x/2; s > 0; s >>= 1){
if (tid < s){
sdata[tid] = min(sdata[tid],sdata[tid+s]);
}
__syncthreads();
}
if(tid == 0){
g_outputRange[blockIdx.x] = sdata[0];
}
}
There are many ways to skin the cat, but if you want to minimize kernel launches, it can always be done with at most two kernel launches.
The first kernel launch is composed of up to however many blocks correspond to the number of threads per block that your device supports. Newer devices will support 1024, older devices, 512.
Each of these (at most 512 or 1024) blocks in the first kernel will participate in a grid-looping sum of all the data elements in global memory.
Each of these blocks will then do a partial reduction and write a partial result to global memory. There will be 512 or 1024 of these partial results.
The second kernel launch will be composed of 512 or 1024 threads in a single block. Each thread will pick up one of the partial results from global memory, and then the threads in that single block will cooperatively reduce the partial results to a single final result, and write it back to global memory.
The "grid-looping sum" is described in reduction #7 here as "multiple add/thread". All of the reductions described in this document are available in the NVIDIA reduction sample code

Texture fetch slower than direct global access, chapter 7 from "Cuda by example" book

I am reading and testing the examples in the book "Cuda By example. An introduction to General Purpose GPU Programming".
When testing the examples in chapter 7, relative to texture memory, I realized that access to global memory via texture cache is much slower than direct access (My NVIDIA GPU is GeForceGTX 260, compute capability 1.3 and I am using NVDIA CUDA 4.2):
Time per frame with texture fetch (1D or 2D) for a 256*256 image: 93 ms
Time per frame not using texture (just direct global access) for 256*256: 8.5 ms
I have double checked the code several times and I have also been reading the "CUDA C Programming guide" and "CUDA C Best practices Guide" which come along with the SDK, and I do not really understand the problem.
As far as I understand, texture memory is just global memory with a specific access mechanism implementation to make it look like a cache (?). I am wondering whether coalesced access to global memory will make texture fetch slower, but I cannot be sure.
Does anybody have a similar problem?
(I found some links in NVIDIA forums for a similar problem, but the link is no longer available.)
The testing code looks this way, only including the relevant parts:
//#define TEXTURE
//#define TEXTURE2
#ifdef TEXTURE
// According to C programming guide, it should be static (3.2.10.1.1)
static texture<float> texConstSrc;
static texture<float> texIn;
static texture<float> texOut;
#endif
__global__ void copy_const_kernel( float *iptr
#ifdef TEXTURE2
){
#else
,const float *cptr ) {
#endif
// map from threadIdx/BlockIdx to pixel position
int x = threadIdx.x + blockIdx.x * blockDim.x;
int y = threadIdx.y + blockIdx.y * blockDim.y;
int offset = x + y * blockDim.x * gridDim.x;
#ifdef TEXTURE2
float c = tex1Dfetch(texConstSrc,offset);
#else
float c = cptr[offset];
#endif
if ( c != 0) iptr[offset] = c;
}
__global__ void blend_kernel( float *outSrc,
#ifdef TEXTURE
bool dstOut ) {
#else
const float *inSrc ) {
#endif
// map from threadIdx/BlockIdx to pixel position
int x = threadIdx.x + blockIdx.x * blockDim.x;
int y = threadIdx.y + blockIdx.y * blockDim.y;
int offset = x + y * blockDim.x * gridDim.x;
int left = offset - 1;
int right = offset + 1;
if (x == 0) left++;
if (x == SXRES-1) right--;
int top = offset - SYRES;
int bottom = offset + SYRES;
if (y == 0) top += SYRES;
if (y == SYRES-1) bottom -= SYRES;
#ifdef TEXTURE
float t, l, c, r, b;
if (dstOut) {
t = tex1Dfetch(texIn,top);
l = tex1Dfetch(texIn,left);
c = tex1Dfetch(texIn,offset);
r = tex1Dfetch(texIn,right);
b = tex1Dfetch(texIn,bottom);
} else {
t = tex1Dfetch(texOut,top);
l = tex1Dfetch(texOut,left);
c = tex1Dfetch(texOut,offset);
r = tex1Dfetch(texOut,right);
b = tex1Dfetch(texOut,bottom);
}
outSrc[offset] = c + SPEED * (t + b + r + l - 4 * c);
#else
outSrc[offset] = inSrc[offset] + SPEED * ( inSrc[top] +
inSrc[bottom] + inSrc[left] + inSrc[right] -
inSrc[offset]*4);
#endif
}
// globals needed by the update routine
struct DataBlock {
unsigned char *output_bitmap;
float *dev_inSrc;
float *dev_outSrc;
float *dev_constSrc;
cudaEvent_t start, stop;
float totalTime;
float frames;
unsigned size;
unsigned char *output_host;
};
void anim_gpu( DataBlock *d, int ticks ) {
checkCudaErrors( cudaEventRecord( d->start, 0 ) );
dim3 blocks(SXRES/16,SYRES/16);
dim3 threads(16,16);
#ifdef TEXTURE
volatile bool dstOut = true;
#endif
for (int i=0; i<90; i++) {
#ifdef TEXTURE
float *in, *out;
if (dstOut) {
in = d->dev_inSrc;
out = d->dev_outSrc;
} else {
out = d->dev_inSrc;
in = d->dev_outSrc;
}
#ifdef TEXTURE2
copy_const_kernel<<<blocks,threads>>>( in );
#else
copy_const_kernel<<<blocks,threads>>>( in,
d->dev_constSrc );
#endif
blend_kernel<<<blocks,threads>>>( out, dstOut );
dstOut = !dstOut;
#else
copy_const_kernel<<<blocks,threads>>>( d->dev_inSrc,
d->dev_constSrc );
blend_kernel<<<blocks,threads>>>( d->dev_outSrc,
d->dev_inSrc );
swap( d->dev_inSrc, d->dev_outSrc );
#endif
}
// Some stuff for the events
// ...
}
I have been testing the results with the nvvp (NVIDIA profiler)
The result are quite curious as they show that there are a lot of texture cache misses (which are probably the cause for the bad performance).
The result from the profiler show also information that is difficult to understand even using the guide "CUPTI_User_GUide):
text_cache_hit: Number of texture cache hits (they are accounted only for one SM according to 1.3 capability).
text_cache_miss: Number of texture cache miss (they are accounted only for one SM according to 1.3 capability).
The following are the results for an example of 256*256 without using texture cache (only relevant info is shown):
Name Duration(ns) Grid_Size Block_Size
"copy_const_kernel(...) 22688 16,16,1 16,16,1
"blend_kernel(...)" 51360 16,16,1 16,16,1
Following are the results using 1D texture cache:
Name Duration(ns) Grid_Size Block_Size tex_cache_hit tex_cache_miss
"copy_const_kernel(...)" 147392 16,16,1 16,16,1 0 1024
"blend_kernel(...)" 841728 16,16,1 16,16,1 79 5041
Following are the results using 2D texture cache:
Name Duration(ns) Grid_Size Block_Size tex_cache_hit tex_cache_miss
"copy_const_kernel(...)" 150880 16,16,1 16,16,1 0 1024
"blend_kernel(...)" 872832 16,16,1 16,16,1 2971 2149
These result show several interesting info:
There are no cache hits at all for the "copy const" function (although ideally the memory is "spatially located", in the sense that each thread accesses memory which is near to the memory acceded by other near threads). I guess that this is because the threads within this function do not access memory from other threads, which seems to be the way for the texture cache to be usable (being the "spatially located" concept quite confusing)
There are some cache hits in the 1D and a lot more in the 2D case for the function "blend_kernel". I guess that it is due to the fact that within that function, any thread access memory from their neighbours threads. I cannot understand why there are more in 2D than 1d.
The duration time is greater in the texture cases than in the no texture case (nearly about one order of magnitude). Perhaps related with the so many texture cache misses.
For the "copy_const" function there are 1024 total accesses for the SM and 5120 for the "blend kernel". The relation 5:1 is correct due to the fact that there are 5 fetches in "blend" and only 1 in "copy_const". Anyway, I cannot understand where all this 1024 come from: ideally, this event "text cache miss/hot" only accounts for one SM (I have 24 in my GeForceGTX 260) and it only accounts for warps ( 32 thread size). Therefore, I have 256 threads/32=8 warps per SM and 256 blocks/24 = 10 or 11 "iterations" per SM, so I would be expecting something like 80 or 88 fetches (more over, some other event like sm_cta_launched, which is the number of thread blocks per SM, which is supposed to be supported in my 1.3 device, is always 0...)

CUDA - Memory Limit - Vector Summation

I'm trying to learn CUDA and the following code works OK for the values N<= 16384, but fails for the greater values(Summation check at the end of the code fails, c values are always 0 for the index value of i>=16384).
#include<iostream>
#include"cuda_runtime.h"
#include"../cuda_be/book.h"
#define N (16384)
__global__ void add(int *a,int *b,int *c)
{
int tid = threadIdx.x + blockIdx.x * blockDim.x;
if(tid<N)
{
c[tid] = a[tid] + b[tid];
tid += blockDim.x * gridDim.x;
}
}
int main()
{
int a[N],b[N],c[N];
int *dev_a,*dev_b,*dev_c;
//allocate mem on gpu
HANDLE_ERROR(cudaMalloc((void**)&dev_a,N*sizeof(int)));
HANDLE_ERROR(cudaMalloc((void**)&dev_b,N*sizeof(int)));
HANDLE_ERROR(cudaMalloc((void**)&dev_c,N*sizeof(int)));
for(int i=0;i<N;i++)
{
a[i] = -i;
b[i] = i*i;
}
HANDLE_ERROR(cudaMemcpy(dev_a,a,N*sizeof(int),cudaMemcpyHostToDevice));
HANDLE_ERROR(cudaMemcpy(dev_b,b,N*sizeof(int),cudaMemcpyHostToDevice));
system("PAUSE");
add<<<128,128>>>(dev_a,dev_b,dev_c);
//copy the array 'c' back from the gpu to the cpu
HANDLE_ERROR( cudaMemcpy(c,dev_c,N*sizeof(int),cudaMemcpyDeviceToHost));
system("PAUSE");
bool success = true;
for(int i=0;i<N;i++)
{
if((a[i] + b[i]) != c[i])
{
printf("Error in %d: %d + %d != %d\n",i,a[i],b[i],c[i]);
system("PAUSE");
success = false;
}
}
if(success) printf("We did it!\n");
cudaFree(dev_a);
cudaFree(dev_b);
cudaFree(dev_c);
return 0;
}
I think it's a shared memory related problem, but I can't come up with a good explanation(Possible lack of knowledge). Could you provide me an explanation and a workaround to run for the values of N greater than 16384. Here is the specs for my GPU:
General Info for device 0
Name: GeForce 9600M GT
Compute capability: 1.1
Clock rate: 1250000
Device copy overlap : Enabled
Kernel Execution timeout : Enabled
Mem info for device 0
Total global mem: 536870912
Total const mem: 65536
Max mem pitch: 2147483647
Texture Alignment 256
MP info about device 0
Multiproccessor count: 4
Shared mem per mp: 16384
Registers per mp: 8192
Threads in warp: 32
Max threads per block: 512
Max thread dimensions: (512,512,64)
Max grid dimensions: (65535,65535,1)
You probably intended to write
while(tid<N)
not
if(tid<N)
You aren't running out of shared memory, your vector arrays are being copied into your device's global memory. As you can see this has far more space available than the 196608 bytes (16384*4*3) you need.
The reason for your problem is that you are only performing one addition operation per thread so hence with this structure, the maximum dimension that your vectors can be is the block*thread parameters in your kernel launch as tera has pointed out. By correcting
if(tid<N)
to
while(tid<N)
in your code, each thread will perform its addition on multiple indexes and the whole array will be considered.
For more information about the memory hierarchy and the various different places memory can sit, you should read sections 2.3 and 5.3 of the CUDA_C_Programming_Guide.pdf provided with the CUDA toolkit.
Hope that helps.
If N is:
#define N (33 * 1024) //value defined in Cuda by Examples
The same code I found in Cuda by Example, but the value of N was different. I think that o value of N cant be 33 * 1024. I must change the parameters number of block and number of threads per blocks. Because:
add<<<128,128>>>(dev_a,dev_b,dev_c); //16384 threads
(128 * 128) < (33 * 1024) so we have a crash.