I am trying to modify the original YUV->RGB kernel provided in sample code of NVIDIA Video SDK and I need help to understand some of its parts.
Here is the kernel code:
template<class YuvUnitx2, class Rgb, class RgbIntx2>
__global__ static void YuvToRgbKernel(uint8_t* pYuv, int nYuvPitch, uint8_t* pRgb, int nRgbPitch, int nWidth, int nHeight) {
int x = (threadIdx.x + blockIdx.x * blockDim.x) * 2;
int y = (threadIdx.y + blockIdx.y * blockDim.y) * 2;
if (x + 1 >= nWidth || y + 1 >= nHeight) {
return;
}
uint8_t* pSrc = pYuv + x * sizeof(YuvUnitx2) / 2 + y * nYuvPitch;
uint8_t* pDst = pRgb + x * sizeof(Rgb) + y * nRgbPitch;
YuvUnitx2 l0 = *(YuvUnitx2*)pSrc;
YuvUnitx2 l1 = *(YuvUnitx2*)(pSrc + nYuvPitch);
YuvUnitx2 ch = *(YuvUnitx2*)(pSrc + (nHeight - y / 2) * nYuvPitch);
//YuvToRgbForPixel - returns rgba encoded in uint32_t (.d)
*(RgbIntx2*)pDst = RgbIntx2{
YuvToRgbForPixel<Rgb>(l0.x, ch.x, ch.y).d,
YuvToRgbForPixel<Rgb>(l0.y, ch.x, ch.y).d,
};
*(RgbIntx2*)(pDst + nRgbPitch) = RgbIntx2{
YuvToRgbForPixel<Rgb>(l1.x, ch.x, ch.y).d,
YuvToRgbForPixel<Rgb>(l1.y, ch.x, ch.y).d,
};
}
Here are my basic assumptions, some of them are possibly wrong:
NV12 has two planes, 1 for Luma and 2 for interleaved chroma.
The kernel tries to write 4 pixels at a time.
If assumption 2 is correct, the question is why same chroma (ch) values are used for all 4 pixels? And If I am wrong on 2, please explain what exactly happens here.
The Chroma-planes on NV12 or NV21 are subsampled by a factor of 2.
For every 2x2 macro pixel in the output there are 4 luma (Y) channels, 1 Cb and 1 Cr element.
Related
for the CUDA kernel function, get branching divergence shown below, how to optimize it?
int gx = threadIdx.x + blockDim.x * blockIdx.x;
val = g_data[gx];
if (gx % 4 == 0)
val = op1(val);
else if (gx % 4 == 1)
val = op2(val);
else if (gx % 4 == 2)
val = op3(val);
else if (gx % 4 == 3)
val = op4(val);
g_data[gx] = val;
If I were programming in CUDA, I certainly wouldn't do any of this. However to answer your question:
how to avoid thread divergence in this CUDA kernel?
You could do something like this:
int gx = threadIdx.x + blockDim.x * blockIdx.x;
val = g_data[gx];
int gx_bit_0 = gx & 1;
int gx_bit_1 = (gx & 2) >> 1;
val = (1-gx_bit_1)*(1-gx_bit_0)*op1(val) + (1-gx_bit_1)*(gx_bit_0)*op2(val) + (gx_bit_1)*(1-gx_bit_0)*op3(val) + (gx_bit_1)*(gx_bit_0)*op4(val);
g_data[gx] = val;
Here is a full test case:
$ cat t1914.cu
#include <iostream>
__device__ float op1(float val) { return val + 1.0f;}
__device__ float op2(float val) { return val + 2.0f;}
__device__ float op3(float val) { return val + 3.0f;}
__device__ float op4(float val) { return val + 4.0f;}
__global__ void k(float *g_data){
int gx = threadIdx.x + blockDim.x * blockIdx.x;
float val = g_data[gx];
int gx_bit_0 = gx & 1;
int gx_bit_1 = (gx & 2) >> 1;
val = (1-gx_bit_1)*(1-gx_bit_0)*op1(val) + (1-gx_bit_1)*(gx_bit_0)*op2(val) + (gx_bit_1)*(1-gx_bit_0)*op3(val) + (gx_bit_1)*(gx_bit_0)*op4(val);
g_data[gx] = val;
}
const int N = 32;
int main(){
float *data;
cudaMallocManaged(&data, N*sizeof(float));
for (int i = 0; i < N; i++) data[i] = 1.0f;
k<<<1,N>>>(data);
cudaDeviceSynchronize();
for (int i = 0; i < N; i++) std::cout << data[i] << std::endl;
}
$ nvcc -o t1914 t1914.cu
$ compute-sanitizer ./t1914
========= COMPUTE-SANITIZER
2
3
4
5
2
3
4
5
2
3
4
5
2
3
4
5
2
3
4
5
2
3
4
5
2
3
4
5
2
3
4
5
========= ERROR SUMMARY: 0 errors
$
Solution by changing the work per thread
The best solution with the existing data layout is to let every thread compute 4 consecutive values. It's better to have fewer threads that can work properly than have more that can't.
float* g_data;
int gx = threadIdx.x + blockDim.x * blockIdx.x;
g_data[4 * gx] = op1(g_data[4 * gx]);
g_data[4 * gx + 1] = op2(g_data[4 * gx + 1]);
g_data[4 * gx + 2] = op3(g_data[4 * gx + 2]);
g_data[4 * gx + 3] = op4(g_data[4 * gx + 3]);
If the size of g_data is not a multiple of 4, put an if around the index operations. If it is always a multiple of 4 and properly aligned, load and store 4 values as a float4 for better performance.
Solution by reordering the work
As all my talk about float4 may have suggested, your input data appears to be some form of 2D structure where one every four elements share a similar function. Maybe it is an array of structs or an array of vectors -- in other words, a matrix.
For the purpose of explaining what I mean, I consider it a Nx4 matrix. If you transpose this into a 4xN matrix and apply a kernel to this, most of your problems disappear. Because then entries for which the same operation has to be done are placed next to each other in memory and that makes writing an efficient kernel easier. Something like this:
float* g_data;
int rows_in_g;
int gx = threadIdx.x + blockDim.x * blockIdx.x;
int gy = threadIdx.y;
float& own_g = g_data[gx + rows_in_g * gy];
switch(gy) {
case 0: own_g = op1(own_g); break;
case 1: own_g = op2(own_g); break;
case 2: own_g = op3(own_g); break;
case 3: own_g = op4(own_g); break;
default: break;
}
Start this as a 2D kernel with blocksize x=32, y=4 and gridsize x=N/32, y=1.
Now your kernel is still divergent, but all threads within a warp will execute the same case and access consecutive floats in memory. That's the best you can achieve. Of course this all depends on whether you can change the data layout.
I have a templated CUDA kernel for calculating and setting values at the interface between 2 computational meshes. The values are calculated using 3 separate contributions, obtained from class member functions with class instances passed to the kernel. If I obtain any one of these contributions alone to set in the output the kernel works. As soon as I add 2 (or all) of these contributions to set in the output the kernel simply does not launch at all.
I've inserted the full kernel code at the end, but I'll try to exemplify the above first.
First define the first 2 contributions:
//contribution 1
VType value1 = (V_m2 * 2 * b_val_sec / 3 + V_2 * (b_val_pri + b_val_sec / 3)) / (b_val_sec + b_val_pri);
//contribution 2
VType value2 = (Vdiff2_sec * b_val_sec * hL * hL - Vdiff2_pri * b_val_pri * hR * hR) / (b_val_sec + b_val_pri);
Now set output:
Case 1 - kernel launches and sets expected values:
V_pri[cell1_idx] = value1;
Case 2 - kernel launches and sets expected values:
V_pri[cell1_idx] = value2;
Case 3 - kernel does not launch:
V_pri[cell1_idx] = value1 + value2;
I am completely stumped as this seems to defy logic and would really like to understand what is happening. Has anyone encountered anything similar, or any idea what could be causing this?
I'm using CUDA 9.2 with Visual Studio 2017 and I've tested the code on GTX 980 Ti (compute 5.2) and GTX 1060 (compute 6.1) with identical results.
Here is the full kernel code:
template <typename VType, typename Class_CMBND>
__global__ void set_cmbnd_values_kernel(
cuVEC_VC<VType>& V_sec, cuVEC_VC<VType>& V_pri,
Class_CMBND& cmbndFuncs_sec, Class_CMBND& cmbndFuncs_pri,
CMBNDInfoCUDA& contact)
{
int box_idx = blockIdx.x * blockDim.x + threadIdx.x;
cuINT3 box_sizes = contact.cells_box.size();
if (box_idx < box_sizes.dim()) {
int i = (box_idx % box_sizes.x) + contact.cells_box.s.i;
int j = ((box_idx / box_sizes.x) % box_sizes.y) + contact.cells_box.s.j;
int k = (box_idx / (box_sizes.x * box_sizes.y)) + contact.cells_box.s.k;
cuReal hL = contact.hshift_secondary.norm();
cuReal hR = contact.hshift_primary.norm();
cuReal hmax = (hL > hR ? hL : hR);
int cell1_idx = i + j * V_pri.n.x + k * V_pri.n.x*V_pri.n.y;
if (V_pri.is_empty(cell1_idx) || V_pri.is_not_cmbnd(cell1_idx)) return;
int cell2_idx = (i + contact.cell_shift.i) + (j + contact.cell_shift.j) * V_pri.n.x + (k + contact.cell_shift.k) * V_pri.n.x*V_pri.n.y;
cuReal3 relpos_m1 = V_pri.rect.s - V_sec.rect.s + ((cuReal3(i, j, k) + cuReal3(0.5)) & V_pri.h) + (contact.hshift_primary + contact.hshift_secondary) / 2;
cuReal3 stencil = V_pri.h - cu_mod(contact.hshift_primary) + cu_mod(contact.hshift_secondary);
VType V_2 = V_pri[cell2_idx];
VType V_m2 = V_sec.weighted_average(relpos_m1 + contact.hshift_secondary, stencil);
//a values
VType a_val_sec = cmbndFuncs_sec.a_func_sec(relpos_m1, contact.hshift_secondary, stencil);
VType a_val_pri = cmbndFuncs_pri.a_func_pri(cell1_idx, cell2_idx, contact.hshift_secondary);
//b values adjusted with weights
cuReal b_val_sec = cmbndFuncs_sec.b_func_sec(relpos_m1, contact.hshift_secondary, stencil) * contact.weights.i;
cuReal b_val_pri = cmbndFuncs_pri.b_func_pri(cell1_idx, cell2_idx) * contact.weights.j;
//V'' values at cell positions -1 and 1
VType Vdiff2_sec = cmbndFuncs_sec.diff2_sec(relpos_m1, stencil);
VType Vdiff2_pri = cmbndFuncs_pri.diff2_pri(cell1_idx);
//Formula for V1
V_pri[cell1_idx] = (V_m2 * 2 * b_val_sec / 3 + V_2 * (b_val_pri + b_val_sec / 3)
- Vdiff2_sec * b_val_sec * hL * hL - Vdiff2_pri * b_val_pri * hR * hR
+ (a_val_pri - a_val_sec) * hmax) / (b_val_sec + b_val_pri);
}
}
It's almost as if kernels with too many lines of code in them (in the above kernel there's additional code in the various functions used) fail to launch under certain conditions.
Right, seems I found the answer to my problem.
Looking at the generated errors I get "Too many resources requested for launch".
I've reduced the number of threads per block from 512 to 256 and the kernel runs fine now.
The following kernel performs matrix copy that I came across in this article:
https://devblogs.nvidia.com/efficient-matrix-transpose-cuda-cc/
__global__ void copy(float *odata, const float *idata)
{
int x = blockIdx.x * TILE_DIM + threadIdx.x;
int y = blockIdx.y * TILE_DIM + threadIdx.y;
int width = gridDim.x * TILE_DIM;
for (int j = 0; j < TILE_DIM; j+= BLOCK_ROWS)
odata[(y+j)*width + x] = idata[(y+j)*width + x];
}
I am confused with the notation used. From what I understand, the data is in row-major format. "y" corresponds to rows and "x" corresponds to columns. So, the linear index is calculated as data[y][x] = data[y*width+x];
How is odata[(y+j)*width + x] coalesced? In row-major, elements in the same row are in successive locations. So, accessing elements in the fashion, (y,x) (y,x+1) (y,x+2) ... is contiguous.
However "j" above is added to "y" which does not seem coalesced.
Is my understanding of the notation incorrect or am I missing something here?
Coalescing memory transactions only requires that threads from the same warp read and write into a contiguous block of memory which can be served by a single transaction. Your code
int x = blockIdx.x * TILE_DIM + threadIdx.x;
int y = blockIdx.y * TILE_DIM + threadIdx.y;
odata[(y+j)*width + x] = idata[(y+j)*width + x];
produces coalesced access because j is constant across every thread in a warp. So the access patterns become:
0. (y * width); (y * width + 1); (y * width + 2); .....
1. (y * width + width); (y * width + width + 1); (y * width + width + 2); .....
2. (y * width + 2 * width); (y * width + 2 * width + 1); (y * width + 2 * width + 2); .....
Within each warp at any value of J access is still sequential elements with memory, so reads and writes will coalesce.
How can I count the number of cycles performed by a function like the following. Should I count straight forward the number of sums and muls and divs? Where can I check how many cycles an addition takes in CUDA?
__global__
void mandelbrotSet_per_element(Grayscale *image){
float minR = -2.0f, maxR = 1.0f;
float minI = -1.2f, maxI = minI + (maxR-minR) * c_rows / c_cols;
float realFactor = (maxR - minR) / (c_cols-1);
float imagFactor = (maxI - minI) / (c_rows-1);
bool isInSet;
float c_real, c_imag, z_real, z_imag;
int y = blockDim.y * blockIdx.y + threadIdx.y;
int x = blockDim.x * blockIdx.x + threadIdx.x;
while (y < c_rows){
while (x < c_cols) {
c_real = minR + x * realFactor;
c_imag = maxI - y * imagFactor;
z_real = c_real; z_imag = c_imag;
isInSet = true;
for (int k = 0; k < c_iterations; k++){
float z_real2 = z_real * z_real;
float z_imag2 = z_imag * z_imag;
if (z_real2 + z_imag2 > 4){
isInSet = false;
break;
}
z_imag = 2 * z_real * z_imag + c_imag;
z_real = z_real2 - z_imag2 + c_real;
}
if (isInSet) image[y*c_cols+x] = 255;
else image[y*c_cols+x] = 0;
x += blockDim.x * gridDim.x;
}
x = blockDim.x * blockIdx.x + threadIdx.x;
y += blockDim.y * gridDim.y;
}
}
Instruction throughput is described in the programming guide here
You can also try measuring a sequence of instructions using the native clock() function described here
The compiler tends to obscure actual counts of operations at the source code level (increasing or possibly decreasing apparent arithmetic intensity) so if you want to indentify exactly what the machine is doing you may want to inspect the ptx (nvcc -ptx ...) or possibly the machine assembly level code, called SASS, which you can extract from an executable using the cuobjdump utility.
There is a weird problem. I have following code. When I call first function it does not give correct result. However, when I call the function2 (the second function) it works fine. It is so weird to me. Does anyone has any idea about the problem? Thanks!!!
__global__ void function(int w, class<double> C, float *result) {
int r = threadIdx.x + blockIdx.x * blockDim.x;
int c = threadIdx.y + blockIdx.y * blockDim.y;
int half_w = w /2;
if (r < w && c < w) {
double dis = sort((double)(r - half_w) * (r - half_w) + (double)(c_half_w) * (c - half_w));
result[c * w + r] = (float)C.getVal(dis);
}
}
__global__ void function2(int w, class<double> C, float *result) {
int tid = threadIdx.x + blockIdx.x * blockDim.x;
int half_w = w /2;
int r = tid / w;
int c = tid % w;
if (r < w && c < w) {
double dis = sort((double)(r - half_w) * (r - half_w) + (double)(c_half_w) * (c - half_w));
result[c * w + r] = (float)C.getVal(dis);
}
}
UPDATE:
I use the function and function2 to draw an image. The pixel value is based on the distance between image center and current pixel position. Based on the distance, the class C getVal will calculate the value for the pixel. So, in the kernel, I just make every thread to calculate the distance and corresponding pixel value. The correct result is compared with CPU version. The function is just give some random value some very larger some very small. When I changed the result[c * w + r] = (float)C.getVal(dis) to result[c * w +r ] = 1.0f, the generated image seems does not change.
The image size is W x W, to launch function I set
dim3 grid_dim(w / 64 + 1, w / 64 + 1);
dim3 block_dim(64, 64);
function<<<grid_dim, block_dim>>>(W, C, cu_img);
To launch function2
function2<<<W / 128 + 1, 128>>>(W, C, cu_img)
Fixed:
I got the problem. I assigned too many threads to one block. The max threads in one block is 1024 in my device. Actually, when I run cuds-memcheck, I can see the function2 does not even launched.
I solved the problem. I assigned too many threads to one block. The max threads in one block is 1024 in my device. Actually, when I ran cuda-memcheck, I can see the function2 was not ever launched.