Related
I believe my CUDA application could potentially benefit from shared memory, in order to keep the data near the GPU cores. Right now, I have a single kernel to which I pass a pointer to a previously allocated chunk of device memory, and some constants. After the kernel has finished, the device memory includes the result, which is copied to host memory. This scheme works perfectly and is cross-checked with the same algorithm run on the CPU.
The docs make it quite clear that global memory is much slower and has higher access latency than shared memory, but either way to get the best performance you should make your threads coalesce and align any access. My GPU has Compute Capability 6.1 "Pascal", has 48 kiB of shared memory per thread block and 2 GiB DRAM. If I refactor my code to use shared memory, how do I make sure to avoid bank conflicts?
Shared memory is organized in 32 banks, so that 32 threads from the same block each may simultaneously access a different bank without having to wait. Let's say I take the kernel from above, launch a kernel configuration with one block and 32 threads in that block, and statically allocate 48 kiB of shared memory outside the kernel. Also, each thread will only ever read from and write to the same single memory location in (shared) memory, which is specific to the algorithm I am working on. Given this, I would access those 32 shared memory locations with on offset of 48 kiB / 32 banks / sizeof(double) which equals 192:
__shared__ double cache[6144];
__global__ void kernel(double *buf_out, double a, double b, double c)
{
for(...)
{
// Perform calculation on shared memory
cache[threadIdx.x * 192] = ...
}
// Write result to global memory
buf_out[threadIdx.x] = cache[threadIdx.x * 192];
}
My reasoning: while threadIdx.x runs from 0 to 31, the offset together with cache being a double make sure that each thread will access the first element of a different bank, at the same time. I haven't gotten around to modify and test the code, but is this the right way to align access for the SM?
MWE added:
This is the naive CPU-to-CUDA port of the algorithm, using global memory only. Visual Profiler reports a kernel execution time of 10.3 seconds.
Environment: Win10, MSVC 2019, x64 Release Build, CUDA v11.2.
#include "cuda_runtime.h"
#include <iostream>
#include <stdio.h>
#define _USE_MATH_DEFINES
#include <math.h>
__global__ void kernel(double *buf, double SCREEN_STEP_SIZE, double APERTURE_RADIUS,
double APERTURE_STEP_SIZE, double SCREEN_DIST, double WAVE_NUMBER)
{
double z, y, y_max;
unsigned int tid = threadIdx.x/* + blockIdx.x * blockDim.x*/;
double Z = tid * SCREEN_STEP_SIZE, Y = 0;
double temp = WAVE_NUMBER / SCREEN_DIST;
// Make sure the per-thread accumulator is zero before we begin
buf[tid] = 0;
for (z = -APERTURE_RADIUS; z <= APERTURE_RADIUS; z += APERTURE_STEP_SIZE)
{
y_max = sqrt(APERTURE_RADIUS * APERTURE_RADIUS - z * z);
for (y = -y_max; y <= y_max; y += APERTURE_STEP_SIZE)
{
buf[tid] += cos(temp * (Y * y + Z * z));
}
}
}
int main(void)
{
double *dev_mem;
double *buf = NULL;
cudaError_t cudaStatus;
unsigned int screen_elems = 1000;
if ((buf = (double*)malloc(screen_elems * sizeof(double))) == NULL)
{
printf("Could not allocate memory...");
return -1;
}
memset(buf, 0, screen_elems * sizeof(double));
if ((cudaStatus = cudaMalloc((void**)&dev_mem, screen_elems * sizeof(double))) != cudaSuccess)
{
printf("cudaMalloc failed with code %u", cudaStatus);
return cudaStatus;
}
kernel<<<1, 1000>>>(dev_mem, 1e-3, 5e-5, 50e-9, 10.0, 2 * M_PI / 5e-7);
cudaDeviceSynchronize();
if ((cudaStatus = cudaMemcpy(buf, dev_mem, screen_elems * sizeof(double), cudaMemcpyDeviceToHost)) != cudaSuccess)
{
printf("cudaMemcpy failed with code %u", cudaStatus);
return cudaStatus;
}
cudaFree(dev_mem);
cudaDeviceReset();
free(buf);
return 0;
}
The kernel below uses shared memory instead and takes approximately 10.6 seconds to execute, again measured in Visual Profiler:
__shared__ double cache[1000];
__global__ void kernel(double *buf, double SCREEN_STEP_SIZE, double APERTURE_RADIUS,
double APERTURE_STEP_SIZE, double SCREEN_DIST, double WAVE_NUMBER)
{
double z, y, y_max;
unsigned int tid = threadIdx.x + blockIdx.x * blockDim.x;
double Z = tid * SCREEN_STEP_SIZE, Y = 0;
double temp = WAVE_NUMBER / SCREEN_DIST;
// Make sure the per-thread accumulator is zero before we begin
cache[tid] = 0;
for (z = -APERTURE_RADIUS; z <= APERTURE_RADIUS; z += APERTURE_STEP_SIZE)
{
y_max = sqrt(APERTURE_RADIUS * APERTURE_RADIUS - z * z);
for (y = -y_max; y <= y_max; y += APERTURE_STEP_SIZE)
{
cache[tid] += cos(temp * (Y * y + Z * z));
}
}
buf[tid] = cache[tid];
}
The innermost line inside the loops is typically executed several million times, depending on the five constants passed to the kernel. So instead of thrashing the off-chip global memory, I expected the on-chip shared-memory version to be much faster, but apparently it is not - what am I missing?
Let's say... each thread will only ever read from and write to the same single memory location in (shared) memory, which is specific to the algorithm I am working on.
In that case, it does not make sense to use shared memory. The whole point of shared memory is the sharing... among all threads in a block. Under your assumptions, you should keep your element in a register, not in shared memory. Indeed, in your "MWE Added" kernel - that's probably what you should do.
If your threads were to share information - then the pattern of this sharing would determine how best to utilize shared memory.
Also remember that if you don't read data repeatedly, or from multiple threads, it is much less likely that shared memory will help you - as you always have to read from global memory at least once and write to shared memory at least once to have your data in shared memory.
I am facing the problem that if I set the CUDA heap size to the total amount of memory I need to allocate in a kernel, the heap is still not big enough to serve all allocations.
This is a minimal example which represents my use case:
#include <stdio.h>
#define NARR 8
__global__
void heaptest(int N){
double* arr[NARR];
__shared__ double* arrS[NARR];
if(threadIdx.x == 0){
for(int i = 0; i < NARR; i++){
arrS[i] = (double*) malloc(sizeof(double) * N);
if(arrS[i] == NULL)
printf("block %d, array %d is NULL\n", blockIdx.x, i);
}
}
__syncthreads();
for(int i = 0; i < NARR; i++){
arr[i] = arrS[i];
}
}
size_t getHeapSizePerBlock(int N){
return sizeof(double) * N * NARR;
}
int main(){
int N = 4000 * 18;
int nBlocks = 1;
size_t myheapsize = getHeapSizePerBlock(N) * nBlocks;
printf("set heap size to %lu\n", myheapsize);
cudaDeviceSetLimit(cudaLimitMallocHeapSize, myheapsize);
size_t a;
cudaDeviceGetLimit(&a, cudaLimitMallocHeapSize);
printf("heap size is now %lu\n", a);
heaptest<<<nBlocks, 128>>>(N);
cudaDeviceSynchronize();
cudaDeviceReset();
return 0;
}
I compile with nvcc V8.0.61.
nvcc -arch=sm_60 heaptest.cu -o heaptest
Program output is
set heap size to 4608000
heap size is now 4653056
block 0, array 7 is NULL
So, even if the heap size is larger than the required size, it is not large enough.
How do I correctly calculate the required size in this case?
You may not be able to compute the exact size required by your application's heap, since you don't have control over CUDA's memory manager. Just like when allocating CPU memory where you have the OS' memory manager, CUDA has it's own memory manager. When you allocate multiple arrays in the heap, you have no guarantee that they will fit perfectly in the size of the heap, there may exist some overhead.
To exemplify, I did a small modification to your code to print also the memory address returned by malloc:
printf("block %d, array %d is %p\n", blockIdx.x, i, arrS[i]);
Here's what I get on my GTX 1070:
block 0, array 0 is 0x102059a8d20
block 0, array 1 is 0x10205600120
block 0, array 2 is 0x1020568f280
block 0, array 3 is 0x10205738520
block 0, array 4 is 0x102057c7680
block 0, array 5 is 0x10205870920
block 0, array 6 is 0x102058ffa80
block 0, array 7 is (nil)
First thing to note is that memory locations are not (always) contiguous/increasing (e.g., array 0 > array 6 > ... > array 1), but that's not too important for us. Also, if you subtract the memory addresses in decreasing order, you will not get the size you passed to malloc(), which in your case was always sizeof(double) * N, or 576000 bytes. For example:
0x1020568f280 - 0x10205600120 = 586080 bytes (array 1)
0x10205738520 - 0x1020568f280 = 692896 bytes (array 2)
Since these blocks are contiguous in memory with respect to the block size passed to malloc(), we can verify that there is indeed some memory chunks where we cannot allocate blocks of 576000 bytes. Between arrays 1 and 2 we have an extra 10080 bytes, and between arrays 2 and 3, 116896 bytes extra (that's more than 20% of the block size!).
What I would do is to avoid allocating memory dynamically on the heap, and instead allocate it during host-code execution. But if you really need to do it like that for some reason, I would suggest setting the heap size with some overhead margin, by testing it before what seems to be enough. I would at least expect that even existing some overhead for heap allocation, this shouldn't be too big, so maybe start allocating an extra 10% and go up from there, if necessary.
I want all accesses from my program to access global memory (even if the data is found in the L1/L2 cache). To this effect I found out that L1 cache can be skipped by passing these options to nvcc compiler:
-Xptxas -dlcm=cg
CUDA documentation states this:
.cv Cache as volatile (consider cached system memory lines stale, fetch again).
So, I am assuming when I run with either -dlcm=cg or -dlcm=cv, the PTX file generated should be different from the one that is generated normally. (The loads should be appended with either .cg or .cv)
My sample program:
__global__ void rh_kernel(int *datainRowX, int *datainRowY) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;
if (tid != 0)
return;
int i, x, y;
x = datainRowX[1];
y = datainRowY[2];
datainRowX[0] = x + y;
}
int main(int argc, char** argv) {
int* d_datainRowX;
cudaMalloc((void**)&d_datainRowX, sizeof(int) * 268435456);
int* d_datainRowY;
cudaMalloc((void**)&d_datainRowY, sizeof(int) * 268435456);
rh_kernel<<<1024, 1>>>(d_datainRowX, d_datainRowY);
cudaFree(d_datainRowX); cudaFree(d_datainRowY);
return(0);
}
I notice that whatever options I pass to the nvcc compiler ("-Xptxas -dlcm=cg" or "-Xptxas -dlcm=cv" or nothing), in all the three cases the PTX generated is the same. I am using -ptx option to generate the PTX file.
What am I missing? Is there any other way to achieve what I am doing?
Thanks in advance for your time.
According to Cuda Toolkit Documentation:
L1 caching in Kepler GPUs is reserved only for local memory accesses,
such as register spills and stack data. Global loads are cached in L2
only (or in the Read-Only Data Cache).
GK110B-based products such as the Tesla K40 GPU Accelerator, GK20A,
and GK210 retain this behavior by default
L1 cache is not used in global memory reads on Kepler by default . Thus - there is no difference in PTX when you add -Xptxas -dlcm=cg.
Disabling L2 cache is not possbile.
I want to use global variable for several kernel method, but when I use the flowing code to init __device__ variable, I got a [access violation on store (global memory)] error when I init the second var.
__device__ short* blockTmp[4];
//init blockTmp
template<int BS>
__global__ void InitQuarterBuf_kernel(
)
{
int iBufSize = 2000000;
for (int i = 0; i < 4; i++){
blockTmp[[i] = new short[iBufSize];
blockTmp[[i][iBufSize-1]=1;
printf("blockTmp[[%d][%d] is %d.\n",i,iBufSize-1,blockTmp[[i][iBufSize-1]);
}
}
The error message:
Memory Checker detected 1 access violations.
error = access violation on store (global memory)
gridid = 94
blockIdx = {0,0,0}
threadIdx = {0,0,0}
address = 0x003d08fe
accessSize = 2
CUDA grid launch failed: CUcontext: 1014297073536 CUmodule: 1013915337344 Function: _Z21InitBuf_kernelILi8EEvii
CUDA context created : 67e557f3e0
CUDA module loaded: 67cdc7ed80
CUDA module loaded: 67cdc7e180
================================================================================
CUDA Memory Checker detected 1 threads caused an access violation:
Launch Parameters
CUcontext = 67e557f3e0
CUstream = 67cdc7f580
CUmodule = 67cdc7e180
CUfunction = 67eb64b2f0
FunctionName = _Z21InitBuf_kernelILi8EEvii
GridId = 94
gridDim = {1,1,1}
blockDim = {1,1,1}
sharedSize = 256
Parameters (raw):
0x00000780 0x00000440
GPU State:
Address Size Type Mem Block Thread blockIdx threadIdx PC Source
----------------------------------------------------------------------------------------------------------------------------------
003d08fe 2 adr st g 0 0 {0,0,0} {0,0,0} _Z21InitBuf_kernelILi8EEvii+0004c8
Summary of access violations:
xxxx_launcher.cu(481): error MemoryChecker: #misaligned=0 #invalidAddress=1
================================================================================
Memory Checker detected 1 access violations.
error = access violation on store (global memory)
gridid = 94
blockIdx = {0,0,0}
threadIdx = {0,0,0}
address = 0x003d08fe
accessSize = 2
CUDA grid launch failed: CUcontext: 446229378016 CUmodule: 445834060160 Function: _Z21InitBuf_kernelILi8EEvii
Is there some limit for __device__ variable? How can I init the __device__ variable?
And if I change the buffer size to 1000, it is OK.
Your posted kernel doesn't really make sense, as your __device__ variable is named blockTmp but you are initializing m_filteredBlockTmp variables in your kernel, which don't appear to be defined anywhere.
Anyway, supposing these are intended to be the same, the issue is probably not related to your usage of __device__ variables (pointers, in this case) but your use of in-kernel new which definitely has allocation limits.
These limits and behavior are the same as what is described in the programming guide for in-kernel malloc. In particular, the default limit is 8MB and if you need more (in the "device heap") you must explicitly raise the limit with a CUDA runtime API call.
A useful error check in these situations is to check whether the pointer returned by new or malloc is NULL, which would indicate an allocation failure. If you fail to do that check, but then attempt to use the pointer anyway, you are going to run into trouble as described in your post.
I am running some image processing operations on GPU and I need the histogram of the output.
I have written and tested the processing kernels. Also I have tested the histogram kernel for samples of the output pictures separately. They both work fine but when I put all of them in one loop I get nothing.
This is my histogram kernel:
__global__ void histogram(int n, uchar* color, uchar* mask, int* bucket, int ch, int W, int bin)
{
unsigned int X = blockIdx.x*blockDim.x+threadIdx.x;
unsigned int Y = blockIdx.y*blockDim.y+threadIdx.y;
int l = (256%bin==0)?256/bin: 256/bin+1;
int c;
if (X+Y*W < n && mask[X+Y*W])
{
c = color[(X+Y*W)*3]/bin;
atomicAdd(&bucket[c], 1);
c = color[(X+Y*W)*3+1]/bin;
atomicAdd(&bucket[c+l], 1);
c = color[(X+Y*W)*3+2]/bin;
atomicAdd(&bucket[c+l*2], 1);
}
}
It is updating histogram vectors for red, green, and blue.('l' is the length of the vectors)
When I comment out atomicAdds it again produces the output but of course not the histogram.
Why don't they work together?
Edit:
This is the loop:
cudaMemcpy(frame_in_gpu,frame_in.data, W*H*3*sizeof(uchar),cudaMemcpyHostToDevice);
cuda_process(frame_in_gpu, frame_out_gpu, W, H, dimGrid,dimBlock);
cuda_histogram(W*H, frame_in_gpu, mask_gpu, hist, 3, W, bin, dimg_histogram, dimb_histogram);
Then I copy the output to host memory and write it to a video.
These are c codes that only call their kernels with dimGrid and dimBlock that are given as inputs. Also:
dim3 dimBlock(32,32);
dim3 dimGrid(W/32,H/32);
dim3 dimb_Histogram(16,16);
dim3 dimg_Histogram(W/16,H/16);
I changed this for histogram because it worked better with it. Does it matter?
Edit2:
I am using -arch=sm_11 option for compilation. I just read it somewhere. Could anyone tell me how I should choose it?
perhaps you should try to compile without -arch=sm_11 flag.
sm 1.1 is the first architecture which supported atomic operations on global memory while your GPU supports SM 2.0. Hence there is no reason to compile for SM 1.1 unless for backward compatibility.
One possible issue could be that SM 1.1 does not support atomic operations on 64-bit ints in global memory. So I would suggest you recompile the code without -arch option, or use
-arch=sm_20 if you like