How __align__ works in cuda C? - cuda

I am practicing an exercise for Array of Struct (AoS). A struct with/without __align__ has been defined like:
#ifdef TESTALIGN8
struct __align__(8) InnerStruct {
float x;
float y;
};
#else
struct InnerStruct {
float x;
float y;
};
#endif
The test case is
__global__ void testGpuInnerStruct(InnerStruct *data, InnerStruct *result, const int n) {
unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
result[idx].x = data[idx].x + 10;
result[idx].y = data[idx].y + 20;
}
}
The file could be found at gist
Both cases were profiled by ncu-ui on Quadro RTX 4000 and the Memory Workload Analysis is like
Performance without __align__(8)
Performance with __align__(8)
Why L1 hit of latter case is 0%? In my mind, the minimum granularity of load/store is 32 bytes and sizeof(InnerStruct) is 8 bytes with or without __align__(8) qualifier, the InnerStruct.x and InnerStruct.y would always be read in a same load with or without L1 cache. How __align__ impacts the performance like this?

Why L1 hit of latter case is 0%?
The __align__(8) directive allows the compiler to discover that it can convert 2 separate loads into a single load. The result is that whereas in the non-decorated case, the compiler generates 2 loads, and the 2nd load derives benefit (cache hit rate) from the first load, in the decorated case, there is only one load instruction. Therefore there is no observed cache benefit.
For the non-decorated case, the compiler does something like this:
if (idx < n) {
//result[idx].x = data[idx].x + 10;
LDG R0, [result[idx].x]; // pulls two 128-byte L1 cachelines per warp: MISS L1,MISS L2
FADD R1, R0, 10;
STG [result[idx].x], R1; // HIT L2
//result[idx].y = data[idx].y + 20;
LDG R0, [result[idx].y]; // benefits from the L1 cache: HIT L1
FADD R1, R0, 20;
STG [result[idx].y], R1; // HIT L2
}
For the decorated case, the compiler does something like:
if (idx < n) {
LDG.64 R0,[result[idx].x]; // nothing populated cache prior to this: MISS L1,MISS L2
//result[idx].x = data[idx].x + 10;
FADD R0, R0, 10;
//result[idx].y = data[idx].y + 20;
FADD R1, R1, 20;
STG.64 [result[idx].x], R0; // HIT L2
}
Thus there is only one load instruction, which does not get any cache benefit.
In the non-decorated case, the compiler cannot assume that the struct is aligned to 8 bytes. It can only assume a 4-byte alignment (the natural alignment for float type). If the struct only has 4 byte alignment (and not 8-byte alignment), then the LDG.64 instruction is not legal, because that instruction requires a "natural" 8-byte alignment. Therefore in the non-decorated case, the compiler must use two 4-byte loads, because it cannot assume 8 byte alignment, whereas in the decorated case, it knows that LDG.64 is legal, and so it uses it instead.
(Aside: I suspect your GPU is not actually a Quadro 4000, but instead maybe a Quadro RTX 4000, because the Quadro 4000 was a fermi-class GPU which is not supported by any recent version of CUDA, much less nsight compute.)

Related

CUDA shared vs global memory, possible speedup

I believe my CUDA application could potentially benefit from shared memory, in order to keep the data near the GPU cores. Right now, I have a single kernel to which I pass a pointer to a previously allocated chunk of device memory, and some constants. After the kernel has finished, the device memory includes the result, which is copied to host memory. This scheme works perfectly and is cross-checked with the same algorithm run on the CPU.
The docs make it quite clear that global memory is much slower and has higher access latency than shared memory, but either way to get the best performance you should make your threads coalesce and align any access. My GPU has Compute Capability 6.1 "Pascal", has 48 kiB of shared memory per thread block and 2 GiB DRAM. If I refactor my code to use shared memory, how do I make sure to avoid bank conflicts?
Shared memory is organized in 32 banks, so that 32 threads from the same block each may simultaneously access a different bank without having to wait. Let's say I take the kernel from above, launch a kernel configuration with one block and 32 threads in that block, and statically allocate 48 kiB of shared memory outside the kernel. Also, each thread will only ever read from and write to the same single memory location in (shared) memory, which is specific to the algorithm I am working on. Given this, I would access those 32 shared memory locations with on offset of 48 kiB / 32 banks / sizeof(double) which equals 192:
__shared__ double cache[6144];
__global__ void kernel(double *buf_out, double a, double b, double c)
{
for(...)
{
// Perform calculation on shared memory
cache[threadIdx.x * 192] = ...
}
// Write result to global memory
buf_out[threadIdx.x] = cache[threadIdx.x * 192];
}
My reasoning: while threadIdx.x runs from 0 to 31, the offset together with cache being a double make sure that each thread will access the first element of a different bank, at the same time. I haven't gotten around to modify and test the code, but is this the right way to align access for the SM?
MWE added:
This is the naive CPU-to-CUDA port of the algorithm, using global memory only. Visual Profiler reports a kernel execution time of 10.3 seconds.
Environment: Win10, MSVC 2019, x64 Release Build, CUDA v11.2.
#include "cuda_runtime.h"
#include <iostream>
#include <stdio.h>
#define _USE_MATH_DEFINES
#include <math.h>
__global__ void kernel(double *buf, double SCREEN_STEP_SIZE, double APERTURE_RADIUS,
double APERTURE_STEP_SIZE, double SCREEN_DIST, double WAVE_NUMBER)
{
double z, y, y_max;
unsigned int tid = threadIdx.x/* + blockIdx.x * blockDim.x*/;
double Z = tid * SCREEN_STEP_SIZE, Y = 0;
double temp = WAVE_NUMBER / SCREEN_DIST;
// Make sure the per-thread accumulator is zero before we begin
buf[tid] = 0;
for (z = -APERTURE_RADIUS; z <= APERTURE_RADIUS; z += APERTURE_STEP_SIZE)
{
y_max = sqrt(APERTURE_RADIUS * APERTURE_RADIUS - z * z);
for (y = -y_max; y <= y_max; y += APERTURE_STEP_SIZE)
{
buf[tid] += cos(temp * (Y * y + Z * z));
}
}
}
int main(void)
{
double *dev_mem;
double *buf = NULL;
cudaError_t cudaStatus;
unsigned int screen_elems = 1000;
if ((buf = (double*)malloc(screen_elems * sizeof(double))) == NULL)
{
printf("Could not allocate memory...");
return -1;
}
memset(buf, 0, screen_elems * sizeof(double));
if ((cudaStatus = cudaMalloc((void**)&dev_mem, screen_elems * sizeof(double))) != cudaSuccess)
{
printf("cudaMalloc failed with code %u", cudaStatus);
return cudaStatus;
}
kernel<<<1, 1000>>>(dev_mem, 1e-3, 5e-5, 50e-9, 10.0, 2 * M_PI / 5e-7);
cudaDeviceSynchronize();
if ((cudaStatus = cudaMemcpy(buf, dev_mem, screen_elems * sizeof(double), cudaMemcpyDeviceToHost)) != cudaSuccess)
{
printf("cudaMemcpy failed with code %u", cudaStatus);
return cudaStatus;
}
cudaFree(dev_mem);
cudaDeviceReset();
free(buf);
return 0;
}
The kernel below uses shared memory instead and takes approximately 10.6 seconds to execute, again measured in Visual Profiler:
__shared__ double cache[1000];
__global__ void kernel(double *buf, double SCREEN_STEP_SIZE, double APERTURE_RADIUS,
double APERTURE_STEP_SIZE, double SCREEN_DIST, double WAVE_NUMBER)
{
double z, y, y_max;
unsigned int tid = threadIdx.x + blockIdx.x * blockDim.x;
double Z = tid * SCREEN_STEP_SIZE, Y = 0;
double temp = WAVE_NUMBER / SCREEN_DIST;
// Make sure the per-thread accumulator is zero before we begin
cache[tid] = 0;
for (z = -APERTURE_RADIUS; z <= APERTURE_RADIUS; z += APERTURE_STEP_SIZE)
{
y_max = sqrt(APERTURE_RADIUS * APERTURE_RADIUS - z * z);
for (y = -y_max; y <= y_max; y += APERTURE_STEP_SIZE)
{
cache[tid] += cos(temp * (Y * y + Z * z));
}
}
buf[tid] = cache[tid];
}
The innermost line inside the loops is typically executed several million times, depending on the five constants passed to the kernel. So instead of thrashing the off-chip global memory, I expected the on-chip shared-memory version to be much faster, but apparently it is not - what am I missing?
Let's say... each thread will only ever read from and write to the same single memory location in (shared) memory, which is specific to the algorithm I am working on.
In that case, it does not make sense to use shared memory. The whole point of shared memory is the sharing... among all threads in a block. Under your assumptions, you should keep your element in a register, not in shared memory. Indeed, in your "MWE Added" kernel - that's probably what you should do.
If your threads were to share information - then the pattern of this sharing would determine how best to utilize shared memory.
Also remember that if you don't read data repeatedly, or from multiple threads, it is much less likely that shared memory will help you - as you always have to read from global memory at least once and write to shared memory at least once to have your data in shared memory.

Is dynamic and static shared memory allocation allowed at the same time in CUDA? [duplicate]

Suppose I have two __device__ CUDA function, each having the following local variable:
__shared__ int a[123];
and another function (say it's my kernel, i.e. a __global__ function), with:
extern __shared__ int b[];
Is this explicitly allowed/forbidden by nVIDIA? (I don't see it in the programming guide section B.2.3 on __shared__) Do the sizes all count together together towards the shared memory limit, or is it the maximum possibly in use at a single time? Or some other rule?
This can be considered a follow-up question to this one.
The shared memory is split in two parts: statically allocated and dynamically allocated. The first part is calculated during compilation, and each declaration is an actual allocation - activating ptxas info during compilation illustrates it here:
ptxas info : Used 22 registers, 384 bytes smem, 48 bytes cmem[0]
Here, we have 384 bytes, which is 3 arrays of 32 ints. (see sample corde below).
You may pass a pointer to shared memory since Kepler, to another function allowing a device sub-function to access another shared memory declaration.
Then, comes the dynamically allocated shared memory, which reserved size is declared during kernel call.
Here is an example of some various uses in a couple of functions. Note the pointer value of each shared memory region.
__device__ void dev1()
{
__shared__ int a[32] ;
a[threadIdx.x] = threadIdx.x ;
if (threadIdx.x == 0)
printf ("dev1 : %x\n", a) ;
}
__device__ void dev2()
{
__shared__ int a[32] ;
a[threadIdx.x] = threadIdx.x * 5 ;
if (threadIdx.x == 0)
printf ("dev2 : %x\n", a) ;
}
__global__ void kernel(int* res, int* res2)
{
__shared__ int a[32] ;
extern __shared__ int b[];
a[threadIdx.x] = 0 ;
b[threadIdx.x] = threadIdx.x * 3 ;
dev1();
__syncthreads();
dev2();
__syncthreads();
res[threadIdx.x] = a[threadIdx.x] ;
res2[threadIdx.x] = b[threadIdx.x] ;
if (threadIdx.x == 0)
printf ("global a : %x\n", a) ;
if (threadIdx.x == 0)
printf ("global b : %x\n", b) ;
}
int main()
{
int* dres ;
int* dres2 ;
cudaMalloc <> (&dres, 32*sizeof(int)) ;
cudaMalloc <> (&dres2, 32*sizeof(int)) ;
kernel<<<1,32,32*sizeof(float)>>> (dres, dres2);
int hres[32] ;
int hres2[32] ;
cudaMemcpy (hres, dres, 32 * sizeof(int), cudaMemcpyDeviceToHost) ;
cudaMemcpy (hres2, dres2, 32 * sizeof(int), cudaMemcpyDeviceToHost) ;
for (int k = 0 ; k < 32 ; ++k)
{
printf ("%d -- %d \n", hres[k], hres2[k]) ;
}
return 0 ;
}
This code outputs the ptxas info using 384 bytes smem, that is one array for global a array, a second for dev1 method a array, and a third for dev2 method a array. Totalling 3*32*sizeof(float)=384 bytes.
When running the kernel with dynamic shared memory equals to 32*sizeof(float), the pointer to b starts right after these three arrays.
EDIT:
The ptx file generated by this code holds declarations of statically-defined shared memory,
.shared .align 4 .b8 _ZZ4dev1vE1a[128];
.shared .align 4 .b8 _ZZ4dev2vE1a[128];
.extern .shared .align 4 .b8 b[];
except for the entry-point where it is defined in the body of the method
// _ZZ6kernelPiS_E1a has been demoted
The shared space of the memory is defined in the PTX documentation here:
The shared (.shared) state space is a per-CTA region of memory for threads in a CTA to share data. An address in shared memory can be read and written by any thread in a CTA. Use ld.shared and st.shared to access shared variables.
Though with no detail on the runtime. There is a word in the programming guide here with no further detail on the mixing of the two.
During PTX compilation, the compiler may know the amount of shared memory that is statically allocated. There might be some supplemental magic. Looking at the SASS, the first instructions use the SR_LMEMHIOFF
1 IADD32I R1, R1, -0x8;
2 S2R R0, SR_LMEMHIOFF;
3 ISETP.GE.U32.AND P0, PT, R1, R0, PT;
and calling functions in reverse order assign different values to the statically-allocated shared memory (looks very much like a form of stackalloc).
I believe the ptxas compiler calculates all the shared memory it might need in the worst case when all method may be called (when not using one of the method and using function pointers, the b address does not change, and the unallocated shared memory region is never accessed).
Finally, as einpoklum suggests in a comment, this is experimental and not part of a norm/API definition.

Using both dynamically-allocated and statically-allocated shared memory

Suppose I have two __device__ CUDA function, each having the following local variable:
__shared__ int a[123];
and another function (say it's my kernel, i.e. a __global__ function), with:
extern __shared__ int b[];
Is this explicitly allowed/forbidden by nVIDIA? (I don't see it in the programming guide section B.2.3 on __shared__) Do the sizes all count together together towards the shared memory limit, or is it the maximum possibly in use at a single time? Or some other rule?
This can be considered a follow-up question to this one.
The shared memory is split in two parts: statically allocated and dynamically allocated. The first part is calculated during compilation, and each declaration is an actual allocation - activating ptxas info during compilation illustrates it here:
ptxas info : Used 22 registers, 384 bytes smem, 48 bytes cmem[0]
Here, we have 384 bytes, which is 3 arrays of 32 ints. (see sample corde below).
You may pass a pointer to shared memory since Kepler, to another function allowing a device sub-function to access another shared memory declaration.
Then, comes the dynamically allocated shared memory, which reserved size is declared during kernel call.
Here is an example of some various uses in a couple of functions. Note the pointer value of each shared memory region.
__device__ void dev1()
{
__shared__ int a[32] ;
a[threadIdx.x] = threadIdx.x ;
if (threadIdx.x == 0)
printf ("dev1 : %x\n", a) ;
}
__device__ void dev2()
{
__shared__ int a[32] ;
a[threadIdx.x] = threadIdx.x * 5 ;
if (threadIdx.x == 0)
printf ("dev2 : %x\n", a) ;
}
__global__ void kernel(int* res, int* res2)
{
__shared__ int a[32] ;
extern __shared__ int b[];
a[threadIdx.x] = 0 ;
b[threadIdx.x] = threadIdx.x * 3 ;
dev1();
__syncthreads();
dev2();
__syncthreads();
res[threadIdx.x] = a[threadIdx.x] ;
res2[threadIdx.x] = b[threadIdx.x] ;
if (threadIdx.x == 0)
printf ("global a : %x\n", a) ;
if (threadIdx.x == 0)
printf ("global b : %x\n", b) ;
}
int main()
{
int* dres ;
int* dres2 ;
cudaMalloc <> (&dres, 32*sizeof(int)) ;
cudaMalloc <> (&dres2, 32*sizeof(int)) ;
kernel<<<1,32,32*sizeof(float)>>> (dres, dres2);
int hres[32] ;
int hres2[32] ;
cudaMemcpy (hres, dres, 32 * sizeof(int), cudaMemcpyDeviceToHost) ;
cudaMemcpy (hres2, dres2, 32 * sizeof(int), cudaMemcpyDeviceToHost) ;
for (int k = 0 ; k < 32 ; ++k)
{
printf ("%d -- %d \n", hres[k], hres2[k]) ;
}
return 0 ;
}
This code outputs the ptxas info using 384 bytes smem, that is one array for global a array, a second for dev1 method a array, and a third for dev2 method a array. Totalling 3*32*sizeof(float)=384 bytes.
When running the kernel with dynamic shared memory equals to 32*sizeof(float), the pointer to b starts right after these three arrays.
EDIT:
The ptx file generated by this code holds declarations of statically-defined shared memory,
.shared .align 4 .b8 _ZZ4dev1vE1a[128];
.shared .align 4 .b8 _ZZ4dev2vE1a[128];
.extern .shared .align 4 .b8 b[];
except for the entry-point where it is defined in the body of the method
// _ZZ6kernelPiS_E1a has been demoted
The shared space of the memory is defined in the PTX documentation here:
The shared (.shared) state space is a per-CTA region of memory for threads in a CTA to share data. An address in shared memory can be read and written by any thread in a CTA. Use ld.shared and st.shared to access shared variables.
Though with no detail on the runtime. There is a word in the programming guide here with no further detail on the mixing of the two.
During PTX compilation, the compiler may know the amount of shared memory that is statically allocated. There might be some supplemental magic. Looking at the SASS, the first instructions use the SR_LMEMHIOFF
1 IADD32I R1, R1, -0x8;
2 S2R R0, SR_LMEMHIOFF;
3 ISETP.GE.U32.AND P0, PT, R1, R0, PT;
and calling functions in reverse order assign different values to the statically-allocated shared memory (looks very much like a form of stackalloc).
I believe the ptxas compiler calculates all the shared memory it might need in the worst case when all method may be called (when not using one of the method and using function pointers, the b address does not change, and the unallocated shared memory region is never accessed).
Finally, as einpoklum suggests in a comment, this is experimental and not part of a norm/API definition.

Texture fetch slower than direct global access, chapter 7 from "Cuda by example" book

I am reading and testing the examples in the book "Cuda By example. An introduction to General Purpose GPU Programming".
When testing the examples in chapter 7, relative to texture memory, I realized that access to global memory via texture cache is much slower than direct access (My NVIDIA GPU is GeForceGTX 260, compute capability 1.3 and I am using NVDIA CUDA 4.2):
Time per frame with texture fetch (1D or 2D) for a 256*256 image: 93 ms
Time per frame not using texture (just direct global access) for 256*256: 8.5 ms
I have double checked the code several times and I have also been reading the "CUDA C Programming guide" and "CUDA C Best practices Guide" which come along with the SDK, and I do not really understand the problem.
As far as I understand, texture memory is just global memory with a specific access mechanism implementation to make it look like a cache (?). I am wondering whether coalesced access to global memory will make texture fetch slower, but I cannot be sure.
Does anybody have a similar problem?
(I found some links in NVIDIA forums for a similar problem, but the link is no longer available.)
The testing code looks this way, only including the relevant parts:
//#define TEXTURE
//#define TEXTURE2
#ifdef TEXTURE
// According to C programming guide, it should be static (3.2.10.1.1)
static texture<float> texConstSrc;
static texture<float> texIn;
static texture<float> texOut;
#endif
__global__ void copy_const_kernel( float *iptr
#ifdef TEXTURE2
){
#else
,const float *cptr ) {
#endif
// map from threadIdx/BlockIdx to pixel position
int x = threadIdx.x + blockIdx.x * blockDim.x;
int y = threadIdx.y + blockIdx.y * blockDim.y;
int offset = x + y * blockDim.x * gridDim.x;
#ifdef TEXTURE2
float c = tex1Dfetch(texConstSrc,offset);
#else
float c = cptr[offset];
#endif
if ( c != 0) iptr[offset] = c;
}
__global__ void blend_kernel( float *outSrc,
#ifdef TEXTURE
bool dstOut ) {
#else
const float *inSrc ) {
#endif
// map from threadIdx/BlockIdx to pixel position
int x = threadIdx.x + blockIdx.x * blockDim.x;
int y = threadIdx.y + blockIdx.y * blockDim.y;
int offset = x + y * blockDim.x * gridDim.x;
int left = offset - 1;
int right = offset + 1;
if (x == 0) left++;
if (x == SXRES-1) right--;
int top = offset - SYRES;
int bottom = offset + SYRES;
if (y == 0) top += SYRES;
if (y == SYRES-1) bottom -= SYRES;
#ifdef TEXTURE
float t, l, c, r, b;
if (dstOut) {
t = tex1Dfetch(texIn,top);
l = tex1Dfetch(texIn,left);
c = tex1Dfetch(texIn,offset);
r = tex1Dfetch(texIn,right);
b = tex1Dfetch(texIn,bottom);
} else {
t = tex1Dfetch(texOut,top);
l = tex1Dfetch(texOut,left);
c = tex1Dfetch(texOut,offset);
r = tex1Dfetch(texOut,right);
b = tex1Dfetch(texOut,bottom);
}
outSrc[offset] = c + SPEED * (t + b + r + l - 4 * c);
#else
outSrc[offset] = inSrc[offset] + SPEED * ( inSrc[top] +
inSrc[bottom] + inSrc[left] + inSrc[right] -
inSrc[offset]*4);
#endif
}
// globals needed by the update routine
struct DataBlock {
unsigned char *output_bitmap;
float *dev_inSrc;
float *dev_outSrc;
float *dev_constSrc;
cudaEvent_t start, stop;
float totalTime;
float frames;
unsigned size;
unsigned char *output_host;
};
void anim_gpu( DataBlock *d, int ticks ) {
checkCudaErrors( cudaEventRecord( d->start, 0 ) );
dim3 blocks(SXRES/16,SYRES/16);
dim3 threads(16,16);
#ifdef TEXTURE
volatile bool dstOut = true;
#endif
for (int i=0; i<90; i++) {
#ifdef TEXTURE
float *in, *out;
if (dstOut) {
in = d->dev_inSrc;
out = d->dev_outSrc;
} else {
out = d->dev_inSrc;
in = d->dev_outSrc;
}
#ifdef TEXTURE2
copy_const_kernel<<<blocks,threads>>>( in );
#else
copy_const_kernel<<<blocks,threads>>>( in,
d->dev_constSrc );
#endif
blend_kernel<<<blocks,threads>>>( out, dstOut );
dstOut = !dstOut;
#else
copy_const_kernel<<<blocks,threads>>>( d->dev_inSrc,
d->dev_constSrc );
blend_kernel<<<blocks,threads>>>( d->dev_outSrc,
d->dev_inSrc );
swap( d->dev_inSrc, d->dev_outSrc );
#endif
}
// Some stuff for the events
// ...
}
I have been testing the results with the nvvp (NVIDIA profiler)
The result are quite curious as they show that there are a lot of texture cache misses (which are probably the cause for the bad performance).
The result from the profiler show also information that is difficult to understand even using the guide "CUPTI_User_GUide):
text_cache_hit: Number of texture cache hits (they are accounted only for one SM according to 1.3 capability).
text_cache_miss: Number of texture cache miss (they are accounted only for one SM according to 1.3 capability).
The following are the results for an example of 256*256 without using texture cache (only relevant info is shown):
Name Duration(ns) Grid_Size Block_Size
"copy_const_kernel(...) 22688 16,16,1 16,16,1
"blend_kernel(...)" 51360 16,16,1 16,16,1
Following are the results using 1D texture cache:
Name Duration(ns) Grid_Size Block_Size tex_cache_hit tex_cache_miss
"copy_const_kernel(...)" 147392 16,16,1 16,16,1 0 1024
"blend_kernel(...)" 841728 16,16,1 16,16,1 79 5041
Following are the results using 2D texture cache:
Name Duration(ns) Grid_Size Block_Size tex_cache_hit tex_cache_miss
"copy_const_kernel(...)" 150880 16,16,1 16,16,1 0 1024
"blend_kernel(...)" 872832 16,16,1 16,16,1 2971 2149
These result show several interesting info:
There are no cache hits at all for the "copy const" function (although ideally the memory is "spatially located", in the sense that each thread accesses memory which is near to the memory acceded by other near threads). I guess that this is because the threads within this function do not access memory from other threads, which seems to be the way for the texture cache to be usable (being the "spatially located" concept quite confusing)
There are some cache hits in the 1D and a lot more in the 2D case for the function "blend_kernel". I guess that it is due to the fact that within that function, any thread access memory from their neighbours threads. I cannot understand why there are more in 2D than 1d.
The duration time is greater in the texture cases than in the no texture case (nearly about one order of magnitude). Perhaps related with the so many texture cache misses.
For the "copy_const" function there are 1024 total accesses for the SM and 5120 for the "blend kernel". The relation 5:1 is correct due to the fact that there are 5 fetches in "blend" and only 1 in "copy_const". Anyway, I cannot understand where all this 1024 come from: ideally, this event "text cache miss/hot" only accounts for one SM (I have 24 in my GeForceGTX 260) and it only accounts for warps ( 32 thread size). Therefore, I have 256 threads/32=8 warps per SM and 256 blocks/24 = 10 or 11 "iterations" per SM, so I would be expecting something like 80 or 88 fetches (more over, some other event like sm_cta_launched, which is the number of thread blocks per SM, which is supposed to be supported in my 1.3 device, is always 0...)

modular arithmetic on the gpu

I am working on the GPU algorithm which is supposed to do a lot of modular computations. Particularly, various operations on matrices in a finite field which in the long run
reduce to primitive operations like: (a*b - c*d) mod m or (a*b + c) mod m where a,b,c and d are residues modulo m and m is a 32-bit prime.
Through experimentation I learned that the performance of the algorithm is mostly limited by slow modular arithmetic because integer modulo (%) and division operations are not supported on the GPU in hardware.
I appreciate if somebody can give me an idea how to realize efficient modular computations with CUDA ?
To see how this is implemented on CUDA, I use the following code snippet:
__global__ void mod_kernel(unsigned *gout, const unsigned *gin) {
unsigned tid = threadIdx.x;
unsigned a = gin[tid], b = gin[tid * 2], m = gin[tid * 3];
typedef unsigned long long u64;
__syncthreads();
unsigned r = (unsigned)(((u64)a * (u64)b) % m);
__syncthreads();
gout[tid] = r;
}
This code is not supposed to work, I just wanted to see how modular reduction is
implemented on CUDA.
When I disassemble this with cuobjdump --dump-sass (thanks njuffa for advice!), I see the following:
/*0098*/ /*0xffffdc0450ee0000*/ BAR.RED.POPC RZ, RZ;
/*00a0*/ /*0x1c315c4350000000*/ IMUL.U32.U32.HI R5, R3, R7;
/*00a8*/ /*0x1c311c0350000000*/ IMUL.U32.U32 R4, R3, R7;
/*00b0*/ /*0xfc01dde428000000*/ MOV R7, RZ;
/*00b8*/ /*0xe001000750000000*/ CAL 0xf8;
/*00c0*/ /*0x00000007d0000000*/ BPT.DRAIN 0x0;
/*00c8*/ /*0xffffdc0450ee0000*/ BAR.RED.POPC RZ, RZ;
Note that between the two calls to bar.red.popc there is a call to 0xf8 procedure which implements some sophisticated algorithm (about 50 instructions or even more). Not surpising that mod (%) operation is slow
Some time ago I experimented a lot with modular arithmetic on the GPU. On Fermi GPUs you can use double-precision arithmetic to avoid expensive div and mod operations. For example, modular multiplication can be done as follows:
// fast truncation of double-precision to integers
#define CUMP_D2I_TRUNC (double)(3ll << 51)
// computes r = a + b subop c unsigned using extended precision
#define VADDx(r, a, b, c, subop) \
asm volatile("vadd.u32.u32.u32." subop " %0, %1, %2, %3;" : \
"=r"(r) : "r"(a) , "r"(b), "r"(c));
// computes a * b mod m; invk = (double)(1<<30) / m
__device__ __forceinline__
unsigned mul_m(unsigned a, unsigned b, volatile unsigned m,
volatile double invk) {
unsigned hi = __umulhi(a*2, b*2); // 3 flops
// 2 double instructions
double rf = __uint2double_rn(hi) * invk + CUMP_D2I_TRUNC;
unsigned r = (unsigned)__double2loint(rf);
r = a * b - r * m; // 2 flops
// can also be replaced by: VADDx(r, r, m, r, "min") // == umin(r, r + m);
if((int)r < 0)
r += m;
return r;
}
However this only works for 31-bit integer modulos (if 1 bit is not critical for you)
and you also need to precompute 'invk' beforehand. This gives absolute minimum of instructions I can achieve, ie.:
SHL.W R2, R4, 0x1;
SHL.W R8, R6, 0x1;
IMUL.U32.U32 R4, R4, R6;
IMUL.U32.U32.HI R8, R2, R8;
I2F.F64.U32 R8, R8;
DFMA R2, R2, R8, R10;
IMAD.U32.U32 R4, -R12, R2, R4;
ISETP.GE.AND P0, pt, R4, RZ, pt;
#!P0 IADD R4, R12, R4;
For description of the algorithm, you can have a look at my paper:
gpu_resultants. Other operations like (xy - zw) mod m are also explained there.
Out of curiosity, I compared the performance of the resultant algorithm
using your modular multiplication:
unsigned r = (unsigned)(((u64)a * (u64)b) % m);
against the optimized version with mul_m.
Modular arithmetic with default % operation:
low_deg: 11; high_deg: 2481; bits: 10227
nmods: 330; n_real_pts: 2482; npts: 2495
res time: 5755.357910 ms; mod_inv time: 0.907008 ms; interp time: 856.015015 ms; CRA time: 44.065857 ms
GPU time elapsed: 6659.405273 ms;
Modular arithmetic with mul_m:
low_deg: 11; high_deg: 2481; bits: 10227
nmods: 330; n_real_pts: 2482; npts: 2495
res time: 1100.124756 ms; mod_inv time: 0.192608 ms; interp time: 220.615143 ms; CRA time: 10.376352 ms
GPU time elapsed: 1334.742310 ms;
So on the average it is about 5x faster. Note also that, you might not see a speed-up if you just evaluate raw arithmetic performance using a kernel with a bunch of mul_mod operations (like saxpy example). But in real applications with control logic, synchronization barriers etc. the speed-up is very noticeable.
A high-end Fermi GPU (e.g. a GTX 580) will likely give you the best performance among shipping cards for this. You would want all 32-bit operands to be of type "unsigned int" for best performance, as there is some additional overhead for the handling of signed divisions and modulos.
The compiler generates very efficient code for division and modulo with fixed divisor As I recall it is usually around three to five machine instructions instructions on Fermi and Kepler. You can check the generated SASS (machine code) with cuobjdump --dump-sass. You might be able to use templated functions with constant divisors if you only use a few different divisors.
You should see on the order of sixteen inlined SASS instructions being generated for the unsigned 32-bit operations with variable divisor, across Fermi and Kepler. The code is limited by the throughput of integer multiplies and for Fermi-class GPUs is competitive with hardware solutions. Somewhat reduced performance is seen on currently shipping Kepler-class GPUs due to their reduced integer multiply throughput.
[Added later, after clarification of the question:]
Unsigned 64-bit division and modulo with variable divisor on the other hand are called subroutines of about 65 instructions on Fermi and Kepler. They look close to optimal. On Fermi, this is still reasonably competitive with hardware implementations (note that 64-bit integer divisions are not exactly super fast on CPUs that provide this as a built-in instruction). Below is some code that I posted to the NVIDIA forums some time back for the kind of task described in the clarification. It avoids the expensive division, but does assume that fairly large batches of operands are sharing the same divisior. It uses double-precision arithmetic, which is especially fast on Tesla-class GPUs (as opposed to consumer cards). I only did a cursory test of the code, you might want to test this more carefully before deploying it.
// Let b, p, and A[i] be integers < 2^51
// Let N be a integer on the order of 10000
// for i from 1 to N
// A[i] <-- A[i] * b mod p
/*---- kernel arguments ----*/
unsigned long long *A;
double b, p; /* convert from unsigned long long to double before passing to kernel */
double oop; /* pass precomputed 1.0/p to kernel */
/*---- code inside kernel -----*/
double a, q, h, l, rem;
const double int_cvt_magic = 6755399441055744.0; /* 2^52+2^51 */
a = (double)A[i];
/* approximate quotient and round it to the nearest integer */
q = __fma_rn (a * b, oop, int_cvt_magic);
q = q - int_cvt_magic;
/* back-multiply, representing p*q as a double-double h:l exactly */
h = p * q;
l = __fma_rn (p, q, -h);
/* remainder is double-width product a*b minus double-double h:l */
rem = __fma_rn (a, b, -h);
rem = rem - l;
/* remainder may be negative as quotient rounded; fix if necessary */
if (rem < 0.0) rem += p;
A[i] = (unsigned long long)rem;
There are tricks to efficiently perform mod operations but if only m is radix 2.
For instance, x mod y == x & (y-1), where y is 2^n. Performing bitwise operation is the fastest.
Otherwise, probably a look-up table?
Below is a link on discussion of efficient modulo implementation. You might need to implement it yourself to get the most out of it.
Efficient computation of mod