Low memory copy throughput Host to Device - cuda

I have a vector of vectors vector<vector<double>> data.
I want to copy only the information contained in that "2D matrix" as there are no vectors in CUDA.
So the first approach I used was
vector<vector<double>> *values;
vector<vector<double>>::iterator it;
double *d_values;
double *dst;
checkCudaErr(
cudaMalloc((void**)&d_values, sizeof(double)*M*N)
);
dst = d_values;
for (it = values->begin(); it != values->end(); ++it){
double *src = &((*it)[0]);
size_t s = it->size();
checkCudaErr(
cudaMemcpy(dst, src, sizeof(double)*s, cudaMemcpyHostToDevice)
);
dst += s;
}
After profiling with NVVP I got a very low cudaMempcpy throughput. I think this is logic as I'm sending a very small amount of
bytes in each cudaMemcpy call.
So I decided to change a little bit the code to try to improve this, so the second approach is
double *h_values = new double[M*N];
dst = h_values;
for (it = values->begin(); it != values->end(); ++it){
double *src = &((*it)[0]);
size_t s = it->size();
memcpy(dst, src, sizeof(double)*s);
dst += s;
}
checkCudaErr(
cudaMemcpy(d_values, h_values, sizeof(double)*M*N, cudaMemcpyHostToDevice)
);
the result after profiling is still a low memcpy throughput.
So, my question is, how can I improve the copies from host to device?
I'm using a Quadro K4000. I'm getting 25 MB/s for the first case and about 2 GB/s on the second one. M = 5 and N = 2000000. I must say the value for M is a common value, but sometimes it can get up to 50.

A reason for your slow throughput can be that you allocate your double matrix with new. This memory is not page locked. You can either use a system function (dont know which system you use) or the cuda function providing this functionality. It would be cudaMallocHost.
Just remove your =new double[M*N] and set your h_values with cudaMallocHost(&h_values, sizeof(double)*M*N) (and of course dont delete it, but free it (with cudaFreeHost)).
Btw. the theoretical top speed is 8 GB/s (PCI 2.0 x 16 lanes), practical you will stay below it (around 6 GB/s).

Related

prefix sum using CUDA

I am having trouble understanding a cuda code for naive prefix sum.
This is code is from https://developer.nvidia.com/gpugems/GPUGems3/gpugems3_ch39.html
In example 39-1 (naive scan), we have a code like this:
__global__ void scan(float *g_odata, float *g_idata, int n)
{
extern __shared__ float temp[]; // allocated on invocation
int thid = threadIdx.x;
int pout = 0, pin = 1;
// Load input into shared memory.
// This is exclusive scan, so shift right by one
// and set first element to 0
temp[pout*n + thid] = (thid > 0) ? g_idata[thid-1] : 0;
__syncthreads();
for (int offset = 1; offset < n; offset *= 2)
{
pout = 1 - pout; // swap double buffer indices
pin = 1 - pout;
if (thid >= offset)
temp[pout*n+thid] += temp[pin*n+thid - offset];
else
temp[pout*n+thid] = temp[pin*n+thid];
__syncthreads();
}
g_odata[thid] = temp[pout*n+thid1]; // write output
}
My questions are
Why do we need to create a shared-memory temp?
Why do we need "pout" and "pin" variables? What do they do? Since we only use one block and 1024 threads at maximum here, can we only use threadId.x to specify the element in the block?
In CUDA, do we use one thread to do one add operation? Is it like, one thread does what could be done in one iteration if I use a for loop (loop the threads or processors in OpenMP given one thread for one element in an array)?
My previous two questions may seem to be naive... I think the key is I don't understand the relation between the above implementation and the pseudocode as following:
for d = 1 to log2 n do
for all k in parallel do
if k >= 2^d then
x[k] = x[k – 2^(d-1)] + x[k]
This is my first time using CUDA, so I'll appreciate it if anyone can answer my questions...
1- It's faster to put stuff in Shared Memory (SM) and do calculations there rather than using the Global Memory. It's important to sync threads after loading the SM hence the __syncthreads.
2- These variables are probably there for the clarification of reversing the order in the algorithm. It's simply there for toggling certain parts:
temp[pout*n+thid] += temp[pin*n+thid - offset];
First iteration ; pout = 1 and pin = 0. Second iteration; pout = 0 and pin = 1.
It offsets the output for N amount at odd iterations and offsets the input at even iterations. To come back to your question, you can't achieve the same thing with threadId.x because the it wouldn't change within the loop.
3 & 4 - CUDA executes threads to run the kernel. Meaning that each thread runs that code separately. If you look at the pseudo code and compare with the CUDA code you already parallelized the outer loop with CUDA. So each thread would run the loop in the kernel until the end of loop and would wait each thread to finish before writing to the Global Memory.
Hope it helps.

CUDA Transfer Timing using events vs windows

I'm transferring up 48kb data blocks (with pinned memory), and although cuda events see it go up at 5gb/sec, by the time we get back to windows we only see half that speed. Is this just unavoidable driver overhead, or are there ways to mitigate this? I've encapsulated the process in the test program below.
void transferUp(size_t size)
{
StopWatchWin timer;
timer.start();
float tUpCopyStart,tUpCopyStop;
cudaEvent_t sendUpStopEvent,sendUpStartEvent;
checkCudaErrors(cudaEventCreate( &sendUpStartEvent ));
checkCudaErrors(cudaEventCreate( &sendUpStopEvent ));
unsigned *cpu_sending = (unsigned *)malloc(size);
checkCudaErrors(cudaHostAlloc(&cpu_sending, size*sizeof(unsigned), cudaHostAllocPortable));
unsigned *gpu_receiving;
checkCudaErrors(cudaMalloc(&gpu_receiving, size*sizeof(unsigned)));
tUpCopyStart = timer.getTime();
checkCudaErrors(cudaEventRecord(sendUpStartEvent));
checkCudaErrors(cudaMemcpyAsync(gpu_receiving, cpu_sending, size*sizeof(unsigned), cudaMemcpyHostToDevice));
checkCudaErrors(cudaEventRecord(sendUpStopEvent));
checkCudaErrors(cudaEventSynchronize(sendUpStopEvent));
tUpCopyStop = timer.getTime();
double sendTimeWindows = tUpCopyStop - tUpCopyStart;
float sendTimeCuda;
checkCudaErrors(cudaEventElapsedTime( &sendTimeCuda,sendUpStartEvent,sendUpStopEvent));
float GbSec_cuda = (size*sizeof(unsigned)/1000)/(sendTimeCuda*1000);
float GbSec_win = (size*sizeof(unsigned)/1000)/(sendTimeWindows*1000);
printf("size=%06d bytes eventTime=%.03fms windowsTime=%0.3fms cudaSpeed=%.01f gb/s winSpeed=%.01f gb/s\n",
size*sizeof(unsigned),sendTimeCuda,sendTimeWindows,GbSec_cuda,GbSec_win);
checkCudaErrors(cudaEventDestroy( sendUpStartEvent ));
checkCudaErrors(cudaEventDestroy( sendUpStopEvent ));
checkCudaErrors(cudaFreeHost(cpu_sending));
checkCudaErrors(cudaFree(gpu_receiving));
}
The overhead of timing this small operation is overwhelming the measurement.
For small host->device copies (e.g., 64K or smaller), the CUDA driver will inline the data into the command buffer, so even the purportedly-synchronous memcpy calls are actually done asynchronously. But, the cudaEventSynchronize() call in your code forces the CPU to wait instead of continuing execution.

Monitoring how thread blocks are allocated to SMs across execution time?

I am a beginner with CUDA profiling. I basically want to generate a timeline that shows each SM and the the thread block that was assigned to it across execution time.
Something similar to this:
Author: Sreepathi Pai
I have read about reading %smid register, but I don't know how to incorporate it with the code that I want to test, or how to relate that to thread blocks or time.
The full code is beyond the scope of this answer so this answer provides the building blocks for you to implement block trace.
Allocate a buffer 16 bytes * number of blocks. This can be done per launch or a larger buffer can be allocated and maintained for multiple launches.
Pass the pointer of the block either through a constant variable or as an additional kernel parameter.
Modify your global functions to accept the parameter and perform the code listed below. I recommend writing new global function wrappers and have the wrapper kernel call the old code. This makes it easier to handle kernels with multiple exit points.
Visualizing Data
On compute capability 2.x devices the timestamp function should be clock64. This clock is not synchronized across SMs. The recommend approach is to sort the times per SM and use the lowest time per SM as the time of the kernel launch. This will only be off by 100s of cycles from the real time so for reasonable size kernels this drift is negligible.
Remove the smid from the lower 4-bits of the first 8 byte value. Clear the lower 4-bits of the end timestamp.
Allocate a device buffer equal to number of blocks * 16 bytes. Each 16 byte records will store the start and end timestamp as well as a 5-bit smid packed into the start time.
static __device__ inline uint32_t __smid()
{
uint32_t smid;
asm volatile("mov.u32 %0, %%smid;" : "=r"(smid));
return smid;
}
// use globaltimer for compute capability >= 3.0 (kepler and maxwell)
// use clock64 for compute capability 2.x (fermi)
static __device__ inline uint64_t __timestamp()
{
uint64_t globaltime;
asm volatile("mov.u64 %0, %%globaltimer;" : "=l"(globaltime) );
return globaltime;
}
__global__ blocktime(uint64_t* pBlockTime)
{
// START TIMESTAMP
uint64_t startTime = __timestamp();
// flatBlockIdx should be adjusted to 1D, 2D, and 3D launches to minimize
// overhead. Reduce to uint32_t if launch index does not exceed 32-bit.
uint64_t flatBlockIdx = (blockIdx.z * gridDim.x * gridDim.y)
+ (blockIdx.y * gridDim.x)
+ blockIdx.x;
// reduce this based upon dimensions of block to minimize overhead
if (threadIdx.x == 0 && theradIdx.y == 0 && threadIdx.z == 0)
{
// Put the smid in the 4 lower bits. If the MultiprocessCounter exceeds
// 16 then increase to 5-bits. The lower 5-bits of globaltimer are
// junk. If using clock64 and you want the improve precision then use
// the most significant 4-5 bits.
uint64_t smid = __smid();
uint64_t data = (startTime & 0xF) | smid;
pBlockTime[flatBlockIdx * 2 + 0] = data;
}
// do work
// I would recommend changing your current __global__ function to be
// a __global__ __device__ function and call it here. This will result
// in easier handling of kernels that have multiple exit points.
// END TIMESTAMP
// All threads in block will write out. This is not very efficient.
// Depending on the kernel this can be reduced to 1 thread or 1 thread per warp.
uint64_t endTime = __timestamp();
pBlockTime[flatBlockIdx * 2 + 1] = endTime;
}
__noinline__ __device__ uint get_smid(void)
{
uint ret;
asm("mov.u32 %0, %smid;" : "=r"(ret) );
return ret;
}
Source here.

Analyzing memory access coalescing of my CUDA kernel

I would like to read (BS_X+1)*(BS_Y+1) global memory locations by BS_x*BS_Y threads moving the contents to the shared memory and I have developed the following code.
int i = threadIdx.x;
int j = threadIdx.y;
int idx = blockIdx.x*BLOCK_SIZE_X + threadIdx.x;
int idy = blockIdx.y*BLOCK_SIZE_Y + threadIdx.y;
int index1 = j*BLOCK_SIZE_Y+i;
int i1 = (index1)%(BLOCK_SIZE_X+1);
int j1 = (index1)/(BLOCK_SIZE_Y+1);
int i2 = (BLOCK_SIZE_X*BLOCK_SIZE_Y+index1)%(BLOCK_SIZE_X+1);
int j2 = (BLOCK_SIZE_X*BLOCK_SIZE_Y+index1)/(BLOCK_SIZE_Y+1);
__shared__ double Ezx_h_shared_ext[BLOCK_SIZE_X+1][BLOCK_SIZE_Y+1];
Ezx_h_shared_ext[i1][j1]=Ezx_h[(blockIdx.y*BLOCK_SIZE_Y+j1)*xdim+(blockIdx.x*BLOCK_SIZE_X+i1)];
if ((i2<(BLOCK_SIZE_X+1))&&(j2<(BLOCK_SIZE_Y+1)))
Ezx_h_shared_ext[i2][j2]=Ezx_h[(blockIdx.y*BLOCK_SIZE_Y+j2)*xdim+(blockIdx.x*BLOCK_SIZE_X+i2)];
In my understanding, coalescing is the parallel equivalent of consecutive memory reads of sequential processing. How can I detect now if the global memory accesses are coalesced? I remark that there is an index jump from (i1,j1) to (i2,j2).
Thanks in advance.
I've evaluated the memory accesses of your code with a hand-written coalescing analyzer. The evaluation shows the code less exploits the coalescing. Here is the coalescing analyzer that you may find useful:
#include <stdio.h>
#include <malloc.h>
typedef struct dim3_t{
int x;
int y;
} dim3;
// KERNEL LAUNCH PARAMETERS
#define GRIDDIMX 4
#define GRIDDIMY 4
#define BLOCKDIMX 16
#define BLOCKDIMY 16
// ARCHITECTURE DEPENDENT
// number of threads aggregated for coalescing
#define COALESCINGWIDTH 32
// number of bytes in one coalesced transaction
#define CACHEBLOCKSIZE 128
#define CACHE_BLOCK_ADDR(addr,size) (addr*size)&(~(CACHEBLOCKSIZE-1))
int main(){
// fixed dim3 variables
// grid and block size
dim3 blockDim,gridDim;
blockDim.x=BLOCKDIMX;
blockDim.y=BLOCKDIMY;
gridDim.x=GRIDDIMX;
gridDim.y=GRIDDIMY;
// counters
int unq_accesses=0;
int *unq_addr=(int*)malloc(sizeof(int)*COALESCINGWIDTH);
int total_unq_accesses=0;
// iter over total number of threads
// and count the number of memory requests (the coalesced requests)
int I, II, III;
for(I=0; I<GRIDDIMX*GRIDDIMY; I++){
dim3 blockIdx;
blockIdx.x = I%GRIDDIMX;
blockIdx.y = I/GRIDDIMX;
for(II=0; II<BLOCKDIMX*BLOCKDIMY; II++){
if(II%COALESCINGWIDTH==0){
// new coalescing bunch
total_unq_accesses+=unq_accesses;
unq_accesses=0;
}
dim3 threadIdx;
threadIdx.x=II%BLOCKDIMX;
threadIdx.y=II/BLOCKDIMX;
////////////////////////////////////////////////////////
// Change this section to evaluate different accesses //
////////////////////////////////////////////////////////
// do your indexing here
#define BLOCK_SIZE_X BLOCKDIMX
#define BLOCK_SIZE_Y BLOCKDIMY
#define xdim 32
int i = threadIdx.x;
int j = threadIdx.y;
int idx = blockIdx.x*BLOCK_SIZE_X + threadIdx.x;
int idy = blockIdx.y*BLOCK_SIZE_Y + threadIdx.y;
int index1 = j*BLOCK_SIZE_Y+i;
int i1 = (index1)%(BLOCK_SIZE_X+1);
int j1 = (index1)/(BLOCK_SIZE_Y+1);
int i2 = (BLOCK_SIZE_X*BLOCK_SIZE_Y+index1)%(BLOCK_SIZE_X+1);
int j2 = (BLOCK_SIZE_X*BLOCK_SIZE_Y+index1)/(BLOCK_SIZE_Y+1);
// calculate the accessed location and offset here
// change the line "Ezx_h[(blockIdx.y*BLOCK_SIZE_Y+j1)*xdim+(blockIdx.x*BLOCK_SIZE_X+i1)];" to
int addr = (blockIdx.y*BLOCK_SIZE_Y+j1)*xdim+(blockIdx.x*BLOCK_SIZE_X+i1);
int size = sizeof(double);
//////////////////////////
// End of modifications //
//////////////////////////
printf("tid (%d,%d) from blockid (%d,%d) accessing to block %d\n",threadIdx.x,threadIdx.y,blockIdx.x,blockIdx.y,CACHE_BLOCK_ADDR(addr,size));
// check whether it can be merged with existing requests or not
short merged=0;
for(III=0; III<unq_accesses; III++){
if(CACHE_BLOCK_ADDR(addr,size)==CACHE_BLOCK_ADDR(unq_addr[III],size)){
merged=1;
break;
}
}
if(!merged){
// new cache block accessed over this coalescing width
unq_addr[unq_accesses]=CACHE_BLOCK_ADDR(addr,size);
unq_accesses++;
}
}
}
printf("%d threads make %d memory transactions\n",GRIDDIMX*GRIDDIMY*BLOCKDIMX*BLOCKDIMY, total_unq_accesses);
}
The code will run for every thread of the grid and calculates the number of merged requests, metric of memory access coalescing.
To use the analyzer, paste the index calculation portion of your code in the specified region and decompose the memory accesses (array) into 'address' and 'size'. I've already done this for your code where the indexings are:
int i = threadIdx.x;
int j = threadIdx.y;
int idx = blockIdx.x*BLOCK_SIZE_X + threadIdx.x;
int idy = blockIdx.y*BLOCK_SIZE_Y + threadIdx.y;
int index1 = j*BLOCK_SIZE_Y+i;
int i1 = (index1)%(BLOCK_SIZE_X+1);
int j1 = (index1)/(BLOCK_SIZE_Y+1);
int i2 = (BLOCK_SIZE_X*BLOCK_SIZE_Y+index1)%(BLOCK_SIZE_X+1);
int j2 = (BLOCK_SIZE_X*BLOCK_SIZE_Y+index1)/(BLOCK_SIZE_Y+1);
and the memory access is:
Ezx_h_shared_ext[i1][j1]=Ezx_h[(blockIdx.y*BLOCK_SIZE_Y+j1)*xdim+(blockIdx.x*BLOCK_SIZE_X+i1)];
The analyzer reports 4096 threads access to 4064 cache blocks. Run the code for your actual grid and block size and analyze the coalescing behavior.
As GPUs have evolved, the requirements for getting coalesced accesses have become less restrictive. Your description of coalesced accesses is more accurate for the earlier GPU architectures than the more recent ones. In particular, Fermi (compute capability 2.0) significantly loosened the requirements. On Fermi and later, it is not important to access the memory locations consecutively. Instead, focus has shifted to accessing memory with as few memory transactions as possible. On Fermi, global memory transactions are 128 bytes wide. So, when the 32 threads in a warp hit an instruction that performs a load or store, 128-byte transactions will be scheduled to service all the threads in the warp. Performance then depends on how many transactions are necessary. If all the threads access values within a 128-byte area that is aligned to 128 bytes, a single transaction is necessary. If all the threads access values in different 128-byte areas, 32 transactions will be necessary. That would be the worst case scenario for servicing the requests for a single instruction in a warp.
You use one of the CUDA profilers to determine the average for how many transactions were required for servicing the requests. The number should be as close to 1 as possible. Higher numbers mean that you should see if there are opportunities for optimizing the memory accesses in your kernel.
The visual profiler is a great tool for checking your work. After you have a piece of code functionally correct, then run it from within the visual profiler. On linux for example, assuming you have an X session, just run nvvp from a terminal window. You will then be given a wizard which will prompt you for the application to profile along with any command line parameters.
The profiler will then do a basic run of your app to collect statistics. You can also select more advanced statistic gathering (requiring addtional runs), and one of these will be memory utilization statistics. It will report memory utilization as a percentage of peak and will also flag warnings for what it considers to be low utilization that merits your attention.
If you have a utilzation number above 50%, your app is probably running the way you expect. If you have a low number, you have probably missed some coalescing details. It will report statistics separately for memory reads and memory writes. To get 100% or close to it, you will also need to make sure that your coalesced reads and writes from the warp are aligned on 128 byte boundaries.
A common mistake in these situations is to use the threadIdx.y based variable to be the most rapidly changing index. It doesn't seem to me that you've made that error. e.g. it's a common mistake to do shared[threadIdx.x][threadIdx.y] since this is often the way we think about it in C. But threads are grouped together first in the x axis, so we want to use shared[threadIdx.y][threadIdx.x] or something similar. If you do make this mistake, your code can still be functionally correct but you will get low percentage utilization numbers in the profiler, like around 12% or even 3%.
And as already stated, to get above 50% and close to 100%, you will want to make sure that not only are all your thread requests are adjacent, but that they are aligned on a 128B boundary. Due to L1/L2 caches, these aren't hard and fast rules, but guidelines. The caches may mitigate some mistakes, to some degree.

cuda register pressure

I have a kernel does a linear least square fit. It turns out threads are using too many registers, therefore, the occupancy is low. Here is the kernel,
__global__
void strainAxialKernel(
float* d_dis,
float* d_str
){
int i = threadIdx.x;
float a = 0;
float c = 0;
float e = 0;
float f = 0;
int shift = (int)((float)(i*NEIGHBOURS)/(float)WINDOW_PER_LINE);
int j;
__shared__ float dis[WINDOW_PER_LINE];
__shared__ float str[WINDOW_PER_LINE];
// fetch data from global memory
dis[i] = d_dis[blockIdx.x*WINDOW_PER_LINE+i];
__syncthreads();
// least square fit
for (j=-shift; j<NEIGHBOURS-shift; j++)
{
a += j;
c += j*j;
e += dis[i+j];
f += (float(j))*dis[i+j];
}
str[i] = AMP*(a*e-NEIGHBOURS*f)/(a*a-NEIGHBOURS*c)/(float)BLOCK_SPACING;
// compensate attenuation
if (COMPEN_EXP>0 && COMPEN_BASE>0)
{
str[i]
= (float)(str[i]*pow((float)i/(float)COMPEN_BASE+1.0f,COMPEN_EXP));
}
// write back to global memory
if (!SIGN_PRESERVE && str[i]<0)
{
d_str[blockIdx.x*WINDOW_PER_LINE+i] = -str[i];
}
else
{
d_str[blockIdx.x*WINDOW_PER_LINE+i] = str[i];
}
}
I have 32x404 blocks with 96 threads in each block. On GTS 250, the SM shall be able to handle 8 blocks. Yet, visual profiler shows I have 11 registers per thread, as a result, occupancy is 0.625 (5 blocks per SM). BTW, the shared memory used by each block is 792 B, so the register is the problem.
The performance is not end of the world. I am just curious if there is anyway I can get around this. Thanks.
There is always a trade-off between the fast but limited registers/shared memory and the slow but large global memory. There's no way to "get around" that trade-off. If you use reduce register usage by using global memory, you should get higher occupancy but slower memory access.
That said, here are some ideas to use fewer registers:
Can shift be precomputed and stored in constant memory? Then each thread just needs to look up shift[i].
Do a and c have to be floats?
Or, can a and c be removed from the loop and computed once? And thus removed completely?
a is computed as a simple arithmetic sequence, so reduce it... (something like this)
a = ((NEIGHBORS-shift) - (-shift) + 1) * ((NEIGHBORS-shift) + (-shift)) / 2
or
a = (NEIGHBORS + 1) * ((NEIGHBORS - 2*shift)) / 2
so instead, do something like the following (you can probably reduce these expressions further):
str[i] = AMP*((NEIGHBORS + 1) * ((NEIGHBORS - 2*shift)) / 2*e-NEIGHBOURS*f)
str[i] /= ((NEIGHBORS + 1) * ((NEIGHBORS - 2*shift)) / 2*(NEIGHBORS + 1) * ((NEIGHBORS - 2*shift)) / 2-NEIGHBOURS*c)
str[i] /= (float)BLOCK_SPACING;
Occupancy is NOT a problem.
The SM in GTS 250 (compute capability 1.1) may be able to hold 8 blocks (8x96 threads) simultaneously in its registers, but it only has 8 execution units, meaning that only 8 out of 8x96 (or, in your case, 5x96) threads would be advancing at any given moment of time. There's very little value in trying to squeeze more blocks onto the overloaded SM.
In fact, you could try to play with -maxrregcount option to INCREASE the number of registers, that could have a positive effect on performance.
You can use launch bounds to instruct the compiler to generate a register mapping for a maximum number of threads and a minimum number of blocks per multiprocessor. This can reduce register counts so that you can achieve the desired occupancy.
For your case, Nvidia's occupancy calculator shows a theoretical peak occupancy of 63%, which seems to be what you're achieving. This is due to your register count, as you mention, but it is also due to the number of threads per block. Increasing the number of threads per block to 128 and decreasing the register count to 10 yields 100% theoretical peak occupancy.
To control the launch bounds for your kernel:
__global__ void
__launch_bounds__(128, 6)
MyKernel(...)
{
...
}
Then just launch with a block size of 128 threads and enjoy your occupancy. The compiler should generate your kernel such that it uses 10 or less registers.