cudaThreadSynchronize & performance - cuda

Some days ago I was comparing performance of some code of mine where I perform a very simple replace and Thrust implementation of the same algorithm. I discovered a mismatching of one order of magnitude (!) in favor of Thrust, so I started to make my debugger "surf" into their code to discover where the magic happens.
Surprisingly, I discovered that my very straight-forward implementation was actually very similar to theirs, once I got rid of all the functor stuff and got to the nitty-gritty. I saw that Thrust has a clever way to decide both block _size & grid_size (btw: exactly, how it works?!), so I just took their settings and executed my code again, being them so similar. I gained some microseconds, but almost the same situation. Then, in the end, I don't know why but just "to try" I removed a cudaThreadSynchronize() after my kernel and BINGO! I zeroed the gap (and better) and gained a whole order of magnitude of execution time. Accessing my array's value I saw that they had exactly what I expected, so correct execution.
The questions, now, are: when can I get rid of cudaThreadSynchronize (et similia)? Why does it cause such a huge overhead? I see that Thrust itself doesn't synchronize at the end (synchronize_if_enabled(const char* message) that is a NOP if macro __THRUST_SYNCHRONOUS isn't defined and it isn't).
Details & code follow.
// my replace code
template <typename T>
__global__ void replaceSimple(T* dev, const int n, const T oldval, const T newval)
{
const int gridSize = blockDim.x * gridDim.x;
int index = blockIdx.x * blockDim.x + threadIdx.x;
while(index < n)
{
if(dev[index] == oldval)
dev[index] = newval;
index += gridSize;
}
}
// replace invocation - not in main because of cpp - cu separation
template <typename T>
void callReplaceSimple(T* dev, const int n, const T oldval, const T newval)
{
replaceSimple<<<30,768,0>>>(dev,n,oldval,newval);
cudaThreadSynchronize();
}
// thrust replace invocation
template <typename T>
void callReplace(thrust::device_vector<T>& dev, const T oldval, const T newval)
{
thrust::replace(dev.begin(), dev.end(), oldval, newval);
}
Param details: arrays: n=10,000,000 elements set to 2, oldval=2, newval=3
Time to execute thrust callReplace (thrust): 0.057 ms
Time to execute callReplaceSimple with sync: 0.662 ms
Time to execute callReplaceSimple without sync: 0.011 ms
I used CUDA 5.0 with Thrust included, my card is a GeForce GTX 570 and I have a quadcore Q9550 2.83 GHz with 2 GB RAM.

Kernel launches are asynchronous. If you remove the cudaThreadSynchronize() call, you only measure the kernel launch time, not the time until completion of the kernel.

Related

Is there proper CUDA atomicLoad function?

I've faced with the issue that CUDA atomic API do not have atomicLoad function.
After searching on stackoverflow, I've found the following implementation of CUDA atomicLoad
But looks like this function is failed to work in following example:
#include <cassert>
#include <iostream>
#include <cuda_runtime_api.h>
template <typename T>
__device__ T atomicLoad(const T* addr) {
const volatile T* vaddr = addr; // To bypass cache
__threadfence(); // for seq_cst loads. Remove for acquire semantics.
const T value = *vaddr;
// fence to ensure that dependent reads are correctly ordered
__threadfence();
return value;
}
__global__ void initAtomic(unsigned& count, const unsigned initValue) {
count = initValue;
}
__global__ void addVerify(unsigned& count, const unsigned biasAtomicValue) {
atomicAdd(&count, 1);
// NOTE: When uncomment the following while loop the addVerify is stuck,
// it cannot read last proper value in variable count
// while (atomicLoad(&count) != (1024 * 1024 + biasAtomicValue)) {
// printf("count = %u\n", atomicLoad(&count));
// }
}
int main() {
std::cout << "Hello, CUDA atomics!" << std::endl;
const auto atomicSize = sizeof(unsigned);
unsigned* datomic = nullptr;
cudaMalloc(&datomic, atomicSize);
cudaStream_t stream;
cudaStreamCreate(&stream);
constexpr unsigned biasAtomicValue = 11;
initAtomic<<<1, 1, 0, stream>>>(*datomic, biasAtomicValue);
addVerify<<<1024, 1024, 0, stream>>>(*datomic, biasAtomicValue);
cudaStreamSynchronize(stream);
unsigned countHost = 0;
cudaMemcpyAsync(&countHost, datomic, atomicSize, cudaMemcpyDeviceToHost, stream);
assert(countHost == 1024 * 1024 + biasAtomicValue);
cudaStreamDestroy(stream);
return 0;
}
If you will uncomment the section with atomicLoad then application will stuck ...
Maybe I missed something ? Is there a proper way to load variable modified atomically ?
P.S.: I know there exists cuda::atomic implementation, but this API is not supported by my hardware
Since warps work in a lockstep manner (at least in old arch), if you put a conditional wait for one thread and a producer on another thread, both in same warp, then the warp could be stuck in the waiting if it starts/is executed first. Maybe only newest architecture that has asynchronous warp thread scheduling can do this. For example, you should query minor-major versions of cuda architecture before running this. Volta and onwards is ok.
Also you are launching 1million threads and waiting on all of them at once. GPU may not have that many execution ports/pipeline availability to have 1 million threads in-flight. Maybe it would work in only a GPU of 64k CUDA pipelines (assuming 16 threads in flight per pipeline). Instead of waiting on millions of threads, just spawn sub-kernels from main kernel when a condition occurs. Dynamic parallelism is the key feature. You should also check for the minimum minor-major cuda version to use dynamic parallelism just in case someone is using ancient nvidia cards.
Atomic-add command returns the old value in the target address. If you have meant to call a third kernel only once only after the condition, then you can simply check that returned value by an "if" before starting the dynamic parallelism.
You are printing for 1 million times, it is not good for performance and it may take some time before text appears in console output if you have a slow CPU/RAM.
Lastly, you can optimize performance of atomic operations by running them on shared memory first then going global atomic only once per block. This will miss the point of condition if there are more threads than the condition value (assuming always 1 increment value) so it may not be applicable for all algorithms.

Can I run a CUDA device function without parallelization or calling it as part of a kernel?

I have a program that loads an image onto a CUDA device, analyzes it with cufft and some custom stuff, and updates a single number on the device which the host then queries as needed. The analysis is mostly parallelized, but the last step sums everything up (using thrust::reduce) for a couple final calculations that aren't parallel.
Once everything is reduced, there's nothing to parallelize, but I can't figure out how to just run a device function without calling it as its own tiny kernel with <<<1, 1>>>. That seems like a hack. Is there a better way to do this? Maybe a way to tell the parallelized kernel "just do these last lines once after the parallel part is finished"?
I feel like this must have been asked before, but I can't find it. Might just not know what to search for though.
Code snip below, I hope I didn't remove anything relevant:
float *d_phs_deltas; // Allocated using cudaMalloc (data is on device)
__device__ float d_Z;
static __global__ void getDists(const cufftComplex* data, const bool* valid, float* phs_deltas)
{
const int i = blockIdx.x*blockDim.x + threadIdx.x;
// Do stuff with the line indicated by index i
// ...
// Save result into array, gets reduced to single number in setDist
phs_deltas[i] = phs_delta;
}
static __global__ void setDist(const cufftComplex* data, const bool* valid, const float* phs_deltas)
{
// Final step; does it need to be it's own kernel if it only runs once??
d_Z += phs2dst * thrust::reduce(thrust::device, phs_deltas, phs_deltas + d_y);
// Save some other stuff to refer to next frame
// ...
}
void fftExec(unsigned __int32 *host_data)
{
// Copy image to device, do FFT, etc
// ...
// Last parallel analysis step, sets d_phs_deltas
getDists<<<out_blocks, N_THREADS>>>(d_result, d_valid, d_phs_deltas);
// Should this be a serial part at the end of getDists somehow?
setDist<<<1, 1>>>(d_result, d_valid, d_phs_deltas);
}
// d_Z is copied out only on request
void getZ(float *Z) { cudaMemcpyFromSymbol(Z, d_Z, sizeof(float)); }
Thank you!
There is no way to run a device function directly without launching a kernel. As pointed out in comments, there is a working example in the Programming Guide which shows how to use memory fence functions and an atomically incremented counter to signal that a given block is the last block:
__device__ unsigned int count = 0;
__global__ void sum(const float* array, unsigned int N, volatile float* result)
{
__shared__ bool isLastBlockDone;
float partialSum = calculatePartialSum(array, N);
if (threadIdx.x == 0) {
result[blockIdx.x] = partialSum;
// Thread 0 makes sure that the incrementation
// of the "count" variable is only performed after
// the partial sum has been written to global memory.
__threadfence();
// Thread 0 signals that it is done.
unsigned int value = atomicInc(&count, gridDim.x);
// Thread 0 determines if its block is the last
// block to be done.
isLastBlockDone = (value == (gridDim.x - 1));
}
// Synchronize to make sure that each thread reads
// the correct value of isLastBlockDone.
__syncthreads();
if (isLastBlockDone) {
// The last block sums the partial sums
// stored in result[0 .. gridDim.x-1] float totalSum =
calculateTotalSum(result);
if (threadIdx.x == 0) {
// Thread 0 of last block stores the total sum
// to global memory and resets the count
// varilable, so that the next kernel call
// works properly.
result[0] = totalSum;
count = 0;
}
}
}
I would recommend benchmarking both ways and choosing which is faster. On most platforms kernel launch latency is only a few microseconds, so a short running kernel to finish an action after a long running kernel can be the most efficient way to get this done.

Monitoring how thread blocks are allocated to SMs across execution time?

I am a beginner with CUDA profiling. I basically want to generate a timeline that shows each SM and the the thread block that was assigned to it across execution time.
Something similar to this:
Author: Sreepathi Pai
I have read about reading %smid register, but I don't know how to incorporate it with the code that I want to test, or how to relate that to thread blocks or time.
The full code is beyond the scope of this answer so this answer provides the building blocks for you to implement block trace.
Allocate a buffer 16 bytes * number of blocks. This can be done per launch or a larger buffer can be allocated and maintained for multiple launches.
Pass the pointer of the block either through a constant variable or as an additional kernel parameter.
Modify your global functions to accept the parameter and perform the code listed below. I recommend writing new global function wrappers and have the wrapper kernel call the old code. This makes it easier to handle kernels with multiple exit points.
Visualizing Data
On compute capability 2.x devices the timestamp function should be clock64. This clock is not synchronized across SMs. The recommend approach is to sort the times per SM and use the lowest time per SM as the time of the kernel launch. This will only be off by 100s of cycles from the real time so for reasonable size kernels this drift is negligible.
Remove the smid from the lower 4-bits of the first 8 byte value. Clear the lower 4-bits of the end timestamp.
Allocate a device buffer equal to number of blocks * 16 bytes. Each 16 byte records will store the start and end timestamp as well as a 5-bit smid packed into the start time.
static __device__ inline uint32_t __smid()
{
uint32_t smid;
asm volatile("mov.u32 %0, %%smid;" : "=r"(smid));
return smid;
}
// use globaltimer for compute capability >= 3.0 (kepler and maxwell)
// use clock64 for compute capability 2.x (fermi)
static __device__ inline uint64_t __timestamp()
{
uint64_t globaltime;
asm volatile("mov.u64 %0, %%globaltimer;" : "=l"(globaltime) );
return globaltime;
}
__global__ blocktime(uint64_t* pBlockTime)
{
// START TIMESTAMP
uint64_t startTime = __timestamp();
// flatBlockIdx should be adjusted to 1D, 2D, and 3D launches to minimize
// overhead. Reduce to uint32_t if launch index does not exceed 32-bit.
uint64_t flatBlockIdx = (blockIdx.z * gridDim.x * gridDim.y)
+ (blockIdx.y * gridDim.x)
+ blockIdx.x;
// reduce this based upon dimensions of block to minimize overhead
if (threadIdx.x == 0 && theradIdx.y == 0 && threadIdx.z == 0)
{
// Put the smid in the 4 lower bits. If the MultiprocessCounter exceeds
// 16 then increase to 5-bits. The lower 5-bits of globaltimer are
// junk. If using clock64 and you want the improve precision then use
// the most significant 4-5 bits.
uint64_t smid = __smid();
uint64_t data = (startTime & 0xF) | smid;
pBlockTime[flatBlockIdx * 2 + 0] = data;
}
// do work
// I would recommend changing your current __global__ function to be
// a __global__ __device__ function and call it here. This will result
// in easier handling of kernels that have multiple exit points.
// END TIMESTAMP
// All threads in block will write out. This is not very efficient.
// Depending on the kernel this can be reduced to 1 thread or 1 thread per warp.
uint64_t endTime = __timestamp();
pBlockTime[flatBlockIdx * 2 + 1] = endTime;
}
__noinline__ __device__ uint get_smid(void)
{
uint ret;
asm("mov.u32 %0, %smid;" : "=r"(ret) );
return ret;
}
Source here.

Making CUB blockradixsort on-chip entirely?

I am reading the CUB documentations and examples:
#include <cub/cub.cuh> // or equivalently <cub/block/block_radix_sort.cuh>
__global__ void ExampleKernel(...)
{
// Specialize BlockRadixSort for 128 threads owning 4 integer items each
typedef cub::BlockRadixSort<int, 128, 4> BlockRadixSort;
// Allocate shared memory for BlockRadixSort
__shared__ typename BlockRadixSort::TempStorage temp_storage;
// Obtain a segment of consecutive items that are blocked across threads
int thread_keys[4];
...
// Collectively sort the keys
BlockRadixSort(temp_storage).Sort(thread_keys);
...
}
In the example, each thread has 4 keys. It looks like 'thread_keys' will be allocated in global local memory. If I only has 1 key per thread, could I declare"int thread_key;" and make this variable in register only?
BlockRadixSort(temp_storage).Sort() is taking a pointer to the key as parameter. Does it mean that the keys have to be in global memory?
I would like to use this code but I want each thread to hold one key in register and keep it on-chip in register/shared memory after they are sorted.
Thanks in advance!
You can do this using shared memory (which will keep it "on-chip"). I'm not sure I know how to do it using strictly registers without de-constructing the BlockRadixSort object.
Here's an example code that uses shared memory to hold the initial data to be sorted, and the final sorted results. This sample is mostly set up for one data element per thread, since that seems to be what you are asking for. It's not difficult to extend it to multiple elements per thread, and I have put most of the plumbing in place to do that, with the exception of the data synthesis and debug printouts:
#include <cub/cub.cuh>
#include <stdio.h>
#define nTPB 32
#define ELEMS_PER_THREAD 1
// Block-sorting CUDA kernel (nTPB threads each owning ELEMS_PER THREAD integers)
__global__ void BlockSortKernel()
{
__shared__ int my_val[nTPB*ELEMS_PER_THREAD];
using namespace cub;
// Specialize BlockRadixSort collective types
typedef BlockRadixSort<int, nTPB, ELEMS_PER_THREAD> my_block_sort;
// Allocate shared memory for collectives
__shared__ typename my_block_sort::TempStorage sort_temp_stg;
// need to extend synthetic data for ELEMS_PER_THREAD > 1
my_val[threadIdx.x*ELEMS_PER_THREAD] = (threadIdx.x + 5)%nTPB; // synth data
__syncthreads();
printf("thread %d data = %d\n", threadIdx.x, my_val[threadIdx.x*ELEMS_PER_THREAD]);
// Collectively sort the keys
my_block_sort(sort_temp_stg).Sort(*static_cast<int(*)[ELEMS_PER_THREAD]>(static_cast<void*>(my_val+(threadIdx.x*ELEMS_PER_THREAD))));
__syncthreads();
printf("thread %d sorted data = %d\n", threadIdx.x, my_val[threadIdx.x*ELEMS_PER_THREAD]);
}
int main(){
BlockSortKernel<<<1,nTPB>>>();
cudaDeviceSynchronize();
}
This seems to work correctly for me, in this case I happened to be using RHEL 5.5/gcc 4.1.2, CUDA 6.0 RC, and CUB v1.2.0 (which is quite recent).
The strange/ugly static casting is needed as far as I can tell, because the CUB Sort is expecting a reference to an array of length equal to the customization parameter ITEMS_PER_THREAD(i.e. ELEMS_PER_THREAD):
__device__ __forceinline__ void Sort(
Key (&keys)[ITEMS_PER_THREAD],
int begin_bit = 0,
int end_bit = sizeof(Key) * 8)
{ ...

Keeping unused variables in CUDA

I made some kernels for testing bandwidth and they do no useful computations. A minimal example is
__global__ void testKernel(float* a)
{
unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
float x;
x = a[i];
}
When I compile, I get (not surprisingly)
warning: variable "x" was set but never used
and the kernel runs as quickly as an empty kernel:
__global__ void donothing()
{
}
This indicates that the read of a[i] has been optimized out.
I have tried tricks such as
volatile float x;
if(x);
(void)(x;)
and they suppress the warning, but the kernel still finishes too quickly.
How can I make sure that the useless instructions actually get executed?
I found the option CU_JIT_OPTIMIZATION_LEVEL but google provides mostly links to the documentation and not how to use it. Would this option help me and how do I use it?
Try introducing a branch which stores the variable:
__global__ void testKernel(float* a, float *b)
{
unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
float x;
x = a[i];
if(b)
{
*b = x;
}
}
The cost of the branch compared to the cost of memory transfer is negligible.
At the kernel launch site, simply pass a null pointer:
testKernel<<<...>>>(a, static_cast<float*>(0));
nvcc will not perform constant folding at this granularity, so your load should not be removed because the compiler cannot prove it is useless.