__syncthreads not working in CUDA

__syncthreads not working in CUDA - cuda

I wrote simple kernel to test functionality of CUDA __syncthreads. In kernel I've managed to print from each thread if updated value is not visible to other threads. Ideally no thread should print Not visible to me error message but some threads end up printing this message.
Here is the kernel.
__device__ int a=0;
__global__ void kernel()
{
isItOK=false;
if(threadIdx.x==0 && blockIdx.x==0)
{
atomicAdd(&a,1);
__threadfence();
}
__syncthreads();
if(atomicAdd(&a,0)==0)
{
cuPrintf("Not Visible to me\n");
}
}
int main()
{
int *a;
cudaPrintfInit();
kernel<<<16,16>>>();
cudaPrintfDisplay(stdout,true);
cudaPrintfEnd();
}
Please help me with this, very simple test program but still not working. Do we need some compiler flags to set ?

__syncthreads() is a synchronization barrier primitive that only synchronizes threads in the same block.
CUDA has no mechanism for safely synchronizing across thread blocks.
Communication and synchronization between thread blocks is not recommended because it breaks scalability of execution across GPUs with varying numbers of multiprocessors, which is the reason for having thread blocks in the first place.

Related

CUDA shared memory read/write order within a single thread

The shared memory is not synchronized between threads in a block. But I don't know if the shared memory is synchronized with the writer thread.
For example, in this example:
__global__ void kernel()
{
__shared__ int i, j;
if(threadIdx.x == 0)
{
i = 10;
j = i;
}
// #1
}
Is it guaranteed at #1 that, for thread 0, i=10 and j=10, or do I need some memory fence or introduce a local variable?

I'm going to assume that by
for thread 0
you mean, "the thread that passed the if-test". And for the sake of this discussion, I will assume there is only one of those.
Yes, it's guaranteed. Otherwise basic C++ compliance would be broken in CUDA.
Challenges in CUDA may arise in inter-thread communication or behavior. However you don't have that in view in your question.
As an example, it is certainly not guaranteed that for some other thread, i will be visible as 10, without some sort of fence or barrier.

why the first cuda kernel cannot overlap with previous memcpy?

Here is a demo. The kernel cannot overlap with previous cudaMemcpyAsync, although they are in different streams.
#include <iostream>
#include <cuda_runtime.h>
__global__ void warmUp(){
int Id = blockIdx.x*blockDim.x+threadIdx.x;
if(Id == 0){
printf("warm up!");
}
}
__global__ void kernel(){
int Id = blockIdx.x*blockDim.x+threadIdx.x;
if(Id == 0){
long long x = 0;
for(int i=0; i<1000000; i++){
x += i>>1;
}
printf("kernel!%d\n", x);
}
}
int main(){
//warmUp<<<1,32>>>();
int *data, *data_dev;
int dataSize = pow(10, 7);
cudaMallocHost(&data, dataSize*sizeof(int));
cudaMalloc(&data_dev, dataSize*sizeof(int));
cudaStream_t stream1, stream2;
cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);
cudaMemcpyAsync(data_dev, data, dataSize*sizeof(int), cudaMemcpyHostToDevice, stream1);
kernel<<<1, 32, 0, stream2>>>();
}
Visual Profiler show
After some attempts, I found out that this is due to it being the first kernel call.
Uncomment warmUp<<<1,32>>>();, Visual Profiler show, overlap!
Why?

CUDA uses lazy initialization. Because of this, the first time you do a particular operation or a particular operation type, it's possible that the behavior will not be as you expect.
The operation will/should work "correctly", but performance measurements may not be as you expect.
Contrary to the linked article, there really is no specified formula to force the lazy initialization to complete, without performing the actual work you intend to do.
If the only thing you ever intend to do with your application is launch a single kernel, then having that kernel overlap with a previous copy operation doesn't seem to make a lot of sense to me. In any event, you should expect that device initialization is necessary before all operations will proceed at expected speeds or in expected ways.
Lazy initialization behavior may vary based on CUDA version, platform (e.g. OS) and GPU type.
Additionally, kernel launches are asynchronous. So this particular coding pattern:
int main(){
...
kernel<<<1, 32, 0, stream2>>>();
}
is generally not recommended in CUDA, and specifically is not recommended when using a profiler. Your code should provide the opportunity for all issued work to complete properly, in order for the profiler to provide useful results. You should provide a cudaDeviceSynchronize() or similar operation at the end of your code, if you want to profile it, for this type of pattern.
I also don't recommend doing performance analysis on kernels that are issuing printf calls. The printf call imposes additional host/device synchronization behavior/needs, and this can be confusing; its not easy to predict the performance impact of that.

In cuda which thread is responsible for shared memory allocation requested inside a kernel?

I understand instructions inside a kernel is executed by all the threads. Let us consider the following case:
_global__ void staticReverse(int *d, int n)
{
__shared__ int s[64];
int t = threadIdx.x;
int tr = n-t-1;
s[t] = d[t];
__syncthreads();
d[t] = s[tr];
}
Basically this code will be running at different cores as thread.
Now there is a shared memory allocation. Since it(shared memory allocation) will be encountered by all the threads, will it be allocated by all the threads? (Logically not.) But I am sure at least one thread must allocate it. I want to know which thread do it ?
Kindly help me understand where my understanding is wrong.

No threads are responsible for this allocation - the threads do not run any SASS code that is involved in allocating this memory.
The same statement is true if you use dynamic (extern) shared allocation - no threads are responsible - meaning the threads do not run any SASS code that is involved in allocating this memory. There are no function calls or other mechanisms involved.
The memory is already allocated, and a pointer to it is already established, by the time the thread SASS code (i.e. the kernel) begins executing.
There is a wrinkle to be aware of. If the shared memory declaration involves a constructor, then the constructor will be run on all threads. This can be confusing behavior.

How to avoid memcpy if number of blocks depends on device variable?

I am computing a number, X, on the device. Now I need to launch a kernel with X threads. I can set the blockSize to 1024. Is there a way to set the number of blocks to ceil(X / 1024) without performing a memcpy?

I see two possibilities:
Use dynamic parallelism (if feasible). Rather than copying the result back to determine the execution parameters of the next launch, just have the device perform the next launch itself.
Use zero-copy or managed memory. In that case the GPU writes directly to CPU memory over the PCI-e bus, rather than requiring an explicit memory transfer.
Of those options, dynamic parallelism and managed memory require hardware features which are not available on all GPUs. Zero-copy memory is supported by all GPUs with compute capability >= 1.1, which in practice is just about every CUDA compatible device ever made.

An example of using managed memory, as outlined by #talonmies, allowing kernel1 to determine the number of blocks for kernel2 without an explicit memcpy.
#include <stdio.h>
#include <cuda.h>
using namespace std;
__device__ __managed__ int kernel2_blocks;
__global__ void kernel1() {
if (threadIdx.x == 0) {
kernel2_blocks = 42;
}
}
__global__ void kernel2() {
if (threadIdx.x == 0) {
printf("block: %d\n", blockIdx.x);
}
}
int main() {
kernel1<<<1, 1024>>>();
cudaDeviceSynchronize();
kernel2<<<kernel2_blocks, 1024>>>();
cudaDeviceSynchronize();
return 0;
}

CUDA context lifetime

In my application I have some part of the code that works as follows
main.cpp
int main()
{
//First dimension usually small (1-10)
//Second dimension (100 - 1500)
//Third dimension (10000 - 1000000)
vector<vector<vector<double>>> someInfo;
Object someObject(...); //Host class
for (int i = 0; i < N; i++)
someObject.functionA(&(someInfo[i]));
}
Object.cpp
void SomeObject::functionB(vector<vector<double>> *someInfo)
{
#define GPU 1
#if GPU == 1
//GPU COMPUTING
computeOnGPU(someInfo, aConstValue, aSecondConstValue);
#else
//CPU COMPUTING
#endif
}
Object.cu
extern "C" void computeOnGPU(vector<vector<double>> *someInfo, int aConstValue, int aSecondConstValue)
{
//Copy values to constant memory
//Allocate memory on GPU
//Copy data to GPU global memory
//Launch Kernel
//Copy data back to CPU
//Free memory
}
So as (I hope) you can see in the code, the function that prepares the GPU is called many times depending on the value of the first dimension.
All the values that I send to constant memory always remain the same and the sizes of the pointers allocated in global memory are always the same (the data is the only one changing).
This is the actual workflow in my code but I'm not getting any speedup when using GPU, I mean the kernel does execute faster but the memory transfers became my problem (as reported by nvprof).
So I was wondering where in my app the CUDA context starts and finishes to see if there is a way to do only once the copies to constant memory and memory allocations.

Normally, the cuda context begins with the first CUDA call in your application, and ends when the application terminates.
You should be able to do what you have in mind, which is to do the allocations only once (at the beginning of your app) and the corresponding free operations only once (at the end of your app) and populate __constant__ memory only once, before it is used the first time.
It's not necessary to allocate and free the data structures in GPU memory repetetively, if they are not changing in size.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008