Can a branch in CUDA be ignored if all the warps go one path? If so, is there a way I could give the compiler/runtime this information? - cuda

Suppose we have code like the following (I have not compiled this, it may be wrong)
__global__ void myKernel()
{
int data = someArray[threadIdx.x];
if (data == 0) {
funcA();
} else {
funcB();
}
}
Now Suppose there's 1024-thread block running, and someArray is all zero.
Further suppose that funcB() is costly to run, but funcA() is not.
I assume the compiler has to emit both paths sequentially, like doing funcA first, then funcB after. This is not ideal.
Is there a way to hint to CUDA to not do it? Or does the runtime notice "no threads are active so I will skip over all the instructions as I see them"?
Or better yet, what if the branch was something like this (again, haven't compiled this, but it illustrates what I am trying to convey)
__constant__ int constantNumber;
__global__ void myKernel()
{
if (constantNumber == 123) {
funcA();
} else {
funcB();
}
}
and then I set constantNumber to 123 before launching the kernel. Would this still cause both paths to be taken?

This can be achieved using __builtin_assume.
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#__builtin_assume
Quoting the documentation:
void __builtin_assume(bool exp)
Allows the compiler to assume that the Boolean argument is true. If the argument is not true at run time, then the behavior is undefined. The argument is not evaluated, so any side-effects will be discarded.

Related

Can you use #defines like method parameters in HLSL?

In HLSL, is there a way to make defines act like swap-able methods? My use case is creating a method that does fractal brownian noise with a sampling function(x, y). Ideally I would be able to have a parameter that is a method, and just call that parameter, but I can't seem to do that in HLSL in Unity. It wouldn't make sense to copy+paste the entire fractal brown method and change just the one sampler line, especially if I'm using multiple layers of different noise functions for a final output. But I can't seem to find out how to do it.
Here is what I've tried:
#define NOISE_SAMPLE Random(x, y)
float FBM()
{
...
float somevalue = NOISE_SAMPLE;
....
}
And in a compute buffer, I have something like this:
void CSMain(uint3 id : SV_DispatchThreadID)
{
...
#undef NOISE_SAMPLE
#define NOISE_SAMPLE Perlin(x, y)
float result = FBM();
...
}
However this doesn't seem to work. If I use NOISE_SAMPLE in the CSMain function, it uses the Perlin version. However, calling FBM() still uses the random version. This doesn't seem to make sense as I've read elsewhere that all functions are inline, so I thought the FBM function would 'inline' itself below the redefinition with the Perlin version. Why is this the case and what are some options for my use case?
This doesn't work, as a #define is a preprocessor instruction, and the preprocessor does its work before any other part of the HLSL compiler. So, even though your function is eventually inlined, this inlining only happens long after the preprocessor has run. In fact, the preprocessor is basically doing a purely string-based find-and-replace (just slightly smarter) before the actual compiler even sees your code. It isn't even aware of the concept of a function.
Out of my head, I can think of two options for your use case:
You could pass an integer as a parameter to your FBM() method, which identifies your noise function, and then have a switch (or an if-else-chain) inside your FBM() method, which selects the proper noise function based on this integer. Since the integer is passed as a compile-time constant, I'd expect that the compiler optimizes that branching away (and even if it doesn't, the cost of such a branch is fairly low, since all threads are always taking the same path through the code):
float FBM(uint noise)
{
...
float somevalue = 0.0f;
if(noise == 0)
somevalue = Random(x, y);
else
somevalue = Perlin(x, y);
...
}
void CSMain(uint3 id : SV_DispatchThreadID)
{
...
float result = FMB(1);
...
}
You could write your whole FBM() method as a preprocessor macro instead of a function (you can end a line in a #define with \ to have the macro spanning multiple lines). This is a bit more cumbersome, but your #undef and #define would work, as the inlining is then actually done by the preprocesor as well.
#define NOISE_SAMPLE Random(x, y)
#define FBM { \
... \
float somevalue = NOISE_SAMPLE; \
... \
result = ...; \
}
void CSMain(uint3 id : SV_DispatchThreadID)
{
float result = 0.0f;
...
#undef NOISE_SAMPLE
#define NOISE_SAMPLE Perlin(x, y)
FBM;
...
}
(Note that, with this approach, the compiler errors/warnings will never reference a line inside the FBM macro, but only ever the line(s) where the FBM macro is being called, so debugging these errors/warnings is slightly harder)

Why is CUDA function cudaLaunchKernel passed a function pointer to host-code function?

I compile axpy.cu with the following command.
nvcc --cuda axpy.cu -o axpy.cu.cpp.ii
Within axpy.cu.cpp.ii, I observe that function cudaLaunchKernel nested in __device_stub__Z4axpyfPfS_ accepts an function pointer to axpy which is defined in axpy.cu.cpp.ii. So my confuse is that shouldn't cudaLaunchKernel have been passed an function pointer to kernel function axpy? Why is there function definition with the same name as kernel function? Any help would be appreciated! Thanks in advance
Both of functions are shown below.
void __device_stub__Z4axpyfPfS_(float __par0, float *__par1, float *__par2){
void * __args_arr[3];
int __args_idx = 0;
__args_arr[__args_idx] = (void *)(char *)&__par0;
++__args_idx;
__args_arr[__args_idx] = (void *)(char *)&__par1;
++__args_idx;
__args_arr[__args_idx] = (void *)(char *)&__par2;
++__args_idx;
{
volatile static char *__f __attribute__((unused));
__f = ((char *)((void ( *)(float, float *, float *))axpy));
dim3 __gridDim, __blockDim;
size_t __sharedMem;
cudaStream_t __stream;
if (__cudaPopCallConfiguration(&__gridDim, &__blockDim, &__sharedMem, &__stream) != cudaSuccess)
return;
if (__args_idx == 0) {
(void)cudaLaunchKernel(((char *)((void ( *)(float, float *, float *))axpy)), __gridDim, __blockDim, &__args_arr[__args_idx], __sharedMem, __stream);
} else {
(void)cudaLaunchKernel(((char *)((void ( *)(float, float *, float *))axpy)), __gridDim, __blockDim, &__args_arr[0], __sharedMem, __stream);
}
};
}
void axpy( float __cuda_0,float *__cuda_1,float *__cuda_2)
# 3 "axpy.cu"
{
__device_stub__Z4axpyfPfS_( __cuda_0,__cuda_1,__cuda_2);
}
So my confuse [sic] is that shouldn't cudaLaunchKernel have been passed an function pointer to kernel function axpy?
No, because that isn't the design which NVIDIA chose and your assumption about how this function works are probably not correct. As I understand it, the first argument to cudaLaunchKernel is treated as a key, not a function pointer which is called. Elsewhere in the nvcc emitted code you will find something like this boilerplate:
static void __nv_cudaEntityRegisterCallback( void **__T0)
{
__nv_dummy_param_ref(__T0);
__nv_save_fatbinhandle_for_managed_rt(__T0);
__cudaRegisterEntry(__T0, ((void ( *)(float, float *, float *))axpy), _Z4axpyfPfS_, (-1));
}
You can see that the __cudaRegisterEntry call takes both a static pointer to axpy and a form the mangled symbol for the compiled GPU kernel. __cudaRegisterEntry is an internal, completely undocumented API from the CUDA runtime API. Many years ago, I satisfied myself by reverse engineering an earlier version of the CUDA runtime API, that there is an internal lookup mechanism which allows the correct instance of a host kernel stub to be matched to the correct instance of the compiled GPU code at runtime. The compiler emits a large amount of boilerplate and statically defined objects holding all the necessary definitions to make the runtime API work seamlessly without all of the additional API overhead that you need to use in the CUDA driver API or comparable compute APIs like OpenCL.
Why is there function definition with the same name as kernel function?
Because that is how NVIDIA decided to implement the runtime API. What you are asking about are undocumented, internal implementation details of the runtime API. You as a programmer are not supposed to have to see or use them, and they are not guaranteed to be the same from version to version.
If, as indicated in comments, you want to implement some additional compiler infrastructure in the CUDA compilation process, use the CUDA driver API, not the runtime API.

How to best make use of the same constants in both host and device code?

Suppose I have some global constant data I use in host-side code:
const float my_array[20] = { 45.146, 54.633, 74.669, 12.734, 74.240, 100.524 };
(Note: I've kept them C-ish, no constexpr here.)
I now want to also use these in device-side code. I can't simply start using them: They are not directly accessible from the device, and trying to use them gives:
error: identifier "my_array" is undefined in device code
What is, or what are, the idiomatic way(s) to make such constants usable on both the host and the device?
This approach was suggested by Mark Harris in an answer in 2012:
#define MY_ARRAY_VALUES 45.146, 54.633, 74.669, 12.734, 74.240, 100.524
__constant__ float device_side_my_array[2] = { MY_ARRAY_VALUES };
const float host_side_my_array[2] = { MY_ARRAY_VALUES };
#undef MY_ARRAY_VALUES
__device__ __host__ float my_array(size_t i) {
#ifdef __CUDA_ARCH__
return device_side_my_array[i];
#else
return host_side_my_array[i];
#endif
}
But this has some drawbacks:
Not actually using the same constants, just constants with the same value.
Duplication of data.
Takes up constant memory, which is a rather limited resource.
Seems a bit verbose (although maybe other options are even more so).
I wonder if this is what most people use in practice.
Note:
In C++ one might use the same name, but in different sub-namespaces within the detail:: namespace.
Doesn't use cudaMemcpyToSymbol().

MQL5 function converted to OpenCL kernel using only one core

I have tried to convert the MQL5 function for simple moving average to OpenCL based kernel program.
Here is what I did:
MQL5 function
void CalculateSimpleMA(int rates_total,int prev_calculated,int begin,const double &price[])
{
int i,limit;
//--- first calculation or number of bars was changed
if(prev_calculated==0)// first calculation
{
limit=InpMAPeriod+begin;
//--- set empty value for first limit bars
for(i=0;i<limit-1;i++) ExtLineBuffer[i]=0.0;
//--- calculate first visible value
double firstValue=0;
for(i=begin;i<limit;i++)
firstValue+=price[i];
firstValue/=InpMAPeriod;
ExtLineBuffer[limit-1]=firstValue;
}
else limit=prev_calculated-1;
//--- main loop
for(i=limit;i<rates_total && !IsStopped();i++)
ExtLineBuffer[i]=ExtLineBuffer[i-1]+(price[i]-price[i-InpMAPeriod])/InpMAPeriod;
//---
}
OpenCL
#pragma OPENCL EXTENSION cl_khr_fp64 : enable
__kernel void CalculateSimpleMA(
int rates_total,
int prev_calculated,
int begin,
int InpMAPeriod,
__global double *price,
__global double *ExtLineBuffer
)
{
int i,limit;
int len_price = get_global_id(4);
if(prev_calculated==0)// first calculation
{
limit=InpMAPeriod+begin;
for(i=0;i<limit-1;i++)
ExtLineBuffer[i]=0.0;
double firstValue=0;
for(i=begin;i<limit;i++)
firstValue+=price[i];
firstValue/=InpMAPeriod;
ExtLineBuffer[limit-1]=firstValue;
}
else limit=prev_calculated-1;
for(i=limit;i<rates_total;i++)
ExtLineBuffer[i]=ExtLineBuffer[i-1]+(price[i]-price[i-InpMAPeriod])/InpMAPeriod;
}
The program is working fine. But the question is I did use OpenCL so that I get to use the multiple cores of the GPU. But what I see is that I am able to consume single core only. While I tried using worker in the Execution, that kernel failed to execute. This is what going on.
I thought that there is some mistake I did while converting the function to Opencl program.
Kindly, suggest me what I missed owing to what nature I am getting through my program. I wanted to use all the cores if possible.
EDITED
The question is related to the conversion of the function from one language to another.

Thrust - accessing neighbors

I would like to use Thrust's stream compaction functionality (copy_if) for distilling indices of elements from a vector if the elements adhere to a number of constraints. One of these constraints depends on the values of neighboring elements (8 in 2D and 26 in 3D). My question is: how can I obtain the neighbors of an element in Thrust?
The function call operator of the functor for the 'copy_if' basically looks like:
__host__ __device__ bool operator()(float x) {
bool mark = x < 0.0f;
if (mark) {
if (left neighbor of x > 1.0f) return false;
if (right neighbor of x > 1.0f) return false;
if (top neighbor of x > 1.0f) return false;
//etc.
}
return mark;
}
Currently I use a work-around by first launching a CUDA kernel (in which it is easy to access neighbors) to appropriately mark the elements. After that, I pass the marked elements to Thrust's copy_if to distill the indices of the marked elements.
I came across counting_iterator as a sort of substitute for directly using threadIdx and blockIdx to acquire the index of the processed element. I tried the solution below, but when compiling it, it gives me a "/usr/include/cuda/thrust/detail/device/cuda/copy_if.inl(151): Error: Unaligned memory accesses not supported". As far as I know I'm not trying to access memory in an unaligned fashion. Anybody knows what's going on and/or how to fix this?
struct IsEmpty2 {
float* xi;
IsEmpty2(float* pXi) { xi = pXi; }
__host__ __device__ bool operator()(thrust::tuple<float, int> t) {
bool mark = thrust::get<0>(t) < -0.01f;
if (mark) {
int countindex = thrust::get<1>(t);
if (xi[countindex] > 1.01f) return false;
//etc.
}
return mark;
}
};
thrust::copy_if(indices.begin(),
indices.end(),
thrust::make_zip_iterator(thrust::make_tuple(xi, thrust::counting_iterator<int>())),
indicesEmptied.begin(),
IsEmpty2(rawXi));
#phoad: you're right about the shared mem, it struck me after I already posted my reply, subsequently thinking that the cache probably will help me. But you beat me with your quick response. The if-statement however is executed in less than 5% of all cases, so either using shared mem or relying on the cache will probably have negligible impact on performance.
Tuples only support 10 values, so that would mean I would require tuples of tuples for the 26 values in the 3D case. Working with tuples and zip_iterator was already quite cumbersome, so I'll pass for this option (also from a code readability stand point). I tried your suggestion by directly using threadIdx.x etc. in the device function, but Thrust doesn't like that. I seem to be getting some unexplainable results and sometimes I end up with an Thrust error. The following program for example generates a 'thrust::system::system_error' with an 'unspecified launch failure', although it first correctly prints "Processing 10" to "Processing 41":
struct printf_functor {
__host__ __device__ void operator()(int e) {
printf("Processing %d\n", threadIdx.x);
}
};
int main() {
thrust::device_vector<int> dVec(32);
for (int i = 0; i < 32; ++i)
dVec[i] = i + 10;
thrust::for_each(dVec.begin(), dVec.end(), printf_functor());
return 0;
}
Same applies to printing blockIdx.x Printing blockDim.x however generates no error. I was hoping for a clean solution, but I guess I am stuck with my current work-around solution.