Can you use #defines like method parameters in HLSL? - parameter-passing

In HLSL, is there a way to make defines act like swap-able methods? My use case is creating a method that does fractal brownian noise with a sampling function(x, y). Ideally I would be able to have a parameter that is a method, and just call that parameter, but I can't seem to do that in HLSL in Unity. It wouldn't make sense to copy+paste the entire fractal brown method and change just the one sampler line, especially if I'm using multiple layers of different noise functions for a final output. But I can't seem to find out how to do it.
Here is what I've tried:
#define NOISE_SAMPLE Random(x, y)
float FBM()
{
...
float somevalue = NOISE_SAMPLE;
....
}
And in a compute buffer, I have something like this:
void CSMain(uint3 id : SV_DispatchThreadID)
{
...
#undef NOISE_SAMPLE
#define NOISE_SAMPLE Perlin(x, y)
float result = FBM();
...
}
However this doesn't seem to work. If I use NOISE_SAMPLE in the CSMain function, it uses the Perlin version. However, calling FBM() still uses the random version. This doesn't seem to make sense as I've read elsewhere that all functions are inline, so I thought the FBM function would 'inline' itself below the redefinition with the Perlin version. Why is this the case and what are some options for my use case?

This doesn't work, as a #define is a preprocessor instruction, and the preprocessor does its work before any other part of the HLSL compiler. So, even though your function is eventually inlined, this inlining only happens long after the preprocessor has run. In fact, the preprocessor is basically doing a purely string-based find-and-replace (just slightly smarter) before the actual compiler even sees your code. It isn't even aware of the concept of a function.
Out of my head, I can think of two options for your use case:
You could pass an integer as a parameter to your FBM() method, which identifies your noise function, and then have a switch (or an if-else-chain) inside your FBM() method, which selects the proper noise function based on this integer. Since the integer is passed as a compile-time constant, I'd expect that the compiler optimizes that branching away (and even if it doesn't, the cost of such a branch is fairly low, since all threads are always taking the same path through the code):
float FBM(uint noise)
{
...
float somevalue = 0.0f;
if(noise == 0)
somevalue = Random(x, y);
else
somevalue = Perlin(x, y);
...
}
void CSMain(uint3 id : SV_DispatchThreadID)
{
...
float result = FMB(1);
...
}
You could write your whole FBM() method as a preprocessor macro instead of a function (you can end a line in a #define with \ to have the macro spanning multiple lines). This is a bit more cumbersome, but your #undef and #define would work, as the inlining is then actually done by the preprocesor as well.
#define NOISE_SAMPLE Random(x, y)
#define FBM { \
... \
float somevalue = NOISE_SAMPLE; \
... \
result = ...; \
}
void CSMain(uint3 id : SV_DispatchThreadID)
{
float result = 0.0f;
...
#undef NOISE_SAMPLE
#define NOISE_SAMPLE Perlin(x, y)
FBM;
...
}
(Note that, with this approach, the compiler errors/warnings will never reference a line inside the FBM macro, but only ever the line(s) where the FBM macro is being called, so debugging these errors/warnings is slightly harder)

Related

Can a branch in CUDA be ignored if all the warps go one path? If so, is there a way I could give the compiler/runtime this information?

Suppose we have code like the following (I have not compiled this, it may be wrong)
__global__ void myKernel()
{
int data = someArray[threadIdx.x];
if (data == 0) {
funcA();
} else {
funcB();
}
}
Now Suppose there's 1024-thread block running, and someArray is all zero.
Further suppose that funcB() is costly to run, but funcA() is not.
I assume the compiler has to emit both paths sequentially, like doing funcA first, then funcB after. This is not ideal.
Is there a way to hint to CUDA to not do it? Or does the runtime notice "no threads are active so I will skip over all the instructions as I see them"?
Or better yet, what if the branch was something like this (again, haven't compiled this, but it illustrates what I am trying to convey)
__constant__ int constantNumber;
__global__ void myKernel()
{
if (constantNumber == 123) {
funcA();
} else {
funcB();
}
}
and then I set constantNumber to 123 before launching the kernel. Would this still cause both paths to be taken?
This can be achieved using __builtin_assume.
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#__builtin_assume
Quoting the documentation:
void __builtin_assume(bool exp)
Allows the compiler to assume that the Boolean argument is true. If the argument is not true at run time, then the behavior is undefined. The argument is not evaluated, so any side-effects will be discarded.

How to best make use of the same constants in both host and device code?

Suppose I have some global constant data I use in host-side code:
const float my_array[20] = { 45.146, 54.633, 74.669, 12.734, 74.240, 100.524 };
(Note: I've kept them C-ish, no constexpr here.)
I now want to also use these in device-side code. I can't simply start using them: They are not directly accessible from the device, and trying to use them gives:
error: identifier "my_array" is undefined in device code
What is, or what are, the idiomatic way(s) to make such constants usable on both the host and the device?
This approach was suggested by Mark Harris in an answer in 2012:
#define MY_ARRAY_VALUES 45.146, 54.633, 74.669, 12.734, 74.240, 100.524
__constant__ float device_side_my_array[2] = { MY_ARRAY_VALUES };
const float host_side_my_array[2] = { MY_ARRAY_VALUES };
#undef MY_ARRAY_VALUES
__device__ __host__ float my_array(size_t i) {
#ifdef __CUDA_ARCH__
return device_side_my_array[i];
#else
return host_side_my_array[i];
#endif
}
But this has some drawbacks:
Not actually using the same constants, just constants with the same value.
Duplication of data.
Takes up constant memory, which is a rather limited resource.
Seems a bit verbose (although maybe other options are even more so).
I wonder if this is what most people use in practice.
Note:
In C++ one might use the same name, but in different sub-namespaces within the detail:: namespace.
Doesn't use cudaMemcpyToSymbol().

GDB: Can I add a "watch" for a variable in another scope?

Seems that watch only works when I run into a function and watch the value of a function-local variable. My question is, can I watch and see if a function's input parameter is larger than a number? E.g. I've this code:
$cat testWatch.cpp
#include<stdio.h>
void f(int i){
++i;
printf("%d\n",i);
}
int main(){
int i=1;
f(2);
f(3);
++i;
f(4);
++i;
return 0;
}
I wish to
(1) When program is in "main" function, I wish to set a "watch" inside f(). Is it possible?
(2)I want set a "watch" point at the beginning of f() function, when the input "int i" is larger than 2, I want gdb to stop. Is it possible?
1) do you really need a 'watch'? It's trivial to set a conditional breakpoint inside f() by specifying the line number. (or in less trivial programs, fileName:lineNum )
2) the behavior you describe is a conditional breakpoint.
(gdb) break 2 if (i > 2)
Breakpoint 5 at 0x400531: file test.c, line 2.
(gdb) run
Starting program: /tmp/test
3
Breakpoint 5, f (i=3) at test.c:3

Is there any negative performance implication to using local functions in Rust?

I recently realized that I could create local functions in Rust (a function within a function). Seems like a good way to clean up my code without polluting the function space of a file. Small sample of what I mean below by local function vs an 'external' function:
fn main() {
fn local_plus(x: i64, y: i64) -> i64 {
x + y
}
let x = 2i64;
let y = 5i64;
let local_res = local_plus(x, y);
let external_res = external_plus(x,y);
assert_eq!(local_res, external_res);
}
fn external_plus(x: i64, y: i64) -> i64 {
x + y
}
I was wondering if there is any negative performance implication of doing this? Like does Rust re-declare the function or take up some undesired amount of function space each time the containing function runs? Or does it have literally no performance implication?
As a bit of an aside, any tips on how I could have found out the answer for myself (either through reading any specific set of documents, or tooling I could use) would be welcome.
There is no impact; I checked the assembly generated for both variants and it is identical.
The two versions I compared:
"external":
fn main() {
let x = 2i64;
let y = 5i64;
let external_res = external_plus(x,y);
}
fn external_plus(x: i64, y: i64) -> i64 {
x + y
}
"local":
fn main() {
fn local_plus(x: i64, y: i64) -> i64 {
x + y
}
let x = 2i64;
let y = 5i64;
let local_res = local_plus(x, y);
}
And both yield the same asm result (release mode in today's nightly):
.text
.file "rust_out.cgu-0.rs"
.section .text._ZN8rust_out4main17hb497928495d48c40E,"ax",#progbits
.p2align 4, 0x90
.type _ZN8rust_out4main17hb497928495d48c40E,#function
_ZN8rust_out4main17hb497928495d48c40E:
.cfi_startproc
retq
.Lfunc_end0:
.size _ZN8rust_out4main17hb497928495d48c40E, .Lfunc_end0-_ZN8rust_out4main17hb497928495d48c40E
.cfi_endproc
.section .text.main,"ax",#progbits
.globl main
.p2align 4, 0x90
.type main,#function
main:
.cfi_startproc
movq %rsi, %rax
movq %rdi, %rcx
leaq _ZN8rust_out4main17hb497928495d48c40E(%rip), %rdi
movq %rcx, %rsi
movq %rax, %rdx
jmp _ZN3std2rt10lang_start17h14cbded5fe3cd915E#PLT
.Lfunc_end1:
.size main, .Lfunc_end1-main
.cfi_endproc
.section ".note.GNU-stack","",#progbits
Which means there will be zero difference (not only performance-wise) in the generated binary.
What is more, it doesn't even matter if you use a function; the following approach:
fn main() {
let x = 2i64;
let y = 5i64;
let res = x + y;
}
Also yields the same assembly.
The bottom line is that, in general, the functions get inlined regardless of whether you declare them in main() or outside it.
Edit: as Shepmaster pointed out, in this program there are no side effects, so the generated assembly for both variants is actually the same as the one for:
fn main() {}
However, the MIR output for both is the same, too (and different than one for a blank main()), so there shouldn't be any difference coming from the function location even if side effects were present.
As a bit of an aside, any tips on how I could have found out the answer for myself (either through reading any specific set of documents, or tooling I could use) would be welcome.
Do you know of the Rust playground?
Enter your code, click on "LLVM IR", "Assembly" or "MIR" instead of "Run", and you get to see what is the low-level representation emitted for said code.
I personally prefer LLVM IR (I'm used to reading it from C++), which is still quite higher level than assembly whilst still being post language.
I was wondering if there is any negative performance implication of doing this?
That's a very complicated question; actually.
The only difference between declaring a function locally or externally in Rust is one of scope. Declaring it locally simply reduces its scope. Nothing else.
However... scope, and usage, can have drastic effects on compilation.
A function that is used only once, for example, is much more likely to be inlined than a function that is used 10 times. A compiler cannot easily estimate the number of uses of a pub function (unbounded), but has perfect knowledge for local or non-pub functions. And whether a function is inlined or not can drastically affect the performance profile (for worse or better).
So, by reducing the scope, and thereby limiting the usage, you are encouraging the compiler to consider your function for inlining (unless your mark it "cold").
On the other hand, since the scope is reduced, it cannot be shared (obviously).
So... what?
Follow the usage: define an item in the tightest scope possible.
This is encapsulation: now, the next time you need to modify this piece, you will know exactly the impacted scope.
Have some trust in Rust, it won't be introducing overhead if it can avoid it.

Thrust - accessing neighbors

I would like to use Thrust's stream compaction functionality (copy_if) for distilling indices of elements from a vector if the elements adhere to a number of constraints. One of these constraints depends on the values of neighboring elements (8 in 2D and 26 in 3D). My question is: how can I obtain the neighbors of an element in Thrust?
The function call operator of the functor for the 'copy_if' basically looks like:
__host__ __device__ bool operator()(float x) {
bool mark = x < 0.0f;
if (mark) {
if (left neighbor of x > 1.0f) return false;
if (right neighbor of x > 1.0f) return false;
if (top neighbor of x > 1.0f) return false;
//etc.
}
return mark;
}
Currently I use a work-around by first launching a CUDA kernel (in which it is easy to access neighbors) to appropriately mark the elements. After that, I pass the marked elements to Thrust's copy_if to distill the indices of the marked elements.
I came across counting_iterator as a sort of substitute for directly using threadIdx and blockIdx to acquire the index of the processed element. I tried the solution below, but when compiling it, it gives me a "/usr/include/cuda/thrust/detail/device/cuda/copy_if.inl(151): Error: Unaligned memory accesses not supported". As far as I know I'm not trying to access memory in an unaligned fashion. Anybody knows what's going on and/or how to fix this?
struct IsEmpty2 {
float* xi;
IsEmpty2(float* pXi) { xi = pXi; }
__host__ __device__ bool operator()(thrust::tuple<float, int> t) {
bool mark = thrust::get<0>(t) < -0.01f;
if (mark) {
int countindex = thrust::get<1>(t);
if (xi[countindex] > 1.01f) return false;
//etc.
}
return mark;
}
};
thrust::copy_if(indices.begin(),
indices.end(),
thrust::make_zip_iterator(thrust::make_tuple(xi, thrust::counting_iterator<int>())),
indicesEmptied.begin(),
IsEmpty2(rawXi));
#phoad: you're right about the shared mem, it struck me after I already posted my reply, subsequently thinking that the cache probably will help me. But you beat me with your quick response. The if-statement however is executed in less than 5% of all cases, so either using shared mem or relying on the cache will probably have negligible impact on performance.
Tuples only support 10 values, so that would mean I would require tuples of tuples for the 26 values in the 3D case. Working with tuples and zip_iterator was already quite cumbersome, so I'll pass for this option (also from a code readability stand point). I tried your suggestion by directly using threadIdx.x etc. in the device function, but Thrust doesn't like that. I seem to be getting some unexplainable results and sometimes I end up with an Thrust error. The following program for example generates a 'thrust::system::system_error' with an 'unspecified launch failure', although it first correctly prints "Processing 10" to "Processing 41":
struct printf_functor {
__host__ __device__ void operator()(int e) {
printf("Processing %d\n", threadIdx.x);
}
};
int main() {
thrust::device_vector<int> dVec(32);
for (int i = 0; i < 32; ++i)
dVec[i] = i + 10;
thrust::for_each(dVec.begin(), dVec.end(), printf_functor());
return 0;
}
Same applies to printing blockIdx.x Printing blockDim.x however generates no error. I was hoping for a clean solution, but I guess I am stuck with my current work-around solution.