PTX has a prmt instruction with many variants. This question regards the default one, which, if formatted as a C/C++ function, would look like this:
uint32_t prmt(uint32_t a, uint32_t b, uint32_t byte_selectors);
and this is what it does (adapted from the official docs):
In the generic form (no mode specified), byte_selectors consists of four 4-bit selection values. The bytes in the two source parameters a and b are numbered from 0 to 7: {b, a} = {{b7, b6, b5, b4}, {b3, b2, b1, b0}}. For each byte in the function's output, a 4-bit selection value is defined.
The 3 lsbs of the selection value specify which of the 8 source bytes should be moved into the target position. The msb defines if the byte value should be copied, or if the sign (msb of the byte) should be replicated over all 8 bits of the target position (sign extend of the byte value); msb=0 means copy the literal value; msb=1 means replicate the sign.
My question: When is this kind of operation useful? What kind of computation can make use of it?
The PTX instruction prmt exposes the functionality of the machine instruction PRMT. The default mode of the prmt instruction is used when none of the special modes .f4e, .b4e, .rc8, .ecl, .ecr, .rc16 is specified.
The default mode has two per-byte sub-modes, controlled by the most significant bit of the 4-bit selector field for each of eight source bytes. The commonly used sub-mode is to have the msb of the selector field zero, which means the destination byte is copied verbatim from the specified source byte. This sub-mode is exposed via a device function intrinsic __byte_perm(), and generally used to extract, insert, and permute bytes or perform bit shifts by multiples of 8. Example usage can be seen in this answer.
The other sub-mode is special, in that instead of copying the entire source byte, it replicates the most significant bit of the specified source byte across the destination byte. For this, the msb of the selector field needs to be set to one. Programmers must use PTX inline assembly to access this functionality.
I did not design the GPU hardware, so cannot speak to why that sub-mode was implemented. It is generally useful when the msb of each byte serves as a boolean value that one needs to convert into a mask for the whole byte. This in turn is generally useful for byte-wise processing within a 32-bit register. Note that CUDA includes a good many device function intrinsics for such processing, and disassembly will confirm that the msb replication sub-mode of the prmt default mode is used for many of those.
A fully worked example, emulation of the paddsb operation (byte-wise addition with signed saturation), is shown below. Note the use of prmt with msb replication inside masked_sign_to_byte_mask().
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#if (__CUDACC__)
#define __HOST__ __host__
#define __DEVICE__ __device__
#else // __CUDACC__
#define __HOST__
#define __DEVICE__
#endif // __CUDACC__
#define MSB_MASK (0x80808080U) // mask for msb of each byte
// r = (a ^ b) & ~c
__HOST__ __DEVICE__ uint32_t lop3_14 (uint32_t a, uint32_t b, uint32_t c)
{
uint32_t r;
#if (__CUDA_ARCH__ >= 500)
asm ("lop3.b32 %0,%1,%2,%3,0x14;\n\t" : "=r"(r) : "r"(a), "r"(b), "r"(c));
#else // __CUDA_ARCH__
r = (a ^ b) & ~c;
#endif // __CUDA_ARCH__
return r;
}
// r = (a ^ b) & c
__HOST__ __DEVICE__ uint32_t lop3_28 (uint32_t a, uint32_t b, uint32_t c)
{
uint32_t r;
#if (__CUDA_ARCH__ >= 500)
asm ("lop3.b32 %0,%1,%2,%3,0x28;\n\t" : "=r"(r) : "r"(a), "r"(b), "r"(c));
#else // __CUDA_ARCH__
r = (a ^ b) & c;
#endif // __CUDA_ARCH__
return r;
}
// r = a ^ (~b & c)
__HOST__ __DEVICE__ uint32_t lop3_d2 (uint32_t a, uint32_t b, uint32_t c)
{
uint32_t r;
#if (__CUDA_ARCH__ >= 500)
asm ("lop3.b32 %0,%1,%2,%3,0xd2;\n\t" : "=r"(r) : "r"(a), "r"(b), "r"(c));
#else // __CUDA_ARCH__
r = a ^ (~b & c);
#endif // __CUDA_ARCH__
return r;
}
// r = (a & c) | (b & ~c)
__HOST__ __DEVICE__ uint32_t lop3_f4 (uint32_t a, uint32_t b, uint32_t c)
{
uint32_t r;
#if (__CUDA_ARCH__ >= 500)
asm ("lop3.b32 %0,%1,%2,%3,0xf4;\n\t" : "=r"(r) : "r"(a), "r"(b), "r"(c));
#else // __CUDA_ARCH__
r = (a & c) | (b & ~c);
#endif // __CUDA_ARCH__
return r;
}
__HOST__ __DEVICE__ uint32_t masked_sign_to_byte_mask (uint32_t a)
{
#if (__CUDA_ARCH__ >= 200)
asm ("prmt.b32 %0,%0,0,0xba98;" : "+r"(a)); // convert MSBs to masks
#else
a = a & MSB_MASK;
a = a + a - (a >> 7); // extend MSBs to full byte to create mask
#endif
return a;
}
__HOST__ __DEVICE__ uint32_t masked_select (uint32_t a, uint32_t b, uint32_t m)
{
#if (__CUDA_ARCH__ >= 500)
return lop3_f4 (a, b, m);
#elif 0
return (((a)&(m))|((b)&(~(m))));
#else
return((((a)^(b))&(m))^(b));
#endif
}
/*
my_paddsb() performs byte-wise addition with signed saturation. In the
case of overflow, positive results are clamped at 127, while negative
results are clamped at -128.
*/
__HOST__ __DEVICE__ uint32_t my_paddsb (uint32_t a, uint32_t b)
{
uint32_t sum, res, ofl, sga, msk;
res = (a & ~MSB_MASK) + (b & ~MSB_MASK);
sum = a ^ b;
ofl = lop3_14 (res, a, sum); // ofl = (res ^ a) & ~sum
sga = masked_sign_to_byte_mask (a); // sign(a)-mask
msk = masked_sign_to_byte_mask (ofl);// overflow-mask
res = lop3_d2 (res, ~MSB_MASK, sum); // res = res ^ (MSB_MASK & sum)
sga = lop3_28 (sga, ~MSB_MASK, msk); // sga = (sga ^ ~MSB_MASK) & msk
res = masked_select (sga, res, msk); // res = (sga & msk) | (res & ~msk)
return res;
}
__global__ void kernel (uint32_t a, uint32_t b)
{
printf ("GPU: %08x\n", my_paddsb (a, b));
}
int main (void)
{
uint32_t a = 0x12ef70a0;
uint32_t b = 0x34cd6090;
kernel<<<1,1>>>(a, b);
cudaDeviceSynchronize();
printf ("CPU: %08x\n", my_paddsb (a, b));
return EXIT_SUCCESS;
}
Related
I am trying to do parallel sum scan on a test vector. I am using both Thrust and CUB library for this purpose
struct CustomSum
{
template <typename T>
CUB_RUNTIME_FUNCTION __forceinline__
T operator()(const T &a, const T &b) const {
return a + b;
}
};
// 2d array stored in row-major order [(0,0), (0,1), (0,2), ... ]
thrust::host_vector<int> hVec_I1(SIZE_IMG, 1);
thrust::host_vector<int> hVec_I2(SIZE_IMG, 1);
thrust::host_vector<int> h_out(SIZE_IMG, 1);
CustomSum sum_op;
// Innitialize vector with synthetic image:
initialize(N, N, hVec_I1, hVec_I2);
// Compute Integral Image M1 and M2
thrust::device_vector<int> dVec_M1 = hVec_I1;
thrust::device_vector<int> dVec_M2 = hVec_I2;
thrust::device_vector<int> d_o = h_out;
//thrust::device_ptr<double> d_in = dVec_M1.data();
//thrust::device_ptr<double> d_out1 = d_out.data();
int* d_in = thrust::raw_pointer_cast(&dVec_M1[0]);
int *d_out = thrust::raw_pointer_cast(&d_o[0]);
//d_in = thrust::raw_pointer_cast(dVec_M2.data());
//thrust::device_vector<int> d_out;
//int *d_out = thrust::raw_pointer_cast(dVec_M1.data());
void *d_temp_storage = NULL;
size_t temp_storage_bytes = 0;
// Run inclusive prefix sum-scan
cub::DeviceScan::InclusiveScan(d_temp_storage, temp_storage_bytes, d_in, d_out, sum_op, SIZE_IMG);
// Allocate temporary storage for inclusive prefix scan
cudaMalloc(&d_temp_storage, temp_storage_bytes);
// Run inclusive prefix sum-scan
cub::DeviceScan::InclusiveScan(d_temp_storage, temp_storage_bytes, d_in, d_out, sum_op, SIZE_IMG);
The error I am getting is
Error 43 error : calling a __host__ function("CustomSum::operator ()<int> ") from a __device__ function("cub::TilePrefixCallbackOp<int, CustomSum, cub::ScanTileState<int, (bool)1> > ::operator ()") is not allowed c:\users\asu_cuda_laptop\documents\visual studio 2013\projects\stats_kernel\cub\agent\single_pass_scan_operators.cuh 747 1 stats_kernel
I could not interpret the error correctly and I am sure there is a problem with the way I am handling raw pointers. Any help is appreciated.
Related link: How to use CUB and Thrust in one CUDA code
Try defining CustomSum::operator() as a __device__ function. More on __host__ vs __device__ functions in the CUDA C programming guide.
I would like to create a list of function pointers dynamically on the CPU (with some sort of push_back() method called from main()) and copy it to a GPU __constant__ or __device__ array, without needing to resort to static __device__ function pointers. I believe this question is related to my problem; however, my goal is to create the __host__ function pointer array iteratively and then copy it to the __constant__ function pointer array instead of initialising the latter on declaration.
A working code example with static function pointers (as seen here or here) would be:
common.h:
#ifndef COMMON_H
#define COMMON_H
#include <stdio.h>
#include <iostream>
#define num_functions 3
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true)
{
if (code != cudaSuccess)
{
fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if (abort) exit(code);
}
}
// fptr_t: Pointer to void function that takes two integer lvalues
typedef void (*fptr_t)(int&, int&);
// some examples of void(int&, int&) functions...
__device__ void Add(int &a, int &b) {printf("Add... %i + %i = %i\n", a, b, a+b);}
__device__ void Subtract(int &a, int &b) {printf("Subtract... %i - %i = %i\n", a, b, a-b);}
__device__ void Multiply(int &a, int &b) {printf("Multiply... %i * %i = %i\n", a, b, a*b);}
// List of function pointers in device memory
__constant__ fptr_t constant_fList[num_functions];
// Kernel called from main(): choose the function to apply whose index is equal to thread ID
__global__ void kernel(int a, int b) {
fptr_t f;
if (threadIdx.x < num_functions) {
f = constant_fList[threadIdx.x];
f(a,b);
}
}
#endif
main.cu:
#include "common.h"
// Static device function pointers
__device__ fptr_t p_Add = Add;
__device__ fptr_t p_Sub = Subtract;
__device__ fptr_t p_Mul = Multiply;
// Load function list to constant memory
void loadList_staticpointers() {
fptr_t h_fList[num_functions];
gpuErrchk( cudaMemcpyFromSymbol(&h_fList[0], p_Add, sizeof(fptr_t)) );
gpuErrchk( cudaMemcpyFromSymbol(&h_fList[1], p_Sub, sizeof(fptr_t)) );
gpuErrchk( cudaMemcpyFromSymbol(&h_fList[2], p_Mul, sizeof(fptr_t)) );
gpuErrchk( cudaMemcpyToSymbol(constant_fList, h_fList, num_functions * sizeof(fptr_t)) );
}
int main() {
loadList_staticpointers();
int a = 12, b = 15;
kernel<<<1,3>>>(a, b);
gpuErrchk(cudaGetLastError());
gpuErrchk(cudaDeviceSynchronize());
return 0;
}
Specs: GeForce GTX 670, compiled for -arch=sm_30, CUDA 6.5, Ubuntu 14.04
I wish to avoid the use of static device function pointers, as appending each function would require code maintenance on the user side - declaration of a new static pointer like p_Add or p_Mul, manipulation of void loadList_functionpointers(), etc. To make it clear, I am trying something like the following (crashing) code:
main_wrong.cu:
#include "common.h"
#include <vector>
// Global variable: list of function pointers in host memory
std::vector<fptr_t> vec_fList;
// Add function to functions list
void addFunc(fptr_t f) {vec_fList.push_back(f);}
// Upload the functions in the std::vector<fptr_t> to GPU memory
// Copies CPU-side pointers to constant_fList, therefore crashes on kernel call
void UploadVector() {
fptr_t* h_vpointer = vec_fList.data();
gpuErrchk( cudaMemcpyToSymbol(constant_fList, h_vpointer, vec_fList.size() * sizeof(fptr_t)) );
}
int main() {
addFunc(Add);
addFunc(Subtract);
addFunc(Multiply);
int a = 12, b = 15;
UploadVector();
kernel<<<1,3>>>(a, b); // Wrong to call a host-side function pointer from a kernel
gpuErrchk(cudaGetLastError());
gpuErrchk(cudaDeviceSynchronize());
return 0;
}
My understanding is that function pointers pointing to host addresses are copied to the GPU and are unusable by the kernel, which needs pointers pointing to GPU addresses when the function f(a,b) is called. Populating a host-side array with device-side pointers would work for me with raw data (see this question) but not with function pointers. Trivial attempts with Unified Memory have failed as well... so far, I have only found static device-side pointers to work. Is there no other way to copy a dynamically created CPU array of function pointers onto the GPU?
If you can use C++11 (supported since CUDA 7), you could use the following to auto-generate the function table:
template <fptr_t... Functions>
__global__ void kernel(int a, int b)
{
constexpr auto num_f = sizeof...(Functions);
constexpr fptr_t table[] = { Functions... };
if (threadIdx.x < num_f)
{
fptr_t f = table[threadIdx.x];
f(a,b);
}
}
You would then call this kernel using
kernel<Add, Subtract, Multiply><<<1,3>>>(a, b);
Inspired by m.s.'s answer, I chose to pass the function pointer as a template parameter -this was in fact the key to solve my problem- and discovered that filling a __device__ array of function pointers dev_fList from the main() function iteratively without the help of static function pointers is indeed possible, plus C++11 compatibility is not even needed!
Here is a working example on a __device__ array in global memory. I have not tried its constant memory counterpart yet, but once a global memory array has been satisfactorily created, my guess is that a cudaMemcpyToSymbol(..., cudaMemcpyDeviceToDevice) should do the trick.
A kernel kernel() creates a GPU address for function pointer dev_f and copies the function f that was passed as a template argument. Since this is an iterative process from the CPU, only one thread (thread 0) is involved in this kernel, which is launched with configuration <<<1,1>>>. The static variable count_f takes care of indexing in dev_fList.
common.h:
#ifndef COMMON_H
#define COMMON_H
#include <stdio.h>
#include <iostream>
#define num_functions 3
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true)
{
if (code != cudaSuccess)
{
fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if (abort) exit(code);
}
}
// fptr_t: Pointer to void function that takes two integer lvalues
typedef void (*fptr_t)(int&, int&);
// some examples of void(int&, int&) functions...
__device__ void Add(int &a, int &b) {printf("Add... %i + %i = %i\n", a, b, a+b);}
__device__ void Subtract(int &a, int &b) {printf("Subtract... %i - %i = %i\n", a, b, a-b);}
__device__ void Multiply(int &a, int &b) {printf("Multiply... %i * %i = %i\n", a, b, a*b);}
// List of function pointers in device memory
// Note that, in my example, it resides in global memory space, not constant memory
__device__ fptr_t dev_fList[num_functions];
#endif
main.cu:
#include "common.h"
// Index in dev_fList[] == number of times addFunc<>() was launched
static int count_f = 0;
// Kernel that copies function f to the GPU
template<fptr_t f>
__global__ void kernel(int a, int b, int idx) {
fptr_t dev_f = f; // Create device function pointer
dev_fList[idx] = dev_f; // Populate the GPU array of function pointers
dev_fList[idx](a,b); // Make sure that the array was populated correctly
}
// Add function to functions list
template<fptr_t f>
void addFunc(const int &a, const int &b) {
if (count_f >= num_functions) {
std::cout << "Error: not enough memory statically allocated on device!\n";
exit(EXIT_FAILURE);
}
kernel<f><<<1,1>>>(a,b,count_f);
gpuErrchk(cudaGetLastError());
gpuErrchk(cudaDeviceSynchronize());
count_f++;
}
int main() {
int a = 12, b = 15;
addFunc<Add>(a,b);
addFunc<Subtract>(a,b);
addFunc<Multiply>(a,b);
return 0;
}
Edit: Added copy of the array of function pointers to constant memory
For what it's worth, here is how to copy our dev_fList array to constant memory:
In common.h:
__constant__ fptr_t cst_fList[num_functions];
__global__ void cst_test(int a, int b, int idx) {
if (threadIdx.x < idx) cst_fList[threadIdx.x](a,b);
}
In main.cu main() function, after all desired functions have been added:
fptr_t *temp;
gpuErrchk( cudaMemcpyFromSymbol((void**)&temp, dev_fList[0], count_f * sizeof(fptr_t)) );
gpuErrchk( cudaMemcpyToSymbol(cst_fList[0], &temp, count_f * sizeof(fptr_t)) );
cst_test<<<1,count_f>>>(a,b, count_f);
gpuErrchk(cudaGetLastError());
gpuErrchk(cudaDeviceSynchronize());
It may look ugly as I understand that memory is transferred to the host via temp and then back to the device; more elegant suggestions are welcome.
It is impossible to use dynamically created CUDA device function pointers (at least not without crash or UB). The template based solutions work at compile time (not dynamic). The CUDA device function pointer approaches you see everywhere need device symbols in global space. This means that for every function a device function pointer must be already declared. This also means you cannot use normal C function pointers as reference, which are e.g. set at runtime. In comprehension, using CUDA device function pointers is questionable. Template based approaches look user-friendly, but are per definition not dynamic.
Example showing structure with function pointers:
This example shows a structure having some function pointers. In normal C++ code, you can set and change the device function pointers while the program is running (dynamically). With CUDA this example below is impossible, because the function pointers in the struct are no valid device symbols. This means they cannot be used with "cudaMemcpyFromSymbol". To circumvent this, either the original function (target of the function pointers) or global cuda device function pointers must be created. Both is not dynamic.
This is dynamic assignment:
typedef float (*pDistanceFu) (float, float);
typedef float (*pDecayFu) (float, float, float);
// In C++ you can set and reset the function pointer during run time whenever you want ..
struct DistFunction {
/*__host__ __device__*/ pDistanceFu distance; // uncomment for NVCC ..
/*__host__ __device__*/ pDecayFu rad_decay;
/*__host__ __device__*/ pDecayFu lrate_decay;
};
// you can do what you want ..
DistFunction foo, bar;
foo.distance = bar.distance;
// ..
This is how it should be with CUDA, but it will fail, because there is no valid device symbol :(
pDistanceFu hDistance;
pDecayFu hRadDay;
pDecayFu hLRateDecay;
void DeviceAssign(DistFunction &dist) {
cudaMemcpyFromSymbol(&hDistance, dist.distance, sizeof(pDistanceFu) );
cudaMemcpyFromSymbol(&hRadDay, dist.rad_decay, sizeof(pDecayFu) );
cudaMemcpyFromSymbol(&hLRateDecay, dist.lrate_decay, sizeof(pDecayFu) );
dist.distance = hDistance;
dist.rad_decay = hRadDay;
dist.lrate_decay = hLRateDecay;
}
Here is the classical way, but you notice, it is not dynamic anymore because the device symbol must refer to the function reference not a pointer which may chnage during run-time..
// .. and this would work
#ifdef __CUDACC__
__host__ __device__
#endif
inline float fcn_rad_decay (float sigma0, float T, float lambda) {
return std::floor(sigma0*exp(-T/lambda) + 0.5f);
}
__device__ pDistanceFu pFoo= fcn_rad_decay; // pointer must target a reference, no host pointer possible
void DeviceAssign2(DistFunction &dist) {
cudaMemcpyFromSymbol(&hLRateDecay, &fcn_rad_decay, sizeof(pDecayFu) );
// the same:
// cudaMemcpyFromSymbol(&hLRateDecay, pFoo, sizeof(pDecayFu) );
// ..
dist.lrate_decay = hLRateDecay;
// ..
}
I would like to create a list of function pointers dynamically on the CPU (with some sort of push_back() method called from main()) and copy it to a GPU __constant__ or __device__ array, without needing to resort to static __device__ function pointers. I believe this question is related to my problem; however, my goal is to create the __host__ function pointer array iteratively and then copy it to the __constant__ function pointer array instead of initialising the latter on declaration.
A working code example with static function pointers (as seen here or here) would be:
common.h:
#ifndef COMMON_H
#define COMMON_H
#include <stdio.h>
#include <iostream>
#define num_functions 3
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true)
{
if (code != cudaSuccess)
{
fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if (abort) exit(code);
}
}
// fptr_t: Pointer to void function that takes two integer lvalues
typedef void (*fptr_t)(int&, int&);
// some examples of void(int&, int&) functions...
__device__ void Add(int &a, int &b) {printf("Add... %i + %i = %i\n", a, b, a+b);}
__device__ void Subtract(int &a, int &b) {printf("Subtract... %i - %i = %i\n", a, b, a-b);}
__device__ void Multiply(int &a, int &b) {printf("Multiply... %i * %i = %i\n", a, b, a*b);}
// List of function pointers in device memory
__constant__ fptr_t constant_fList[num_functions];
// Kernel called from main(): choose the function to apply whose index is equal to thread ID
__global__ void kernel(int a, int b) {
fptr_t f;
if (threadIdx.x < num_functions) {
f = constant_fList[threadIdx.x];
f(a,b);
}
}
#endif
main.cu:
#include "common.h"
// Static device function pointers
__device__ fptr_t p_Add = Add;
__device__ fptr_t p_Sub = Subtract;
__device__ fptr_t p_Mul = Multiply;
// Load function list to constant memory
void loadList_staticpointers() {
fptr_t h_fList[num_functions];
gpuErrchk( cudaMemcpyFromSymbol(&h_fList[0], p_Add, sizeof(fptr_t)) );
gpuErrchk( cudaMemcpyFromSymbol(&h_fList[1], p_Sub, sizeof(fptr_t)) );
gpuErrchk( cudaMemcpyFromSymbol(&h_fList[2], p_Mul, sizeof(fptr_t)) );
gpuErrchk( cudaMemcpyToSymbol(constant_fList, h_fList, num_functions * sizeof(fptr_t)) );
}
int main() {
loadList_staticpointers();
int a = 12, b = 15;
kernel<<<1,3>>>(a, b);
gpuErrchk(cudaGetLastError());
gpuErrchk(cudaDeviceSynchronize());
return 0;
}
Specs: GeForce GTX 670, compiled for -arch=sm_30, CUDA 6.5, Ubuntu 14.04
I wish to avoid the use of static device function pointers, as appending each function would require code maintenance on the user side - declaration of a new static pointer like p_Add or p_Mul, manipulation of void loadList_functionpointers(), etc. To make it clear, I am trying something like the following (crashing) code:
main_wrong.cu:
#include "common.h"
#include <vector>
// Global variable: list of function pointers in host memory
std::vector<fptr_t> vec_fList;
// Add function to functions list
void addFunc(fptr_t f) {vec_fList.push_back(f);}
// Upload the functions in the std::vector<fptr_t> to GPU memory
// Copies CPU-side pointers to constant_fList, therefore crashes on kernel call
void UploadVector() {
fptr_t* h_vpointer = vec_fList.data();
gpuErrchk( cudaMemcpyToSymbol(constant_fList, h_vpointer, vec_fList.size() * sizeof(fptr_t)) );
}
int main() {
addFunc(Add);
addFunc(Subtract);
addFunc(Multiply);
int a = 12, b = 15;
UploadVector();
kernel<<<1,3>>>(a, b); // Wrong to call a host-side function pointer from a kernel
gpuErrchk(cudaGetLastError());
gpuErrchk(cudaDeviceSynchronize());
return 0;
}
My understanding is that function pointers pointing to host addresses are copied to the GPU and are unusable by the kernel, which needs pointers pointing to GPU addresses when the function f(a,b) is called. Populating a host-side array with device-side pointers would work for me with raw data (see this question) but not with function pointers. Trivial attempts with Unified Memory have failed as well... so far, I have only found static device-side pointers to work. Is there no other way to copy a dynamically created CPU array of function pointers onto the GPU?
If you can use C++11 (supported since CUDA 7), you could use the following to auto-generate the function table:
template <fptr_t... Functions>
__global__ void kernel(int a, int b)
{
constexpr auto num_f = sizeof...(Functions);
constexpr fptr_t table[] = { Functions... };
if (threadIdx.x < num_f)
{
fptr_t f = table[threadIdx.x];
f(a,b);
}
}
You would then call this kernel using
kernel<Add, Subtract, Multiply><<<1,3>>>(a, b);
Inspired by m.s.'s answer, I chose to pass the function pointer as a template parameter -this was in fact the key to solve my problem- and discovered that filling a __device__ array of function pointers dev_fList from the main() function iteratively without the help of static function pointers is indeed possible, plus C++11 compatibility is not even needed!
Here is a working example on a __device__ array in global memory. I have not tried its constant memory counterpart yet, but once a global memory array has been satisfactorily created, my guess is that a cudaMemcpyToSymbol(..., cudaMemcpyDeviceToDevice) should do the trick.
A kernel kernel() creates a GPU address for function pointer dev_f and copies the function f that was passed as a template argument. Since this is an iterative process from the CPU, only one thread (thread 0) is involved in this kernel, which is launched with configuration <<<1,1>>>. The static variable count_f takes care of indexing in dev_fList.
common.h:
#ifndef COMMON_H
#define COMMON_H
#include <stdio.h>
#include <iostream>
#define num_functions 3
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true)
{
if (code != cudaSuccess)
{
fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if (abort) exit(code);
}
}
// fptr_t: Pointer to void function that takes two integer lvalues
typedef void (*fptr_t)(int&, int&);
// some examples of void(int&, int&) functions...
__device__ void Add(int &a, int &b) {printf("Add... %i + %i = %i\n", a, b, a+b);}
__device__ void Subtract(int &a, int &b) {printf("Subtract... %i - %i = %i\n", a, b, a-b);}
__device__ void Multiply(int &a, int &b) {printf("Multiply... %i * %i = %i\n", a, b, a*b);}
// List of function pointers in device memory
// Note that, in my example, it resides in global memory space, not constant memory
__device__ fptr_t dev_fList[num_functions];
#endif
main.cu:
#include "common.h"
// Index in dev_fList[] == number of times addFunc<>() was launched
static int count_f = 0;
// Kernel that copies function f to the GPU
template<fptr_t f>
__global__ void kernel(int a, int b, int idx) {
fptr_t dev_f = f; // Create device function pointer
dev_fList[idx] = dev_f; // Populate the GPU array of function pointers
dev_fList[idx](a,b); // Make sure that the array was populated correctly
}
// Add function to functions list
template<fptr_t f>
void addFunc(const int &a, const int &b) {
if (count_f >= num_functions) {
std::cout << "Error: not enough memory statically allocated on device!\n";
exit(EXIT_FAILURE);
}
kernel<f><<<1,1>>>(a,b,count_f);
gpuErrchk(cudaGetLastError());
gpuErrchk(cudaDeviceSynchronize());
count_f++;
}
int main() {
int a = 12, b = 15;
addFunc<Add>(a,b);
addFunc<Subtract>(a,b);
addFunc<Multiply>(a,b);
return 0;
}
Edit: Added copy of the array of function pointers to constant memory
For what it's worth, here is how to copy our dev_fList array to constant memory:
In common.h:
__constant__ fptr_t cst_fList[num_functions];
__global__ void cst_test(int a, int b, int idx) {
if (threadIdx.x < idx) cst_fList[threadIdx.x](a,b);
}
In main.cu main() function, after all desired functions have been added:
fptr_t *temp;
gpuErrchk( cudaMemcpyFromSymbol((void**)&temp, dev_fList[0], count_f * sizeof(fptr_t)) );
gpuErrchk( cudaMemcpyToSymbol(cst_fList[0], &temp, count_f * sizeof(fptr_t)) );
cst_test<<<1,count_f>>>(a,b, count_f);
gpuErrchk(cudaGetLastError());
gpuErrchk(cudaDeviceSynchronize());
It may look ugly as I understand that memory is transferred to the host via temp and then back to the device; more elegant suggestions are welcome.
It is impossible to use dynamically created CUDA device function pointers (at least not without crash or UB). The template based solutions work at compile time (not dynamic). The CUDA device function pointer approaches you see everywhere need device symbols in global space. This means that for every function a device function pointer must be already declared. This also means you cannot use normal C function pointers as reference, which are e.g. set at runtime. In comprehension, using CUDA device function pointers is questionable. Template based approaches look user-friendly, but are per definition not dynamic.
Example showing structure with function pointers:
This example shows a structure having some function pointers. In normal C++ code, you can set and change the device function pointers while the program is running (dynamically). With CUDA this example below is impossible, because the function pointers in the struct are no valid device symbols. This means they cannot be used with "cudaMemcpyFromSymbol". To circumvent this, either the original function (target of the function pointers) or global cuda device function pointers must be created. Both is not dynamic.
This is dynamic assignment:
typedef float (*pDistanceFu) (float, float);
typedef float (*pDecayFu) (float, float, float);
// In C++ you can set and reset the function pointer during run time whenever you want ..
struct DistFunction {
/*__host__ __device__*/ pDistanceFu distance; // uncomment for NVCC ..
/*__host__ __device__*/ pDecayFu rad_decay;
/*__host__ __device__*/ pDecayFu lrate_decay;
};
// you can do what you want ..
DistFunction foo, bar;
foo.distance = bar.distance;
// ..
This is how it should be with CUDA, but it will fail, because there is no valid device symbol :(
pDistanceFu hDistance;
pDecayFu hRadDay;
pDecayFu hLRateDecay;
void DeviceAssign(DistFunction &dist) {
cudaMemcpyFromSymbol(&hDistance, dist.distance, sizeof(pDistanceFu) );
cudaMemcpyFromSymbol(&hRadDay, dist.rad_decay, sizeof(pDecayFu) );
cudaMemcpyFromSymbol(&hLRateDecay, dist.lrate_decay, sizeof(pDecayFu) );
dist.distance = hDistance;
dist.rad_decay = hRadDay;
dist.lrate_decay = hLRateDecay;
}
Here is the classical way, but you notice, it is not dynamic anymore because the device symbol must refer to the function reference not a pointer which may chnage during run-time..
// .. and this would work
#ifdef __CUDACC__
__host__ __device__
#endif
inline float fcn_rad_decay (float sigma0, float T, float lambda) {
return std::floor(sigma0*exp(-T/lambda) + 0.5f);
}
__device__ pDistanceFu pFoo= fcn_rad_decay; // pointer must target a reference, no host pointer possible
void DeviceAssign2(DistFunction &dist) {
cudaMemcpyFromSymbol(&hLRateDecay, &fcn_rad_decay, sizeof(pDecayFu) );
// the same:
// cudaMemcpyFromSymbol(&hLRateDecay, pFoo, sizeof(pDecayFu) );
// ..
dist.lrate_decay = hLRateDecay;
// ..
}
I have found some strange runtime behaviour while experimenting with function pointers in CUDA.
Goal
My goal is to make my function pointers choose which function to apply to two objects according to an internal property of the latter.
In short, I want to emulate C++ templates with a CUDA kernel - without actually using template arguments or switch clauses, but function pointers and class/struct members instead.
Approach
Define my custom objects struct customObj with one property (int type) that will emulate the arguments of a template.
Define a bunch of dummy functions (Sum(), Subtract(), etc) to choose from.
Keep the list of functions to apply (functionsList) and respective type members to look up (first_types, second_types) in __constant__ memory, such that function functionsList[i](obj1,obj2) is applied to objects with obj1.type == first_types[i] and obj2.type == second_types[i].
Working code
The following code has been compiled for Linux x86_64 with CUDA 5.0, on a GPU with compute capability 3.0 (GeForce GTX 670), and works.
#include <stdio.h>
#include <iostream>
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, char *file, int line, bool abort=true)
{
if (code != cudaSuccess)
{
fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if (abort) exit(code);
}
}
struct customObj
{
int type;
double d;
// Constructors
__device__ __host__ customObj() {}
__device__ __host__ customObj(const int& _type, const double& _d) : type(_type), d(_d) {}
};
typedef void (*function_t)(customObj&, customObj&);
// Define a bunch of functions
__host__ __device__ void Sum(customObj& obj1, customObj& obj2) {printf("Sum chosen! d1 + d2 = %f\n", obj1.d + obj2.d);}
__host__ __device__ void Subtract(customObj& obj1, customObj& obj2) {printf("Subtract chosen! d1 - d2 = %f\n", obj1.d - obj2.d);}
__host__ __device__ void Multiply(customObj& obj1, customObj& obj2) {printf("Multiply chosen! d1 * d2 = %f\n", obj1.d * obj2.d);}
#define ARRAYLENGTH 3
__constant__ int first_type[ARRAYLENGTH] = {1, 2, 3};
__constant__ int second_type[ARRAYLENGTH] = {1, 1, 2};
__constant__ function_t functionsList[ARRAYLENGTH] = {Sum, Sum, Subtract};
// Kernel to loop through functions list
__global__ void choosefunction(customObj obj1, customObj obj2) {
int i = 0;
function_t f = NULL;
do {
if ((obj1.type == first_type[i]) && (obj2.type == second_type[i])) {
f = functionsList[i];
break;
}
i++;
} while (i < ARRAYLENGTH);
if (f == NULL) printf("No possible interaction!\n");
else f(obj1,obj2);
}
int main() {
customObj obj1(1, 5.2), obj2(1, 2.6);
choosefunction<<<1,1>>>(obj1, obj2);
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaDeviceSynchronize());
return 0;
}
The issue
The problem that I have found is that, as soon as I replace the datatype of member int type and related variables and functions (__constant__ int first_types[...] and so on)... the code compiles but stops working!
If I change the datatype from int to char or int8_t, the memory checker throws error 4 on my call to cudaDeviceSynchronize().
If I change the datatype to unsigned short int, I get a hardware stack overflow.
So, is anybody having similar issues when working with __constant__ memory? I really have no clue about what is going on. As far as I know, char and int8_t are built-in types of 1 byte length, while the size of int is 4 bytes, so maybe it is about data alignment, but I'm just guessing here. Besides, CUDA is supposed to support function pointers on the GPU since compute capability 2.0. Are there any special constraints for function pointers in __constant__ memory that I'm missing?
I was able to reproduce the problem (error 4, unspecified launch failure) on CUDA 5.0 on 64bit RHEL 5.5, but not on CUDA 6.0.
Please update/upgrade to CUDA 6.
What is the difference between the following two functions?
__device__ inline void comparator_volatile(volatile float &A, volatile float &B, uint dir) {
float t;
if ((A > B) == dir) {
t = A;
A = B;
B = t;
}
}
__device__ inline void comparator(float &A, float &B, uint dir) {
float t;
if ((A > B) == dir) {
t = A;
A = B;
B = t;
}
__threadfence();
}
Could anyone help me?
I implement BitonicSort in some different versions based on CUDA SDK version.
For ATOMIC version (bitonicSortAtomic), I tried to use __threadfence() in __syncblocks_atomic to maintain memory consistency. But it doesn't work (the output is incorrect). I have to call comparator_volatile instead of comparator, then I get correct result. Any idea?
The BitonicSort benchmark:
// (C) Copyright 2013, University of Illinois. All Rights Reserved
#include <stdlib.h>
#include <stdio.h>
#include "parboil.h"
#define THREADS 256
#define BLOCKS 32
#define NUM_VALS 2*THREADS*BLOCKS
__device__ volatile int mutex = 0;
__device__ inline void __syncblocks_atomic(int goal) {
__syncthreads();
// __threadfence();
int tx = threadIdx.x;
if (tx == 0) {
atomicAdd((int *)&mutex, 1);
while(g_mutex != goal) {}
}
__syncthreads();
}
__device__ inline void comparator(float &A, float &B, uint dir) {
float t;
if ((A > B) == dir) {
t = A;
A = B;
B = t;
}
}
__device__ inline void comparator_volatile(volatile float &A, volatile float &B, uint dir) {
float t;
if ((A > B) == dir) {
t = A;
A = B;
B = t;
}
}
#ifdef NAIVE
__global__ void bitonicSortNaive(float *src, int stride, int size) {
unsigned int tid = threadIdx.x + blockDim.x * blockIdx.x;
uint dir = (tid & (size / 2)) == 0;
unsigned int pos = 2*tid - (tid & (stride - 1));
comparator(src[pos], src[pos+stride], dir);
}
#endif
#ifdef ATOMIC
__global__ void bitonicSortAtomic(float *src, int length) {
uint numBlocks = gridDim.x * gridDim.y * gridDim.z;
uint goalVal = 0;
uint tid = threadIdx.x + blockDim.x * blockIdx.x;
for(uint size=2; size<=length; size<<=1) {
for(uint stride=size>>1; stride>0; stride=stride>>1) {
uint dir = (tid & (size / 2)) == 0;
uint pos = 2*tid - (tid & (stride - 1));
comparator_volatile(src[pos], src[pos+stride], dir);
if(stride>THREADS || (stride==1 && size>=THREADS)) {
goalVal += numBlocks;
__syncblocks_atomic(goalVal);
}
else
__syncthreads();
} // end for stride
} // end for size
}
#endif
int main() {
printf("[BENCH] Bitonic Sort %d elements\n", NUM_VALS);
printf("[BENCH] Xuhao Chen <cxh#illinois.edu>\n");
#ifdef NAIVE
printf("[BENCH] Naive version\n");
#endif
#ifdef ATOMIC
printf("[BENCH] Atomic Barrier\n");
#endif
float *values = (float*) malloc( NUM_VALS * sizeof(float));
array_init(values, NUM_VALS);
float *dev_values;
size_t size = NUM_VALS * sizeof(float);
cudaMalloc((void**) &dev_values, size);
cudaMemcpy(dev_values, values, size, cudaMemcpyHostToDevice);
dim3 blocks(BLOCKS,1);
dim3 threads(THREADS,1);
cudaDeviceSynchronize();
#ifdef NAIVE
int j, k;
for (k = 2; k <= NUM_VALS; k <<= 1) {
for (j=k>>1; j>0; j=j>>1) {
bitonicSortNaive<<<blocks, threads>>>(dev_values, j, k);
}
}
#endif
#ifdef ATOMIC
bitonicSortAtomic<<<blocks, threads>>>(dev_values, NUM_VALS);
#endif
cudaDeviceSynchronize();
cudaMemcpy(values, dev_values, size, cudaMemcpyDeviceToHost);
cudaFree(dev_values);
free(values);
}
__syncblocks_atomic is a function to implement global barrier. Since there is inter-block communication, I have to keep data consistency.
The CUDA programming guide states:
If a variable located in
global or shared memory is declared as volatile, the compiler assumes that its value
can be changed or used at any time by another thread and therefore any reference to
this variable compiles to an actual memory read or write instruction.
This basically means that the memory will be flushed immediately as you assign a value to the variable, and will be fetched directly from the memory (with no cache) when you try to read its value.
In you first code sample, since both A and B are volatile, 6 actual memory instructions are generated. One read/write each time you use either A or B. The good point is that other threads will be able to see that modifications earlier, while they are made. The downside is that the execution will be slower, because the caches will be disabled.
In your second code sample, on the other side, the GPU is authorized to use caches to accelerate its execution, until the end of the function, when it's forced to issue a memory write. If both A and B are already cached, only 2 memory writes are issued. The downside is that other threads might only be able to see the changed value after the fence.
Another thing you should consider is that operations are not atomic.
If other threads try to access A and B while your function is executing, they might see a partial execution of the function, in both cases. In the second code sample, this is a bit less likely to happen, because the thread will probably use its cached value, and flush the final values at once (anyway, you should not rely on this).
Also, volatile works as a faster version of __threadfence() among threads in the same warp (because threads in a warp act synchronously).