Vectors with CUDA [duplicate] - cuda

I am a newbie to Thrust. I see that all Thrust presentations and examples only show host code.
I would like to know if I can pass a device_vector to my own kernel? How?
If yes, what are the operations permitted on it inside kernel/device code?

As it was originally written, Thrust is purely a host side abstraction. It cannot be used inside kernels. You can pass the device memory encapsulated inside a thrust::device_vector to your own kernel like this:
thrust::device_vector< Foo > fooVector;
// Do something thrust-y with fooVector
Foo* fooArray = thrust::raw_pointer_cast( fooVector.data() );
// Pass raw array and its size to kernel
someKernelCall<<< x, y >>>( fooArray, fooVector.size() );
and you can also use device memory not allocated by thrust within thrust algorithms by instantiating a thrust::device_ptr with the bare cuda device memory pointer.
Edited four and half years later to add that as per #JackOLantern's answer, thrust 1.8 adds a sequential execution policy which means you can run single threaded versions of thrust's alogrithms on the device. Note that it still isn't possible to directly pass a thrust device vector to a kernel and device vectors can't be directly used in device code.
Note that it is also possible to use the thrust::device execution policy in some cases to have parallel thrust execution launched by a kernel as a child grid. This requires separate compilation/device linkage and hardware which supports dynamic parallelism. I am not certain whether this is actually supported in all thrust algorithms or not, but certainly works with some.

This is an update to my previous answer.
Starting from Thrust 1.8.1, CUDA Thrust primitives can be combined with the thrust::device execution policy to run in parallel within a single CUDA thread exploiting CUDA dynamic parallelism. Below, an example is reported.
#include <stdio.h>
#include <thrust/reduce.h>
#include <thrust/execution_policy.h>
#include "TimingGPU.cuh"
#include "Utilities.cuh"
#define BLOCKSIZE_1D 256
#define BLOCKSIZE_2D_X 32
#define BLOCKSIZE_2D_Y 32
/*************************/
/* TEST KERNEL FUNCTIONS */
/*************************/
__global__ void test1(const float * __restrict__ d_data, float * __restrict__ d_results, const int Nrows, const int Ncols) {
const unsigned int tid = threadIdx.x + blockDim.x * blockIdx.x;
if (tid < Nrows) d_results[tid] = thrust::reduce(thrust::seq, d_data + tid * Ncols, d_data + (tid + 1) * Ncols);
}
__global__ void test2(const float * __restrict__ d_data, float * __restrict__ d_results, const int Nrows, const int Ncols) {
const unsigned int tid = threadIdx.x + blockDim.x * blockIdx.x;
if (tid < Nrows) d_results[tid] = thrust::reduce(thrust::device, d_data + tid * Ncols, d_data + (tid + 1) * Ncols);
}
/********/
/* MAIN */
/********/
int main() {
const int Nrows = 64;
const int Ncols = 2048;
gpuErrchk(cudaFree(0));
// size_t DevQueue;
// gpuErrchk(cudaDeviceGetLimit(&DevQueue, cudaLimitDevRuntimePendingLaunchCount));
// DevQueue *= 128;
// gpuErrchk(cudaDeviceSetLimit(cudaLimitDevRuntimePendingLaunchCount, DevQueue));
float *h_data = (float *)malloc(Nrows * Ncols * sizeof(float));
float *h_results = (float *)malloc(Nrows * sizeof(float));
float *h_results1 = (float *)malloc(Nrows * sizeof(float));
float *h_results2 = (float *)malloc(Nrows * sizeof(float));
float sum = 0.f;
for (int i=0; i<Nrows; i++) {
h_results[i] = 0.f;
for (int j=0; j<Ncols; j++) {
h_data[i*Ncols+j] = i;
h_results[i] = h_results[i] + h_data[i*Ncols+j];
}
}
TimingGPU timerGPU;
float *d_data; gpuErrchk(cudaMalloc((void**)&d_data, Nrows * Ncols * sizeof(float)));
float *d_results1; gpuErrchk(cudaMalloc((void**)&d_results1, Nrows * sizeof(float)));
float *d_results2; gpuErrchk(cudaMalloc((void**)&d_results2, Nrows * sizeof(float)));
gpuErrchk(cudaMemcpy(d_data, h_data, Nrows * Ncols * sizeof(float), cudaMemcpyHostToDevice));
timerGPU.StartCounter();
test1<<<iDivUp(Nrows, BLOCKSIZE_1D), BLOCKSIZE_1D>>>(d_data, d_results1, Nrows, Ncols);
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaDeviceSynchronize());
printf("Timing approach nr. 1 = %f\n", timerGPU.GetCounter());
gpuErrchk(cudaMemcpy(h_results1, d_results1, Nrows * sizeof(float), cudaMemcpyDeviceToHost));
for (int i=0; i<Nrows; i++) {
if (h_results1[i] != h_results[i]) {
printf("Approach nr. 1; Error at i = %i; h_results1 = %f; h_results = %f", i, h_results1[i], h_results[i]);
return 0;
}
}
timerGPU.StartCounter();
test2<<<iDivUp(Nrows, BLOCKSIZE_1D), BLOCKSIZE_1D>>>(d_data, d_results1, Nrows, Ncols);
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaDeviceSynchronize());
printf("Timing approach nr. 2 = %f\n", timerGPU.GetCounter());
gpuErrchk(cudaMemcpy(h_results1, d_results1, Nrows * sizeof(float), cudaMemcpyDeviceToHost));
for (int i=0; i<Nrows; i++) {
if (h_results1[i] != h_results[i]) {
printf("Approach nr. 2; Error at i = %i; h_results1 = %f; h_results = %f", i, h_results1[i], h_results[i]);
return 0;
}
}
printf("Test passed!\n");
}
The above example performs reductions of the rows of a matrix in the same sense as Reduce matrix rows with CUDA, but it is done differently from the above post, namely, by calling CUDA Thrust primitives directly from user written kernels. Also, the above example serves to compare the performance of the same operations when done with two execution policies, namely, thrust::seq and thrust::device. Below, some graphs showing the difference in performance.
The performance has been evaluated on a Kepler K20c and on a Maxwell GeForce GTX 850M.

I would like to provide an updated answer to this question.
Starting from Thrust 1.8, CUDA Thrust primitives can be combined with the thrust::seq execution policy to run sequentially within a single CUDA thread (or sequentially within a single CPU thread). Below, an example is reported.
If you want parallel execution within a thread, then you may consider using CUB which provides reduction routines that can be called from within a threadblock, provided that your card enables dynamic parallelism.
Here is the example with Thrust
#include <stdio.h>
#include <thrust/reduce.h>
#include <thrust/execution_policy.h>
/********************/
/* CUDA ERROR CHECK */
/********************/
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, char *file, int line, bool abort=true)
{
if (code != cudaSuccess)
{
fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if (abort) exit(code);
}
}
__global__ void test(float *d_A, int N) {
float sum = thrust::reduce(thrust::seq, d_A, d_A + N);
printf("Device side result = %f\n", sum);
}
int main() {
const int N = 16;
float *h_A = (float*)malloc(N * sizeof(float));
float sum = 0.f;
for (int i=0; i<N; i++) {
h_A[i] = i;
sum = sum + h_A[i];
}
printf("Host side result = %f\n", sum);
float *d_A; gpuErrchk(cudaMalloc((void**)&d_A, N * sizeof(float)));
gpuErrchk(cudaMemcpy(d_A, h_A, N * sizeof(float), cudaMemcpyHostToDevice));
test<<<1,1>>>(d_A, N);
}

If you mean to use the data allocated / processed by thrust yes you can, just get the raw pointer of the allocated data.
int * raw_ptr = thrust::raw_pointer_cast(dev_ptr);
if you want to allocate thrust vectors in the kernel I never tried but I don't think will work
and also if it works I don't think it will provide any benefit.

Related

Profiling a CUDA matrix addition code, using nvprof: the code API profiles, the kernel does not

I am using a remote workstation with nvidia Geforce Gpu , and after compiling and executing when I try to profile this shows up in the screen
and this is the output when i run nvidia-smi
#include <stdio.h>
#include <cuda.h>
#include <math.h>
__global__ void matrixInit(double *matrix, int width, int height, double value){
for(int i = (threadIdx.x + blockIdx.x * blockDim.x); i<width; i+=(blockDim.x * gridDim.x)){
for(int j = (threadIdx.y + blockIdx.y * blockDim.y); j<height; j+=(blockDim.y * gridDim.y)){
matrix[j * width +i] = value;
}
}
}
__global__ void matrixAdd(double *d_A, double *d_B, double *d_C, int width, int height){
int ix = threadIdx.x + blockIdx.x * blockDim.x;
int iy = threadIdx.y + blockIdx.y * blockDim.y;
int stride_x = blockDim.x * gridDim.x;
int stride_y = blockDim.y * gridDim.y;
for(int j=iy; j<height; j+=stride_y){
for(int i=ix; i<width; i+=stride_x){
int index = j * width +i;
d_C[index] = d_A[index-1] + d_B[index];
}
}
}
int main(){
int Nx = 1<<12;
int Ny = 1<<15;
size_t size = Nx*Ny*sizeof(double);
// host memory pointers
double *A, *B, *C;
// device memory pointers
double *d_A, *d_B, *d_C;
// allocate host memory
A = (double*)malloc(size);
B = (double*)malloc(size);
C = (double*)malloc(size);
// kernel call
int thread = 32;
int block_x = ceil(Nx + thread -1)/thread;
int block_y = ceil(Ny + thread -1)/thread;
dim3 THREADS(thread,thread);
dim3 BLOCKS(block_y,block_x);
// initialize variables
matrixInit<<<BLOCKS,THREADS>>>(A, Nx, Ny, 1.0);
matrixInit<<<BLOCKS,THREADS>>>(B, Nx, Ny, 2.0);
matrixInit<<<BLOCKS,THREADS>>>(C, Nx, Ny, 0.0);
//allocated device memory
cudaMalloc(&d_A, size);
cudaMalloc(&d_B, size);
cudaMalloc(&d_C, size);
//copy to device
cudaMemcpy(d_A, A, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, B, size, cudaMemcpyHostToDevice);
// Add matrix at GPU
matrixAdd<<<BLOCKS,THREADS>>>(A, B, C, Nx, Ny);
//copy back to host
cudaMemcpy(C, d_C, size, cudaMemcpyDeviceToHost);
cudaFree(A);
cudaFree(B);
cudaFree(C);
return 0;
}
This is my code. in summary, the result shows these 2 warning messages:
==525867== Warning: 4 records have invalid timestamps due to insufficient device buffer space. You can configure the buffer space using the option --device-buffer-size.
==525867== Warning: 1 records have invalid timestamps due to insufficient semaphore pool size. You can configure the pool size using the option --profiling-semaphore-pool-size.
==525867== Profiling result: No kernels were profiled.
matrixInit<<<BLOCKS,THREADS>>>(A, Nx, Ny, 1.0);
matrixInit<<<BLOCKS,THREADS>>>(B, Nx, Ny, 2.0);
matrixInit<<<BLOCKS,THREADS>>>(C, Nx, Ny, 0.0);
You are writing to host memory here, which is not allowed.
Instead you can do matrixInit() directly on the device arrays d_A, d_B and d_C, after the allocations.
Another mistake here:
cudaFree(A);
cudaFree(B);
cudaFree(C);
Those should be d_A, d_B and d_C. Use regular free() for A, B and C.
Your kernels are also not doing what you want. You launch them with a thread per matrix entry, which means there should be no for() loops in the kernels.

Unspecified launch failure after cudaDeviceSynchronize() call when program starts. But no errors using step-through debugging. CUDA

I've spent several hours struggling with unspecified launch failure.
I've come up with a tiny task for myself in order to understand how shared memory works.
Task is to divide array [1, 2, 3, ... , N] into K group of (N / K) elements and find the sum of each group. (Difference between current and previous element of the array equals 1)
I was planning to use N threads in grid divided between K blocks. So every threadblock contains (N / K) threads. Thus one threadblock could be used to compute sum of one group. Also I wanted to dynamically allocate shared memory.
When I start program I got unspecified launch failure after cudaDeviceSynchronize() call. But when I try step-through debugging everthing is ok and works fine.
What am I doing wrong? (Visual Studio 2012 Professional, Compute Capability 2.1) I would very appreciate any help.
#include <stdio.h>
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#define CUDA_CALL(x) do { if((x) != cudaSuccess) { \
printf("Error at %s:%d\n",__FILE__,__LINE__); \
printf("%s\n",cudaGetErrorString(x)); \
system("pause"); \
return EXIT_FAILURE;}} while(0)
extern __shared__ double shrd[];
__global__ void kernel(double * a){
size_t threadID_block = blockDim.x * threadIdx.y + threadIdx.x;
size_t blockID_global = (gridDim.x * blockIdx.y + blockIdx.x );
size_t threadID_global = blockID_global * blockDim.x * blockDim.y + threadID_block;
double * temp = &shrd[blockID_global * blockDim.x * blockDim.y];
temp[threadID_block] = static_cast<double>(threadID_global);
__syncthreads();
if (threadID_block == 0){
a[blockID_global] = 0.0;
for (size_t index = 0; index < blockDim.x * blockDim.y; index++){
a[blockID_global] += temp[index];
}
}
}
int main(){
int devNum = 0;
CUDA_CALL(cudaGetDevice(&devNum));
CUDA_CALL(cudaSetDevice(devNum));
dim3 gridSize(2,2,1);
dim3 blockSize(4,4,1);
double * dev_a = NULL;
size_t length = gridSize.x * gridSize.y ;
size_t byteSize = length * sizeof(double);
CUDA_CALL(cudaMalloc(&dev_a,byteSize));
size_t shmem_perBlock = blockSize.x * blockSize.y * sizeof(double);
kernel <<< gridSize, blockSize, shmem_perBlock >>> (dev_a);
CUDA_CALL(cudaGetLastError());
CUDA_CALL(cudaDeviceSynchronize());
double * a = new double [length];
CUDA_CALL(cudaMemcpy(a,dev_a,byteSize,cudaMemcpyDeviceToHost));
for (size_t index = 0; index < length; index++){
printf("%.3f\n",a[index]);
}
printf("\n");
CUDA_CALL(cudaFree(dev_a));
CUDA_CALL(cudaDeviceReset());
delete[]a;
system("pause");
return 0;
}
If you are on kepler or later first read this:
http://devblogs.nvidia.com/parallelforall/faster-parallel-reductions-kepler/
Otherwise if you are pre-kepler read this:
http://developer.download.nvidia.com/compute/cuda/1.1-Beta/x86_website/projects/reduction/doc/reduction.pdf
There are some fundamentals you are missing in terms of CUDA programming. I have given you a template of your code below. It is for clarification of some of these fundamentals. Do not expect this to be optimized as I am expecting you to program the parallel reduction. This will get you started with an understanding on how to use shared memory.
Good Luck!
#include <stdio.h>
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#define N 10000
#define K 100
#define CUDA_CALL(x) do { if((x) != cudaSuccess) { \
printf("Error at %s:%d\n",__FILE__,__LINE__); \
printf("%s\n",cudaGetErrorString(x)); \
system("pause"); \
return EXIT_FAILURE;}} while(0)
__global__ void kernel(double* a, double* results){
extern __shared__ double shared[];
size_t tid, tid_local, stride;
tid = blockDim.x*blockIdx.x+threadIdx.x; //thread id within all blocks
tid_local = threadIdx.x; //thread id within a block
stride = blockDim.x*gridDim.x; //total number of threads
double *start = &a[K*blockIdx.x]; //each block will get K of a block.
shared[tid_local]=start[tid_local]; //copy K elements into shared memory
__syncthreads();
//Perform Parallel reduction, you will have to implement this
//After parallel reduction, result should be in shared[0]
//for demonstration I made the code serial for each block on thread 0.
//This is for demonstration only.
double sum=0;
if(tid_local==0){
for(int i=0; i<K; i++){
sum+=shared[i];
}
a[blockIdx.x]=sum;
}
}
int main(){
int devNum = 0;
CUDA_CALL(cudaGetDevice(&devNum));
CUDA_CALL(cudaSetDevice(devNum));
double * dev_a = NULL;
double * dev_results=NULL;
CUDA_CALL(cudaMalloc(&dev_a, N*sizeof(double) ));
CUDA_CALL(cudaMalloc(&dev_results, (N/K)*sizeof(double)));
//copy dev_a onto GPU (this is the array you are summing).
dim3 block_size(K, 1, 1);
dim3 grid_size (N/K, 1, 1);
size_t shmem_perBlock = K * sizeof(double);
kernel <<< grid_size, block_size, shmem_perBlock >>> (dev_a, dev_results);
CUDA_CALL(cudaGetLastError());
CUDA_CALL(cudaDeviceSynchronize());
//copy dev_results back to CPU, this is your result.
CUDA_CALL(cudaFree(dev_a));
CUDA_CALL(cudaFree(dev_results));
system("pause");
return 0;
}

loop unrolling with dynamic parallelism decrease the time performance

I have a simple program to calculate square root, loop unrolling was done as
loop unrolling
#include <stdio.h>
#include <cuda.h>
__global__ void square(float *a, int N,int idx);
// Kernel that executes on the CUDA device
__global__ void first(float *arr, int N)
{
int idx = 2*(blockIdx.x * blockDim.x + threadIdx.x);
int n=N;
//printf("%d\n",n);
for(int q=0;q<2;q++)
{
if(N<2000)
{
arr[idx+q] = arr[idx+q] * arr[idx+q];
}
}
}
// main routine that executes on the host
int main(void)
{
clock_t start = clock(),diff;
float *a_h, *a_d; // Pointer to host & device arrays
const int N = 1000; // Number of elements in arrays
size_t size = N * sizeof(float);
a_h = (float *)malloc(size); // Allocate array on host
cudaMalloc((void **) &a_d, size); // Allocate array on device
// Initialize host array and copy it to CUDA device
for (int i=0; i<N; i++) a_h[i] = (float)i;
cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);
// Do calculation on device:
int block_size = 4;
//int n_blocks = N/block_size + (N%block_size == 0 ? 0:1);
first <<< 4, 128 >>> (a_d, N);
//cudaThreadSynchronize();
// Retrieve result from device and store it in host array
cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost);
// Print results
for (int i=0; i<N; i++) printf("%d %f\n", i, a_h[i]);
// Cleanup
free(a_h); cudaFree(a_d);
diff = clock() - start;
int msec = diff * 1000 / CLOCKS_PER_SEC;
printf("Time taken %d seconds %d milliseconds\n", msec/1000, msec%1000);
}
then realizing that the loop calculation can be minimized with dynamic parallelism .
unrolling with dynamic parallelism was implemented as
unrolling with dynamic parallelism
#include <stdio.h>
#include <cuda.h>
__global__ void square(float *a, int N,int idx);
// Kernel that executes on the CUDA device
__global__ void first(float *arr, int N)
{
int idx = 2*(blockIdx.x * blockDim.x + threadIdx.x);
int n=N;
square <<< 1,2 >>> (arr, n,idx);
}
__global__ void square(float *a, int N,int idx)
{
int tdx = blockIdx.x * blockDim.x + threadIdx.x;
printf("%d\n",N);
if(N<2000)
{
a[tdx+idx] = a[tdx+idx] * a[tdx+idx];
}
}
// main routine that executes on the host
int main(void)
{
clock_t start = clock(),diff;
float *a_h, *a_d; // Pointer to host & device arrays
const int N = 1000; // Number of elements in arrays
size_t size = N * sizeof(float);
a_h = (float *)malloc(size); // Allocate array on host
cudaMalloc((void **) &a_d, size); // Allocate array on device
// Initialize host array and copy it to CUDA device
for (int i=0; i<N; i++) a_h[i] = (float)i;
cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);
// Do calculation on device:
int block_size = 4;
//int n_blocks = N/block_size + (N%block_size == 0 ? 0:1);
first <<< 4, 128 >>> (a_d, N);
//cudaThreadSynchronize();
// Retrieve result from device and store it in host array
cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost);
// Print results
for (int i=0; i<N; i++) printf("%d %f\n", i, a_h[i]);
// Cleanup
free(a_h); cudaFree(a_d);
diff = clock() - start;
int msec = diff * 1000 / CLOCKS_PER_SEC;
printf("Time taken %d seconds %d milliseconds\n", msec/1000, msec%1000);
}
the implementation of dynamic parallelism with unrolling takes more time for executio than only unrolling. Aren,t we suppose to improve execution time with dynamic parallelism in such case?
Dynamic parallelism is mainly useful in cases where you have parallelism that is dynamic. That is: cases where you don't know how much parallelism you're going to need until you've done some calculation. Rather than transfer data back to the host which is then instantly fed into parameterising another launch, you launch from within the kernel. In this pattern, with memcpys between kernel launches avoided, you'll see speedup.
In your example above this is not the case. You could have just launched twice as many threads from the host. There's nothing dynamic required as there's no parallelism available there that you didn't know about at the time of the first kernel launch.
Furthermore, performance requirements for kernels launched using dynamic parallelism are similar to that of those launched from the host. You have to launch a reasonable amount of work or the launch latency will dominate your computation time.

Using thrust::device_vector in device code [duplicate]

I am a newbie to Thrust. I see that all Thrust presentations and examples only show host code.
I would like to know if I can pass a device_vector to my own kernel? How?
If yes, what are the operations permitted on it inside kernel/device code?
As it was originally written, Thrust is purely a host side abstraction. It cannot be used inside kernels. You can pass the device memory encapsulated inside a thrust::device_vector to your own kernel like this:
thrust::device_vector< Foo > fooVector;
// Do something thrust-y with fooVector
Foo* fooArray = thrust::raw_pointer_cast( fooVector.data() );
// Pass raw array and its size to kernel
someKernelCall<<< x, y >>>( fooArray, fooVector.size() );
and you can also use device memory not allocated by thrust within thrust algorithms by instantiating a thrust::device_ptr with the bare cuda device memory pointer.
Edited four and half years later to add that as per #JackOLantern's answer, thrust 1.8 adds a sequential execution policy which means you can run single threaded versions of thrust's alogrithms on the device. Note that it still isn't possible to directly pass a thrust device vector to a kernel and device vectors can't be directly used in device code.
Note that it is also possible to use the thrust::device execution policy in some cases to have parallel thrust execution launched by a kernel as a child grid. This requires separate compilation/device linkage and hardware which supports dynamic parallelism. I am not certain whether this is actually supported in all thrust algorithms or not, but certainly works with some.
This is an update to my previous answer.
Starting from Thrust 1.8.1, CUDA Thrust primitives can be combined with the thrust::device execution policy to run in parallel within a single CUDA thread exploiting CUDA dynamic parallelism. Below, an example is reported.
#include <stdio.h>
#include <thrust/reduce.h>
#include <thrust/execution_policy.h>
#include "TimingGPU.cuh"
#include "Utilities.cuh"
#define BLOCKSIZE_1D 256
#define BLOCKSIZE_2D_X 32
#define BLOCKSIZE_2D_Y 32
/*************************/
/* TEST KERNEL FUNCTIONS */
/*************************/
__global__ void test1(const float * __restrict__ d_data, float * __restrict__ d_results, const int Nrows, const int Ncols) {
const unsigned int tid = threadIdx.x + blockDim.x * blockIdx.x;
if (tid < Nrows) d_results[tid] = thrust::reduce(thrust::seq, d_data + tid * Ncols, d_data + (tid + 1) * Ncols);
}
__global__ void test2(const float * __restrict__ d_data, float * __restrict__ d_results, const int Nrows, const int Ncols) {
const unsigned int tid = threadIdx.x + blockDim.x * blockIdx.x;
if (tid < Nrows) d_results[tid] = thrust::reduce(thrust::device, d_data + tid * Ncols, d_data + (tid + 1) * Ncols);
}
/********/
/* MAIN */
/********/
int main() {
const int Nrows = 64;
const int Ncols = 2048;
gpuErrchk(cudaFree(0));
// size_t DevQueue;
// gpuErrchk(cudaDeviceGetLimit(&DevQueue, cudaLimitDevRuntimePendingLaunchCount));
// DevQueue *= 128;
// gpuErrchk(cudaDeviceSetLimit(cudaLimitDevRuntimePendingLaunchCount, DevQueue));
float *h_data = (float *)malloc(Nrows * Ncols * sizeof(float));
float *h_results = (float *)malloc(Nrows * sizeof(float));
float *h_results1 = (float *)malloc(Nrows * sizeof(float));
float *h_results2 = (float *)malloc(Nrows * sizeof(float));
float sum = 0.f;
for (int i=0; i<Nrows; i++) {
h_results[i] = 0.f;
for (int j=0; j<Ncols; j++) {
h_data[i*Ncols+j] = i;
h_results[i] = h_results[i] + h_data[i*Ncols+j];
}
}
TimingGPU timerGPU;
float *d_data; gpuErrchk(cudaMalloc((void**)&d_data, Nrows * Ncols * sizeof(float)));
float *d_results1; gpuErrchk(cudaMalloc((void**)&d_results1, Nrows * sizeof(float)));
float *d_results2; gpuErrchk(cudaMalloc((void**)&d_results2, Nrows * sizeof(float)));
gpuErrchk(cudaMemcpy(d_data, h_data, Nrows * Ncols * sizeof(float), cudaMemcpyHostToDevice));
timerGPU.StartCounter();
test1<<<iDivUp(Nrows, BLOCKSIZE_1D), BLOCKSIZE_1D>>>(d_data, d_results1, Nrows, Ncols);
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaDeviceSynchronize());
printf("Timing approach nr. 1 = %f\n", timerGPU.GetCounter());
gpuErrchk(cudaMemcpy(h_results1, d_results1, Nrows * sizeof(float), cudaMemcpyDeviceToHost));
for (int i=0; i<Nrows; i++) {
if (h_results1[i] != h_results[i]) {
printf("Approach nr. 1; Error at i = %i; h_results1 = %f; h_results = %f", i, h_results1[i], h_results[i]);
return 0;
}
}
timerGPU.StartCounter();
test2<<<iDivUp(Nrows, BLOCKSIZE_1D), BLOCKSIZE_1D>>>(d_data, d_results1, Nrows, Ncols);
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaDeviceSynchronize());
printf("Timing approach nr. 2 = %f\n", timerGPU.GetCounter());
gpuErrchk(cudaMemcpy(h_results1, d_results1, Nrows * sizeof(float), cudaMemcpyDeviceToHost));
for (int i=0; i<Nrows; i++) {
if (h_results1[i] != h_results[i]) {
printf("Approach nr. 2; Error at i = %i; h_results1 = %f; h_results = %f", i, h_results1[i], h_results[i]);
return 0;
}
}
printf("Test passed!\n");
}
The above example performs reductions of the rows of a matrix in the same sense as Reduce matrix rows with CUDA, but it is done differently from the above post, namely, by calling CUDA Thrust primitives directly from user written kernels. Also, the above example serves to compare the performance of the same operations when done with two execution policies, namely, thrust::seq and thrust::device. Below, some graphs showing the difference in performance.
The performance has been evaluated on a Kepler K20c and on a Maxwell GeForce GTX 850M.
I would like to provide an updated answer to this question.
Starting from Thrust 1.8, CUDA Thrust primitives can be combined with the thrust::seq execution policy to run sequentially within a single CUDA thread (or sequentially within a single CPU thread). Below, an example is reported.
If you want parallel execution within a thread, then you may consider using CUB which provides reduction routines that can be called from within a threadblock, provided that your card enables dynamic parallelism.
Here is the example with Thrust
#include <stdio.h>
#include <thrust/reduce.h>
#include <thrust/execution_policy.h>
/********************/
/* CUDA ERROR CHECK */
/********************/
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, char *file, int line, bool abort=true)
{
if (code != cudaSuccess)
{
fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if (abort) exit(code);
}
}
__global__ void test(float *d_A, int N) {
float sum = thrust::reduce(thrust::seq, d_A, d_A + N);
printf("Device side result = %f\n", sum);
}
int main() {
const int N = 16;
float *h_A = (float*)malloc(N * sizeof(float));
float sum = 0.f;
for (int i=0; i<N; i++) {
h_A[i] = i;
sum = sum + h_A[i];
}
printf("Host side result = %f\n", sum);
float *d_A; gpuErrchk(cudaMalloc((void**)&d_A, N * sizeof(float)));
gpuErrchk(cudaMemcpy(d_A, h_A, N * sizeof(float), cudaMemcpyHostToDevice));
test<<<1,1>>>(d_A, N);
}
If you mean to use the data allocated / processed by thrust yes you can, just get the raw pointer of the allocated data.
int * raw_ptr = thrust::raw_pointer_cast(dev_ptr);
if you want to allocate thrust vectors in the kernel I never tried but I don't think will work
and also if it works I don't think it will provide any benefit.

Thrust inside user written kernels

I am a newbie to Thrust. I see that all Thrust presentations and examples only show host code.
I would like to know if I can pass a device_vector to my own kernel? How?
If yes, what are the operations permitted on it inside kernel/device code?
As it was originally written, Thrust is purely a host side abstraction. It cannot be used inside kernels. You can pass the device memory encapsulated inside a thrust::device_vector to your own kernel like this:
thrust::device_vector< Foo > fooVector;
// Do something thrust-y with fooVector
Foo* fooArray = thrust::raw_pointer_cast( fooVector.data() );
// Pass raw array and its size to kernel
someKernelCall<<< x, y >>>( fooArray, fooVector.size() );
and you can also use device memory not allocated by thrust within thrust algorithms by instantiating a thrust::device_ptr with the bare cuda device memory pointer.
Edited four and half years later to add that as per #JackOLantern's answer, thrust 1.8 adds a sequential execution policy which means you can run single threaded versions of thrust's alogrithms on the device. Note that it still isn't possible to directly pass a thrust device vector to a kernel and device vectors can't be directly used in device code.
Note that it is also possible to use the thrust::device execution policy in some cases to have parallel thrust execution launched by a kernel as a child grid. This requires separate compilation/device linkage and hardware which supports dynamic parallelism. I am not certain whether this is actually supported in all thrust algorithms or not, but certainly works with some.
This is an update to my previous answer.
Starting from Thrust 1.8.1, CUDA Thrust primitives can be combined with the thrust::device execution policy to run in parallel within a single CUDA thread exploiting CUDA dynamic parallelism. Below, an example is reported.
#include <stdio.h>
#include <thrust/reduce.h>
#include <thrust/execution_policy.h>
#include "TimingGPU.cuh"
#include "Utilities.cuh"
#define BLOCKSIZE_1D 256
#define BLOCKSIZE_2D_X 32
#define BLOCKSIZE_2D_Y 32
/*************************/
/* TEST KERNEL FUNCTIONS */
/*************************/
__global__ void test1(const float * __restrict__ d_data, float * __restrict__ d_results, const int Nrows, const int Ncols) {
const unsigned int tid = threadIdx.x + blockDim.x * blockIdx.x;
if (tid < Nrows) d_results[tid] = thrust::reduce(thrust::seq, d_data + tid * Ncols, d_data + (tid + 1) * Ncols);
}
__global__ void test2(const float * __restrict__ d_data, float * __restrict__ d_results, const int Nrows, const int Ncols) {
const unsigned int tid = threadIdx.x + blockDim.x * blockIdx.x;
if (tid < Nrows) d_results[tid] = thrust::reduce(thrust::device, d_data + tid * Ncols, d_data + (tid + 1) * Ncols);
}
/********/
/* MAIN */
/********/
int main() {
const int Nrows = 64;
const int Ncols = 2048;
gpuErrchk(cudaFree(0));
// size_t DevQueue;
// gpuErrchk(cudaDeviceGetLimit(&DevQueue, cudaLimitDevRuntimePendingLaunchCount));
// DevQueue *= 128;
// gpuErrchk(cudaDeviceSetLimit(cudaLimitDevRuntimePendingLaunchCount, DevQueue));
float *h_data = (float *)malloc(Nrows * Ncols * sizeof(float));
float *h_results = (float *)malloc(Nrows * sizeof(float));
float *h_results1 = (float *)malloc(Nrows * sizeof(float));
float *h_results2 = (float *)malloc(Nrows * sizeof(float));
float sum = 0.f;
for (int i=0; i<Nrows; i++) {
h_results[i] = 0.f;
for (int j=0; j<Ncols; j++) {
h_data[i*Ncols+j] = i;
h_results[i] = h_results[i] + h_data[i*Ncols+j];
}
}
TimingGPU timerGPU;
float *d_data; gpuErrchk(cudaMalloc((void**)&d_data, Nrows * Ncols * sizeof(float)));
float *d_results1; gpuErrchk(cudaMalloc((void**)&d_results1, Nrows * sizeof(float)));
float *d_results2; gpuErrchk(cudaMalloc((void**)&d_results2, Nrows * sizeof(float)));
gpuErrchk(cudaMemcpy(d_data, h_data, Nrows * Ncols * sizeof(float), cudaMemcpyHostToDevice));
timerGPU.StartCounter();
test1<<<iDivUp(Nrows, BLOCKSIZE_1D), BLOCKSIZE_1D>>>(d_data, d_results1, Nrows, Ncols);
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaDeviceSynchronize());
printf("Timing approach nr. 1 = %f\n", timerGPU.GetCounter());
gpuErrchk(cudaMemcpy(h_results1, d_results1, Nrows * sizeof(float), cudaMemcpyDeviceToHost));
for (int i=0; i<Nrows; i++) {
if (h_results1[i] != h_results[i]) {
printf("Approach nr. 1; Error at i = %i; h_results1 = %f; h_results = %f", i, h_results1[i], h_results[i]);
return 0;
}
}
timerGPU.StartCounter();
test2<<<iDivUp(Nrows, BLOCKSIZE_1D), BLOCKSIZE_1D>>>(d_data, d_results1, Nrows, Ncols);
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaDeviceSynchronize());
printf("Timing approach nr. 2 = %f\n", timerGPU.GetCounter());
gpuErrchk(cudaMemcpy(h_results1, d_results1, Nrows * sizeof(float), cudaMemcpyDeviceToHost));
for (int i=0; i<Nrows; i++) {
if (h_results1[i] != h_results[i]) {
printf("Approach nr. 2; Error at i = %i; h_results1 = %f; h_results = %f", i, h_results1[i], h_results[i]);
return 0;
}
}
printf("Test passed!\n");
}
The above example performs reductions of the rows of a matrix in the same sense as Reduce matrix rows with CUDA, but it is done differently from the above post, namely, by calling CUDA Thrust primitives directly from user written kernels. Also, the above example serves to compare the performance of the same operations when done with two execution policies, namely, thrust::seq and thrust::device. Below, some graphs showing the difference in performance.
The performance has been evaluated on a Kepler K20c and on a Maxwell GeForce GTX 850M.
I would like to provide an updated answer to this question.
Starting from Thrust 1.8, CUDA Thrust primitives can be combined with the thrust::seq execution policy to run sequentially within a single CUDA thread (or sequentially within a single CPU thread). Below, an example is reported.
If you want parallel execution within a thread, then you may consider using CUB which provides reduction routines that can be called from within a threadblock, provided that your card enables dynamic parallelism.
Here is the example with Thrust
#include <stdio.h>
#include <thrust/reduce.h>
#include <thrust/execution_policy.h>
/********************/
/* CUDA ERROR CHECK */
/********************/
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, char *file, int line, bool abort=true)
{
if (code != cudaSuccess)
{
fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if (abort) exit(code);
}
}
__global__ void test(float *d_A, int N) {
float sum = thrust::reduce(thrust::seq, d_A, d_A + N);
printf("Device side result = %f\n", sum);
}
int main() {
const int N = 16;
float *h_A = (float*)malloc(N * sizeof(float));
float sum = 0.f;
for (int i=0; i<N; i++) {
h_A[i] = i;
sum = sum + h_A[i];
}
printf("Host side result = %f\n", sum);
float *d_A; gpuErrchk(cudaMalloc((void**)&d_A, N * sizeof(float)));
gpuErrchk(cudaMemcpy(d_A, h_A, N * sizeof(float), cudaMemcpyHostToDevice));
test<<<1,1>>>(d_A, N);
}
If you mean to use the data allocated / processed by thrust yes you can, just get the raw pointer of the allocated data.
int * raw_ptr = thrust::raw_pointer_cast(dev_ptr);
if you want to allocate thrust vectors in the kernel I never tried but I don't think will work
and also if it works I don't think it will provide any benefit.