CUDA thrust reduce is so slow? [closed]

CUDA thrust reduce is so slow? [closed] - cuda

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 9 years ago.
I am learning CUDA. Today, I try some code in the book: CUDA Application Design And Development, which make me surprised. Why CUDA Thrust is so slow? Here is the code and the output.
#include <iostream>
using namespace std;
#include<thrust/reduce.h>
#include<thrust/sequence.h>
#include<thrust/host_vector.h>
#include<thrust/device_vector.h>
#include <device_launch_parameters.h>
#include "GpuTimer.h"
__global__ void fillKernel(int *a, int n)
{
int tid = blockDim.x * blockIdx.x + threadIdx.x;
if(tid <n) a[tid] = tid;
}
void fill(int *d_a, int n)
{
int nThreadsPerBlock = 512;
int nBlock = n/nThreadsPerBlock + ((n/nThreadsPerBlock)?1:0);
fillKernel<<<nBlock, nThreadsPerBlock>>>(d_a, n);
}
int main()
{
const int N = 500000;
GpuTimer timer1, timer2;
thrust::device_vector<int> a(N);
fill(thrust::raw_pointer_cast(&a[0]), N);
timer1.Start();
int sumA = thrust::reduce(a.begin(), a.end(), 0);
timer1.Stop();
cout << "Thrust reduce costs " << timer1.Elapsed() << "ms." << endl;
int sumCheck = 0;
timer2.Start();
for(int i = 0; i < N; i++)
sumCheck += i;
timer2.Stop();
cout << "Traditional reduce costs " << timer2.Elapsed() << "ms." << endl;
if (sumA == sumCheck)
cout << "Correct!" << endl;
return 0;
}

You don't have a valid comparison. Your GPU code is doing this:
int sumA = thrust::reduce(a.begin(), a.end(), 0);
Your CPU code is doing this:
for(int i = 0; i < N; i++)
sumCheck += i;
There are so many problems with this methodology I'm not sure where to start. First of all, the GPU operation is a valid reduction which will give a valid result for any sequence of numbers in the vector a. It so happens that you have the sequence from 1 to N in a, but it doesn't have to be that way and it would still give a correct result. The CPU code only gives the correct answer for the specific sequence of 1 to N. Secondly, a smart compiler may be able to optimize the heck out of your CPU code, essentially reducing that entire loop to a constant assignment statement. (Summation from 1 to N is just (N+1)(N/2) isn't it?) I have no idea what optimizations may be going on under the hood on the CPU side.
A more valid comparison would be to do an actual arbitrary reduction in both cases. An example might be to benchmark thrust::reduce operating on a device vector vs. operating on a host vector. Or write your own serial CPU reduction code that actually operates on a vector, rather than summing the integers from 1 to N.
And as indicated in the comments if you're serious about wanting help, document things like the HW and SW platform you are running on, as well as provide all the code. I have no idea what GPUtimer does. I'm voting to close this as "too localized" because I don't think anyone would find this a useful comparison using a methodology like this.

Related

How to use thrust::copy_if using pointers [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 2 years ago.
Improve this question
I am trying to copy non-zero elements of an array to a different array using pointers. I have tried implementing the solution in thrust copy_if: incomplete type is not allowed but I get zeros in my resultant array. Here is my code:
This is the predicate functor:
struct is_not_zero
{
__host__ __device__
bool operator()( double x)
{
return (x != 0);
}
};
And this is where the copy_if function is used:
double out[5];
thrust::device_ptr<double> output = thrust::device_pointer_cast(out);
double *test1;
thrust::device_ptr<double> gauss_res(hostResults1);
thrust::copy_if(thrust::host,gauss_res, gauss_res+3,output, is_not_zero());
test1 = thrust::raw_pointer_cast(output);
for(int i =0;i<6;i++) {
cout << test1[i] << " the number " << endl;
}
where hostresult1 is the output array from a kernel.

You are making a variety of errors as discussed in the comments, and you've not provided a complete code so its not possible to state what all the errors are that you are making. Generally speaking you appear to be mixing up device and host activity, and pointers. These should generally be kept separate, and treated separately, in algorithms. The exception would be copying from device to host, but this can't be done with thrust::copy and raw pointers. You must use vector iterators or properly decorated thrust device pointers.
Here is a complete example based on what you have shown:
$ cat t66.cu
#include <thrust/copy.h>
#include <iostream>
#include <thrust/device_ptr.h>
struct is_not_zero
{
__host__ __device__
bool operator()( double x)
{
return (x != 0);
}
};
int main(){
const int ds = 5;
double *out, *hostResults1;
cudaMalloc(&out, ds*sizeof(double));
cudaMalloc(&hostResults1, ds*sizeof(double));
cudaMemset(out, 0, ds*sizeof(double));
double test1[ds];
for (int i = 0; i < ds; i++) test1[i] = 1;
test1[3] = 0;
cudaMemcpy(hostResults1, test1, ds*sizeof(double), cudaMemcpyHostToDevice);
thrust::device_ptr<double> output = thrust::device_pointer_cast(out);
thrust::device_ptr<double> gauss_res(hostResults1);
thrust::copy_if(gauss_res, gauss_res+ds,output, is_not_zero());
cudaMemcpy(test1, out, ds*sizeof(double), cudaMemcpyDeviceToHost);
for(int i =0;i<ds;i++) {
std::cout << test1[i] << " the number " << std::endl;
}
}
$ nvcc -o t66 t66.cu
$ ./t66
1 the number
1 the number
1 the number
1 the number
0 the number

How to get malloc to show up in nvprof's statistical profiler?

Is there a way to get CUDA's nvprof to include function calls like malloc in its statistical profiler?
I've been trying to improve the performance of my application. Naturally, I've been using nvprof as a tool in that effort.
Recently, in an effort to reduce the GPU memory footprint of my application, I wrote code that made it take twice as long to run. However, the new code that caused the slow-down was only showing up in the profiler in a small amount (the instruction sampling indicated that about 10% of the time was being spent in the new code, but a naive thought would indicate that 50% of the time should have been spent in the new code). Maybe the new code caused more cache thrashing, maybe putting the implementation in a header file so it could be inlined confused the profiler, etc. However, for no good reason, I suspected the new code's calls of malloc.
Indeed, after I reduced the number of malloc calls, my performance increased, almost back to where it was before incorporating the new code.
This lead me to a similar question of, why didn't the calls of malloc show up in the statistical profiler? Are the malloc calls some sort of GPU system call that can't be observed?
Below, I include an example program and its out that showcases this particular issue.
#include <iostream>
#include <numeric>
#include <thread>
#include <stdlib.h>
#include <stdio.h>
static void CheckCudaErrorAux (const char *, unsigned, const char *, cudaError_t);
#define CUDA_CHECK_RETURN(value) CheckCudaErrorAux(__FILE__,__LINE__, #value, value)
__global__ void countup()
{
long sum = 0;
for (long i = 0; i < (1 << 23); ++i) {
sum += i;
}
printf("sum is %li\n", sum);
}
__global__ void malloc_a_lot() {
long sum = 0;
for (int i = 0; i < (1 << 17) * 3; ++i) {
int * v = (int *) malloc(sizeof(int));
sum += (long) v;
free(v);
}
printf("sum is %li\n", sum);
}
__global__ void both() {
long sum = 0;
for (long i = 0; i < (1 << 23); ++i) {
sum += i;
}
printf("sum is %li\n", sum);
sum = 0;
for (int i = 0; i < (1 << 17) * 3; ++i) {
int * v = (int *) malloc(sizeof(int));
sum += (long) v;
free(v);
}
printf("sum is %li\n", sum);
}
int main(void)
{
CUDA_CHECK_RETURN(cudaDeviceSynchronize());
std::chrono::time_point<std::chrono::system_clock> t1 = std::chrono::system_clock::now();
countup<<<8,1>>>();
CUDA_CHECK_RETURN(cudaDeviceSynchronize());
std::chrono::time_point<std::chrono::system_clock> t2 = std::chrono::system_clock::now();
malloc_a_lot<<<8,1>>>();
CUDA_CHECK_RETURN(cudaDeviceSynchronize());
std::chrono::time_point<std::chrono::system_clock> t3 = std::chrono::system_clock::now();
both<<<8,1>>>();
CUDA_CHECK_RETURN(cudaDeviceSynchronize());
std::chrono::time_point<std::chrono::system_clock> t4 = std::chrono::system_clock::now();
std::chrono::duration<double> duration_1_to_2 = t2 - t1;
std::chrono::duration<double> duration_2_to_3 = t3 - t2;
std::chrono::duration<double> duration_3_to_4 = t4 - t3;
printf("timer for countup() took %.3lf\n", duration_1_to_2.count());
printf("timer for malloc_a_lot() took %.3lf\n", duration_2_to_3.count());
printf("timer for both() took %.3lf\n", duration_3_to_4.count());
return 0;
}
static void CheckCudaErrorAux (const char *file, unsigned line, const char *statement, cudaError_t err)
{
if (err == cudaSuccess)
return;
std::cerr << statement<<" returned " << cudaGetErrorString(err) << "("<<err<< ") at "<<file<<":"<<line << std::endl;
exit (1);
}
An elided version of the results is:
sum is 35184367894528...
sum is -319453208467532096...
sum is 35184367894528...
sum is -319453208467332416...
timer for countup() took 4.034
timer for malloc_a_lot() took 4.306
timer for both() took 8.343
A profiling result is shown in the following graphic. The numbers that show up when mousing-over the light blue bars are consistent with the size of the bars. Specifically, Line 41 has 16,515,077 samples associated with it, but Line 47 only has 633,996 samples.
BTW, the program above is compiled with debug information and presumably no optimization -- the default "Debug" mode for compiling in Nsight Eclipse. If I compile in "Release" mode, optimization is invoked, and the countup() call's duration is very close to 0 seconds.

The current NVIDIA GPU PC Sampler only collects the current warp program counter (not a call stack). The PC sampler will correctly collect samples inside of malloc; however, the tool does not show SASS or high level source for internal syscalls.
The tool does not have UI to show aggregated count for samples inside a syscall module.
The tool does not know the PC ranges for malloc, free, or other syscalls to correctly attribute the samples to the user called syscall.
If (1) or (2) is fixed the data would be shown on a separate row simply labelled "syscall" or "malloc". The hardware does not collect call stacks so it is not possible to attribute the samples to L48.

Are atomic operations in CUDA guaranteed to be scheduled per warp?

Suppose I have 8 blocks of 32 threads each running on a GTX 970. Each blcok either writes all 1's or all 0's to an array of length 32 in global memory, where thread 0 in a block writes to position 0 in the array.
Now to write the actual values atomicExch is used, exchanging the current value in the array with the value that the block attempts to write. Because of SIMD, atomic operation and the fact that a warp executes in lockstep I would expect the array to, at any point in time, only contain 1's or 0's. But never a mix of the two.
However, while running code like this there are several cases where at some point in time the array contains of a mix of 0's and 1's. Which appears to point to the fact that atomic operations are not executed per warp, and instead scheduled using some other scheme.
From other sources I have not really found a conclusive write-up detailing the scheduling of atomic operations across different warps (please correct me if I'm wrong), so I was wondering if there is any information on this topic. Since I need to write many small vectors consisting of several 32 bit integers atomically to global memory, and an atomic operation that is guaranteed to write a single vector atomically is obviously very important.
For those wondering, the code I wrote was executed on a GTX 970, compiled on compute capability 5.2, using CUDA 8.0.

The atomic instructions, like all instructions, are scheduled per warp. However there is an unspecified pipeline associated with atomics, and the scheduled instruction flow through the pipeline is not guaranteed to be executed in lockstep, for every thread, for every stage through the pipeline. This gives rise to the possibility for your observations.
I believe a simple thought experiment will demonstrate that this must be true: what if 2 threads in the same warp targeted the same location? Clearly every aspect of the processing could not proceed in lockstep. We could extend this thought experiment to the case where we have multiple issue per clock within an SM and even across SMs, to as additional examples.
If the vector length were short enough (16 bytes or less) then it should be possible to accomplish this ("atomic update") simply by having a thread in a warp write an appropriate vector-type quantity, e.g. int4. As long as all threads (regardless of where they are in the grid) are attempting to update a naturally aligned location, the write should not be corrupted by other writes.
However, after discussion in the comments, it seems that OP's goal is to be able to have a warp or threadblock update a vector of some length, without interference from other warps or threadblocks. It seems to me that really what is desired is access control (so that only one warp or threadblock is updating a particular vector at a time) and OP had some code that wasn't working as desired.
This access control can be enforced using an ordinary atomic operation (atomicCAS in the example below) to permit only one "producer" to update a vector at a time.
What follows is an example producer-consumer code, where there are multiple threadblocks that are updating a range of vectors. Each vector "slot" has a "slot control" variable, which is atomically updated to indicate:
vector is empty
vector is being filled
vector is filled, ready for "consumption"
with this 3-level scheme, we can allow for ordinary access to the vector by both consumer and multiple producer workers, with a single ordinary atomic variable access mechanism. Here is an example code:
#include <assert.h>
#include <iostream>
#include <stdio.h>
const int num_slots = 256;
const int slot_length = 32;
const int max_act = 65536;
const int slot_full = 2;
const int slot_filling = 1;
const int slot_empty = 0;
const int max_sm = 64; // needs to be greater than the maximum number of SMs for any GPU that it will be run on
__device__ int slot_control[num_slots] = {0};
__device__ int slots[num_slots*slot_length];
__device__ int observations[max_sm] = {0}; // reported by consumer
__device__ int actives[max_sm] = {0}; // reported by producers
__device__ int correct = 0;
__device__ int block_id = 0;
__device__ volatile int restricted_sm = -1;
__device__ int num_act = 0;
static __device__ __inline__ int __mysmid(){
int smid;
asm volatile("mov.u32 %0, %%smid;" : "=r"(smid));
return smid;}
// this code won't work on a GPU with a single SM!
__global__ void kernel(){
__shared__ volatile int done, update, next_slot;
int my_block_id = atomicAdd(&block_id, 1);
int my_sm = __mysmid();
if (my_block_id == 0){
if (!threadIdx.x){
restricted_sm = my_sm;
__threadfence();
// I am "block 0" and process the vectors, checking for coherency
// "consumer"
next_slot = 0;
volatile int *vslot_control = slot_control;
volatile int *vslots = slots;
int scount = 0;
while(scount < max_act){
if (vslot_control[next_slot] == slot_full){
scount++;
int slot_val = vslots[next_slot*slot_length];
for (int i = 1; i < slot_length; i++) if (slot_val != vslots[next_slot*slot_length+i]) { assert(0); /* badness - incoherence */}
observations[slot_val]++;
vslot_control[next_slot] = slot_empty;
correct++;
__threadfence();
}
next_slot++;
if (next_slot >= num_slots) next_slot = 0;
}
}}
else {
// "producer"
while (restricted_sm < 0); // wait for signaling
if (my_sm == restricted_sm) return;
next_slot = 0;
done = 0;
__syncthreads();
while (!done) {
if (!threadIdx.x){
while (atomicCAS(slot_control+next_slot, slot_empty, slot_filling) > slot_empty) {
next_slot++;
if (next_slot >= num_slots) next_slot = 0;}
// we grabbed an empty slot, fill it with my_sm
if (atomicAdd(&num_act, 1) < max_act) update = 1;
else {done = 1; update = 0;}
}
__syncthreads();
if (update) slots[next_slot*slot_length+threadIdx.x] = my_sm;
__threadfence(); //enforce ordering
if ((update) && (!threadIdx.x)){
slot_control[next_slot] = 2; // mark slot full
atomicAdd(actives+my_sm, 1);}
__syncthreads();
}
}
}
int main(){
kernel<<<256, slot_length>>>();
cudaDeviceSynchronize();
cudaError_t res= cudaGetLastError();
if (res != cudaSuccess) printf("kernel failure: %d\n", (int)res);
int *h_obs = new int[max_sm];
int *h_act = new int[max_sm];
int h_correct;
cudaMemcpyFromSymbol(h_obs, observations, sizeof(int)*max_sm);
cudaMemcpyFromSymbol(h_act, actives, sizeof(int)*max_sm);
cudaMemcpyFromSymbol(&h_correct, correct, sizeof(int));
int h_total_act = 0;
int h_total_obs = 0;
for (int i = 0; i < max_sm; i++){
std::cout << h_act[i] << "," << h_obs[i] << " ";
h_total_act += h_act[i];
h_total_obs += h_obs[i];}
std::cout << std::endl << h_total_act << "," << h_total_obs << "," << h_correct << std::endl;
}
I don't claim this code to be defect free for any use case. It is advanced to demonstrate the workability of a concept, not as production-ready code. It seems to work for me on linux, on a couple different systems I tested it on. It should not be run on GPUs that have only a single SM, as one SM is reserved for the consumer, and the remaining SMs are used by the producers.

Using thrust::max_element in a CUDA C project

In a CUDA C project, I would like to try and use the Thrust library in order to find the maximum element inside an array of floats. It seems like the Thrust function thrust::max_element() is what I need. The array on which I want to use this function is the result of a cuda kernel (which seems to work fine) and so it is already present in device memory when calling thrust::max_element().
I am not very familiar with the Thrust library but after looking at the documentation for thrust::max_element() and reading the answers to similar questions on this site, I thought I had grasped the working principles of this process. Unfortunately I get wrong results and it seems that I am not using the library functions correctly. Can somebody please tell me what is wrong in my code?
float* deviceArray;
float* max;
int length = 1025;
*max = 0.0f;
size = (int) length*sizeof(float);
cudaMalloc(&deviceArray, size);
cudaMemset(deviceArray, 0.0f, size);
// here I launch a cuda kernel which modifies deviceArray
thrust::device_ptr<float> d_ptr = thrust::device_pointer_cast(deviceArray);
*max = *(thrust::max_element(d_ptr, d_ptr + length));
I use the following headers:
#include <thrust/extrema.h>
#include <thrust/device_ptr.h>
I keep getting zero values for *max even though I am sure that deviceArray contains non-zero values after running the kernel.
I am using nvcc as a compiler (CUDA 7.0) and I am running the code on a device with compute capability 3.5.
Any help would be much appreciated. Thanks.

This is not proper C code:
float* max;
int length = 1025;
*max = 0.0f;
You're not allowed to store data using a pointer (max) until you properly provide an allocation for that pointer (and set the pointer equal to the address of that allocation).
Apart from that, the rest of your code seems to work for me:
$ cat t990.cu
#include <thrust/extrema.h>
#include <thrust/device_ptr.h>
#include <iostream>
int main(){
float* deviceArray;
float max, test;
int length = 1025;
max = 0.0f;
test = 2.5f;
int size = (int) length*sizeof(float);
cudaMalloc(&deviceArray, size);
cudaMemset(deviceArray, 0.0f, size);
cudaMemcpy(deviceArray, &test, sizeof(float),cudaMemcpyHostToDevice);
thrust::device_ptr<float> d_ptr = thrust::device_pointer_cast(deviceArray);
max = *(thrust::max_element(d_ptr, d_ptr + length));
std::cout << max << std::endl;
}
$ nvcc -o t990 t990.cu
$ ./t990
2.5
$

CUDA: Max of array, how to prevent write collisions?

I have an array of doubles stored in GPU global memory and i need to find the maximum value in it. I have read some texts about parallel reduction, so i know that one should divide the array between blocks and make them find their "global maximum", and so on.
But they never seem to address the issue of threads trying to write to the same memory position simultaneously.
Let's say that local_max=0.0 in the beginning of a block execution. Then each thread reads their value from the input vector, decides that is larger than local_max, and then try to write their value to local_max. When all of this happens at the exact same time (atleast when inside the same warp), how can this work and end up with the actual maximum within this block?
I would think either an atomic function or some kind of lock or critical section would be needed, but i haven't seen this addressed in the answers i have found. (ex http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/reduction/doc/reduction.pdf )

The answer to your questions are contained in the very document you linked to, and the SDK reduction example shows concrete implementations of the reduction concept.
For completeness, here is a concrete example of a reduction kernel:
template <typename T, int BLOCKSIZE>
__global__ reduction(T *inputvals, T *outputvals, int N)
{
__shared__ volatile T data[BLOCKSIZE];
T maxval = inputvals[threadIdx.x];
for(int i=blockDim.x + threadIdx.x; i<N; i+=blockDim.x)
{
maxfunc(maxval, inputvals[i]);
}
data[threadIdx.x] = maxval;
__syncthreads();
// Here maxfunc(a,b) sets a to the minimum of a and b
if (threadIdx.x < 32) {
for(int i=32+threadIdx.x; i < BLOCKSIZE; i+= 32) {
maxfunc(data[threadIdx.x], data[i]);
}
if (threadIdx.x < 16) maxfunc(data[threadIdx.x], data[threadIdx.x+16]);
if (threadIdx.x < 8) maxfunc(data[threadIdx.x], data[threadIdx.x+8]);
if (threadIdx.x < 4) maxfunc(data[threadIdx.x], data[threadIdx.x+4]);
if (threadIdx.x < 2) maxfunc(data[threadIdx.x], data[threadIdx.x+2]);
if (threadIdx.x == 0) {
maxfunc(data[0], data[1]);
outputvals[blockIdx.x] = data[0];
}
}
}
The key point is using the synchronization that is implicit within a warp to perform the reduction in shared memory. The result is a single per-block maximum value. A second reduction pass is required to reduce the set of block maximums to the global maximum (often it is faster to o this on the host). In this example, maxvals is the "compare and set" function which could be as simple as
template<T>
__device__ void maxfunc(T & a, T & b)
{
a = (b > a) ? b : a;
}

Dont' cook your own code, use some thrust (included in version 4.0 of the Cuda sdk) :
#include <thrust/device_vector.h>
#include <thrust/sequence.h>
#include <thrust/copy.h>
#include <iostream>
int main(void)
{
thrust::host_vector<int> h_vec(10000);
thrust::sequence(h_vec.begin(), h_vec.end());
// show hvec
thrust::copy(h_vec.begin(), h_vec.end(),
std::ostream_iterator<int>(std::cout, "\n"));
// transfer to device
thrust::device_vector<int> d_vec = h_vec;
int max_dvec_value = *thrust::max_element(d_vec.begin(), d_vec.end());
std::cout << "max value: " << max_dvec_value << "\n";
return 0;
}
And watch out that thrust::max_element returns a pointer.

Your question is clearly answered in the document you link to. I think you just need to spend some more time reading it and understanding the CUDA concepts used in it. In particular, I would focus on shared memory, the __syncthreads() method, and how to uniquely identify a thread while inside a kernel. Additionally, you should try to understand why the reduction may need to be run in 2 passes to find the global maximum.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008