Using nvprof to measure floating point operations of my sample kernels, it seems that there is no metrics for flop_count_dp_div, and the actual double-precision division operations is measured in terms of add/mul/fma of double-precision and even some fma of single-precision operations.
I am wondering why is the case, and how to deduce the dynamic number of division operations of a kernel from nvprof report if I don't have the source code?
My simple test kernel:
#include <iostream>
__global__ void mul(double a, double* x, double* y) {
y[threadIdx.x] = a * x[threadIdx.x];
}
__global__ void div(double a, double* x, double* y) {
y[threadIdx.x] = a / x[threadIdx.x];
}
int main(int argc, char* argv[]) {
const int kDataLen = 4;
double a = 2.0f;
double host_x[kDataLen] = {1.0f, 2.0f, 3.0f, 4.0f};
double host_y[kDataLen];
// Copy input data to device.
double* device_x;
double* device_y;
cudaMalloc(&device_x, kDataLen * sizeof(double));
cudaMalloc(&device_y, kDataLen * sizeof(double));
cudaMemcpy(device_x, host_x, kDataLen * sizeof(double),
cudaMemcpyHostToDevice);
// Launch the kernel.
mul<<<1, kDataLen>>>(a, device_x, device_y);
div<<<1, kDataLen>>>(a, device_x, device_y);
// Copy output data to host.
cudaDeviceSynchronize();
cudaMemcpy(host_y, device_y, kDataLen * sizeof(double),
cudaMemcpyDeviceToHost);
// Print the results.
for (int i = 0; i < kDataLen; ++i) {
std::cout << "y[" << i << "] = " << host_y[i] << "\n";
}
cudaDeviceReset();
return 0;
}
And nvprof output of the two kernels:
nvprof --metrics flop_count_sp \
--metrics flop_count_sp_add \
--metrics flop_count_sp_mul \
--metrics flop_count_sp_fma \
--metrics flop_count_sp_special \
--metrics flop_count_dp \
--metrics flop_count_dp_add \
--metrics flop_count_dp_mul \
--metrics flop_count_dp_fma \
./a.out
==14380== NVPROF is profiling process 14380, command: ./a.out
==14380== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics.
Replaying kernel "mul(double, double*, double*)" (done)
Replaying kernel "div(double, double*, double*)" (done)
y[0] = 24 internal events
y[1] = 1
y[2] = 0.666667
y[3] = 0.5
==14380== Profiling application: ./a.out
==14380== Profiling result:
==14380== Metric result:
Invocations Metric Name Metric Description Min Max Avg
Device "GeForce GTX 1080 Ti (0)"
Kernel: mul(double, double*, double*)
1 flop_count_sp Floating Point Operations(Single Precision) 0 0 0
1 flop_count_sp_add Floating Point Operations(Single Precision Add) 0 0 0
1 flop_count_sp_mul Floating Point Operation(Single Precision Mul) 0 0 0
1 flop_count_sp_fma Floating Point Operations(Single Precision FMA) 0 0 0
1 flop_count_sp_special Floating Point Operations(Single Precision Special) 0 0 0
1 flop_count_dp Floating Point Operations(Double Precision) 4 4 4
1 flop_count_dp_add Floating Point Operations(Double Precision Add) 0 0 0
1 flop_count_dp_mul Floating Point Operations(Double Precision Mul) 4 4 4
1 flop_count_dp_fma Floating Point Operations(Double Precision FMA) 0 0 0
Kernel: div(double, double*, double*)
1 flop_count_sp Floating Point Operations(Single Precision) 8 8 8
1 flop_count_sp_add Floating Point Operations(Single Precision Add) 0 0 0
1 flop_count_sp_mul Floating Point Operation(Single Precision Mul) 0 0 0
1 flop_count_sp_fma Floating Point Operations(Single Precision FMA) 4 4 4
1 flop_count_sp_special Floating Point Operations(Single Precision Special) 4 4 4
1 flop_count_dp Floating Point Operations(Double Precision) 44 44 44
1 flop_count_dp_add Floating Point Operations(Double Precision Add) 0 0 0
1 flop_count_dp_mul Floating Point Operations(Double Precision Mul) 4 4 4
1 flop_count_dp_fma Floating Point Operations(Double Precision FMA) 20 20 20
it seems that there is no metrics for flop_count_dp_div, t
Because there are no floating point division instructions in CUDA hardware.
and the actual double-precision division operations is measured in terms of add/mul/fma of double-precision and even some fma of single-precision operations.
Because floating point division is implemented using a Newton Raphson iterative method using multiply-add and multiply operations. Possibly even in mixed precision (thus the single precision operations)
how to deduce the dynamic number of division operations of a kernel from nvprof report if I don't have the source code?
You really can't.
Related
I am faced with cuda invalid resource handle error when allocating buffer on gpu.
1, I download the code from git clone https://github.com/Funatiq/gossip.git.
2, I built this project in the gossip directory: git submodule update --init && make. Then I got the compile binary excute here.
3, then I generate a scatter and gather plan for my main GPU, here, it is 0.
$python3 scripts/plan_from_topology_asynch.py gather 0
$python3 scripts/plan_from_topology_asynch.py scatter 0
then it generates scatter_plan.json and gather_plan.json.
4, finally, I execute the plan:
./execute scatter_gather scatter_plan.json gather_plan.json
The error was pointing to these lines of code:
std::vector<size_t> bufs_lens_scatter = scatter.calcBufferLengths(table[main_gpu]);
print_buffer_sizes(bufs_lens_scatter);
std::vector<data_t *> bufs(num_gpus);
std::vector<size_t> bufs_lens(bufs_lens_scatter);
TIMERSTART(malloc_buffers)
for (gpu_id_t gpu = 0; gpu < num_gpus; ++gpu) {
cudaSetDevice(context.get_device_id(gpu)); CUERR
cudaMalloc(&bufs[gpu], sizeof(data_t)*bufs_lens[gpu]); CUERR
}
TIMERSTOP(malloc_buffers)
The detailed error is shown as:
RUN: scatter_gather
INFO: 32768 bytes (scatter_gather)
TIMING: 0.463872 ms (malloc_devices)
TIMING: 0.232448 ms (zero_gpu_buffers)
TIMING: 0.082944 ms (init_data)
TIMING: 0.637952 ms (multisplit)
Partition Table:
470 489 534 553 514 515 538 483
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
Required buffer sizes:
0 538 717 604 0 344 0 687
TIMING: 3.94455e-31 ms (malloc_buffers)
CUDA error: invalid resource handle : executor.cuh, line 405
For reference, I attached the complete error report here. The curious part is that the author cannot reproduce these error on his server. But when I ran it on DGX workstation with 8 GPUs. This error occurs. I doubt if it is cuda programming error or environment specific issues.
The code has a defect in it, in the handling of cudaEventRecord() as used in the TIMERSTART and TIMERSTOP macros defined here and used here (with the malloc_buffers label).
CUDA events have a device association, impliclitly defined, when they are created. That means they are associated with the device selected by the most recent cudaSetDevice() call. As stated in the programming guide:
cudaEventRecord() will fail if the input event and input stream are associated to different devices.
(note that each device has its own null stream - these events are being recorded into the null stream)
And if we run the code with cuda-memcheck, we observe that the invalid resource handle error is indeed being returned by a call to cudaEventRecord().
Specifically referring to the code here:
...
std::vector<size_t> bufs_lens(bufs_lens_scatter);
TIMERSTART(malloc_buffers)
for (gpu_id_t gpu = 0; gpu < num_gpus; ++gpu) {
cudaSetDevice(context.get_device_id(gpu)); CUERR
cudaMalloc(&bufs[gpu], sizeof(data_t)*bufs_lens[gpu]); CUERR
}
TIMERSTOP(malloc_buffers)
The TIMERSTART macro defines and creates 2 cuda events, one of which it immediately records (the start event). The TIMERSTOP macro uses the timer stop event that was created in the TIMERSTART macro. However, we can see that the intervening code has likely changed the device from the one that was in effect when these two events were created (due to the cudaSetDevice call in the for-loop). Therefore, the cudaEventRecord (and cudaEventElapsedTime) calls are failing due to this invalid usage.
As a proof point, when I add cudaSetDevice calls to the macro definitions as follows:
#define TIMERSTART(label) \
cudaEvent_t timerstart##label, timerstop##label; \
float timerdelta##label; \
cudaSetDevice(0); \
cudaEventCreate(&timerstart##label); \
cudaEventCreate(&timerstop##label); \
cudaEventRecord(timerstart##label, 0);
#endif
#ifndef __CUDACC__
#define TIMERSTOP(label) \
stop##label = std::chrono::system_clock::now(); \
std::chrono::duration<double> \
timerdelta##label = timerstop##label-timerstart##label; \
std::cout << "# elapsed time ("<< #label <<"): " \
<< timerdelta##label.count() << "s" << std::endl;
#else
#define TIMERSTOP(label) \
cudaSetDevice(0); \
cudaEventRecord(timerstop##label, 0); \
cudaEventSynchronize(timerstop##label); \
cudaEventElapsedTime( \
&timerdelta##label, \
timerstart##label, \
timerstop##label); \
std::cout << \
"TIMING: " << \
timerdelta##label << " ms (" << \
#label << \
")" << std::endl;
#endif
The code runs without error for me. I'm not suggesting this is the correct fix. The correct fix may be to properly set the device before calling the macro. It seems evident that either the macro writer did not expect this kind of usage, or else was unaware of the hazard.
The only situation I could imagine where the error would not occur would be in a single-device system. When the code maintainer responded to your issue that they could not reproduce the issue, my guess is they have not tested the code on a multi-device system. As near as I can tell, the error would be unavoidable in a multi-device setup.
In cusp, there is a multiply to calculate spmv(sparse matrix vector multiplication) that takes a reduce and a combine:
template <typename LinearOperator,
typename MatrixOrVector1,
typename MatrixOrVector2,
typename UnaryFunction,
typename BinaryFunction1,
typename BinaryFunction2>
void multiply(const LinearOperator& A,
const MatrixOrVector1& B,
MatrixOrVector2& C,
UnaryFunction initialize,
BinaryFunction1 combine,
BinaryFunction2 reduce);
From the interface it seems like custom combine and reduce should be possible for any matrix/vector multiplication. I think cusp supports to use other combine and reduce function defined in thrust/functional.h besides multiplication and plus to calculate spmv. For example, can I use thrust::plus to replace multiplication the original combine function(i.e. multiplication)?
And I guess, this scaled spmv also support those sparse matrix in coo,csr,dia,hyb format.
However, I got a wrong answer when I tested the below example in a.cu whose matrix A was in coo format.
It used plus operator to combine. And I compiled it with cmd : nvcc a.cu -o a to .
#include <cusp/csr_matrix.h>
#include <cusp/monitor.h>
#include <cusp/multiply.h>
#include <cusp/print.h>
#include <cusp/krylov/cg.h>
int main(void)
{
// COO format in host memory
int host_I[13] = {0,0,1,1,2,2,2,3,3,3,4,5,5}; // COO row indices
int host_J[13] = {0,1,1,2,2,4,6,3,4,5,5,5,6}; // COO column indices
int host_V[13] = {1,1,1,1,1,1,1,1,1,1,1,1,1};
// x and y arrays in host memory
int host_x[7] = {1,1,1,1,1,1,1};
int host_y[6] = {0,0,0,0,0,0};
// allocate device memory for COO format
int * device_I;
cudaMalloc(&device_I, 13 * sizeof(int));
int * device_J;
cudaMalloc(&device_J, 13 * sizeof(int));
int * device_V;
cudaMalloc(&device_V, 13 * sizeof(int));
// allocate device memory for x and y arrays
int * device_x;
cudaMalloc(&device_x, 7 * sizeof(int));
int * device_y;
cudaMalloc(&device_y, 6 * sizeof(int));
// copy raw data from host to device
cudaMemcpy(device_I, host_I, 13 * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(device_J, host_J, 13 * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(device_V, host_V, 13 * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(device_x, host_x, 7 * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(device_y, host_y, 6 * sizeof(int), cudaMemcpyHostToDevice);
// matrices and vectors now reside on the device
// *NOTE* raw pointers must be wrapped with thrust::device_ptr!
thrust::device_ptr<int> wrapped_device_I(device_I);
thrust::device_ptr<int> wrapped_device_J(device_J);
thrust::device_ptr<int> wrapped_device_V(device_V);
thrust::device_ptr<int> wrapped_device_x(device_x);
thrust::device_ptr<int> wrapped_device_y(device_y);
// use array1d_view to wrap the individual arrays
typedef typename cusp::array1d_view< thrust::device_ptr<int> > DeviceIndexArrayView;
typedef typename cusp::array1d_view< thrust::device_ptr<int> > DeviceValueArrayView;
DeviceIndexArrayView row_indices (wrapped_device_I, wrapped_device_I + 13);
DeviceIndexArrayView column_indices(wrapped_device_J, wrapped_device_J + 13);
DeviceValueArrayView values (wrapped_device_V, wrapped_device_V + 13);
DeviceValueArrayView x (wrapped_device_x, wrapped_device_x + 7);
DeviceValueArrayView y (wrapped_device_y, wrapped_device_y + 6);
// combine the three array1d_views into a coo_matrix_view
typedef cusp::coo_matrix_view<DeviceIndexArrayView,
DeviceIndexArrayView,
DeviceValueArrayView> DeviceView;
// construct a coo_matrix_view from the array1d_views
DeviceView A(6, 7, 13, row_indices, column_indices, values);
std::cout << "\ndevice coo_matrix_view" << std::endl;
cusp::print(A);
cusp::constant_functor<int> initialize;
thrust::plus<int> combine;
thrust::plus<int> reduce;
cusp::multiply(A , x , y , initialize, combine, reduce);
std::cout << "\nx array" << std::endl;
cusp::print(x);
std::cout << "\n y array, y = A * x" << std::endl;
cusp::print(y);
cudaMemcpy(host_y, device_y, 6 * sizeof(int), cudaMemcpyDeviceToHost);
// free device arrays
cudaFree(device_I);
cudaFree(device_J);
cudaFree(device_V);
cudaFree(device_x);
cudaFree(device_y);
return 0;
}
And I got the below answer.
device coo_matrix_view
sparse matrix <6, 7> with 13 entries
0 0 (1)
0 1 (1)
1 1 (1)
1 2 (1)
2 2 (1)
2 4 (1)
2 6 (1)
3 3 (1)
3 4 (1)
3 5 (1)
4 5 (1)
5 5 (1)
5 6 (1)
x array
array1d <7>
(1)
(1)
(1)
(1)
(1)
(1)
(1)
y array, y = A * x
array1d <6>
(4)
(4)
(6)
(6)
(2)
(631)
The vector y I got is strange, I think the correct answer y should be:
[9,
9,
10,
10,
8,
9]
So I do not sure that whether such replacement of combine and reduce can be adapted to other sparse matrix format, like coo. Or maybe the code I wrote above is incorrect to call multiply.
Can you give me some help? Any info will help.
Thank you!
From a very brief reading of the code and instrumentation of your example, this seems to be something badly broken in CUSP causing the problem for this usage case. The code only appears to accidentally work correctly for the case where the combine operator is multiplication because the spurious operations it performs with zero elements do not effect the reduction operation (ie. it just sums a lot of additional zeros).
I'm trying to reach peak performance of each SM from the code below. The peak lies somewhere between 25 GFlops(GTX275-GT200 Arch.). This code gives 8 GFlops at the max.
__global__ void new_ker(float *x)
{
int index = threadIdx.x+blockIdx.x*blockDim.x;
float a,b;
a=0;
b=x[index];
//LOOP=10000000
//No. of blocks = 1
//Threads per block = 512 (I'm using GTX 275 - GT200 Arch.)
#pragma unroll 2048
for(int i=0;i<LOOP;i++){
a=a*b+b;
}
x[index] = a;
}
I don't want to increase ILP in the code. Any ideas why it's not reaching peak??
int main(int argc,char **argv)
{
//Initializations
float *x;
float *dx;
cudaEvent_t new_start,new_stop;
float elapsed;
double gflops;
x = 0;
flag = 0;
cudaMalloc((void **)&dx,sizeof(float)*THPB);
//ILP=1
cudaEventCreate(&new_start);
cudaEventCreate(&new_stop);
printf("Kernel1:\n");
cudaEventRecord(new_start, 0);
new_ker<<<BLOCKS,THPB>>>(dx);
cudaEventRecord(new_stop,0);
cudaEventSynchronize(new_stop);
cudaEventElapsedTime(&elapsed,new_start,new_stop);
x = (float *)malloc(sizeof(float)*THPB);
cudaMemcpy(x,dx,sizeof(float)*THPB,cudaMemcpyDeviceToHost);
gflops = ((double)(BLOCKS)*(THPB)*LOOP/elapsed)/1000000;
printf("\t%f",gflops);
cudaEventDestroy(new_start);
cudaEventDestroy(new_stop);
return 0;
}
Platform:
CUDA 3.0
NVIDIA GeForce GTX275 (GT200)
If I put together a complete repro case from your code, using the correct FLOP calculation:
#include <stdio.h>
#define LOOP (10000000)
#define BLOCKS (30)
#define THPB (512)
__global__ void new_ker(float *x)
{
int index = threadIdx.x+blockIdx.x*blockDim.x;
float a,b;
a=0;
b=x[index];
#pragma unroll 2048
for(int i=0;i<LOOP;i++){
a=a*b+b;
}
x[index] = a;
}
int main(int argc,char **argv)
{
//Initializations
float *x;
float *dx;
cudaEvent_t new_start,new_stop;
float elapsed;
double gflops;
x = 0;
cudaMalloc((void **)&dx,sizeof(float)*THPB);
//ILP=1
cudaEventCreate(&new_start);
cudaEventCreate(&new_stop);
printf("Kernel1:\n");
cudaEventRecord(new_start, 0);
new_ker<<<BLOCKS,THPB>>>(dx);
cudaEventRecord(new_stop,0);
cudaEventSynchronize(new_stop);
cudaEventElapsedTime(&elapsed,new_start,new_stop);
x = (float *)malloc(sizeof(float)*THPB*BLOCKS);
cudaMemcpy(x,dx,sizeof(float)*THPB*BLOCKS,cudaMemcpyDeviceToHost);
gflops = 2.0e-6 * ((double)(LOOP)*double(THPB*BLOCKS)/(double)elapsed);
printf("\t%f\n",gflops);
cudaEventDestroy(new_start);
cudaEventDestroy(new_stop);
return 0;
}
And I compile it and run it on a 1.4GHz GTX275 with CUDA 3.2 on a 64 bit linux platform:
$ nvcc -arch=sm_13 -Xptxas="-v" -o perf perf.cu
ptxas info : Compiling entry function '_Z7new_kerPf' for 'sm_13'
ptxas info : Used 4 registers, 8+16 bytes smem, 8 bytes cmem[1]
$ ./perf
Kernel1:
671.806039
I get within 0.01% of peak FLOP/s for that card running a pure FMAD code (1.4 GHz * 2 FLOP * 8 cores/MP * 30 MP) = 672 GFLOP/s.
So it seems that the code does, in fact, hit peak FLOP/s with one block per multiprocessor, but you just are not calculating the FLOP/s number correctly.
I found what Thrust can provide is quite limited, as below code shows:
I end up to have 9*9*2 (1 multiple + 1 reduce) Thrust calls, which is 162 kernel launches.
While if I write my own kernel, only 1 kernel launch needed.
for(i=1;i<=9;i++)
{
for(j=i;j<=9;j++)
{
ATA[i][j]=0;
for(m=1;m<=50000;m++)
ATA[i][j]=ATA[i][j]+X[idx0[i]][m]*X[idx0[j]][m];
}
}
Then I end up with below Thrust implementation:
for(i=1;i<=dim0;i++)
{
for(j=i;j<=dim0;j++)
{
thrust::transform(t_d_X+(idx0[i]-1)*(1+iNumPaths)+1, t_d_X+(idx0[i]-1)*(1+iNumPaths)+iNumPaths+1, t_d_X+(idx0[j]-1)*(1+iNumPaths)+1,t_d_cdataMulti, thrust::multiplies<double>());
ATA[i][j] = thrust::reduce(t_d_cdataMulti, t_d_cdataMulti+iNumPaths, (double) 0, thrust::plus<double>());
}
}
Some analysis:
transform_reduce: will NOT help, as there is a pointer redirect idx0[i], and basically there are 2 arrays involved. 1st one is X[idx0[i]], 2nd one is X[idx0[j]]
reduce_by_key: will help. But I need to store all interim results into one big array, and prepare a huge mapping key table with same size. Will try it out.
transform_iterator: will NOT help, same reason as 1.
Think I can't avoid writing my own kernel?
I'll bet #m.s. can provide a more efficient approach. But here is one possible approach. In order to get the entire computation reduced to a single kernel call by thrust, it is necessary to handle everything with a single thrust algorithm call. At the heart of the operation, we are summing many computations together, to fill a matrix. Therefore I believe thrust::reduce_by_key is an appropriate thrust algorithm to use. This means we must realize all other transformations using various thrust "fancy iterators", which are mostly covered in the thrust getting started guide.
Attempting to do this (handle everything with a single kernel call) makes the code very dense and hard to read. I don't normally like to demonstrate thrust this way, but since it is the crux of your question, it cannot be avoided. Therefore let's unpack the sequence of operations contained in the call to reduce_by_key, approximately from the inward out. The general basis of this algorithm is to "flatten" all data into a single long logical vector. Let's assume for understanding that our square matrix dimensions are only 2x2 and the length of our m vector is 3. You can think of the "flattening" or linear-index conversion like this:
linear index: 0 1 2 3 4 5 6 7 8 9 10 11
i index: 0 0 0 0 0 0 1 1 1 1 1 1
j index: 0 0 0 1 1 1 0 0 0 1 1 1
m index: 0 1 2 0 1 2 0 1 2 0 1 2
k index: 0 0 0 1 1 1 2 2 2 3 3 3
The "k index" above is our keys that will ultimately be used by reduce_by_key to collect product terms together, for each element of the matrix. Note that the code has EXT_I, EXT_J, EXT_M, and EXT_K helper macros which will define, using thrust placeholders, the operation to be performed on the linear index (created using a counting_iterator) to produce the various other "indices".
The first thing we will need to do is construct a suitable thrust operation to convert the linear index into the transformed value of idx0[i] (again, working from "inward to outward"). We can do this with a permutation iterator on idx0 vector, with a transform_iterator supplying the "map" for the permutation iterator - this transform iterator just converts the linear index (mb) to an "i" index:
thrust::make_permutation_iterator(d_idx0.begin(), thrust::make_transform_iterator(mb, EXT_I))
Now we need to combine the result from step 1 with the other index - m in this case, to generate a linearized version of the 2D index into X (d_X is the vector-linearized version of X). To do this, we will combine the result of step one in a zip_iterator with another transform iterator that creates the m index. This zip_iterator will be passed to a transform_iterator which takes the two indices and converts it into a linearized index to "look into" the d_X vector:
thrust::make_transform_iterator(thrust::make_zip_iterator(thrust::make_tuple(thrust::make_permutation_iterator(d_idx0.begin(), thrust::make_transform_iterator(mb, EXT_I)), thrust::make_transform_iterator(mb, EXT_M))), create_Xidx()))
create_Xidx is the functor that takes the two computed indices and converts it into the linear index into d_X
With the result from step 2, we can then use a permutation iterator to grab the appropriate value from d_X for the first term in the multiplication:
thrust::make_permutation_iterator(d_X.begin(), {code from step 2})
repeat steps 1,2,3, using EXT_J instead of EXT_I, to create the second term in the multiplication:
X[idx0[i]][m]*X[idx0[j]][m]
Place the terms created in step 3 and 4 into a zip_iterator, for use by the transform_iterator that will multiply the two together (using my_mult functor) to create the actual product:
thrust::make_transform_iterator(thrust::make_zip_iterator(thrust::make_tuple({result from step 3}, {result from step 4}, my_mult())
The remainder of the reduce_by_key is fairly straightforward. We create the keys index as described previously, and then use it to sum together the various products for each element of the square matrix.
Here is a fully worked example:
$ cat t875.cu
#include <iostream>
#include <thrust/reduce.h>
#include <thrust/copy.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/iterator/permutation_iterator.h>
#include <thrust/iterator/transform_iterator.h>
#include <thrust/iterator/zip_iterator.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/discard_iterator.h>
// rows
#define D1 9
// cols
#define D2 9
// size of m
#define D3 50
// helpers to convert linear indices to i,j,m or "key" indices
#define EXT_I (_1/(D2*D3))
#define EXT_J ((_1/(D3))%D2)
#define EXT_M (_1%D3)
#define EXT_K (_1/D3)
void test_cpu(float ATA[][D2], float X[][D3], int idx0[]){
for(int i=0;i<D1;i++)
{
for(int j=0;j<D2;j++)
{
ATA[i][j]=0;
for(int m=0;m<D3;m++)
ATA[i][j]=ATA[i][j]+X[idx0[i]][m]*X[idx0[j]][m];
}
}
}
using namespace thrust::placeholders;
struct create_Xidx : public thrust::unary_function<thrust::tuple<int, int>, int>{
__host__ __device__
int operator()(thrust::tuple<int, int> &my_tuple){
return (thrust::get<0>(my_tuple) * D3) + thrust::get<1>(my_tuple);
}
};
struct my_mult : public thrust::unary_function<thrust::tuple<float, float>, float>{
__host__ __device__
float operator()(thrust::tuple<float, float> &my_tuple){
return thrust::get<0>(my_tuple) * thrust::get<1>(my_tuple);
}
};
int main(){
//synthesize data
float ATA[D1][D2];
float X[D1][D3];
int idx0[D1];
thrust::host_vector<float> h_X(D1*D3);
thrust::host_vector<int> h_idx0(D1);
for (int i = 0; i < D1; i++){
idx0[i] = (i + 2)%D1; h_idx0[i] = idx0[i];
for (int j = 0; j < D2; j++) {ATA[i][j] = 0;}
for (int j = 0; j < D3; j++) {X[i][j] = j%(i+1); h_X[i*D3+j] = X[i][j];}}
thrust::device_vector<float> d_ATA(D1*D2);
thrust::device_vector<float> d_X = h_X;
thrust::device_vector<int> d_idx0 = h_idx0;
// helpers
thrust::counting_iterator<int> mb = thrust::make_counting_iterator(0);
thrust::counting_iterator<int> me = thrust::make_counting_iterator(D1*D2*D3);
// perform computation
thrust::reduce_by_key(thrust::make_transform_iterator(mb, EXT_K), thrust::make_transform_iterator(me, EXT_K), thrust::make_transform_iterator(thrust::make_zip_iterator(thrust::make_tuple(thrust::make_permutation_iterator(d_X.begin(), thrust::make_transform_iterator(thrust::make_zip_iterator(thrust::make_tuple(thrust::make_permutation_iterator(d_idx0.begin(), thrust::make_transform_iterator(mb, EXT_I)), thrust::make_transform_iterator(mb, EXT_M))), create_Xidx())), thrust::make_permutation_iterator(d_X.begin(), thrust::make_transform_iterator(thrust::make_zip_iterator(thrust::make_tuple(thrust::make_permutation_iterator(d_idx0.begin(), thrust::make_transform_iterator(mb, EXT_J)), thrust::make_transform_iterator(mb, EXT_M))), create_Xidx())))), my_mult()), thrust::make_discard_iterator(), d_ATA.begin());
thrust::host_vector<float> h_ATA = d_ATA;
test_cpu(ATA, X, idx0);
std::cout << "GPU: CPU: " << std::endl;
for (int i = 0; i < D1*D2; i++)
std::cout << i/D1 << "," << i%D2 << ":" << h_ATA[i] << " " << ATA[i/D1][i%D2] << std::endl;
}
$ nvcc -o t875 t875.cu
$ ./t875
GPU: CPU:
0,0:81 81
0,1:73 73
0,2:99 99
0,3:153 153
0,4:145 145
0,5:169 169
0,6:219 219
0,7:0 0
0,8:25 25
1,0:73 73
1,1:169 169
1,2:146 146
1,3:193 193
1,4:212 212
1,5:313 313
1,6:280 280
1,7:0 0
1,8:49 49
2,0:99 99
2,1:146 146
2,2:300 300
2,3:234 234
2,4:289 289
2,5:334 334
2,6:390 390
2,7:0 0
2,8:50 50
3,0:153 153
3,1:193 193
3,2:234 234
3,3:441 441
3,4:370 370
3,5:433 433
3,6:480 480
3,7:0 0
3,8:73 73
4,0:145 145
4,1:212 212
4,2:289 289
4,3:370 370
4,4:637 637
4,5:476 476
4,6:547 547
4,7:0 0
4,8:72 72
5,0:169 169
5,1:313 313
5,2:334 334
5,3:433 433
5,4:476 476
5,5:841 841
5,6:604 604
5,7:0 0
5,8:97 97
6,0:219 219
6,1:280 280
6,2:390 390
6,3:480 480
6,4:547 547
6,5:604 604
6,6:1050 1050
6,7:0 0
6,8:94 94
7,0:0 0
7,1:0 0
7,2:0 0
7,3:0 0
7,4:0 0
7,5:0 0
7,6:0 0
7,7:0 0
7,8:0 0
8,0:25 25
8,1:49 49
8,2:50 50
8,3:73 73
8,4:72 72
8,5:97 97
8,6:94 94
8,7:0 0
8,8:25 25
$
Notes:
If you profile the above code with e.g. nvprof --print-gpu-trace ./t875, you will witness two kernel calls. The first is associated with the device_vector creation. The second kernel call handles the entire reduce_by_key operation.
I don't know if all this is slower or faster than your CUDA kernel, since you haven't provided it. Sometimes, expertly written CUDA kernels can be faster than thrust algorithms doing the same operation.
It's quite possible that what I have here is not precisely the algorithm you had in mind. For example, your code suggests you're only filling in a triangular portion of ATA. But your description (9*9*2) suggests you want to populate every position in ATA. Nevertheless, my intent is not to give you a black box but to demonstrate how you can use various thrust approaches to achieve whatever it is you want in a single kernel call.
I'm trying to reach peak performance of each SM from the code below. The peak lies somewhere between 25 GFlops(GTX275-GT200 Arch.). This code gives 8 GFlops at the max.
__global__ void new_ker(float *x)
{
int index = threadIdx.x+blockIdx.x*blockDim.x;
float a,b;
a=0;
b=x[index];
//LOOP=10000000
//No. of blocks = 1
//Threads per block = 512 (I'm using GTX 275 - GT200 Arch.)
#pragma unroll 2048
for(int i=0;i<LOOP;i++){
a=a*b+b;
}
x[index] = a;
}
I don't want to increase ILP in the code. Any ideas why it's not reaching peak??
int main(int argc,char **argv)
{
//Initializations
float *x;
float *dx;
cudaEvent_t new_start,new_stop;
float elapsed;
double gflops;
x = 0;
flag = 0;
cudaMalloc((void **)&dx,sizeof(float)*THPB);
//ILP=1
cudaEventCreate(&new_start);
cudaEventCreate(&new_stop);
printf("Kernel1:\n");
cudaEventRecord(new_start, 0);
new_ker<<<BLOCKS,THPB>>>(dx);
cudaEventRecord(new_stop,0);
cudaEventSynchronize(new_stop);
cudaEventElapsedTime(&elapsed,new_start,new_stop);
x = (float *)malloc(sizeof(float)*THPB);
cudaMemcpy(x,dx,sizeof(float)*THPB,cudaMemcpyDeviceToHost);
gflops = ((double)(BLOCKS)*(THPB)*LOOP/elapsed)/1000000;
printf("\t%f",gflops);
cudaEventDestroy(new_start);
cudaEventDestroy(new_stop);
return 0;
}
Platform:
CUDA 3.0
NVIDIA GeForce GTX275 (GT200)
If I put together a complete repro case from your code, using the correct FLOP calculation:
#include <stdio.h>
#define LOOP (10000000)
#define BLOCKS (30)
#define THPB (512)
__global__ void new_ker(float *x)
{
int index = threadIdx.x+blockIdx.x*blockDim.x;
float a,b;
a=0;
b=x[index];
#pragma unroll 2048
for(int i=0;i<LOOP;i++){
a=a*b+b;
}
x[index] = a;
}
int main(int argc,char **argv)
{
//Initializations
float *x;
float *dx;
cudaEvent_t new_start,new_stop;
float elapsed;
double gflops;
x = 0;
cudaMalloc((void **)&dx,sizeof(float)*THPB);
//ILP=1
cudaEventCreate(&new_start);
cudaEventCreate(&new_stop);
printf("Kernel1:\n");
cudaEventRecord(new_start, 0);
new_ker<<<BLOCKS,THPB>>>(dx);
cudaEventRecord(new_stop,0);
cudaEventSynchronize(new_stop);
cudaEventElapsedTime(&elapsed,new_start,new_stop);
x = (float *)malloc(sizeof(float)*THPB*BLOCKS);
cudaMemcpy(x,dx,sizeof(float)*THPB*BLOCKS,cudaMemcpyDeviceToHost);
gflops = 2.0e-6 * ((double)(LOOP)*double(THPB*BLOCKS)/(double)elapsed);
printf("\t%f\n",gflops);
cudaEventDestroy(new_start);
cudaEventDestroy(new_stop);
return 0;
}
And I compile it and run it on a 1.4GHz GTX275 with CUDA 3.2 on a 64 bit linux platform:
$ nvcc -arch=sm_13 -Xptxas="-v" -o perf perf.cu
ptxas info : Compiling entry function '_Z7new_kerPf' for 'sm_13'
ptxas info : Used 4 registers, 8+16 bytes smem, 8 bytes cmem[1]
$ ./perf
Kernel1:
671.806039
I get within 0.01% of peak FLOP/s for that card running a pure FMAD code (1.4 GHz * 2 FLOP * 8 cores/MP * 30 MP) = 672 GFLOP/s.
So it seems that the code does, in fact, hit peak FLOP/s with one block per multiprocessor, but you just are not calculating the FLOP/s number correctly.