cublasDgemm getting more slower

cublasDgemm getting more slower - cuda

I have a problem when using cublasDgemm(this function is in cublas, and the result is A*B,A=750*600,B=600*1000).
for (i=0; i < N; ++i) {
cublasDgemm();
}
N=10, total time is 0.000473s, average call is 0.0000473
N=100, total time is 0.00243s, average call is 0.0000243
N=1000, total time is 0.715072s, average call is 0.000715
N=10000, total time is 10.4998s, average call is 0.00104998
why the average time is increasing so much?
#include <cuda_runtime.h>
#include <string.h>
#include <cublas.h>
#include <cublas_v2.h>
#include <time.h>
#include <sys/time.h>
#include <iostream>
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
using namespace std;
#define IDX2C(i,j,leading) (((j)*(leading))+(i))
#define CHECK_EQ(a,b) do { \
if ((a) != (b)) { \
cout <<__FILE__<<" : "<< __LINE__<<" : check failed because "<<a<<"!="<<b<<endl;\
exit(1);\
}\
} while(0)
#define CUBLAS_CHECK(condition) \
do {\
cublasStatus_t status = condition; \
CHECK_EQ(status, CUBLAS_STATUS_SUCCESS); \
} while(0)
#define CUDA_CHECK(condition)\
do {\
cudaError_t error = condition;\
CHECK_EQ(error, cudaSuccess);\
} while(0)
//check after kernel function
#define CUDA_POST_KERNEL_CHECK CUDA_CHECK(cudaPeekAtLastError())
template <class T>
void randMtx(T *mat, int n, double range) {
srand((unsigned int)time(NULL));
for (int i = 0; i < n; ++i) {
//mat[i] = 1.0;
double flag = 1.0;
if (rand() % 2 == 0) flag = -1.0;
mat[i] = flag * rand()/RAND_MAX * range;
}
}
int main(int argc, char *argv[]) {
if (argc != 9) {
cout << "m1_row m1_col m2_row m2_col m1 m2 count range\n";
return -1;
}
int row1 = atoi(argv[1]);
int col1 = atoi(argv[2]);
int row2 = atoi(argv[3]);
int col2 = atoi(argv[4]);
int count = atoi(argv[7]);
double range = atof(argv[8]);
cublasOperation_t opt1 = CUBLAS_OP_N;
cublasOperation_t opt2 = CUBLAS_OP_N;
int row3 = row1;
int col3 = col2;
int k = col1;
if (argv[5][0] == 't') {
opt1 = CUBLAS_OP_T;
row3 = col1;
k = row1;
}
if (argv[6][0] == 't') {
opt2 = CUBLAS_OP_T;
col3 = row2;
}
double *mat1_c = (double*)malloc(sizeof(double)*row1*col1);
double *mat2_c = (double*)malloc(sizeof(double)*row2*col2);
double *mat3_c = (double*)malloc(sizeof(double)*row3*col3);
srand((unsigned int)time(NULL));
randMtx(mat1_c, row1*col1, range);
randMtx(mat2_c, row2*col2, range);
double *mat1_g;
double *mat2_g;
double *mat3_g;
double alpha = 1.0;
double beta = 0.0;
CUDA_CHECK(cudaMalloc((void **)&(mat1_g), sizeof(double)*row1*col1));
CUDA_CHECK(cudaMalloc((void **)&(mat2_g), sizeof(double)*row2*col2));
CUDA_CHECK(cudaMalloc((void **)&(mat3_g), sizeof(double)*row3*col3));
CUDA_CHECK(cudaMemcpy(mat1_g, mat1_c, sizeof(double)*row1*col1, cudaMemcpyHostToDevice));
CUDA_CHECK(cudaMemcpy(mat2_g, mat2_c, sizeof(double)*row2*col2, cudaMemcpyHostToDevice));
cublasHandle_t handle;
CUBLAS_CHECK(cublasCreate(&handle));
struct timeval beg, end, b1, e1;
gettimeofday(&beg, NULL);
for (int i = 0; i < count ;++i) {
CUBLAS_CHECK(cublasDgemm(handle, opt1, opt2, row3, col3, k, &alpha, mat1_g, row1, mat2_g, row2, &beta, mat3_g, row3));
}
cudaDeviceSynchronize();//
gettimeofday(&end, NULL);
cout << "real time used: " << end.tv_sec-beg.tv_sec + (double)(end.tv_usec-beg.tv_usec)/1000000 <<endl;
free(mat1_c);
free(mat2_c);
free(mat3_c);
cudaFree(mat1_g);
cudaFree(mat2_g);
cudaFree(mat3_g);
return 1;
}
this is the code. I add cudaDeviceSynchronize after the loop block, and no matter the value of count, the average call time is about 0.001s

As pointed out by #talonmies, this behavior is probably exactly what would be expected.
When you call cublasDgemm, the call (usually) returns control to the host (CPU) thread, before the operation is complete. In fact there is a queue that calls like this will go into, each time you make the call. The operation will be placed into a queue, and your host code will continue.
Furthermore, CUDA and CUBLAS usually have some one-time overhead that is associated with using the API. For example, the call to create a CUBLAS handle usually incurs some measurable time, in order to initialize the library.
So your measurements can be broken into 3 groups:
"Small" iteration counts (e.g. 10). In this case, each call pays the cost to put a Dgemm request into the queue, plus the amortization of the startup costs over a relatively small number of iterations. This corresponds to your measurements like this: "average call is 0.0000473"
"Medium" iteration counts (e.g. 100-1000). In this case, the amortization of the start up costs becomes very small per call, and so most of the measurement is just the time to add a Dgemm request to the queue. This corresponds to your measurements like this: "average call is 0.0000243"
"Large" iteration counts (e.g. 10000). At some point, the internal request queue becomes full and can no longer accept new requests, until some requests have been completed and removed from the queue. What happens at this point is that the Dgemm call switches from non-blocking to blocking. It blocks (holds up the host/CPU thread) until a queue slot becomes available. What happens at this point then, is that suddenly new requests must wait effectively for a previous request to finish, so now the cost for a new Dgemm request approximately equals the time to execute and complete a (previous) Dgemm request. So the per-call cost jumps up dramatically from the cost to add an item to the queue to the cost to complete a request. This corresponds to your measurements like this: "average call is 0.00104998"

Related

What's the alternative for __match_any_sync on compute capability 6?

In the cuda examples, e.g. here, __match_all_sync __match_any_sync is used.
Here is an example where a warp is split into multiple (one or more) groups that each keep track of their own atomic counter.
// increment the value at ptr by 1 and return the old value
__device__ int atomicAggInc(int* ptr) {
int pred;
//const auto mask = __match_all_sync(__activemask(), ptr, &pred); //error, should be any_sync, not all_sync
const auto mask = __match_any_sync(__activemask(), ptr, &pred);
const auto leader = __ffs(mask) - 1; // select a leader
int res;
const auto lane_id = ThreadId() % warpSize;
if (lane_id == leader) { // leader does the update
res = atomicAdd(ptr, __popc(mask));
}
res = __shfl_sync(mask, res, leader); // get leader’s old value
return res + __popc(mask & ((1 << lane_id) - 1)); //compute old value
}
The __match_any_sync here splits up the threads in the warp into groups that have the same ptr value, so that each group can update its own ptr atomically without getting in the way of other threads.
I know the nvcc compiler (since cuda 9) does this sort of optimization under the hood automatically, but this is just about the mechanics of __match_any_sync
Is there a way to do this pre compute capability 7?

EDIT: The blog article has now been modified to reflect __match_any_sync() rather than __match_all_sync(), so any commentary to that effect below should be disregarded. The answer below is edited to reflect this.
Based on your statement:
this is just about the mechanics of __match_any_sync
we will focus on a replacement for __match_any_sync itself, not any other form of rewriting the atomicAggInc function. Therefore, we must provide a mask that has the same value as would be returned by __match_any_sync() on cc7.0 or higher architectures.
I believe this will require a loop, which broadcasts the ptr value, in the worst case one iteration for each thread in the warp (since each thread could have a unique ptr value) and testing which threads have the same value. There are various ways we could "optimize" this loop for this function, so as to possibly reduce the trip count from 32 to some lesser value, based on the actual ptr values in each thread, but such optimization in my view introduces considerable complexity, which makes the worst-case processing time longer (as is typical of early-exit optimizations). So I will demonstrate a fairly simple method without this optimization.
The other consideration is what to do in the case of the warp not being converged? For that, we can employ __activemask() to identify that case.
Here is a worked example:
$ cat t1646.cu
#include <iostream>
#include <stdio.h>
// increment the value at ptr by 1 and return the old value
__device__ int atomicAggInc(int* ptr) {
int mask;
#if __CUDA_ARCH__ >= 700
mask = __match_any_sync(__activemask(), (unsigned long long)ptr);
#else
unsigned tmask = __activemask();
for (int i = 0; i < warpSize; i++){
#ifdef USE_OPT
if ((1U<<i) & tmask){
#endif
unsigned long long tptr = __shfl_sync(tmask, (unsigned long long)ptr, i);
unsigned my_mask = __ballot_sync(tmask, (tptr == (unsigned long long)ptr));
if (i == (threadIdx.x & (warpSize-1))) mask = my_mask;}
#ifdef USE_OPT
}
#endif
#endif
int leader = __ffs(mask) - 1; // select a leader
int res;
unsigned lane_id = threadIdx.x % warpSize;
if (lane_id == leader) { // leader does the update
res = atomicAdd(ptr, __popc(mask));
}
res = __shfl_sync(mask, res, leader); // get leader’s old value
return res + __popc(mask & ((1 << lane_id) - 1)); //compute old value
}
__global__ void k(int *d){
int *ptr = d + threadIdx.x/4;
if ((threadIdx.x >= 16) && (threadIdx.x < 32))
atomicAggInc(ptr);
}
const int ds = 32;
int main(){
int *d_d, *h_d;
h_d = new int[ds];
cudaMalloc(&d_d, ds*sizeof(d_d[0]));
cudaMemset(d_d, 0, ds*sizeof(d_d[0]));
k<<<1,ds>>>(d_d);
cudaMemcpy(h_d, d_d, ds*sizeof(d_d[0]), cudaMemcpyDeviceToHost);
for (int i = 0; i < ds; i++)
std::cout << h_d[i] << " ";
std::cout << std::endl;
}
$ nvcc -o t1646 t1646.cu -DUSE_OPT
$ cuda-memcheck ./t1646
========= CUDA-MEMCHECK
0 0 0 0 4 4 4 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
========= ERROR SUMMARY: 0 errors
$
(CentOS 7, CUDA 10.1.243, with device 0 being Tesla V100, device 1 being a cc3.5 device).
I've added an optional optimization for the case where the warp is diverged (i.e. tmask is not 0xFFFFFFFF). This can be selected by defining USE_OPT.

How to use CUDA tex1DFetch with cudaTextureObject_t?

I was working with texture references when I noticed they were deprecated, I tried to update my test function to work with the 'new' bindless texture objects with tex1Dfetch but was not able to produce the same results.
I'm currently exploring the use of texture memory to speed up my aho-corasick implementation; I was able to get tex1D() working with texture references, however, I noticed they were deprecated and decided to use texture objects instead.
I'm getting some immensely weird behaviour with the kernels when I try to use the results in any way; I can do results[tidx] = tidx; without any issues, but results[tidx] = temp + 1; only ever returns the value of temp not temp * 3 or any other numerical test involving temp.
I can see no logical reason for this behaviour, and the documentation examples look similar enough that I can't see where I've gone wrong.
I've already read CUDA tex1Dfetch() wrong behaviour and New CUDA Texture Object — getting wrong data in 2D case but neither seem related to the issue I am having.
Just in case it makes a difference; I am am using CUDA release 10.0, V10.0.130 with an Nvidia GTX 980ti.
#include <iostream>
__global__ void test(cudaTextureObject_t tex ,int* results){
int tidx = threadIdx.y * blockDim.x + threadIdx.x;
unsigned temp = tex1Dfetch<unsigned>(tex, threadIdx.x);
results[tidx] = temp * 3;
}
int main(){
int *host_arr;
const int host_arr_size = 8;
// Create and populate host array
std::cout << "Host:" << std::endl;
cudaMallocHost(&host_arr, host_arr_size*sizeof(int));
for (int i = 0; i < host_arr_size; ++i){
host_arr[i] = i * 2;
std::cout << host_arr[i] << std::endl;
}
// Create resource description
struct cudaResourceDesc resDesc;
resDesc.resType = cudaResourceTypeLinear;
resDesc.res.linear.devPtr = &host_arr;
resDesc.res.linear.sizeInBytes = host_arr_size*sizeof(unsigned);
resDesc.res.linear.desc = cudaCreateChannelDesc<unsigned>();
// Create texture description
struct cudaTextureDesc texDesc;
texDesc.readMode = cudaReadModeElementType;
// Create texture
cudaTextureObject_t tex;
cudaCreateTextureObject(&tex, &resDesc, &texDesc, NULL);
// Allocate results array
int * result_arr;
cudaMalloc(&result_arr, host_arr_size*sizeof(unsigned));
// launch test kernel
test<<<1, host_arr_size>>>(tex, result_arr);
// fetch results
std::cout << "Device:" << std::endl;
cudaMemcpy(host_arr, result_arr, host_arr_size*sizeof(unsigned), cudaMemcpyDeviceToHost);
// print results
for (int i = 0; i < host_arr_size; ++i){
std::cout << host_arr[i] << std::endl;
}
// Tidy Up
cudaDestroyTextureObject(tex);
cudaFreeHost(host_arr);
cudaFree(result_arr);
}
I expected the above to work similarly to the below (which does work):
texture<int, 1, cudaReadModeElementType> tex_ref;
cudaArray* cuda_array;
__global__ void test(int* results){
const int tidx = threadIdx.x;
results[tidx] = tex1D(tex_ref, tidx) * 3;
}
int main(){
int *host_arr;
int host_arr_size = 8;
// Create and populate host array
cudaMallocHost((void**)&host_arr, host_arr_size * sizeof(int));
for (int i = 0; i < host_arr_size; ++i){
host_arr[i] = i * 2;
std::cout << host_arr[i] << std::endl;
}
// bind to texture
cudaChannelFormatDesc cuDesc = cudaCreateChannelDesc <int >();
cudaMallocArray(&cuda_array, &cuDesc, host_arr_size);
cudaMemcpyToArray(cuda_array, 0, 0, host_arr , host_arr_size * sizeof(int), cudaMemcpyHostToDevice);
cudaBindTextureToArray(tex_ref , cuda_array);
// Allocate results array
int * result_arr;
cudaMalloc((void**)&result_arr, host_arr_size*sizeof(int));
// launch kernel
test<<<1, host_arr_size>>>(result_arr);
// fetch results
cudaMemcpy(host_arr, result_arr, host_arr_size * sizeof(int), cudaMemcpyDeviceToHost);
// print results
for (int i = 0; i < host_arr_size; ++i){
std::cout << host_arr[i] << std::endl;
}
// Tidy Up
cudaUnbindTexture(tex_ref);
cudaFreeHost(host_arr);
cudaFreeArray(cuda_array);
cudaFree(result_arr);
}
Expected results:
Host:
0
2
4
6
8
10
12
14
Device:
0
6
12
18
24
30
36
42
Actual results:
Host:
0
2
4
6
8
10
12
14
Device:
0
2
4
6
8
10
12
14
Does anyone know what on earth is going wrong?

CUDA API function calls return error codes. You want to check these error codes. Especially when something is clearly going wrong somewhere…
You use the the same array to store the initial array data as well as to receive the result from the device. Your kernel launch fails with an illegal address error because you do not have a valid texture object. You do not have a valid texture object because the creation of your texture object failed. The first API call right after the kernel launch is the cudaMemcpy() to get the results back. Since there was an error during the kernel launch, cudaMemcpy() will fail, returning the most recent error instead of performing the copy. As a result, the contents of your host_arr buffer are unchanged and you just end up displaying the original input data again.
The reson why creation of your texture object failed is explained in the documentation (emphasis mine):
If cudaResourceDesc::resType is set to cudaResourceTypeLinear, cudaResourceDesc::res::linear::devPtr must be set to a valid device pointer, that is aligned to cudaDeviceProp::textureAlignment. […]
A texture object cannot reference host memory. The problem in your code lies here:
resDesc.res.linear.devPtr = &host_arr;
You need to allocate a buffer in decive memory, e.g., using cudaMalloc(), copy your data there, and create a texture object that refers to that device buffer.
Furthermore, your texDesc is not initialized properly. In your case, it should be sufficient to just zero-initialize it:
struct cudaTextureDesc texDesc = {};

4 steps:
declare
texture<unsigned char,1,cudaReadmodeElementType> tex1;
bind
cudaBindTexture(0,tex1,dev_A);
fetch/read via index
tex1Dfetch(tex1,2);
unbind
cudaUnbindTexture(tex1)；

Thrust Histogram with weights

I want to compute the density of particles over a grid. Therefore, I have a vector that contains the cellID of each particle, as well as a vector with the given mass which does not have to be uniform.
I have taken the non-sparse example from Thrust to compute a histogram of my particles.
However, to compute the density, I need to include the weight of each particle, instead of simply summing the number of particles per cell, i.e. I'm interested in rho[i] = sum W[j] for all j that satify cellID[j]=i (probably unnecessary to explain, since everybody knows that).
Implementing this with Thrust has not worked for me. I also tried to use a CUDA kernel and thrust_raw_pointer_cast, but I did not succeed with that either.
EDIT:
Here is a minimal working example which should compile via nvcc file.cu under CUDA 6.5 and with Thrust installed.
#include <thrust/device_vector.h>
#include <thrust/sort.h>
#include <thrust/copy.h>
#include <thrust/binary_search.h>
#include <thrust/adjacent_difference.h>
// Predicate
struct is_out_of_bounds {
__host__ __device__ bool operator()(int i) {
return (i < 0); // out of bounds elements have negative id;
}
};
// cf.: https://code.google.com/p/thrust/source/browse/examples/histogram.cu, but modified
template<typename T1, typename T2>
void computeHistogram(const T1& input, T2& histogram) {
typedef typename T1::value_type ValueType; // input value type
typedef typename T2::value_type IndexType; // histogram index type
// copy input data (could be skipped if input is allowed to be modified)
thrust::device_vector<ValueType> data(input);
// sort data to bring equal elements together
thrust::sort(data.begin(), data.end());
// there are elements that we don't want to count, those have ID -1;
data.erase(thrust::remove_if(data.begin(), data.end(), is_out_of_bounds()),data.end());
// number of histogram bins is equal to the maximum value plus one
IndexType num_bins = histogram.size();
// find the end of each bin of values
thrust::counting_iterator<IndexType> search_begin(0);
thrust::upper_bound(data.begin(), data.end(), search_begin,
search_begin + num_bins, histogram.begin());
// compute the histogram by taking differences of the cumulative histogram
thrust::adjacent_difference(histogram.begin(), histogram.end(),
histogram.begin());
}
int main(void) {
thrust::device_vector<int> cellID(5);
cellID[0] = -1; cellID[1] = 1; cellID[2] = 0; cellID[3] = 2; cellID[4]=1;
thrust::device_vector<float> mass(5);
mass[0] = .5; mass[1] = 1.0; mass[2] = 2.0; mass[3] = 3.0; mass[4] = 4.0;
thrust::device_vector<int> histogram(3);
thrust::device_vector<float> density(3);
computeHistogram(cellID,histogram);
std::cout<<"\nHistogram:\n";
thrust::copy(histogram.begin(), histogram.end(),
std::ostream_iterator<int>(std::cout, " "));
std::cout << std::endl;
// this will print: " Histogram 1 2 1 "
// meaning one element with ID 0, two elements with ID 1
// and one element with ID 2
/* here is what I am unable to implement:
*
*
* computeDensity(cellID,mass,density);
*
* print(density): 2.0 5.0 3.0
*
*
*/
}
I hope the comment at the end of the file also makes clear what I mean by computing the density. If there is any question open, please feel free to ask. Thanks!
There still seems to be a problem in understanding my problem, which I am sorry for! Therefore I added some pictures.
Consider the first picture. For my understanding, a histogram would simply be the count of particles per grid cell. In this case a histogram would be an array of size 36, since there are 36 cells. Also, there would be a lot of zero entries in the vector, since for example in the upper left corner almost no cell contains a particle. This is what I already have in my code.
Now consider the slightly more complicated case. Here each particle has a different mass, indicated by the different size in the plot. To compute the density I can't just add the number of particles per cell, but I have to add the mass of all particles per cell. This is what I'm unable to implement.

What you described in your example does not look like a histogram but rather like a segmented reduction.
The following example code uses thrust::reduce_by_key to sum up the masses of particles within the same cell:
density.cu
#include <thrust/device_vector.h>
#include <thrust/sort.h>
#include <thrust/reduce.h>
#include <thrust/copy.h>
#include <thrust/scatter.h>
#include <iostream>
#define PRINTER(name) print(#name, (name))
template <template <typename...> class V, typename T, typename ...Args>
void print(const char* name, const V<T,Args...> & v)
{
std::cout << name << ":\t\t";
thrust::copy(v.begin(), v.end(), std::ostream_iterator<T>(std::cout, "\t"));
std::cout << std::endl << std::endl;
}
int main()
{
const int particle_count = 5;
const int cell_count = 10;
thrust::device_vector<int> cellID(particle_count);
cellID[0] = -1; cellID[1] = 1; cellID[2] = 0; cellID[3] = 2; cellID[4]=1;
thrust::device_vector<float> mass(particle_count);
mass[0] = .5; mass[1] = 1.0; mass[2] = 2.0; mass[3] = 3.0; mass[4] = 4.0;
std::cout << "input data" << std::endl;
PRINTER(cellID);
PRINTER(mass);
thrust::sort_by_key(cellID. begin(), cellID.end(), mass.begin());
std::cout << "after sort_by_key" << std::endl;
PRINTER(cellID);
PRINTER(mass);
thrust::device_vector<int> reduced_cellID(particle_count);
thrust::device_vector<float> density(particle_count);
int new_size = thrust::reduce_by_key(cellID. begin(), cellID.end(),
mass.begin(),
reduced_cellID.begin(),
density.begin()
).second - density.begin();
if (reduced_cellID[0] == -1)
{
density.erase(density.begin());
reduced_cellID.erase(reduced_cellID.begin());
new_size--;
}
density.resize(new_size);
reduced_cellID.resize(new_size);
std::cout << "after reduce_by_key" << std::endl;
PRINTER(density);
PRINTER(reduced_cellID);
thrust::device_vector<float> final_density(cell_count);
thrust::scatter(density.begin(), density.end(), reduced_cellID.begin(), final_density.begin());
PRINTER(final_density);
}
compile using
nvcc -std=c++11 density.cu -o density
output
input data
cellID: -1 1 0 2 1
mass: 0.5 1 2 3 4
after sort_by_key
cellID: -1 0 1 1 2
mass: 0.5 2 1 4 3
after reduce_by_key
density: 2 5 3
reduced_cellID: 0 1 2
final_density: 2 5 3 0 0 0 0 0 0 0

Efficient bitstream convolution

I have two floating point time series A, B of length N each. I have to calculate the circular convolution and find maximum value. The classic and fastest way of doing this is
C = iFFT(FFT(A) * FFT(B))
Now, let's suppose that both A and B is a series which contains only 1s and 0s, so in principle we can represent them as bitstreams.
Question: Is there any faster way of doing the convolution (and find its maximum value) if I am somehow able to make use of the fact above ?
(I was already thinking a lot on Walsh - Hadamard transforms and SSE instructions, popcounts, but found no faster way for M > 2 **20 which is my case.)
Thanks,
gd

The 1D convolution c of two arrays a and b of size n is an array such that :
This formula can be rewritten in an iterative way :
The non-null terms of the sum are limited to the number of changes nb of b : if b is a simple pattern, this sum can be limited to a few terms. An algorithm may now be designed to compute c :
1 : compute c[0] (about n operations)
2 : for 0<i<n compute c[i] using the formula (about nb*n operations)
If nb is small, this method may be faster than fft. Note that it will provide exact results for bitstream signals, while the fft needs oversampling and floating point precision to deliver accurate results.
Here is a piece of code implementing this trick with input type unsigned char.
#include <stdlib.h>
#include <math.h>
#include <string.h>
#include <time.h>
#include <fftw3.h>
typedef struct{
unsigned int nbchange;
unsigned int index[1000];
int change[1000];
}pattern;
void topattern(unsigned int n, unsigned char* b,pattern* bp){
//initialisation
bp->nbchange=0;
unsigned int i;
unsigned char former=b[n-1];
for(i=0;i<n;i++){
if(b[i]!=former){
bp->index[bp->nbchange]=i;
bp->change[bp->nbchange]=((int)b[i])-former;
bp->nbchange++;
}
former=b[i];
}
}
void printpattern(pattern* bp){
int i;
printf("pattern :\n");
for(i=0;i<bp->nbchange;i++){
printf("index %d change %d\n",bp->index[i],bp->change[i]);
}
}
//https://stackoverflow.com/questions/109023/how-to-count-the-number-of-set-bits-in-a-32-bit-integer
unsigned int NumberOfSetBits(unsigned int i)
{
i = i - ((i >> 1) & 0x55555555);
i = (i & 0x33333333) + ((i >> 2) & 0x33333333);
return (((i + (i >> 4)) & 0x0F0F0F0F) * 0x01010101) >> 24;
}
//https://stackoverflow.com/questions/2525310/how-to-define-and-work-with-an-array-of-bits-in-c
unsigned int convol_longint(unsigned int a, unsigned int b){
return NumberOfSetBits(a&b);
}
int main(int argc, char* argv[]) {
unsigned int n=10000000;
//the array a
unsigned char* a=malloc(n*sizeof(unsigned char));
if(a==NULL){printf("malloc failed\n");exit(1);}
unsigned int i,j;
for(i=0;i<n;i++){
a[i]=rand();
}
memset(&a[2],5,2);
memset(&a[10002],255,20);
for(i=0;i<n;i++){
//printf("a %d %d \n",i,a[i]);
}
//pattern b
unsigned char* b=malloc(n*sizeof(unsigned char));
if(b==NULL){printf("malloc failed\n");exit(1);}
memset(b,0,n*sizeof(unsigned char));
memset(&b[2],1,20);
//memset(&b[120],1,10);
//memset(&b[200],1,10);
int* c=malloc(n*sizeof(int)); //nb bit in the array
memset(c,0,n*sizeof(int));
clock_t begin, end;
double time_spent;
begin = clock();
/* here, do your time-consuming job */
//computing c[0]
for(i=0;i<n;i++){
//c[0]+= convol_longint(a[i],b[i]);
c[0]+= ((int)a[i])*((int)b[i]);
//printf("c[0] %d %d\n",c[0],i);
}
printf("c[0] %d\n",c[0]);
//need to store b as a pattern.
pattern bpat;
topattern( n,b,&bpat);
printpattern(&bpat);
//computing c[i] according to formula
for(i=1;i<n;i++){
c[i]=c[i-1];
for(j=0;j<bpat.nbchange;j++){
c[i]+=bpat.change[j]*((int)a[(bpat.index[j]-i+n)%n]);
}
}
//finding max
int currmax=c[0];
unsigned int currindex=0;
for(i=1;i<n;i++){
if(c[i]>currmax){
currmax=c[i];
currindex=i;
}
//printf("c[i] %d %d\n",i,c[i]);
}
printf("c[max] is %d at index %d\n",currmax,currindex);
end = clock();
time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
printf("computation took %lf seconds\n",time_spent);
double* dp = malloc(sizeof (double) * n);
fftw_complex * cp = fftw_malloc(sizeof (fftw_complex) * (n/2+1));
begin = clock();
fftw_plan plan = fftw_plan_dft_r2c_1d(n, dp, cp, FFTW_ESTIMATE);
end = clock();
time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
fftw_execute ( plan );
printf("fftw took %lf seconds\n",time_spent);
free(dp);
free(cp);
free(a);
free(b);
free(c);
return 0;
}
To compile : gcc main.c -o main -lfftw3 -lm
For n=10 000 000 and nb=2 (b is just a "rectangular 1D window") this algorithm run in 0.65 seconds on my computer. A double-precision fft using fftw took approximately the same time. This comparison, like most of comparisons, may be unfair since :
nb=2 is the best case for the algorithm presented in this answer.
The fft-based algorithm would have needed oversampling.
double precison may not be required for the fft-based algorithm
The implementation exposed here is not optimized. It is just basic code.
This implementation can handle n=100 000 000. At this point, using long int for c could be advised to avoid any risk of overflow.
If signals are bitstreams, this program may be optimzed in various ways. For bitwise operations, look this question and this one

Shared memory, branching performance and register count

I came across some peculiar performance behaviour when trying out the CUDA shuffle instruction. The test kernel below is based on an image processing algorithm which adds input-dependent values to all neighbouring pixels within a square of side rad. The output for each block is added in shared memory. If only one thread per warp adds its result to shared memory, the performance is poor (Option 1), whereas on the other hand, if all threads add to shared memory (one thread adds the desired value, the rest just add 0), the execution time drops by 2-3 times (Option 2).
#include <iostream>
#include "cuda_runtime.h"
#define warpSz 32
#define tileY 32
#define rad 32
__global__ void test(float *out, int pitch)
{
// Set shared mem to 0
__shared__ float tile[(warpSz + 2*rad) * (tileY + 2*rad)];
for (int i = threadIdx.y*blockDim.x+threadIdx.x; i<(tileY+2*rad)*(warpSz+2*rad); i+=blockDim.x*blockDim.y) {
tile[i] = 0.0f;
}
__syncthreads();
for (int row=threadIdx.y; row<tileY; row += blockDim.y) {
// Loop over pixels in neighbourhood
for (int i=0; i<2*rad+1; ++i) {
float res = 0.0f;
int rowStartIdx = (row+i)*(warpSz+2*rad);
for (int j=0; j<2*rad+1; ++j) {
res += float(threadIdx.x+row); // Substitute for real calculation
// Option 1: one thread writes to shared mem
if (threadIdx.x == 0) {
tile[rowStartIdx + j] += res;
res = 0.0f;
}
//// Option 2: all threads write to shared mem
//float tmp = 0.0f;
//if (threadIdx.x == 0) {
// tmp = res;
// res = 0.0f;
//}
//tile[rowStartIdx + threadIdx.x+j] += tmp;
res = __shfl(res, (threadIdx.x+1) % warpSz);
}
res += float(threadIdx.x+row);
tile[rowStartIdx + threadIdx.x+2*rad] += res;
__syncthreads();
}
}
// Add result back to global mem
for (int row=threadIdx.y; row<tileY+2*rad; row+=blockDim.y) {
for (int col=threadIdx.x; col<warpSz+2*rad; col+=warpSz) {
int idx = (blockIdx.y*tileY + row)*pitch + blockIdx.x*warpSz + col;
atomicAdd(out+idx, tile[row*(warpSz+2*rad) + col]);
}
}
}
int main(void)
{
int2 dim = make_int2(512, 512);
int pitchOut = (((dim.x+2*rad)+warpSz-1) / warpSz) * warpSz;
int sizeOut = pitchOut*(dim.y+2*rad);
dim3 gridDim((dim.x+warpSz-1)/warpSz, (dim.y+tileY-1)/tileY, 1);
float *devOut;
cudaMalloc((void**)&devOut, sizeOut*sizeof(float));
cudaEvent_t start, stop;
float elapsedTime;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaFree(0);
cudaEventRecord(start, 0);
test<<<gridDim, dim3(warpSz, 8)>>>(devOut, pitchOut);
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&elapsedTime, start, stop);
cudaFree(devOut);
cudaDeviceReset();
std::cout << "Elapsed time: " << elapsedTime << " ms.\n";
std::cin.ignore();
}
Is this expected behaviour/can anyone explain why this happens?
One thing I have noted is that Option 1 uses only 15 registers, whereas Option 2 uses 37, which seems a big difference to me.
Another is that the if-statement in the innermost loop is converted to explicit bra instructions in the PTX code for Option 1, whereas for Option 2 it is converted to two selp instructions. Could it be that the explicit branching is behind the 2-3 times slow down similar to what's suspected in this question?
There are two reasons why I am reluctant to go for Option 2. First, when profiling the original application it seems to be limited by share memory bandwidth, which indicates that there is potential to increase the performance by having fewer threads accessing it. Second, unless we use the volatile keyword, writes to shared memory can be optimised to registers. Since we are only interested in the contribution from last the thread to access each memory location (threadIdx.x == 0), and all others add 0, this is not a problem as long as all changes temporarily located in registers are guaranteed to be written back to shared memory in the same order they were issued. Is this the case though? (This far, both options have produced the exact same result.)
Any thoughts or ideas are much appreciated!
PS. I compile for compute capability 3.0. (However, the shuffle instruction is not necessary to demonstrate the behaviour and can be commented out.)

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008