Efficient bitstream convolution - fft

I have two floating point time series A, B of length N each. I have to calculate the circular convolution and find maximum value. The classic and fastest way of doing this is
C = iFFT(FFT(A) * FFT(B))
Now, let's suppose that both A and B is a series which contains only 1s and 0s, so in principle we can represent them as bitstreams.
Question: Is there any faster way of doing the convolution (and find its maximum value) if I am somehow able to make use of the fact above ?
(I was already thinking a lot on Walsh - Hadamard transforms and SSE instructions, popcounts, but found no faster way for M > 2 **20 which is my case.)
Thanks,
gd

The 1D convolution c of two arrays a and b of size n is an array such that :
This formula can be rewritten in an iterative way :
The non-null terms of the sum are limited to the number of changes nb of b : if b is a simple pattern, this sum can be limited to a few terms. An algorithm may now be designed to compute c :
1 : compute c[0] (about n operations)
2 : for 0<i<n compute c[i] using the formula (about nb*n operations)
If nb is small, this method may be faster than fft. Note that it will provide exact results for bitstream signals, while the fft needs oversampling and floating point precision to deliver accurate results.
Here is a piece of code implementing this trick with input type unsigned char.
#include <stdlib.h>
#include <math.h>
#include <string.h>
#include <time.h>
#include <fftw3.h>
typedef struct{
unsigned int nbchange;
unsigned int index[1000];
int change[1000];
}pattern;
void topattern(unsigned int n, unsigned char* b,pattern* bp){
//initialisation
bp->nbchange=0;
unsigned int i;
unsigned char former=b[n-1];
for(i=0;i<n;i++){
if(b[i]!=former){
bp->index[bp->nbchange]=i;
bp->change[bp->nbchange]=((int)b[i])-former;
bp->nbchange++;
}
former=b[i];
}
}
void printpattern(pattern* bp){
int i;
printf("pattern :\n");
for(i=0;i<bp->nbchange;i++){
printf("index %d change %d\n",bp->index[i],bp->change[i]);
}
}
//https://stackoverflow.com/questions/109023/how-to-count-the-number-of-set-bits-in-a-32-bit-integer
unsigned int NumberOfSetBits(unsigned int i)
{
i = i - ((i >> 1) & 0x55555555);
i = (i & 0x33333333) + ((i >> 2) & 0x33333333);
return (((i + (i >> 4)) & 0x0F0F0F0F) * 0x01010101) >> 24;
}
//https://stackoverflow.com/questions/2525310/how-to-define-and-work-with-an-array-of-bits-in-c
unsigned int convol_longint(unsigned int a, unsigned int b){
return NumberOfSetBits(a&b);
}
int main(int argc, char* argv[]) {
unsigned int n=10000000;
//the array a
unsigned char* a=malloc(n*sizeof(unsigned char));
if(a==NULL){printf("malloc failed\n");exit(1);}
unsigned int i,j;
for(i=0;i<n;i++){
a[i]=rand();
}
memset(&a[2],5,2);
memset(&a[10002],255,20);
for(i=0;i<n;i++){
//printf("a %d %d \n",i,a[i]);
}
//pattern b
unsigned char* b=malloc(n*sizeof(unsigned char));
if(b==NULL){printf("malloc failed\n");exit(1);}
memset(b,0,n*sizeof(unsigned char));
memset(&b[2],1,20);
//memset(&b[120],1,10);
//memset(&b[200],1,10);
int* c=malloc(n*sizeof(int)); //nb bit in the array
memset(c,0,n*sizeof(int));
clock_t begin, end;
double time_spent;
begin = clock();
/* here, do your time-consuming job */
//computing c[0]
for(i=0;i<n;i++){
//c[0]+= convol_longint(a[i],b[i]);
c[0]+= ((int)a[i])*((int)b[i]);
//printf("c[0] %d %d\n",c[0],i);
}
printf("c[0] %d\n",c[0]);
//need to store b as a pattern.
pattern bpat;
topattern( n,b,&bpat);
printpattern(&bpat);
//computing c[i] according to formula
for(i=1;i<n;i++){
c[i]=c[i-1];
for(j=0;j<bpat.nbchange;j++){
c[i]+=bpat.change[j]*((int)a[(bpat.index[j]-i+n)%n]);
}
}
//finding max
int currmax=c[0];
unsigned int currindex=0;
for(i=1;i<n;i++){
if(c[i]>currmax){
currmax=c[i];
currindex=i;
}
//printf("c[i] %d %d\n",i,c[i]);
}
printf("c[max] is %d at index %d\n",currmax,currindex);
end = clock();
time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
printf("computation took %lf seconds\n",time_spent);
double* dp = malloc(sizeof (double) * n);
fftw_complex * cp = fftw_malloc(sizeof (fftw_complex) * (n/2+1));
begin = clock();
fftw_plan plan = fftw_plan_dft_r2c_1d(n, dp, cp, FFTW_ESTIMATE);
end = clock();
time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
fftw_execute ( plan );
printf("fftw took %lf seconds\n",time_spent);
free(dp);
free(cp);
free(a);
free(b);
free(c);
return 0;
}
To compile : gcc main.c -o main -lfftw3 -lm
For n=10 000 000 and nb=2 (b is just a "rectangular 1D window") this algorithm run in 0.65 seconds on my computer. A double-precision fft using fftw took approximately the same time. This comparison, like most of comparisons, may be unfair since :
nb=2 is the best case for the algorithm presented in this answer.
The fft-based algorithm would have needed oversampling.
double precison may not be required for the fft-based algorithm
The implementation exposed here is not optimized. It is just basic code.
This implementation can handle n=100 000 000. At this point, using long int for c could be advised to avoid any risk of overflow.
If signals are bitstreams, this program may be optimzed in various ways. For bitwise operations, look this question and this one

Related

CUDA: Confused as to why this test program doesn't seem to do anything

I've got a CUDA test program which is supposed to invert the RGB values of an image. On my system at least, this is producing an output image, but it is completely transparent.
Here's CudaLodepng.cu
#include <stdio.h>
#include <stdlib.h>
#include "lodepng.h"
__global__
void NegativeFilter(unsigned char *inputImage, unsigned char *outputImage)
{
int r;
int g;
int b;
int t;
int threadIndex = blockDim.x * blockIdx.x + threadIdx.x;
int pixel = threadIndex * 4;
printf("uid = %d\n", pixel);
r = inputImage[pixel];
g = inputImage[pixel+1];
b = inputImage[pixel+2];
t = inputImage[pixel+3];
outputImage[pixel] = 255-r;
outputImage[pixel+1] = 255-g;
outputImage[pixel+2] = 255-b;
outputImage[pixel+3] = t;
}
int main(int argc, char ** argv){
unsigned int errorDecode;
unsigned char* cpuImage;
unsigned int width, height;
char *filename = argv[1];
char *newFilename = argv[2];
errorDecode = lodepng_decode32_file(&cpuImage, &width, &height, filename);
if(errorDecode){
printf("error %u: %s\n", errorDecode, lodepng_error_text(errorDecode));
}
int arraySize = width*height*4;
int memorySize = arraySize * sizeof(unsigned char);
unsigned char *cpuOutImage = (unsigned char*)malloc(memorySize);
unsigned char* gpuInput;
unsigned char* gpuOutput;
cudaMalloc((void**)&gpuInput, memorySize);
cudaMalloc((void**)&gpuOutput, memorySize);
cudaMemcpy(gpuInput, cpuImage, memorySize, cudaMemcpyHostToDevice);
NegativeFilter<<<1, width * height>>>(gpuInput, gpuOutput);
cudaDeviceSynchronize();
cudaMemcpy(cpuOutImage, gpuOutput, memorySize, cudaMemcpyDeviceToHost);
unsigned int errorEncode = lodepng_encode32_file(newFilename, cpuOutImage, width, height);
if(errorEncode) {
printf("error %u: %s\n", errorEncode, lodepng_error_text(errorEncode));
}
cudaFree(gpuInput);
cudaFree(gpuOutput);
free(cpuImage);
free(cpuOutImage);
}
A couple of other files are required for this to compile: lodepng.h and lodepng.cpp.
You can obtain them here: https://github.com/lvandeve/lodepng
Finally, to compile and run:
nvcc CudaLodepng.cu lodepng.cpp
./a.out image.png imageout.png
If you don't want to bother downloading loadpng and running this code on a file, you might be able to spot the issue in the code itself. I've looked for an hour or so and can't figure it out.
I'm not new to CUDA but it's been about 5 years since I've done any, so this somewhat took me by surprise when it didn't appear to do anything.
(It compiles and runs fine by the way, but the output is just a transparent image on my system. I've been testing it with a 4x4 test image containing 4 color squares. You could knock the same thing up with gimp. I will attach the test image below but I have no idea if the data will transfer correctly. It's a 32 bit png, supposedly rgba format.)
Look for the really tiny image here
VVVVVV
^^^^^^
Totally unrelated to the code above: The issue is I am on a linux laptop with a discrete and embedded GPU.
optirun ./a.out
is required to exec CUDA code on the Nvidia GPU.
I would have deleted the question, however there might be someone else on a Linux system with a similar configuration, and reading this answer might prevent them from wasting several hours going back and fourth trying to find a solution to a problem which doesn't exist. (aka in the code)

CUDA float precision not matching CPU implementation

I am using CUDA 5.5 compute 3.5 on GTX 1080Ti and want to compute this formula:
y = a * a * b / 64 + c * c
Suppose I have these parameters:
a = 5876
b = 0.4474222958088
c = 664
I am computing this both via GPU and on the CPU and they give me different inexact answers:
h_data[0] = 6.822759375000e+05,
h_ref[0] = 6.822760000000e+05,
difference = -6.250000000000e-02
h_data is the CUDA answer, h_ref is the CPU answer. When I plug these into my calculator the GPU answer is closer to the exact answer, and I suspect this has to do with floating point precision. My question now is, how can I get the CUDA solution to match the precision/roundoff of CPU version? If I offset the a parameter by +/-1 the solutions match, but if I offset say the c parameter I still get a difference of 1/16
Here's the working code:
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <math.h>
__global__ void test_func(float a, float b, int c, int nz, float * __restrict__ d_out)
{
float *fdes_out = d_out + blockIdx.x * nz;
float roffout2 = a * a / 64.f;
//float tmp = fma(roffout2,vel,index*index);
for (int tid = threadIdx.x; tid < nz; tid += blockDim.x) {
fdes_out[tid] = roffout2 * b + c * c;
}
}
int main (int argc, char **argv)
{
// parameters
float a = 5876.0f, b = 0.4474222958088f;
int c = 664;
int nz = 1;
float *d_data, *h_data, *h_ref;
h_data = (float*)malloc(nz*sizeof(float));
h_ref = (float*)malloc(nz*sizeof(float));
// CUDA
cudaMalloc((void**)&d_data, sizeof(float)*nz);
dim3 nb(1,1,1); dim3 nt(64,1,1);
test_func <<<nb,nt>>> (a,b,c,nz,d_data);
cudaMemcpy(h_data, d_data, sizeof(float)*nz, cudaMemcpyDeviceToHost);
// Reference
float roffout2 = a * a / 64.f;
h_ref[0] = roffout2*b + c*c;
// Compare
printf("h_data[0] = %1.12e,\nh_ref[0] = %1.12e,\ndifference = %1.12e\n",
h_data[0],h_ref[0],h_data[0]-h_ref[0]);
// Free
free(h_data); free(h_ref);
cudaFree(d_data);
return 0;
}
I'm compiling only with the-O3 flag.
This small numerical difference of one single-precision ulp occurs because the CUDA compiler applies FMA-merging by default, whereas the host compiler does not do that. FMA-merging can be turned off by adding the command line flag -fmad=false to the invocation of the CUDA compiler driver nvcc.
FMA-merging is a compiler optimization in which an FMUL and a dependent FADD are transformed into a single fused multiply-add, or FMA, instruction. An FMA instruction computes a*b+c such that the full unrounded product a*b enters into the addition with c before a final rounding is applied to produce the final result.
Usually, this has performance advantages, since a single FMA instruction is executed instead of two instructions FMUL, FADD, and all the instructions have similar latency. Usually, this also has accuracy advantages as the use of FMA eliminates one rounding step and guards against subtractive cancellation when a*c and c have opposite signs.
In this case, as noted by OP, the GPU result computed with FMA is slightly more accurate than the host result computed without FMA. Using a higher precision reference, I find that the relative error in the GPU result is -4.21e-8, while the relative error in the host result is 4.95e-8.

is it data race in nested thrust functor

I have tested this snippet and try to explain its cause as well as a way to resolve it, but have failed to do so
#include <thrust/inner_product.h>
#include <thrust/functional.h>
#include <thrust/device_vector.h>
#include <thrust/random.h>
#include <thrust/execution_policy.h>
#include <iostream>
#include <cmath>
#include <boost/concept_check.hpp>
struct alter_tuple {
alter_tuple(const int& a_, const int& b_) : a(a_), b(b_){};
__host__ __device__
thrust::tuple<int,int> operator()(thrust::tuple<int,int> X)
{
int Xx = thrust::get<0>(X);
int Xy = thrust::get<1>(X);
int Xpx = a*Xx-b*Xy;
int Xpy = -b*Xx+a*Xy;
printf("in (%d,%d) -> (%d,%d)\n",Xx,Xy,Xpx,Xpy);
return thrust::make_tuple(Xpx,Xpy);
}
int a; // these variables a,b are shared between different threads used by this functor kernel
int b; // which easily creates racing problem
};
struct alter_tuple_arr {
alter_tuple_arr(int* a_, int* b_, int* c_, int* d_) : a(a_), b(b_), c(c_), d(d_) {};
__host__ __device__
thrust::tuple<int,int> operator()(const int& idx)
{
int Xx = a[idx];
int Xy = b[idx];
int Xpx = a[idx]*Xx-b[idx]*Xy;
int Xpy = -b[idx]*Xx+a[idx]*Xy;
printf("in (%d,%d) -> (%d,%d)\n",Xx,Xy,Xpx,Xpy);
return thrust::make_tuple(Xpx,Xpy);
}
int* a;
int* b;
int* c;
int* d;
};
struct bFuntor
{
bFuntor(int* av__, int* bv__, int* cv__, int* dv__, const int& N__) : av_(av__), bv_(bv__), cv_(cv__), dv_(dv__), N_(N__) {};
__host__ __device__
int operator()(const int& idx)
{
thrust::device_ptr<int> av_dpt = thrust::device_pointer_cast(av_);
thrust::device_ptr<int> av_dpt1 = thrust::device_pointer_cast(av_+N_);
thrust::device_ptr<int> bv_dpt = thrust::device_pointer_cast(bv_);
thrust::device_ptr<int> bv_dpt1 = thrust::device_pointer_cast(bv_+N_);
thrust::device_ptr<int> cv_dpt = thrust::device_pointer_cast(cv_);
thrust::device_ptr<int> cv_dpt1 = thrust::device_pointer_cast(cv_+N_);
thrust::device_ptr<int> dv_dpt = thrust::device_pointer_cast(dv_);
thrust::device_ptr<int> dv_dpt1 = thrust::device_pointer_cast(dv_+N_);
thrust::detail::normal_iterator<thrust::device_ptr<int>> a0 = thrust::detail::make_normal_iterator<thrust::device_ptr<int>>(av_dpt);
thrust::detail::normal_iterator<thrust::device_ptr<int>> a1 = thrust::detail::make_normal_iterator<thrust::device_ptr<int>>(av_dpt1);
thrust::detail::normal_iterator<thrust::device_ptr<int>> b0 = thrust::detail::make_normal_iterator<thrust::device_ptr<int>>(bv_dpt);
thrust::detail::normal_iterator<thrust::device_ptr<int>> b1 = thrust::detail::make_normal_iterator<thrust::device_ptr<int>>(bv_dpt1);
thrust::detail::normal_iterator<thrust::device_ptr<int>> c0 = thrust::detail::make_normal_iterator<thrust::device_ptr<int>>(cv_dpt);
thrust::detail::normal_iterator<thrust::device_ptr<int>> c1 = thrust::detail::make_normal_iterator<thrust::device_ptr<int>>(cv_dpt1);
thrust::detail::normal_iterator<thrust::device_ptr<int>> d0 = thrust::detail::make_normal_iterator<thrust::device_ptr<int>>(dv_dpt);
thrust::detail::normal_iterator<thrust::device_ptr<int>> d1 = thrust::detail::make_normal_iterator<thrust::device_ptr<int>>(dv_dpt1);
// ** alter_tuple is WRONG
#define WRONG
#ifdef WRONG
thrust::transform(thrust::device,
thrust::make_zip_iterator(thrust::make_tuple(a0,b0)),
thrust::make_zip_iterator(thrust::make_tuple(a1,b1)),
// thrust::make_zip_iterator(thrust::make_tuple(cv_dpt,dv_dpt)), // cv_dpt
thrust::make_zip_iterator(thrust::make_tuple(c0,d0)), // cv_dpt
alter_tuple(cv_[idx],dv_[idx]));
#endif
#ifdef RIGHT
// ** alter_tuple_arr is CORRECT way to do it
thrust::transform(thrust::device,
thrust::counting_iterator<int>(0),
thrust::counting_iterator<int>(N_),
// thrust::make_zip_iterator(thrust::make_tuple(cv_dpt,dv_dpt)), // cv_dpt
thrust::make_zip_iterator(thrust::make_tuple(c0,d0)), // cv_dpt
alter_tuple_arr(av_,bv_,cv_,dv_));
#endif
for (int i=0; i<N_; i++)
printf("out: (%d,%d) -> (%d,%d)\n",av_[i],bv_[i],cv_[i],dv_[i]);
return cv_dpt[idx];
}
int* av_;
int* bv_;
int* cv_;
int* dv_;
int N_;
float af; // are these variables host side or device side??
};
__host__ __device__
unsigned int hash(unsigned int a)
{
a = (a+0x7ed55d16) + (a<<12);
a = (a^0xc761c23c) ^ (a>>19);
a = (a+0x165667b1) + (a<<5);
a = (a+0xd3a2646c) ^ (a<<9);
a = (a+0xfd7046c5) + (a<<3);
a = (a^0xb55a4f09) ^ (a>>16);
return a;
}
int main(void)
{
int N = 10;
std::vector<int> av,bv,cv,dv;
unsigned int seed = hash(10);
thrust::default_random_engine rng(seed);
thrust::uniform_real_distribution<float> u01(0,10);
for (int i=0;i<N;i++) {
av.push_back((int)u01(rng));
bv.push_back((int)u01(rng));
cv.push_back((int)u01(rng));
dv.push_back((int)u01(rng));
// printf("%d %d %d %d \n",av[i],bv[i],cv[i],dv[i]);
}
thrust::device_vector<int> av_d(N);
thrust::device_vector<int> bv_d(N);
thrust::device_vector<int> cv_d(N);
thrust::device_vector<int> dv_d(N);
av_d = av; bv_d = bv; cv_d = cv; dv_d = dv;
thrust::transform(thrust::counting_iterator<int>(0),
thrust::counting_iterator<int>(N),
cv_d.begin(),
bFuntor(thrust::raw_pointer_cast(av_d.data()),
thrust::raw_pointer_cast(bv_d.data()),
thrust::raw_pointer_cast(cv_d.data()),
thrust::raw_pointer_cast(dv_d.data()),
N));
thrust::host_vector<int> bv_h(N);
thrust::copy(bv_d.begin(), bv_d.end(), bv_h.begin()); // probably I forgot this! to copy back the result from device to host!
return 0;
}
In this nested thrust calls, two nested functors were tested, one of them worked (one with "#define RIGHT"). In the case of WRONG functor i.e. alter_tuple:
where do two variables int a, int b reside? host or device? or local kernel registers? or they are shared between threads of this functor's operator?
Inside, the alter_tuple functor, I tried to print out the result (int printf("in...")) and this is correct calculation. However, when this result is returned to caller functor and is printed out (in printf("out....")), they are incorrect and are different with its previous calculation
how come can these results are different? I can't seem to explain it and there is no documents or example to refer to
this difference is shown in output here
Edit 1:
minimum size test code shows functors (literally, a*x = y) in both cases receive/initialize values correctly SO_example_no_tuple_arr_wo_c.cu
print out is:
out: 9*8 -> 72
out: 9*8 -> 72
out: 9*8 -> 72
out: 6*4 -> 24
out: 6*4 -> 24
out: 6*4 -> 24
out: 1*8 -> 8
out: 1*8 -> 8
out: 1*6 -> 6
out: 9*1 -> 9
out: 9*1 -> 9
which shows the correct received values
minimum test code without using pointer/array to pass input values shows that regardless of input values are correctly initialized, the return results are wrong SO_example_no_tuple.cu
its output in case N=2:
in 9*8 -> 72
in 6*4 -> 24
in 9*8 -> 72
in 6*4 -> 24
out: 9*8 -> 24
out: 9*8 -> 24
out: 6*4 -> 24
out: 6*4 -> 24
The difference in values is not strictly due to a data race problem.
Your two approaches do not do the same thing, and it has to do with the values of a and b that will be selected for each invocation of the nested thrust::transform call. This is evident if you set N = 1, which should remove any concerns about data racing. The results are still different.
In the "failing" case, you are invoking the alter_tuple() operator like so:
thrust::transform(thrust::device,
...
alter_tuple(cv_[idx],dv_[idx]));
These values (cv_[idx], dv_[idx]) then become your initializing parameters ending up in a and b variables inside the functor. But your "passing" case is effectively initializing these variables differently, using a[idx] and b[idx], which correspond to av_[idx] and bv_[idx]. If we change the alter_tuple invocation to use a and b:
alter_tuple(av_[idx],bv_[idx]));
then the N = 1 case results now match. This was easier to understand because we had in fact only one entry in the a, b, c, d vectors.
When we expand to the N = 10 case, however, we no longer get matching results. To explain why, we need to understand the use of a and b inside the functor in this case. In the "failing" case, we are passing a single initializing value for each of a and b as used in the functor:
alter_tuple(av_[idx],bv_[idx]));
so, for a given thread, which means for a given invocation of the nested thrust::transform call, a single value will be used for a and b:
alter_tuple(const int& a_, const int& b_) : a(a_), b(b_){};
...
int a; // these values are constant across variation of "idx"
int b; // passed to the functor
on the other hand, in the "passing" case, the a and b values will vary for each element passed to the functor, within the nested transform call:
thrust::tuple<int,int> operator()(const int& idx)
{
int Xx = a[idx]; // these values of a and b *vary* for each idx
int Xy = b[idx]; // passed to the functor
Once that is understood, if the "passing" case is the desired case, then I have no idea how to transform the first case to produce passing results, as there is no way you can cause a single initializing value to take on the behavior of the varying values for a and b in the "passing" case.
None of the above involves data racing, but since your operations (i.e. each thread) is writing to every value of c and d, I don't think this overall approach makes any sense, and I'm not sure what you are trying to accomplish. I think if you expanded this to more elements/threads, then you could certainly experience unpredictable/variable results.
To answer some of your other questions, the variables a and b end up as thread-local variables, on the device. So each data member in either functor is a thread-local variable on the device.
Inside, the alter_tuple functor, I tried to print out the result (int printf("in...")) and this is correct calculation. However, when this result is returned to caller functor and is printed out (in printf("out....")), they are incorrect and are different with its previous calculation
Each thread is writing to the same locations in the c and d vector. Therefore, since each thread writes to the entire vector, but (in the failing case) each thread uses a different initializing value for a and b inside the functor, it stands to reason that each thread will compute a different result for the values of c and d, and the results you get after completion of the thrust call will depend on which thread "wins" the output write operation. This is unpredictable, and certainly not all threads printout will match the final result, because each thread will compute different values for c and d.

Best strategy for grid search with CUDA

Recently I started working with CUDA and I read an introductory book on the computing language. To see if I understood it well, I considered the following problem.
Consider a function minimize f(x,y) on the grid [-1,1] X [-1,1]. This provided me with a few practical questions and I would like to have your look on things.
Do I explicitly calculate the grid? If I create the grid on the CPU, then I'll have to transfer the information to the GPU. I can then use a 2D block layout and access data efficiently using texture memory. Is it then best to use square blocks or perhaps blocks of different shapes?
Suppose I don't explicitly make a grid. I can assign discretise the X and Y direction with constant float arrays (which provides fast memory access) and then use 1 list of blocks.
Thanks!
This was an interesting question for me because it represents a type of problem that I think is rare:
potentially high compute load
little to no data that needs to be communicated host->device
very low volume of results that need to be communicated device->host
In other words, pretty much all compute, with not much dependence on data transfer, or even global memory usage/bandwidth.
Having said that, the question seems to be looking for a brute-force search approach to functional optimization/minimization, which is not an efficient technique for functions that are amenable to other optimization methods. But as a learning exercise, it's interesting (to me, anyway). It may also be useful for functions that are otherwise difficult to handle such as functions with discontinuities or other irregularities.
To answer your questions:
Do I explicitly calculate the grid? If I create the grid on the CPU, then I'll have to transfer the information to the GPU. I can then use a 2D block layout and access data efficiently using texture memory. Is it then best to use square blocks or perhaps blocks of different shapes?
I wouldn't bother calculating the grid on the CPU. (I assume by "grid" you mean the functional value of f at each point on the grid.) First of all, this is a fairly computationally intensive task - which GPUs are good at, and secondly, it is potentially a large data set, so transferring it to the GPU (so the GPU can then do the search) will take time. I propose to let the GPU do this (compute the functional value at each grid point.) Since we won't be using global access to data for this, texture memory is not an issue.
Suppose I don't explicitly make a grid. I can assign discretise the X and Y direction with constant float arrays (which provides fast memory access) and then use 1 list of blocks.
Yes, you could use a 1D array of blocks (list) or a 2D array. I don't think this significantly impacts the problem either way, and I think the 2D grid approach fits the problem better (and I think allows for slightly cleaner code) so I would suggest starting with a 2D array of blocks.
Here's a sample code that might be interesting to play with or crystallize ideas. Each thread has the responsibility to compute its respective value of x and y, and then the functional value f at that point. Then a reduction followed by a block-draining reduction is used to search over all computed values for the minimum value (in this case).
$ cat t811.cu
#include <stdio.h>
#include <math.h>
#include <assert.h>
// grid dimensions and divisions
#define XNR -1.0f
#define XPR 1.0f
#define YNR -1.0f
#define YPR 1.0f
#define DX 0.0001f
#define DY 0.0001f
// threadblock dimensions - product must be a power of 2
#define BLK_X 16
#define BLK_Y 16
// optimization functions - these are currently set for minimization
#define TST(X1,X2) ((X1)>(X2))
#define OPT(X1,X2) (X2)
// error check macro
#define cudaCheckErrors(msg) \
do { \
cudaError_t __err = cudaGetLastError(); \
if (__err != cudaSuccess) { \
fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
msg, cudaGetErrorString(__err), \
__FILE__, __LINE__); \
fprintf(stderr, "*** FAILED - ABORTING\n"); \
exit(1); \
} \
} while (0)
// for timing
#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL
long long dtime_usec(unsigned long long start){
timeval tv;
gettimeofday(&tv, 0);
return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}
// the function f that will be "optimized"
__host__ __device__ float f(float x, float y){
return (x+0.5)*(x+0.5) + (y+0.5)*(y+0.5) +0.1f;
}
// variable for block-draining reduction block counter
__device__ int blkcnt = 0;
// GPU optimization kernel
__global__ void opt_kernel(float * __restrict__ bf, float * __restrict__ bx, float * __restrict__ by, const float scx, const float scy){
__shared__ float sh_f[BLK_X*BLK_Y];
__shared__ float sh_x[BLK_X*BLK_Y];
__shared__ float sh_y[BLK_X*BLK_Y];
__shared__ int lblock;
// compute x,y coordinates for this thread
float x = ((threadIdx.x+blockDim.x*blockIdx.x) * (XPR-XNR))*scx + XNR;
float y = ((threadIdx.y+blockDim.y*blockIdx.y) * (YPR-YNR))*scy + YNR;
int thid = (threadIdx.y*BLK_X)+threadIdx.x;
lblock = 0;
sh_x[thid] = x;
sh_y[thid] = y;
sh_f[thid] = f(x,y); // compute functional value of f(x,y)
__syncthreads();
// perform block-level shared memory reduction
// assume block size is a power of 2
for (int i = (blockDim.x*blockDim.y)>>1; i > 16; i>>=1){
if (thid < i)
if (TST(sh_f[thid],sh_f[thid+i])){
sh_f[thid] = OPT(sh_f[thid],sh_f[thid+i]);
sh_x[thid] = OPT(sh_x[thid],sh_x[thid+i]);
sh_y[thid] = OPT(sh_y[thid],sh_y[thid+i]);}
__syncthreads();}
volatile float *vf = sh_f;
volatile float *vx = sh_x;
volatile float *vy = sh_y;
for (int i = 16; i > 0; i>>=1)
if (thid < i)
if (TST(vf[thid],vf[thid+i])){
vf[thid] = OPT(vf[thid],vf[thid+i]);
vx[thid] = OPT(vx[thid],vx[thid+i]);
vy[thid] = OPT(vy[thid],vy[thid+i]);}
// save block reduction result, and check if last block
if (!thid){
bf[blockIdx.y*gridDim.x+blockIdx.x] = sh_f[0];
bx[blockIdx.y*gridDim.x+blockIdx.x] = sh_x[0];
by[blockIdx.y*gridDim.x+blockIdx.x] = sh_y[0];
int myblock = atomicAdd(&blkcnt, 1);
if (myblock == (gridDim.x*gridDim.y-1)) lblock = 1;}
__syncthreads();
if (lblock){
// do last-block reduction
float my_x, my_y, my_f;
int myid = thid;
if (myid < gridDim.x * gridDim.y){
my_x = bx[myid];
my_y = by[myid];
my_f = bf[myid];}
else { assert(0);} // does not work correctly if block dims are greater than grid dims
myid += blockDim.x*blockDim.y;
while (myid < gridDim.x*gridDim.y){
if TST(my_f,bf[myid]){
my_x = OPT(my_x,bx[myid]);
my_y = OPT(my_y,by[myid]);
my_f = OPT(my_f,bf[myid]);}
myid += blockDim.x*blockDim.y;}
sh_f[thid] = my_f;
sh_x[thid] = my_x;
sh_y[thid] = my_y;
__syncthreads();
for (int i = (blockDim.x*blockDim.y)>>1; i > 0; i>>=1){
if (thid < i)
if (TST(sh_f[thid],sh_f[thid+i])){
sh_f[thid] = OPT(sh_f[thid],sh_f[thid+i]);
sh_x[thid] = OPT(sh_x[thid],sh_x[thid+i]);
sh_y[thid] = OPT(sh_y[thid],sh_y[thid+i]);}
__syncthreads();}
if (!thid){
bf[0] = sh_f[0];
bx[0] = sh_x[0];
by[0] = sh_y[0];
}
}
}
// cpu (naive,serial) function for comparison
float3 opt_cpu(){
float optx = XNR;
float opty = YNR;
float optf = f(optx,opty);
for (float x = XNR; x < XPR; x += DX)
for (float y = YNR; y < YPR; y += DY){
float test = f(x,y);
if (TST(optf,test)){
optf = OPT(optf,test);
optx = OPT(optx,x);
opty = OPT(opty,y);}}
return make_float3(optf, optx, opty);
}
int main(){
// compute threadblock and grid dimensions
int nx = ceil(XPR-XNR)/DX;
int ny = ceil(YPR-YNR)/DY;
int bx = ceil(nx/(float)BLK_X);
int by = ceil(ny/(float)BLK_Y);
dim3 threads(BLK_X, BLK_Y);
dim3 blocks(bx, by);
float *d_bx, *d_by, *d_bf;
cudaFree(0);
// run GPU test case
unsigned long gtime = dtime_usec(0);
cudaMalloc(&d_bx, bx*by*sizeof(float));
cudaMalloc(&d_by, bx*by*sizeof(float));
cudaMalloc(&d_bf, bx*by*sizeof(float));
opt_kernel<<<blocks, threads>>>(d_bf, d_bx, d_by, 1.0f/(blocks.x*threads.x), 1.0f/(blocks.y*threads.y));
float rf, rx, ry;
cudaMemcpy(&rf, d_bf, sizeof(float), cudaMemcpyDeviceToHost);
cudaMemcpy(&rx, d_bx, sizeof(float), cudaMemcpyDeviceToHost);
cudaMemcpy(&ry, d_by, sizeof(float), cudaMemcpyDeviceToHost);
cudaCheckErrors("some error");
gtime = dtime_usec(gtime);
printf("gpu val: %f, x: %f, y: %f, time: %fs\n", rf, rx, ry, gtime/(float)USECPSEC);
//run CPU test case
unsigned long ctime = dtime_usec(0);
float3 cpu_res = opt_cpu();
ctime = dtime_usec(ctime);
printf("cpu val: %f, x: %f, y: %f, time: %fs\n", cpu_res.x, cpu_res.y, cpu_res.z, ctime/(float)USECPSEC);
return 0;
}
$ nvcc -O3 -o t811 t811.cu
$ ./t811
gpu val: 0.100000, x: -0.500000, y: -0.500000, time: 0.193248s
cpu val: 0.100000, x: -0.500017, y: -0.500017, time: 2.810862s
$
Notes:
This problem is set up to find the minimum value of f(x,y) = (x+0.5)^2 + (y+0.5)^2 + 0.1 over the domain: x(-1,1), y(-1,1)
The test was run on Fedora 20, CUDA 7, Quadro5000 GPU (cc2.0) and a Xeon X5560 2.8GHz CPU. Different CPU or GPU will obviously affect the comparison.
The observed speedup here is about 14x. The CPU code is a naive, single threaded code.
It should be possible, for example, via modification of the OPT and TST macros, to perform a different kind of optimization - such as maximum instead of minimum.
The domain (and grid) dimensions and granularity to search over can be modified by the compile time constants such as XNR, XPR, etc.

Simplest Possible Example to Show GPU Outperform CPU Using CUDA

I am looking for the most concise amount of code possible that can be coded both for a CPU (using g++) and a GPU (using nvcc) for which the GPU consistently outperforms the CPU. Any type of algorithm is acceptable.
To clarify: I'm literally looking for two short blocks of code, one for the CPU (using C++ in g++) and one for the GPU (using C++ in nvcc) for which the GPU outperforms. Preferably on the scale of seconds or milliseconds. The shortest code pair possible.
First off, I'll reiterate my comment: GPUs are high bandwidth, high latency. Trying to get the GPU to beat a CPU for a nanosecond job (or even a millisecond or second job) is completely missing the point of doing GPU stuff. Below is some simple code, but to really appreciate the performance benefits of GPU, you'll need a big problem size to amortize the startup costs over... otherwise, it's meaningless. I can beat a Ferrari in a two foot race, simply because it take some time to turn the key, start the engine and push the pedal. That doesn't mean I'm faster than the Ferrari in any meaningful way.
Use something like this in C++:
#define N (1024*1024)
#define M (1000000)
int main()
{
float data[N]; int count = 0;
for(int i = 0; i < N; i++)
{
data[i] = 1.0f * i / N;
for(int j = 0; j < M; j++)
{
data[i] = data[i] * data[i] - 0.25f;
}
}
int sel;
printf("Enter an index: ");
scanf("%d", &sel);
printf("data[%d] = %f\n", sel, data[sel]);
}
Use something like this in CUDA/C:
#define N (1024*1024)
#define M (1000000)
__global__ void cudakernel(float *buf)
{
int i = threadIdx.x + blockIdx.x * blockDim.x;
buf[i] = 1.0f * i / N;
for(int j = 0; j < M; j++)
buf[i] = buf[i] * buf[i] - 0.25f;
}
int main()
{
float data[N]; int count = 0;
float *d_data;
cudaMalloc(&d_data, N * sizeof(float));
cudakernel<<<N/256, 256>>>(d_data);
cudaMemcpy(data, d_data, N * sizeof(float), cudaMemcpyDeviceToHost);
cudaFree(d_data);
int sel;
printf("Enter an index: ");
scanf("%d", &sel);
printf("data[%d] = %f\n", sel, data[sel]);
}
If that doesn't work, try making N and M bigger, or changing 256 to 128 or 512.
A very, very simple method would be to calculate the squares for, say, the first 100,000 integers, or a large matrix operation. Ita easy to implement and lends itself to the the GPUs strengths by avoiding branching, not requiring a stack, etc. I did this with OpenCL vs C++ awhile back and got some pretty astonishing results. (A 2GB GTX460 achieved about 40x the performance of a dual core CPU.)
Are you looking for example code, or just ideas?
Edit
The 40x was vs a dual core CPU, not a quad core.
Some pointers:
Make sure you're not running, say, Crysis while running your benchmarks.
Shot down all unnecessary apps and services that might be stealing CPU time.
Make sure your kid doesn't start watching a movie on your PC while the benchmarks are running. Hardware MPEG decoding tends to influence the outcome. (Autoplay let my two year old start Despicable Me by inserting the disk. Yay.)
As I said in my comment response to #Paul R, consider using OpenCL as it'll easily let you run the same code on the GPU and CPU without having to reimplement it.
(These are probably pretty obvious in retrospect.)
For reference, I made a similar example with time measurements. With GTX 660, the GPU speedup was 24X where its operation includes data transfers in addition to actual computation.
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include <time.h>
#define N (1024*1024)
#define M (10000)
#define THREADS_PER_BLOCK 1024
void serial_add(double *a, double *b, double *c, int n, int m)
{
for(int index=0;index<n;index++)
{
for(int j=0;j<m;j++)
{
c[index] = a[index]*a[index] + b[index]*b[index];
}
}
}
__global__ void vector_add(double *a, double *b, double *c)
{
int index = blockIdx.x * blockDim.x + threadIdx.x;
for(int j=0;j<M;j++)
{
c[index] = a[index]*a[index] + b[index]*b[index];
}
}
int main()
{
clock_t start,end;
double *a, *b, *c;
int size = N * sizeof( double );
a = (double *)malloc( size );
b = (double *)malloc( size );
c = (double *)malloc( size );
for( int i = 0; i < N; i++ )
{
a[i] = b[i] = i;
c[i] = 0;
}
start = clock();
serial_add(a, b, c, N, M);
printf( "c[0] = %d\n",0,c[0] );
printf( "c[%d] = %d\n",N-1, c[N-1] );
end = clock();
float time1 = ((float)(end-start))/CLOCKS_PER_SEC;
printf("Serial: %f seconds\n",time1);
start = clock();
double *d_a, *d_b, *d_c;
cudaMalloc( (void **) &d_a, size );
cudaMalloc( (void **) &d_b, size );
cudaMalloc( (void **) &d_c, size );
cudaMemcpy( d_a, a, size, cudaMemcpyHostToDevice );
cudaMemcpy( d_b, b, size, cudaMemcpyHostToDevice );
vector_add<<< (N + (THREADS_PER_BLOCK-1)) / THREADS_PER_BLOCK, THREADS_PER_BLOCK >>>( d_a, d_b, d_c );
cudaMemcpy( c, d_c, size, cudaMemcpyDeviceToHost );
printf( "c[0] = %d\n",0,c[0] );
printf( "c[%d] = %d\n",N-1, c[N-1] );
free(a);
free(b);
free(c);
cudaFree( d_a );
cudaFree( d_b );
cudaFree( d_c );
end = clock();
float time2 = ((float)(end-start))/CLOCKS_PER_SEC;
printf("CUDA: %f seconds, Speedup: %f\n",time2, time1/time2);
return 0;
}
I agree with David's comments about OpenCL being a great way to test this, because of how easy it is to switch between running code on the CPU vs. GPU. If you're able to work on a Mac, Apple has a nice bit of sample code that does an N-body simulation using OpenCL, with kernels running on the CPU, GPU, or both. You can switch between them in real time, and the FPS count is displayed onscreen.
For a much simpler case, they have a "hello world" OpenCL command line application that calculates squares in a manner similar to what David describes. That could probably be ported to non-Mac platforms without much effort. To switch between GPU and CPU usage, I believe you just need to change the
int gpu = 1;
line in the hello.c source file to 0 for CPU, 1 for GPU.
Apple has some more OpenCL example code in their main Mac source code listing.
Dr. David Gohara had an example of OpenCL's GPU speedup when performing molecular dynamics calculations at the very end of this introductory video session on the topic (about around minute 34). In his calculation, he sees a roughly 27X speedup by going from a parallel implementation running on 8 CPU cores to a single GPU. Again, it's not the simplest of examples, but it shows a real-world application and the advantage of running certain calculations on the GPU.
I've also done some tinkering in the mobile space using OpenGL ES shaders to perform rudimentary calculations. I found that a simple color thresholding shader run across an image was roughly 14-28X faster when run as a shader on the GPU than the same calculation performed on the CPU for this particular device.