FFTW - Difference between FFTW_REDFT00 and FFTW_DHT flags - fft

I am interested in doing 1D FFT with FFTW on real data.
For this, I am using a cosinus signal with a frequency equal to 10 Hz and a sampling frequency of sizex*frequency_signal with sizex the number of sampling points.
With the flag (FFTW_DHT) into fftw_plan_r2r_1d(sizex, Array, Array, FFTW_DHT, FFTW_ESTIMATE);, I get the dirac impulsion at f=10 Hz, here are the output (column 1: k*f_sampling/size x and column 2: X_k) after the forward fft :
0.000000 7.304123e-14
10.000000 5.000000e+01
20.000000 -2.227743e-14
30.000000 -1.300521e-14
40.000000 -3.774757e-15
50.000000 -2.989904e-15
60.000000 -4.879698e-15
70.000000 -2.838093e-15
80.000000 -5.479074e-16
90.000000 1.605429e-15
100.000000 -1.491050e-15
110.000000 -2.587601e-16
...
But with the FFTW_REDFT00, I can't get to have the dirac impulsion at f=10 Hz. In this case, I have the following output :
0.000000 -1.998027e+00
10.000000 2.682414e+00
20.000000 9.843837e+01
30.000000 -1.543229e+00
40.000000 6.493255e-01
50.000000 -3.723752e-01
60.000000 2.449150e-01
70.000000 -1.744771e-01
80.000000 1.310807e-01
90.000000 -1.023168e-01
100.000000 8.221456e-02
110.000000 -6.758738e-02
...
Could I get the dirac at f=10 Hz with FFTW_REDFT00 flag ?
What's exactly the difference between these two flags, i.e how can I find the same results of FFTW_DHT with FFTW_REDFT00 flag.
From fftw DFT doc, I thought that these 2 two flags produced the same results but this is not the case apparently.
I would like just to switch from one to another. if I know how to reverse them, it could help me for a code which uses FFTW_REDFT00 flag.

First, note that in http://www.fftw.org/fftw3_doc/1d-Real_002deven-DFTs-_0028DCTs_0029.html,
the REDFT00 misses a factor of 2 before π in the argument of cosine. That's why you saw a peak around 20Hz instead of 10Hz.
Second, REDFT00 is particularly tricky because it requires you to allocate one additional element. That is, Array should contain sizex + 1 elements, and you should create the plan as
fftw_plan_r2r_1d(sizex + 1, Array, Array, FFTW_REDFT00, FFTW_ESTIMATE);
Without the additional 1, the peak is widened as you have seen.
To avoid the peak widening. Follow these rules. If you prepare the signal as
for (i = 0; i <= n; i++) a[i] = cos(2*M_PI*10*i/n);
Then detect it by:
fftw_plan_r2r_1d(n + 1, a, a, FFTW_REDFT00, FFTW_ESTIMATE);
If you prepare the signal as
for (i = 0; i < n; i++) a[i] = cos(2*M_PI*10*(i+0.5)/n);
Then detect it by:
fftw_plan_r2r_1d(n, a, a, FFTW_REDFT10, FFTW_ESTIMATE);
If you prepare the signal as
for (i = 0; i < n; i++) a[i] = cos(2*M_PI*10.25*i/n);
Then detect it by:
fftw_plan_r2r_1d(n, a, a, FFTW_REDFT01, FFTW_ESTIMATE);
If you prepare the signal as
for (i = 0; i < n; i++) a[i] = cos(2*M_PI*10.25*(i + .5)/n);
Then detect it by:
fftw_plan_r2r_1d(n, a, a, FFTW_REDFT11, FFTW_ESTIMATE);

Related

#pragma acc host_data use_device() with complex variables

I need to pass to #pragma acc host_data use_device() an element of a vector of pointers of vectors:
static double *send_bufL[3];
static double *recv_bufL[3];
send_bufL[IDIR] = ARRAY_1D(NVAR*grid->nghost[IDIR]*nx2*nx3, double);
recv_bufL[IDIR] = ARRAY_1D(NVAR*grid->nghost[IDIR]*nx2*nx3, double);
#pragma acc enter data copyin(send_bufL[IDIR:1][:NVAR*grid->nghost[IDIR]*nx2*nx3], \
recv_bufL[IDIR:1][:NVAR*grid->nghost[IDIR]*nx2*nx3])
#pragma acc parallel loop collapse(4) present(d, grid, send_bufL[1:1][:NVAR*grid->nghost[JDIR]*nx1*nx3])
for (nv = 0; nv < NVAR; nv++){
for (k = kbeg; k <= kend; k++){
for (i = ibeg; i <= iend; i++){
for (j = 0; j < grid->nghost[JDIR]; j++){
index = nv*nx3*nx1*grid->nghost[JDIR]+(k-kbeg)*nx1*grid->nghost[JDIR]+(i-ibeg)*grid->nghost[JDIR]+j;
send_bufL[JDIR][index] = d->Vc[nv][k][jbeg+j][i];
}}}}
count = NVAR*nx3*nx2*nghost;
#pragma acc host_data use_device(send_bufL, recv_bufL)
{
MPI_Isend (send_bufL[IDIR], count, MPI_DOUBLE, nrnks[IDIR][0], 0, MPI_COMM_WORLD, &req[0]);
MPI_Irecv (recv_bufL[IDIR], count, MPI_DOUBLE, nrnks[IDIR][0], 0, MPI_COMM_WORLD, &req[1]);
}
Writing like this, with send_bufL, recv_bufL only I get:
[marco-Inspiron-7501:41130:0:41130] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7f7f4fafa608)
[marco-Inspiron-7501:41131:0:41131] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7fd697afa608)*
Trying to add dimensions I get compiling errors. What can I do?
I have to add the apparently, for some elements of the vector of pointers there's no need to use #pragma acc host_data use_device(). The compiler seems able to use the device buffers correctly. Anyway, for others elements this doesn't work and it uses host buffers, generating wrong results.
The problem is with the "send_bufL[IDIR]" and "recv_bufL[IDIR]". Since these are device pointers, dereferencing them on the host will give a seg fault.
I'm thinking that the best solution here would be to use temp pointers to the correct element in the buffers. Something like:
double * sbtmp;
double * rbtmp;
...
sbtmp = send_bufL[IDIR];
rbtmp = recv_bufL[IDIR];
#pragma acc host_data use_device(sbtmp, rbtmp)
{
MPI_Isend (sbtmp, count, MPI_DOUBLE, nrnks[IDIR][0], 0, MPI_COMM_WORLD, &req[0]);
MPI_Irecv (rbtmp, count, MPI_DOUBLE, nrnks[IDIR][0], 0, MPI_COMM_WORLD, &req[1]);
}
If this doesn't work, please post a small reproducing example and I'll see if we can find a solution.

How to avoid append when computing the intersection of two lists?

I am writing a function to compute the intersection between two sorted arrays (which may contain duplicates). So if the input is [0,3,7,7,7,9, 12] and [2,7,7,8, 12] the output should be [7,7,12] for example.
Here is my code:
cimport cython
#cython.wraparound(False)
#cython.cdivision(True)
#cython.boundscheck(False)
def sorting(int[:] A, int[:] B):
cdef Py_ssize_t i = 0
cdef Py_ssize_t j = 0
cdef int lenA = A.shape[0]
cdef int lenB = B.shape[0]
intersect = []
while (i < lenA and j < lenB):
if A[i] == B[j]:
intersect.append(A[i])
i += 1
j += 1
elif A[i] > B[j]:
j += 1
elif A[i] < B[j]:
i += 1
return intersect
As you will see, I use a list to store the answers and append to add the answers as they arrive. I am happy to return a python or numpy array if that will speed things up.
How can I avoid append to speed up the cython?
For this kind of thing you usually want to pre-allocate the array (it's basically free to shrink it later). In this case it can't be longer than the shortest of your input arrays, so that gives you a starting size:
cdef int[::1] intersect = np.array([A.shape[0] if A.shape[0]<B.shape[0] else B.shape[0]],dtype=np.int)
You then just keep a running total of how what index you're at on that array (say k), so append is replaced by:
intersect[k] = A[i]
k += 1
At the end you can either return the memoryview intersect[:k] or convert it to a numpy array with np.asarray(intersect[:k]).
As an aside: I'd remove the Cython directive #cython.cdivision(True) since you aren't doing any division. I believe you should be thinking about whether these directives are useful and if they apply to your code rather than blindly copying them in out of habit.

Summing up elements in array using managedCuda

Problem Description
I try to get a kernel summing up all elements of an array to work. The kernel is intended to be launched with 256 threads per block and an arbitary number of blocks. The length of the array passsed in as a is always a multiple of 512, in fact it is #blocks * 512. One block of the kernel should sum up 'its' 512 elements (256 threads can sum up 512 elements using this algorithm), storing the result in out[blockIdx.x]. The final summation over the values in out ,and therefore the results of the blocks, will be done on the host.
This kernel works fine for up to 6 blocks, meaning up to 3072 elements. But launching it with more than 6 blocks result in the first block calculating a strictly greater, wrong result than the other blocks (i. e. out = {572, 512, 512, 512, 512, 512, 512}), this wrong result is reproducable, the wrong value is the same for multiple executions.
I guess this means there is a structural error somewhere in my code, which has something to do with blockIdx.x, but the only use this is to calculate blockStart, and this seams to be a correct calculation, also for the first block.
I verified if my host code computes the correct number of blocks for the kernel and passes in an array of correct size. That's not the problem.
Of course I read a lot of similar questions here on stackoverflow, but none seems to describe my problem (See i. e. here or here)
The kernel is called via managedCuda (C#), I don't know if this might be a problem.
Hardware
I use a MX150 with the follwing specifications:
Revision Number: 6.1
Total global memory: 2147483648
Total shared memory per block: 49152
Total registers per block: 65536
Warp size: 32
Max Threads per block: 1024
Max Blocks: 2147483648
Number of multiprocessors: 3
Code
Kernel
__global__ void Vector_Reduce_As_Sum_Kernel(float* out, float* a)
{
int tid = threadIdx.x;
int blockStart = blockDim.x * blockIdx.x * 2;
int i = tid + blockStart;
int leftSumElementIdx = blockStart + tid * 2;
a[i] = a[leftSumElementIdx] + a[leftSumElementIdx + 1];
__syncthreads();
if (tid < 128)
{
a[i] = a[leftSumElementIdx] + a[leftSumElementIdx + 1];
}
__syncthreads();
if(tid < 64)
{
a[i] = a[leftSumElementIdx] + a[leftSumElementIdx + 1];
}
__syncthreads();
if (tid < 32)
{
a[i] = a[leftSumElementIdx] + a[leftSumElementIdx + 1];
}
__syncthreads();
if (tid < 16)
{
a[i] = a[leftSumElementIdx] + a[leftSumElementIdx + 1];
}
__syncthreads();
if (tid < 8)
{
a[i] = a[leftSumElementIdx] + a[leftSumElementIdx + 1];
}
__syncthreads();
if (tid < 4)
{
a[i] = a[leftSumElementIdx] + a[leftSumElementIdx + 1];
}
__syncthreads();
if (tid < 2)
{
a[i] = a[leftSumElementIdx] + a[leftSumElementIdx + 1];
}
__syncthreads();
if (tid == 0)
{
out[blockIdx.x] = a[blockStart] + a[blockStart + 1];
}
}
Kernel Invocation
//Get the cuda kernel
//PathToPtx and MangledKernelName must be replaced
CudaContext cntxt = new CudaContext();
CUmodule module = cntxt.LoadModule("pathToPtx");
CudaKernel vectorReduceAsSumKernel = new CudaKernel("MangledKernelName", module, cntxt);
//Get an array to reduce
float[] array = new float[4096];
for(int i = 0; i < array.Length; i++)
{
array[i] = 1;
}
//Calculate execution info for the kernel
int threadsPerBlock = 256;
int numOfBlocks = array.Length / (threadsPerBlock * 2);
//Memory on the device
CudaDeviceVariable<float> m_d = array;
CudaDeviceVariable<float> out_d = new CudaDeviceVariable<float>(numOfBlocks);
//Give the kernel necessary execution info
vectorReduceAsSumKernel.BlockDimensions = threadsPerBlock;
vectorReduceAsSumKernel.GridDimensions = numOfBlocks;
//Run the kernel on the device
vectorReduceAsSumKernel.Run(out_d.DevicePointer, m_d.DevicePointer);
//Fetch the result
float[] out_h = out_d;
//Sum up the partial sums on the cpu
float sum = 0;
for(int i = 0; i < out_h.Length; i++)
{
sum += out_h[i];
}
//Verify the correctness
if(sum != 4096)
{
throw new Exception("Thats the wrong result!");
}
Update:
The very helpfull and only answer did address all my problems. Thank you! The problem was an unforeseen race condition.
Important Hint:
In the comments the author of managedCuda pointed out all NPPs methods are indeed already implmented in managedCuda (using ManagedCuda.NPP.NPPsExtensions;). I wasn't aware of that, and i guess so are many people reading ths question.
You are not correctly incorporating into your code the idea that each block will process 512 elements out of your total array. According to my testing, you need to make at least 2 changes to fix this:
In the kernel, you have incorrectly calculated the starting point for each block:
int blockStart = blockDim.x * blockIdx.x;
since blockDim.x is 256, but each block processes 512 elements, you must multiply this by 2. (the multiplication by 2 in your calculation of leftSumElementIdx doesn't take care of this -- since it is only multiplying tid).
In your host code, your number of blocks calculation is incorrect:
vectorReduceAsSumKernel.GridDimensions = array.Length / threadsPerBlock;
for a value of 2048 for array.Length and a value of 256 for threadsPerBlock, this creates 8 blocks. But as you already indicate, your intention is to launch for blocks (2048/512). So you need to multiply the denominator by 2:
vectorReduceAsSumKernel.GridDimensions = array.Length / (2*threadsPerBlock);
In addition, your reduction sweep pattern is broken. It is warp-execution-order dependent, to give the proper result, and CUDA does not specify a warp execution order.
To see why, let's take a simple example. Let's consider just a single threadblock, with a starting point of the array being all 1, just as you have initialized it.
Now, warp 0 consists of threads 0-31. Your reduction sweep operation is like this:
a[i] = a[leftSumElementIdx] + a[leftSumElementIdx + 1];
So each thread in warp 0 will collect two other values and add them, and store them. Thread 31 will take the values a[62] and a[63] and add them together. If the values of a[62] and a[63] are still 1, as initialized, then this will work as expected. But the values of a[62] and a[63] are written to by warp 1, consisting of threads 32-63. So if warp 1 executes before warp 0 (perfectly legal), then you will get a different result. This is a global memory race condition. It is arising due to the fact that your input array is both the source and destination of your intermediate results, and __syncthreads() will not sort this out for you. It doesn't force warps to execute in any particular order.
One possible solution is to fix your sweep pattern. On any given reduction cycle, let's have a sweep pattern where each thread writes and reads values that are not touched by any other thread during that cycle. The following adaptation of your kernel code accomplishes that:
__global__ void Vector_Reduce_As_Sum_Kernel(float* out, float* a)
{
int tid = threadIdx.x;
int blockStart = blockDim.x * blockIdx.x * 2;
int i = tid + blockStart;
for (int j = blockDim.x; j > 0; j>>=1){
if (tid < j)
a[i] += a[i+j];
__syncthreads();}
if (tid == 0)
{
out[blockIdx.x] = a[i];
}
}
For general purpose reductions, this is still a very slow method. This tutorial covers how to write faster reductions. And, as already pointed out, managedCuda may have methods to avoid writing a kernel at all.

Bit tricks to find the first position where the number of 0s equals the number of 1s

Suppose I have a 32 or 64 bit unsigned integer.
What is the fastest way to find the index i of the leftmost bit such that the number of 0s in the leftmost i bits equals the number of 1s in the leftmost i bits?
I was thinking of some bit tricks like the ones mentioned here.
I am interested in recent x86_64 processor. This might be relevant as some processor support instructions as POPCNT (count the number of 1s) or LZCNT (counts the number of leading 0s).
If it helps, it is possible to assume that the first bit has always a certain value.
Example (with 16 bits):
If the integer is
1110010100110110b
^
i
then i=10 and it corresponds to the marked position.
A possible (slow) implementation for 16-bit integers could be:
mask = 1000000000000000b
pos = 0
count=0
do {
if(x & mask)
count++;
else
count--;
pos++;
x<<=1;
} while(count)
return pos;
Edit: fixed bug in code as per #njuffa comment.
I don't have any bit tricks for this, but I do have a SIMD trick.
First a few observations,
Interpreting 0 as -1, this problem becomes "find the first i so that the first i bits sum to 0".
0 is even but all the bits have odd values under this interpretation, which gives the insight that i must be even and this problem can be analyzed by blocks of 2 bits.
01 and 10 don't change the balance.
After spreading the groups of 2 out to bytes (none of the following is tested),
// optionally use AVX2 _mm_srlv_epi32 instead of ugly variable set
__m128i spread = _mm_shuffle_epi8(_mm_setr_epi32(x, x >> 2, x >> 4, x >> 6),
_mm_setr_epi8(0, 4, 8, 12, 1, 5, 9, 13, 2, 6, 10, 14, 3, 7, 11, 15));
spread = _mm_and_si128(spread, _mm_set1_epi8(3));
Replace 00 by -1, 11 by 1, and 01 and 10 by 0:
__m128i r = _mm_shuffle_epi8(_mm_setr_epi8(-1, 0, 0, 1, 0,0,0,0,0,0,0,0,0,0,0,0),
spread);
Calculate the prefix sum:
__m128i pfs = _mm_add_epi8(r, _mm_bsrli_si128(r, 1));
pfs = _mm_add_epi8(pfs, _mm_bsrli_si128(pfs, 2));
pfs = _mm_add_epi8(pfs, _mm_bsrli_si128(pfs, 4));
pfs = _mm_add_epi8(pfs, _mm_bsrli_si128(pfs, 8));
Find the highest 0:
__m128i iszero = _mm_cmpeq_epi8(pfs, _mm_setzero_si128());
return __builtin_clz(_mm_movemask_epi8(iszero) << 15) * 2;
The << 15 and *2 appear because the resulting mask is 16 bits but the clz is 32 bit, it's shifted one less because if the top byte is zero that indicates that 1 group of 2 is taken, not zero.
This is a solution for 32-bit data using classical bit-twiddling techniques. The intermediate computation requires 64-bit arithmetic and logic operations. I have to tried to stick to portable operations as far as it was possible. Required is an implementation of the POSIX function ffsll to find the least-significant 1-bit in a 64-bit long long, and a custom function rev_bit_duos that reverses the bit-duos in a 32-bit integer. The latter could be replaced with a platform-specific bit-reversal intrinsic, such as the __rbit intrinsic on ARM platforms.
The basic observation is that if a bit-group with an equal number of 0-bits and 1-bits can be extracted, it must contain an even number of bits. This means we can examine the operand in 2-bit groups. We can further restrict ourselves to tracking whether each 2-bit increases (0b11), decreases (0b00) or leaves unchanged (0b01, 0b10) a running balance of bits. If we count positive and negative changes with separate counters, 4-bit counters will suffice unless the input is 0 or 0xffffffff, which can be handled separately. Based on comments to the question, these cases shouldn't occur. By subtracting the negative change count from the positive change count for each 2-bit group we can find at which group the balance becomes zero. There may be multiple such bit groups, we need to find the first one.
The processing can be parallelized by expanding each 2-bit group into a nibble that then can serve as a change counter. The prefix sum can be computed via integer multiply with an appropriate constant, which provides the necessary shift & add operations at each nibble position. Efficient ways for parallel nibble-wise subtraction are well-known, likewise there is a well-known technique due to Alan Mycroft for detecting zero-bytes that is trivially changeable to zero-nibble detection. POSIX function ffsll is then applied to find the bit position of that nibble.
Slightly problematic is the requirement for extraction of a left-most bit group, rather than a right-most, since Alan Mycroft's trick only works for finding the first zero-nibble from the right. Also, handling the prefix-sum for left-most bit group require use of a mulhi operation which may not be easily available, and may be less efficient than standard integer multiplication. I have addressed both of these issues by simply bit-reversing the original operand up front.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
/* Reverse bit-duos using classic binary partitioning algorithm */
inline uint32_t rev_bit_duos (uint32_t a)
{
uint32_t m;
a = (a >> 16) | (a << 16); // swap halfwords
m = 0x00ff00ff; a = ((a >> 8) & m) | ((a << 8) & ~m); // swap bytes
m = (m << 4)^m; a = ((a >> 4) & m) | ((a << 4) & ~m); // swap nibbles
m = (m << 2)^m; a = ((a >> 2) & m) | ((a << 2) & ~m); // swap bit-duos
return a;
}
/* Return the number of most significant (leftmost) bits that must be extracted
to achieve an equal count of 1-bits and 0-bits in the extracted bit group.
Return 0 if no such bit group exists.
*/
int solution (uint32_t x)
{
const uint64_t mask16 = 0x0000ffff0000ffffULL; // alternate half-words
const uint64_t mask8 = 0x00ff00ff00ff00ffULL; // alternate bytes
const uint64_t mask4h = 0x0c0c0c0c0c0c0c0cULL; // alternate nibbles, high bit-duo
const uint64_t mask4l = 0x0303030303030303ULL; // alternate nibbles, low bit-duo
const uint64_t nibble_lsb = 0x1111111111111111ULL;
const uint64_t nibble_msb = 0x8888888888888888ULL;
uint64_t a, b, r, s, t, expx, pc_expx, nc_expx;
int res;
/* common path can't handle all 0s and all 1s due to counter overflow */
if ((x == 0) || (x == ~0)) return 0;
/* make zero-nibble detection work, and simplify prefix sum computation */
x = rev_bit_duos (x); // reverse bit-duos
/* expand each bit-duo into a nibble */
expx = x;
expx = ((expx << 16) | expx) & mask16;
expx = ((expx << 8) | expx) & mask8;
expx = ((expx << 4) | expx);
expx = ((expx & mask4h) * 4) + (expx & mask4l);
/* compute positive and negative change counts for each nibble */
pc_expx = expx & ( expx >> 1) & nibble_lsb;
nc_expx = ~expx & (~expx >> 1) & nibble_lsb;
/* produce prefix sums for positive and negative change counters */
a = pc_expx * nibble_lsb;
b = nc_expx * nibble_lsb;
/* subtract positive and negative prefix sums, nibble-wise */
s = a ^ ~b;
r = a | nibble_msb;
t = b & ~nibble_msb;
s = s & nibble_msb;
r = r - t;
r = r ^ s;
/* find first nibble that is zero using Alan Mycroft's magic */
r = (r - nibble_lsb) & (~r & nibble_msb);
res = ffsll (r) / 2; // account for bit-duo to nibble expansion
return res;
}
/* Return the number of most significant (leftmost) bits that must be extracted
to achieve an equal count of 1-bits and 0-bits in the extracted bit group.
Return 0 if no such bit group exists.
*/
int reference (uint32_t x)
{
int count = 0;
int bits = 0;
uint32_t mask = 0x80000000;
do {
bits++;
if (x & mask) {
count++;
} else {
count--;
}
x = x << 1;
} while ((count) && (bits <= (int)(sizeof(x) * CHAR_BIT)));
return (count) ? 0 : bits;
}
int main (void)
{
uint32_t x = 0;
do {
uint32_t ref = reference (x);
uint32_t res = solution (x);
if (res != ref) {
printf ("x=%08x res=%u ref=%u\n\n", x, res, ref);
}
x++;
} while (x);
return EXIT_SUCCESS;
}
A possible solution (for 32-bit integers). I'm not sure if it can be improved / avoid the use of lookup tables. Here x is the input integer.
//Look-up table of 2^16 elements.
//The y-th is associated with the first 2 bytes y of x.
//If the wanted bit is in y, LUT1[y] is minus the position of the bit
//If the wanted bit is not in y, LUT1[y] is the number of ones in excess in y minus 1 (between 0 and 15)
LUT1 = ....
//Look-up talbe of 16 * 2^16 elements.
//The y-th element is associated to two integers y' and y'' of 4 and 16 bits, respectively.
//y' is the number of excess ones in the first byte of x, minus 1
//y'' is the second byte of x. The table contains the answer to return.
LUT2 = ....
if(LUT1[x>>16] < 0)
return -LUT1[x>>16];
return LUT2[ (LUT1[x>>16]<<16) | (x & 0xFFFF) ]
This requires ~1MB for the lookup tables.
The same idea also works using 4 lookup tables (one per byte of x). The requires more operations but brings down the memory to 12KB.
LUT1 = ... //2^8 elements
LUT2 = ... //8 * 2^8 elements
LUT3 = ... //16 * 2^8 elements
LUT3 = ... //24 * 2^8 elements
y = x>>24
if(LUT1[y] < 0)
return -LUT1[y];
y = (LUT1[y]<<8) | ((x>>16) & 0xFF);
if(LUT2[y] < 0)
return -LUT2[y];
y = (LUT2[y]<<8) | ((x>>8) & 0xFF);
if(LUT3[y] < 0)
return -LUT3[y];
return LUT4[(LUT2[y]<<8) | (x & 0xFF) ];

how to optimize matrix multiplication using OpenACC?

I am learning OpenACC (with PGI's compiler) and trying to optimize matrix multiplication example. The fastest implementation I came up so far is the following:
void matrix_mul(float *restrict r, float *a, float *b, int N, int accelerate){
#pragma acc data copyin (a[0: N * N ], b[0: N * N]) copyout (r [0: N * N ]) if(accelerate)
{
# pragma acc region if(accelerate)
{
# pragma acc loop independent vector(32)
for (int j = 0; j < N; j ++)
{
# pragma acc loop independent vector(32)
for (int i = 0; i < N ; i ++ )
{
float sum = 0;
for (int k = 0; k < N ; k ++ ) {
sum += a [ i + k*N ] * b [ k + j * N ];
}
r[i + j * N ] = sum ;
}
}
}
}
This results in thread blocks of size 32x32 threads and gives me the best performance so far.
Here are the benchmarks:
Matrix multiplication (1500x1500):
GPU: Geforce GT650 M, 64-bit Linux
Data sz : 1500
Unaccelerated:
matrix_mul() time : 5873.255333 msec
Accelerated:
matrix_mul() time : 420.414700 msec
Data size : 1750 x 1750
matrix_mul() time : 876.271200 msec
Data size : 2000 x 2000
matrix_mul() time : 1147.783400 msec
Data size : 2250 x 2250
matrix_mul() time : 1863.458100 msec
Data size : 2500 x 2500
matrix_mul() time : 2516.493200 msec
Unfortunately I realized that the generated CUDA code is quite primitive (e.g. it does not even use shared memory) and hence cannot compete with hand-optimized CUDA program. As a reference implementation I took Arrayfire lib with the following results:
Arrayfire 1500 x 1500 matrix mul
CUDA toolkit 4.2, driver 295.59
GPU0 GeForce GT 650M, 2048 MB, Compute 3.0 (single,double)
Memory Usage: 1932 MB free (2048 MB total)
af: 0.03166 seconds
Arrayfire 1750 x 1750 matrix mul
af: 0.05042 seconds
Arrayfire 2000 x 2000 matrix mul
af: 0.07493 seconds
Arrayfire 2250 x 2250 matrix mul
af: 0.10786 seconds
Arrayfire 2500 x 2500 matrix mul
af: 0.14795 seconds
I wonder if there any suggestions how to get better performance from OpenACC ?
Perhaps my choice of directives is not right ?
You're getting right at a 14x speedup, which is pretty good for PGI's compiler in my experience.
First off, are you compiling with -Minfo? That will give you a lot of feedback from the compiler regarding optimization choices.
You are using a 32x32 thread block, but in my experience 16x16 thread blocks tend to get better performance. If you omit the vector(32) clauses, what scheduling does the compiler choose?
Declaring a and b with restrict might let the compiler generate better code.
Just by looking at your code, I'm not sure that shared memory would help performance. Shared memory only helps improve performance if your code can store and reuse values there instead of going to global memory. In this case you're not reusing any part of a or b after reading it.
It's also worth noting that I've had bad experiences with PGI's compiler when it comes to shared memory usage. It will sometimes do funny stuff and cache the wrong values (seems to mostly happen if you iterate a loop backward), generating wrong results. I actually have to compile my current application using the undocumented -ta=nvidia,nocache option to get it to work correctly, by bypassing shared memory usage altogether.