I am trying to optimize this FDTD code with CUDA Fortran. I have three 3-D cube matrix with input, output and costant.
attributes (global) subroutine kernel_h(k,num_cells_x,num_cells_y,num_cells_z,Hx,Hy,Hz,Ex,Ey,Ez,Cbdx,Cbdy,Cbdz)
implicit none
integer :: idx,idy
integer,value :: k,num_cells_x,num_cells_y,num_cells_z
real(kind=8), intent(in), dimension(1:num_cells_x,1:num_cells_y,1:num_cells_z) :: Ex, Ey, Ez
real(kind=8), intent(inout), dimension(1:num_cells_x,1:num_cells_y,1:num_cells_z) :: Hx, Hy, Hz
real(kind=8), intent(in), constant, dimension(1:num_cells_x,1:num_cells_y,1:num_cells_z) :: Cbdx,Cbdy,Cbdz
idx = threadIdx%x + ((blockIdx%x-1) * blockDim%x)
idy = threadIdx%y + ((blockIdx%y-1) * blockDim%y)
do while (idx < num_cells_x)
Hz(idx,idy,k) = Hz(idx,idy,k) + ((Ex(idx,idy+1,k)-Ex(idx,idy,k))*Cbdy(idx,idy,k) + (Ey(idx,idy,k)-Ey(idx+1,idy,k))*Cbdx(idx,idy,k))
Hx(idx,idy,k) = Hx(idx,idy,k) + ((Ey(idx,idy,k+1)-Ey(idx,idy,k))*Cbdz(idx,idy,k) + (Ez(idx,idy,k)-Ez(idx,idy+1,k))*Cbdy(idx,idy,k))
Hy(idx,idy,k) = Hy(idx,idy,k) + ((Ez(idx+1,idy,k)-Ez(idx,idy,k))*Cbdx(idx,idy,k) + (Ex(idx,idy,k)-Ex(idx,idy,k+1))*Cbdz(idx,idy,k))
idx = idx + (blockDim%x * gridDim%x)
idy = idy + (blockDim%y * gridDim%y)
end do
end subroutine kernel_h
and my kernel launch is:
bdim=dim3(16,16,1)
gdim=dim3((num_cells_x+(bdim%x-1))/bdim%x,(num_cells_y+(bdim%y-1))/bdim%y,1)
do k=1,num_cells_z
call kernel_h<<<gdim,bdim>>>(k,num_cells_x,num_cells_y,num_cells_z,Hx_d,Hy_d,Hz_d,Ex_d,Ey_d,Ez_d,Cbdx_d,Cbdy_d,Cbdz_d)
end do
My questions are: why i can't load more than 100x100x100 matrix? If i try i get a kernel error launch failure. And can i improve my code performace? I think it could be written in a better way.
I would guess that you are accessing out of bounds.
Consider a 10x10x10 volume (x,y,z). In that case you will launch a single block of 16x16 threads. These threads will access a 17x17 slice (since stencil radius is 1) which is clearly going to end up out of bounds. You would need to disable those threads that will access out of bounds and also disable those threads that will reach beyond the boundary to apply their stencil.
Consider looking at the FDTD3D sample in the CUDA SDK. Granted it's in C but it illustrates how to handle this problem and it also shows how to use shared memory to have a much more efficient implementation.
Related
EDIT: new minimal working example to illustrate the question and better explanation of nvvp's outcome (following suggestions given in the comments).
So, I have crafted a "minimal" working example, which follows:
#include <cuComplex.h>
#include <iostream>
int const n = 512 * 100;
typedef float real;
template < class T >
struct my_complex {
T x;
T y;
};
__global__ void set( my_complex< real > * a )
{
my_complex< real > & d = a[ blockIdx.x * 1024 + threadIdx.x ];
d = { 1.0f, 0.0f };
}
__global__ void duplicate_whole( my_complex< real > * a )
{
my_complex< real > & d = a[ blockIdx.x * 1024 + threadIdx.x ];
d = { 2.0f * d.x, 2.0f * d.y };
}
__global__ void duplicate_half( real * a )
{
real & d = a[ blockIdx.x * 1024 + threadIdx.x ];
d *= 2.0f;
}
int main()
{
my_complex< real > * a;
cudaMalloc( ( void * * ) & a, sizeof( my_complex< real > ) * n * 1024 );
set<<< n, 1024 >>>( a );
cudaDeviceSynchronize();
duplicate_whole<<< n, 1024 >>>( a );
cudaDeviceSynchronize();
duplicate_half<<< 2 * n, 1024 >>>( reinterpret_cast< real * >( a ) );
cudaDeviceSynchronize();
my_complex< real > * a_h = new my_complex< real >[ n * 1024 ];
cudaMemcpy( a_h, a, sizeof( my_complex< real > ) * n * 1024, cudaMemcpyDeviceToHost );
std::cout << "( " << a_h[ 0 ].x << ", " << a_h[ 0 ].y << " )" << '\t' << "( " << a_h[ n * 1024 - 1 ].x << ", " << a_h[ n * 1024 - 1 ].y << " )" << std::endl;
return 0;
}
When I compile and run the above code, kernels duplicate_whole and duplicate_half take just about the same time to run.
However, when I analyze the kernels using nvvp I get different reports for each of the kernels in the following sense. For kernel duplicate_whole, nvvp warns me that at line 23 (d = { 2.0f * d.x, 2.0f * d.y };) the kernel is performing
Global Load L2 Transaction/Access = 8, Ideal Transaction/Access = 4
I agree that I am loading 8 byte words. What I do not understand is why 4 bytes is the ideal word size. In special, there is no performance difference between the kernels.
I suppose that there must be circumstances where this global store access pattern could cause performance degradation. What are these?
And why is that I do not get a performance hit?
I hope that this edit has clarified some unclear points.
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
I'll start wit some kernel code to exemplify my question, which will follow below
template < class data_t >
__global__ void chirp_factors_multiply( std::complex< data_t > const * chirp_factors,
std::complex< data_t > * data,
int M,
int row_length,
int b,
int i_0
)
{
#ifndef CUGALE_MUL_SHUFFLE
// Output array length:
int plane_area = row_length * M;
// Process element:
int i = blockIdx.x * row_length + threadIdx.x + i_0;
my_complex< data_t > const chirp_factor = ref_complex( chirp_factors[ i ] );
my_complex< data_t > datum;
my_complex< data_t > datum_new;
for ( int i_b = 0; i_b < b; ++ i_b )
{
my_complex< data_t > & ref_datum = ref_complex( data[ i_b * plane_area + i ] );
datum = ref_datum;
datum_new.x = datum.x * chirp_factor.x - datum.y * chirp_factor.y;
datum_new.y = datum.x * chirp_factor.y + datum.y * chirp_factor.x;
ref_datum = datum_new;
}
#else
// Output array length:
int plane_area = row_length * M;
// Element to process:
int i = blockIdx.x * row_length + ( threadIdx.x + i_0 ) / 2;
my_complex< data_t > const chirp_factor = ref_complex( chirp_factors[ i ] );
// Real and imaginary part of datum (not respectively for odd threads):
data_t datum_a;
data_t datum_b;
// Even TIDs will read data in regular order, odd TIDs will read data in inverted order:
int parity = ( threadIdx.x % 2 );
int shuffle_dir = 1 - 2 * parity;
int inwarp_tid = threadIdx.x % warpSize;
for ( int i_b = 0; i_b < b; ++ i_b )
{
int data_idx = i_b * plane_area + i;
datum_a = reinterpret_cast< data_t * >( data + data_idx )[ parity ];
datum_b = __shfl_sync( 0xFFFFFFFF, datum_a, inwarp_tid + shuffle_dir, warpSize );
// Even TIDs compute real part, odd TIDs compute imaginary part:
reinterpret_cast< data_t * >( data + data_idx )[ parity ] = datum_a * chirp_factor.x - shuffle_dir * datum_b * chirp_factor.y;
}
#endif // #ifndef CUGALE_MUL_SHUFFLE
}
Let us consider the case where data_t is float, which is memory bandwidth limited. As it can be seen above, there are two versions of the kernel, one which reads/writes 8 bytes (a whole complex number) per thread and another which reads/writes 4 bytes per thread and then shuffles the results so the complex product is computed correctly.
The reason why I have written the version using shuffle is because nvvp insisted that reading 8 bytes per thread was not the best idea because this memory access pattern would be inefficient. This is the case even though in both systems tested (GTX 1050 and GTX Titan Xp) memory bandwidth was very close to theoretical maximum.
Surely enough I knew that no improvement was likely to happen, and this was indeed the case: both kernels take pretty much the same time to run. So, my question is the following:
Why is that nvvp reports that reading 8 bytes would be less efficient than reading 4 bytes per thread? In which circumstances would that be the case?
As a side note, single precision is more important to me, but double is useful in some cases too. Interestingly enough, in the case where data_t is double, there is no execution time difference too between the two kernel versions, even though in this case the kernel is compute bound and the shuffle version performs some more flops than the original version.
Note: the kernels are applied to a row_length * M * b dataset (b images with row_length columns and M lines) and the chirp_factor array is row_length * M. Both kernels run perfecly fine (I can edit the question to show you the calls to both versions if you have doubts about it).
The issue here has to do with how the compiler is processing your code. nvvp is merely dutifully reporting what is happening when you run your code.
If you use the cuobjdump -sass tool on your executable, you will discover that the duplicate_whole routine is doing two 4-byte loads and two 4-byte stores. This is not optimal, partly becuase there is a stride in each load and store (each load and store touches alternate elements in memory).
The reason for this is that the compiler does not know the alignment of your my_complex struct. Your struct would be legal for use in situations that would prevent the compiler from generating a (legal) 8-byte load. As discussed here we can fix this by informing the compiler that we only intend to use the struct in alignment scenarios where a CUDA 8-byte load is legal (i.e. it is "naturally aligned"). The modification to your struct looks like this:
template < class T >
struct __align__(8) my_complex {
T x;
T y;
};
With that change to your code, the compiler generates 8-byte loads for the duplicate_whole kernel, and you should see a different report from the profiler. You should use this sort of decoration only when you understand what it means and are willing to enter into a contract with the compiler that you will ensure this is the case. If you do something unusual, like unusual pointer casting, you can violate your end of the bargain and generate a machine fault.
The reason you don't see much performance difference almost certainly has to do with CUDA load/store behavior and the GPU caches
When you do a strided load, the GPU loads an entire cacheline anyway, even though (in this case) you only need half the elements (the real elements) for that particular load operation. However you need the other half of the elements (the imaginary elements) anyway; they will be loaded on the next instruction, and this instruction most likely hits in the cache, due to the previous load.
On a strided store in this case, writing strided elements in one instruction and the alternate elements in the next instruction will end up using one of the caches as a "coalescing buffer". This isn't coalescing in the typical sense used in CUDA terminology; that sort of coalescing only applies to a single instruction. However the cache "coalescing buffer" behavior allows it to "accumulate" multiple writes to an already-resident line, before that line gets written out or evicted. This is approximately equivalent to "write-back" cache behavior.
I am attempting to use PGFortran for CUDA. I installed PGFortran on my computer and linked everything up to the best of my knowledge. To get going I decided to follow a tutorial listed here. When trying to compile the code:
module mathOps
contains
attributes(global) subroutine saxpy(x, y, a)
implicit none
real :: x(:), y(:)
real, value :: a
integer :: i, n
n = size(x)
i = blockDim%x * (blockIdx%x - 1) + threadIdx%x
if (i <= n) y(i) = y(i) + a*x(i)
end subroutine saxpy
end module mathOps
program testSaxpy
use mathOps
use cudafor
implicit none
integer, parameter :: N = 40000
real :: x(N), y(N), a
real, device :: x_d(N), y_d(N)
type(dim3) :: grid, tBlock
tBlock = dim3(256,1,1)
grid = dim3(ceiling(real(N)/tBlock%x),1,1)
x = 1.0; y = 2.0; a = 2.0
x_d = x
y_d = y
call saxpy<<<grid, tblock="">>>(x_d, y_d, a)
y = y_d
write(*,*) 'Max error: ', maxval(abs(y-4.0))
end program testSaxpy
I got:
PGF90-S-0034-Syntax error at or near identifier saxpy (main.cuf: 29)
0 inform, 0 warnings, 1 severes, 0 fatal for testsaxpy
The error points to the line call saxpy<<<grid, tblock="">>>(x_d, y_d, a). For some reason it apparently hates the fact that I am using <<< and >>>? Going by the tutorial those triple chevrons are meant to be there:
The information between the triple chevrons is the execution
configuration, which dictates how many device threads execute the
kernel in parallel.
Removing these chevrons would not make any sense since they are the purpose of the program. So why does PGFortran dislike this?
As for the compilation. I have followed the tutorial by using
pgf90 -o saxpy main.cuf. But since that gave an error I also tried pgf90 -Mcuda -o saxpy main.cuf. Same results.
There does seem to be a text error in that blog at the kernel invocation line:
call saxpy<<<grid, tblock="">>>(x_d, y_d, a)
tblock="" is not correct. You'll notice elsewhere in that blog text, the kernel invocation line is given correctly as:
call saxpy<<<grid,tBlock>>>(x_d, y_d, a)
So if you change that line accordingly in your actual code, I think you'll have better results.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I have a problem about the data copy form host to device. Here is my problem. I have an array define as
real, allocatable :: cpuArray(:,:,:)
real, device, allocatable :: gpuArrray(:,:,:)
allocate(cpuArray(0:imax-1,0:jmax-1,0:kmax-1))
allocate(gpuArrray(-1:imax,-1:jmax,-1:kmax))
!array initialiazation
cpuArrray = randomValue !non 0 value
gpuArray = 0.0 !first 0 all gpu array elements
gpuArrray(0:imax-1,0:jmax-1,0:kmax-1)= cpuArray
My expectation is that only the designated index in the gpuArray will receive data from the host however it does not work.
Could you help me find what is wrong with this?
PS: I based my my approach from this tutorial of PGI home page
--
When I set both the cpuArray and the gpuArray the same dimension,
I get exactly the correct result.
But the current situation produces 0 for all element in the gpuArray. I modified the default value to a non zero (ie. gpuArray = 10.0 !first 10 all gpu array elements ) but the result still 0.
Best regards,
Adjeiinfo
All my apologies to the whole community. I could solve my problem. It was a silly bug I introduced in the test program. Instead of cpuArrray= cpuArray(0:imax-1,0:jmax-1,0:kmax-1) in the check program, I did cpuArrray= cpuArray.So the program was working well but the result check program was buggy.
Thank you for your follow up.
For your reference this is a part of the program (can be built and run)
module mytest
use cudafor
implicit none
integer :: imax , jmax, kmax
integer :: i,j,k
!host arrays
real,allocatable:: h_a(:,:,:)
real,allocatable:: h_b(:,:,:)
real,allocatable:: h_c(:,:,:)
!device array
real,device,allocatable:: d_b(:,:,:)
real,device,allocatable:: d_c(:,:,:)
real,device,allocatable:: d_b_copy(:,:,:)
real,device,allocatable:: d_c_copy(:,:,:)
contains
attributes(global) subroutine testdata()
integer :: d_i, d_j,d_k
d_i = (blockIdx%x-1) * blockDim%x + threadIdx%x-1
d_j = (blockIdx%y-1) * blockDim%y + threadIdx%y-1
do d_k = 0, 1
d_b_copy(d_i, d_j, d_k) = d_b(d_i, d_j, d_k)
d_c_copy(d_i, d_j, d_k) = d_c(d_i, d_j, d_k)
end do
end subroutine testdata
end module mytest
program Test
use mytest
type(dim3) :: dimGrid, dimBlock,dimGrid1, dimBlock1
imax = 32
jmax = 32
kmax = 2
dimGrid = dim3(2,2, 1)
dimBlock = dim3(imax,jmax,1)
allocate(h_a(0:imax-1,0:jmax-1,0:1))
allocate(h_b(0:imax-1,0:jmax-1,0:1))
allocate(h_c(0:imax-1,0:jmax-1,0:1))
!real,device,allocatable::d_c(:,:,:)
allocate(d_b(0:imax-1,0:jmax-1,0:1))
allocate(d_c(-1:imax,-1:jmax,-1:16))
allocate(d_b_copy(0:imax-1,0:jmax-1,0:1))
allocate(d_c_copy(-1:imax,-1:jmax,-1:1))
!array initialization
do k = 0,kmax-1
do j=0, jmax-1
do i = 0, imax-1
h_a(i,j,k) = i*0.1
end do
end do
end do
!data transfer (cpu to gpu)
d_b = h_a
d_c(0:imax-1,0:jmax-1,0:kmax-1)= h_a
call testdata<<<dimGrid,dimBlock>>>()
!copy back to cpu
h_b = d_b_copy(0:imax-1,0:jmax-1,0:kmax-1)
h_c = d_c_copy(0:imax-1,0:jmax-1,0:kmax-1)
!just for visual test
write(*,*), h_b
open(24,file='h_a.dat')
write(24,*) h_a
close(24)
open(24,file='d_b_copy.dat')
write(24,*) h_b
close(24)
open(24,file='d_c_copy.dat')
write(24,*) h_c
close(24)
end program Test
I am reading CUDA_C_Programming_Guide, and in shared memory topics, I have cam across an example:
Device Compute capability: 1.0, 16 banks in shared memory
extern __shared__ float shared[];
float data = shared[BaseIndex + s * tid];
And in the explanation they have concluded 's' has to be odd, can anyone please help me understand what happens when s is even and what happens when s is odd?
Conclusion of odd s is not easy to directly see, but if you try to derivate when bank conflict occurs (two threads tid and tid' access the same bank), assuming 32 is number of banks:
s*tid == s*tid' (mod 32)
s*tid == s*(tid + n) (mod 32) where tid' = tid + n
s*tid == s*tid + s*n (mod 32)
s*n == 0 (mod 32)
n = (32/d)*k for some k and d = gcd(s, 32)
so bank conflict will not occur when 32 is less than or equal to 32/d
and since d = gcd(s, 2^5), s has to be odd.
About your question in comments, I didn't fully get what you don't understand, but simple explanation: if two threads try to access the same bank(it means accessing two words in the same row) accesses are serialized.
I made a simple CUDA program for practice. It simply copies over data from one array to another:
import pycuda.driver as cuda
import pycuda.autoinit
import numpy as np
from pycuda.compiler import SourceModule
# Global constants
N = 2**20 # size of array a
a = np.linspace(0, 1, N)
e = np.empty_like(a)
block_size_x = 512
# Instantiate block and grid sizes.
block_size = (block_size_x, 1, 1)
grid_size = (N / block_size_x, 1)
# Create the CUDA kernel, and run it.
mod = SourceModule("""
__global__ void D2x_kernel(double* a, double* e, int N) {
int tid = blockDim.x * blockIdx.x + threadIdx.x;
if (tid > 0 && tid < N - 1) {
e[tid] = a[tid];
}
}
""")
func = mod.get_function('D2x_kernel')
func(a, cuda.InOut(e), np.int32(N), block=block_size, grid=grid_size)
print str(e)
However, I get this error: pycuda._driver.LogicError: cuLaunchKernel failed: invalid value
When I get rid of the second argument double* e in my kernel function and invoke the kernel without the argument e, the error goes away. Why is that? What does this error mean?
Your a array does not exist in device memory, so I suspect that PyCUDA is ignoring (or otherwise handling) the first argument to your kernel invocation and only passing in e and N...so you get an error because the kernel was expecting three arguments and it has only received two. Removing double* e from your kernel definition might eliminate the error message you're getting, but your kernel still won't work properly.
A quick fix to this should be to wrap a in a cuda.In() call, which instructs PyCUDA to copy a to the device before launching the kernel. That is, your kernel launch line should be:
func(cuda.In(a), cuda.InOut(e), np.int32(N), block=block_size, grid=grid_size)
Edit: Also, do you realize that your kernel is not copying the first and last elements of a to e? Your if (tid > 0 && tid < N - 1) statement is preventing that. For the entire array, it should be if (tid < N).