Multi-dimensional array copy OpenACC - cuda

I have a 2D matrix SIZE x SIZE, which I'm trying to copy to the GPU.
I allocate the matrix this way:
#define SIZE 1024
float (*a)(SIZE) = (float(*)[SIZE]) malloc(SIZE * SIZE * sizeof(float));
And I have this on my ACC region:
void mmul_acc(restrict float a[][SIZE],
restrict float b[][SIZE],
restrict float c[][SIZE]) {
#pragma acc data copyin(a[0:SIZE][0:SIZE], b[0:SIZE][0:SIZE]) \
copyout c[0:SIZE][0:SIZE])
{
... code here...
}
When compiling with the PGI compiler, using -Minfo=acc, the compiler tells me:
Generating copyin(a[0:1024][0:])
What does a[0:1024][0:] mean? Why not a[0:1024][0:1024] ???
If instead of declaring matrices I declare arrays with size SIZE*SIZE, doing
#pragma acc copyin(a[0:SIZE*SIZE])
Generates the following compiler message
Generating copyin(a[0:16777216])
The code actually works the same way, same performance, same result.
Apparently in both ways the compiler generates the same code, as it should be, but the message is not straightforward.
I'm using the PGI accelerator 12.8, in a Linux64 machine. I'm compiling with -Minfo=acc
Note: this question was edited and now it doesn't really make much sense, but maybe it can useful to more people.

This issue is fixed in latest PGI Compiler 12.9.0. The compiler now returns following messsage:
Generating copyin(a[0:1024][0:1024])

Related

How can I dereference a thrust::device_vector from within a thrust functor?

I'm doing a thrust transform_reduce and need to access a thrust::device_vector from within the functor. I am not iterating on the device_vector. It allows me to declare the functor, passing in the device_vector reference, but won't let me dereference it, either with begin() or operator[].
1>C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v6.5\include\thrust/detail/function.h(187): warning : calling a host function("thrust::detail::vector_base > ::operator []") from a host device function("thrust::detail::host_device_function ::operator () ") is not allowed
I assume I'll be able to pass in the base pointer and do the pointer math myself, but is there a reason this isn't supported?
Just expanding on what #JaredHoberock has already indicated. I think he will not mind.
A functor usable by thrust must (for the most part) conform to the requirements imposed on any CUDA device code.
Both thrust::host_vector and thrust::device_vector are host code containers used to manipulate host data and device data respectively. A reference to the host code container cannot be used successfully in device code. This means even if you passed a reference to the container successfully, you could not use it (i.e. could not do .push_back(), for example) in device code.
For direct manipulation in device code (such as functors, or kernels) you must extract raw device pointers from thrust and use those directly, with your own pointer arithmetic. And advanced functions (such as .push_back()) will not be available.
There are a variety of ways to extract the raw device pointer corresponding to thrust data, and the following example code demonstrates two possibilities:
$ cat t651.cu
#include <thrust/device_vector.h>
#include <thrust/sequence.h>
__global__ void printkernel(float *data){
printf("data = %f\n", *data);
}
int main(){
thrust::device_vector<float> mydata(5);
thrust::sequence(mydata.begin(), mydata.end());
printkernel<<<1,1>>>(mydata.data().get());
printkernel<<<1,1>>>(thrust::raw_pointer_cast(&mydata[2]));
cudaDeviceSynchronize();
return 0;
}
$ nvcc -o t651 t651.cu
$ ./t651
data = 0.000000
data = 2.000000
$

Usage of same constant memory array on different source files

I have a __constant__ memory array holding information that is needed by many kernels, which are placed in different source files. This constant memory array is defined in the header GlobalParameters.h, which is #included by all files containing kernels that need to access to this array.
I just discovered (look at talonmies' answer) that __constant memory__ is only available in the translation unit where it is defined, unless you turn on separate compilation (with CUDA 5.0 or later).
I still do not get completely what this means for my case.
Assuming that I cannot turn on separate compilation, is there a way for dealing with my needs? Where should I place the definition of my constant memory array? What if I place it in my header, which is #included in many translation units?
Assuming I can turn on separate compilation, should I declare my __constant__ memory array in the header as extern and place the definition inside a source file (e.g. GlobalParameters.cu)?
One way to make constant memory available to translation units other than the one where it is declared, is to call cudaGetSymbolAddress() and make the pointer available to the other functions.
This technique is playing with fire to some degree, because if you use the pointer to write to the memory without appropriate barriers and synchronization, you may run afoul of the lack of coherency between constant memory and global memory.
Also, you may not get the full performance benefits of constant memory if you use this method. That should be less true on SM 2.x and later hardware - disassemble the object code and make sure the compiler is emitting "load uniform" instructions.
The below example assumes the possibility of using separate compilation. In this case, the below example shows how using extern to work with constant memory across different compilation units.
FILE kernel.cu
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include "Utilities.cuh"
__constant__ int N_GPU;
__constant__ float a_GPU;
__global__ void printKernel();
int main()
{
const int N = 5;
const float a = 10.466;
gpuErrchk(cudaMemcpyToSymbol(N_GPU, &N, sizeof(int)));
gpuErrchk(cudaMemcpyToSymbol(a_GPU, &a, sizeof(float)));
printKernel << <1, 1 >> > ();
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaDeviceSynchronize());
return 0;
}
FILE otherCompilationUnit.cu
#include <stdio.h>
extern __constant__ int N_GPU;
extern __constant__ float a_GPU;
__global__ void printKernel() {
printf("N = %i; a = %f\n", N_GPU, a_GPU);
}
No, without using separate compilation it won't be possible to use the same constant memory, that is declared once, over several .cu files.
In my oppinion there are two ways for a workaround.
First one is to implement all kernels within one .cu file. Therefore you will get the disadvantage that this file will become very large with a bad overview.
A second way would be to declare in every .cu file the constant memory again. Then once with a wrapper copy the values into the constant memory for every single .cu file - like I described in an answer here. Disadvantages would be that you have to ensure that you don't forget to copy the values in single .cu files and you have to check that you won't run in the limitation of total available constant memory.
Yes. The later CUDA doc says:
When compiling in the separate compilation mode (see the nvcc user manual for a description of this mode), device, shared, managed and constant variables can be defined as external using the extern keyword. nvlink will generate an error when it cannot find a definition for an external variable (unless it is a dynamically allocated shared variable).

Counting registers/thread in Cuda kernel

The nSight profiler tells me that the following kernel uses 52 registers per thread:
//Just the first lines of the kernel.
__global__ void voles_kernel(float *params, int *ctrl_params,
float dt, float currTime,
float *dev_voles, float *dev_weasels,
curandStateMtgp32 *state)
{
__shared__ float dev_params[9];
__shared__ int BuYeSimStep[4];
if(threadIdx.x < 4)
{
BuYeSimStep[threadIdx.x] = ctrl_params[threadIdx.x];
}
if(threadIdx.x < 9){
dev_params[threadIdx.x] = params[threadIdx.x];
}
__syncthreads();
float currVole = curand_uniform(&state[blockIdx.x]) + 3.0;
float currWeas = curand_uniform(&state[blockIdx.x]) + 0.1;
float oldVole = currVole;
float oldWeas = currWeas;
int jj;
if (blockIdx.x * blockDim.x + threadIdx.x < BuYeSimStep[2])
{
int dayIndex = 0;
/* Not declaring any new variable from here on, just doing arithmetics.
....... */
If each register has 4 bytes I don't understand how we get to 52 registers, even
assuming that the arrays params[9] and ctrl_params[4] end up in registers (in which
case using shared memory as I did doesn't make sense). I would
like to increase occupancy, but I don't get why I'm using so many registers.
Any ideas?
It's generally difficult to look at C code and predict the register usage from it. The compiler may aggressively optimize code by increasing register usage, perhaps to save an instruction here or there. You seem to be making an assumption that register usage can be predicted from your C code variable allocations, and while there is some connection between the two, you cannot assume register usage can be computed directly from C code variable allocations.
Since you haven't provided your code, nobody can actually help with the register usage. If you want to better understand the register usage, you will need to look at the PTX code directly. To do this, compile your code using nvcc with the -ptx switch, and inspect the resultant .ptx file directly. To do this you may wish to refer to the PTX documentation as well as the nvcc documentation to look at the various compiler options.
You haven't provided your code, so it's not really possible to make any direct suggestions, but you may be able to reduce register usage by reducing constant usage, reducing or refactoring arithmetic usage, switching from double to float, and I'm sure there are many other suggestions as well. Register usage will also be affected if you are passing the -G switch to the compiler.
You can limit the compiler's usage of registers per thread by passing the -maxrregcount switch to nvcc with an appropriate parameter, such as -maxrregcount 20 which will instruct the compiler to limit itself to 20 registers per thread. This tactic may not give good results, however, or you may need to tune the parameter to a value which doesn't sacrifice too much performance. However you may find an optimum choice which doesn't sacrifice too much basic performance but allows you to improve occupancy. If you constrain the compiler too much, it will begin to spill it's needed register usage to local memory, which will generally reduce performance.
You should also be aware that you can pass -Xptxas -v to nvcc which will give useful output about the compiler's register usage and other related data (spilling, etc.) at compile time.
If you want to increase the occupancy, a direct way is using compiler flag: maxregcount to restrict the usage of registers, but it may suffer a performance loss because some registers will be spilled to local memory, which is very slow.
I suggest you debug your code with Eclipse Nsight.
Create a breakpoint at the first line of your kernel and step to there.
In Debug Perspective, inside the CUDA Thread, you have the current stack trace. Right-click on the stack and click on "Instruction Stepping Mode". The window "Disassembly" will open your kernel PTX Assembly. You can continue stepping in your kernel to track the correlation of your source code and the assembly. So you can discover which register is used for.

How threads/blocks are mapped on GPU while calling cublasSgemm/clAmdBlasSgemm routines?

I am interested in knowing how cublasSgemm/clAmdBlasSgemm routines are mapped on GPU while calculating matrix multiplication (C = A * B).
Assume the dimensions of input Matrix ::A_rows = 6144;
A_cols = 12288; B_rows = 12288; B_cols = 15360;
and dimensions of resultant matrix :: C_rows = 6144; C_cols = 15360;
Assume i have initialized the input matrices on host and i copied the matrix data into device memory. After that i am calling following cuBlas or clAmdBlas routines to do matrix multiplication on GPU.
void cublasSgemm (char transa, char transb, int m, int n, int k, float alpha, const float *A, int lda, const float *B, int ldb, float beta, float *C, int ldc);
where m = A_rows; and
n = B_cols;
So my doubts are:
1. ) How these routines are implemented on GPU ?
2. ) Does m and n values mapped on one compute unit (SM)? If No, then what can be maximum value for m and n ?
3. ) Do we have control of threads/Blocks ?
For the host side CUBLAS API (note that I have no idea why you would assume that clAmdBlasSgemm would be the same), the short answer to your questions are as follows:
Modern CUBLAS is closed source. There are code bases like Magma which you could look at to at least get a feel for how CUBLAS might be implemented. You can also run CUBLAS code in one of the NVIDIA supplied profilers to see what it does on the GPU. But the point is that you don't need to know how it works. There is an API and some very thorough documentation. That is all you need to know.
You example problem requires roughly 1.2Gb of memory. If you have a GPU with that much memory, and either enough computational capacity to avoid the display driver watchdog timer, or a compute dedicated GPU, it will work. Memory and the display driver time limitations (where applicable) are the only limitations.
No.
Note that there is also a CUBLAS device API for K20 Kepler devices, and the answers I provided above do not apply to that library.
Before going any further you must read the papers of Volkov and Demmel, have a look here: http://www.cs.berkeley.edu/~volkov/ see his article regarding SGEMM. The answers are there since 2008.

Variable Sizes Array in CUDA

Is there any way to declare an array such as:
int arraySize = 10;
int array[arraySize];
inside a CUDA kernel/function? I read in another post that I could declare the size of the shared memory in the kernel call and then I would be able to do:
int array[];
But I cannot do this. I get a compile error: "incomplete type is not allowed". On a side note, I've also read that printf() can be called from within a thread and this also throws an error: "Cannot call host function from inside device/global function".
Is there anything I can do to make a variable sized array or equivalent inside CUDA? I am at compute capability 1.1, does this have anything to do with it? Can I get around the variable size array declarations from within a thread by defining a typedef struct which has a size variable I can set? Solutions for compute capabilities besides 1.1 are welcome. This is for a class team project and if there is at least some way to do it I can at least present that information.
About the printf, the problem is it only works for compute capability 2.x. There is an alternative cuPrintf that you might try.
For the allocation of variable size arrays in CUDA you do it like this:
Inside the kernel you write extern __shared__ int[];
On the kernel call you pass as the third launch parameter the shared memory size in bytes like mykernel<<<gridsize, blocksize, sharedmemsize>>>();
This is explained in the CUDA C programming guide in section B.2.3 about the __shared__ qualifier.
If your arrays can be large, one solution would be to have one kernel that computes the required array sizes, stores them in an array, then after that invocation, the host allocates the necessary arrays and passes an array of pointers to the threads, and then you run your computation as a second kernel.
Whether this helps depends on what you have to do, because it would be arrays allocated in global memory. If the total size (per block) of your arrays is less than the size of the available shared memory, you could have a sufficiently-large shared memory array and let your threads negociate splitting it amongst themselves.