cublasSetVector() vs cudaMemcpy() - cuda

I am wondering if there is a difference between:
// cumalloc.c - Create a device on the device
HOST float * cudamath_vector(const float * h_vector, const int m)
{
float *d_vector = NULL;
cudaError_t cudaStatus;
cublasStatus_t cublasStatus;
cudaStatus = cudaMalloc(&d_vector, sizeof(float) * m );
if(cudaStatus == cudaErrorMemoryAllocation) {
printf("ERROR: cumalloc.cu, cudamath_vector() : cudaErrorMemoryAllocation");
return NULL;
}
/* THIS: */ cublasSetVector(m, sizeof(*d_vector), h_vector, 1, d_vector, 1);
/* OR THAT: */ cudaMemcpy(d_vector, h_vector, sizeof(float) * m, cudaMemcpyHostToDevice);
return d_vector;
}
cublasSetVector() has two arguments incx and incy and the documentation says:
The storage spacing between consecutive elements is given by incx for
the source vector x and for the destination vector y.
In the NVIDIA forum someone said:
iona_me: "incx and incy are strides measured in floats."
So does this mean that for incx = incy = 1 all elements of a float[] will be sizeof(float)-aligned and for incx = incy = 2 there would be a sizeof(float)-padding between each element?
Except for those two parameters and the cublasHandle - does cublasSetVector() anything else what cudaMalloc() doesn't do?
Would it be save to pass a vector/matrix which was not created with their respective cublas*() function to other CUBLAS functions to manipulate them?

There is a comment in a thread of the NVIDIA Forum provided by Massimiliano Fatica confirming my statement in the above comment (or, saying it better, my comment originated by a recall of having read the post I linked to). In particular
cublasSetVector, cubblasGetVector, cublasSetMatrix, cublasGetMatrix are thin wrappers around cudaMemcpy and cudaMemcpy2D. Therefore, no significant performance differences are expected between the two sets of copy functions.
Accordingly, you can safely pass any array created by cudaMalloc as input to cublasSetVector.
Concerning the strides, perhaps there is a misprint in the guide (as of CUDA 6.0), which says that
The storage spacing between consecutive elements is given by incx for the source
vector x and for the destination vector y.
but perhaps should be read as
The storage spacing between consecutive elements is given by incx for the source
vector x and incy for the destination vector y.

Related

PyCUDA illegal memory access of curandState*

I'm studying the spread of an invasive species and am trying to generate random numbers within a PyCUDA kernel using the XORWOW random number generator. The matrices I need to be able to use as input in the study are quite large (up to 8,000 x 8,000).
The error seems to occur inside get_random_number when indexing the curandState* of the XORWOW generator. The code executes without errors on smaller matrices and produces correct results. I'm running my code on 2 NVidia Tesla K20X GPUs.
Kernel code and setup:
kernel_code = '''
#include <curand_kernel.h>
#include <math.h>
extern "C" {
__device__ float get_random_number(curandState* global_state, int thread_id) {
curandState local_state = global_state[thread_id];
float num = curand_uniform(&local_state);
global_state[thread_id] = local_state;
return num;
}
__global__ void survival_of_the_fittest(float* grid_a, float* grid_b, curandState* global_state, int grid_size, float* survival_probabilities) {
int x = threadIdx.x + blockIdx.x * blockDim.x; // column index of cell
int y = threadIdx.y + blockIdx.y * blockDim.y; // row index of cell
// make sure this cell is within bounds of grid
if (x < grid_size && y < grid_size) {
int thread_id = y * grid_size + x; // thread index
grid_b[thread_id] = grid_a[thread_id]; // copy current cell
float num;
// ignore cell if it is not already populated
if (grid_a[thread_id] > 0.0) {
num = get_random_number(global_state, thread_id);
// agents in this cell die
if (num < survival_probabilities[thread_id]) {
grid_b[thread_id] = 0.0; // cell dies
//printf("Cell (%d,%d) died (probability of death was %f)\\n", x, y, survival_probabilities[thread_id]);
}
}
}
}
mod = SourceModule(kernel_code, no_extern_c = True)
survival = mod.get_function('survival_of_the_fittest')
Data setup:
matrix_size = 2000
block_dims = 32
grid_dims = (matrix_size + block_dims - 1) // block_dims
grid_a = gpuarray.to_gpu(np.ones((matrix_size,matrix_size)).astype(np.float32))
grid_b = gpuarray.to_gpu(np.zeros((matrix_size,matrix_size)).astype(np.float32))
generator = curandom.XORWOWRandomNumberGenerator()
grid_size = np.int32(matrix_size)
survival_probabilities = gpuarray.to_gpu(np.random.uniform(0,1,(matrix_size,matrix_size)))
Kernel call:
survival(grid_a, grid_b, generator.state, grid_size, survival_probabilities,
grid = (grid_dims, grid_dims), block = (block_dims, block_dims, 1))
I expect to be able to generate random numbers within the range (0,1] for matrices up to (8,000 x 8,000), but executing my code on large matrices leads to an illegal memory access error.
pycuda._driver.LogicError: cuMemcpyDtoH failed: an illegal memory access was encountered
PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
cuMemFree failed: an illegal memory access was encountered
Am I indexing the curandState* incorrectly in get_random_number? And if not, what else might be causing this error?
The problem here is a disconnect between this code which determines the size of the state which the PyCUDA curandom interface allocates for its internal state and this code in your post:
matrix_size = 2000
block_dims = 32
grid_dims = (matrix_size + block_dims - 1) // block_dims
You seem to be assuming that PyCUDA will magically allocate enough state for whatever block and grid dimension you select in you code. That is obviously unlikely, particularly at large grid sizes. You either need to
Modify your code to use the same block and grid sizes as the curandom module uses internally for whichever generator you choose to use, or
Allocate and manage your own state scratch space so that you have enough state allocated to service the block and grid sizes you select
I leave it as an exercise to the reader as to which one of these two approaches will work better in your application.

dot_product with CUDA_CUB

__global__ void sum(const float * __restrict__ indata, float * __restrict__ outdata) {
unsigned int tid = blockIdx.x * blockDim.x + threadIdx.x;
// --- Specialize BlockReduce for type float.
typedef cub::BlockReduce<float, BLOCKSIZE> BlockReduceT;
// --- Allocate temporary storage in shared memory
__shared__ typename BlockReduceT::TempStorage temp_storage;
float result;
if(tid < N) result = BlockReduceT(temp_storage).Sum(indata[tid]);
// --- Update block reduction value
if(threadIdx.x == 0) outdata[blockIdx.x] = result;
return;
}
I have tested the reduction sum(as shown in above code snippet) with cuda cub successfully, I want to perform the inner product of two vectors based on this code. But I have some confusions about it:
We need two input vectors for the inner_product, need I to conduct a component-wise multiplication of this two input vectors before the reduction sum on the resulting new vector.
In the code examples of the cuda cub, the dimension of input vectors is equal to the blocknumber*threadnumber. what if we have a very large vector.
Yes, with cub, and assuming your vectors were stored separately (i.e. not interleaved), you would need to do an element-wise multiplication first. On the other hand, thrust transform_reduce could handle it in a single function call.
blocknumber*threadnumber should give you all the range you need. on a cc3.0 or higher GPU, blocknumber (i.e. gridDim.x) can range up to 2^31-1 and threadnumber (i.e. blockDim.x) can range up to 1024. This gives you the possibility to handle 2^40 elements. If each element is 4 bytes, this would constitute (i.e. require) 2^42 bytes. That is about 4TB (or double that if you are considering 2 input vectors), which is much larger than any GPU memory currently. So you will run out of GPU memory space before you run out of grid dimension.
Note that what you are showing is cub::BlockReduce. However if you are doing a vector dot product of two large vectors, you might want to use cub::DeviceReduce instead.

Parallel Anti diagonal 'for' loop?

I have an N x N square matrix of integers (which is stored in the device as a 1-d array for convenience).
I'm implementing an algorithm which requires the following to be performed:
There are 2N anti diagonals in this square. (anti - diagonals are parallel lines from top edge to left edge and right edge to bottom edge)
I need a for loop with 2N iterations with each iteration computing one anti-diagonal starting from the top left and ending at bottom right.
In each iteration, all the elements in that anti-diagonal must run parallelly.
Each anti-diagonal is calculated based on the values of the previous anti-diagonal.
So, how do I index the threads with this requirement in CUDA?
As long as I understand, you want something like
Parallelizing the Smith-Waterman Local Alignment Algorithm using CUDA A
At each iteration, the kernel is launched with a different number of threads.
Perhaps the code in Parallel Anti diagonal 'for' loop could be modified as
int iDivUp(const int a, const int b) { return (a % b != 0) ? (a / b + 1) : (a / b); };
#define BLOCKSIZE 32
__global__ antiparallel(float* d_A, int step, int N) {
int i = threadIdx.x + blockIdx.x* blockDim.x;
int j = step-i;
/* do work on d_A[i*N+j] */
}
for (int step = 0; step < 2*N-1; step++) {
dim3 dimBlock(BLOCKSIZE);
dim3 dimGrid(iDivUp(step,dimBlock.x));
antiparallel<<<dimGrid.x,dimBlock.x>>>(d_A,step,N);
}
This code is untested and is just a sketch of a possible solution (provided that I have not misunderstood your question). Furthermore, I do not know how efficient would be a solution like that since you will have kernels launched with very few threads.

CUDA: Thread and Array Allocation

I have read many times about CUDA Thread/Blocks and Array, but still don't understand point: how and when CUDA starts to run multithread for kernel function. when host calling kernel function, or inside kernel function.
For example I have this example, It just simple transpose an array. (so, it just copy value from this array to another array).
__global__
void transpose(float* in, float* out, uint width) {
uint tx = blockIdx.x * blockDim.x + threadIdx.x;
uint ty = blockIdx.y * blockDim.y + threadIdx.y;
out[tx * width + ty] = in[ty * width + tx];
}
int main(int args, char** vargs) {
/*const int HEIGHT = 1024;
const int WIDTH = 1024;
const int SIZE = WIDTH * HEIGHT * sizeof(float);
dim3 bDim(16, 16);
dim3 gDim(WIDTH / bDim.x, HEIGHT / bDim.y);
float* M = (float*)malloc(SIZE);
for (int i = 0; i < HEIGHT * WIDTH; i++) { M[i] = i; }
float* Md = NULL;
cudaMalloc((void**)&Md, SIZE);
cudaMemcpy(Md,M, SIZE, cudaMemcpyHostToDevice);
float* Bd = NULL;
cudaMalloc((void**)&Bd, SIZE); */
transpose<<<gDim, bDim>>>(Md, Bd, WIDTH); // CALLING FUNCTION TRANSPOSE
cudaMemcpy(M,Bd, SIZE, cudaMemcpyDeviceToHost);
return 0;
}
(I have commented all lines that not important, just have the line calling function transpose)
I have understand all lines in function main except the line calling function tranpose. Does it true when I say: when we call function transpose<<<gDim, bDim>>>(Md, Bd, WIDTH), CUDA will automatically assign each elements of array into one thread (and block), and when we calling "one time" tranpose, CUDA will running gDim * bDim times tranpose on gDim * bDim threads.
This point makes me feel frustrated so much, because it doesn't like multithread in java, when I use :( Please tell me.
Thanks :)
Your understanding is in essence correct.
transpose is not a function, but a CUDA kernel. When you call a regular function, it only runs once. But when you launch a kernel a single time, CUDA will automatically run the code in the kernel many times. CUDA does this by starting many threads. Each thread runs the code in your kernel one time. The numbers inside the tripple brackets (<<< >>>) is called the kernel execution configuration. It determines how many threads will be launched by CUDA and specifies some relationships between the threads.
The number of threads that will be started is calculated by multiplying up all the values in the grid and block dimensions inside the triple brackets. For instance, the number of threads will be 1,048,576 (16 * 16 * 64 * 64) in your example.
Each thread can read some variables to find out which thread it is. Those are the blockIdx and threadIdx structures at the top of the kernel. The values reflect the ones in the kernel execution configuration. So, if you run your kernel with a grid configuration of 16 x 16 (the first dim3 in the triple brackets, you will get threads that, when they each read the x and y values in the blockIdx structure, will get all possible combinations of x and y between 0 and 15.
So, as you see, CUDA does not know anything about array elements or any other data structures that are specific to your kernel. It just deals with threads, thread indexes and block indexes. You then use those indexes to to determine what a given thread should do (in particular, which values in your application specific data it should work on).

CUDA: Allocating 2D array on GPU

I have already read the following thread , but I couldn't get my code to work.
I am trying to allocate a 2D array on GPU, fill it with values, and copy it back to the CPU. My code is as follows:
__global__ void Kernel(char **result,int N)
{
//do something like result[0][0]='a';
}
int N=20;
int Count=5;
char **result_h=(char**)malloc(sizeof(char*)*Count);
char **result_d;
cudaMalloc(&result_d, sizeof(char*)*Count);
for(int i=0;i<Count;i++)
{
result_h[i] = (char*)malloc(sizeof(char)*N);
cudaMalloc(&result_d[i], sizeof(char)*N); //get exception here
}
//call kernel
//copy values from result_d to result_h
printf("%c",result_h[0][0])//should print a
How can i achieve this?
You can't manipulate device pointers in host code, which is why the cudaMalloc call inside the loop fails. You should probably just allocate a single contiguous block of memory and then treat that as a flattened 2D array.
For doing the simplest 2D operations on a GPU, I'd recommend you just treat it as a 1D array. cudaMalloc a block of size w*h*sizeof(char). You can access the element (i,j) through index j*w+i.
Alternatively, you could use cudaMallocArray to get a 2D array. This has a better sense of locality than linear mapped 2D memory. You can easily bind this to a texture, for example.
Now in terms of your example, the reason why it doesn't work is that cudaMalloc manipulates a host pointer to point at a block of device memory. Your example allocated the pointer structure for results_d on the device. If you just change the cudaMalloc call for results_d to a regular malloc, it should work as you originally intended.
That said, perhaps one of the two options I outlined above might work better from an ease of code maintenance perspective.
When allocating in that way you are allocating addresses that are valid on the CPU memory.
The value of the addresses is transferred as a number without problems, but once on the device memory the char* address will not have meaning.
Create an array of N * max text length, and another array of length N that tells how long each word is.
This is a bit more advanced but if you are processing a set of defined text (passwords for example)
I would suggest you to group it by text length and create specialized kernel for each length
template<int text_width>
__global__ void Kernel(char *result,int N)
{
//pseudocode
for i in text_width:
result[idx][i] = 'a'
}
and in the kernel invocation code you specify:
switch text_length
case 16:
Kernel<16> <<<>>> ()
The following code sample allocates a width×height 2D array of floating-point values and shows how to loop over the array elements in device code[1]
// host code
float* devPtr;
int pitch;
cudaMallocPitch((void**)&devPtr, &pitch, width * sizeof(float), height);
myKernel<<<100, 192>>>(devPtr, pitch);
// device code
__global__ void myKernel(float* devPtr, int pitch)
{
for (int r = 0; r < height; ++r) {
float* row = (float*)((char*)devPtr + r * pitch);
for (int c = 0; c < width; ++c) {
float element = row[c]; }
}
}
The following code sample allocates a width×height CUDA array of one 32-bit
floating-point component[1]
cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc<float>();
cudaArray* cuArray;
cudaMallocArray(&cuArray, &channelDesc, width, height);
The following code sample copies the 2D array to the CUDA array allocated in the
previous code samples[1]:
cudaMemcpy2DToArray(cuArray, 0, 0, devPtr, pitch, width * sizeof(float), height,
cudaMemcpyDeviceToDevice);
The following code sample copies somehost memory array to device memory[1]:
float data[256];
int size = sizeof(data);
float* devPtr;
cudaMalloc((void**)&devPtr, size);
cudaMemcpy(devPtr, data, size, cudaMemcpyHostToDevice);
you can understand theses examples and apply them in your purpose.
[1] NVIDIA CUDA Compute Unified Device Architecture