Understanding Cublas: vector addition (asum) [duplicate] - cuda

This question already has answers here:
Finding maximum and minimum with CUBLAS
(2 answers)
Closed 7 years ago.
According to CUBLAS reference, asum function (for getting the sum of the elements of a vector) is:
cublasStatus_t cublasSasum(cublasHandle_t handle, int n, const float *x, int incx, float *result)
You can see in the link to the reference the parameters explanation, roughly we have a vector x of n elements with incx distance between elements.
My code is (quite simplified, but I also tested this one and there is still the error):
int arraySize = 10;
float* a = (float*) malloc (sizeof(float) * arraySize);
float* d_a;
cudaMalloc((void**) &d_a, sizeof(float) * arraySize);
for (int i=0; i<arraySize; i++)
a[i]=0.8f;
cudaMemcpy(d_a, a, sizeof(float) * arraySize, cudaMemcpyHostToDevice);
cublasStatus_t ret;
cublasHandle_t handle;
ret = cublasCreate(&handle);
float* cb_result = (float*) malloc (sizeof(float));
ret = cublasSasum(handle, arraySize, d_a, sizeof(float), cb_result);
printf("\n\nCUBLAS: %.3f", *cb_result);
cublasDestroy(handle);
I have removed error checking for simplifying the code (there are no errors, CUBLAS functions return CUDA_STATUS_SUCCESS) and free and cudaFree.
It compiles, it runs, it doesn't throw any error, but result printed is 0, and, debugging, it is actually 1.QNAN.
What did i miss?

One of the arguments to cublasSasum is incorrect. The call should look like this:
ret = cublasSasum(handle, arraySize, d_a, 1, cb_result);
Note that the second last argument, incx, should be in words, not bytes.

Related

Is DeviceToDevice copy in a CUDA program needed?

I am doing the following two operations:
Addition of two array => a + b = AddResult
Multiply of two arrays => AddResult * a = MultiplyResult
In the above logic, the AddResult is an intermediate result and is used in the next mupltiplication operation as input.
#define N 4096 // size of array
__global__ void add(const int* a, const int* b, int* c)
{
int tid = blockIdx.x * blockDim.x + threadIdx.x;
if (tid < N)
{
c[tid] = a[tid] + b[tid];
}
}
__global__ void multiply(const int* a, const int* b, int* c)
{
int tid = blockIdx.x * blockDim.x + threadIdx.x;
if (tid < N)
{
c[tid] = a[tid] * b[tid];
}
}
int main()
{
int T = 1024, B = 4; // threads per block and blocks per grid
int a[N], b[N], c[N], d[N], e[N];
int* dev_a, * dev_b, * dev_AddResult, * dev_Temp, * dev_MultiplyResult;
cudaMalloc((void**)&dev_a, N * sizeof(int));
cudaMalloc((void**)&dev_b, N * sizeof(int));
cudaMalloc((void**)&dev_AddResult, N * sizeof(int));
cudaMalloc((void**)&dev_Temp, N * sizeof(int));
cudaMalloc((void**)&dev_MultiplyResult, N * sizeof(int));
for (int i = 0; i < N; i++)
{
// load arrays with some numbers
a[i] = i;
b[i] = i * 1;
}
cudaMemcpy(dev_a, a, N * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(dev_b, b, N * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(dev_AddResult, c, N * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(dev_Temp, d, N * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(dev_MultiplyResult, e, N * sizeof(int), cudaMemcpyHostToDevice);
//ADD
add << <B, T >> > (dev_a, dev_b, dev_AddResult);
cudaDeviceSynchronize();
//Multiply
cudaMemcpy(dev_Temp, dev_AddResult, N * sizeof(int), cudaMemcpyDeviceToDevice); //<---------DO I REALLY NEED THIS?
multiply << <B, T >> > (dev_a, dev_Temp, dev_MultiplyResult);
//multiply << <B, T >> > (dev_a, dev_AddResult, dev_MultiplyResult);
//Copy Final Results D to H
cudaMemcpy(e, dev_MultiplyResult, N * sizeof(int), cudaMemcpyDeviceToHost);
for (int i = 0; i < N; i++)
{
printf("(%d+%d)*%d=%d\n", a[i], b[i], a[i], e[i]);
}
// clean up
cudaFree(dev_a);
cudaFree(dev_b);
cudaFree(dev_AddResult);
cudaFree(dev_Temp);
cudaFree(dev_MultiplyResult);
return 0;
}
In the above sample code, I am transferring the Addition results (i.e. dev_AddResult) to another device array (i.e. dev_Temp) to perform the multiplication operation.
QUESTION: Since the Addition results array (i.e. dev_AddResult) is already on the GPU device, do I really need to transfer it to another array? I have already tried to execute the next kernel by directly providing dev_AddResult as input and it produced the same results. Is there any risk involved in directly passing the output of one kernel as the input of the next kernel? Any best practices to follow?
Yes, for the case you have shown, you can use the "output" of one kernel as the "input" to the next, without any copying. You've already done that and confirmed it works, so I will dispense with any example. The changes are trivial anyway - eliminate the intervening cudaMemcpy operation, and use the same dev_AddResult pointer in place of the dev_Temp pointer on your multiply kernel invocation.
Regarding "risks" I'm not aware of any for the example you have given. Moving away from that example to possibly more general usage, you would want to make sure that the add output calculations are finished before being used somewhere else.
Your example already does this, redundantly, using at least 2 mechanisms:
intervening cudaDeviceSynchronize() - this forces the previously issued work to complete
stream semantics - one rule of stream semantics is that work issued into a particular stream will execute in issue order. Item B issued into stream X, will not begin until the previously issued item A into stream X has completed.
So you don't really need the cudaDeviceSynchronize() in this case. It isn't "hurting" anything from a functionality perspective, but it is probably adding a few microseconds to the overall execution time.
More generally, if you had issued your add and multiply kernel into separate streams, then CUDA provides no guarantees of execution order, even though you "issued" the multiply kernel after the add kernel.
In that case (not the one you have here) if you needed the multiply operation to use the previously computed add results, you would need to enforce that somehow (enforce the completion of the add kernel before the multiply kernel). You have already shown one method to do that here, using a synchronize call.

Should cudaMemset work on the device pointer mapped from cudaHostRegister

I came across the sample code from one of my colleagues where the cudaMemset doesn't seem to work properly, when run on V100.
#include <iostream>
#include <stdio.h>
#define CUDACHECK(cmd) \
{\
cudaError_t error = cmd;\
if (error != cudaSuccess) { \
fprintf(stderr, "info: '%s'(%d) at %s:%d\n", cudaGetErrorString(error), error,__FILE__, __LINE__);\
}\
}
__global__ void setValue(int value, int* A_d) {
int tx = threadIdx.x + blockIdx.x * blockDim.x;
if(tx == 0){
A_d[tx] = A_d[tx] + value;
}
}
__global__ void printValue(int* A_d) {
int tx = threadIdx.x + blockIdx.x * blockDim.x;
if(tx == 0){
printf("A_d: %d\n", A_d[tx]);
}
}
int main(int argc, char* argv[ ]){
int *A_h, *A_d;
int size = sizeof(int);
A_h = (int*)malloc(size);
A_h[0] = 1;
CUDACHECK(cudaSetDevice(0));
CUDACHECK(cudaHostRegister(A_h, size, 0));
CUDACHECK(cudaHostGetDevicePointer((void**)&A_d, A_h, 0));
setValue<<<64,1,0,0>>>(5, A_d);
cudaDeviceSynchronize();
printf("A_h: %d\n", A_h[0]);
A_h[0] = 100;
printf("A_h: %d\n",A_h[0]);
printValue<<<64,1,0,0>>>(A_d);
cudaDeviceSynchronize();
CUDACHECK (cudaMemset(A_d, 1, size) );
printf("A_h: %d\n",A_h[0]);
printValue<<<64,1,0,0>>>(A_d);
cudaDeviceSynchronize();
cudaHostUnregister(A_h);
free(A_h);
}
When this sample is compiled and run, the output is seen as below.
/usr/local/cuda-11.0/bin/nvcc memsettest.cu -o test
./test
A_h: 6
A_h: 100
A_d: 100
A_h: 16843009
A_d: 16843009
We expect A_h and A_d to be set to 1 with cudaMemset. But it is set to some huge value as seen.
So, is cudaMemset expected to work on the device pointer A_d returned by cudaHostGetDevicePointer.
Is this A_d expected to be used only in kernels.
We also see that cudaMemcpy DtoH or HtoD seem to be working on the same device pointer A_d.
Can someone help us with the correct behavior.
We expect A_h and A_d to be set to 1 with cudaMemset.
You're confused about how cudaMemset works. Conceptually, it is very similar to memset from the C standard library. You should try your same test case with memset and see what it does.
Anyway, cudaMemset takes a pointer, a byte value, and a size in bytes to set, just like memset.
So your cudaMemset command:
CUDACHECK (cudaMemset(A_d, 1, size) );
is setting each byte to 1. Since size is 4, that means that you are setting A_d[0] to 0x01010101 (in hexadecimal). If you plug that value into your windows programmer calculator, the value is 16843009 in decimal. So everything is working as expected, here, from what I can see.
Again, I'm pretty sure you would see the same behavior with memset for the same test case/usage.

CUDA cudaMemcpyFromSymbol "invalid device symbol" error?

So I see a parent question about how to copy from host to the constant memory on GPU using cudaMemcpyToSymbol.
My question is how to do the reverse, copying from device constant memory to the host using cudaMemcpyFromSymbol.
In the following minimal reproducible example, I either got
1) invalid device symbol error using cudaMemcpyFromSymbol(const_d_a, b, size);, or
2) got segmentation fault if I use cudaMemcpyFromSymbol(&b, const_d_a, size, cudaMemcpyDeviceToHost).
I have consulted with the manual which suggests I code as in 1), and this SO question that suggests I code as in 2). Neither of them work here.
Could anyone kindly help suggesting a workaround with this? I must be understanding something improperly... Thanks!
Here is the code:
// a basic CUDA function to test working with device constant memory
#include <stdio.h>
#include <cuda.h>
const unsigned int N = 10; // size of vectors
__constant__ float const_d_a[N * sizeof(float)];
int main()
{
float * a, * b; // a and b are vectors. c is the result
a = (float *)calloc(N, sizeof(float));
b = (float *)calloc(N, sizeof(float));
/**************************** Exp 1: sequential ***************************/
int i;
int size = N * sizeof(float);
for (i = 0; i < N; i++){
a[i] = (float)i / 0.23 + 1;
}
// 1. copy a to constant memory
cudaError_t err = cudaMemcpyToSymbol(const_d_a, a, size);
if (err != cudaSuccess){
printf("%s in %s at line %d\n", cudaGetErrorString(err), __FILE__, __LINE__);
exit(EXIT_FAILURE);
}
cudaError_t err2 = cudaMemcpyFromSymbol(const_d_a, b, size);
if (err2 != cudaSuccess){
printf("%s in %s at line %d\n", cudaGetErrorString(err2), __FILE__, __LINE__);
exit(EXIT_FAILURE);
}
double checksum0, checksum1;
for (i = 0; i < N; i++){
checksum0 += a[i];
checksum1 += b[i];
}
printf("Checksum for elements in host memory is %f\n.", checksum0);
printf("Checksum for elements in constant memory is %f\n.", checksum1);
return 0;
}
In CUDA, the various cudaMemcpy* operations are modeled after the C standard library memcpy routine. In that function, the first pointer is always the destination pointer and the second pointer is always the source pointer. That is true for all cudaMemcpy* functions as well.
Therefore, if you want to do cudaMemcpyToSymbol, the symbol had better be the first (destination) argument passed to the function (the second argument would be a host pointer). If you want to do cudaMemcpyFromSymbol, the symbol needs to be the second argument (the source position), and the host pointer is the first argument. That's not what you have here:
cudaError_t err2 = cudaMemcpyFromSymbol(const_d_a, b, size);
^ ^
| This should be the symbol.
|
This is supposed to be the host destination pointer.
You can discover this with a review of the API documentation.
If we reverse the order of those two arguments in that line of code:
cudaError_t err2 = cudaMemcpyFromSymbol(b, const_d_a, size);
Your code will run with no errors and the final results printed will match.
There is no need to use an ampersand with either of the a or b pointers in these functions. a and b are already pointers. In the example you linked, pi_gpu_h is not a pointer. It is an ordinary variable. To copy something to it using cudaMemcpyFromSymbol, it is necessary to take the address of that ordinary variable, because the function expects a (destination) pointer.
As an aside, this doesn't look right:
__constant__ float const_d_a[N * sizeof(float)];
This is effectively a static array declaration, and apart from the __constant__ decorator it should be done equivalently to how you would do it in C or C++. It's not necessary to multiply N by sizeof(float) here, if you want storage for N float quantities. Just N by itself will do that:
__constant__ float const_d_a[N];
however leaving that as-is does not create problems for the code you have posted.

cuBLAS argmin -- segfault if outputing to device memory?

In cuBLAS, cublasIsamin() gives the argmin for a single-precision array.
Here's the full function declaration: cublasStatus_t cublasIsamin(cublasHandle_t handle, int n,
const float *x, int incx, int *result)
The cuBLAS programmer guide provides this information about the cublasIsamin() parameters:
If I use host (CPU) memory for result, then cublasIsamin works properly. Here's an example:
void argmin_experiment_hostOutput(){
float h_A[4] = {1, 2, 3, 4}; int N = 4;
float* d_A = 0;
CHECK_CUDART(cudaMalloc((void**)&d_A, N * sizeof(d_A[0])));
CHECK_CUBLAS(cublasSetVector(N, sizeof(h_A[0]), h_A, 1, d_A, 1));
cublasHandle_t handle; CHECK_CUBLAS(cublasCreate(&handle));
int result; //host memory
CHECK_CUBLAS(cublasIsamin(handle, N, d_A, 1, &result));
printf("argmin = %d, min = %f \n", result, h_A[result]);
CHECK_CUBLAS(cublasDestroy(handle));
}
However, if I use device (GPU) memory for result, then cublasIsamin segfaults. Here's an example that segfaults:
void argmin_experiment_deviceOutput(){
float h_A[4] = {1, 2, 3, 4}; int N = 4;
float* d_A = 0;
CHECK_CUDART(cudaMalloc((void**)&d_A, N * sizeof(d_A[0])));
CHECK_CUBLAS(cublasSetVector(N, sizeof(h_A[0]), h_A, 1, d_A, 1));
cublasHandle_t handle; CHECK_CUBLAS(cublasCreate(&handle));
int* d_result = 0;
CHECK_CUDART(cudaMalloc((void**)&d_result, 1 * sizeof(d_result[0]))); //just enough device memory for 1 result
CHECK_CUDART(cudaMemset(d_result, 0, 1 * sizeof(d_result[0])));
CHECK_CUBLAS(cublasIsamin(handle, N, d_A, 1, d_result)); //SEGFAULT!
CHECK_CUBLAS(cublasDestroy(handle));
}
The Nvidia guide says that `cublasIsamin()` can output to device memory. What am I doing wrong?
Motivation: I want to compute the argmin() of several vectors concurrently in multiple streams. Outputting to host memory requires CPU-GPU synchronization and seems to kill the multi-kernel concurrency. So, I want to output the argmin to device memory instead.
The CUBLAS V2 API does support writing scalar results to device memory. But it doesn't support this by default. As per Section 2.4 "Scalar parameters" of the documentation, you need to use cublasSetPointerMode() to make the API aware that scalar argument pointers will reside in device memory. Note this also makes these level 1 BLAS functions asynchronous, so you must ensure that the GPU has completed the kernel(s) before trying to access the result pointer.
See this answer for a complete working example.

Modulus computation of an array of cufftComplex data type in CUDA

I made a Dll file in visual C++ to compute modulus of an array of complex numbers in CUDA. The array is type of cufftComplex. I then called the Dll in LabVIEW to check the accuracy of the result. I'm receiving an incorrect result. Could anyone tell me what is wrong with the following code, please? I think there should be something wrong with my kernel function(the way I am retrieving the cufftComplex data should be incorrect).
#include <math.h>
#include <cstdlib>
#include <cuda_runtime.h>
#include <cufft.h>
extern "C" __declspec(dllexport) void Modulus(cufftComplex *digits,float *result);
__global__ void ModulusComputation(cufftComplex *a, int N, float *temp)
{
int idx = blockIdx.x*blockDim.x + threadIdx.x;
if (idx<N)
{
temp[idx] = sqrt((a[idx].x * a[idx].x) + (a[idx].y * a[idx].y));
}
}
void Modulus(cufftComplex *digits,float *result)
{
#define N 1024
cufftComplex *d_data;
float *temp;
size_t size = sizeof(cufftComplex)*N;
cudaMalloc((void**)&d_data, size);
cudaMalloc((void**)&temp, sizeof(float)*N);
cudaMemcpy(d_data, digits, size, cudaMemcpyHostToDevice);
int blockSize = 16;
int nBlocks = N/blockSize;
if( N % blockSize != 0 )
nBlocks++;
ModulusComputation <<< nBlocks, blockSize >>> (d_data, N,temp);
cudaMemcpy(result, temp, size, cudaMemcpyDeviceToHost);
cudaFree(d_data);
cudaFree(temp);
}
In the final cudaMemcpy in your code, you have:
cudaMemcpy(result, temp, size, cudaMemcpyDeviceToHost);
It should be:
cudaMemcpy(result, temp, sizeof(float)*N, cudaMemcpyDeviceToHost);
If you had included error checking for your cuda calls, you would have seen this cuda call (as originally written) throw an error.
There's other comments that could be made. For example your block size (16) should be an integral multiple of 32. But this does not prevent proper operation.
After the kernel call, when copying back the result, you are using size as the memory size. The third argument of cudaMemcpy should be N * sizeof(float).