Launch out of resources - cuda

I wrote the following simple CUDA kernel:
__global__ void pr_kernel(float* O, const float* I, const float* W, int N)
{
int x = threadIdx.x;
float sum;
int i;
if (x < N) {
for (i = 0; i < N; i++) {
if (i == x) continue;
sum += W[x*N+i] * I[x];
}
O[x] = (0.15 / N) + 0.85 * sum;
}
}
The variables are allocated in Python as follows:
N = np.int32(4)
W = np.float32(np.asarray(
[0, 1, 0, 1, 1, 0, 1, 1,
0, 1, 0, 1,1, 1, 0]))
I = np.float32(np.asarray(
[0.25, 0.25, 0.25, 0.25]))
O = np.float32(np.zeros(N))
I'm transferring the variables using gpuarray.to_gpu, and I'm calling the kernel on a Tesla C2070 with the following line:
pr_kernel(O_d, I_d, W_d, N_d, block=blocksize, grid=gridsize)
Where:
blocksize = (128, 1, 1)
gridsize = (1, 1)
I get the error message:
pycuda.driver.LaunchError: cuLaunchKernel failed: launch out of resources.
This happens even if I reduce blocksize to something like (8, 1, 1). I can run other CUDA programs on the GPU with a blocksize of (512, 1, 1) so I'm confident this is not due to a GPU configuration issue.
What am I doing wrong? Thanks for any help.

The problem was that I was transferring the integer N to the GPU using gpuarray.to_gpu, where I should have been directly passing N to the pr_kernel function.

I got a similar problem when I used a different type in definition and as an argument to the kernel. Probably the fact that the latter required more resources generates an error.

Related

Wrong scan pseudocode by CUDA?

I'm trying to implemenet the pseudocode of the prefix-sum(scan) operation given in the CUDA documentation. The results I'm getting is absolutely wrong. I revised my code hundred times but still got problems with that.
Here is the pseudocode given by CUDA:
1: for d = 1 to log2 n do
2: for all k in parallel do
3: if k >= power(2, d) then
4: x[k] = x[k – power(2, d-1)] + x[k]
And the CUDA kernel I've coded so far is:
// CUDA Kernel
__global__ void
prefixSumCUDA(int *a, size_t n)
{
int tId = threadIdx.x;
for (int offset = 1; offset < n; offset *= 2) {
if (tId >= pow((float)2, offset)) {
int temp = tId - pow((float)2, offset - 1);
a[tId] += a[temp];
}
}
}
Please let me know if I am making any mistakes here. I know this implementation is massively dependent on the size of the blocks and grids. Thus, I will provide my kernel call here:
// Kernel launch
prefixSumCUDA << <1, 32 >> > (d_A, n);
The input array is a 8 element integer type:
[-] array: 1, 2, 3, 4, 5, 6, 7, 8
And the result of the CUDA kernel is as following:
[-] array: 1, 2, 5, 7, 14, 18, 22, 26
Thanks for any help in advance!
I solved the problem by implementing this another way. The offset is better to get started from 0 rather than 1. This results in the following code.
__global__ void
prefixSumCUDA(int *a, size_t n)
{
int tId = threadIdx.x;
int end = ceil(log2((float)n));
for (int offset = 0; offset < end; offset++) {
if (tId >= (1 << offset)) {
a[tId] += a[tId - (1 << offset)];
}
}
}

cuda, addAtomic producing odd results [duplicate]

This question already has an answer here:
Atomic Operation failed in CUDA
(1 answer)
Closed 3 years ago.
I have a situation where addAtomic is not performing as I would expect. I am very new to cuda so I am likely missing something, however ive been stuck on this for nearly a good day and rewritten most other areas of my program thinking it was a memory allocation issue. This doesnt seem to be the case though.
Essentially what is happening is that it calls the 'analyze' kernel which should be producing min/max and sum values of the data. The same data is used for min/max as the sum. The result from the atomicadd operation however, read like a memory address. very very large numbers. Is there something I am missing - I have gone over this a hundred times and stripped out almost everything out of the kernel except for the min/max and sum.
__global__ void analyze(int *data, int *min, int *max, int *mean)
{
int t_id = (threadIdx.x * AXIS_COUNT) + blockIdx.x;
int b_id = blockIdx.x;
int localVal = data[t_id];
atomicMin(&min[b_id], localVal);
atomicMax(&max[b_id], localVal);
atomicAdd(&mean[b_id], localVal);
}
...........
int r;
int step = WINDOW_LENGTH * AXIS_COUNT;
for (r = 0; r < out_rows; r++){
analyze<<<AXIS_COUNT, WINDOW_LENGTH>>>(
&d_data[r * step],
&d_min[r * AXIS_COUNT],
&d_max[r * AXIS_COUNT],
&d_mean[r * AXIS_COUNT]);
}
cudaDeviceSynchronize();
cudaMemcpy(h_min, d_min, int_size, cudaMemcpyDeviceToHost);
cudaMemcpy(h_max, d_max, int_size, cudaMemcpyDeviceToHost);
cudaMemcpy(h_mean, d_mean, int_size, cudaMemcpyDeviceToHost);
for(r=0; r < out_rows; r++) {
fprintf(stderr, "mean %d, x: %d, y: %d z: %d\n", r, h_mean[r*AXIS_COUNT], h_mean[r*AXIS_COUNT + 1], h_mean[r*AXIS_COUNT+2]);
}
The result are in the form:
mean 5025, x: 2078310793, y: 1999653847 z: -1453684997
mean 5026, x: 2078308025, y: 1999646363 z: -1453660854
mean 5027, x: 2078305391, y: 1999639383 z: -1453636904
mean 5028, x: 2078304342, y: 1999630356 z: -1453613212
I have validated and checked the min/max values with the relevant documents to confirm.
The answer was to initialize the shared memory in the kernel.
__shared__ double sum[AXIS_COUNT];
if (threadIdx.x == 0) {
int i;
for (i=0; i < AXIS_COUNT; i++)
sum[i] = 0;
}
syncthreads();
int t_id = (threadIdx.x * AXIS_COUNT) + blockIdx.x;
int b_id = blockIdx.x;

Optimize vector matrix multiplication in cuda with large number of zeros

I am using the following kernel to optimize vector-matrix multiplication for the case where both the vector and the matrix have a large number of zeros. The use of this kernel may reduce the time taken for such a multiplication by up to half of the time taken by cublasSgemv, for the case where there are more than 90% zeros. But, it is still much longer than an equivalent blas gemm host call on Ubuntu 14.04
vec = 1 x m, mat = m x m and prod = 1 x m; all are in row-major order
m >= 5000
__global__ void calc_v_m(float *vec, float *mat, float *prod, int m)
{
int x = blockDim.x * blockIdx.x + threadIdx.x;
if(x < m)
{
prod[x] = 0;
for(int i = 0; i < m; i++)
{
int offset = i*m + x;
if( mat[offset] != 0 && vec[i] != 0 )
prod[x] += vec[i] * mat[i*m+x];
}
}
}
What can be done to further enhance the performance of this kernel apart from libraries like cuSparse?
Would be nice if this optimization was compatible with Compute Capability of 1.2
Thanks
EDIT
Corrected: prod = 1 x m
GPU = Quadro FX 1800M, Cuda v.5.0 on Ubuntu 14.04
EDIT
Complete code that performs multiplication using i. blas, ii. cublas, iii. above kernel for m = 6000. Please enter 0, when asked to enter a value
#include <iostream>
#include <stdio.h>
#include <time.h>
#include <cblas.h>
#include <cublas_v2.h>
#include <math.h>
using namespace std;
const int m = 6000;
const int BS = 512; // threads per block
const int NB = ceil((float) m / BS); // number of blocks
__global__ void calc_v_m(float *vec, float *mat, float *prod, int m)
{
int x = blockDim.x * blockIdx.x + threadIdx.x;
if(x < m)
{
prod[x] = 0;
for(int i = 0; i < m; i++)
{
int offset = i*m + x;
if( mat[offset] != 0 && vec[i] != 0 )
prod[x] += vec[i] * mat[i*m+x];
}
}
}
int main()
{
timespec blas_start, blas_end, cublas_start, cublas_end, opt_start, opt_end;
long totalnsec; //total nano sec
double totalsec, totaltime;
int i, j;
float *A = new float[m]; // 1 x m
float *B = new float[m*m]; // m x m
float *C = new float[m]; // 1 x m
float input;
cout<<"Enter a value to populate the vector (0 to make it sparse) ";
cin>>input;
// input martix A: every 600th element is non-zero i.e 90% zero
for(i = 0; i < m; i++)
{
A[i] = input;
if( i % 600 == 0) //adjust for sparsity
A[i] = i;
}
// input matrix B: identity matrix
for(i = 0; i < m; i++)
for(j = 0; j < m; j++)
B[j*m + i] = (i==j);
//blas on host
clock_gettime(CLOCK_REALTIME, &blas_start);
cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, 1, m, m, 1.0f, A, m, B, m, 0.0f, C, m);
//cblas_sgemv(CblasRowMajor, CblasTrans, m, m, 1.0f, B, m, A, 1, 0.0f, C, 1);
clock_gettime(CLOCK_REALTIME, &blas_end);
/* for(i = 0; i < m; i++) printf("%f ", C[i]); */
//cublas section
cudaError_t cudaStat;
cublasHandle_t handle;
cublasCreate(&handle);
float *A_d, *B_d, *C_d;
cudaStat = cudaMalloc(&A_d, sizeof(float)*m);
if(cudaStat != cudaSuccess) printf("Error Allocating Memory for A_d\n");
cudaStat = cudaMalloc(&B_d, sizeof(float)*m*m);
if(cudaStat != cudaSuccess) printf("Error Allocating Memory for B_d\n");
cudaStat = cudaMalloc(&C_d, sizeof(float)*m);
if(cudaStat != cudaSuccess) printf("Error Allocating Memory for C_d\n");
cudaMemcpy(A_d, A, sizeof(float)*m, cudaMemcpyHostToDevice);
cudaMemcpy(B_d, B, sizeof(float)*m*m, cudaMemcpyHostToDevice);
float alpha = 1.0f, beta = 0.0f;
cudaDeviceSynchronize();
clock_gettime(CLOCK_REALTIME, &cublas_start);
cublasSgemv(handle, CUBLAS_OP_N, m, m, &alpha, B_d, m, A_d, 1, &beta, C_d, 1);
cudaDeviceSynchronize();
clock_gettime(CLOCK_REALTIME, &cublas_end);
cudaMemcpy(C, C_d, sizeof(float)*m, cudaMemcpyDeviceToHost);
/* for(i = 0; i < m; i++) printf("%f ", C[i]); */
// Call kernel having Optimization for Zeros
cudaDeviceSynchronize();
clock_gettime(CLOCK_REALTIME, &opt_start);
/////////////////// call kernel //////////////////
calc_v_m<<<NB, BS>>>(A_d, B_d, C_d, m);
//////////////////////////////////////////////////
cudaDeviceSynchronize();
clock_gettime(CLOCK_REALTIME, &opt_end);
cudaMemcpy(C, C_d, sizeof(float)*m, cudaMemcpyDeviceToHost);
/*for(i = 0; i < m; i++) printf("%f ", C[i]); */
// Print times
// blas time
totalsec = (double)blas_end.tv_sec - (double)blas_start.tv_sec;
totalnsec = blas_end.tv_nsec - blas_start.tv_nsec;
if(totalnsec < 0)
{
totalnsec += 1e9;
totalsec -= 1;
}
totaltime = totalsec + (double)totalnsec*1e-9;
cout<<"blas Time = "<< totaltime << "\n";
//cublas time
totalsec = (double)cublas_end.tv_sec - (double)cublas_start.tv_sec;
totalnsec = cublas_end.tv_nsec - cublas_start.tv_nsec;
if(totalnsec < 0)
{
totalnsec += 1e9;
totalsec -= 1;
}
totaltime = totalsec + (double)totalnsec*1e-9;
cout<<"cublas Time = "<< totaltime << "\n";
//Optimized Kernel Time
totalsec = (double)opt_end.tv_sec - (double)opt_start.tv_sec;
totalnsec = opt_end.tv_nsec - opt_start.tv_nsec;
if(totalnsec < 0)
{
totalnsec += 1e9;
totalsec -= 1;
}
totaltime = totalsec + (double)totalnsec*1e-9;
cout<<"Opt Kernel Time = "<< totaltime << "\n";
return 0;
}
Results
$ nvcc -arch=sm_12 blascomp.cu -o blascomp.o -lblas -lcublas
$ ./blascomp.o
Enter a value to populate the vector (0 to make it sparse) 0
blas Time = 0.000105207
cublas Time = 0.0070294
Opt Kernel Time = 0.00642797
At least on my system blas is still the fastest for such a scenario
Things get even more interesting if every '1200th' element instead of '600th' is set to 0
Enter a value to populate the vector (0 to make it sparse) 0
blas Time = 7.84e-05
cublas Time = 0.00698783
Opt Kernel Time = 0.00643042
The important thing to recognise here is that the gemv operation you are concerned with is fundamentally memory throughput limited on GPUs, rather than compute throughput limited. This implies that an "optimisation" as you have shown in your kernel:
__global__ void calc_v_m(float *vec, float *mat, float *prod, int m)
{
int x = blockDim.x * blockIdx.x + threadIdx.x;
if(x < m)
{
prod[x] = 0;
for(int i = 0; i < m; i++)
{
int offset = i*m + x;
if( mat[offset] != 0 && vec[i] != 0 )
prod[x] += vec[i] * mat[i*m+x];
}
}
}
isn't really an optmisation at all, simply because the memory transactions are the performance bottleneck in the kernel, not the floating point arithmetic, and your code must perform most of the memory transactions irrespective of whether the multiply add operation will be performed because of zero detection or not.
Consider the following, instrumented version of roughly the same code:
__constant__ float cvec1[2];
__global__ void
__launch_bounds__(512,4)
calc_v_m1(const float* __restrict__ vec,
const float* __restrict__ mat,
float* __restrict__ prod,
int m,
int do_reads = 1,
int do_write = 1)
{
int x = blockDim.x * blockIdx.x + threadIdx.x;
if(x < m)
{
float res = 0;
float mval = cvec1[0], vval = cvec1[1];
#pragma unroll 8
for(int i = 0; i < m; i++)
{
int offset = i*m + x;
if (do_reads) {
mval = mat[offset];
vval = vec[i];
}
res += mval * vval;
}
if (do_write) prod[x] = res;
}
}
Here I have added two optional arguments which control whether the kernel will load from global memory, and whether the kernel will store to global memory. This allows me to quantify the performance impact of the memory loads, computation, and memory stores independently. The results using your test code are instructive:
Function nvprof time
-----------------------------------------------
cublasSgemv 942.75us
calc_v_m 2798.4us
calc_v_m1(do_reads=1, do_write=1) 962.40us
calc_v_m1(do_reads=1, do_write=0) 970.40us
calc_v_m1(do_reads=0, do_write=1) 55.166us
calc_v_m1(do_reads=0, do_write=0) 55.102us
[All benchmarking done on a GTX970 using the CUDA 7.5 release toolchain and CUBLAS 7.5 library]
In no particular order:
The full instrumented kernel runtime is within a few percent of the equivalent CUBLAS call
The memory fetches from global memory are the bottleneck
The actual computations in the kernel only constitute 5% of the kernel running time
The "fire-and-forget" nature of write operations in CUDA means that the latency of the write has no significant effect on throughput.
Your "optimised" kernel is considerably slower than either CUBLAS or the instrumented kernel, probably because all you are introducing is branch divergence without addressing the source of the kernel bottleneck (the latency of the memory loads).
The only times conditionally executing the FMAD operation makes sense would be in an architecture where memory has near zero latency and floating point throughput was severely constrained. The GPU definitely doesn't fall into that category.
The only other option for optimising this would be to exploit a priori information about the sparsity patterns in the LHS matrix to remove the need to read zero entries. Which is precisely what sparse matrix formats and linear algebra codes are designed to accommodate.

Unable to use cublasXt

I tried the following simple program using cublasXt to multiply two matrices. I get all zero output. Can someone let me know why? My computer can use other cuda libraries normally, and I have two GPUs. My machine is 64bit, as is required by cublasXt.
Btw, I've checked that none of the function calls in the program returns error.
#include <stdio.h>
#include "cublasXt.h"
#include <curand.h>
void fill(double* &x, long m, long n, double val) {
x = new double[m * n];
for (long i = 0; i < m; ++i) {
for (long j = 0; j < n; ++j) {
x[i * n + j] = val;
}
}
}
int main() {
cublasXtHandle_t xt_;
cublasXtCreate(&xt_);
double *A, *B, *C;
long m = 10, n = 10, k = 20;
fill(A, m, k, 0.2);
fill(B, k, n, 0.3);
fill(C, m, n, 0.0);
double alpha = 1.0;
double beta = 0.0;
cublasXtDgemm(xt_, CUBLAS_OP_N, CUBLAS_OP_N,
m, n, k, &alpha, A, m, B, k, &beta, C, m
);
cudaDeviceSynchronize();
for (int i = 0; i < m; ++i) {
for (int j = 0; j < n; ++j) {
printf ("%lf ", C[i *n + j]);
}
printf ("\n");
}
cublasXtDestroy(xt_);
return 0;
}
The first issue with your code is that you have no call to cublasXtDeviceSelect. This is a necessary part of a cublasXt code, to tell the CUBLAS runtime how many devices to use and which devices to use.
As a simple proof point, try adding the following immediately after your handle creation call:
if(cublasXtCreate(&xt_) != CUBLAS_STATUS_SUCCESS) {printf("handle create fail\n"); return 1;}
int devices[1] = { 0 }; // add this line
if(cublasXtDeviceSelect(xt_, 1, devices) != CUBLAS_STATUS_SUCCESS) {printf("set devices fail\n"); return 1;} // add this line
This should cause your output to change from all zero's to all 1.2 (although only using 1 GPU)
However you will probably want to read the section of the documentation I linked above (for example if you want to use 2 GPUs, and they are of the correct type). cublasXt functionality at this time, that is included in the toolkit, for multi-GPU usage is limited to 2 devices (but note my comments below) and those 2 GPUs must be on a dual-GPU board, such as a Tesla K10 or GeForce GTX 690 (I think Titan Z or Tesla K80 should also work, just to pick other examples).
Additional details of licensing are here. You can get an evaluation version of the "Premier" package that has fewer restrictions on GPUs.

CUDA programming

I am new to CUDA. I had a question on a simple program, hope someone can notice my mistake.
__global__ void ADD(float* A, float* B, float* C)
{
const int ix = blockDim.x * blockIdx.x + threadIdx.x;
const int iy = blockDim.y * blockIdx.y + threadIdx.y;
if(ix < 16 && iy < 16)
{
for(int i = 0; i<256; i++)
C[i] = A[ix+iy*16] + B[ix+iy*16] + C[i]; // << I wish to store all in C
}
}
extern "C" void cuda_p(float* A, float* B, float* C)
{
float* dev_A;
float* dev_B;
float* dev_C;
cudaMalloc((void**) &dev_A, sizeof(float) * 256);
cudaMalloc((void**) &dev_B, sizeof(float) * 256);
cudaMalloc((void**) &dev_C, sizeof(float) * 256);
cudaMemcpy(dev_A, A, sizeof(float) * 256, cudaMemcpyHostToDevice);
cudaMemcpy(dev_B, B, sizeof(float) * 256, cudaMemcpyHostToDevice);
cudaMemcpy(dev_C, C, sizeof(float) * 256, cudaMemcpyHostToDevice);
ADDD<<<16,16>>>(dev_A,dev_B,dev_C);
cudaMemcpy(A, dev_A, sizeof(float) * 256, cudaMemcpyDeviceToHost);
cudaMemcpy(B, dev_B, sizeof(float) * 256, cudaMemcpyDeviceToHost);
cudaMemcpy(C, dev_C, sizeof(float) * 256, cudaMemcpyDeviceToHost);
cudaFree(dev_A);
cudaFree(dev_B);
cudaFree(dev_C);
}
Are you sure about kernel launch configuration? In your code you try to start some unknown function ADDD. And your execution configuration is: gridDim = (16, 0, 0) and blockDim = (16, 0, 0). So in your kernel blockIdx.x = [0..16) and threadIdx.x = [0..16). If I understood you right, then
ix = threadIdx.x;
iy = blockIdx.x;
Read about it in CUDA Programming Guide (Appendix B.15).
But it's not only one mistake. When you accumulate values in C[i] you have a race condition. 16 threads (1 warp) simultaneously read C[i], add some value (A[ix+iy*16] + B[ix+iy*16]) and write the results back to C[i]. You should use atomic add operations (CUDA Programming Guide, Appendix B.11.1.1) or redesign your kernel to maximize memory coalescing (CUDA C Best Practices Guide 3.2.1) because atomics are very-VERY slow...
Your primary issue is that the core of your kernel doesn't make sense. What you have is:
for(int i = 0; i<256; i++)
C[i] = A[ix+iy*16] + B[ix+iy*16] + C[i]; // << I wish to store all in C
This is going to have each thread to through and read every entry in C, add its own part of A and B to it, and write it back. Since each thread is doing this at the same time, they're going to step on each other. If you really want every entry in C to be the sum of all entries in A and all entries in B, you want to make each thread responsible for a certain entry in C:
for(int i = 0; i<256; i++)
C[ix+iy*16] += A[i] + B[i];
If instead you want every entry in C to be the sum of the corresponding entries in A and B, which seems more likely, then you would get rid of the loop, and your kernel would look like:
__global__ void ADD(float* A, float* B, float* C)
{
const int ix = blockDim.x * blockIdx.x + threadIdx.x;
const int iy = blockDim.y * blockIdx.y + threadIdx.y;
if(ix < 16 && iy < 16)
{
C[ix+iy*16] = A[ix+iy*16] + B[ix+iy*16];
}
}
Each thread grabs one entry from A and one from B, and writes one entry in C.
Your secondary issue is that you're launching the kernel wrong. You're doing:
ADDD<<<16,16>>>(dev_A,dev_B,dev_C);
This launches a 1x16 grid of blocks of 1x16 threads each (of the typo'd kernel). If you want to have your threads positioned in 2 dimensions (using both the x and y indexes), you need to use dim3 as your size specifier type. Something like:
// Use a grid of 4x4 blocks
dim3 gridSize;
gridSize.x = 4;
gridSize.y = 4;
// Use blocks of 4x4 threads.
dim3 blockSize;
blockSize.x = 4;
blockSize.y = 4;
// Run a 4x4 grid of blocks, each with 4x4 threads.
// So you end up with a 16x16 group of threads, matching your data layout.
ADD<<<gridSize,blockSize>>>(dev_A,dev_B,dev_C);
To avoid using atomicAdd, you can allocate shared memory and write the value into shared memory, then add them, and write out. Note that do not tried to use shared memory atomicAdd, it is even slower than global memory's atomicAdd. Only shared memory's int value atomicAdd is faster than global's atomicAdd. Also notice, write into shared memory should avoid bank conflict. Actually my test shows using shared memory will increase the algorithm faster 1-5% than atomicAdd. But try syncwrap can be even faster!
In general, my suggestions are:
Using shared memory instead of atomicAdd
Using syncwrap() than syncthread() (need special design)
And you might enjoy a 5-10% increase in speed.