Possible stack overflow on nested loop on the kernel

Possible stack overflow on nested loop on the kernel - cuda

I was getting launch errors on the following code (it is a pattern reduction), and after some time over it, I noticed that for smaller values than 39 for q are ok, but if it goes higher I get launch erros.
In the begin I thougth that it was a excessive number for nested loops, but in the botton end, I notice that lower values of q are ok even with additional nested loops.
On the cuda debug mode, no error is reported.
Question
Is it a stack error?
Assuming the maximum value of q is equals to maximum value of
unsigned short does it still doable?
Made the code simple as possible :
#include "device_launch_parameters.h"
#include "stdlib.h"
#include "cuda.h"
#include <helper_functions.h> // includes cuda.h and cuda_runtime_api.h
#include <helper_cuda.h> // helper functions for CUDA error check
#include <stdio.h>
#include <cuda_runtime.h>
#include <stdio.h>
__global__ void loopTest(int q, int *ops, short* best) {
int i, j, k, l, m, n, o, p;
const int off(8);
int maxSum(0), sum;
const int qi = (q - blockDim.x * blockIdx.x + threadIdx.x);
if (qi < 0) return;
// qi, the upper for limit reduces as threadId increases
for (i = 0; i < qi - off + 0; i++)
for (j = i + 1; j < qi - off + 1; j++)
for (k = j + 1; k < qi - off + 2; k++)
for (l = k + 1; l < qi - off + 3; l++)
for (m = l + 1; m < qi - off + 4; m++)
for (n = m + 1; n < qi - off + 5; n++)
for (o = n + 1; o < qi - off + 6; o++)
for (p = o + 1; p < qi - off + 7; p++)
{
sum = i + j + k + l + m + n + o + p;
if (sum > maxSum) {
best[0] = i;
best[1] = j;
best[2] = k;
best[3] = l;
best[4] = n;
best[5] = m;
best[6] = o;
best[7] = p;
maxSum = sum;
}
}
ops[0] = maxSum;
printf("max %d:", maxSum);
}
int main() {
int *d_ops;
short *d_best;
cudaError_t cudaStatus;
cudaStatus = cudaMalloc((void**)(&d_ops), sizeof(int));
cudaStatus = cudaMalloc((void**)(&d_best), sizeof(short) * 8);
// any q value smaller than 39 is fine, no error, but anything higher there is launch error
loopTest << <1, 1 >> > (38, d_ops, d_best);
cudaDeviceSynchronize();
cudaStatus = cudaGetLastError();
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "failure: %s", cudaGetErrorString(cudaStatus));
return 99;
}
cudaStatus = cudaFree(d_ops);
cudaStatus = cudaFree(d_best);
cudaStatus = cudaDeviceReset();
cudaStatus = cudaGetLastError();
getchar();
return cudaStatus;
}
Background
Despite of the high often frequence of inactive threads (since the q valeu is intial_q - threadIdx.x) it does avoid the data transfer from host. This the best way that I found to sweep across alternative cluster partitions.
Rules
all the elements must bellow to a single cluster (a.k.a hard clustering)
all the clusters must have at least one element
the elements position in the vector is fixed
Example
(4 partitions, 10 elements, clusters bondaries are show bellow)
alt pat 1: 1-1, 2-2, 3-3, 4-10
(one element per cluster, except the last one tha has the elements {4, 5, 6, 7, 8, 9 and 10}
alt pat 2: 1-1, 2-2, 3-4, 5-10
(same as above, but the 4th cluster has the elements {3 and 4} and the last has the elements {5, 6, 7, 8, 9 and 10}
...
alt pat x: 1-1, 2-2, 3-9, 10-10
alt pat x+1: 1-1, 2-3, 4-4, 5-10
alt pat x+2: 1-1, 2-3, 4-5, 6-10
...
alt pat y: 1-7, 8-8, 9-9, 10-10
the last possible partition has the maximum number of elements in the 1st cluster, thus any other cluster has a single element

The unspecified launch failure was due to the kernel timeout.
It is due to the long processing cost and the TDR windows option as active.
Setting it to off fixed it.

Related

Implementing Dijkstra's algorithm with C++ STL

I have implemented the Dijkstra's algorithm as follows
#include <iostream>
#include <bits/stdc++.h>
#include<cstdio>
#define ll long long int
#define mod 1000000007
#define pi 3.141592653589793
#define f first
#define s second
#define pb push_back
#define pf push_front
#define pob pop_back
#define pof pop_front
#define vfor(e, a) for (vector<ll> :: iterator e = a.begin(); e != a.end(); e++)
#define vfind(a, e) find(a.begin(), a.end(), e)
#define forr(i, n) for (ll i = 0; i < n; i++)
#define rfor(i, n) for (ll i = n - 1; i >= 0; i--)
#define fors(i, b, e, steps) for(ll i = b; i < e; i += steps)
#define rfors(i, e, b, steps) for(ll i = e; i > b; i -= steps)
#define mp make_pair
using namespace std;
void up(pair<ll, ll> a[], ll n, ll i, ll indArray[]) {
ll ind = (i - 1) / 2;
while (ind >= 0 && a[ind].s > a[i].s) {
swap(a[ind], a[i]);
indArray[a[ind].f] = ind;
indArray[a[i].f] = i;
i = ind;
ind = (i - 1) / 2;
}
}
void down(pair<ll, ll> a[], ll n, ll i, ll indArray[]) {
ll left = 2 * i + 1;
ll right = 2 * i + 2;
ll m = a[i].s;
ll ind = i;
if (left < n && a[left].s < m) {
ind = left;
m = a[left].s;
}
if (right < n && a[right].s < m) {
ind = right;
}
if (ind != i) {
swap(a[i], a[ind]);
indArray[a[i].f] = i;
indArray[a[ind].f] = ind;
}
}
int main() {
ios_base::sync_with_stdio(false);
cin.tie(NULL);
cout.tie(NULL);
// cout << setprecision(10);
ll n, m;
cin >> n >> m;
vector<pair<ll, ll>> a[n];
forr(i, m) {
ll u, v, w;
cin >> u >> v >> w;
a[u].pb(mp(v, w));
a[v].pb(mp(u, w));
}
ll parent[n];
parent[0] = -1;
pair<ll, ll> dist[n];
forr(i, n) {
dist[i] = mp(i, INT_MAX);
}
dist[0].s = 0;
ll ind[n];
iota(ind, ind + n, 0);
ll ans[n];
ans[0] = 0;
bool visited[n];
fill(visited, visited + n, false);
ll size = n;
forr(i, n) {
ll u = dist[0].f;
visited[u] = true;
ll d1 = dist[0].s;
ans[u] = dist[0].s;
swap(dist[0], dist[size - 1]);
size--;
down(dist, size, 0, ind);
for (auto e : a[u]) {
if (visited[e.f]){
continue;
}
ll v = e.f;
ll j = ind[v];
if (dist[j].s > d1 + e.s) {
dist[j].s = d1 + e.s;
up(dist, size, j, ind);
parent[v] = u;
}
}
}
stack<ll> st;
forr(i, n) {
ll j = i;
while (j != -1) {
st.push(j);
j = parent[j];
}
while (!st.empty()) {
cout << st.top() << "->";
st.pop();
}
cout << " Path length is " << ans[i];
cout << '\n';
}
}
This implementation is correct and giving correct output.
As it can be seen every time I select the node with lowest key value(distance from source) and then I update the keys on all the adjacent nodes of the selected node. After updating the keys of the adjacent nodes I am calling the 'up' function as to maintain the min heap properties. But priority queue is present in the c++ stl. How can I use them to avoid the functions up and down.
The thing is I need to be able to find the index of the node-key pair in the mean heap whose key needs to be updated. Here in this code I have used a seperate ind array which is updated every time the min heap is updated.
But how to make use of c++ stl

Like you implied, we cannot random-access efficiently with std::priority_queue. For this case I would suggest that you use std::set. It is not actually a heap but a balanced binary search tree. However it works the desired way you wanted. find, insert and erase methods are all O(log n) so you can insert/erase/update a value with desired time since update can be done with erase-then-insert. And accessing minimum is O(1).
You may refer to this reference implementation like the exact way I mentioned. With your adjacency list, the time complexity is O(E log V) where E is number of edges, V is number of vertices.
And please note that
With default comparator, std::set::begin() method returns the min element if non-empty
In this code, it puts the distance as first and index as second. By doing so, the set elements are sorted with distance in ascending order
% I did not look into the implementation of up and down of your code in detail.

CUDA in-place transpose doesn't complete transpose total matrix [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 7 years ago.
Improve this question
I've written the CUDA code below. It's supposed to transpose a matrix using tiling blocks, and the code works when using small values, but when using, for example:
TILE = 32, matrix 128 x 128, it doesn't complete the transpose, it stops after 96. In host this is my dimension thread/block
dim3 dimGrid((nEven + TILE_DIM - 1) / TILE_DIM, (nEven + TILE_DIM - 1) / TILE_DIM);
dim3 dimBlock(TILE_DIM, TILE_DIM);
where I let the threads number == to tile block number,
the global code is simple and it should theoretically work:
__global__ void transposeMain( int *idata)
{
__shared__ int tile2[TILE_DIM][TILE_DIM];
int yyy = blockIdx.y * TILE_DIM ; // col values (0,32,64,96)
int xxx = blockIdx.x * TILE_DIM ; // row values (0,32,64,96)
if (xxx < nEven && yyy < nEven)
{
tile2[threadIdx.x][threadIdx.y] = idata[(threadIdx.x + xxx)*nEven + (threadIdx.y + yyy)];
__syncthreads();
idata[(threadIdx.y + yyy)*nEven + (threadIdx.x + xxx)] = tile2[threadIdx.x][threadIdx.y];
}
}
Any idea what might be the problem?

The problem is you are trying to do an in-place transpose.
CUDA device code execution is broken up into threadblocks. Threadblocks (groups of threads) can execute in any order, and do not all (typically) execute at the same time. So when you read a tile in here:
tile2[threadIdx.x][threadIdx.y] = idata[(threadIdx.x + xxx)*nEven + (threadIdx.y + yyy)];
That is OK. But when you write the tile:
idata[(threadIdx.y + yyy)*nEven + (threadIdx.x + xxx)] = tile2[threadIdx.x][threadIdx.y];
You are frequently over-writing data (in some other tile in the original matrix) which you haven't read yet (because the threadblock responsible for reading that tile hasn't even begun to execute yet). Once you overwrite it like this, it's lost.
The solution (for square matrix transpose) has several aspects to it:
Each threadblock must first read 2 tiles. These 2 tiles from the input data will be swapped.
Then each threadblock can write those two tiles.
The tiles along the main diagonal need special casing.
since most threadblocks are handling 2 tiles, only threadblocks on or on one side of the main diagonal need do any work.
You haven't shown a complete MCVE (which is expected when you have questions like this), and your code has other issues such as the potential for uncoalesced access (lower performance) so I'm not going to try to "fix" your code.
Instead, here's a fully worked example, lifted from here:
$ cat t469.cu
#include <stdio.h>
#include <cublas_v2.h>
#include <time.h>
#include <sys/time.h>
#define uS_PER_SEC 1000000
#define uS_PER_mS 1000
#define N 4096
#define M 4096
#define TILE_DIM 32
#define BLOCK_ROWS 8
__global__ void transposeCoalesced(float *odata, const float *idata)
{
__shared__ float tile[TILE_DIM][TILE_DIM+1];
int x = blockIdx.x * TILE_DIM + threadIdx.x;
int y = blockIdx.y * TILE_DIM + threadIdx.y;
int width = gridDim.x * TILE_DIM;
for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS)
tile[threadIdx.y+j][threadIdx.x] = idata[(y+j)*width + x];
__syncthreads();
x = blockIdx.y * TILE_DIM + threadIdx.x; // transpose block offset
y = blockIdx.x * TILE_DIM + threadIdx.y;
for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS)
odata[(y+j)*width + x] = tile[threadIdx.x][threadIdx.y + j];
}
__global__ void iptransposeCoalesced(float *data)
{
__shared__ float tile_s[TILE_DIM][TILE_DIM+1];
__shared__ float tile_d[TILE_DIM][TILE_DIM+1];
int x = blockIdx.x * TILE_DIM + threadIdx.x;
int y = blockIdx.y * TILE_DIM + threadIdx.y;
int width = gridDim.x * TILE_DIM;
if (blockIdx.y>blockIdx.x) { // handle off-diagonal case
int dx = blockIdx.y * TILE_DIM + threadIdx.x;
int dy = blockIdx.x * TILE_DIM + threadIdx.y;
for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS)
tile_s[threadIdx.y+j][threadIdx.x] = data[(y+j)*width + x];
for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS)
tile_d[threadIdx.y+j][threadIdx.x] = data[(dy+j)*width + dx];
__syncthreads();
for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS)
data[(dy+j)*width + dx] = tile_s[threadIdx.x][threadIdx.y + j];
for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS)
data[(y+j)*width + x] = tile_d[threadIdx.x][threadIdx.y + j];
}
else if (blockIdx.y==blockIdx.x){ // handle on-diagonal case
for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS)
tile_s[threadIdx.y+j][threadIdx.x] = data[(y+j)*width + x];
__syncthreads();
for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS)
data[(y+j)*width + x] = tile_s[threadIdx.x][threadIdx.y + j];
}
}
int validate(const float *mat, const float *mat_t, int n, int m){
int result = 1;
for (int i = 0; i < n; i++)
for (int j = 0; j < m; j++)
if (mat[(i*m)+j] != mat_t[(j*n)+i]) result = 0;
return result;
}
int main(){
timeval t1, t2;
float *matrix = (float *) malloc (N * M * sizeof(float));
for (int i = 0; i < N; i ++)
for (int j = 0; j < M; j++)
matrix[(i*M) + j] = i;
// Starting the timer
gettimeofday(&t1, NULL);
float *matrixT = (float *) malloc (N * M * sizeof(float));
for (int i = 0; i < N; i++)
for (int j = 0; j < M; j++)
matrixT[(j*N)+i] = matrix[(i*M)+j]; // matrix is obviously filled
//Ending the timer
gettimeofday(&t2, NULL);
if (!validate(matrix, matrixT, N, M)) {printf("fail!\n"); return 1;}
float et1 = (((t2.tv_sec*uS_PER_SEC)+t2.tv_usec) - ((t1.tv_sec*uS_PER_SEC)+t1.tv_usec))/(float)uS_PER_mS;
printf("CPU time = %fms\n", et1);
float *h_matrixT , *d_matrixT , *d_matrix;
h_matrixT = (float *) (malloc (N * M * sizeof(float)));
cudaMalloc((void **)&d_matrixT , N * M * sizeof(float));
cudaMalloc((void**)&d_matrix , N * M * sizeof(float));
cudaMemcpy(d_matrix , matrix , N * M * sizeof(float) , cudaMemcpyHostToDevice);
//Starting the timer
gettimeofday(&t1, NULL);
const float alpha = 1.0;
const float beta = 0.0;
cublasHandle_t handle;
//gettimeofday(&t1, NULL);
cublasCreate(&handle);
gettimeofday(&t1, NULL);
cublasSgeam(handle, CUBLAS_OP_T, CUBLAS_OP_N, N, M, &alpha, d_matrix, M, &beta, d_matrix, N, d_matrixT, N);
cudaDeviceSynchronize();
gettimeofday(&t2, NULL);
cublasDestroy(handle);
//Ending the timer
float et2 = (((t2.tv_sec*uS_PER_SEC)+t2.tv_usec) - ((t1.tv_sec*uS_PER_SEC)+t1.tv_usec))/(float)uS_PER_mS;
printf("GPU Sgeam time = %fms\n", et2);
cudaMemcpy(h_matrixT , d_matrixT , N * M * sizeof(float) , cudaMemcpyDeviceToHost);
if (!validate(matrix, h_matrixT, N, M)) {printf("fail!\n"); return 1;}
cudaMemset(d_matrixT,0, N*M*sizeof(float));
memset(h_matrixT, 0, N*M*sizeof(float));
dim3 threads(TILE_DIM, BLOCK_ROWS);
dim3 blocks(N/TILE_DIM, M/TILE_DIM);
gettimeofday(&t1, NULL);
transposeCoalesced<<<blocks, threads >>>(d_matrixT, d_matrix);
cudaDeviceSynchronize();
gettimeofday(&t2, NULL);
cudaMemcpy(h_matrixT , d_matrixT , N * M * sizeof(float) , cudaMemcpyDeviceToHost);
if (!validate(matrix, h_matrixT, N, M)) {printf("fail!\n"); return 1;}
float et3 = (((t2.tv_sec*uS_PER_SEC)+t2.tv_usec) - ((t1.tv_sec*uS_PER_SEC)+t1.tv_usec))/(float)uS_PER_mS;
printf("GPU kernel time = %fms\n", et3);
memset(h_matrixT, 0, N*M*sizeof(float));
gettimeofday(&t1, NULL);
iptransposeCoalesced<<<blocks, threads >>>(d_matrix);
cudaDeviceSynchronize();
gettimeofday(&t2, NULL);
cudaMemcpy(h_matrixT , d_matrix , N * M * sizeof(float) , cudaMemcpyDeviceToHost);
if (!validate(matrix, h_matrixT, N, M)) {printf("fail!\n"); return 1;}
float et4 = (((t2.tv_sec*uS_PER_SEC)+t2.tv_usec) - ((t1.tv_sec*uS_PER_SEC)+t1.tv_usec))/(float)uS_PER_mS;
printf("GPU in-place kernel time = %fms\n", et4);
cudaFree(d_matrix);
cudaFree(d_matrixT);
return 0;
}
$ nvcc -arch=sm_20 -o t469 t469.cu -lcublas
$ ./t469
CPU time = 450.095001ms
GPU Sgeam time = 1.937000ms
GPU kernel time = 1.694000ms
GPU in-place kernel time = 1.839000ms
$
Note that this compares several different approaches to matrix transpose.
If you study the iptransposeCoalesced you will see that it is adhering to the 4 specific aspects I outlined above.

It is fishy to use __syncthreads(); in the if statement in CUDA. Try to move it outside this block by simple:
if (xxx < nEven && yyy < nEven)
{
tile2[threadIdx.x][threadIdx.y] = idata[(threadIdx.x + xxx)*nEven + (threadIdx.y + yyy)];
}
__syncthreads();
if (xxx < nEven && yyy < nEven)
{
idata[(threadIdx.y + yyy)*nEven + (threadIdx.x + xxx)] = tile2[threadIdx.x][threadIdx.y];
}

Summing the rows of a matrix (stored in either row-major or column-major order) in CUDA

I'm working on the problem summing the rows of a matrix in CUDA. I'm giving the following example.
Suppose to have the following 20 * 4 array:
1 2 3 4
4 1 2 3
3 4 1 2
.
1 2 3 4
.
.
.
.
.
.
.
.
2 1 3 4
After flattened the 2d array to a 1d array (either in row-major or column-major order), I need to assign each thread to a different row and calculate the cost for that row.
For example
- thread 1 should calculate the cost for 1 2 3 4
- thread 2 should calculate the cost for 4 1 2 3
How can I that in CUDA?
Thank you all for the reply

#include <stdio.h>
#include <stdlib.h>
#define MROWS 20
#define NCOLS 4
#define nTPB 256
__global__ void mykernel(int *costdata, int rows, int cols, int *results){
int tidx = threadIdx.x + blockDim.x*blockIdx.x;
if (tidx < rows){
int mycost = 0;
for (int i = 0; i < cols; i++)
mycost += costdata[(tidx*cols)+i];
results[tidx] = mycost;
}
}
int main(){
//define and initialize host and device storage for cost and results
int *d_costdata, *h_costdata, *d_results, *h_results;
h_results = (int *)malloc(MROWS*sizeof(int));
h_costdata = (int *)malloc(MROWS*NCOLS*sizeof(int));
for (int i=0; i<(MROWS*NCOLS); i++)
h_costdata[i] = rand()%4;
cudaMalloc((void **)&d_results, MROWS*sizeof(int));
cudaMalloc((void **)&d_costdata, MROWS*NCOLS*sizeof(int));
//copy cost data from host to device
cudaMemcpy(d_costdata, h_costdata, MROWS*NCOLS*sizeof(int), cudaMemcpyHostToDevice);
mykernel<<<(MROWS + nTPB - 1)/nTPB, nTPB>>>(d_costdata, MROWS, NCOLS, d_results);
// copy results back from device to host
cudaMemcpy(h_results, d_results, MROWS*sizeof(int), cudaMemcpyDeviceToHost);
for (int i=0; i<MROWS; i++){
int loc_cost = 0;
for (int j=0; j<NCOLS; j++) loc_cost += h_costdata[(i*NCOLS)+j];
printf("cost[%d]: host= %d, device = %d\n", i, loc_cost, h_results[i]);
}
}
This assumes "cost" of each row is just the sum of the elements in each row. If you have a different "cost" function, you can modify the activity in the kernel for-loop accordingly. This also assumes C-style row-major data storage (1 2 3 4 4 1 2 3 3 4 1 2 etc.)
If you instead use column-major storage (1 4 3 etc.), you can slightly improve the performance, since the data reads can be fully coalesced. Then your kernel code could look like this:
for (int i = 0; i < cols; i++)
mycost += costdata[(i*rows)+tidx];
You should also use proper cuda error checking on all CUDA API calls and kernel calls.
EDIT: As discussed in the comments below, for the row-major storage case, in some situations it might give an increase in memory efficiency by electing to load 16-byte quantities rather than the base type. Following is a modified version that implements this idea for arbitrary dimensions and (more or less) arbitrary base types:
#include <iostream>
#include <typeinfo>
#include <cstdlib>
#include <vector_types.h>
#define MROWS 1742
#define NCOLS 801
#define nTPB 256
typedef double mytype;
__host__ int sizetype(){
int size = 0;
if ((typeid(mytype) == typeid(float)) || (typeid(mytype) == typeid(int)) || (typeid(mytype) == typeid(unsigned int)))
size = 4;
else if (typeid(mytype) == typeid(double))
size = 8;
else if ((typeid(mytype) == typeid(unsigned char)) || (typeid(mytype) == typeid(char)))
size = 1;
return size;
}
template<typename T>
__global__ void mykernel(const T *costdata, int rows, int cols, T *results, int size, size_t pitch){
int chunk = 16/size; // assumes size is a factor of 16
int tidx = threadIdx.x + blockDim.x*blockIdx.x;
if (tidx < rows){
T *myrowptr = (T *)(((unsigned char *)costdata) + tidx*pitch);
T mycost = (T)0;
int count = 0;
while (count < cols){
if ((cols-count)>=chunk){
// read 16 bytes
int4 temp = *((int4 *)(myrowptr + count));
int bcount = 16;
int j = 0;
while (bcount > 0){
mycost += *(((T *)(&temp)) + j++);
bcount -= size;
count++;}
}
else {
// read one quantity at a time
for (; count < cols; count++)
mycost += myrowptr[count];
}
results[tidx] = mycost;
}
}
}
int main(){
int typesize = sizetype();
if (typesize == 0) {std::cout << "invalid type selected" << std::endl; return 1;}
//define and initialize host and device storage for cost and results
mytype *d_costdata, *h_costdata, *d_results, *h_results;
h_results = (mytype *)malloc(MROWS*sizeof(mytype));
h_costdata = (mytype *)malloc(MROWS*NCOLS*sizeof(mytype));
for (int i=0; i<(MROWS*NCOLS); i++)
h_costdata[i] = (mytype)(rand()%4);
size_t pitch = 0;
cudaMalloc((void **)&d_results, MROWS*sizeof(mytype));
cudaMallocPitch((void **)&d_costdata, &pitch, NCOLS*sizeof(mytype), MROWS);
//copy cost data from host to device
cudaMemcpy2D(d_costdata, pitch, h_costdata, NCOLS*sizeof(mytype), NCOLS*sizeof(mytype), MROWS, cudaMemcpyHostToDevice);
mykernel<<<(MROWS + nTPB - 1)/nTPB, nTPB>>>(d_costdata, MROWS, NCOLS, d_results, typesize, pitch);
// copy results back from device to host
cudaMemcpy(h_results, d_results, MROWS*sizeof(mytype), cudaMemcpyDeviceToHost);
for (int i=0; i<MROWS; i++){
mytype loc_cost = (mytype)0;
for (int j=0; j<NCOLS; j++) loc_cost += h_costdata[(i*NCOLS)+j];
if ((i < 10) && (typesize > 1))
std::cout <<"cost[" << i << "]: host= " << loc_cost << ", device = " << h_results[i] << std::endl;
if (loc_cost != h_results[i]){ std::cout << "mismatch at index" << i << "should be:" << loc_cost << "was:" << h_results[i] << std::endl; return 1; }
}
std::cout << "Results are correct!" << std::endl;
}

CUDA C/C++: Calculate the average of inverse of distance per point (interaction energy, perhaps?)

I've been trying to write a kernel in that calculates the sum of the inverse of the distance between N given points over N. A serial coda in C would be like
average = 0;
for(int i = 0; i < Np; i++){
for(int j = i + 1; j < Np; j++){
average += 1.0e0f/sqrtf((rx[i]-rx[j])*(rx[i]-rx[j]) + (ry[i]-ry[j])*(ry[i]-ry[j]));
}
}
average = average/(float)N;
Where rx and ry are the x and y coordinates, respectively.
I generate the points via a kernel that uses random number generator. For the kernel, I used 128(256) threads per block for 4k(8k) points. On it every thread performs the inner above inner loop, then the results are passed to a reduce sum function, as follows
Generate points:
__global__ void InitRNG ( curandState * state, const int seed ){
int tIdx = blockIdx.x*blockDim.x + threadIdx.x;
curand_init (seed, tIdx, 0, &state[tIdx]);
}
__global__
void SortPoints(float* X, float* Y,const int N, curandState *state){
float rdmn1, rdmn2;
unsigned int tIdx = blockIdx.x*blockDim.x + threadIdx.x;
float range;
if(tIdx < N){
rdmn1 = curand_uniform(&state[tIdx]);
rdmn2 = curand_uniform(&state[tIdx]);
range = sqrtf(0.25e0f*N*rdmn1);
X[tIdx] = range*cosf(2.0e0f*pi*rdmn2);
Y[tIdx] = range*sinf(2.0e0f*pi*rdmn2);
}
}
Reduction:
__device__
float ReduceSum2(float In){
__shared__ float data[BlockSize];
unsigned int tIdx = threadIdx.x;
data[tIdx] = In;
__syncthreads();
for(unsigned int i = blockDim.x/2; i > 0; i >>= 1){
if(tIdx < i){
data[tIdx] += data[tIdx + i];
}
__syncthreads();
}
return data[0];
}
Kernel:
__global__
void AvgDistance(float *X, float *Y, float *Avg, const int N){
int tIdx = blockIdx.x*blockDim.x + threadIdx.x;
int bIdx = blockIdx.x;
float x , y;
float d = 0.0f;
if(tIdx < N){
for(int i = tIdx + 1; i < N ; i++){
x = X[tIdx] - X[i];
y = Y[tIdx] - Y[i];
d += 1.0e0f/(sqrtf(x*x + y*y));
}
__syncthreads();
Avg[bIdx] = ReduceSum2(d);
}
}
The kernel is configured and launched as follows:
dim3 threads(BlockSize,BlockSize);
dim3 blocks(ceil(Np/threads.x),ceil(Np/threads.y));
InitRNG<<<blocks.x,threads.x>>>(d_state,seed);
SortPoints<<<blocks.x,threads.x>>>(d_rx,d_ry,Np,d_state);
AvgDistance<<<blocks.x,threads.x,threads.x*sizeof(float)>>>(d_rx,d_ry,d_Avg,Np);
Finally, I copy the data back to host and then perform the remaining sum:
Avg = new float[blocks.x];
CHECK(cudaMemcpy(Avg,d_Avg,blocks.x*sizeof(float),cudaMemcpyDeviceToHost),ERROR_CPY_DEVTOH);
float average = 0;
for(int i = 0; i < blocks.x; i++){
average += Avg[i];
}
average = average/(float)Np;
For 4k points, ok! the results are:
Average distance between points (via Kernel) = 108.615
Average distance between points (via CPU) = 110.191
In this case the sum may be performed in different order, causing both results to diverge from each other, I don't know...
But when it comes to 8k, the results are quiet different:
Average distance between points (via Kernel) = 153.63
Average distance between points (via CPU) = 131.471
To me it seems that both the kernel and the serial code are written the same way. What leads me to distrust the precision on CUDA calculation of floating point numbers. Does this make sense? Or are the access to global memory causing some conflicts when some threads load the same data from X and Y at the same time? Or the way I wrote the kernel is in some way 'wrong'(I mean, am I doing something that is causing both results to diverge from each other?).

Actually, from what I can tell, the problem seems to be on the CPU side. I created a sample code based on your code.
I was able to reproduce your results.
First I switched all instances of sinf, cosf, and sqrtf to their corresponding double versions. This made no difference in the results.
Next I included a typedef so I could easily switch the precision from float to double and back, replacing every relevant instance of float in the code with mytype which is my typedef.
When I run the code with typedef of float and a data size of 4096 I get these results:
GPU average = 108.294922
CPU average = 109.925285
When I run the code with typedef of double and a data size of 4096 I get these results:
GPU average = 108.294903
CPU average = 108.294903
When I run the code with typedef of float and a data size of 8192 I get these results:
GPU average = 153.447327
CPU average = 131.473526
When I run the code with typedef of double and a data size of 8192 I get these results:
GPU average = 153.447380
CPU average = 153.447380
There are at least 2 observations:
The GPU results don't vary between float and double, except in the 5th decimal place
The CPU results vary by 1-20% or so between float and double, but when double is selected, they line up exactly (to the 6th decimal place, anyway) with the GPU results.
Based on this, I believe the CPU is providing the variable, questionable behavior.
Here's my code for reference:
#include <stdio.h>
#include <curand.h>
#include <curand_kernel.h>
#define DSIZE 8192
#define BlockSize 32
#define pi 3.14159f
#define cudaCheckErrors(msg) \
do { \
cudaError_t __err = cudaGetLastError(); \
if (__err != cudaSuccess) { \
fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
msg, cudaGetErrorString(__err), \
__FILE__, __LINE__); \
fprintf(stderr, "*** FAILED - ABORTING\n"); \
exit(1); \
} \
} while (0)
typedef double mytype;
__global__ void InitRNG ( curandState * state, const int seed ){
int tIdx = blockIdx.x*blockDim.x + threadIdx.x;
curand_init (seed, tIdx, 0, &state[tIdx]);
}
__global__
void SortPoints(mytype* X, mytype* Y,const int N, curandState *state){
mytype rdmn1, rdmn2;
unsigned int tIdx = blockIdx.x*blockDim.x + threadIdx.x;
mytype range;
if(tIdx < N){
rdmn1 = curand_uniform(&state[tIdx]);
rdmn2 = curand_uniform(&state[tIdx]);
range = sqrt(0.25e0f*N*rdmn1);
X[tIdx] = range*cos(2.0e0f*pi*rdmn2);
Y[tIdx] = range*sin(2.0e0f*pi*rdmn2);
}
}
__device__
mytype ReduceSum2(mytype In){
__shared__ mytype data[BlockSize];
unsigned int tIdx = threadIdx.x;
data[tIdx] = In;
__syncthreads();
for(unsigned int i = blockDim.x/2; i > 0; i >>= 1){
if(tIdx < i){
data[tIdx] += data[tIdx + i];
}
__syncthreads();
}
return data[0];
}
__global__
void AvgDistance(mytype *X, mytype *Y, mytype *Avg, const int N){
int tIdx = blockIdx.x*blockDim.x + threadIdx.x;
int bIdx = blockIdx.x;
mytype x , y;
mytype d = 0.0f;
if(tIdx < N){
for(int i = tIdx + 1; i < N ; i++){
x = X[tIdx] - X[i];
y = Y[tIdx] - Y[i];
d += 1.0e0f/(sqrt(x*x + y*y));
}
__syncthreads();
Avg[bIdx] = ReduceSum2(d);
}
}
mytype cpu_avg(const mytype *rx, const mytype *ry, const int size){
mytype average = 0.0f;
for(int i = 0; i < size; i++){
for(int j = i + 1; j < size; j++){
average += 1.0e0f/sqrt((rx[i]-rx[j])*(rx[i]-rx[j]) + (ry[i]-ry[j])*(ry[i]-ry[j]));
}
}
average = average/(mytype)size;
return average;
}
int main() {
int Np = DSIZE;
mytype *rx, *ry, *d_rx, *d_ry, *d_Avg, *Avg;
curandState *d_state;
int seed = 1;
dim3 threads(BlockSize,BlockSize);
dim3 blocks((int)ceilf(Np/(float)threads.x),(int)ceilf(Np/(float)threads.y));
printf("number of blocks = %d\n", blocks.x);
printf("number of threads= %d\n", threads.x);
rx = (mytype *)malloc(DSIZE*sizeof(mytype));
if (rx == 0) {printf("malloc fail\n"); return 1;}
ry = (mytype *)malloc(DSIZE*sizeof(mytype));
if (ry == 0) {printf("malloc fail\n"); return 1;}
cudaMalloc((void**)&d_rx, DSIZE * sizeof(mytype));
cudaMalloc((void**)&d_ry, DSIZE * sizeof(mytype));
cudaMalloc((void**)&d_Avg, blocks.x * sizeof(mytype));
cudaMalloc((void**)&d_state, DSIZE * sizeof(curandState));
cudaCheckErrors("cudamalloc");
InitRNG<<<blocks.x,threads.x>>>(d_state,seed);
SortPoints<<<blocks.x,threads.x>>>(d_rx,d_ry,Np,d_state);
AvgDistance<<<blocks.x,threads.x,threads.x*sizeof(mytype)>>>(d_rx,d_ry,d_Avg,Np);
cudaCheckErrors("kernels");
Avg = new mytype[blocks.x];
cudaMemcpy(Avg,d_Avg,blocks.x*sizeof(mytype),cudaMemcpyDeviceToHost);
cudaMemcpy(rx, d_rx, DSIZE*sizeof(mytype),cudaMemcpyDeviceToHost);
cudaMemcpy(ry, d_ry, DSIZE*sizeof(mytype),cudaMemcpyDeviceToHost);
cudaCheckErrors("cudamemcpy");
mytype average = 0;
for(int i = 0; i < blocks.x; i++){
average += Avg[i];
}
average = average/(mytype)Np;
printf("GPU average = %f\n", average);
average = cpu_avg(rx, ry, DSIZE);
printf("CPU average = %f\n", average);
return 0;
}
I am running on RHEL 5.5, CUDA 5.0, Intel Xeon X5560
compiled with:
nvcc -O3 -arch=sm_20 -lcurand -lm -o t93 t93.cu
EDIT:
After observing that the variability was on the CPU side, I found that I could eliminate most of the CPU variability by modifying your CPU averaging code like this:
mytype cpu_avg(const mytype *rx, const mytype *ry, const int size){
mytype average = 0.0f;
mytype temp = 0.0f;
for(int i = 0; i < size; i++){
for(int j = i + 1; j < size; j++){
temp += 1.0e0f/sqrt((rx[i]-rx[j])*(rx[i]-rx[j]) + (ry[i]-ry[j])*(ry[i]-ry[j]));
}
average += temp/(mytype)size;
temp = 0.0f;
}
return average;
}
So I would say there's a problem with intermediate results on the CPU side. It's interesting that it doesn't show up on the GPU result. I suspect the reason for this is that the final summation of GPU averages is done on the CPU (therefore each individual GPU block result is scaled down by the size, e.g. 8192), and these may have an intermediate precision that is sufficient to survive until the final division. If you inlined the CPU average calculation, you may observe something different again.

CUDA kernel - nested for loop

Hello
I'm trying to write a CUDA kernel to perform the following piece of code.
for (n = 0; n < (total-1); n++)
{
a = values[n];
for ( i = n+1; i < total ; i++)
{
b = values[i] - a;
c = b*b;
if( c < 10)
newvalues[i] = c;
}
}
This is what I have currently, but it does not seem to be giving the correct results? does anyone know what I'm doing wrong. Cheers
__global__ void calc(int total, float *values, float *newvalues){
float a,b,c;
int idx = blockIdx.x * blockDim.x + threadIdx.x;
for (int n = idx; n < (total-1); n += blockDim.x*gridDim.x){
a = values[n];
for(int i = n+1; i < total; i++){
b = values[i] - a;
c = b*b;
if( c < 10)
newvalues[i] = c;
}
}

Realize this problem in 2D and launch your kernel with 2D thread blocks. The total number of threads in x and y dimension will be equal to total . The kernel code should look like this:
__global__ void calc(float *values, float *newvalues, int total){
float a,b,c;
int n= blockIdx.y * blockDim.y + threadIdx.y;
int i= blockIdx.x * blockDim.x + threadIdx.x;
if (n>=total || i>=total)
return;
a = values[n];
b = values[i] - a;
c = b*b;
if( c < 10)
newvalues[i] = c;
// I don't know your problem statement but i think it should be like: newvalues[n*total+i] = c;
}
Update:
This is how you should call the kernel
dim3 block(16,16);
dim3 grid ( (total+15)/16, (total+15)/16 );
calc<<<grid,block>>>(float *val, float *newval, int T);
Also make sure you add this line in kernel (see updated kernel)
if (n>=total || i>=total)
return;
Update 2:
fixed blockIdy.y, correct is blockIdx.y

I'll probably be way wrong but the n < (total-1) check in
for (int n = idx; n < (total-1); n += blockDim.x*gridDim.x)
seems different than the original version.

Why don't you just remove the outter loop and start the kernel with as many threads as you need for this loop? It's a bit weird to have a loop that depends on your blockId. Normally you try to avoid these loops.
Secondly it seems to me that newvalues[i] can be overriden by different threads.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Possible stack overflow on nested loop on the kernel - cuda

The unspecified launch failure was due to the kernel timeout. It is due to the long processing cost and the TDR windows option as active. Setting it to off fixed it.

Related

Implementing Dijkstra's algorithm with C++ STL

CUDA in-place transpose doesn't complete transpose total matrix [closed]

Summing the rows of a matrix (stored in either row-major or column-major order) in CUDA

CUDA C/C++: Calculate the average of inverse of distance per point (interaction energy, perhaps?)

CUDA kernel - nested for loop

Categories

Resources