Interpret CUDA profiler log file - cuda

Here is the log file of running CUDA profiler (nvprof) on some codes, which have a mix of Thrust, cublas and curand. The first is a kernel I wrote, so no problem there. But I'm not sure how to interpret 2nd to 5th lines, which took up substantial run time.
> Time(%) Time Calls Avg Min Max Name % s ms ms ms
>
> 28.12 6.82 24,543.00 0.28 0.01 0.64 dev_update_dW1(doub....)
> 23.78 5.77 12,272.00 0.47 0.46 0.49 void thrust::system::cud....
> 14.32 3.47 12,272.00 0.28 0.28 0.29 void thrust::system::cud....
> 10.82 2.62 12,272.00 0.21 0.21 0.22 void thrust::system::cud....
> 4.93 1.20 24,544.00 0.05 0.05 0.05 void thrust::system::cud....
> 3.98 0.96 12,272.00 0.08 0.08 0.09 Act_dAct(double*, long, double*, double*)
The 2nd to 5th lines are printed below in full:
2nd line : void thrust::system::cuda::detail::detail::launch_closure_by_value>, thrust::counting_iterator<__int64, thrust::use_default, thrust::use_default, thrust::use_default>, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type>>, __int64, thrust::tuple, thrust::detail::normal_iterator, thrust::system::cuda::detail::tag, thrust::use_default, thrust::use_default>>, thrust::system::detail::generic::detail::max_element_reduction>, thrust::system::cuda::detail::detail::blocked_thread_array>>(double)
3rd line : void thrust::system::cuda::detail::detail::launch_closure_by_value>, thrust::detail::normal_iterator>, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type>>, unsigned int, thrust::detail::device_unary_transform_functor, thrust::system::cuda::detail::detail::blocked_thread_array>>(double)
4th line : void thrust::system::cuda::detail::detail::launch_closure_by_value>, double, thrust::use_default>, __int64, double, thrust::detail::normal_iterator>, thrust::plus, thrust::system::cuda::detail::detail::blocked_thread_array>>(exp_functor)
5th line : void thrust::system::cuda::detail::detail::launch_closure_by_value, unsigned int, thrust::detail::device_generate_functor>, thrust::system::cuda::detail::detail::blocked_thread_array>>(double)
EDIT :
I have this function (softmax) that uses max_element and transform_reduce
void Softmax_ThrustMatrix(thrust::device_vector<double>& mat, int Nrow, int Ncol, thrust::device_vector<double>& Outmat) {
thrust::device_vector<double> x(Ncol, 0.0);
thrust::device_vector<double> v(Ncol, 0.0);
thrust::device_vector<double>::iterator mx;
double tmp = 0.0, logsm=0.0;
dim3 grid, block;
block.x = 16;
block.y = 1;
grid.x = Ncol / block.x + 1;
grid.y = 1;
for ( int i=0; i < Nrow; i++ ) {
GetRow<<<grid,block>>>(thrust::raw_pointer_cast(&mat[0]), i, Nrow, Ncol, thrust::raw_pointer_cast(&x[0]));
mx = thrust::max_element(x.begin(), x.end());
tmp = thrust::transform_reduce(x.begin(), x.end(), exp_functor(*mx), 0.0, thrust::plus<double>() );
logsm = *mx + log(tmp);
thrust::transform(x.begin(), x.end(), v.begin(), exp_functor(logsm));
SetRow<<<grid,block>>>(thrust::raw_pointer_cast(&v[0]), i, Nrow, Ncol, thrust::raw_pointer_cast(&Outmat[0]));
}
}

Thrust code at a low level is not any different than CUDA code (at least for thrust code targetting a GPU). Thrust, as a template library, abstracts away many aspects of CUDA at the source code level, but the profiler doesn't know any difference between thrust code and ordinary cuda code.
Lines 2-5 represent the profiler data on 4 kernel launches. It's evident from their syntax that they are probably not kernels you wrote - they are emanating from within the depths of thrust template functions.
"Launch closure" is thrust-under-the-hood-speak for a kernel launched by thrust to perform some function. Since you have 3 thrust calls in the code you have shown, and are also showing GetRow and SetRow kernels that you wrote, and those kernels don't show up in your profiler output anywhere, it's not evident to me that the profiler output you have shown is related to the code you have shown. You haven't shown the code that calls the kernels that do appear in your output ( dev_update_dW1 and Act_dAct), so it seems fairly clear to me that the code you have shown is not useful for further interpretation of your profiler output.
In any event, lines 2-5 represent CUDA kernels, launched by thrust, that are emanating from thrust calls in your code (somewhere).
Note that it's also possible for thrust to launch kernels for some other non-obvious purposes, such as instantiation of device vectors.

Related

error : identifier "atomicOr" is undefined in Thrust program

I have found that the Cuda atomicOr function is not recognized in my Thrust program compiled in Visual Studio 2012.
I have read that all header files should already be included when the NVidia nvcc compiler is invoked. Most postings on this issue state that this must mean the architectural settings are incorrect.
I have tried it with these settings based on other postings:
How to set CUDA compiler flags in Visual Studio 2010?
...as well as using:
http://s1240.photobucket.com/user/fireshot8888/media/cuda_settings.png.html
main.cpp:
#include <thrust/device_vector.h>
#include <cstdlib>
#include <iostream>
#include "cuda.h"
using namespace std;
//Visual C++ compiled main function to launch the GPU calling code
int main(int argc, char *argv[])
{
//Just some random data hand keyed to make it a complete example for stack overflow while not being too complicated
float data[] = {1.2, 3.4, 3.4, 3.3, 4.4, 4.4, 4.4, 3.4, 4.4, 4.4,
1.2, 3.4, 3.4, 3.3, 4.4, 4.4, 4.4, 3.4, 4.4, 4.4};
thrust::host_vector<float> h_data(data, data+20); //Holds the contents of the file as they are read; it will be cleared once we are done with it.
const int numVars = 10;
int numBins = 4;
int rowCount = 2;
doHistogramGPU(numVars, h_data, numBins, rowCount);
return 0;
}
cuda.cu:
#include "cuda.h"
#include <iostream>
#include <thrust/device_vector.h>
#include <thrust/iterator/constant_iterator.h>
//I GAVE THIS A TRY BUT IT DID NOT FIX MY ISSUE::::
#include <cuda_runtime.h>
#include <cuda.h>
using namespace std;
//Function to call the kernel
void doHistogramGPU(int numVars, thrust::host_vector<float> h_buffer, int numBins, int numRecords)
{
int dataSize = sizeof(BYTE_UNIT);
int shiftSize = dataSize - 1;
thrust::device_vector<float> d_buffer(h_buffer.begin(), h_buffer.end());
int bitVectorSize = ceil(numRecords * numVars / (float)dataSize);
thrust::device_vector<BYTE_UNIT> d_bitData(bitVectorSize * numBins);
thrust::counting_iterator<int> counter(0);
auto zipInFirst = thrust::make_zip_iterator(thrust::make_tuple(d_buffer.begin(), counter));
auto zipInLast = thrust::make_zip_iterator(thrust::make_tuple(d_buffer.end(), counter + d_buffer.size()));
float minValues[] = {579.8, 72.16, 0.000385, 7.576e-005, 6.954e-005, 0, 0, 2.602e-012, 1.946e-013, 7.393e-015};
float maxValues[] = {1053, 22150, 0.7599, 0.7596, 0.24, 0.2398, 0.1623, 1.167e-007, 4.518e-006, 5.322e-008};
//Get things loaded onto the device then call the kernel
thrust::device_vector<float> d_minValues(minValues, minValues+10);
thrust::device_vector<float> d_maxValues(maxValues, maxValues+10);
thrust::device_ptr<float> minDevPtr = &d_minValues[0];
thrust::device_ptr<float> maxDevPtr = &d_maxValues[0];
thrust::device_ptr<BYTE_UNIT> dataDevPtr = &d_bitData[0];
//Invoke the Thrust Kernel
thrust::for_each(zipInFirst, zipInLast, BinFinder(thrust::raw_pointer_cast(dataDevPtr), thrust::raw_pointer_cast(minDevPtr), thrust::raw_pointer_cast(maxDevPtr), numVars, numBins, numRecords));
cout << endl;
return;
}
cuda.h:
#ifndef CUDA_H
#define CUDA_H
#include <thrust/device_vector.h>
#include <iostream>
//I tried these here, too...
#include <cuda_runtime.h>
#include <cuda.h>
using namespace std;
typedef long BYTE_UNIT; //32 bit storage
void doHistogramGPU(int numvars, thrust::host_vector<float> h_buffer, int numBins, int numRecords);
struct BinFinder
{
BYTE_UNIT * data;
float * rawMinVector;
float * rawMaxVector;
int numVars;
int numBins;
int numRecords;
BinFinder(BYTE_UNIT * data, float * rawMinVector, float * rawMaxVector, int numVars, int numBins, int numRecords)
{
this -> data = data;
this -> rawMinVector = rawMinVector;
this -> rawMaxVector = rawMaxVector;
this -> numVars = numVars;
this -> numBins = numBins;
this -> numRecords = numRecords;
}
//This kernel converts the multidimensional bin representation to a single dimensional representation
template <typename Tuple>
__device__ void operator()( Tuple param )
{
int dataSize = sizeof(BYTE_UNIT);
int shiftSize = dataSize - 1;
int bitVectorSize = ceil(numRecords * numVars / float(dataSize));
float value = thrust::get<0>(param);
int id = thrust::get<1>(param);
//Look up the min and max values for this data column using the index
float min = rawMinVector[id % numVars];
float max = rawMaxVector[id % numVars];
//Calculate the bin id
float percentage = (value - min) / float(max - min);
char bin = percentage * numBins;
if (bin == numBins)
{
bin--;
}
//////////////////////////////////////////////////////////////
//Set a 1 in the appropriate bitvector for the calculated bin
//////////////////////////////////////////////////////////////
//What I originally tried to do that appeared to have generated race conditions (using data from a file):
//data[bin * bitVectorSize + id / dataSize] |= (1 << (shiftSize - id % dataSize));
//What I've been trying to do now that generates a compilation error:
atomicOr(data + (bin * bitVectorSize + id / dataSize), 1 << (shiftSize - id % dataSize)); //<----THIS DOESN'T COMPILE!!!!!!!!!
}
};
#endif
nvcc command for cuda.cu (which includes my cuda.h file):
"C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v6.0/bin/nvcc.exe" "C:/Users/datahead8888/Documents/Visual Studio 2012/Projects/thrust-space-data/src/cuda.cu" -c -o "C:/Users/datahead8888/Documents/Visual Studio 2012/Projects/thrust-space-data/build/CMakeFiles/CudaLib.dir//Debug/CudaLib_generated_cuda.cu.obj" -ccbin "C:/Program Files (x86)/Microsoft Visual Studio 11.0/VC/bin" -m64 -Xcompiler ,\"/DWIN32\",\"/D_WINDOWS\",\"/W3\",\"/GR\",\"/EHsc\",\"/D_DEBUG\",\"/MDd\",\"/Zi\",\"/Ob0\",\"/Od\",\"/RTC1\" -DNVCC "-IC:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v6.0/include" "-IC:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v6.0/include"
Full error output by nvcc:
1>nvcc : warning : The 'compute_10' and 'sm_10' architectures are deprecated, and may be removed in a future release.
1>C:/Users/datahead8888/Documents/Visual Studio 2012/Projects/thrust-space-data/src/cuda.cu(107): warning : variable "minValues" was declared but never referenced
1>
1>C:/Users/datahead8888/Documents/Visual Studio 2012/Projects/thrust-space-data/src/cuda.cu(108): warning : variable "maxValues" was declared but never referenced
1>
1>C:/Users/datahead8888/Documents/Visual Studio 2012/Projects/thrust-space-data/src/cuda.cu(462): warning : variable "shiftSize" was declared but never referenced
1>
1>C:/Users/datahead8888/Documents/Visual Studio 2012/Projects/thrust-space-data/src/cuda.cu(602): warning : initial value of reference to non-const must be an lvalue
1>
1>C:/Users/datahead8888/Documents/Visual Studio 2012/Projects/thrust-space-data/src/cuda.cu(618): warning : dynamic initialization in unreachable code
1>
1>C:/Users/datahead8888/Documents/Visual Studio 2012/Projects/thrust-space-data/src/cuda.cu(522): warning : variable "shiftSize" was declared but never referenced
1>
1>C:/Users/datahead8888/Documents/Visual Studio 2012/Projects/thrust-space-data/src/cuda.cu(975): warning : initial value of reference to non-const must be an lvalue
1>
1>C:/Users/datahead8888/Documents/Visual Studio 2012/Projects/thrust-space-data/src/cuda.cu(993): warning : initial value of reference to non-const must be an lvalue
1>
1>C:/Users/datahead8888/Documents/Visual Studio 2012/Projects/thrust-space-data/src/cuda.cu(1022): warning : variable "shiftSize" was declared but never referenced
1>
1>c:\users\datahead8888\documents\visual studio 2012\projects\thrust-space-data\src\cuda.h(188): error : identifier "atomicOr" is undefined
1> detected during:
1> instantiation of "void BinFinder::operator()(Tuple) [with Tuple=thrust::detail::tuple_of_iterator_references]"
1> C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v6.0\include\thrust/detail/function.h(119): here
1> instantiation of "Result thrust::detail::device_function::operator()(const Argument &) const [with Function=BinFinder, Result=void, Argument=thrust::detail::tuple_of_iterator_references, int, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type>]"
1> C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v6.0\include\thrust/system/cuda/detail/for_each.inl(82): here
1> instantiation of "thrust::system::cuda::detail::for_each_n_detail::for_each_n_closure::result_type thrust::system::cuda::detail::for_each_n_detail::for_each_n_closure::operator()() [with RandomAccessIterator=thrust::zip_iterator>, thrust::counting_iterator, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type>>, Size=unsigned int, UnaryFunction=BinFinder, Context=thrust::system::cuda::detail::detail::blocked_thread_array]"
1> C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v6.0\include\thrust/system/cuda/detail/detail/launch_closure.inl(49): here
1> instantiation of "void thrust::system::cuda::detail::detail::launch_closure_by_value(Closure) [with Closure=thrust::system::cuda::detail::for_each_n_detail::for_each_n_closure>, thrust::counting_iterator, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type>>, unsigned int, BinFinder, thrust::system::cuda::detail::detail::blocked_thread_array>]"
1> C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v6.0\include\thrust/system/cuda/detail/detail/launch_closure.inl(77): here
1> instantiation of "thrust::system::cuda::detail::detail::closure_launcher_base::launch_function_t thrust::system::cuda::detail::detail::closure_launcher_base::get_launch_function() [with Closure=thrust::system::cuda::detail::for_each_n_detail::for_each_n_closure>, thrust::counting_iterator, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type>>, unsigned int, BinFinder, thrust::system::cuda::detail::detail::blocked_thread_array>, launch_by_value=true]"
1> C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v6.0\include\thrust/system/cuda/detail/detail/launch_closure.inl(185): here
1> [ 2 instantiation contexts not shown ]
1> instantiation of "thrust::tuple thrust::system::cuda::detail::for_each_n_detail::configure_launch(Size) [with Closure=thrust::system::cuda::detail::for_each_n_detail::for_each_n_closure>, thrust::counting_iterator, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type>>, unsigned int, BinFinder, thrust::system::cuda::detail::detail::blocked_thread_array>, Size=long long]"
1> C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v6.0\include\thrust/system/cuda/detail/for_each.inl(163): here
1> instantiation of "RandomAccessIterator thrust::system::cuda::detail::for_each_n(thrust::system::cuda::detail::execution_policy &, RandomAccessIterator, Size, UnaryFunction) [with DerivedPolicy=thrust::system::cuda::detail::tag, RandomAccessIterator=thrust::zip_iterator>, thrust::counting_iterator, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type>>, Size=long long, UnaryFunction=BinFinder]"
1> C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v6.0\include\thrust/system/cuda/detail/for_each.inl(191): here
1> instantiation of "RandomAccessIterator thrust::system::cuda::detail::for_each(thrust::system::cuda::detail::execution_policy &, RandomAccessIterator, RandomAccessIterator, UnaryFunction) [with DerivedPolicy=thrust::system::cuda::detail::tag, RandomAccessIterator=thrust::zip_iterator>, thrust::counting_iterator, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type>>, UnaryFunction=BinFinder]"
1> C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v6.0\include\thrust/detail/for_each.inl(43): here
1> instantiation of "InputIterator thrust::for_each(const thrust::detail::execution_policy_base &, InputIterator, InputIterator, UnaryFunction) [with DerivedPolicy=thrust::system::cuda::detail::tag, InputIterator=thrust::zip_iterator>, thrust::counting_iterator, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type>>, UnaryFunction=BinFinder]"
1> C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v6.0\include\thrust/detail/for_each.inl(57): here
1> instantiation of "InputIterator thrust::for_each(InputIterator, InputIterator, UnaryFunction) [with InputIterator=thrust::zip_iterator>, thrust::counting_iterator, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type>>, UnaryFunction=BinFinder]"
1> C:/Users/datahead8888/Documents/Visual Studio 2012/Projects/thrust-space-data/src/cuda.cu(597): here
1>
1> 1 error detected in the compilation of "C:/Users/DATAHE~1/AppData/Local/Temp/tmpxft_00001f78_00000000-8_cuda.cpp1.ii".
1> cuda.cu
The reason it is undefined is because you are not specifying the project settings correctly to compile for an architecture (cc1.1 or higher) that supports atomics.
You will need to modify the settings for the compile operation to compile for an architecture that your GPU supports as well as one that supports atomics.
Your compile command includes no architectural switches at all, so the default architecture (cc1.0) is being targetted. This architecture does not support atomics, and also is deprecated in CUDA 6, so the compiler issues a warning to let you know you are compiling for a deprecated architecture.
You'll need to study the available questions and documentation to learn how to set the target architecture, and you must be sure to not include cc1.0, or the compile will fail. (For example, in this question that you linked, use the methods discussed in the answers, not in the question. The method described in the question does not work. And read all the answers, noting that there are both project properties places and file-specific places where this setting can be made.)
If you're having difficulty getting the settings arranged, you might try opening a CUDA sample project that depends on atomics, e.g. simple atomic intrinsics and remove the existing code from that project, and place your code in it. You should then pick up the proper project settings from that project to use atomics.

cudaMemcpy is too slow on Tesla C2075

I'm currently working on a server with 2 cuda capable GPU's: Quadro 400 and Tesla C2075. I made a simple vector addition test program. My problem is that while Tesla C2075 GPU is supposed to be more powerful than Quadro 400, it takes it more time to do the job. I found that cudaMemcpy takes up most of the execution time and it works slower on a more powerful gpu.
Here's the source:
void get_matrix(float* arr1,float* arr2,int N1,int N2)
{
int Nx,Ny;
int n_blocks,n_threads;
int dev=0; // 1
float time;
size_t size;
clock_t start,end;
cudaSetDevice(dev);
cudaDeviceProp deviceProp;
start = clock();
cudaGetDeviceProperties(&deviceProp, dev);
Nx=N1;
Ny=N2;
n_threads=256;
n_blocks=(Nx*Ny+n_threads-1)/n_threads;
size=Nx*Ny*sizeof(float);
cudaMalloc((void**)&d_A,size);
cudaMalloc((void**)&d_B,size);
cudaMemcpy(d_A, arr1, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, arr2, size, cudaMemcpyHostToDevice);
vector_add<<<n_blocks,n_threads>>>(d_A,d_B,size);
cudaMemcpy(arr1, d_A, size, cudaMemcpyDeviceToHost);
printf("Running device %s \n",deviceProp.name);
end = clock();
time=float(end-start)/float(CLOCKS_PER_SEC);
printf("time = %e\n",time);
}
int main()
{
int const nx = 20000,ny = nx;
static float a[nx*ny],b[nx*ny];
for(int i=0;i<nx;i++)
{
for(int j=0;j<ny;j++)
{
a[j+ny*i]=j+10*i;
b[j+ny*i]=-(j+10*i);
}
}
get_matrix(a,b,nx,ny);
return 0;
}
The output is:
Running device Quadro 400
time = 1.100000e-01
Running device Tesla C2075
time = 1.050000e+00
And my questions are:
Should I modify the code depending on what GPU I am going to use?
Is there any connection between the number of blocks, threads per block specified in the code and the number of multiprocessors, cores per multiprocessor available on a GPU?
I'm running Linux Open Suse 11.2. The source code is compiled using the nvcc compiler (version 4.2).
Thanks for your help!
Try to invoke get_matrix(a,b,nx,ny) twice and take the second timing result. First time calling to CUDA API will create the cuda context. It often takes a long time.
Please refer to this section in CUDA C Best Practice Guide for how to determine the block size and grid size.

cuda -- out of memory (threads and blocks issue) --Address is out of bounds

I am using 63 registers/thread ,so (32768 is maximum) i can use about 520 threads.I am using now 512 threads in this example.
(The parallelism is in the function "computeEvec" inside global computeEHfields function function.)
The problems are:
1) The mem check error below.
2) When i use numPointsRp>2000 it show me "out of memory" ,but (if i am not doing wrong) i compute the global memory and it's ok.
-------------------------------UPDATED---------------------------
i run the program with cuda-memcheck and it gives me (only when numPointsRs>numPointsRp):
========= Invalid global read of size 4
========= at 0x00000428 in computeEHfields
========= by thread (2,0,0) in block (0,0,0)
========= Address 0x4001076e0 is out of bounds
=========
========= Invalid global read of size 4
========= at 0x00000428 in computeEHfields
========= by thread (1,0,0) in block (0,0,0)
========= Address 0x4001076e0 is out of bounds
=========
========= Invalid global read of size 4
========= at 0x00000428 in computeEHfields
========= by thread (0,0,0) in block (0,0,0)
========= Address 0x4001076e0 is out of bounds
ERROR SUMMARY: 160 errors
-----------EDIT----------------------------
Also , some times (if i use only threads and not blocks (i haven't test it for blocks) ) if for example i have numPointsRs=1000 and numPointsRp=100 and then change the numPointsRp=200 and then again change the numPointsRp=100 i am not taking the first results!
import pycuda.gpuarray as gpuarray
import pycuda.autoinit
from pycuda.compiler import SourceModule
import numpy as np
import cmath
import pycuda.driver as drv
Rs=np.zeros((numPointsRs,3)).astype(np.float32)
for k in range (numPointsRs):
Rs[k]=[0,k,0]
Rp=np.zeros((numPointsRp,3)).astype(np.float32)
for k in range (numPointsRp):
Rp[k]=[1+k,0,0]
#---- Initialization and passing(allocate memory and transfer data) to GPU -------------------------
Rs_gpu=gpuarray.to_gpu(Rs)
Rp_gpu=gpuarray.to_gpu(Rp)
J_gpu=gpuarray.to_gpu(np.ones((numPointsRs,3)).astype(np.complex64))
M_gpu=gpuarray.to_gpu(np.ones((numPointsRs,3)).astype(np.complex64))
Evec_gpu=gpuarray.to_gpu(np.zeros((numPointsRp,3)).astype(np.complex64))
Hvec_gpu=gpuarray.to_gpu(np.zeros((numPointsRp,3)).astype(np.complex64))
All_gpu=gpuarray.to_gpu(np.ones(numPointsRp).astype(np.complex64))
mod =SourceModule("""
#include <pycuda-complex.hpp>
#include <cmath>
#include <vector>
#define RowRsSize %(numrs)d
#define RowRpSize %(numrp)d
typedef pycuda::complex<float> cmplx;
extern "C"{
__device__ void computeEvec(float Rs_mat[][3], int numPointsRs,
cmplx J[][3],
cmplx M[][3],
float *Rp,
cmplx kp,
cmplx eta,
cmplx *Evec,
cmplx *Hvec, cmplx *All)
{
while (c<numPointsRs){
...
c++;
}
}
__global__ void computeEHfields(float *Rs_mat_, int numPointsRs,
float *Rp_mat_, int numPointsRp,
cmplx *J_,
cmplx *M_,
cmplx kp,
cmplx eta,
cmplx E[][3],
cmplx H[][3], cmplx *All )
{
float Rs_mat[RowRsSize][3];
float Rp_mat[RowRpSize][3];
cmplx J[RowRsSize][3];
cmplx M[RowRsSize][3];
int k=threadIdx.x+blockIdx.x*blockDim.x;
while (k<numPointsRp)
{
computeEvec( Rs_mat, numPointsRs, J, M, Rp_mat[k], kp, eta, E[k], H[k], All );
k+=blockDim.x*gridDim.x;
}
}
}
"""% { "numrs":numPointsRs, "numrp":numPointsRp},no_extern_c=1)
func = mod.get_function("computeEHfields")
func(Rs_gpu,np.int32(numPointsRs),Rp_gpu,np.int32(numPointsRp),J_gpu, M_gpu, np.complex64(kp), np.complex64(eta),Evec_gpu,Hvec_gpu, All_gpu, block=(128,1,1),grid=(200,1))
print(" \n")
#----- get data back from GPU-----
Rs=Rs_gpu.get()
Rp=Rp_gpu.get()
J=J_gpu.get()
M=M_gpu.get()
Evec=Evec_gpu.get()
Hvec=Hvec_gpu.get()
All=All_gpu.get()
--------------------GPU MODEL------------------------------------------------
Device 0: "GeForce GTX 560"
CUDA Driver Version / Runtime Version 4.20 / 4.10
CUDA Capability Major/Minor version number: 2.1
Total amount of global memory: 1024 MBytes (1073283072 bytes)
( 0) Multiprocessors x (48) CUDA Cores/MP: 0 CUDA Cores //CUDA Cores 336 => 7 MP and 48 Cores/MP
When i use numPointsRp>2000 it show me "out of memory"
Now we have some real code to work with, let's compile it and see what happens. Using RowRsSize=2000 and RowRpSize=200 and compiling with the CUDA 4.2 toolchain, I get:
nvcc -arch=sm_21 -Xcompiler="-D RowRsSize=2000 -D RowRpSize=200" -Xptxas="-v" -c -I./ kivekset.cu
ptxas info : Compiling entry function '_Z15computeEHfieldsPfiS_iPN6pycuda7complexIfEES3_S2_S2_PA3_S2_S5_S3_' for 'sm_21'
ptxas info : Function properties for _Z15computeEHfieldsPfiS_iPN6pycuda7complexIfEES3_S2_S2_PA3_S2_S5_S3_
122432 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 57 registers, 84 bytes cmem[0], 168 bytes cmem[2], 76 bytes cmem[16]
The key numbers are 57 registers and 122432 bytes stack frame per thread. The occupancy calculator suggests that a block of 512 threads will have a maximum of 1 block per SM, and your GPU has 7 SM. This gives a total of 122432 * 512 * 7 = 438796288 bytes of stack frame (local memory) to run your kernel, before you have allocated a single of byte of memory for input and output using pyCUDA. On a GPU with 1Gb of memory, it isn't hard to imagine running out of memory. Your kernel has a enormous local memory footprint. Start thinking about ways to reduce it.
As I indicated in comments, it is absolutely unclear why every thread needs a complete copy of the input data in this kernel code. It results in a gigantic local memory footprint and there seems to be absolutely no reason why the code should be written in this way. You could, I suspect, modify the kernel to something like this:
typedef pycuda::complex<float> cmplx;
typedef float fp3[3];
typedef cmplx cp3[3];
__global__
void computeEHfields2(
float *Rs_mat_, int numPointsRs,
float *Rp_mat_, int numPointsRp,
cmplx *J_,
cmplx *M_,
cmplx kp,
cmplx eta,
cmplx E[][3],
cmplx H[][3],
cmplx *All )
{
fp3 * Rs_mat = (fp3 *)Rs_mat_;
cp3 * J = (cp3 *)J_;
cp3 * M = (cp3 *)M_;
int k=threadIdx.x+blockIdx.x*blockDim.x;
while (k<numPointsRp)
{
fp3 * Rp_mat = (fp3 *)(Rp_mat_+k);
computeEvec2( Rs_mat, numPointsRs, J, M, *Rp_mat, kp, eta, E[k], H[k], All );
k+=blockDim.x*gridDim.x;
}
}
and the main __device__ function it calls to something like this:
__device__ void computeEvec2(
fp3 Rs_mat[], int numPointsRs,
cp3 J[],
cp3 M[],
fp3 Rp,
cmplx kp,
cmplx eta,
cmplx *Evec,
cmplx *Hvec,
cmplx *All)
{
....
}
and eliminate every byte of thread local memory without changing the functionality of the computational code at all.
Using R=1000 and then
block=R/2,1,1 and grid=1,1 everything ok
If i try R=10000 and
block=R/20,1,1 and grid=20,1 ,then it show me "out of memory"
I'm not familiar with pycuda and didn't read into your code too
deeply. However you have more blocks and more threads, so it
will
local memory (probably the kernel's stack, it's allocated per thread),
shared memory (allocated per block), or
global memory that gets allocated based on grid or gridDim.
You can reduce the stack size calling
cudeDeviceSetLimit(cudaLimitStackSize, N));
(the code is for the C runtime API, but the pycuda equivalent shouldn't be too hard to find).

atomic operation disrupting all kernels

I am running some image processing operations on GPU and I need the histogram of the output.
I have written and tested the processing kernels. Also I have tested the histogram kernel for samples of the output pictures separately. They both work fine but when I put all of them in one loop I get nothing.
This is my histogram kernel:
__global__ void histogram(int n, uchar* color, uchar* mask, int* bucket, int ch, int W, int bin)
{
unsigned int X = blockIdx.x*blockDim.x+threadIdx.x;
unsigned int Y = blockIdx.y*blockDim.y+threadIdx.y;
int l = (256%bin==0)?256/bin: 256/bin+1;
int c;
if (X+Y*W < n && mask[X+Y*W])
{
c = color[(X+Y*W)*3]/bin;
atomicAdd(&bucket[c], 1);
c = color[(X+Y*W)*3+1]/bin;
atomicAdd(&bucket[c+l], 1);
c = color[(X+Y*W)*3+2]/bin;
atomicAdd(&bucket[c+l*2], 1);
}
}
It is updating histogram vectors for red, green, and blue.('l' is the length of the vectors)
When I comment out atomicAdds it again produces the output but of course not the histogram.
Why don't they work together?
Edit:
This is the loop:
cudaMemcpy(frame_in_gpu,frame_in.data, W*H*3*sizeof(uchar),cudaMemcpyHostToDevice);
cuda_process(frame_in_gpu, frame_out_gpu, W, H, dimGrid,dimBlock);
cuda_histogram(W*H, frame_in_gpu, mask_gpu, hist, 3, W, bin, dimg_histogram, dimb_histogram);
Then I copy the output to host memory and write it to a video.
These are c codes that only call their kernels with dimGrid and dimBlock that are given as inputs. Also:
dim3 dimBlock(32,32);
dim3 dimGrid(W/32,H/32);
dim3 dimb_Histogram(16,16);
dim3 dimg_Histogram(W/16,H/16);
I changed this for histogram because it worked better with it. Does it matter?
Edit2:
I am using -arch=sm_11 option for compilation. I just read it somewhere. Could anyone tell me how I should choose it?
perhaps you should try to compile without -arch=sm_11 flag.
sm 1.1 is the first architecture which supported atomic operations on global memory while your GPU supports SM 2.0. Hence there is no reason to compile for SM 1.1 unless for backward compatibility.
One possible issue could be that SM 1.1 does not support atomic operations on 64-bit ints in global memory. So I would suggest you recompile the code without -arch option, or use
-arch=sm_20 if you like

Simple adding of two int's in Cuda, result always the same

I'm starting my journary to learn Cuda. I am playing with some hello world type cuda code but its not working, and I'm not sure why.
The code is very simple, take two ints and add them on the GPU and return the result, but no matter what I change the numbers to I get the same result(If math worked that way I would have done alot better in the subject than I actually did).
Here's the sample code:
// CUDA-C includes
#include <cuda.h>
#include <stdio.h>
__global__ void add( int a, int b, int *c ) {
*c = a + b;
}
extern "C"
void runCudaPart();
// Main cuda function
void runCudaPart() {
int c;
int *dev_c;
cudaMalloc( (void**)&dev_c, sizeof(int) );
add<<<1,1>>>( 1, 4, dev_c );
cudaMemcpy( &c, dev_c, sizeof(int), cudaMemcpyDeviceToHost );
printf( "1 + 4 = %d\n", c );
cudaFree( dev_c );
}
The output seems a bit off: 1 + 4 = -1065287167
I'm working on setting up my environment and just wanted to know if there was a problem with the code otherwise its probably my environment.
Update: I tried to add some code to show the error but I don't get an output but the number changes(is it outputing error codes instead of answers? Even if I don't do any work in the kernal other than assign a variable I still get simlair results).
// CUDA-C includes
#include <cuda.h>
#include <stdio.h>
__global__ void add( int a, int b, int *c ) {
//*c = a + b;
*c = 5;
}
extern "C"
void runCudaPart();
// Main cuda function
void runCudaPart() {
int c;
int *dev_c;
cudaError_t err = cudaMalloc( (void**)&dev_c, sizeof(int) );
if(err != cudaSuccess){
printf("The error is %s", cudaGetErrorString(err));
}
add<<<1,1>>>( 1, 4, dev_c );
cudaError_t err2 = cudaMemcpy( &c, dev_c, sizeof(int), cudaMemcpyDeviceToHost );
if(err2 != cudaSuccess){
printf("The error is %s", cudaGetErrorString(err));
}
printf( "1 + 4 = %d\n", c );
cudaFree( dev_c );
}
Code appears to be fine, maybe its related to my setup. Its been a nightmare to get Cuda installed on OSX lion but I thought it worked as the examples in the SDK seemed to be fine. The steps I took so far are go to the Nvida website and download the latest mac releases for the driver, toolkit and SDK. I then added export DYLD_LIBRARY_PATH=/usr/local/cuda/lib:$DYLD_LIBRARY_PATH and 'PATH=/usr/local/cuda/bin:$PATH` I did a deviceQuery and it passed with the following info about my system:
[deviceQuery] starting...
/Developer/GPU Computing/C/bin/darwin/release/deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Found 1 CUDA Capable device(s)
Device 0: "GeForce 320M"
CUDA Driver Version / Runtime Version 4.2 / 4.2
CUDA Capability Major/Minor version number: 1.2
Total amount of global memory: 253 MBytes (265027584 bytes)
( 6) Multiprocessors x ( 8) CUDA Cores/MP: 48 CUDA Cores
GPU Clock rate: 950 MHz (0.95 GHz)
Memory Clock rate: 1064 Mhz
Memory Bus Width: 128-bit
Max Texture Dimension Size (x,y,z) 1D=(8192), 2D=(65536,32768), 3D=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(8192) x 512, 2D=(8192,8192) x 512
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 16384
Warp size: 32
Maximum number of threads per multiprocessor: 1024
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 256 bytes
Concurrent copy and execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: Yes
Support host page-locked memory mapping: Yes
Concurrent kernel execution: No
Alignment requirement for Surfaces: Yes
Device has ECC support enabled: No
Device is using TCC driver mode: No
Device supports Unified Addressing (UVA): No
Device PCI Bus ID / PCI location ID: 4 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 4.2, CUDA Runtime Version = 4.2, NumDevs = 1, Device = GeForce 320M
[deviceQuery] test results...
PASSED
UPDATE: what's really weird is even if I remove all the work in the kernel I stil get a result for c? I have reinstalled cuda and used make on the examples and all of them pass.
Basically there are two problems here:
You are not compiling the kernel for the correct architecture (gleaned from comments)
Your code contains imperfect error checking which is missing the point when the runtime error is occurring, leading to mysterious and unexplained symptoms.
In the runtime API, most context related actions are performed "lazily". When you launch a kernel for the first time, the runtime API will invoke code to intelligently find a suitable CUBIN image from inside the fat binary image emitted by the toolchain for the target hardware and load it into the context. This can also include JIT recompilation of PTX for a backwards compatible architecture, but not the other way around. So if you had a kernel compiled for a compute capability 1.2 device and you run it on a compute capability 2.0 device, the driver can JIT compile the PTX 1.x code it contains for the newer architecture. But the reverse doesn't work. So in your example, the runtime API will generate an error because it cannot find a usable binary image in the CUDA fatbinary image embedded in the executable. The error message is pretty cryptic, but you will get an error (see this question for a bit more information).
If your code contained error checking like this:
cudaError_t err = cudaMalloc( (void**)&dev_c, sizeof(int) );
if(err != cudaSuccess){
printf("The error is %s", cudaGetErrorString(err));
}
add<<<1,1>>>( 1, 4, dev_c );
if (cudaPeekAtLastError() != cudaSuccess) {
printf("The error is %s", cudaGetErrorString(cudaGetLastError()));
}
cudaError_t err2 = cudaMemcpy( &c, dev_c, sizeof(int), cudaMemcpyDeviceToHost );
if(err2 != cudaSuccess){
printf("The error is %s", cudaGetErrorString(err));
}
the extra error checking after the kernel launch should catch the runtime API error generated by the kernel load/launch failure.
#include <stdio.h>
#include <conio.h>
#include <cuda.h>
#include <cuda_runtime.h>
#include <device_launch_parameters.h>
__global__ void Addition(int *a,int *b,int *c)
{
*c = *a + *b;
}
int main()
{
int a,b,c;
int *dev_a,*dev_b,*dev_c;
int size = sizeof(int);
cudaMalloc((void**)&dev_a, size);
cudaMalloc((void**)&dev_b, size);
cudaMalloc((void**)&dev_c, size);
a=5,b=6;
cudaMemcpy(dev_a, &a,sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(dev_b, &b,sizeof(int), cudaMemcpyHostToDevice);
Addition<<< 1,1 >>>(dev_a,dev_b,dev_c);
cudaMemcpy(&c, dev_c,size, cudaMemcpyDeviceToHost);
cudaFree(&dev_a);
cudaFree(&dev_b);
cudaFree(&dev_c);
printf("%d\n", c);
getch();
return 0;
}