I am trying to implement box filter in C-CUDA, starting with implementing matrix average problem in CUDA first. When I try following code without commenting those lines within for loops than I get the certain output. But when I comment those lines then it generates the same output again!
if(tx==0)
for(int i=1;i<=radius;i++)
{
//sharedTile[radius+ty][radius-i] = 6666.0;
}
if(tx==(Dx-1))
for(int i=0;i<radius;i++)
{
//sharedTile[radius+ty][radius+Dx+i] = 7777;
}
if(ty==0)
for(int i=1;i<=radius;i++)
{
//sharedTile[radius-i][radius+tx]= 8888;
}
if(ty==(Dy-1))
for(int i=0;i<radius;i++)
{
//sharedTile[radius+Dy+i][radius+tx] = 9999;
}
if((tx==0)&&(ty==0))
for(int i=globalRow,l=0;i<HostPaddedRow,l<radius;i++,l++)
{
for(int j=globalCol,m=0;j<HostPaddedCol,m<radius;j++,m++)
{
//sharedTile[l][m]=8866;
}
}
if((tx==(Dx-1))&&(ty==(Dx-1)))
for(int i=(HostPaddedRow+1),l=(radius+Dx);i<(HostPaddedRow+1+radius),l<(TILE+2*radius);i++,l++)
{
for(int j=HostPaddedCol,m=(radius+Dx);j<(HostPaddedCol+radius),m<(TILE+2*radius);j++,m++)
{
//sharedTile[l][m]=7799.0;
}
}
if((tx==(Dx-1))&&(ty==0))
for(int i=(globalRow),l=0;i<HostPaddedRow,l<radius;i++,l++)
{
for(int j=(HostPaddedCol+1),m=(radius+Dx);j<(HostPaddedCol+1+radius),m<(TILE+2*radius);j++,m++)
{
//sharedTile[l][m]=9966;
}
}
if((tx==0)&&(ty==(Dy-1)))
for(int i=(HostPaddedRow+1),l=(radius+Dy);i<(HostPaddedRow+1+radius),l<(TILE+2*radius);i++,l++)
{
for(int j=globalCol,m=0;j<HostPaddedCol,m<radius;j++,m++)
{
//sharedTile[l][m]=0.0;
}
}
__syncthreads();
You can ignore those for loop conditions and all, they are irrelevant here right now.
May basic problem and question is why am I getting the same vales even after commenting those lines? I tried making some modification in my main program and kernel as well. Also entered manual errors and removed them, and again compiled and executed the same code, but still getting same values. Is there any way to clear cache memory in CUDA?
I am using Nsight + RedHat + CUDA 5.5.
Thanks in advance.
why am I getting the same vales even after commenting those lines?
Seems like sharedTile is pointing to the same piece of memory between multiple consecutive runs which is absolutely normal. Therefore commented out code does not "generate" anything, it is just your pointer pointing to the same memory which was not flushed.
Is there any way to clear cache memory in CUDA
I believe you are talking about clearing shared memory? If so then you can use analogy of approach described here. Instead of using cudaMemset in host code you'll be zeroing out your shared memory from inside of kernel. The simplest approach is to place following code at the beginning of your kernel which declares sharedTile (this is for one dimensional thread blocks, one dimensional shared memory array):
__global__ void your_kernel(int count) {
extern __shared__ float* sharedTile;
for (int i = threadIdx.x; i < count; i += blockDim.x)
sharedTile[i] = 0.0f;
__syncthreads();
// your code here
}
Following approaches do not guarantee clear shared memory as Robert Crovella pointed out in below comment:
Or possibly call nvidia-smi with
--gpu-reset parameter.
Yet another solution was offered in the other SO
thread which includes
driver unloading and reloading.
Related
I'm trying to create a class that will get allocated on the device. I want the constructor to run on the device so that the whole object including the fields inside are automatically allocated on the device instead of having to create a host object then copy it manually to the device.
I'm using thrust device_new
Here is my code:
using namespace thrust;
class Particle
{
public:
int* data;
__device__ Particle()
{
data = new int[10];
for (int i=0; i<10; i++)
{
data[i] = i*2;
}
}
};
__global__ void test(Particle* p)
{
for (int i=0; i<10; i++)
printf("%d\n", p->data[i]);
}
int main() {
device_ptr<Particle> p = device_new<Particle>();
test<<<1,1>>>(thrust::raw_pointer_cast(p));
cudaDeviceSynchronize();
printf("Done!\n");
}
I annotated the constructor with __device__ and used device_new (thrust), but this doesn't work, can someone explain to me why?
Cheers for help
I believe the answer lies in the description given here. Someone who knows thrust under the hood will probably come along and indicate whether this is true or not.
Although thrust has changed a lot since 2009, I believe device_new may still be using some form of operation where the object is actually temporarily instantiated on the host, then copied to the device. I believe the size limitation described in the above reference is no longer applicable, however.
I was able to get this to work:
#include <stdio.h>
#include <thrust/device_ptr.h>
#include <thrust/device_new.h>
#define N 512
using namespace thrust;
class Particle
{
public:
int data[N];
__device__ __host__ Particle()
{
// data = new int[10];
for (int i=0; i<N; i++)
{
data[i] = i*2;
}
}
};
__global__ void test(Particle* p)
{
for (int i=0; i<N; i++)
printf("%d\n", p->data[i]);
}
int main() {
device_ptr<Particle> p = device_new<Particle>();
test<<<1,1>>>(thrust::raw_pointer_cast(p));
cudaDeviceSynchronize();
printf("Done!\n");
}
Interestingly, it gives bogus results if I omit the __host__ decorator on the constructor, suggesting to me that the temporary object copy mechanism is still in place. It also gives bogus results (and cuda-memcheck reports out-of-bounds access errors) if I switch to using the dynamic allocation for data instead of static, also suggesting to me that device_new is using a temporary object creation on the host followed by a copy to the device.
First of all thanks to Rovert Crovella for his input (and previous answers)
So apparently I "overestimated" what device_new can do, I thought that it can initialise the object directly on the device, so any dynamically allocated memory inside is done on the device too.
But it seems like device_new is basically just doing the same as the manual way:
Particle temp;
Particle *d_p;
cudaMalloc(&d_p, sizeof(Particle));
cudaMemcpy(d_p, &temp, sizeof(Particle), cudaMemcpyHostToDevice);
So it makes a temp host object and copies it just like how it would be done manually. That means the memory allocated inside the object is allocated on the host, and only the pointer gets copied as part of the object, so you cannot use that memory in a kernel, you have to copy that memory manually to the device, and thrust doesn't seem to be doing that.
So it's just a cleaner way of creating a temp host object and copying it, except that you lose the ability to copy the dynamic memory allocated inside since you don't have access to that temp variable.
I hope in the future, there will be a method or a feature in CUDA that makes you initialise the object directly on the device so any dynamically allocated data in the constructor (or elsewhere) is allocated on the device too, instead of the tedious way of copying every piece of memory manually.
Is it possible for a CUDA kernel to synchronize writes to device-mapped memory without any host-side invocation (e.g., of cudaDeviceSynchronize)? When I run the following program, it doesn't seem that the kernel waits for the writes to device-mapped memory to complete before terminating because examining the page-locked host memory immediately after the kernel launch does not show any modification of the memory (unless a delay is inserted or the call to cudaDeviceSynchronize is uncommented):
#include <stdio.h>
#include <cuda.h>
__global__ void func(int *a, int N) {
int idx = threadIdx.x;
if (idx < N) {
a[idx] *= -1;
__threadfence_system();
}
}
int main(void) {
int *a, *a_gpu;
const int N = 8;
size_t size = N*sizeof(int);
cudaSetDeviceFlags(cudaDeviceMapHost);
cudaHostAlloc((void **) &a, size, cudaHostAllocMapped);
cudaHostGetDevicePointer((void **) &a_gpu, (void *) a, 0);
for (int i = 0; i < N; i++) {
a[i] = i;
}
for (int i = 0; i < N; i++) {
printf("%i ", a[i]);
}
printf("\n");
func<<<1, N>>>(a_gpu, N);
// cudaDeviceSynchronize();
for (int i = 0; i < N; i++) {
printf("%i ", a[i]);
}
printf("\n");
cudaFreeHost(a);
}
I'm compiling the above for sm_20 with CUDA 4.2.9 on Linux and running it on a Fermi GPU (S2050).
A kernel launch will immediately return to the host code before any kernel activity has occurred. Kernel execution is in this way asynchronous to host execution and does not block host execution. So it's no surprise that you have to wait a bit or else use a barrier (like cudaDeviceSynchronize()) to see the results of the kernel.
As described here:
In order to facilitate concurrent execution between host and device,
some function calls are asynchronous: Control is returned to the host
thread before the device has completed the requested task. These are:
Kernel launches;
Memory copies between two addresses to the same device memory;
Memory copies from host to device of a memory block of 64 KB or less;
Memory copies performed by functions that are suffixed with Async;
Memory set function calls.
This is all intentional of course, so that you can use the GPU and CPU simultaneously. If you don't want this behavior, a simple solution as you've already discovered is to insert a barrier. If your kernel is producing data which you will immediately copy back to the host, you don't need a separate barrier. The cudaMemcpy call after the kernel will wait until the kernel is completed before it begins it's copy operation.
I guess to answer your question, you are wanting kernel launches to be synchronous without you having even to use a barrier (why do you want to do this? Is adding the cudaDeviceSynchronize() call a problem?) It's possible to do this:
"Programmers can globally disable asynchronous kernel launches for all
CUDA applications running on a system by setting the
CUDA_LAUNCH_BLOCKING environment variable to 1. This feature is
provided for debugging purposes only and should never be used as a way
to make production software run reliably. "
If you want this synchronous behavior, it's better just to use the barriers (or depend on another subsequent cuda call, like cudaMemcpy). If you use the above method and depend on it, your code will break as soon as somebody else tries to run it without the environment variable set. So it's really not a good idea.
A number of algorithms iterate until a certain convergence criterion is reached (e.g. stability of a particular matrix). In many cases, one CUDA kernel must be launched per iteration. My question is: how then does one efficiently and accurately determine whether a matrix has changed over the course of the last kernel call? Here are three possibilities which seem equally unsatisfying:
Writing a global flag each time the matrix is modified inside the kernel. This works, but is highly inefficient and is not technically thread safe.
Using atomic operations to do the same as above. Again, this seems inefficient since in the worst case scenario one global write per thread occurs.
Using a reduction kernel to compute some parameter of the matrix (e.g. sum, mean, variance). This might be faster in some cases, but still seems like overkill. Also, it is possible to dream up cases where a matrix has changed but the sum/mean/variance haven't (e.g. two elements are swapped).
Is there any of the three options above, or an alternative, that is considered best practice and/or is generally more efficient?
I'll also go back to the answer I would have posted in 2012 but for a browser crash.
The basic idea is that you can use warp voting instructions to perform a simple, cheap reduction and then use zero or one atomic operations per block to update a pinned, mapped flag that the host can read after each kernel launch. Using a mapped flag eliminates the need for an explicit device to host transfer after each kernel launch.
This requires one word of shared memory per warp in the kernel, which is a small overhead, and some templating tricks can allow for loop unrolling if you provide the number of warps per block as a template parameter.
A complete working examplate (with C++ host code, I don't have access to a working PyCUDA installation at the moment) looks like this:
#include <cstdlib>
#include <vector>
#include <algorithm>
#include <assert.h>
__device__ unsigned int process(int & val)
{
return (++val < 10);
}
template<int nwarps>
__global__ void kernel(int *inout, unsigned int *kchanged)
{
__shared__ int wchanged[nwarps];
unsigned int laneid = threadIdx.x % warpSize;
unsigned int warpid = threadIdx.x / warpSize;
// Do calculations then check for change/convergence
// and set tchanged to be !=0 if required
int idx = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int tchanged = process(inout[idx]);
// Simple blockwise reduction using voting primitives
// increments kchanged is any thread in the block
// returned tchanged != 0
tchanged = __any(tchanged != 0);
if (laneid == 0) {
wchanged[warpid] = tchanged;
}
__syncthreads();
if (threadIdx.x == 0) {
int bchanged = 0;
#pragma unroll
for(int i=0; i<nwarps; i++) {
bchanged |= wchanged[i];
}
if (bchanged) {
atomicAdd(kchanged, 1);
}
}
}
int main(void)
{
const int N = 2048;
const int min = 5, max = 15;
std::vector<int> data(N);
for(int i=0; i<N; i++) {
data[i] = min + (std::rand() % (int)(max - min + 1));
}
int* _data;
size_t datasz = sizeof(int) * (size_t)N;
cudaMalloc<int>(&_data, datasz);
cudaMemcpy(_data, &data[0], datasz, cudaMemcpyHostToDevice);
unsigned int *kchanged, *_kchanged;
cudaHostAlloc((void **)&kchanged, sizeof(unsigned int), cudaHostAllocMapped);
cudaHostGetDevicePointer((void **)&_kchanged, kchanged, 0);
const int nwarps = 4;
dim3 blcksz(32*nwarps), grdsz(16);
// Loop while the kernel signals it needs to run again
do {
*kchanged = 0;
kernel<nwarps><<<grdsz, blcksz>>>(_data, _kchanged);
cudaDeviceSynchronize();
} while (*kchanged != 0);
cudaMemcpy(&data[0], _data, datasz, cudaMemcpyDeviceToHost);
cudaDeviceReset();
int minval = *std::min_element(data.begin(), data.end());
assert(minval == 10);
return 0;
}
Here, kchanged is the flag the kernel uses to signal it needs to run again to the host. The kernel runs until each entry in the input has been incremented to above a threshold value. At the end of each threads processing, it participates in a warp vote, after which one thread from each warp loads the vote result to shared memory. One thread reduces the warp result and then atomically updates the kchanged value. The host thread waits until the device is finished, and can then directly read the result from the mapped host variable.
You should be able to adapt this to whatever your application requires
I'll go back to my original suggestion. I've updated the related question with an answer of my own, which I believe is correct.
create a flag in global memory:
__device__ int flag;
at each iteration,
initialize the flag to zero (in host code):
int init_val = 0;
cudaMemcpyToSymbol(flag, &init_val, sizeof(int));
In your kernel device code, modify the flag to 1 if a change is made to the matrix:
__global void iter_kernel(float *matrix){
...
if (new_val[i] != matrix[i]){
matrix[i] = new_val[i];
flag = 1;}
...
}
after calling the kernel, at the end of the iteration (in host code), test for modification:
int modified = 0;
cudaMemcpyFromSymbol(&modified, flag, sizeof(int));
if (modified){
...
}
Even if multiple threads in separate blocks or even separate grids, are writing the flag value, as long as the only thing they do is write the same value (i.e. 1 in this case), there is no hazard. The write will not get "lost" and no spurious values will show up in the flag variable.
Testing float or double quantities for equality in this fashion is questionable, but that doesn't seem to be the point of your question. If you have a preferred method to declare "modification" use that instead (such as testing for equality within a tolerance, perhaps).
Some obvious enhancements to this method would be to create one (local) flag variable per thread, and have each thread update the global flag variable once per kernel, rather than on every modification. This would result in at most one global write per thread per kernel. Another approach would be to keep one flag variable per block in shared memory, and have all threads simply update that variable. At the completion of the block, one write is made to global memory (if necessary) to update the global flag. We don't need to resort to complicated reductions in this case, because there is only one boolean result for the entire kernel, and we can tolerate multiple threads writing to either a shared or global variable, as long as all threads are writing the same value.
I can't see any reason to use atomics, or how it would benefit anything.
A reduction kernel seems like overkill, at least compared to one of the optimized approaches (e.g. a shared flag per block). And it would have the drawbacks you mention, such as the fact that anything less than a CRC or similarly complicated computation might alias two different matrix results as "the same".
I would like to use Thrust's stream compaction functionality (copy_if) for distilling indices of elements from a vector if the elements adhere to a number of constraints. One of these constraints depends on the values of neighboring elements (8 in 2D and 26 in 3D). My question is: how can I obtain the neighbors of an element in Thrust?
The function call operator of the functor for the 'copy_if' basically looks like:
__host__ __device__ bool operator()(float x) {
bool mark = x < 0.0f;
if (mark) {
if (left neighbor of x > 1.0f) return false;
if (right neighbor of x > 1.0f) return false;
if (top neighbor of x > 1.0f) return false;
//etc.
}
return mark;
}
Currently I use a work-around by first launching a CUDA kernel (in which it is easy to access neighbors) to appropriately mark the elements. After that, I pass the marked elements to Thrust's copy_if to distill the indices of the marked elements.
I came across counting_iterator as a sort of substitute for directly using threadIdx and blockIdx to acquire the index of the processed element. I tried the solution below, but when compiling it, it gives me a "/usr/include/cuda/thrust/detail/device/cuda/copy_if.inl(151): Error: Unaligned memory accesses not supported". As far as I know I'm not trying to access memory in an unaligned fashion. Anybody knows what's going on and/or how to fix this?
struct IsEmpty2 {
float* xi;
IsEmpty2(float* pXi) { xi = pXi; }
__host__ __device__ bool operator()(thrust::tuple<float, int> t) {
bool mark = thrust::get<0>(t) < -0.01f;
if (mark) {
int countindex = thrust::get<1>(t);
if (xi[countindex] > 1.01f) return false;
//etc.
}
return mark;
}
};
thrust::copy_if(indices.begin(),
indices.end(),
thrust::make_zip_iterator(thrust::make_tuple(xi, thrust::counting_iterator<int>())),
indicesEmptied.begin(),
IsEmpty2(rawXi));
#phoad: you're right about the shared mem, it struck me after I already posted my reply, subsequently thinking that the cache probably will help me. But you beat me with your quick response. The if-statement however is executed in less than 5% of all cases, so either using shared mem or relying on the cache will probably have negligible impact on performance.
Tuples only support 10 values, so that would mean I would require tuples of tuples for the 26 values in the 3D case. Working with tuples and zip_iterator was already quite cumbersome, so I'll pass for this option (also from a code readability stand point). I tried your suggestion by directly using threadIdx.x etc. in the device function, but Thrust doesn't like that. I seem to be getting some unexplainable results and sometimes I end up with an Thrust error. The following program for example generates a 'thrust::system::system_error' with an 'unspecified launch failure', although it first correctly prints "Processing 10" to "Processing 41":
struct printf_functor {
__host__ __device__ void operator()(int e) {
printf("Processing %d\n", threadIdx.x);
}
};
int main() {
thrust::device_vector<int> dVec(32);
for (int i = 0; i < 32; ++i)
dVec[i] = i + 10;
thrust::for_each(dVec.begin(), dVec.end(), printf_functor());
return 0;
}
Same applies to printing blockIdx.x Printing blockDim.x however generates no error. I was hoping for a clean solution, but I guess I am stuck with my current work-around solution.
So, im trying to write some code that utilizes Nvidia's CUDA architecture. I noticed that copying to and from the device was really hurting my overall performance, so now I am trying to move a large amount of data onto the device.
As this data is used in numerous functions, I would like it to be global. Yes, I can pass pointers around, but I would really like to know how to work with globals in this instance.
So, I have device functions that want to access a device allocated array.
Ideally, I could do something like:
__device__ float* global_data;
main()
{
cudaMalloc(global_data);
kernel1<<<blah>>>(blah); //access global data
kernel2<<<blah>>>(blah); //access global data again
}
However, I havent figured out how to create a dynamic array. I figured out a work around by declaring the array as follows:
__device__ float global_data[REALLY_LARGE_NUMBER];
And while that doesn't require a cudaMalloc call, I would prefer the dynamic allocation approach.
Something like this should probably work.
#include <algorithm>
#define NDEBUG
#define CUT_CHECK_ERROR(errorMessage) do { \
cudaThreadSynchronize(); \
cudaError_t err = cudaGetLastError(); \
if( cudaSuccess != err) { \
fprintf(stderr, "Cuda error: %s in file '%s' in line %i : %s.\n", \
errorMessage, __FILE__, __LINE__, cudaGetErrorString( err) );\
exit(EXIT_FAILURE); \
} } while (0)
__device__ float *devPtr;
__global__
void kernel1(float *some_neat_data)
{
devPtr = some_neat_data;
}
__global__
void kernel2(void)
{
devPtr[threadIdx.x] *= .3f;
}
int main(int argc, char *argv[])
{
float* otherDevPtr;
cudaMalloc((void**)&otherDevPtr, 256 * sizeof(*otherDevPtr));
cudaMemset(otherDevPtr, 0, 256 * sizeof(*otherDevPtr));
kernel1<<<1,128>>>(otherDevPtr);
CUT_CHECK_ERROR("kernel1");
kernel2<<<1,128>>>();
CUT_CHECK_ERROR("kernel2");
return 0;
}
Give it a whirl.
Spend some time focusing on the copious documentation offered by NVIDIA.
From the Programming Guide:
float* devPtr;
cudaMalloc((void**)&devPtr, 256 * sizeof(*devPtr));
cudaMemset(devPtr, 0, 256 * sizeof(*devPtr));
That's a simple example of how to allocate memory. Now, in your kernels, you should accept a pointer to a float like so:
__global__
void kernel1(float *some_neat_data)
{
some_neat_data[threadIdx.x]++;
}
__global__
void kernel2(float *potentially_that_same_neat_data)
{
potentially_that_same_neat_data[threadIdx.x] *= 0.3f;
}
So now you can invoke them like so:
float* devPtr;
cudaMalloc((void**)&devPtr, 256 * sizeof(*devPtr));
cudaMemset(devPtr, 0, 256 * sizeof(*devPtr));
kernel1<<<1,128>>>(devPtr);
kernel2<<<1,128>>>(devPtr);
As this data is used in numerous
functions, I would like it to be
global.
There are few good reasons to use globals. This definitely is not one. I'll leave it as an exercise to expand this example to include moving "devPtr" to a global scope.
EDIT:
Ok, the fundamental problem is this: your kernels can only access device memory and the only global-scope pointers that they can use are GPU ones. When calling a kernel from your CPU, behind the scenes what happens is that the pointers and primitives get copied into GPU registers and/or shared memory before the kernel gets executed.
So the closest I can suggest is this: use cudaMemcpyToSymbol() to achieve your goals. But, in the background, consider that a different approach might be the Right Thing.
#include <algorithm>
__constant__ float devPtr[1024];
__global__
void kernel1(float *some_neat_data)
{
some_neat_data[threadIdx.x] = devPtr[0] * devPtr[1];
}
__global__
void kernel2(float *potentially_that_same_neat_data)
{
potentially_that_same_neat_data[threadIdx.x] *= devPtr[2];
}
int main(int argc, char *argv[])
{
float some_data[256];
for (int i = 0; i < sizeof(some_data) / sizeof(some_data[0]); i++)
{
some_data[i] = i * 2;
}
cudaMemcpyToSymbol(devPtr, some_data, std::min(sizeof(some_data), sizeof(devPtr) ));
float* otherDevPtr;
cudaMalloc((void**)&otherDevPtr, 256 * sizeof(*otherDevPtr));
cudaMemset(otherDevPtr, 0, 256 * sizeof(*otherDevPtr));
kernel1<<<1,128>>>(otherDevPtr);
kernel2<<<1,128>>>(otherDevPtr);
return 0;
}
Don't forget '--host-compilation=c++' for this example.
I went ahead and tried the solution of allocating a temporary pointer and passing it to a simple global function similar to kernel1.
The good news is that it does work :)
However, I think it confuses the compiler as I now get "Advisory: Cannot tell what pointer points to, assuming global memory space" whenever I try to access the global data. Luckily, the assumption happens to be correct, but the warnings are annoying.
Anyway, for the record - I have looked at many of the examples and did run through the nvidia exercises where the point is to get the output to say "Correct!". However, I haven't looked at all of them. If anyone knows of an sdk example where they do dynamic global device memory allocation, I would still like to know.
Erm, it was exactly that problem of moving devPtr to global scope that was my problem.
I have an implementation that does exactly that, with the two kernels having a pointer to data passed in. I explicitly don't want to pass in those pointers.
I have read the documentation fairly closely, and hit up the nvidia forums (and google searched for an hour or so), but I haven't found an implementation of a global dynamic device array that actually runs (i have tried several that compile and then fail in new and interesting ways).
check out the samples included with the SDK. Many of those sample projects are a decent way to learn by example.
As this data is used in numerous functions, I would like it to be global.
-
There are few good reasons to use globals. This definitely is not one. I'll leave it as an
exercise to expand this example to include moving "devPtr" to a global scope.
What if the kernel operates on a large const structure consisting of arrays? Using the so called constant memory is not an option, because it's very limited in size.. so then you have to put it in global memory..?