How do you get the stream associated with a thrust execution policy? - cuda

I want to be able to get the stream-id which is associated with an execution policy in thrust. I am trying to access this function.
I have tried this :
cudaStream_t stream = 0;
auto policy = thrust::cuda::par.on(stream);
cudaStream_t str = stream(policy);
but I am getting a compilation error :
stream.cu(7): error: expression preceding parentheses of apparent call must have (pointer-to-) function type
Could I get some ideas on how to do this?

"I am trying to access this function." Trying to directly use e.g. things in detail are part of the implementation and may change from one version to the next. To wit: the file you are referring to does not even exist in the the current thrust distributed with CUDA 10.
However, this seems to work for me:
$ cat t354.cu
#include <thrust/execution_policy.h>
#include <iostream>
#include <cstring>
int main(){
cudaStream_t mystream;
cudaStreamCreate(&mystream);
auto policy = thrust::cuda::par.on(mystream);
cudaStream_t str = stream(policy);
for (int i = 0; i < sizeof(cudaStream_t); i++)
if ( *(reinterpret_cast<unsigned char *>(&mystream)+i) != *(reinterpret_cast<unsigned char *>(&str)+i)) {std::cout << "mismatch" << std::endl; return -1;}
std::cout << "match" << std::endl;
}
$ nvcc -std=c++11 -o t354 t354.cu
$ cuda-memcheck ./t354
========= CUDA-MEMCHECK
match
========= ERROR SUMMARY: 0 errors
$

Related

How to prevent thrust::copy_if illegal memory access

In a CUDA C++ code, I am using thrust::copy_if to copy those integer values that are not -1 from the array x to y:
thrust::copy_if(x_ptr, x_ptr + N, y_ptr, is_not_minus_one());
I put the code in a try/catch and I update the array x regularly, and again push numbers that are not -1 to the end of the y until it accessed to the out of the array and returns
CUDA Runtime Error: an illegal memory access was encountered
As I don't know how many values are not -1 in each iteration, I should keep generating and copying numbers in the main array until it becomes full. However, I wanna it to stop once it reaches the end of the array. How can I manage it?
A naive idea would be using another array and then copying if the number of new values does not exceed the size. but it might be inefficient. Any better idea?
Here is one possible method:
use remove_if() instead, (with the opposite condition - remove if -1) on x. Then use (x start and) the returned iterator to copy to your final array (y). Every time you do a copy to y, you will know how much space is left in y, and you can adjust the copy quantity if needed, before doing the copy.
Example (removing 1 values):
$ cat t2088.cu
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/complex.h>
#include <thrust/transform.h>
#include <thrust/copy.h>
#include <thrust/functional.h>
#include <iostream>
using mt = int;
using namespace thrust::placeholders;
int main(){
thrust::device_vector<mt> a(5);
thrust::device_vector<mt> c(25);
bool done = false;
int empty_spaces = c.size();
int copy_start_index = 0;
while (!done){
thrust::sequence(a.begin(), a.end());
int num_to_copy = (thrust::remove_if(a.begin(), a.end(), _1 == 1) - a.begin());
int actual_num_to_copy = std::min(num_to_copy, empty_spaces);
if (actual_num_to_copy != num_to_copy) done = true;
thrust::copy_n(a.begin(), actual_num_to_copy, c.begin()+copy_start_index);
copy_start_index += actual_num_to_copy;
empty_spaces -= actual_num_to_copy;
}
std::cout << "output array is full!" << std::endl;
thrust::host_vector<mt> h_c = c;
thrust::copy(h_c.begin(), h_c.end(), std::ostream_iterator<mt>(std::cout, ","));
std::cout << std::endl;
}
$ nvcc -o t2088 t2088.cu
$ compute-sanitizer ./t2088
========= COMPUTE-SANITIZER
output array is full!
0,2,3,4,0,2,3,4,0,2,3,4,0,2,3,4,0,2,3,4,0,2,3,4,0,
========= ERROR SUMMARY: 0 errors
$

What does nvprof output: "No kernels were profiled" mean, and how to fix it

I have recently installed Cuda on my arch-Linux machine through the system's package manager, and I have been trying to test whether or not it is working by running a simple vector addition program.
I simply copy-paste the code from this tutorial (Both the one using one and more kernels) into a file titled cuda_test.cu and run
> nvcc cuda_test.cu -o cuda_test
In either case, the program can run, and I get no errors (both as in the program doesn't crash and the output is that there were no errors). But when I try to run the Cuda profiler on the program:
> sudo nvprof ./cuda_test
I get result:
==3201== NVPROF is profiling process 3201, command: ./cuda_test
Max error: 0
==3201== Profiling application: ./cuda_test
==3201== Profiling result:
No kernels were profiled.
No API activities were profiled.
==3201== Warning: Some profiling data are not recorded. Make sure cudaProfilerStop() or cuProfilerStop() is called before application exit to flush profile data.
The latter warning is not my main problem or the topic of my question, my problem is the message saying that No Kernels were profiled and no API activities were profiled.
Does this mean that the program was run entirely on my CPU? or is it an error in nvprof?
I have found a discussion about the same error here, but there the answer was that the wrong version of Cuda was installed, and in my case, the version installed is the latest version installed through the systems package manager (Version 10.1.243-1)
Is there any way I can get either nvprof to display the expected output?
Edit
Trying to adhere to the warning at the end does not solve the problem:
Adding call to cudaProfilerStop() (or cuProfilerStop()), and also adding cudaDeviceReset(); at end as suggested and linking the appropriate library (cuda_profiler_api.h or cudaProfiler.h) and compiling with
> nvcc cuda_test.cu -o cuda_test -lcuda
Yields a program which can still run, but which, when uppon which nvprof is run, returns:
==12558== NVPROF is profiling process 12558, command: ./cuda_test
Max error: 0
==12558== Profiling application: ./cuda_test
==12558== Profiling result:
No kernels were profiled.
No API activities were profiled.
==12558== Warning: Some profiling data are not recorded. Make sure cudaProfilerStop() or cuProfilerStop() is called before application exit to flush profile data.
======== Error: Application received signal 139
This has not solved the original problem, and has in fact created a new error; the same happens when cudaProfilerStop() is used on its own or alongside cuProfilerStop() and cudaDeviceReset();
The code
The code is, as mentioned copied from a tutorial to test if Cuda is working, though I also have included calls to cudaProfilerStop() and cudaDeviceReset(); for clarity, it is here included:
#include <iostream>
#include <math.h>
#include <cuda_profiler_api.h>
// Kernel function to add the elements of two arrays
__global__
void add(int n, float *x, float *y)
{
int index = threadIdx.x;
int stride = blockDim.x;
for (int i = index; i < n; i += stride)
y[i] = x[i] + y[i];
}
int main(void)
{
int N = 1<<20;
float *x, *y;
cudaProfilerStart();
// Allocate Unified Memory – accessible from CPU or GPU
cudaMallocManaged(&x, N*sizeof(float));
cudaMallocManaged(&y, N*sizeof(float));
// initialize x and y arrays on the host
for (int i = 0; i < N; i++) {
x[i] = 1.0f;
y[i] = 2.0f;
}
// Run kernel on 1M elements on the GPU
add<<<1, 1>>>(N, x, y);
// Wait for GPU to finish before accessing on host
cudaDeviceSynchronize();
// Check for errors (all values should be 3.0f)
float maxError = 0.0f;
for (int i = 0; i < N; i++)
maxError = fmax(maxError, fabs(y[i]-3.0f));
std::cout << "Max error: " << maxError << std::endl;
// Free memory
cudaFree(x);
cudaFree(y);
cudaDeviceReset();
cudaProfilerStop();
return 0;
}
This problem was apparently somewhat well known, after some searching I found this thread about the error-code in the edited version; the solution as discussed there is to call nvprof with the flag --unified-memory-profiling off:
> sudo nvprof --unified-memory-profiling off ./cuda_test
This makes nvprof work as expected-- even without the call to cudaProfileStop.
You can solve the problem by using
sudo nvprof --unified-memory-profiling per-process-device <your program>

Clang fails to throw a std::bad_alloc when allocating objects that would exceed the limit

I am having trouble understanding how clang throws exceptions when I try to allocate an object that would exceed its limit. For instance if I compile and run the following bit of code:
#include <limits>
#include <new>
#include <iostream>
int main(int argc, char** argv) {
typedef unsigned char byte;
byte*gb;
try{
gb=new byte[std::numeric_limits<std::size_t>::max()];
}
catch(const std::bad_alloc&){
std::cout<<"Normal"<<std::endl;
return 0;}
delete[]gb;
std::cout<<"Abnormal"<<std::endl;
return 1;
}
then when I compile using "clang++ -O0 -std=c++11 main.cpp" the result I get is "Normal" as expected, but as soon as I enable optimizations 1 through 3, the program unexpectedly returns "Abnormal".
I am saying unexpectedly, because according to the C++11 standard 5.3.4.7:
When the value of the expression in a noptr-new-declarator is zero, the allocation function is called to
allocate an array with no elements. If the value of that expression is less than zero or such that the size
of the allocated object would exceed the implementation-defined limit, or if the new-initializer is a braced-
init-list for which the number of initializer-clauses exceeds the number of elements to initialize, no storage
is obtained and the new-expression terminates by throwing an exception of a type that would match a
handler (15.3) of type std::bad_array_new_length (18.6.2.2).
[This behavior is observed with both clang 3.5 using libstd++ on linux and clang 3.3 using libc++ on Mac. The same behavior is also observed when the -std=c++11 flag is removed.]
The plot thickens when I compile the same program using gcc 4.8, using the exact same command line options. In that case, the program returns "Normal" for any chosen optimization level.
I cannot find any undefined behavior in the code posted above that would explain why clang would feel free not to throw an exception when code optimizations are enabled. As far as the bug database is concerned, the closest I can find is http://llvm.org/bugs/show_bug.cgi?id=11644 but it seems to be related to the type of exception being thrown rather than a behavior difference between debug and release code.
So it this a bug from Clang? Or am I missing something? Thanks,
It appears that clang eliminates the allocation as the array is unused:
#include <limits>
#include <new>
#include <iostream>
int main(int argc, char** argv)
{
typedef unsigned char byte;
bytes* gb;
const size_t max = std::numeric_limits<std::size_t>::max();
try
{
gb = new bytes[max];
}
catch(const std::bad_alloc&)
{
std::cout << "Normal" << std::endl;
return 0;
}
try
{
gb[0] = 1;
gb[max - 1] = 1;
std::cout << gb[0] << gb[max - 1] << "\n";
}
catch ( ... )
{
std::cout << "Exception on access\n";
}
delete [] gb;
std::cout << "Abnormal" << std::endl;
return 1;
}
This code prints "Normal" with -O0 and -O3, see this demo. That means that in this code, it is actually tried to allocate the memory and it indeed fails, hence we get the exception. Note that if we don't output, clang is still smart enough to even ignore the writes.
It appears that clang++ on Mac OSX does throw bad_alloc, but it also prints an error message from malloc.
Program:
// bad_alloc example
#include <iostream> // std::cout
#include <sstream>
#include <new> // std::bad_alloc
int main(int argc, char *argv[])
{
unsigned long long memSize = 10000;
if (argc < 2)
memSize = 10000;
else {
std::istringstream is(argv[1]); // C++ atoi
is >> memSize;
}
try
{
int* myarray= new int[memSize];
std::cout << "alloc of " << memSize << " succeeded" << std::endl;
}
catch (std::bad_alloc& ba)
{
std::cerr << "bad_alloc caught: " << ba.what() << '\n';
}
std::cerr << "Program exiting normally" << std::endl;
return 0;
}
Mac terminal output:
david#Godel:~/Dropbox/Projects/Miscellaneous$ badalloc
alloc of 10000 succeeded
Program exiting normally
david#Godel:~/Dropbox/Projects/Miscellaneous$ badalloc 1234567891234567890
badalloc(25154,0x7fff7622b310) malloc: *** mach_vm_map(size=4938271564938272768)
failed (error code=3)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
bad_alloc caught: std::bad_alloc
Program exiting normally
I also tried the same program using g++ on Windows 7:
C:\Users\David\Dropbox\Projects\Miscellaneous>g++ -o badallocw badalloc.cpp
C:\Users\David\Dropbox\Projects\Miscellaneous>badallocw
alloc of 10000 succeeded
C:\Users\David\Dropbox\Projects\Miscellaneous>badallocw 1234567890
bad_alloc caught: std::bad_alloc
Note: The program is a modified version of the example at
http://www.cplusplus.com/reference/new/bad_alloc/

thrust 1.7 tabulate on CUDA device fails

The new thrust::tabulate function works for me on the host but not on the device. The device is a K20x with compute capability 3.5. The host is an Ubuntu machine with 128GB of memory. Help?
I think that the unified addressing is not the problem since I can sort a unifiedly addressed array on the device.
#include <iostream>
#include <thrust/device_vector.h>
#include <thrust/execution_policy.h>
#include <thrust/tabulate.h>
#include <thrust/version.h>
using namespace std;
// Print an expression's name then its value, possible followed by a
// comma or endl. Ex: cout << PRINTC(x) << PRINTN(y);
#define PRINT(arg) #arg "=" << (arg)
#define PRINTC(arg) #arg "=" << (arg) << ", "
#define PRINTN(arg) #arg "=" << (arg) << endl
// Execute an expression and check for CUDA errors.
#define CE(exp) { \
cudaError_t e; e = (exp); \
if (e != cudaSuccess) { \
cerr << #exp << " failed at line " << __LINE__ << " with error " << cudaGetErrorString(e) << endl; \
exit(1); \
} \
}
const int N(10);
int main(void) {
int major = THRUST_MAJOR_VERSION;
int minor = THRUST_MINOR_VERSION;
cout << "Thrust v" << major << "." << minor
<< ", CUDA_VERSION: " << CUDA_VERSION << ", CUDA_ARCH: " << __CUDA_ARCH__
<< endl;
cout << PRINTN(N);
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, 0);
if (!prop.unifiedAddressing) {
cerr << "Unified addressing not available." << endl;
exit(1);
}
cudaGetDeviceProperties(&prop, 0);
if (!prop.canMapHostMemory) {
cerr << "Can't map host memory." << endl;
exit(1);
}
cudaSetDeviceFlags(cudaDeviceMapHost);
int *p, *q;
CE(cudaHostAlloc(&p, N*sizeof(int), cudaHostAllocMapped));
CE(cudaHostAlloc(&q, N*sizeof(int), cudaHostAllocMapped));
thrust::tabulate(thrust::host, p, p+N, thrust::negate<int>());
thrust::tabulate(thrust::device, q, q+N, thrust::negate<int>());
for (int i=0; i<N; i++)
cout << PRINTC(i) << PRINTC(p[i]) << PRINTN(q[i]);
}
Output:
Thrust v1.7, CUDA_VERSION: 6000, CUDA_ARCH: 0
N=10
i=0, p[i]=0, q[i]=0
i=1, p[i]=-1, q[i]=0
i=2, p[i]=-2, q[i]=0
i=3, p[i]=-3, q[i]=0
i=4, p[i]=-4, q[i]=0
i=5, p[i]=-5, q[i]=0
i=6, p[i]=-6, q[i]=0
i=7, p[i]=-7, q[i]=0
i=8, p[i]=-8, q[i]=0
i=9, p[i]=-9, q[i]=0
The following does not add any info content to my post but is required before stackoverflow will accept it: Much of the program is error checking and version checking.
The problem appears to be fixed in the thrust master branch at the moment. This master branch currently identifies itself as Thrust v1.8.
I ran your code with CUDA 6RC (appears to be what you are using) and I was able to duplicate your observation.
I then updated to the master branch, and removed the __CUDA_ARCH__ macro from your code, and I got the expected results (host and device tabulations match).
Note that according to the programming guide, the __CUDA_ARCH__ macro is only defined when it's used in code that is being compiled by the device code compiler. It is officially undefined in host code. Therefore it's acceptable to use it as follows in host code:
#ifdef __CUDA_ARCH__
but not as you are using it. Yes, I understand the behavior is different between thrust v1.7 and thrust master in this regard, but that appears to (also) be a thrust issue, that has been fixed in the master branch.
Both of these issues I expect would be fixed whenever the next version of thrust gets incorporated into an official CUDA drop. Since we are very close to CUDA 6.0 official release, I'd be surprised if these issues were fixed in CUDA 6.0.
Further notes about the tabulate issue:
One workaround would be to update thrust to master
Issue doesn't appear to be specific to thrust::tabulate in my testing. Many thrust functions that I tested seem to fail in that when used with thrust::device and raw pointers, they fail to write values correctly (seem to write all zeroes), but they do seem to be able to read values correctly (e.g. thrust::reduce seems to work)
Another possible workaround is to wrap your raw pointers with thrust::device_ptr<> using thrust::device_ptr_cast<>(). That seemed to work for me as well.

storing return value of thrust reduce_by_key on device vectors

I have been trying to use thrust function reduce_by_key on device vectors. In the documentation they have given example on host_vectors instead of any device vector. The main problem I am getting is in storing the return value of the function. To be more specific here is my code:
thrust::device_vector<int> hashValueVector(10)
thrust::device_vector<int> hashKeysVector(10)
thrust::device_vector<int> d_pathId(10);
thrust::device_vector<int> d_freqError(10); //EDITED
thrust::pair<thrust::detail::normal_iterator<thrust::device_ptr<int> >,thrust::detail::normal_iterator<thrust::device_ptr<int> > > new_end; //THE PROBLEM
new_end = thrust::reduce_by_key(hashKeysVector.begin(),hashKeysVector.end(),hashValueVector.begin(),d_pathId.begin(),d_freqError.begin());
I tried declaring them as device_ptr first in the definition since for host_vectors too they have used pointers in the definition in the documentation. But I am getting compilation error when I try that then I read the error statement and converted the declaration to the above, this is compiling fine but I am not sure whether this is the right way to define or not.
I there any other standard/clean way of declaring that (the "new_end" variable)? Please comment if my question is not clear somewhere.
EDIT: I have edited the declaration of d_freqError. It was supposed to be int I wrote it as hashElem by mistake, sorry for that.
There is a problem in the setup of your reduce_by_key operation:
new_end = thrust::reduce_by_key(hashKeysVector.begin(),hashKeysVector.end(),hashValueVector.begin(),d_pathId.begin(),d_freqError.begin());
Notice that in the documentation it states:
OutputIterator2 is a model of Output Iterator and and InputIterator2's value_type is convertible to OutputIterator2's value_type.
The value type of your InputIterator2 (i.e. hashValueVector.begin()) is int . The value type of your OutputIterator2 is struct hashElem. Thrust is not going to know how to convert an int to a struct hashElem.
Regarding your question, it should not be difficult to capture the return entity from reduce_by_key. According to the documentation it is a thrust pair of two iterators, and these iterators should be consistent with (i.e. of the same vector type and value type) as your keys iterator type and your values iterator type, respectively.
Here's an updated sample based on what you posted, which compiles cleanly:
$ cat t353.cu
#include <iostream>
#include <thrust/device_vector.h>
#include <thrust/pair.h>
#include <thrust/reduce.h>
#include <thrust/sequence.h>
#include <thrust/fill.h>
#include <thrust/copy.h>
typedef thrust::device_vector<int>::iterator dIter;
int main(){
thrust::device_vector<int> hashValueVector(10);
thrust::device_vector<int> hashKeysVector(10);
thrust::device_vector<int> d_pathId(10);
thrust::device_vector<int> d_freqError(10);
thrust::sequence(hashValueVector.begin(), hashValueVector.end());
thrust::fill(hashKeysVector.begin(), hashKeysVector.begin()+5, 1);
thrust::fill(hashKeysVector.begin()+6, hashKeysVector.begin()+10, 2);
thrust::pair<dIter, dIter> new_end;
new_end = thrust::reduce_by_key(hashKeysVector.begin(),hashKeysVector.end(),hashValueVector.begin(),d_pathId.begin(),d_freqError.begin());
std::cout << "Number of results are: " << new_end.first - d_pathId.begin() << std::endl;
thrust::copy(d_pathId.begin(), new_end.first, std::ostream_iterator<int>(std::cout, "\n"));
thrust::copy(d_freqError.begin(), new_end.second, std::ostream_iterator<int>(std::cout, "\n"));
}
$ nvcc -arch=sm_20 -o t353 t353.cu
$ ./t353
Number of results are: 3
1
0
2
10
5
30
$