thrust 1.7 tabulate on CUDA device fails

thrust 1.7 tabulate on CUDA device fails - cuda

The new thrust::tabulate function works for me on the host but not on the device. The device is a K20x with compute capability 3.5. The host is an Ubuntu machine with 128GB of memory. Help?
I think that the unified addressing is not the problem since I can sort a unifiedly addressed array on the device.
#include <iostream>
#include <thrust/device_vector.h>
#include <thrust/execution_policy.h>
#include <thrust/tabulate.h>
#include <thrust/version.h>
using namespace std;
// Print an expression's name then its value, possible followed by a
// comma or endl. Ex: cout << PRINTC(x) << PRINTN(y);
#define PRINT(arg) #arg "=" << (arg)
#define PRINTC(arg) #arg "=" << (arg) << ", "
#define PRINTN(arg) #arg "=" << (arg) << endl
// Execute an expression and check for CUDA errors.
#define CE(exp) { \
cudaError_t e; e = (exp); \
if (e != cudaSuccess) { \
cerr << #exp << " failed at line " << __LINE__ << " with error " << cudaGetErrorString(e) << endl; \
exit(1); \
} \
}
const int N(10);
int main(void) {
int major = THRUST_MAJOR_VERSION;
int minor = THRUST_MINOR_VERSION;
cout << "Thrust v" << major << "." << minor
<< ", CUDA_VERSION: " << CUDA_VERSION << ", CUDA_ARCH: " << __CUDA_ARCH__
<< endl;
cout << PRINTN(N);
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, 0);
if (!prop.unifiedAddressing) {
cerr << "Unified addressing not available." << endl;
exit(1);
}
cudaGetDeviceProperties(&prop, 0);
if (!prop.canMapHostMemory) {
cerr << "Can't map host memory." << endl;
exit(1);
}
cudaSetDeviceFlags(cudaDeviceMapHost);
int *p, *q;
CE(cudaHostAlloc(&p, N*sizeof(int), cudaHostAllocMapped));
CE(cudaHostAlloc(&q, N*sizeof(int), cudaHostAllocMapped));
thrust::tabulate(thrust::host, p, p+N, thrust::negate<int>());
thrust::tabulate(thrust::device, q, q+N, thrust::negate<int>());
for (int i=0; i<N; i++)
cout << PRINTC(i) << PRINTC(p[i]) << PRINTN(q[i]);
}
Output:
Thrust v1.7, CUDA_VERSION: 6000, CUDA_ARCH: 0
N=10
i=0, p[i]=0, q[i]=0
i=1, p[i]=-1, q[i]=0
i=2, p[i]=-2, q[i]=0
i=3, p[i]=-3, q[i]=0
i=4, p[i]=-4, q[i]=0
i=5, p[i]=-5, q[i]=0
i=6, p[i]=-6, q[i]=0
i=7, p[i]=-7, q[i]=0
i=8, p[i]=-8, q[i]=0
i=9, p[i]=-9, q[i]=0
The following does not add any info content to my post but is required before stackoverflow will accept it: Much of the program is error checking and version checking.

The problem appears to be fixed in the thrust master branch at the moment. This master branch currently identifies itself as Thrust v1.8.
I ran your code with CUDA 6RC (appears to be what you are using) and I was able to duplicate your observation.
I then updated to the master branch, and removed the __CUDA_ARCH__ macro from your code, and I got the expected results (host and device tabulations match).
Note that according to the programming guide, the __CUDA_ARCH__ macro is only defined when it's used in code that is being compiled by the device code compiler. It is officially undefined in host code. Therefore it's acceptable to use it as follows in host code:
#ifdef __CUDA_ARCH__
but not as you are using it. Yes, I understand the behavior is different between thrust v1.7 and thrust master in this regard, but that appears to (also) be a thrust issue, that has been fixed in the master branch.
Both of these issues I expect would be fixed whenever the next version of thrust gets incorporated into an official CUDA drop. Since we are very close to CUDA 6.0 official release, I'd be surprised if these issues were fixed in CUDA 6.0.
Further notes about the tabulate issue:
One workaround would be to update thrust to master
Issue doesn't appear to be specific to thrust::tabulate in my testing. Many thrust functions that I tested seem to fail in that when used with thrust::device and raw pointers, they fail to write values correctly (seem to write all zeroes), but they do seem to be able to read values correctly (e.g. thrust::reduce seems to work)
Another possible workaround is to wrap your raw pointers with thrust::device_ptr<> using thrust::device_ptr_cast<>(). That seemed to work for me as well.

Related

CUDA: method to calculate all partial sums during a sum reduction

I run into this issue over and over in CUDA. I have, for a set of elements, done some GPU calculation. This results in some value that has linear meaning (for instance, in terms of memory):
element_sizes = [ 10, 100, 23, 45 ]
And now, for the next stage of GPU calculation, I need the following values:
memory_size = sum(element_sizes)
memory_offsets = [ 0, 10, 110, 133 ]
I can calculate memory_size at 80 gbps on my GPU using the reduction code available from NVIDIA. However, I can't use this code, as it uses a branching technique that does not compose the memory offsets array. I have tried many things, but what I have found is that simply copying over elements_sizes to the host and calculating the offsets with a simd for loop is the simplest, fastest, way to go:
// in pseudo code
host_element_sizes = copy_to_host(element_sizes);
host_offsets = (... *) malloc(...);
int total_size = 0;
for(int i = 0; i < ...; ...){
host_offsets[i] = total_size;
total_size += host_element_sizes[i];
}
device_offsets = (... *) device_malloc(...);
device_offsets = copy_to_device(host_offsets,...);
However, I have done this many times now, and it is starting to become a bottleneck. This seems like a typical problem, but I have found no work-around.
What is the expected way for a CUDA programmer to solve this problem?

I think the algorithm you are looking for is a prefix sum. A prefix sum on a vector produces another vector which contains the cumulative sum values of the input vector. A prefix sum exists in at least two variants - an exclusive scan or an inclusive scan. Conceptually these are similar.
If your element_sizes vector has been deposited in GPU global memory (it appears to be the case based on your pseudocode), then there exist library functions that run on the GPU that you could call at that point, to produce the memory_offsets data (vector), and the memory_size value could be trivially obtained from the last value in the vector, with a slight variation based on whether you are doing an inclusive scan or exclusive scan.
Here's a trivial worked example using thrust:
$ cat t319.cu
#include <thrust/scan.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/copy.h>
#include <iostream>
int main(){
const int element_sizes[] = { 10, 100, 23, 45 };
const int ds = sizeof(element_sizes)/sizeof(element_sizes[0]);
thrust::device_vector<int> dv_es(element_sizes, element_sizes+ds);
thrust::device_vector<int> dv_mo(ds);
thrust::exclusive_scan(dv_es.begin(), dv_es.end(), dv_mo.begin());
std::cout << "element_sizes:" << std::endl;
thrust::copy_n(dv_es.begin(), ds, std::ostream_iterator<int>(std::cout, ","));
std::cout << std::endl << "memory_offsets:" << std::endl;
thrust::copy_n(dv_mo.begin(), ds, std::ostream_iterator<int>(std::cout, ","));
std::cout << std::endl << "memory_size:" << std::endl << dv_es[ds-1] + dv_mo[ds-1] << std::endl;
}
$ nvcc -o t319 t319.cu
$ ./t319
element_sizes:
10,100,23,45,
memory_offsets:
0,10,110,133,
memory_size:
178
$

CUDA Thrust - Run Length Encoding with run index

I am trying to build a "run length encoder" which produces a report of occurrences of runs within a file using CUDA Thrust. I will use this "report" to perform the run length encoding step later.
e.g.
Input sequence:
inputSequence = [a, a, b, c, a, a, a];
Output sequences:
runChar = [a, a];
runCount = [2, 3];
runPosition = [0, 4];
The output desribes a run of 2 a's starting at position 0 and a run of 3 a's starting at the position 4.
The Thrust run length encoder example described below outputs two arrays - one for the output char and one for its length.
I would like to modify this so runs of less than 2 are excluded and it also outputs the position each run occurs.
// input data on the host
const char data[] = "aaabbbbbcddeeeeeeeeeff";
const size_t N = (sizeof(data) / sizeof(char)) - 1;
// copy input data to the device
thrust::device_vector<char> input(data, data + N);
// allocate storage for output data and run lengths
thrust::device_vector<char> output(N);
thrust::device_vector<int> lengths(N);
// print the initial data
std::cout << "input data:" << std::endl;
thrust::copy(input.begin(), input.end(), std::ostream_iterator<char>(std::cout, ""));
std::cout << std::endl << std::endl;
// compute run lengths
size_t num_runs = thrust::reduce_by_key
(input.begin(), input.end(), // input key sequence
thrust::constant_iterator<int>(1), // input value sequence
output.begin(), // output key sequence
lengths.begin() // output value sequence
).first - output.begin(); // compute the output size
// print the output
std::cout << "run-length encoded output:" << std::endl;
for(size_t i = 0; i < num_runs; i++)
std::cout << "(" << output[i] << "," << lengths[i] << ")";
std::cout << std::endl;
return 0;

One possible approach, building on what you have shown already:
Take your output lengths, and do an exclusive_scan on them. This creates a corresponding vector of the starting indexes of each run.
Use stream compaction (remove_if) to remove elements from all arrays (output, lengths, and indexes) whose corresponding length is 1. We do this in two steps, the first remove_if operation to clean up output and indexes, using lengths as the stencil, and the second operating directly on lengths. This can probably be significantly improved by operating on all 3 at once, which will make the output length calculation a bit more complicated. How you handle this exactly will depend on which sets of data you intend to retain.
Here is a fully worked example, extending your code:
$ cat t601.cu
#include <iostream>
#include <thrust/device_vector.h>
#include <thrust/copy.h>
#include <thrust/reduce.h>
#include <thrust/scan.h>
#include <thrust/iterator/constant_iterator.h>
#include <thrust/iterator/zip_iterator.h>
struct is_not_one{
template <typename T>
__host__ __device__
bool operator()(T data){
return data != 1;
}
};
int main(){
// input data on the host
const char data[] = "aaabbbbbcddeeeeeeeeeff";
const size_t N = (sizeof(data) / sizeof(char)) - 1;
// copy input data to the device
thrust::device_vector<char> input(data, data + N);
// allocate storage for output data and run lengths
thrust::device_vector<char> output(N);
thrust::device_vector<int> lengths(N);
// print the initial data
std::cout << "input data:" << std::endl;
thrust::copy(input.begin(), input.end(), std::ostream_iterator<char>(std::cout, ""));
std::cout << std::endl << std::endl;
// compute run lengths
size_t num_runs = thrust::reduce_by_key
(input.begin(), input.end(), // input key sequence
thrust::constant_iterator<int>(1), // input value sequence
output.begin(), // output key sequence
lengths.begin() // output value sequence
).first - output.begin(); // compute the output size
// print the output
std::cout << "run-length encoded output:" << std::endl;
for(size_t i = 0; i < num_runs; i++)
std::cout << "(" << output[i] << "," << lengths[i] << ")";
std::cout << std::endl;
thrust::device_vector<int> indexes(num_runs);
thrust::exclusive_scan(lengths.begin(), lengths.begin()+num_runs, indexes.begin());
thrust::device_vector<char> foutput(num_runs);
thrust::device_vector<int> findexes(num_runs);
thrust::device_vector<int> flengths(num_runs);
thrust::copy_if(thrust::make_zip_iterator(thrust::make_tuple(output.begin(), indexes.begin())), thrust::make_zip_iterator(thrust::make_tuple(output.begin()+num_runs, indexes.begin()+num_runs)), lengths.begin(), thrust::make_zip_iterator(thrust::make_tuple(foutput.begin(), findexes.begin())), is_not_one());
size_t fnum_runs = thrust::copy_if(lengths.begin(), lengths.begin()+num_runs, flengths.begin(), is_not_one()) - flengths.begin();
std::cout << "output: " << std::endl;
thrust::copy_n(foutput.begin(), fnum_runs, std::ostream_iterator<char>(std::cout, ","));
std::cout << std::endl << "lengths: " << std::endl;
thrust::copy_n(flengths.begin(), fnum_runs, std::ostream_iterator<int>(std::cout, ","));
std::cout << std::endl << "indexes: " << std::endl;
thrust::copy_n(findexes.begin(), fnum_runs, std::ostream_iterator<int>(std::cout, ","));
std::cout << std::endl;
return 0;
}
$ nvcc -arch=sm_20 -o t601 t601.cu
$ ./t601
input data:
aaabbbbbcddeeeeeeeeeff
run-length encoded output:
(a,3)(b,5)(c,1)(d,2)(e,9)(f,2)
output:
a,b,d,e,f,
lengths:
3,5,2,9,2,
indexes:
0,3,9,11,20,
$
I'm certain that this code can be improved upon, but my purpose is to show you one possible general approach.
In my opinion, for future reference, it's not very helpful for you to strip off the include headers from your sample code. I think it's better to provide a complete, compilable code. Not a big deal in this case.
Also note that there are thrust example codes for run length encoding and decoding.

Clang fails to throw a std::bad_alloc when allocating objects that would exceed the limit

I am having trouble understanding how clang throws exceptions when I try to allocate an object that would exceed its limit. For instance if I compile and run the following bit of code:
#include <limits>
#include <new>
#include <iostream>
int main(int argc, char** argv) {
typedef unsigned char byte;
byte*gb;
try{
gb=new byte[std::numeric_limits<std::size_t>::max()];
}
catch(const std::bad_alloc&){
std::cout<<"Normal"<<std::endl;
return 0;}
delete[]gb;
std::cout<<"Abnormal"<<std::endl;
return 1;
}
then when I compile using "clang++ -O0 -std=c++11 main.cpp" the result I get is "Normal" as expected, but as soon as I enable optimizations 1 through 3, the program unexpectedly returns "Abnormal".
I am saying unexpectedly, because according to the C++11 standard 5.3.4.7:
When the value of the expression in a noptr-new-declarator is zero, the allocation function is called to
allocate an array with no elements. If the value of that expression is less than zero or such that the size
of the allocated object would exceed the implementation-defined limit, or if the new-initializer is a braced-
init-list for which the number of initializer-clauses exceeds the number of elements to initialize, no storage
is obtained and the new-expression terminates by throwing an exception of a type that would match a
handler (15.3) of type std::bad_array_new_length (18.6.2.2).
[This behavior is observed with both clang 3.5 using libstd++ on linux and clang 3.3 using libc++ on Mac. The same behavior is also observed when the -std=c++11 flag is removed.]
The plot thickens when I compile the same program using gcc 4.8, using the exact same command line options. In that case, the program returns "Normal" for any chosen optimization level.
I cannot find any undefined behavior in the code posted above that would explain why clang would feel free not to throw an exception when code optimizations are enabled. As far as the bug database is concerned, the closest I can find is http://llvm.org/bugs/show_bug.cgi?id=11644 but it seems to be related to the type of exception being thrown rather than a behavior difference between debug and release code.
So it this a bug from Clang? Or am I missing something? Thanks,

It appears that clang eliminates the allocation as the array is unused:
#include <limits>
#include <new>
#include <iostream>
int main(int argc, char** argv)
{
typedef unsigned char byte;
bytes* gb;
const size_t max = std::numeric_limits<std::size_t>::max();
try
{
gb = new bytes[max];
}
catch(const std::bad_alloc&)
{
std::cout << "Normal" << std::endl;
return 0;
}
try
{
gb[0] = 1;
gb[max - 1] = 1;
std::cout << gb[0] << gb[max - 1] << "\n";
}
catch ( ... )
{
std::cout << "Exception on access\n";
}
delete [] gb;
std::cout << "Abnormal" << std::endl;
return 1;
}
This code prints "Normal" with -O0 and -O3, see this demo. That means that in this code, it is actually tried to allocate the memory and it indeed fails, hence we get the exception. Note that if we don't output, clang is still smart enough to even ignore the writes.

It appears that clang++ on Mac OSX does throw bad_alloc, but it also prints an error message from malloc.
Program:
// bad_alloc example
#include <iostream> // std::cout
#include <sstream>
#include <new> // std::bad_alloc
int main(int argc, char *argv[])
{
unsigned long long memSize = 10000;
if (argc < 2)
memSize = 10000;
else {
std::istringstream is(argv[1]); // C++ atoi
is >> memSize;
}
try
{
int* myarray= new int[memSize];
std::cout << "alloc of " << memSize << " succeeded" << std::endl;
}
catch (std::bad_alloc& ba)
{
std::cerr << "bad_alloc caught: " << ba.what() << '\n';
}
std::cerr << "Program exiting normally" << std::endl;
return 0;
}
Mac terminal output:
david#Godel:~/Dropbox/Projects/Miscellaneous$ badalloc
alloc of 10000 succeeded
Program exiting normally
david#Godel:~/Dropbox/Projects/Miscellaneous$ badalloc 1234567891234567890
badalloc(25154,0x7fff7622b310) malloc: *** mach_vm_map(size=4938271564938272768)
failed (error code=3)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
bad_alloc caught: std::bad_alloc
Program exiting normally
I also tried the same program using g++ on Windows 7:
C:\Users\David\Dropbox\Projects\Miscellaneous>g++ -o badallocw badalloc.cpp
C:\Users\David\Dropbox\Projects\Miscellaneous>badallocw
alloc of 10000 succeeded
C:\Users\David\Dropbox\Projects\Miscellaneous>badallocw 1234567890
bad_alloc caught: std::bad_alloc
Note: The program is a modified version of the example at
http://www.cplusplus.com/reference/new/bad_alloc/

qserialport does not send a char to arduino

I'm having a trouble in trying to send a char (i.e. "R") from my qt5 application on WIN7 to comport which is connected to an Arduino.
I intend to blink a led on Arduino and my arduino part works OK.
Here is my qt code:
#include <QTextStream>
#include <QCoreApplication>
#include <QtSerialPort/QSerialPortInfo>
#include <QSerialPort>
#include <iostream>
#include <QtCore>
QT_USE_NAMESPACE
using namespace std;
QSerialPort serial;
int main(int argc, char *argv[])
{
QCoreApplication a(argc, argv);
QTextStream out(stdout);
QList<QSerialPortInfo> serialPortInfoList = QSerialPortInfo::availablePorts();
out << QObject::tr("Total number of ports available: ") << serialPortInfoList.count() << endl;
foreach (const QSerialPortInfo &serialPortInfo, serialPortInfoList) {
out << endl
<< QObject::tr("Port: ") << serialPortInfo.portName() << endl
<< QObject::tr("Location: ") << serialPortInfo.systemLocation() << endl
<< QObject::tr("Description: ") << serialPortInfo.description() << endl
<< QObject::tr("Manufacturer: ") << serialPortInfo.manufacturer() << endl
<< QObject::tr("Vendor Identifier: ") << (serialPortInfo.hasVendorIdentifier() ? QByteArray::number(serialPortInfo.vendorIdentifier(), 16) : QByteArray()) << endl
<< QObject::tr("Product Identifier: ") << (serialPortInfo.hasProductIdentifier() ? QByteArray::number(serialPortInfo.productIdentifier(), 16) : QByteArray()) << endl
<< QObject::tr("Busy: ") << (serialPortInfo.isBusy() ? QObject::tr("Yes") : QObject::tr("No")) << endl;
}
serial.setPortName("COM5");
serial.open(QIODevice::ReadWrite);
serial.setBaudRate(QSerialPort::Baud9600);
serial.setDataBits(QSerialPort::Data8);
serial.setParity(QSerialPort::NoParity);
serial.setStopBits(QSerialPort::OneStop);
serial.setFlowControl(QSerialPort::NoFlowControl);
if(!serial.isOpen())
{
std::cout<<"port is not open"<<endl;
//serial.open(QIODevice::ReadWrite);
}
if(serial.isWritable()==true)
{
std::cout<<"port writable..."<<endl;
}
QByteArray data("R");
serial.write(data);
serial.flush();
std::cout<<"value sent!!! "<<std::endl;
serial.close();
return 0;
}
My source code consists of two parts,
1- serialportinfolist .... which works just fine
2- opening and writing data... I get no issue when running the code and the display shows the result as if nothing has gone wrong!
HOWEVER, the led on the board does not turn on when I run this code.
I test this with Arduino Serial Monitor and it turns on but cant turn on from Qt.

Are you waiting for cr lf (0x0D 0x0A) in your arduino code?
QByteArray ba;
ba.resize(3);
ba[0] = 0x5c; //'R'
ba[1] = 0x0d;
ba[2] = 0x0a;
Or append it to your string with
QByteArray data("R\r\n");
Or
QByteArray data("R\n");

I think I have found a partial solution but it is still incomplete.
When I press debug the first time, qt does not send any signal to Arduino, but when I press debug for the second time it behaves as expected.
So, is'nt it so weird that one has to run it twice to get it working???
Let me know if the problem exists somewhere else,
any help...

Accessing value of device_vector build with a functional : error I don't understand

I have a weird error that I don't understand when I initialize a device vector using a functional.
I want to create a device_vector of size 1000 with elements:
A[i] = i*0.05;
Here is my code: (seen from example here : Thrust - Initial device_vector)
#include <thrust/device_vector.h>
#include <thrust/functional.h>
#include <thrust/iterator/counting_iterator.h>
using namespace thrust::placeholders;
int main(void)
{
thrust::device_vector<float> A(
thrust::make_transform_iterator(
thrust::counting_iterator<float>(0),_1*0.05),
thrust::make_transform_iterator(
thrust::counting_iterator<float>(0),_1*0.05) + 1000);
std::cout << "--------------------------" << std::endl;
std::cout << "A[0] is : " << A[0] << std::endl;
std::cout << "A[1] is : " << A[1] << std::endl;
std::cout << "A[100] is : " << A[100] << std::endl;
std::cout << "A[500] is : " << A[500] << std::endl;
return 0;
}
The code compiles fine (using thrust v1.6), but when I try to access any value of my device vector (such as A[0]), I get a runtime error.
What am I doing wrong? This is probably very basic, but it's late Friday night, and somehow I can't see it!
Thanks a lot for the help.

Thrust programs frequently will not behave correctly when compiled with the -G switch, so it's recommended not to use that with thrust.
This specific behavior may vary with thrust versions, and may improve over time with newer thrust versions. But in general, at the moment if you're having trouble with thrust code, try compiling without the -G switch.
Device code will frequently run faster when compiled without the -G switch as well, so in general -G should only be used when you expect to do device code debugging (e.g. with Nsight VSE or cuda-gdb), and there may also be other special cases where you want to use -G for test purposes, if you are focusing specifically on some aspect of device code generation. Otherwise, you should not compile codes with -G as a general practice.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008