Building an STL application in Borland C++ Builder 5.0 - stl

I am new to Borland C++ builder 5.0.I have used small STL application which is compiled successfully in one machine(Window 2003 Server SP2), but not in another machine (Windows XP machine SP3). I have placed an code snippet and error message
Error E2285 Could not find a match for 'distance<>(const AnsiString *,const AnsiString *,i
nt)
I have opened Borland C++ Form and added the below code in Form Create
#include <vcl.h>
#pragma hdrstop
#include <vector>
using namespace std;
using std::distance;
static const AnsiString Text_FieldsInTypen[]=
{
"code_segment_national_2"
};
#include "Unit1.h"
//---------------------------------------------------------------------------
#pragma package(smart_init)
#pragma resource "*.dfm"
TForm1 *Form1;
//---------------------------------------------------------------------------
__fastcall TForm1::TForm1(TComponent* Owner)
: TForm(Owner)
{
}
//---------------------------------------------------------------------------
void __fastcall TForm1::FormCreate(TObject *Sender)
{
vector<AnsiString> aVec;
aVec.push_back("Test");
const AnsiString* Iter;
int Index = 0;
distance(Text_FieldsInTypen, Iter, Index);
}
//---------------------------------------------------------------------------

The distance algorithm takes two iterators:
template<class InputIterator>
typename iterator_traits<InputIterator>::difference_type
distance(
InputIterator _First,
InputIterator _Last
);
Not three unrelated arguments.
Iter is also used uninitialized in your code.

Related

__CUDA_ARCH__ flag with Thrust execution policy

I have a __host__ __device__ function which is a wrapper that calls into "sort" function of the thrust library. Inside this wrapper, I am using the __CUDA_ARCH__ flag to set the execution policy to "thrust::device" when called from host and "thrust::seq" when called from device. The following piece of code generates a runtime error -
#ifndef __CUDA_ARCH__
thrust::stable_sort(thrust::device, data, data + num, customGreater<T>());
#else
thrust::stable_sort(thrust::seq, data, data + num, customGreater<T>());
#endif
The error is-
Unexpected Standard exception:
What() is:merge_sort: failed on 2nd step: invalid device function
As per my understanding, CUDA_ARCH can be used for conditional compilation. I request for help in understanding why this error is thrown.
It seems you are stepping on this issue. In a nutshell, thrust uses CUB functionality under the hood for certain algorithms (including sort). Your use of __CUDA_ARCH__ macro in your code, which wraps around thrust algorithm calls that use CUB, is interfering with CUB code that expects to be able to use this macro for all paths.
A possible workaround is to do "your own dispatch":
$ cat t142.cu
#include <iostream>
#include <thrust/sort.h>
#include <thrust/execution_policy.h>
template <typename T>
struct customGreater {
__host__ __device__ bool operator()(T &t1, T &t2){
return (t1 > t2);}
};
template <typename T>
__host__ __device__
void my_sort_wrapper(T *data, size_t num){
int hostdev = 0; // 0=device code
#ifndef __CUDA_ARCH__
hostdev = 1; // 1=host code
#endif
if (hostdev == 0) thrust::stable_sort(thrust::seq, data, data + num, customGreater<T>());
else thrust::stable_sort(thrust::device, data, data + num, customGreater<T>());
}
template <typename T>
__global__ void my_dev_sort(T *data, size_t num){
my_sort_wrapper(data, num);
}
typedef int mytype;
const size_t sz = 10;
int main(){
mytype *d_data;
cudaMalloc(&d_data, sz*sizeof(mytype));
cudaMemset(d_data, 0, sz*sizeof(mytype));
my_sort_wrapper(d_data, sz);
my_dev_sort<<<1,1>>>(d_data, sz);
cudaDeviceSynchronize();
}
$ nvcc t142.cu -o t142
$ cuda-memcheck ./t142
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
$
With this realization, the use of the __CUDA_ARCH__ macro does not perturb the compilation of the thrust algorithms.
Another possible workaround is simply to use thrust::device policy for both cases (no dispatch - just the thrust algorithm call). Except in the case of CUDA Dynamic Parallelism, thrust::device will "decay" to thrust::seq when used in device code.
I would expect that these suggestions would only be necessary/relevant when the thrust algorithm uses CUB functionality in the underlying implementation.
If you don't like this behavior, you could file a thrust issue.
Unfortunately, we can't fix this in Thrust. The trouble here is that the NVCC compiler needs to see all __global__ function template instantiations during host compilation (e.g. when __CUDA_ARCH__ is not defined), otherwise the kernels will be treated as unused and discarded. See this CUB GitHub issue for more details.
As Robert suggested, a workaround such as this should be fine:
#include <iostream>
#include <thrust/sort.h>
#include <thrust/execution_policy.h>
template <typename T>
struct customGreater {
__host__ __device__ bool operator()(T &t1, T &t2){
return (t1 > t2);}
};
#if defined(__CUDA_ARCH__)
#define DEVICE_COMPILATION 1
#else
#define DEVICE_COMPILATION 0
#endif
template <typename T>
__host__ __device__
void my_sort(T *data, size_t num){
if (DEVICE_COMPILATION)
thrust::stable_sort(thrust::device, data, data + num, customGreater<T>());
else
thrust::stable_sort(thrust::seq, data, data + num, customGreater<T>());
}
template <typename T>
__global__ void my_dev_sort(T *data, size_t num){
my_sort(data, num);
}
typedef int mytype;
const size_t sz = 10;
int main(){
mytype *d_data;
cudaMallocManaged(&d_data, sz*sizeof(mytype));
cudaMemset(d_data, 0, sz*sizeof(mytype));
my_sort(d_data, sz);
my_dev_sort<<<1,1>>>(d_data, sz);
cudaFree(d_data);
cudaDeviceSynchronize();
}

Thrust error with CUDA separate compilation

I'm running into an error when I try to compile CUDA with relocatable device code enabled (-rdc = true). I'm using Visual Studio 2013 as compiler with CUDA 7.5. Below is a small example that shows the error. To clarify, the code below runs fine when -rdc = false, but when set to true, the error shows up.
The error simply says: CUDA error 11 [\cuda\detail\cub\device\dispatch/device_radix_sort_dispatch.cuh, 687]: invalid argument
Then I found this, which says:
When invoked with primitive data types, thrust::sort, thrust::sort_by_key,thrust::stable_sort, thrust::stable_sort_by_key may fail to link in some cases with nvcc -rdc=true.
Is there some workaround to allow separate compilation?
main.cpp:
#include <stdio.h>
#include <vector>
#include "cuda_runtime.h"
#include "RadixSort.h"
typedef unsigned int uint;
typedef unsigned __int64 uint64;
int main()
{
RadixSort sorter;
uint n = 10;
std::vector<uint64> test(n);
for (uint i = 0; i < n; i++)
test[i] = i + 1;
uint64 * d_array;
uint64 size = n * sizeof(uint64);
cudaMalloc(&d_array, size);
cudaMemcpy(d_array, test.data(), size, cudaMemcpyHostToDevice);
try
{
sorter.Sort(d_array, n);
}
catch (const std::exception & ex)
{
printf("%s\n", ex.what());
}
}
RadixSort.h:
#pragma once
typedef unsigned int uint;
typedef unsigned __int64 uint64;
class RadixSort
{
public:
RadixSort() {}
~RadixSort() {}
void Sort(uint64 * input, const uint n);
};
RadixSort.cu:
#include "RadixSort.h"
#include <thrust/device_vector.h>
#include <thrust/device_ptr.h>
#include <thrust/sort.h>
void RadixSort::Sort(uint64 * input, const uint n)
{
thrust::device_ptr<uint64> d_input = thrust::device_pointer_cast(input);
thrust::stable_sort(d_input, d_input + n);
cudaDeviceSynchronize();
}
As mentioned in the comments by Robert Crovella:
Changing the CUDA architecture to a higher value will solve this problem. In my case I changed it to compute_30 and sm_30 under CUDA C++ -> Device -> Code Generation.
Edit:
The general recommendation is to select the best fit hierarchy for your specific GPU. See the link in comments for additional information.

Using CUDA Thrust algorithms sequentially on the host

I wish to compare a Thrust algorithm's runtime when executed sequentially on a single CPU core versus a parallel execution on a GPU.
Thrust specifies the thrust::seq execution policy, but how can I explicity target the host backend system? I wish to avoid executing the algorithm sequentially on the GPU.
CUDA Thrust is architecture agnostic. Accordingly, consider the code I provided as an answer to
Cumulative summation in CUDA
In that code, MatingProbability and CumulativeProbability were thrust::device_vectors. thrust::transform and thrust::inclusive_scan were automatically able to recognize that and operate accordingly on the GPU.
Below, I'm providing the same code by changing thrust::device_vector to thrust::host_vector. Again, thrust::transform and thrust::inclusive_scan are able to automatically recognize that the vectors to operate on reside on the CPU and to operate accordingly.
#include <thrust/host_vector.h>
#include <thrust/transform.h>
#include <thrust/functional.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/constant_iterator.h>
#include <cstdio>
template <class T>
struct scaling {
const T _a;
scaling(T a) : _a(a) { }
__host__ __device__ T operator()(const T &x) const { return _a * x; }
};
void main()
{
const int N = 20;
double a = -(double)N;
double b = 0.;
double Dx = -1./(0.5*N*(N+1));
thrust::host_vector<double> MatingProbability(N);
thrust::host_vector<double> CumulativeProbability(N+1, 0.);
thrust::transform(thrust::make_counting_iterator(a), thrust::make_counting_iterator(b), MatingProbability.begin(), scaling<double>(Dx));
thrust::inclusive_scan(MatingProbability.begin(), MatingProbability.end(), CumulativeProbability.begin() + 1);
for(int i=0; i<N+1; i++)
{
double val = CumulativeProbability[i];
printf("%d %3.15f\n", i, val);
}
}

CUDA shared object between threads

I am totally new to CUDA. I want to create one object on the device, and access its member from different threads. I use nvcc -arch=sm_20 (on Tesla M2090), and if I run my code I get an 'unspecified launch failure'. Here is my code:
#include <stdio.h>
#include <string>
using namespace std;
#ifdef __CUDACC__
#define CUDA_CALLABLE __host__ __device__
#else
#define CUDA_CALLABLE
#endif
class SimpleClass {
public:
int i;
CUDA_CALLABLE SimpleClass(){i=1;};
CUDA_CALLABLE ~SimpleClass(){};
};
__global__ void initkernel(SimpleClass *a){
a = new SimpleClass();
}
__global__ void delkernel(SimpleClass *a){
delete a;
}
__global__ void kernel(SimpleClass *a){
printf("%d\n", a->i);
}
int main() {
SimpleClass *a;
initkernel<<<1,1>>>(a);
cudaThreadSynchronize();
kernel<<<1,10>>>(a);
cudaThreadSynchronize();
delkernel<<<1,1>>>(a);
cudaThreadSynchronize();
cudaError_t error = cudaGetLastError();
string lastError = cudaGetErrorString(error);
printf("%s\n",lastError.c_str());
return 0;
}
You get the 'unspecified launch failure' during your first kernel code because 'a' is a pointer stored in the host, but you want to give it a value from a device function. If you want to allocate the object on the device, than you first have to allocate a pointer on the device and than you can read and write it form device (kernel) code, but be careful because it will require double indirection.
Your code should looks like something like this (the rest of the functions should be modified similarly):
__global__ void initkernel(SimpleClass** a){
*a = new SimpleClass();
}
int main() {
SimpleClass** a;
cudaMalloc((void**)&a, sizeof(SimpleClass**));
initkernel<<<1,1>>>(a);
cudaThreadSynchronize();
}
PS.: pQB is absolutely right about that, you should do an error check after each kernel code to detect the errors as soon as possible (and currently for finding the exact location of the error in your code)

How to advance iterator in thrust function

I'm doing some study on thrust. But I didn't understand how to get the value of an iterator point to.
An example code is like:
#include <thrust/for_each.h>
#include <thrust/device_vector.h>
#include <iostream>
#include <vector>
using namespace std;
class ADD
{
private:
typedef typename thrust::device_vector<int>::iterator PTR;
public:
ADD(){}
~ADD(){}
void setPtr(PTR &ptr)
{this->ptr=ptr;}
__host__ __device__
void operator()(int &x)
{
// note that using printf in a __device__ function requires
// code compiled for a GPU with compute capability 2.0 or
// higher (nvcc --arch=sm_20)
x+=add();
}
__host__ __device__
int add()
{return *ptr++;}
private:
PTR ptr;
};
int main()
{
thrust::device_vector<int> d_vec(3);
d_vec[0] = 0; d_vec[1] = 1; d_vec[2] = 2;
thrust::device_vector<int>::iterator itr=d_vec.begin();
ADD *addtest=new ADD();
addtest->setPtr(itr);
thrust::for_each(d_vec.begin(), d_vec.end(), *addtest);
for(int i=0;i<3;i++)
cout<<d_vec[i]<<endl;
return 0;
}
When I compile this using nvcc -arch=sm_20 test.cu
I got the following warning:
test.cu(28): warning: calling a host function("thrust::experimental::iterator_facade<thrust::detail::normal_iterator<thrust::device_ptr<int> > , thrust::device_ptr<int> , int, thrust::detail::cuda_device_space_tag, thrust::random_access_traversal_tag, thrust::device_reference<int> , long> ::operator *") from a __device__/__global__ function("printf_functor::add") is not allowed
test.cu(28): warning: calling a host function("thrust::experimental::iterator_facade<thrust::detail::normal_iterator<thrust::device_ptr<int> > , thrust::device_ptr<int> , int, thrust::detail::cuda_device_space_tag, thrust::random_access_traversal_tag, thrust::device_reference<int> , long> ::operator *") from a __device__/__global__ function("printf_functor::add") is not allowed
I cannot get this to compile. How can I solve this problem?
#Gang.Wang: I think you just mixing up 2 different things: all STL-like functionality including for_each, device_vector iterators etc. is just a "facade" which exists on the host only.
While operator() contains the actual GPU code which is compiled to CUDA kernel and applied to each element of your vector in parallel. Hence, device_vector::iterators are not accessible from your functor.