What is the behavior of the vector here? - stl

I am unable to comprehend why in the test shown below, iterator p never reaches the end and therefore the loop breaks only when k = 20? What exactly is the push_back doing to cause undefined behavior? Is it because the vector dynamically allocated a bunch of additional storage for the new elements I want to use, and the amount is not necessarily the amount I will use?
#include <iostream>
#include <vector>
#include <list>
using namespace std;
const int MAGIC = 11223344;
void test()
{
bool allValid = true;
int k = 0;
vector<int> v2(5, MAGIC);
k = 0;
for (vector<int>::iterator p = v2.begin(); p != v2.end(); p++, k++)
{
if (k >= 20) // prevent infinite loop
break;
if (*p != MAGIC)
{
cout << "Item# " << k << " is " << *p << ", not " << MAGIC <<"!" << endl;
allValid = false;
}
if (k == 2)
{
for (int i = 0; i < 5; i++)
v2.push_back(MAGIC);
}
}
if (allValid && k == 10)
cout << "Passed test 3" << endl;
else
cout << "Failed test 3" << "\n" << k << endl;
}
int main()
{
test();
}

Insertion to a vector while iterating over it is really a bad idea. Data insertion may cause memory reallocation that invalidates iterators. In this case, the capacity was not enough to insert additional elements, which caused memory allocation with a different address. You can check it yourself:
void test()
{
bool allValid = true;
int k = 0;
vector<int> v2(5, MAGIC);
k = 0;
for (vector<int>::iterator p = v2.begin(); p != v2.end(); p++, k++)
{
cout << v2.capacity() << endl; // Print the vector capacity
if (k >= 20) // prevent infinite loop
break;
if (*p != MAGIC) {
//cout << "Item# " << k << " is " << *p << ", not " << MAGIC <<"!" << endl;
allValid = false;
}
if (k == 2) {
for (int i = 0; i < 5; i++)
v2.push_back(MAGIC);
}
}
if (allValid && k == 10)
cout << "Passed test 3" << endl;
else
cout << "Failed test 3" << "\n" << k << endl;
}
This code will output something like the following:
5
5
5
10 <-- the capacity has changed
10
... skipped ...
10
10
Failed test 3
20
We can see that where k is equal to 2 (third line), the capacity of the vector doubled (fourth line) because we are adding new elements. The memory is redistributed, and the vector elements are most likely now located elsewhere. You can also check it by printing vector base address with data member function instead of capacity:
Address: 0x136dc20 k: 0
Address: 0x136dc20 k: 1
Address: 0x136dc20 k: 2
Address: 0x136e050 k: 3 <-- the address has changed
Address: 0x136e050 k: 4
... skipped ...
Address: 0x136e050 k: 19
Address: 0x136e050 k: 20
Failed test 3
20
The code is poorly written, you can make it more robust by using indices instead of iterators.

Related

How to cope with "cudaErrorMissingConfiguration" from "cudaMallocPitch" function of CUDA?

I'm making a Mandelbrot set program with CUDA. However I can't step more unless cudaErrorMissingConfiguration from cudaMallocPitch() function of CUDA is to be solved. Could you tell me something about it?
My GPU is GeForce RTX 2060 SUPER.
I'll show you my command lines below.
> nvcc MandelbrotCUDA.cu -o MandelbrotCUDA -O3
I tried cudaDeviceSetLimit( cudaLimitMallocHeapSize, 7*1024*1024*1024 ) to
resize heap size.
cudaDeviceSetLimit was success.
However I cannot step one more. I cannot print "CUDA malloc done!"
#include <iostream>
#include <thrust/complex.h>
#include <fstream>
#include <string>
#include <stdlib.h>
using namespace std;
#define D 0.0000025 // Tick
#define LIMIT_N 255
#define INF_NUM 2
#define PLOT_METHOD 2 // dat file : 0, ppm file : 1, ppm file with C : 2
__global__
void calculation(const int indexTotalX, const int indexTotalY, int ***n, thrust::complex<double> ***c){ // n, c are the pointers of dN, dC.
for(int i = 0; i < indexTotalY ; i++){
for(int j = 0; j < indexTotalX; j++){
thrust::complex<double> z(0.0f, 0.0f);
n[i][j] = 0;
for(int ctr=1; ctr <= LIMIT_N ; ctr++){
z = z*z + (*(c[i][j]));
n[i][j] = n[i][j] + (abs(z) < INF_NUM);
}
}
}
}
int main(){
// Data Path
string filePath = "Y:\\Documents\\Programming\\mandelbrot\\";
string fileName = "mandelbrot4.ppm";
string filename = filePath+fileName;
//complex<double> c[N][M];
double xRange[2] = {-0.76, -0.74};
double yRange[2] = {0.05, 0.1};
const int indexTotalX = (xRange[1]-xRange[0])/D;
const int indexTotalY = (yRange[1]-yRange[0])/D;
thrust::complex<double> **c;
//c = new complex<double> [N];
cout << "debug_n" << endl;
int **n;
n = new int* [indexTotalY];
c = new thrust::complex<double> * [indexTotalY];
for(int i=0;i<indexTotalY;i++){
n[i] = new int [indexTotalX];
c[i] = new thrust::complex<double> [indexTotalX];
}
cout << "debug_n_end" << endl;
for(int i = 0; i < indexTotalY; i++){
for(int j = 0; j < indexTotalX; j++){
thrust::complex<double> tmp( xRange[0]+j*D, yRange[0]+i*D );
c[i][j] = tmp;
//n[i*sqrt(N)+j] = 0;
}
}
// CUDA malloc
cout << "CUDA malloc initializing..." << endl;
int **dN;
thrust::complex<double> **dC;
cudaError_t error;
error = cudaDeviceSetLimit(cudaLimitMallocHeapSize, 7*1024*1024*1024);
if(error != cudaSuccess){
cout << "cudaDeviceSetLimit's ERROR CODE = " << error << endl;
return 0;
}
size_t tmpPitch;
error = cudaMallocPitch((void **)dN, &tmpPitch,(size_t)(indexTotalY*sizeof(int)), (size_t)(indexTotalX*sizeof(int)));
if(error != cudaSuccess){
cout << "CUDA ERROR CODE = " << error << endl;
cout << "indexTotalX = " << indexTotalX << endl;
cout << "indexTotalY = " << indexTotalY << endl;
return 0;
}
cout << "CUDA malloc done!" << endl;
This is console messages below.
debug_n
debug_n_end
CUDA malloc initializing...
CUDA ERROR CODE = 1
indexTotalX = 8000
indexTotalY = 20000
There are several problems here:
int **dN;
...
error = cudaMallocPitch((void **)dN, &tmpPitch,(size_t)(indexTotalY*sizeof(int)), (size_t)(indexTotalX*sizeof(int)));
The correct type of pointer to use in CUDA allocations is a single pointer:
int *dN;
not a double pointer:
int **dN;
(so your kernel where you are trying pass triple-pointers:
void calculation(const int indexTotalX, const int indexTotalY, int ***n, thrust::complex<double> ***c){ // n, c are the pointers of dN, dC.
is almost certainly not going to work, and should not be designed that way, but that is not the question you are asking.)
The pointer is passed to the allocating function by its address:
error = cudaMallocPitch((void **)&dN,
For cudaMallocPitch, only the horizontal requested dimension is scaled by the size of the data element. The allocation height is not scaled this way. Also, I will assume X corresponds to your allocation width, and Y corresponds to your allocation height, so you also have those parameters reversed:
error = cudaMallocPitch((void **)&dN, &tmpPitch,(size_t)(indexTotalX*sizeof(int)), (size_t)(indexTotalY));
The cudaLimitMallocHeapSize should not be necessary to set to make any of this work. It applies only to in-kernel allocations. Reserving 7GB on an 8GB card may also cause problems. Until you are sure you need that (it's not needed for what you have shown) I would simply remove that.
$ cat t1488.cu
#include <iostream>
#include <thrust/complex.h>
#include <fstream>
#include <string>
#include <stdlib.h>
using namespace std;
#define D 0.0000025 // Tick
#define LIMIT_N 255
#define INF_NUM 2
#define PLOT_METHOD 2 // dat file : 0, ppm file : 1, ppm file with C : 2
__global__
void calculation(const int indexTotalX, const int indexTotalY, int ***n, thrust::complex<double> ***c){ // n, c are the pointers of dN, dC.
for(int i = 0; i < indexTotalY ; i++){
for(int j = 0; j < indexTotalX; j++){
thrust::complex<double> z(0.0f, 0.0f);
n[i][j] = 0;
for(int ctr=1; ctr <= LIMIT_N ; ctr++){
z = z*z + (*(c[i][j]));
n[i][j] = n[i][j] + (abs(z) < INF_NUM);
}
}
}
}
int main(){
// Data Path
string filePath = "Y:\\Documents\\Programming\\mandelbrot\\";
string fileName = "mandelbrot4.ppm";
string filename = filePath+fileName;
//complex<double> c[N][M];
double xRange[2] = {-0.76, -0.74};
double yRange[2] = {0.05, 0.1};
const int indexTotalX = (xRange[1]-xRange[0])/D;
const int indexTotalY = (yRange[1]-yRange[0])/D;
thrust::complex<double> **c;
//c = new complex<double> [N];
cout << "debug_n" << endl;
int **n;
n = new int* [indexTotalY];
c = new thrust::complex<double> * [indexTotalY];
for(int i=0;i<indexTotalY;i++){
n[i] = new int [indexTotalX];
c[i] = new thrust::complex<double> [indexTotalX];
}
cout << "debug_n_end" << endl;
for(int i = 0; i < indexTotalY; i++){
for(int j = 0; j < indexTotalX; j++){
thrust::complex<double> tmp( xRange[0]+j*D, yRange[0]+i*D );
c[i][j] = tmp;
//n[i*sqrt(N)+j] = 0;
}
}
// CUDA malloc
cout << "CUDA malloc initializing..." << endl;
int *dN;
thrust::complex<double> **dC;
cudaError_t error;
size_t tmpPitch;
error = cudaMallocPitch((void **)&dN, &tmpPitch,(size_t)(indexTotalX*sizeof(int)), (size_t)(indexTotalY));
if(error != cudaSuccess){
cout << "CUDA ERROR CODE = " << error << endl;
cout << "indexTotalX = " << indexTotalX << endl;
cout << "indexTotalY = " << indexTotalY << endl;
return 0;
}
cout << "CUDA malloc done!" << endl;
}
$ nvcc -o t1488 t1488.cu
t1488.cu(68): warning: variable "dC" was declared but never referenced
$ cuda-memcheck ./t1488
========= CUDA-MEMCHECK
debug_n
debug_n_end
CUDA malloc initializing...
CUDA malloc done!
========= ERROR SUMMARY: 0 errors
$

namespace::function cannot be used as a function

main.cpp
#include "Primes.h"
#include <iostream>
int main(){
std::string choose;
int num1, num2;
while(1 == 1){
std::cout << "INSTRUCTIONS" << std::endl << "Enter:" << std::endl
<< "'c' to check whether a number is a prime," << std::endl
<< "'u' to view all the prime numbers between two numbers "
<< "that you want," << std::endl << "'x' to exit,"
<< std::endl << "Enter what you would like to do: ";
std::cin >> choose;
std::cout << std::endl;
if(choose == "c"){
std::cout << "Enter number: ";
std::cin >> num1;
Primes::checkPrimeness(num1) == 1 ?
std::cout << num1 << " is a prime." << std::endl << std::endl :
std::cout << num1 << " isn't a prime." << std::endl << std::endl;
}else if(choose == "u"){
std::cout << "Enter the number you want to start seeing primes "
<< "from: ";
std::cin >> num1;
std::cout << "\nEnter the number you want to stop seeing primes "
<< "till: ";
std::cin >> num2;
std::cout << std::endl;
for(num1; num1 <= num2; num1++){
Primes::checkPrimeness(num1) == 1 ?
std::cout << num1 << " is a prime." << std::endl :
std::cout << num1 << " isn't a prime." << std::endl;
}
}else if(choose == "x"){
return 0;
}
std::cout << std::endl;
}
}
Primes.h
#ifndef PRIMES_H
#define PRIMES_H
namespace Primes{
extern int num, count;
extern bool testPrime;
// Returns true if the number is a prime and false if it isn't.
int checkPrimeness(num);
}
#endif
Primes.cpp
#include "Primes.h"
#include <iostream>
int Primes::checkPrimeness(num){
if(num < 2){
return(0);
}else if(num == 2){
return(1);
}else{
for(count = 0; count < num; count++){
for(count = 2; count < num; count++){
if(num % count == 0){
return(0);
}else{
testPrime = true;
if(count == --num && testPrime == true){
return(1);
}
}
}
}
}
}
I get the following 3 errors: Errors from terminal
I've spent hours for days and still can't seem to fix the errors.
I've tried using extern and pretty much everything I can imagine.
Here is an error in function declaration:
int checkPrimeness(num);
defines a global integer variable checkPrimeness initialized with num! To declare a function you just should change it like:
int checkPrimeness(int);
Can't understand why you declare parameters as external variables. To split declarations and realization you should declare all functions and classes inside header file, and define them inside source file.

Error: 2.5e-1 cannot be used as a function

i wrote a simple program and i'm getting this error which i never encountered yet. Can you help me out?
line 13: error: 2.5e-1 cannot be used as a function
#include <iostream>
#include <iomanip>
using namespace std;
int dirac(int);
int main()
{
float y;
for(int k = 0; k <= 4; k++){
y = 2*dirac(k)-0.5*dirac(k-1)*0.25(2*dirac(k-2)-0.5*dirac(k-3));
cout << "k = " << k << ": ";
cout << setw(8) << setfill(' ');
cout << setprecision(3) << fixed << y << endl;
}
return 0;
}
int dirac(int x){
if(x == 0){
x = 1;
return x;
}else{
x = 0;
return x;
}
}
y = 2*dirac(k)-0.5*dirac(k-1)*0.25(2*dirac(k-2)-0.5*dirac(k-3));
^---
You probably forgot a * at the indicated spot.

thrust::device_vector not working

I have written a code using Thrust. I am pasting the code and its output below. Strangely, when the device_vector line is reached during exectution the screen just hangs and no more output comes. It was working in the morning. Please help me.
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <iostream>
int main(void)
{
// H has storage for 4 integers
thrust::host_vector<int> H(4);
// initialize individual elements
H[0] = 14;
H[1] = 20;
H[2] = 38;
H[3] = 46;
// H.size() returns the size of vector H
std::cout << "H has size " << H.size() << std::endl;
// print contents of H
for(size_t i = 0; i < H.size(); i++)
std::cout << "H[" << i << "] = " << H[i] << std::endl;
// resize H
H.resize(2);
std::cout << "H now has size " << H.size() << std::endl;
// Copy host_vector H to device_vector D
thrust::device_vector<int> D = H;
// elements of D can be modified
D[0] = 99;
D[1] = 88;
// print contents of D
for(size_t i = 0; i < D.size(); i++)
std::cout << "D[" << i << "] = " << D[i] << std::endl;
// H and D are automatically deleted when the function returns
return 0;
}
The output is :
H has size 4
H[0] = 14
H[1] = 20
H[2] = 38
H[3] = 46
H now has size 2
* After this nothing happens
Run Device Query. I am confident that if the code was working in the morning, the problem is due to the graphics card.

Why does this CUDA code for calculating a Mandelbrot set fail when setting the maximum iteration count higher than 5,500,000?

I'm writing a code synthesizer which converts high-level models into CUDA C code. As test model, I'm using a Mandelbrot generator application which executes the iteration count for each X-Y coordinate in parallel on a GPGPU. The image is 70x70 pixels, and the X-Y coordinates range from (-1, -1) to (1, 1). For simplicity, the application expects a large float array, where each group of 3 elements contains the X and Y coordinates, followed by the maximum iteration count. Each thread on the GPGPU receives a pointer to the beginning of each 3-group set and calculates the iteration count.
The synthesized CUDA code works perfectly when maximum iteration counts is less than 5,500,000, but when it goes higher than that then the output becomes completely bogus. To illustrate, see the examples below:
Normal output when max_it is set to 5,000,000:
output[0]: 3
output[1]: 3
output[2]: 3
output[3]: 3
output[4]: 3
output[5]: 3
output[6]: 3
output[7]: 3
output[8]: 3
output[9]: 4
output[10]: 4
output[11]: 4
output[12]: 4
output[13]: 4
output[14]: 4
output[15]: 5
output[16]: 5
output[17]: 5
output[18]: 5
output[19]: 5
output[20]: 6
output[21]: 7
output[22]: 9
output[23]: 11
output[24]: 19
output[25]: 5000000
output[26]: 5000000
output[27]: 5000000
...
output[4878]: 2
output[4879]: 2
output[4880]: 2
output[4881]: 2
output[4882]: 2
output[4883]: 2
output[4884]: 2
output[4885]: 2
output[4886]: 2
output[4887]: 2
output[4888]: 2
output[4889]: 2
output[4890]: 2
output[4891]: 2
output[4892]: 2
output[4893]: 2
output[4894]: 2
output[4895]: 2
output[4896]: 2
output[4897]: 2
output[4898]: 2
output[4899]: 2
Bogus output when max_it is set to 6,000,000:
output[0]: 0
output[1]: 0
output[2]: 0
output[3]: 0
output[4]: 0
output[5]: 0
output[6]: 0
output[7]: 0
output[8]: 0
output[9]: 0
output[10]: 0
output[11]: 0
output[12]: 0
output[13]: 0
output[14]: 0
output[15]: 0
output[16]: 0
output[17]: 0
output[18]: 0
output[19]: 0
output[20]: 0
output[21]: 0
output[22]: 0
output[23]: 0
output[24]: 0
output[25]: 0
output[26]: 0
output[27]: 0
...
output[4877]: 0
output[4878]: -1161699328
output[4879]: 32649
output[4880]: -1698402160
output[4881]: 32767
output[4882]: -1177507963
output[4883]: 32649
output[4884]: 6431616
output[4885]: 0
output[4886]: -1174325376
output[4887]: 32649
output[4888]: -1698402384
output[4889]: 32767
output[4890]: 4199904
output[4891]: 0
output[4892]: -1698402160
output[4893]: 32767
output[4894]: -1177511704
output[4895]: 32649
output[4896]: -1174325376
output[4897]: 32649
output[4898]: -1177559142
output[4899]: 32649
And here follows the code:
mandelbrot.cpp (main file)
#include "mandelbrot.h"
#include <iostream>
#include <cstdlib>
using namespace std;
int main(int argc, char** argv) {
const int kNumPixelsRow = 70;
const int kNumPixelsCol = 70;
if (argc != 6) {
cout << "Must provide 5 arguments: " << endl
<< " #1: Lower left corner X coordinate (x0)" << endl
<< " #2: Lower left corner Y coordinate (y0)" << endl
<< " #3: Upper right corner X coordinate (x1)" << endl
<< " #4: Upper right corner Y coordinate (y1)" << endl
<< " #5: Maximum number of iterations" << endl;
return 0;
}
float x0 = (float) atof(argv[1]);
if (x0 < -2.5) {
cout << "x0 is too small, must be larger than -2.5" << endl;
return 0;
}
float y0 = (float) atof(argv[2]);
if (y0 < -1) {
cout << "y0 is too small, must be larger than -1" << endl;
return 0;
}
float x1 = (float) atof(argv[3]);
if (x1 > 1) {
cout << "x1 is too large, must be smaller than 1" << endl;
return 0;
}
float y1 = (float) atof(argv[4]);
if (y1 > 1) {
cout << "x0 is too large, must be smaller than 1" << endl;
return 0;
}
int max_it = atoi(argv[5]);
if (max_it <= 0) {
cout << "max_it is too small, must be larger than 0" << endl;
return 0;
}
cout << "Generating input data..." << endl;
float input_array[kNumPixelsRow][kNumPixelsCol][3];
float delta_x = (x1 - x0) / kNumPixelsRow;
float delta_y = (y1 - y0) / kNumPixelsCol;
for (int x = 0; x < kNumPixelsCol; ++x) {
for (int y = 0; y < kNumPixelsRow; ++y) {
if (x == 0) {
input_array[x][y][0] = x0;
}
else {
input_array[x][y][0] = input_array[x - 1][y][0] + delta_x;
}
if (y == 0) {
input_array[x][y][1] = y0;
}
else {
input_array[x][y][1] = input_array[x][y - 1][1] + delta_y;
}
input_array[x][y][2] = (float) max_it;
}
}
cout << "Executing..." << endl;
struct ModelOutput output = executeModel((float*) input_array);
cout << "Done." << endl;
for (int i = 0; i < kNumPixelsRow * kNumPixelsCol; ++i) {
cout << "output[" << i << "]: " << output.value1[i] << endl;
}
return 0;
}
mandelbrot.h (header file)
////////////////////////////////////////////////////////////
// AUTO-GENERATED BY f2cc 0.1
////////////////////////////////////////////////////////////
/**
* C struct for retrieving the output values from the model.
* This is needed since C functions can only return a single
* value.
*/
struct ModelOutput {
/**
* Output from process "parallelmapSY_1".
*/
int value1[4900];
};
/**
* Executes the model.
*
* #param input1
* Input to process "parallelmapSY_1".
* Expects an array of size 14700.
* #returns A struct containing the model outputs.
*/
struct ModelOutput executeModel(const float* input1);
mandelbrot.cu (CUDA file)
////////////////////////////////////////////////////////////
// AUTO-GENERATED BY f2cc 0.1
////////////////////////////////////////////////////////////
#include "mandelbrot.h"
__device__
int parallelmapSY_1_func1(const float* args) {
float x0 = args[0];
float y0 = args[1];
int max_it = (int) args[2];
float x = 0;
float y = 0;
int i = 0;
while (x*x + y*y < (2*2) && i < max_it) {
float x_temp = x*x - y*y + x0;
y = 2*x*y + y0;
x = x_temp;
++i;
}
return i;
}
__global__
void parallelmapSY_1__kernel(const float* input, int* output) {
unsigned int index = (blockIdx.x * blockDim.x + threadIdx.x);
if (index < 4900) {
output[index] = parallelmapSY_1_func1(&input[index * 3]);
}
}
void parallelmapSY_1__kernel_wrapper(const float* input, int* output) {
float* device_input;
int* device_output;
struct cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, 0);
int max_block_size = prop.maxThreadsPerBlock;
int num_blocks = (4900 + max_block_size - 1) / max_block_size;
cudaMalloc((void**) &device_input, 14700 * sizeof(float));
cudaMalloc((void**) &device_output, 4900 * sizeof(int));
cudaMemcpy((void*) device_input, (void*) input, 14700 * sizeof(float), cudaMemcpyHostToDevice);
dim3 grid(num_blocks, 1);
dim3 blocks(max_block_size, 1);
parallelmapSY_1__kernel<<<grid, blocks>>>(device_input, device_output);
cudaMemcpy((void*) output, (void*) device_output, 4900 * sizeof(int), cudaMemcpyDeviceToHost);
cudaFree((void*) device_input);
cudaFree(((void*) device_output);
}
struct ModelOutput executeModel(const float* input1) {
// Declare signal variables
// Signals part of DelaySY processes are also initiated with delay value
float model_input_to_parallelmapSY_1_in[14700];
int parallelmapSY_1_out_to_model_output[4900];
// Copy model inputs to signal variables
for (int i = 0; i < 14700; ++i) {
model_input_to_parallelmapSY_1_in[i] = input1[i];
}
// Execute processes
parallelmapSY_1__kernel_wrapper(model_input_to_parallelmapSY_1_in, parallelmapSY_1_out_to_model_output);
// Copy model output values to return container
struct ModelOutput outputs;
for (int i = 0; i < 4900; ++i) {
outputs.value1[i] = parallelmapSY_1_out_to_model_output[i];
}
return outputs;
}
The interesting file is mandelbrot.cu as that contains the computational code; mandelbrot.cpp is just a driver to get user input and generate input data, and mandelbrot.h is just a header file so that mandelbrot.cpp can easily use mandelbrot.cu.
The function executeModel() is a wrapper function which takes care of propagating data between the processes in the model. In this case there is only one process so executeModel() is rather pointless.
parallelmapSY_1__kernel_wrapper() prepares the parallel execution by allocating memory on the device, transfers the input data, invokes the kernel, and transfers the result back to the host.
parallelmapSY_1__kernel() is the kernel function, which simply calls parallelmapSY_1_func1() with the appropriate input data. It also prevents execution when too many threads have been spawned.
So the real area of interest is parallelmapSY_1_func1(). As I said, it works perfectly when the maximum iteration count is less than 5,500,000, but when I go higher it just doesn't seem to work as it's supposed to (see output log above). Some may ask "Why are you setting the iteration count so high? That's not necessary!". True, but since the pure C equivalent works perfectly with higher maximum iteration counts, why shouldn't the CUDA version? Since I'm designing a general tool, I need to know why it doesn't work in this example.
So does anyone have any idea what the code appears to fail when the maximum iteration count fails when exceeding 5,500,000?
It may be a time-out problem with your video card and the OS causing the CUDA task to be aborted. See e.g. CUDA apps time out & fail after several seconds - how to work around this?