Moving parts of data inside a single cudaArray with cudaMemcpy3D possible? - cuda

I want to implement a shift operation for a volume texture in CUDA. I thought of an implementation that does several iterations of a memcpy-operation that moves data inside a cudaArray from one position to another.
What am I doing wrong, because I always get the invalid argument error? Here is a sketch of what I am doing:
/* My volume texture */
cudaArray* g_pVolumeTexture // its size is 256^3 voxels of type uchar2
...
cudaMemcpy3DParms prms;
prms.srcArray = g_pVolumeTexture;
prms.dstArray = g_pVolumeTexture; // src = dst, because I wanna rather shift than
// copy
prms.extent = make_cudaExtent(24, 256, 256);
prms.srcPos = make_cudaPos(0, 0, 0);
prms.dstPos = make_cudaPos(48, 0, 0); // this will mean a move of 48 voxels in
// x-direction; the piece of data moved
// measures 24 voxels in x-direction
cudaMemcpy3D(&prms);
// Here cudaGetLastError always returns 'invalid argument error'

The answer is yes: It is possible to use the Memcpy3D command with same srcArray as dstArray. The problem I faced appeared due to the nonexistance of an initial reset of the cudaMemcpy3DParms with:
cudaMemcpy3DParms p = {0};

Related

step doubling Runge Kutta implementation stuck shrinking stepsize to machine precision

I need to integrate a system of ODES using an adaptive RK4 method with stepsize control via step doubling techniques.
The problem is that the program continues forever shrinking the stepsize down to machine precision while not advancing time.
the idea is to step the solution once by a single step and also by two successive half steps, compare the result as their difference and store it in eps. So eps is a measure of the error. Now I want to determine the next step stepsize according to whether eps is greater to a specified accuracy eps0 (as described in the book "Numerical Recipes")
RK4Step(double t, double* Y, double *Yout, void (*RHSFunc)(double, double *, double *),double h) steps the solution vector Y by h and puts the result into Yout using the function RHSFunc.
#define NEQ 4 //problem dimension
int main(int argc, char* argv[])
{
ofstream frames("./frames.dat");
ofstream graphs("./graphs.dat");
double Y[4] = {2.0, 2.0, 1.0, 0.0}; //initial conditions for solution vector
double finaltime = 100; //end of integration
double eps0 = 10e-5; //error to compare with eps
double t = 0.0;
double step = 0.01;
while(t < finaltime)
{
double eps = 0.0;
double Y1[4], Y2[4]; //Y1 will store half step solution
//Y2 will store double step solution
double dt = step; //cache current stepsize
for(;;)
{
//make a step starting from state stored in Y and
//put solution into Y1. Then from Y1 make another half step
//and store into Y1.
RK4Step(t, Y, Y1, RHS, step); //two half steps
RK4Step(t+step,Y1, Y1, RHS, step);
RK4Step(t, Y, Y2, RHS, 2*step); //one long step
//compute eps as maximum of differences between Y1 and Y2
//(an alternative would be quadrature sums)
for(int i=0; i<NEQ; i++)
eps=max(eps, fabs( (Y1[i]-Y2[i])/15.0 ) );
//if error is within tolerance we grow stepsize
//and advance time
if(eps < eps0)
{
//stepsize is accepted, grow stepsize
//save solution from Y1 into Y,
//advance time by the previous (cached) stepsize
Y[0] = Y1[0]; Y[1] = Y1[1];
Y[2] = Y1[2]; Y[3] = Y1[3];
step = 0.9*step*pow(eps0/eps, 0.20); //(0.9 is the safety factor)
t+=dt;
break;
}
//if the error is too big we shrink stepsize
step = 0.9*step*pow(eps0/eps, 0.25);
}
}
frames.close();
graphs.close();
return 0;
}
You never reset eps in the inner loop. This could be the direct cause of your problem. While the actual error reduces with ever decreasing step sizes, the maximum in eps stays constant, and above eps0. This results in a constant reducing factor in the step size update, without any chance to break the loop.
Another "wrong thing" is that the error estimate and tolerance are incompatible. The error tolerance eps0 is an error density or unit-step error. To bring your error estimate eps into that format you need to divide eps by step. Or put another way, currently you are forcing the actual step error to be close to 0.5*eps0, so that the global error is 0.5*eps0 times the number of steps taken, with the number of steps loosely proportional to eps0^0.2. In the version using the unit-step error, the local error is forced to be "dynamically" close to 0.5*eps0*step, so that the global error is about 5*eps0 times the length of the integration interval. I'd say that the second variant is more in line with intuition about the expected behavior.
This is not a critical error, but may lead to sub-optimal step sizes and an actual global error that deviates non-trivially from the desired error tolerance.
You also have a coding inconsistency as in the propagation of the state and declaration of state vectors you have hard-coded 4 components in the state vector, while in the error computation you have a loop over a variable number NEQ of equations and components. As you are using C++, you could use a state vector class that handles all dimension-dependent loops internally. (If done too far, frequent allocation of instances with a short life span could be an efficiency issue.)

(cudaBindTexture2D) How to bind a pitched-array from the middle

I am trying to bind a pitched array from the middle partly (not from the beginning of the array), like followings.
/1. allocate/
cudaMallocPitch((void**)&d_texinput, &FloatPitch, cols*sizeof(float), rows);
cudaMallocPitch((void**)&d_output, &FloatPitch, cols*sizeof(float), rows);
/2. set row-length of target region (i.e., dividing rows 10 times)/
int row_div_times = 10;
int part_rows = rows / row_div_times;
int part_offset = part_rows*FloatPitch/sizeof(float);
dim3 threads(16,16);
dim3 Part_Blocks((cols + threads.x - 1) / threads.x, (Part_rows + threads.y - 1) / threads.y);
/3. processing divided rows, iteratively/
for (int i = 0; i < row_div_times; i++)
{
size_t offsetsize= i*part_offset;
/*computing values of "d_tex_input"*/
calibration << <Part_Blocks, threads, 0, stream[i] >> >
(d_texinput + i*part_offset );
/*
//###(QUESTION point!) I want to bind the device memory "d_texinput" to texture "tex_mem" only partly like below.
cudaBindTexture2D(0, tex_mem, &d_texinput[i*part_offset], channelDesc_flt, cols, Part_rows, FloatPitch); //tentative code a;
,,, or something like,,,
cudaBindTexture2D(&offsetsize, tex_mem, &d_texinput, channelDesc_flt, cols, Part_rows, FloatPitch); //tentative code b;
*/
//final computaion with texture
final_computationwithtexture << <Part_Blocks, threads, 0, stream[i] >> >
( d_output + i*part_offset );
cudaUnbindTexture(tex_mem);
}
Please kindly allow me to ask your instruction, advice how to bind the target region of the device memory array partly by revising above( QUESTION point!)?
I tried to understand first argument of cudaBindTExture2D as "offset". but it is not value. it is address. according to the documentation.
i still could not understand the documentation.
I hope I can understand what that is by knowing adequate inputting way to the cudaBindTexture2D.
The offset parameter is not an input, it is an output. That's why it is a pointer. The function will set the offset in bytes. If you want to bind in the middle of an allocation, you set the devPtr argument (third) appropriately and then the function will give you the offset required for texture accesses.
Here is how to understand this: Textures can only be bound with a certain alignment. Memory allocations are always properly aligned. Therefore it is not an issue in most cases. However, if you provide an arbitrary memory address, CUDA has to round down to the alignment and you have to apply the proper offset later on.
Let's say you bind &float[66], the proper alignment might be &float[64], so CUDA starts its texture at that offset and you have to add an offset of 8 bytes for each access to get the desired result. I'm picking random numbers here, I don't know the alignment requirements.

3D binary image to 3D mesh using itk

I'm trying to generate a 3d mesh using 3d RLE binary mask.
In itk, I find a class named itkBinaryMask3DMeshSource
it's based on MarchingCubes algorithm
some example, use this class, ExtractIsoSurface et ExtractIsoSurface
in my case, I have a rle 3D binary mask but represented in 1d vector format.
I'm writing a function for this task.
My function takes as parameters :
Inputs : crle 1d vector ( computed rle), dimension Int3
Output : coord + coord indices ( or generate a single file contain both of those array; and next I can use them to visualize the mesh )
as a first step, I decoded this computed rle.
next, I use imageIterator to create an image compatible to BinaryMask3DMeshSource.
I'm blocked, in the last step.
This is my code :
void GenerateMeshFromCrle(const std::vector<int>& crle, const Int3 & dim,
std::vector<float>* coords, std::vector<int>*coord_indices, int* nodes,
int* cells, const char* outputmeshfile) {
std::vector<int> mask(crle.back());
CrleDecode(crle, mask.data());
// here we define our itk Image type with a 3 dimension
using ImageType = itk::Image< unsigned char, 3 >;
ImageType::Pointer image = ImageType::New();
// an Image is defined by start index and size for each axes
// By default, we set the first start index from x=0,y=0,z=0
ImageType::IndexType start;
start[0] = 0; // first index on X
start[1] = 0; // first index on Y
start[2] = 0; // first index on Z
// until here, no problem
// We set the image size on x,y,z from the dim input parameters
// itk takes Z Y X
ImageType::SizeType size;
size[0] = dim.z; // size along X
size[1] = dim.y; // size along Y
size[2] = dim.x; // size along Z
ImageType::RegionType region;
region.SetSize(size);
region.SetIndex(start);
image->SetRegions(region);
image->Allocate();
// Set the pixels to value from rle
// This is a fast way
itk::ImageRegionIterator<ImageType> imageIterator(image, region);
int n = 0;
while (!imageIterator.IsAtEnd() && n < mask.size()) {
// Set the current pixel to the value from rle
imageIterator.Set(mask[n]);
++imageIterator;
++n;
}
// In this step, we launch itkBinaryMask3DMeshSource
using BinaryThresholdFilterType = itk::BinaryThresholdImageFilter< ImageType, ImageType >;
BinaryThresholdFilterType::Pointer threshold =
BinaryThresholdFilterType::New();
threshold->SetInput(image->GetOutput()); // here it's an error, since no GetOutput member for image
threshold->SetLowerThreshold(0);
threshold->SetUpperThreshold(1);
threshold->SetOutsideValue(0);
using MeshType = itk::Mesh< double, 3 >;
using FilterType = itk::BinaryMask3DMeshSource< ImageType, MeshType >;
FilterType::Pointer filter = FilterType::New();
filter->SetInput(threshold->GetOutput());
filter->SetObjectValue(1);
using WriterType = itk::MeshFileWriter< MeshType >;
WriterType::Pointer writer = WriterType::New();
writer->SetFileName(outputmeshfile);
writer->SetInput(filter->GetOutput());
}
any idea
I appreciate your time.
Since image is not a filter you can plug it in directly: threshold->SetInput(image);. At the end of this function, you also need writer->Update();. The rest looks good.
Side-note: it looks like you might benefit from usage of import filter instead of manually iterating the buffer and copying values one at a time.

CMSIS real-FFT on 8192 samples in Q15

I need to perform an FFT on a block of 8192 samples on an STM32F446 microcontroller.
For that I wanted to use the CMSIS DSP library as it's available easily and optimised for the STM32F4.
My 8192 samples of input will ultimately be values from the internal 12-bit ADC (left aligned and converted to q15 by flipping the sign bit)., but for testing purpose I'm feeding the FFT with test-buffers.
With CMSIS's FFT functions, only the Q15 version supports lengths of 8192. Thus I am using arm_rfft_q15().
Because the FFT functions of the CMSIS libraries include by default about 32k of LUTs - to adapt to many FFT lengths, I have "rewritten" them to remove all the tables corresponding to other length than the one I'm interested in. I haven't touched anything except removing the useless code.
My samples are stored on an external SDRAM that I access via DMA.
When using the FFT, I have several problems :
Both my source buffer and my destination buffer get modified ;
the result is not at all as expected
To make sure I had wrong results I did an IFFT right after the FFT but it just confirmed that the code wasn't working.
Here is my code :
status_codes FSM::fft_state(void)
{
// Flush the SDRAM section
si_ovf_buf_clr_u16((uint16_t *)0xC0000000, 8192);
q15_t* buf = (q15_t*)(0xC0000000);
for(int i = 0; i<50; i++)
buf[i] = 0x0FFF; // Fill the buffer with test vector (50 sp gate)
// initialise FFT
// ---> Forward, 8192 samples, bitReversed
arm_rfft_instance_q15 S;
if(arm_rfft_init_q15(&S, 8192, 0, 1) != ARM_MATH_SUCCESS)
return state_error;
// perform FFT
arm_rfft_q15(&S, (q15_t*)0xC0000000, (q15_t*)0xC0400000);
// Post-shift by 12, in place (see doc)
arm_shift_q15((q15_t*)0xC0400000, 12, (q15_t*)0xC0400000, 16384);
// Init inverse FFT
if(arm_rfft_init_q15(&S, 8192, 1, 1) != ARM_MATH_SUCCESS)
return state_error;
// Perform iFFT
arm_rfft_q15(&S, (q15_t*)0xC0400000, (q15_t*)0xC0800000);
// Post shift
arm_shift_q15((q15_t*)0xC0800000, 12, (q15_t*)0xC0800000, 8192);
return state_success;
}
And here is the result (from GDB)
PS : I'm using ChibiOS - not sure if it is relevant.

Invalid argument in cudaMemcpy3D using width in bytes?

I've made a simple texture3D test and found a strange behavior when copying data to device. The function cudaMemcpy3D return an 'invalid argument'.
I found the problem is related with cudaExtent. According to the CUDA Toolkit Reference Manual 4.0, cudaExtent Parameters are as follow:
w - Width in bytes
h - Height in elements
d - Depth in elements
So, I prepared the texture as follows:
// prepare texture
cudaChannelFormatDesc t_desc = cudaCreateChannelDesc<baseType>();
// CUDA extent parameters w - Width in bytes, h - Height in elements, d - Depth in elements
cudaExtent t_extent = make_cudaExtent(NCOLS*sizeof(baseType), NROWS, DEPTH);
// CUDA arrays are opaque memory layouts optimized for texture fetching
cudaArray *i_ArrayPtr = NULL;
// allocate 3D
status = cudaMalloc3DArray(&i_ArrayPtr, &t_desc, t_extent);
And configured the 3D parameters as follow:
// prepare input data
cudaMemcpy3DParms i_3DParms = { 0 };
i_3DParms.srcPtr = make_cudaPitchedPtr( (void*)h_idata, NCOLS*sizeof(baseType), NCOLS, NROWS);
i_3DParms.dstArray = i_ArrayPtr;
i_3DParms.extent = t_extent;
i_3DParms.kind = cudaMemcpyHostToDevice;
And finally copied the data to device memory:
// copy input data from host to device
status = cudaMemcpy3D( &i_3DParms );
The problem is solved if I only specified the number of element in the x dimension as:
cudaExtent t_extent = make_cudaExtent(NCOLS, NROWS, DEPTH);
which does not produce any error and the test work as expected.
I'm wondering if I miss something with the cudaExtent function or something else. Why the width parameter is not needed to be expressed in bytes ?
For CUDA arrays, the extent is specified with the width given in array elements. For allocating linear memory, the extent is specified with the width given in bytes. Because you are allocating an array with cudaMalloc3DArray, use the width in elements. If you were using cudaMalloc3D, the extent would have a width in bytes.