odeint thrust slows down massively when using nested zip_iterators - thrust

I am building an analysis tool on top of odeint and thrust. The tool solves a large number of initial condition problems. I have had success following the odeint and thrust tutorials / demos. Up until now, the systems I have been working with have had less than 5 state variables, but now I need to extend the system to be capable of handling more.
As mentioned in the tutorial, there is a small problem that thrust's tuples can only have up to 10 items. This problem can be solved by using nested zip_iterators.
Quoting from near the bottom of one of the tutorials:
"a small difficulty that Thrust's tuples have a maximal arity of 10. But this is only a small problem since we can create a zip iterator packed with zip iterators. So the top level zip iterator contains one zip iterator for the state, one normal iterator for the parameter, and one zip iterator for the derivative."
I have implemented this solution, and it works, but unfortunately with a massive loss in speed. For the same system (a two-variable system, simultaneously solving 8192 initial conditions), the original, simple but not-extensible to more than 5 variables solution runs in just over a second:
real 0m1.244s
user 0m0.798s
sys 0m0.356s
whereas the more complex, but extensible nested solution takes ~2000 times longer!
real 4m3.613s
user 2m15.124s
sys 1m47.363s
The only difference between the two programs is
inside the functor's operator(), where I refer to the state variables and derivatives, and
inside the functor's constructor, specifically the for_each command, where I create the zip_iterators and tuples.
Below I have included excerpts from these sections, for each program. I hope I am doing something wrong here, because this kind of speed loss is devastating! Any help would be much appreciated.
Excerpts for "SIMPLE" code (non-nested-iterators)
//////////////////////////////////////////////////////////
//// Inside ICFunctor's operator()
// getting the state variable
value_type X = thrust::get< 0 >( t );
// setting the derivative
thrust::get< 2 >( t ) = 0.5*A - 1.5*X;
//////////////////////////////////////////////////////////
//// the for_each statement that creates the zip_iterator
thrust::for_each(
//// START INDICES
thrust::make_zip_iterator( thrust::make_tuple(
// State-variables
boost::begin( x ) + 0*m_N,
boost::begin( x ) + 1*m_N,
// Derivatives
boost::begin( dxdt ) + 0*m_N,
boost::begin( dxdt ) + 1*m_N)) ,
//// END INDICES
thrust::make_zip_iterator( thrust::make_tuple(
// State-variables
boost::begin( x ) + 1*m_N,
boost::begin( x ) + 2*m_N,
// Derivatives
boost::begin( dxdt ) + 1*m_N,
boost::begin( dxdt ) + 2*m_N)) ,
ICFunctor() );
Excerpts for "EXTENSIBLE" code (nested-iterators)
//////////////////////////////////////////////////////////
//// Inside ICFunctor's operator()
// getting the state variable
const int STATE_VARIABLES = 0; // defined as a global constant
value_type X = thrust::get<0>(thrust::get<STATE_VARIABLES>( t ));
// setting the derivative
const int DERIVATIVES = 1; // defined as a global constant
thrust::get<0>(thrust::get<DERIVATIVES>( t )) = 0.5*A - 1.5*X;
//////////////////////////////////////////////////////////
//// the for_each statement that creates the zip_iterator
thrust::for_each(
//// START INDICES
thrust::make_zip_iterator( thrust::make_tuple(
// State variables
thrust::make_zip_iterator( thrust::make_tuple(
boost::begin( x ) + 0*m_N,
boost::begin( x ) + 1*m_N)),
// Derivatives
thrust::make_zip_iterator( thrust::make_tuple(
boost::begin( dxdt ) + 0*m_N,
boost::begin( dxdt ) + 1*m_N))
)),
//// END INDICES
thrust::make_zip_iterator( thrust::make_tuple(
// State variables
thrust::make_zip_iterator( thrust::make_tuple(
boost::begin( x ) + 1*m_N,
boost::begin( x ) + 2*m_N)),
// Derivatives
thrust::make_zip_iterator( thrust::make_tuple(
boost::begin( dxdt ) + 1*m_N,
boost::begin( dxdt ) + 2*m_N))
)),
ICFunctor() );

Related

Catching the "Pixel outside the boundaries" exception?

I have an image with atoms that are periodically arranged.
I am trying to write a script that does count how many atoms are arranged in first column by assigning a ROI on top-left atom, then let the script to scan from left to right (column by column). My idea is that, by using ROI that scans from left to right, and when it hits pixel that is out of boundaries (which means, it is out of image), the script returns the number of atoms in one line, instead of giving an error output saying "A pixel outside the boundaries of the image has been referenced".
Is there any way to make the above case possible to write in script?
Thank you.
You can catch any exception thrown by the script-code to handle it yourself using the
Try(){ }
Catch{ break; }
construct. However, this is not the nicest solution for your problem. If you know the size of your image, your really rather should use that knowledge to prevent accessing data outside the bonds. Using Try{}Catch{} is better left to those situation where "anything unexpected" could happen, and you still want to handle the problem.
Here is a code example for your question.
number boxSize = 3
number sx = 10
number sy = 10
image img := realImage( "Test", 4, sx, sy )
img = random()
// Output "sum" over scanned ROI area
// Variant 1: Just scan -- Hits an exception, as you're moving out of range
/*
for( number j=0;j<sy;j++)
for( number i=0;i<sx;i++)
{
Result("\n ScanPos: " + i +" / " + j )
Result("\t SUM: "+ sum( img[j,i,j+boxSize,i+boxSize] ) );
}
*/
// Variant 2: As above, but catch exception to just continue
for( number j=0;j<sy;j++)
for( number i=0;i<sx;i++)
{
Result("\n ScanPos: " + i +" / " + j )
Try
{
Result( "\t SUM: "+ sum( img[j,i,j+boxSize,i+boxSize] ) );
}
catch
{
Result( "\t ROI OUT OF RANGE" )
break; // Needed in scripting, or the exception is re-thrown
}
}
// Variant 3: (Better) Avoid hitting the exception by using the knowlede of data size
for( number j=0;j<sy-boxSize;j++)
for( number i=0;i<sx-boxSize;i++)
{
Result("\n ScanPos: " + i +" / " + j )
Result("\t SUM: "+ sum( img[j,i,j+boxSize,i+boxSize] ) );
}
To answer the questions in the comments of the accepted answer:
You can query any ROI/Selection on an image and use this info to limit the iteration.
The following example shows this. It also shows how the ROI object is properly used, both to get the selection as well as to add new ROIs. More info is found in the F1 help:
The script takes the front-most image with exactly one selection on it.
It then iterates the selection from the top-left onward (in a given step-size ) and outputs the region's sum value. Finally, you can opt to draw the used areas as new ROIs.
ClearResults()
// 1) Get image and image size
image img := GetFrontImage()
number sx = img.ImageGetDimensionSize(0)
number sy = img.ImageGetDimensionSize(0)
Result( "Image size: "+ sx + "x"+ sy+"\n")
// 2) Get the size of the user-drawn selection (ROI)
// 2a)
// If you are only dealing with the simple, user-drawn rectangle selections
// you can use the simplified code below instead.
number t, l, b, r
img.GetSelection(t,l,b,r)
Result( "Marker coordinates (simple): ["+t+","+l+","+b+","+r+"]\n" )
// 2b)
// Or you can use the "full" commands to catch other situations.
// The following lines check ROIs in a more general way.
// Not strictly needed, but to get you started if you want
// to use the commands outlined in F1 help section
// "Scripting > Objects > Document Object Model > ROI Object"
imageDisplay disp = img.ImageGetImageDisplay(0)
if( 0 == disp.ImageDisplayCountROIs() )
Throw( "No ROI on the image." )
if( 1 < disp.ImageDisplayCountROIs() )
Throw( "More than one ROI on the image." )
ROI theMarker = disp.ImageDisplayGetROI(0) // First (and only) ROI
if ( !theMarker.ROIIsRectangle() )
Throw( "ROI not a rectangle selection." )
if ( !theMarker.ROIGetVolatile() )
Throw( "ROI not voltaile." ) // Voltile = ROI disappears when another is drawn. Dashed outline.
number top, left, bottom, right
theMarker.ROIGetRectangle( top, left, bottom, right )
Result( "Marker coordinates (ROI commands): ["+top+","+left+","+bottom+","+right+"]\n" )
// 3) Iterate within bounds
number roiWidth = right - left
number roiHeight = bottom - top
number roiXStep = 100 // We shift the ROI in bigger steps
number roiYStep = 100 // We shift the ROI in bigger steps
if ( !GetNumber( "The ROI is " + roiWidth + "pixels wide. Shift in X?", roiWidth, roiXStep ) )
exit(0)
if ( !GetNumber( "The ROI is " + roiHeight + "pixels heigh. Shift in Y?", roiHeight, roiYStep ) )
exit(0)
for ( number j = 0; j<sy-roiHeight; j+=roiYStep )
for ( number i = 0; i<sx-roiWidth; i+=roiXStep )
{
Result( "Sum at "+i+"/"+j+": " + sum( img[j,i,j+roiHeight,i+roiWidth] ) + "\n" )
}
// 4) If you want you can "show" the used positions.
if ( !TwoButtonDialog("Draw ROIs?","Yes","No") )
exit(0)
for ( number j = 0; j<sy-roiHeight; j+=roiYStep )
for ( number i = 0; i<sx-roiHeight; i+=roiXStep )
{
roi markerROI = NewRoi()
markerROI.ROISetRectangle( j, i, j+roiHeight, i+roiWidth )
markerROI.ROISetVolatile(0)
markerROI.ROISetColor(0,0.4,0.4)
markerROI.ROISetMoveable(0)
markerROI.ROISetLabel( ""+i+"/"+j ) // Start with "" to ensure the parameter is recognized as string
disp.ImageDisplayAddRoi( markerROI )
}

proper thrust call for subtraction

Following from here.
Assuming that dev_X is a vector.
int * X = (int*) malloc( ThreadsPerBlockX * BlocksPerGridX * sizeof(*X) );
for ( int i = 0; i < ThreadsPerBlockX * BlocksPerGridX; i++ )
X[ i ] = i;
// create device vectors
thrust::device_vector<int> dev_X ( ThreadsPerBlockX * BlocksPerGridX );
//copy to device
thrust::copy( X , X + theThreadsPerBlockX * theBlocksPerGridX , dev_X.begin() );
The following is making a subtraction:
thrust::transform( dev_Kx.begin(), dev_Kx.end(), dev_X.begin() , distX.begin() , thrust::minus<float>() );
dev_Kx - dev_X.
I want to use the whole dev_Kx vector ( as it is used because it goes from .begin to .end() ) and the whole dev_X vector.
The above code uses dev_X.begin().
Is that meaning that it will use the whole dev_X vector? Starting from the beginning?
Or I have to use another extra argument to point to the dev_X.end()? ( because in the above function call I can't just use this extra argument )
Also , for example:
If I want to use
thrust::transform( dev_Kx, dev_Kx + i , dev_X.begin() ,distX.begin() , thrust::minus<int>() );
Then dev_Kx would go from 0 to i and the dev_X.begin()? It will use the same length? (0 to i?) Or it will use the length of dev_X?
Many thrust (and standard library) functions take a range as a first parameter and then assume all other iterators are backed by containers of the same size. A range is a pair of iterators indicating the beginning and end of a sequence.
For example:
thrust::copy(
X.begin(), // begin input iterator
X.end(), // end input iterator
dev_X.begin() // begin output iterator
);
This copies the entire contents of X into dev_X. Why is dev_X.end() not needed? Because thrust requires that you, the programmer, take the care of properly sizing dev_X to be able to contain at least as many elements as there are in the input range. If you don't meet that guarantee, then the behavior is undefined.
When you do this:
thrust::transform(
dev_Kx.begin(), // begin input (1) iterator
dev_Kx.end(), // end input (1) iterator
dev_X.begin(), // begin input (2) iterator
distX.begin(), // output iterator
thrust::minus<float>()
);
What thrust sees is an input range from dev_Kx.begin() to dev_Kx.end(). It has an explicit size of dev_Kx.end() - dev_Kx.begin(). Why are dev_X.end() and distX.end() not needed? Because they have an implicit size of dev_Kx.end() - dev_Kx.begin() too. For example, if there are 10 elements in dev_Kx, then transform will:
Use the 10 elements of dev_Kx
Use 10 elements of dev_X (which must hold at least 10 elements)
Perform the substraction and store the 10 results in distX, which must be able to hold at least 10 elements.
Maybe looking at the implementation would clear up any doubts. Here's some pseudo code:
void transform(InputIterator input1_begin, InputIterator input1_end,
InputIterator input2_begin, OutputIterator output,
BinaryFunction op) {
while (input1_begin != input1_end) {
*output++ = op(*input1_begin++, *input2_begin++);
}
}
Notice how only one end iterator is needed.
On an unrelated note, the following:
int * X = (int*) malloc( ThreadsPerBlockX * BlocksPerGridX * sizeof(*X) );
for ( int i = 0; i < ThreadsPerBlockX * BlocksPerGridX; i++ )
X[ i ] = i;
Could be rewritten in more idiomatic, less error-prone C++ to:
std::vector<int> X(ThreadsPerBlockX * BlocksPerGridX);
std::iota(X.begin(), X.end(), 0);

thrust equivalent of cilk::reducer_list_append

I have a list of n intervals or domains. I would like to subdivide in parallel each interval into k parts making a new list (unordered). However, most of the subdivision won't pass certain criteria and shouldn't be added to the new list.
cilk::reducer_list_append extends the idea of parallel reduction to forming a list with push_back. This way I can collect in parallel only valid sub-intervals.
What is the thrust way of accomplishing the task? I suspect one way would be to form a large nxk list, then use parallel filter and stream compaction? But I really hope there is a reduction list append operation, because nxk can be very large indeed.
I am new to this forum but maybe you find some of these useful..
If you are not fixed upon Thrust, you can also have a look at Arrayfire.
I learned about it quite recently and it's free for that sorts of problems.
For example, with arrayfire you can evaluate selection criterion for each interval
in parallel using gfor construct, ie. consider:
// # of intervals n and # of subintervals k
const int n = 10, k = 5;
// this array represets original intervals
array A = seq(n); // A = 0,1,2,...,n-1
// for each interval A[i], subI[i] counts # of subintervals
array subI = zeros(n);
gfor(array i, n) { // in parallel for all intervals
// here evaluate your predicate for interval's subdivision
array pred = A(i)*A(i) + 1234;
subI(i) = pred % (k + 1);
}
//array acc = accum(subI);
int n_total = sum<float>(subI); // compute the total # of intervals
// this array keeps intervals after subdivision
array B = zeros(n_total);
std::cout << "total # of subintervals: " << n_total << "\n";
print(A);
print(subI);
gfor(array i, n_total) {
// populate the array of new intervals
B(i) = ...
}
print(B);
of course, it depends on a way how your intervals are represented and
which criterion you use for subdivision..

Using a settable tolerance in the comparison object for an STL set

I want a tool to remove duplicate nodes ("nodes" in the finite element sense, simply a point in space with certain coordinates). Basically, I want to be able to take a collection of nodes, and reduce it by eliminating extra nodes that are within a certain tolerance in distance of other nodes. I figure that an STL set of nodes sorted by coordinate will fit the bill nicely performance wise. My question is how to incorporate a tolerance that is settable at run time.
Consider the simple key comparison function for the 1D case (a single coordinate per node). I know my logic could be shortened; that isn't what my question is about.
struct NodeSorter_x{
//Stores Node objects sorted by x coordinate within tolerance TOL
bool operator()(const Node N1, const Node N2) const
{
//returns true if N1 < N2 and not within TOL distance
return ( N1.x < N2.x) && !( fabs( N1.x - N2.x ) < TOL );
}
};
And the set containing unique node objects (no duplicates)...
std::set <Node,NodeSorter_x> UniqueNodeSet;
So, I want to be able to set TOL used in the comparison at run time. How do I go about doing so?
struct NodeSorter_x{
NodeSorter_x(double tol) : TOL(tol) {}
bool operator()(const Node N1, const Node N2) const
{
//returns true if N1 < N2 and not within TOL distance
return ( N1.x < N2.x) && !( fabs( N1.x - N2.x ) < TOL );
}
};
std::set <Node,NodeSorter_x> UniqueNodeSet( NodeSorter_x(0.1) );

How to store a symmetric matrix?

Which is the best way to store a symmetric matrix in memory?
It would be good to save half of the space without compromising speed and complexity of the structure too much. This is a language-agnostic question but if you need to make some assumptions just assume it's a good old plain programming language like C or C++..
It seems a thing that has a sense just if there is a way to keep things simple or just when the matrix itself is really big, am I right?
Just for the sake of formality I mean that this assertion is always true for the data I want to store
matrix[x][y] == matrix[y][x]
Here is a good method to store a symmetric matrix, it requires only N(N+1)/2 memory:
int fromMatrixToVector(int i, int j, int N)
{
if (i <= j)
return i * N - (i - 1) * i / 2 + j - i;
else
return j * N - (j - 1) * j / 2 + i - j;
}
For some triangular matrix
0 1 2 3
4 5 6
7 8
9
1D representation (stored in std::vector, for example) looks like as follows:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
And call fromMatrixToVector(1, 2, 4) returns 5, so the matrix data is vector[5] -> 5.
For more information see http://www.codeguru.com/cpp/cpp/algorithms/general/article.php/c11211/TIP-Half-Size-Triangular-Matrix.htm
I find that many high performance packages just store the whole matrix, but then only read the upper triangle or lower triangle. They might then use the additional space for storing temporary data during the computation.
However if storage is really an issue then just store the n(n+1)/2 elements making the upper triangle in a one-dimensional array. If that makes access complicated for you, just define a set of helper functions.
In C to access a matrix matA you could define a macro:
#define A(i,j, dim) ((i <= j)?matA[i*dim + j]:matA[j*dim + i])
then you can access your array nearly normally.
Well I would try a triangular matrix, like this:
int[][] sym = new int[rows][];
for( int i = 0; i < cols; ++i ) {
sym=new int[i+1];
}
But then you wil have to face the problem when someone wants to access the "other side". Eg he wants to access [0][10] but in your case this val is stored in[10][0] (assuming 10x10).
The probably "best" way is the lazy one - dont do anything until the user requests. So you could load the specific row if the user types somethin like print(matrix[4]).
If you want to use a one dimensional array the code would look something like this:
int[] new matrix[(rows * (rows + 1 )) >> 1];
int z;
matrix[ ( ( z = ( x < y ? y : x ) ) * ( z + 1 ) >> 1 ) + ( y < x ? y : x ) ] = yourValue;
You can get rid of the multiplications if you create an additional look-up table:
int[] new matrix[(rows * (rows + 1 )) >> 1];
int[] lookup[rows];
for ( int i= 0; i < rows; i++)
{
lookup[i] = (i * (i+1)) >> 1;
}
matrix[ lookup[ x < y ? y : x ] + ( x < y ? x : y ) ] = yourValue;
If you're using something that supports operator overloading (e.g. C++), it's pretty easy to handle this transparently. Just create a matrix class that checks the two subscripts, and if the second is greater than the first, swap them:
template <class T>
class sym_matrix {
std::vector<std::vector<T> > data;
public:
T operator()(int x, int y) {
if (y>x)
return data[y][x];
else
return data[x][y];
}
};
For the moment I've skipped over everything else, and just covered the subscripting. In reality, to handle use as both an lvalue and an rvalue correctly, you'll typically want to return a proxy instead of a T directly. You'll want a ctor that creates data as a triangle (i.e., for an NxN matrix, the first row will have N elements, the second N-1, and so on -- or, equivalantly 1, 2, ...N). You might also consider creating data as a single vector -- you have to compute the correct offset into it, but that's not terribly difficult, and it will use a bit less memory, run a bit faster, etc. I'd use the simple code for the first version, and optimize later if necessary.
You could use a staggered array (or whatever they're called) if your language supports it, and when x < y, switch the position of x and y. So...
Pseudocode (somewhat Python style, but not really) for an n x n matrix:
matrix[n][]
for i from 0 to n-1:
matrix[i] = some_value_type[i + 1]
[next, assign values to the elements of the half-matrix]
And then when referring to values....
if x < y:
return matrix[y][x]
else:
return matrix[x][y]