cublas<>gemmBatched with aliased Carray parameter - cuda

I'm trying to implement something like scipy.sparse.bsr_matrix operations with cublas<>gemmBatched. Unfortunately I can't do this with cusparse since my BSR matrix isn't square.
I'm new to cublas, I wonder if it's ok (correctness-wise and performance-wise) to use aliased pointer (as in pointer aliasing) array for float * Carray[]
e.g.
/* given float * out as the real output array */
float * Carray[] = {
out + 1*stride, out + 2*stride, out + 3*stride,
out + 1*stride, out + 2*stride, out + 3*stride,
/* and repeat */
};
Also, Although I'm pretty sure it will be correct if I use aliased Aarray or Barray, is there any performance impact?
Thanks!

In general, there is no problem with that sort of aliasing in CUBLAS. In fact, it is the normal way to deal with submatrices, and most LAPACK style solvers use pointer indexing or aliasing extensively to perform sub-block operations on matrices.
I don't believe there is a performance penalty in working this way, at least for the batch solvers, although the only way to be certain would be via benchmarking, which is probably trivial to test yourself.

Related

Evaluate function at constant speed relative to arc length

I'm implementing a realtime graphics engine (C++ / OpenGL) that moves a vehicle over time along a specified course that is described by a polynomial function. The function itself was programmatically generated outside the application and is of a high order (I believe >25), so I can't really post it here (I don't think it matters anyway). During runtime the function does not change, so it's easy to calculate the first and second derivatives once to have them available quickly later on.
My problem is that I have to move along the curve with a constant speed (say 10 units per second), so my function parameter is not equal to the time directly, since the arc length between two points x1 and x2 differs dependent on the function values. For example the difference f(a+1) - f(a) may be way larger or smaller than f(b+1) - f(b), depending on how the function looks at points a and b.
I don't need a 100% accurate solution, since the movement is only visual and will not be processed any further, so any approximation is OK as well. Also please keep in mind that the whole thing has to be calculated at runtime each frame (60fps), so solving huge equations with complex math may be out of the question, depending on computation time.
I'm a little lost on where to start, so even any train of thought would be highly appreciated!
Since the criterion was not to have an exact solution, but a visually appealing approximation, there were multiple possible solutions to try out.
The first approach (suggested by Alnitak in the comments and later answered by coproc) I implemented, which is approximating the actual arclength integral by tiny iterations. This version worked really well most of the time, but was not reliable at really steep angles and used too many iterations at flat angles. As coproc already pointed out in the answer, a possible solution would be to base dx on the second derivative.
All these adjustments could be made, however, I need a runtime friendly algorithm. With this one it is hard to predict the number of iterations, which is why I was not happy with it.
The second approach (also inspired by Alnitak) is utilizing the first derivative by "pushing" the vehicle along the calculated slope (which is equal to the derivative at the current x value). The function for calculating the next x value is really compact and fast. Visually there is no obvious inaccuracy and the result is always consistent. (That's why I chose it)
float current_x = ...; //stores current x
float f(x) {...}
float f_derv(x) {...}
void calc_next_x(float units_per_second, float time_delta) {
float arc_length = units_per_second * time_delta;
float derv_squared = f_derv(current_x) * f_derv(current_x);
current_x += arc_length / sqrt(derv_squared + 1);
}
This approach, however, will possibly only be accurate enough for cases with high frame time (mine is >60fps), since the object will always be pushed along a straight line with a length depending on said frame time.
Given the constant speed and the time between frames the desired arc length between frames can be computed. So the following function should do the job:
#include <cmath>
typedef double (*Function)(double);
double moveOnArc(Function f, const double xStart, const double desiredArcLength, const double dx = 1e-2)
{
double arcLength = 0.;
double fPrev = f(xStart);
double x = xStart;
double dx2 = dx*dx;
while (arcLength < desiredArcLength)
{
x += dx;
double fx = f(x);
double dfx = fx - fPrev;
arcLength += sqrt(dx2 + dfx*dfx);
fPrev = fx;
}
return x;
}
Since you say that accuracy is not a top criteria, choosing an appropriate dx the above function might work right away. Ofcourse, it could be improved by adjusting dx automatically (e.g. based on the second derivative) or by refining the endpoint with a binary search.

Numerical stability of ODE system

I have to perform a numerical solving of an ODE system which has the following form:
du_j/dt = f_1(u_j, v_j, t) + g_1(t)v_(j-1) + h_1(t)v_(j+1),
dv_j/dt = f_2(u_j, v_j, t) + g_2(t)u_(j-1) + h_2(t)u_(j+1),
where u_j(t) and v_j(t) are complex-valued scalar functions of time t, f_i and g_i are given functions, and j = -N,..N. This is an initial value problem and the task is to find the solution at a certain time T.
If g_i(t) = h_i(t) = 0, then the equations for different values of j can be solved independently. In this case I obtain a stable and accurate solutions with the aid of the fourth-order Runge-Kutta method. However, once I turn on the couplings, the results become very unstable with respect to the time grid step and explicit form of the functions g_i, h_i.
I guess it is reasonable to try to employ an implicit Runge-Kutta scheme, which might be stable in such a case, but if I do so, I will have to evaluate the inverse of a huge matrix of size 4*N*c, where c depends on the order of the method (e. g. c = 3 for the Gauss–Legendre method) at each step. Of course, the matrix will mostly contain zeros and have a block tridiagonal form but it still seems to be very time-consuming.
So I have two questions:
Is there a stable explicit method which works even when the coupling functions g_i and h_i are (very) large?
If an implicit method is, indeed, a good solution, what is the fastest method of the inversion of a block tridiagonal matrix? At the moment I just perform a simple Gauss method avoiding redundant operations which arise due to the specific structure of the matrix.
Additional info and details that might help us:
I use Fortran 95.
I currently consider g_1(t) = h_1(t) = g_2(t) = h_2(t) = -iAF(t)sin(omega*t), where i is the imaginary unit, A and omega are given constants, and F(t) is a smooth envelope going slowly, first, from 0 to 1 and then from 1 to 0, so F(0) = F(T) = 0.
Initially u_j = v_j = 0 unless j = 0. The functions u_j and v_j with great absolute values of j are extremely small for all t, so the initial peak does not reach the "boundaries".
To 1) There will be no stable explicit method if your functions are very large. This is due to the fact that the area of stability of explicit (Runge-Kutta) methods is compact.
To 2) If your matrices are larger then 100x100 you could use this method:
Inverses of Block Tridiagonal Matrices and Rounding Errors.

How to Solve non-specific non-linear equations?

I am attempting to fit a circle to some data. This requires numerically solving a set of three non-linear simultaneous equations (see the Full Least Squares Method of this document).
To me it seems that the NEWTON function provided by IDL is fit for solving this problem. NEWTON requires the name of a function that will compute the values of the equation system for particular values of the independent variables:
FUNCTION newtfunction,X
RETURN, [Some function of X, Some other function of X]
END
While this works fine, it requires that all parameters of the equation system (in this case the set of data points) is hard coded in the newtfunction. This is fine if there is only one data set to solve for, however I have many thousands of data sets, and defining a new function for each by hand is not an option.
Is there a way around this? Is it possible to define functions programmatically in IDL, or even just pass in the data set in some other manner?
I am not an expert on this matter, but if I were to solve this problem I would do the following. Instead of solving a system of 3 non-linear equations to find the three unknowns (i.e. xc, yc and r), I would use an optimization routine to converge to a solution by starting with an initial guess. For this steepest descent, conjugate gradient, or any other multivariate optimization method can be used.
I just quickly derived the least square equation for your problem as (please check before use):
F = (sum_{i=1}^{N} (xc^2 - 2 xi xc + xi^2 + yc^2 - 2 yi yc + yi^2 - r^2)^2)
Calculating the gradient for this function is fairly easy, since it is just a summation, and therefore writing a steepest descent code would be trivial, to calculate xc, yc and r.
I hope it helps.
It's usual to use a COMMON block in these types of functions to pass in other parameters, cached values, etc. that are not part of the calling signature of the numeric routine.

How to avoid memory leaks in this case?

In order to prevent memory leaks in ActionScript 3.0, i use a member vector in classes that have to work with vectors, for example:
public class A
{
private var mHelperPointVector:Vector.<Point> = new Vector.<Point>();
public static GetFirstData():Vector.<Point>
{
mHelperPointVector.length = 0;
....
return mHelperPointVector;
}
public static GetSecondData():Vector.<Point>
{
mHelperPointVector.length = 0;
....
return mHelperPointVector;
}
}
and then i have consumers who uses GetFirstData and GetSecondData methods, storing references to vectors returned by these methods, for example:
public function OnEnterFrame():void
{
var vector:Vector.<Point> = A.GetSecondData();
....
}
This trick seems to be good, but sometimes i need to process the vector returned by GetSecondData() after some period of time, and in this case, this vector becomes overwritten by another call to GetSecondData() or GetFirstData()...The solution is to copy vector to a new vector...but in this case is better to avoid this trick at all. How do you deal with these problems? I have to work with a big amount of vectors (each of length between 1-10).
The thing about garbage collection is just trying to avoid instantiating (and disposing of) as much as possible. It's hard to say what would be the best approach since I can't see how/why you're using your Vector data, but at first glance I think that with your approach you'll be constantly losing data (you're pretty much creating the equivalent of weak instances, since they can be easily overwritten) and changing the length of a Vector doesn't really avoid garbage collection (it may delay and reduce it, but you're still constantly throwing data away).
I frankly don't think you'd have memory leaks with point Vectors unless you're leaking the reference to the Vector left and right. In which case, it'd be better to fix these leftover references, rather than simply coming up with a solution to reuse the same vectors (which can have many more adverse effects).
However, if you're really concerned about memory, your best solution, I think, is either creating all vectors you need in advance (if it's a fixed number and you know their length ahead of time) or, better yet, using Object Pools. The latter would definitely be a more robust solution, but it requires some setup on your end, both by creating a Pool class and then when using it. To put it in code, once implemented, it would be used like this:
// Need a vector with length of 9
var myVector:Vector.<Point> = VectorPool.get(9);
// Use the vector for stuff
...
// Vector not needed anymore, put it back in the pool
VectorPool.put(myVector);
myVector = null; // just so it's clear we can't use it anymore
VectorPool would control the list of Vectors you have, letting other parts of your code "borrow" vectors as needed (in which they would be marked as being "used" inside the VectorPool) and give them back (marking them back as unused). Your code could also create vectors on the spot (inside get()), as needed, if no usable vectors are available within the list of unused objects; this would make it more flexible (not recommended in some cases since you're still spending time with instantiation, but probably negligible in this case).
This is a very macro explanation (you'd still have to write VectorPool), but object pools like that are believed to be the definitive solution to avoid re-instantiating as well as garbage collection of objects that are just going to be reused.
For reference, here's what I used as a very generic Object Pool:
https://github.com/zeh/as3/blob/master/com/zehfernando/data/ObjectPool.as
Or a more specialized one, that I use in situations when I need a bunch of throwaway BitmapData instances of similar sizes:
https://github.com/zeh/as3/blob/master/com/zehfernando/data/BitmapDataPool.as
I believe the implementation of a VectorPool class in the molds of what you need would be similar to the link above.
As a side note, if performance is a concern, I'd suggest using vectors of fixed length too, e.g.
// Create a vector of 9 items, filled with `nulls`
var myPoints:Vector.<Point> = new Vector.<Point>(9, true);
This makes it faster since you won't have micro allocations over time. You have to set the items directly, instead of using push():
myPoints[0] = new Point(0, 0);
But that's actually a forced advantage since setting the vector items is faster than push().

Why doesn't `list::iterator + 1` work?

std::list<CPoint>::iterator iter= vertices.end();
CPoint point = *(iter+1);
In such cases I've tried to assign to variables the value of (iter-1) or (iter+1). Why doesn't it work? whereas iter++ or iter-- works.
Simply, these operations are not part of the iterator definition. You can use the std::advance() function for that.
Obviously, the operator+(int) could be overriden to do that, just as operator++() is, but probably it is not, because this operation is not guaranteed to be of constant complexity, and a syntax like (iter + n) could suggest otherwise.
From advance:
Complexity: Linear. However, if InputIt additionally meets the requirements of RandomAccessIterator, complexity is constant.