when we tried to implement a program with one for loop with a limit of 100,octave failed to show all the results.this is the problem with buffer. how can we overcome that?
Try to express the problem in terms of matrices. MATLAB and Octave are optimized for matrix operations. Here is an excerpt of what the MATLAB documentation site says about vectorizing loops:
The MATLAB software uses a matrix language, which means it is designed for vector and matrix operations. You can often speed up your code by using vectorizing algorithms that take advantage of this design. Vectorization means converting for and while loops to equivalent vector or matrix operations.
They also provide a simple example of vectorizing a loop to compute the sine of 1001 values ranging from 0 to 10:
i = 0;
for t = 0:.01:10
i = i + 1;
y(i) = sin(t);
end
To a vectorized version of the same code:
t = 0:.01:10;
y = sin(t);
There are more details in the MATLAB Code Vectorization Guide
and some examples in these few related questions:
Loopless function calls on vector/matrix members in Matlab/Octave
Matlab - Speeding up a Nested For Loop
Related
i was wondering weather it is possible to extend the CUDA.#atomic operation to a custom type.
Here is an example of what i am trying to do:
using CUDA
struct Dual
x
y
end
cu0 = CuArray([Dual(1, 2), Dual(2,3)])
cu1 = CuArray([Dual(1, 2), Dual(2,3)])
indexes = CuArray([1, 1])
function my_kernel(dst, src, idx)
index = threadIdx().x + (blockIdx().x - 1) * blockDim().x
#inbounds if index <= length(idx)
CUDA.#atomic dst[idx[index]] = dst[idx[index]] + src[index]
end
return nothing
end
#cuda threads = 100 my_kernel(cu0, cu1, indexes)
The Problem of this code is that the CUDA.#atomic call only supports basic types like
Int, Float or Real.
I need it to work with my own struct.
Would be nice if someone has an idea how this could be possible.
The underlying PTX instruction set for CUDA provides a subset of atomic store, exchange, add/subtract,increment/decrement, min/max, and compare-and-set operations for global and shared memory locations (not all architectures support all operations with all POD types, and there is evidence that not all operations are implemented in hardware on all architectures).
What all these instructions have in common is that they execute only one operation atomically. I am completely unfamiliar with Julia, but if
CUDA.#atomic dst[idx[index]] = dst[idx[index]] + src[index]
means "atomically add src[].x and src[].y to dst[].x and dst[].y" then that isn't possible because that implies two additions on separate memory locations in one atomic operation. If the members of your structure could be packed into a compatible type (a 32 bit or 64 bit unsigned integer, for example), you could perform atomic store, exchange or compare-and-set in CUDA. But not arithmetic.
If you consult this section of the programming guide, you can see an example of a brute force double precision add implementation using compare-and-set in a tight loop. If your structure can be packed into something which can be manipulated with compare-and-set, then it might be possible to roll your own atomic add for a custom type (limited to a maximum of 64 bits).
How you might approach that in Julia is definitely an exercise left to the reader.
I am trying to use CUDA in order to parallelize the simulated annealing algorithm. The GPU I am using is NVIDIA GTX660. I am trying to speed the program up and in order to do so I am considering to replace this
int r= rand();
if (condition)
{
r += 1;
}
with
int r = rand() + (condition)*1;
I understand that jump/branch instructions(like if-then-else commands) are the slowest to execute but unless my understanding is incorrect typecasting involves memory access then copying the number in new location as an int before accessing it. Could the result of 'condition' be stored in a register and fed in ALU without modification? if so wouldn't that be a faster way to calculate the value of variable r? The above runs on every thread.
Generally, you'd try very hard to avoid branching on GPUs, since that's classically the point where the CPU needs to halt all threads that don't go through that branch, execute those who do, then halt these, and do the other branch.
That being said, the branching doesn't happen because you write if; it happens because you use e.g. < which assigns a value to a register based on what you're comparing, but that is very very depending on your actual condition, and the language/architecture you're on – my knowledge is from first-generation CUDA and might not fully apply anymore.
I have to perform a numerical solving of an ODE system which has the following form:
du_j/dt = f_1(u_j, v_j, t) + g_1(t)v_(j-1) + h_1(t)v_(j+1),
dv_j/dt = f_2(u_j, v_j, t) + g_2(t)u_(j-1) + h_2(t)u_(j+1),
where u_j(t) and v_j(t) are complex-valued scalar functions of time t, f_i and g_i are given functions, and j = -N,..N. This is an initial value problem and the task is to find the solution at a certain time T.
If g_i(t) = h_i(t) = 0, then the equations for different values of j can be solved independently. In this case I obtain a stable and accurate solutions with the aid of the fourth-order Runge-Kutta method. However, once I turn on the couplings, the results become very unstable with respect to the time grid step and explicit form of the functions g_i, h_i.
I guess it is reasonable to try to employ an implicit Runge-Kutta scheme, which might be stable in such a case, but if I do so, I will have to evaluate the inverse of a huge matrix of size 4*N*c, where c depends on the order of the method (e. g. c = 3 for the Gauss–Legendre method) at each step. Of course, the matrix will mostly contain zeros and have a block tridiagonal form but it still seems to be very time-consuming.
So I have two questions:
Is there a stable explicit method which works even when the coupling functions g_i and h_i are (very) large?
If an implicit method is, indeed, a good solution, what is the fastest method of the inversion of a block tridiagonal matrix? At the moment I just perform a simple Gauss method avoiding redundant operations which arise due to the specific structure of the matrix.
Additional info and details that might help us:
I use Fortran 95.
I currently consider g_1(t) = h_1(t) = g_2(t) = h_2(t) = -iAF(t)sin(omega*t), where i is the imaginary unit, A and omega are given constants, and F(t) is a smooth envelope going slowly, first, from 0 to 1 and then from 1 to 0, so F(0) = F(T) = 0.
Initially u_j = v_j = 0 unless j = 0. The functions u_j and v_j with great absolute values of j are extremely small for all t, so the initial peak does not reach the "boundaries".
To 1) There will be no stable explicit method if your functions are very large. This is due to the fact that the area of stability of explicit (Runge-Kutta) methods is compact.
To 2) If your matrices are larger then 100x100 you could use this method:
Inverses of Block Tridiagonal Matrices and Rounding Errors.
I recently wanted to use a simple CUDA matrix-vector multiplication. I found a proper function in cublas library: cublas<<>>gbmv. Here is the official documentation
But it is actually very poor, so I didn't manage to understand what the kl and ku parameters mean. Moreover, I have no idea what stride is (it must also be provided).
There is a brief explanation of these parameters (Page 37), but it looks like I need to know something else.
A search on the internet doesn't provide tons of useful information on this question, mostly references to different version of documentation.
So I have several questions to GPU/CUDA/cublas gurus:
How do I find more understandable docs or guides about using cublas?
If you know how to use this very function, couldn't you explain me how do I use it?
Maybe cublas library is somewhat extraordinary and everyone uses something more popular, better documented and so on?
Thanks a lot.
So BLAS (Basic Linear Algebra Subprograms) generally is an API to, as the name says, basic linear algebra routines. It includes vector-vector operations (level 1 blas routines), matrix-vector operations (level 2) and matrix-matrix operations (level 3). There is a "reference" BLAS available that implements everything correctly, but most of the time you'd use an optimized implementation for your architecture. cuBLAS is an implementation for CUDA.
The BLAS API was so successful as an API that describes the basic operations that it's become very widely adopted. However, (a) the names are incredibly cryptic because of architectural limitations of the day (this was 1979, and the API was defined using names of 8 characters or less to ensure it could widely compile), and (b) it is successful because it's quite general, and so even the simplest function calls require a lot of extraneous arguments.
Because it's so widespread, it's often assumed that if you're doing numerical linear algebra, you already know the general gist of the API, so implementation manuals often leave out important details, and I think that's what you're running into.
The Level 2 and 3 routines generally have function names of the form TMMOO.. where T is the numerical type of the matrix/vector (S/D for single/double precision real, C/Z for single/double precision complex), MM is the matrix type (GE for general - eg, just a dense matrix you can't say anything else about; GB for a general banded matrix, SY for symmetric matrices, etc), and OO is the operation.
This all seems slightly ridiculous now, but it worked and works relatively well -- you quickly learn to scan these for familiar operations so that SGEMV is a single-precision general-matrix times vector multiplication (which is probably what you want, not SGBMV), DGEMM is double-precision matrix-matrix multiply, etc. But it does take some practice.
So if you look at the cublas sgemv instructions, or in the documentation of the original, you can step through the argument list. First, the basic operation is
This function performs the matrix-vector multiplication
y = a op(A)x + b y
where A is a m x n matrix stored in column-major format, x and y
are vectors, and and are scalars.
where op(A) can be A, AT, or AH. So if you just want y = Ax, as is the common case, then a = 1, b = 0. and transa == CUBLAS_OP_N.
incx is the stride between different elements in x; there's lots of situations where this would come in handy, but if x is just a simple 1d array containing the vector, then the stride would be 1.
And that's about all you need for SGEMV.
Started learning octave recently. How do I generate a matrix from another matrix by applying a function to each element?
eg:
Apply 2x+1 or 2x/(x^2+1) or 1/x+3 to a 3x5 matrix A.
The result should be a 3x5 matrix with the values now 2x+1
if A(1,1)=1 then after the operation with output matrix B then
B(1,1) = 2.1+1 = 3
My main concern is a function that uses the value of x like that of finding the inverse or something as indicated above.
regards.
You can try
B = A.*2 + 1
The operator . means application of the following operation * to each element of the matrix.
You will find a lot of documentation for Octave in the distribution package and on the Web. Even better, you can usually also use the extensive documentation on Matlab.
ADDED. For more complex operations you can use arrayfun(), e.g.
B = arrayfun(#(x) 2*x/(x^2+1), A)