Speed of finding non-zero elements in (non-)diagonal matrix - octave

What's happening in this example?
Looking at the full matrix, the find operation is faster on the non-diagonal matrices.
Contrarily, getting the sparse representation is faster on the diagonal matrix (which seems reasonable).
The find operation on the sparse matrices is almost equal.
For curiosity, can someone tell me whats happening under the hood here? Why is finding non-zero elements on a full matrix faster than finding them on a diagonal matrix?
printf("Diagonal Mat:\n\n")
A = eye(10000);
printf("Full mat: ")
tic
find(A);
toc
printf("Building sparse representation: ")
tic
As = sparse(A);
toc
printf("Sparse mat: ")
tic
find(As);
toc
printf("\n\nNon-Diagonally flagged Mat:\n\n")
A = A | A; # This removes the "Diagonal Matrix" flag from A
printf("Full mat: ")
tic
find(A);
toc
printf("Building sparse representation: ")
tic
As = sparse(A);
toc
printf("Sparse mat: ")
tic
find(As);
toc
printf("\n\nActually Non-Diagonal Mat:\n\n")
A(:,:) = 0;
A(:,1) = 1;
printf("Full mat: ")
tic
find(A);
toc
printf("Building sparse representation: ")
tic
As = sparse(A);
toc
printf("Sparse mat: ")
tic
find(As);
toc
Output:
Diagonal Mat:
Full mat: Elapsed time is 0.204636 seconds.
Building sparse representation: Elapsed time is 5.19753e-05 seconds.
Sparse mat: Elapsed time is 7.60555e-05 seconds.
Non-Diagonally flagged Mat:
Full mat: Elapsed time is 0.0800331 seconds.
Building sparse representation: Elapsed time is 0.0924602 seconds.
Sparse mat: Elapsed time is 7.48634e-05 seconds.
Actually Non-Diagonal Mat:
Full mat: Elapsed time is 0.0799708 seconds.
Building sparse representation: Elapsed time is 0.092248 seconds.
Sparse mat: Elapsed time is 7.70092e-05 seconds.

First of all, the following would be a nicer way to measure that:
for i = 1:10, find (d); endfor
t = cputime ();
for i = 1:100, find (d); endfor
cputime () -t
for i = 1:10, find (f); endfor
t = cputime ();
for i = 1:100, find (f); endfor
cputime () -t
That's a good question. Octave has an internal specialisation for diagonal matrices where only the diagonal values are stored. You can see how much less memory it uses:
octave> d = eye (10000);
octave> f = full (eye (10000));
octave> typeinfo (d)
ans = diagonal matrix
octave> typeinfo (f)
ans = matrix
octave> whos d f
Variables in the current scope:
Attr Name Size Bytes Class
==== ==== ==== ===== =====
d 10000x10000 80000 double
f 10000x10000 800000000 double
Total is 200000000 elements using 800080000 bytes
The specialisation is for reduced memory and performance for cases where diagonal matrices are common. The problem with such specialisations is that they add special cases all over the place, specially when you want to access the data directly which Octave often does.
In the case of find, it has special cases for boolean arrays, integer arrays, permutation matrices, and sparse matrices. There is no special handling for diagonal matrices so the the case for real type double precision array is used instead. This means that diagonal matrices get internally converted to full arrays when calling find anyway.
Weirdly, calling full on the diagonal matrix before calling find seems to still be more efficient, so maybe my reasoning is wrong. I have opened a performance bug report

It has something to do with how your computer (or in detail, your cpu) is processing and buffering values on the stack and heap. When you allocate memory on the heap for your arrays, it will do so allocating a "list" of values, one after the other. Thus, when you iterate over an array, value by value, the cpu will just jump from one value on that list to the next one (very fast) but if you jump from value i to value i + n, where n is not 1, that means the cpu has to find this next value somewhere else on that list. Therefore, the way you save values in your programming environment and the way you iterate over those values influences the speed of the following process.
This was just a short and very simple try to explain this topic. Actually, it is more complicated as more factors and different cpu and memory techniques are possible. If you are interested in this kind of things, I would suggest starting here: https://en.wikipedia.org/wiki/System_programming (To have a Top Down view at the topic).

Related

Chisel Synchronization

Here is a description of one of the states in my state machine. What I would like to do is to go to the next state after the for loops.
is(s_multiplier){
when(ready){state := s_ready}
// Initialization of C memory to 0
for(i <- 0 to matrixSize - 1){
for(j <- 0 to matrixSize - 1){
memC.write(i + j, 0.asSInt((2 * cellSize).W))
}
}
// Objective 1 : Multiplication for the 128X128
// Objective 2 : Multiplication for the n.m and m.p size parameters given
for(i <- 0 to matrixSize - 1){
for(j <- 0 to matrixSize - 1){
sum := 0.asSInt(cellSize.W)
for(k <- 0 to matrixSize - 1){
sum = sum + memA.read(i * matrixSize + k, true.B) * memB.read(k * matrixSize + j, true.B)
}
memC.write(i * matrixSize + j, sum)
}
}
ready := true.B
}
I just created a boolean variable ready that I put to true after the loops. But as everything is supposed to be executed in parallel, I Don't think that my code is correct :/
There is a fundamental difference between writing software algorithms and using chisel to construct the hardware necessary to perform equivalent calculations.
Before discussing the matrix multiplication, consider (as a simpler example) your memory initialization operation loop. The way you have done it makes sense, but for hardware every time the inner body of the loop is executed the hardware necessary to init that memory cell is added to the hardware graph. That means you have created the necessary wires to initialize 16384 memory locations all at the same time. That a lot of wires. Not only that, it would require a memory that has 16384 write ports (you probably can't find that). Your hardware would initialize all this memory in one clock cycle, which is good, but by devoting an enormous number of gates to do so.
Typically one would initialize memory over a number of clock cycles and in this way reducing the amount of hardware required.
Similarly in the matrix multiplication section you are generating all the hardware necessary to compute a matrix multiplication in 1 clock cycle. This is great for performance but the number of multiplications required for this approach is 2,097,152 hardware multipliers plus a further large number of adders. Every * and + operation in the inner loop generates hardware. The number of gates required to multiply two 32 bit numbers is roughly 1024 gates.
The way to go about this is to figure out a way of breaking down the problem into stages. Maybe this would be module that can multiply one row by one column and sum the total. You would then need to use registers to work your way through the matrix, keeping track of the row and columns in order to compute the value at every point in the result matrix. In order to reduce the number of hardware elements you instead perform the calculation over multiple clock cycles keeping state information (indices to the rows and columns) on the progress of the calculation in registers or in memory.
There's a lot of ways to try and optimize a function this and Chisel is a great language for experimenting and testing out tactics.
Maybe you want to make the memory very wide to accommodate getting multiple cell values at once.
Maybe you will unroll your loop a bit more to compute multiple cell values at once by having more than one cell calculator.
Clever iteration strategies can optimize your memory accesses for both reading and writing.
The point is that writing hardware is not necessary harder than writing software (and Chisel helps there) but it is pretty different in the approach.
I would recommend you spend a little more time with Chisel bootcamp. The 2.3_control_flow page's section on sorting is pretty similar with respect to the discussion above. You can write a one cycle sorter but the size of the hardware to do it grows rapidly, in practice it is necessary to break the problem into pieces and spread the calculation over multiple cycles.
Good luck.

Calculating PI with Fortran & CUDA

I am trying to make a simple program in PGI's fortran compiler. This simple program will use the graphics card to calculate pi using the "dart board" algorithm. After battling with this program for quite some time now I have finally got it to behave for the most part. However, I am currently stuck on passing back the results properly. I must say, this is a rather tricky program to debug since I can no longer shove any print statements into the subroutine. This program currently returns all zeros. I am not really sure what is going on, but I have two ideas. Both of which I am not sure how to fix:
The CUDA kernel is not running somehow?
I am not converting the values properly? pi_parts = pi_parts_d
Well, this is the status of my current program. All variables with _d on the end stand for the CUDA prepared device memory where all the other variables (with the exception of the CUDA kernel) are typical Fortran CPU prepared variables. Now there are some print statements I have commented out that I have already tried out from CPU Fortran land. These commands were to check if I really was generating the random numbers properly. As for the CUDA method, I have currently commented out the calculations and replaced z to statically equal to 1 just to see something happen.
module calcPi
contains
attributes(global) subroutine pi_darts(x, y, results, N)
use cudafor
implicit none
integer :: id
integer, value :: N
real, dimension(N) :: x, y, results
real :: z
id = (blockIdx%x-1)*blockDim%x + threadIdx%x
if (id .lt. N) then
! SQRT NOT NEEDED, SQRT(1) === 1
! Anything above and below 1 would stay the same even with the applied
! sqrt function. Therefore using the sqrt function wastes GPU time.
z = 1.0
!z = x(id)*x(id)+y(id)*y(id)
!if (z .lt. 1.0) then
! z = 1.0
!else
! z = 0.0
!endif
results(id) = z
endif
end subroutine pi_darts
end module calcPi
program final_project
use calcPi
use cudafor
implicit none
integer, parameter :: N = 400
integer :: i
real, dimension(N) :: x, y, pi_parts
real, dimension(N), device :: x_d, y_d, pi_parts_d
type(dim3) :: grid, tBlock
! Initialize the random number generaters seed
call random_seed()
! Make sure we initialize the parts with 0
pi_parts = 0
! Prepare the random numbers (These cannot be generated from inside the
! cuda kernel)
call random_number(x)
call random_number(y)
!write(*,*) x, y
! Convert the random numbers into graphics card memory land!
x_d = x
y_d = y
pi_parts_d = pi_parts
! For the cuda kernel
tBlock = dim3(256,1,1)
grid = dim3((N/tBlock%x)+1,1,1)
! Start the cuda kernel
call pi_darts<<<grid, tblock>>>(x_d, y_d, pi_parts_d, N)
! Transform the results into CPU Memory
pi_parts = pi_parts_d
write(*,*) pi_parts
write(*,*) 'PI: ', 4.0*sum(pi_parts)/N
end program final_project
EDIT TO CODE:
Changed various lines to reflect the fixes mentioned by: Robert Crovella. Current status: error caught by cuda-memcheck revealing: Program hit error 8 on CUDA API call to cudaLaunch on my machine.
If there is any method I can use to test this program please let me know. I am throwing darts and seeing where they land for my current style of debugging with CUDA. Not the most ideal, but it will have to do until I find another way.
May the Fortran Gods have mercy on my soul at this dark hour.
When I compile and run your program I get a segfault. This is due to the last parameter you are passing to the kernel (N_d):
call pi_darts<<<grid, tblock>>>(x_d, y_d, pi_parts_d, N_d)
Since N is a scalar quantity, the kernel is expecting to use it directly, rather than as a pointer. So when you pass a pointer to device data (N_d), the process of setting up the kernel generates a seg fault (in host code!) as it attempts to access the value N, which should be passed directly as:
call pi_darts<<<grid, tblock>>>(x_d, y_d, pi_parts_d, N)
When I make that change to the code you have posted, I then get actual printed output (instead of a seg fault), which is an array of ones and zeroes (256 ones, followed by 144 zeroes, for a total of N=400 values), followed by the calculated PI value (which happens to be 2.56 in this case (4*256/400), since you have made the kernel basically a dummy kernel).
This line of code is also probably not doing what you want:
grid = dim3(N/tBlock%x,1,1)
With N = 400 and tBlock%x = 256 (from previous code lines), the result of the calculation is 1 (ie. grid ends up at (1,1,1) which amounts to one threadblock). But you really want to launch 2 threadblocks, so as to cover the entire range of your data set (N = 400 elements). There's a number of ways to fix this, but for simplicity let's just always add 1 to the calculation:
grid = dim3((N/tBlock%x)+1,1,1)
Under these circumstances, when we launch grids that are larger (in terms of total threads) than our data set size (512 threads but only 400 data elements in this example) it's customary to put a thread check near the beginning of our kernel (in this case, after the initialization of id), to prevent out-of-bounds accesses, like so:
if (id .lt. N) then
(and a corresponding endif at the very end of the kernel code) This way, only the threads that correspond to actual valid data are allowed to do any work.
With the above changes, your code should be essentially functional, and you should be able to revert your kernel code to the proper statements and start to get an estimate of PI.
Note that you can check the CUDA API for error return codes, and you can also run your code with cuda-memcheck to get an idea of whether the kernel is making out-of-bounds accesses. Niether of these would have helped with this particular seg fault, however.

cublas one function call produced three executions

I called the cublas_Sgemm_v2 function for 10236 times with first matrix non-transposed and the second transposed. However, in the nvprof results, I saw three items produced from that function call. The (m, n, k) values to the function call are (588, 588, 20).
There are the items listed in the nvprof results.
Time(%) Time Calls Avg Min Max Name
12.32% 494.86ms 10236 48.344us 47.649us 49.888us sgemm_sm35_ldg_nt_128x8x128x16x16
8.64% 346.91ms 10236 33.890us 32.352us 35.488us sgemm_sm35_ldg_nt_64x16x128x8x32
8.11% 325.63ms 10236 31.811us 31.360us 32.512us sgemm_sm35_ldg_nt_128x16x64x16x16
Is this expected and why is that? Can someone explain what do the values in the function names such as sgemm_sm35_ldg_nt_128x8x128x16x16 mean?
I also have other function calls to cublas_Sgemm_v2 with different transpose settings and I only see one item per each function call.
UPDATE:
As #Marco13 asked, I put more results here:
Time(%) Time Calls Avg Min Max Name
--------------------------------------------------------------------------------
Resulted from 7984 calls with (Trans, NonTrans) with (m, n, k) = (588, 100, 588)
20.84% 548.30ms 7984 68.675us 58.977us 81.474us sgemm_sm35_ldg_tn_32x16x64x8x16
Resulted from 7984 calls with (NonTrans, NonTrans) with (m, n, k) = (588, 100, 588)
12.95% 340.71ms 7984 42.674us 21.856us 64.514us sgemm_sm35_ldg_nn_64x16x64x16x16
All the following resulted from 3992 calls with (NonTrans, Trans) with (m, n, k) = (588, 588, 100)
9.81% 258.15ms 3992 64.666us 61.601us 68.642us sgemm_sm35_ldg_nt_128x8x128x16x16
6.84% 179.90ms 3992 45.064us 40.097us 49.505us sgemm_sm35_ldg_nt_64x16x128x8x32
6.33% 166.51ms 3992 41.709us 38.304us 61.185us sgemm_sm35_ldg_nt_128x16x64x16x16
Another run with 588 changed to 288:
Time(%) Time Calls Avg Min Max Name
--------------------------------------------------------------------------------
Resulted from 7984 calls with (Trans, NonTrans) with (m, n, k) = (288, 100, 288)
22.01% 269.11ms 7984 33.706us 30.273us 39.232us sgemm_sm35_ldg_tn_32x16x64x8x16
Resulted from 7984 calls with (NonTrans, NonTrans) with (m, n, k) = (288, 100, 288)
14.79% 180.78ms 7984 22.642us 18.752us 26.752us sgemm_sm35_ldg_nn_64x16x64x16x16
Resulted from 3992 calls with (NonTrans, Trans) with (m, n, k) = (288, 288, 100)
7.43% 90.886ms 3992 22.766us 19.936us 25.024us sgemm_sm35_ldg_nt_64x16x64x16x16
From the last three lines is looks like certain transposition types can be more efficient than the others, and certain matrix sizes are more economic in terms of computation time over matrix size. What is the guideline of ensuring economic computation?
UPDATE 2:
For the case of (m, n, k) = (588, 100, 588) above, I manually transposed the matrix before calling the sgemm function, then there is only one item in the nvprof result. The time it take is only a little less than the sum of the two items in the above table. So there is no much performance gain from doing so.
Time(%) Time Calls Avg Min Max Name
--------------------------------------------------------------------------------
31.65% 810.59ms 15968 50.763us 21.505us 72.098us sgemm_sm35_ldg_nn_64x16x64x16x16
Sorry, not an answer - but slightly too long for a comment:
Concerning the edit, about the influence of the "transpose" state: Transposing a matrix might cause an access pattern that is worse in terms of memory coalescing. A quick websearch brings brings some results about this ( https://devtalk.nvidia.com/default/topic/528450/cuda-programming-and-performance/cublas-related-question/post/3734986/#3734986 ), but with a slightly different setup than yours:
DGEMM performance on a K20c
args: ta=N tb=N m=4096 n=4096 k=4096 alpha=-1 beta=2 lda=4096 ldb=4096 ldc=4096
elapsed = 0.13280010 sec GFLOPS=1034.93
args: ta=T tb=N m=4096 n=4096 k=4096 alpha=-1 beta=2 lda=4096 ldb=4096 ldc=4096
elapsed = 0.13872910 sec GFLOPS=990.7
args: ta=N tb=T m=4096 n=4096 k=4096 alpha=-1 beta=2 lda=4096 ldb=4096 ldc=4096
elapsed = 0.12521601 sec GFLOPS=1097.61
args: ta=T tb=T m=4096 n=4096 k=4096 alpha=-1 beta=2 lda=4096 ldb=4096 ldc=4096
elapsed = 0.13652611 sec GFLOPS=1006.69
In this case, the differences do not seem worth the hassle of changing the matrix storage (e.g. from column-major to row-major, to avoid transposing the matrix), because all patterns seem to run with a similar speed. But your mileage may vary - particularly, the difference in your tests between (t,n) and (n,n) are very large (548ms vs 340ms), which I found quite surprising. If you have the choice to easily switch between various representations of the matrix, then a benchmark covering all the four cases may be worthwhile.
In any case, regarding your question about the functions that are called there: The CUBLAS code for the sgemm function in CUBLAS 1.1 was already full of unrolled loops and already contained 80 (!) versions of the sgemm function for different cases that have been assembled using a #define-hell. It has to be assumed that this has become even more unreadable in the newer CUBLAS versions, where the newer compute capabilities have to be taken into account - and the function names that you found there indicated that this indeed is the case:
sgemm_sm35_ldg_nt_64x16x128x8x32
sm35 : Runs on a device with compute capability 3.5
ldg : ? Non-texture-memory version ? (CUBLAS 1.1 contained functions called sgemm_main_tex_* which worked on texture memory, and functions sgemm_main_gld_* which worked on normal, global memory)
nt : First matrix is Not transposed, second one is Transposed
64x16x128x8x32 - Probably related to tile sizes, maybe shared memory etc...
Still, I think it's surprising that a single call to sgemm causes three of these internal functions to be called. But as mentioned in the comment: I assume that they try to handle the "main" part of the matrix with a specialized, efficient version, and "border tiles" with one that is capable of doing range checks and/or cope with warps that are not full. (Not very precise, just to be suggestive: A matrix of size 288x288 could be handled by an efficient core for matrices of size 256x256, and two calls for the remaining 32x288 and 288x32 entries).
But all this is also the reason why I guess there can hardly be a general guideline concerning the matrix sizes: The "best" matrix size in terms of computation time over matrix size will at least depend on
the hardware version (compute capability) of the target system
the transposing-flags
the CUBLAS version
EDIT Concerning the comment: One could imagine that there should be a considerable difference between the transosed and the non-transposed processing. When multiplying two matrices
a00 a01 a02 b00 b01 b02
a10 a11 a12 * b10 b11 b12
a20 a21 a22 b20 b21 b22
Then the first element of the result will be
a00 * b00 + a01 * b10 + a02 * b20
(which simply is the dot product of the first row of a and the first column of b). For this computation one has to read consecutive values from a. But the values that are read from b are not consecutive. Instead, they are "the first value in each row". One could think that this would have a negative impact on memory coalescing. But for sure, the NVIDIA engineers have tried hard to avoid any negative impact here, and the implementation of sgemm in CUBLAS is far, far away from "a parallel version of the naive 3-nested-loops implementation" where this access pattern would have such an obvious drawback.

Operate on all rows of a matrix without loop, octave

Suppose I have a matrix A and I want to obtain the following:
for i=1:m
A(i,:) = something which depends on i;
endfor
Is there a way to obtain that without the loop?
Added: Ok, I've understood I have to be more specific.
I have two matrices B and C (all the matrices we are considering have m rows).
I want to record in the i-th row of A the product of the polynomials written in the i-th rows of B and C (so using the loop I would call the conv function).
Any ideas?
I don't think it's possible to do this without a for loop as conv only accepts vector inputs, but I might be wrong. I can't see a way of using either bsxfun or arrayfun in conjunction with conv and matrix inputs. I might be wrong though... I stand to be corrected.
That is a very very general question, and not possible to answer with more details. Mainly on what i will be involved with. Suppose the following
for i = 1:m
A(i,:) += i;
endfor
It could be written with the much more efficient:
A .+ (1:m)'
Just compare:
octave> n = 1000;
octave> A = B = rand (n);
octave> tic; for i = 1:n, B(i,:) += i; endfor; toc
Elapsed time is 0.051 seconds.
octave> tic; C = A.+ (1:n)'; toc
Elapsed time is 0.01 seconds.
octave> isequal (C, B)
ans = 1
If you have a very old version of octave, you can instead do bsxfun (#plus, A, (i:m)').
However, if i on the right side of the expression will be used for indexing some other variable, then solution would be different. Maybe, the solution is cumsum, or some other of cumfoo function.
Your question is basically, "how do I vectorize code?", which is a really large subject, without telling us what you're trying to vectorize.

any rules of thumb how to smooth FFT spectrum to prevent artifacts when hand-tweaking?

I've got a FFT magnitude spectrum and I want to create a filter from it that selectively passes periodic noise sources (e.g. sinewave spurs) and zero's out the frequency bins associated with the random background noise. I understand sharp transitions in the freq domain will create ringing artifacts once this filter is IFFT back to the time domain... and so I'm wondering if there are any rules of thumb how to smooth the transitions in such a filter to avoid such ringing.
For example, if the FFT has 1M frequency bins, and there are five spurs poking out of the background noise floor, I'd like to zero all bins except the peak bin associated with each of the five spurs. The question is how to handle the neighboring spur bins to prevent artifacts in the time domain. For example, should the the bin on each side of a spur bin be set to 50% amplitude? Should two bins on either side of a spur bin be used (the closest one at 50%, and the next closest at 25%, etc.)? Any thoughts greatly appreciated. Thanks!
I like the following method:
Create the ideal magnitude spectrum (remembering to make it symmetrical about DC)
Inverse transform to the time domain
Rotate the block by half the blocksize
Apply a Hann window
I find it creates reasonably smooth frequency domain results, although I've never tried it on something as sharp as you're suggesting. You can probably make a sharper filter by using a Kaiser-Bessel window, but you have to pick the parameters appropriately. By sharper, I'm guessing maybe you can reduce the sidelobes by 6 dB or so.
Here's some sample Matlab/Octave code. To test the results, I used freqz(h, 1, length(h)*10);.
function [ht, htrot, htwin] = ArbBandPass(N, freqs)
%# N = desired filter length
%# freqs = array of frequencies, normalized by pi, to turn into passbands
%# returns raw, rotated, and rotated+windowed coeffs in time domain
if any(freqs >= 1) || any(freqs <= 0)
error('0 < passband frequency < 1.0 required to fit within (DC,pi)')
end
hf = zeros(N,1); %# magnitude spectrum from DC to 2*pi is intialized to 0
%# In Matlabs FFT, idx 1 -> DC, idx 2 -> bin 1, idx N/2 -> Fs/2 - 1, idx N/2 + 1 -> Fs/2, idx N -> bin -1
idxs = round(freqs * N/2)+1; %# indeces of passband freqs between DC and pi
hf(idxs) = 1; %# set desired positive frequencies to 1
hf(N - (idxs-2)) = 1; %# make sure 2-sided spectrum is symmetric, guarantees real filter coeffs in time domain
ht = ifft(hf); %# this will have a small imaginary part due to numerical error
if any(abs(imag(ht)) > 2*eps(max(abs(real(ht)))))
warning('Imaginary part of time domain signal surprisingly large - is the spectrum symmetric?')
end
ht = real(ht); %# discard tiny imag part from numerical error
htrot = [ht((N/2 + 1):end) ; ht(1:(N/2))]; %# circularly rotate time domain block by N/2 points
win = hann(N, 'periodic'); %# might want to use a window with a flatter mainlobe
htwin = htrot .* win;
htwin = htwin .* (N/sum(win)); %# normalize peak amplitude by compensating for width of window lineshape