"dimension too large" error when broadcasting to sparse matrix in octave - octave

32-bit Octave has a limit on the maximum number of elements in an array. I have recompiled from source (following the script at https://github.com/calaba/octave-3.8.2-enable-64-ubuntu-14.04 ), and now have 64-bit indexing.
Nevertheless, when I attempt to perform elementwise multiplication using a broadcast function, I get error: out of memory or dimension too large for Octave's index type
Is this a bug, or an undocumented feature? If it's a bug, does anyone have a reasonably efficient workaround?
Minimal code to reproduce the problem:
function indexerror();
% both of these are formed without error
% a = zeros (2^32, 1, 'int8');
% b = zeros (1024*1024*1024*3, 1, 'int8');
% sizemax % returns 9223372036854775806
nnz = 1000 % number of non-zero elements
rowmax = 250000
colmax = 100000
irow = zeros(1,nnz);
icol = zeros(1,nnz);
for ind =1:nnz
irow(ind) = round(rowmax/nnz*ind);
icol(ind) = round(colmax/nnz*ind);
end
sparseMat = sparse(irow,icol,1,rowmax,colmax);
% column vector to be broadcast
broad = 1:rowmax;
broad = broad(:);
% this gives "dimension too large" error
toobig = bsxfun(#times,sparseMat,broad);
% so does this
toobig2 = sparse(repmat(broad,1,size(sparseMat,2)));
mult = sparse( sparseMat .* toobig2 ); % never made it this far
end
EDIT:
Well, I have an inefficient workaround. It's slower than using bsxfun by a factor of 3 or so (depending on the details), but it's better than having to sort through the error in the libraries. Hope someone finds this useful some day.
% loop over rows, instead of using bsxfun
mult_loop = sparse([],[],[],rowmax,colmax);
for ind =1:length(broad);
mult_loop(ind,:) = broad(ind) * sparseMat(ind,:);
end

The unfortunate answer is that yes, this is a bug. Apparently #bsxfun and repmat are returning full matrices rather than sparse. Bug has been filed here:
http://savannah.gnu.org/bugs/index.php?47175

Related

Non-linear fit Gnu Octave

I have a problem in performing a non linear fit with Gnu Octave. Basically I need to perform a global fit with some shared parameters, while keeping others fixed.
The following code works perfectly in Matlab, but Octave returns an error
error: operator *: nonconformant arguments (op1 is 34x1, op2 is 4x1)
Attached my code and the data to play with:
clear
close all
clc
pkg load optim
D = dlmread('hd', ';'); % raw data
bkg = D(1,2:end); % 4 sensors bkg
x = D(2:end,1); % input signal
Y = D(2:end,2:end); % 4 sensors reposnse
W = 1./Y; % weights
b0 = [7 .04 .01 .1 .5 2 1]; % educated guess for start the fit
%% model function
F = #(b) ((bkg + (b(1) - bkg).*(1-exp(-(b(2:5).*x).^b(6))).^b(7)) - Y) .* W;
opts = optimset("Display", "iter");
lb = [5 .001 .001 .001 .001 .01 1];
ub = [];
[b, resnorm, residual, exitflag, output, lambda, Jacob\] = ...
lsqnonlin(F,b0,lb,ub,opts)
To give more info, giving array b0, b0(1), b0(6) and b0(7) are shared among the 4 dataset, while b0(2:5) are peculiar of each dataset.
Thank you for your help and suggestions! ;)
Raw data:
0,0.3105,0.31342,0.31183,0.31117
0.013229,0.329,0.3295,0.332,0.372
0.013229,0.328,0.33,0.33,0.373
0.021324,0.33,0.3305,0.33633,0.399
0.021324,0.325,0.3265,0.333,0.397
0.037763,0.33,0.3255,0.34467,0.461
0.037763,0.327,0.3285,0.347,0.456
0.069405,0.338,0.3265,0.36533,0.587
0.069405,0.3395,0.329,0.36667,0.589
0.12991,0.357,0.3385,0.41333,0.831
0.12991,0.358,0.3385,0.41433,0.837
0.25368,0.393,0.347,0.501,1.302
0.25368,0.3915,0.3515,0.498,1.278
0.51227,0.458,0.3735,0.668,2.098
0.51227,0.47,0.3815,0.68467,2.124
1.0137,0.61,0.4175,1.008,3.357
1.0137,0.599,0.422,1,3.318
2.0162,0.89,0.5335,1.645,5.006
2.0162,0.872,0.5325,1.619,4.938
4.0192,1.411,0.716,2.674,6.595
4.0192,1.418,0.7205,2.691,6.766
8.0315,2.34,1.118,4.195,7.176
8.0315,2.33,1.126,4.161,6.74
16.04,3.759,1.751,5.9,7.174
16.04,3.762,1.748,5.911,7.151
32.102,5.418,2.942,7.164,7.149
32.102,5.406,2.941,7.164,7.175
64.142,7.016,4.478,7.174,7.176
64.142,7.018,4.402,7.175,7.175
128.32,7.176,6.078,7.175,7.176
128.32,7.175,6.107,7.175,7.173
255.72,7.165,7.162,7.165,7.165
255.72,7.165,7.164,7.166,7.166
511.71,7.165,7.165,7.165,7.165
511.71,7.165,7.165,7.166,7.164
Giving the function definition above, if you call it by F(b0) in the command windows, you will get a 34x4 matrix which is correct, since variable Y has the same size.
In that way I can (in theory) compute the standard formula for lsqnonlin (fit - measured)^2

Removing DC component for matrix in chuncks in octave

I'm new to octave and if this as been asked and answered then I'm sorry but I have no idea what the phrase is for what I'm looking for.
I trying to remove the DC component from a large matrix, but in chunks as I need to do calculations on each chuck.
What I got so far
r = dlmread('test.csv',';',0,0);
x = r(:,2);
y = r(:,3); % we work on the 3rd column
d = 1
while d <= (length(y) - 256)
e = y(d:d+256);
avg = sum(e) / length(e);
k(d:d+256) = e - avg; % this is the part I need help with, how to get the chunk with the right value into the matrix
d += 256;
endwhile
% to check the result I like to see it
plot(x, k, '.');
if I change the line into:
k(d:d+256) = e - 1024;
it works perfectly.
I know there is something like an element-wise operation, but if I use e .- avg I get this:
warning: the '.-' operator was deprecated in version 7
and it still doesn't do what I expect.
I must be missing something, any suggestions?
GNU Octave, version 7.2.0 on Linux(Manjaro).
Never mind the code works as expected.
The result (K) got corrupted because the chosen chunk size was too small for my signal. Changing 256 to 4096 got me a better result.
+ and - are always element-wise. Beware that d:d+256 are 257 elements, not 256. So if then you increment d by 256, you have one overlaying point.

why the result is different between cpu and gpu?

This is my code running on GPU
tid=threadidx%x
bid=blockidx%x
bdim=blockdim%x
isec = mesh_sec_1(lev)+bid-1
if (isec .le. mesh_sec_0(lev)) then
if(.not. sec_is_int(isec)) return
do iele = tid, sec_n_ele(isec), bdim
idx = n_ele_idx(isec)+iele
u(1:5) = fv_u(1:5,idx)
u(6 ) = fv_t( idx)
g = 0.0d0
do j= sec_iA_ls(idx), sec_iA_ls(idx+1)-1
ss = sec_jA_ls(1,j)
ee = sec_jA_ls(2,j)
tem = n_ele_idx(ss)+ee
du(1:5) = fv_u(1:5, n_ele_idx(ss)+ee)-u(1:5)
du(6 ) = fv_t( n_ele_idx(ss)+ee)-u(6 )
coe(1:3) = sec_coe_ls(1:3,j)
do k=1,6
g(1:3,k)=g(1:3,k)+du(k)*sec_coe_ls(1:3,j)
end do
end do
do j=1,6
do i=1,3
fv_gra(i+(j-1)*3,idx)=g(i,j)
end do
end do
end do
end if
and next is my code running on CPU
do isec = h_mesh_sec_1(lev),h_mesh_sec_0(lev)
if(.not. h_sec_is_int(isec)) cycle
do iele=1,h_sec_n_ele(isec)
idx = h_n_ele_idx(isec)+iele
u(1:5) = h_fv_u(1:5,idx)
u(6 ) = h_fv_t( idx)
g = 0.0d0
do j= h_sec_iA_ls(idx),h_sec_iA_ls(idx+1)-1
ss = h_sec_jA_ls(1,j)
ee = h_sec_jA_ls(2,j)
du(1:5) = h_fv_u(1:5,h_n_ele_idx(ss)+ee)-u(1:5)
du(6 ) = h_fv_t( h_n_ele_idx(ss)+ee)-u(6 )
do k=1,6
g(1:3,k)= g(1:3,k) + du(k)*h_sec_coe_ls(1:3,j)
end do
end do
do j=1,6
do i=1,3
h_fv_gra(i+(j-1)*3,idx) = g(i,j)
enddo
enddo
end do
end do
The variable between h_* and * shows it belong to cpu and gpu separately.
The result is same at many points, but at some points they are a little different. I add the check code like this.
do i =1,size(h_fv_gra,1)
do j = 1,size(h_fv_gra,2)
if(hd_fv_gra(i,j)-h_fv_gra(i,j) .ge. 1.0d-9) then
print *,hd_fv_gra(i,j)-h_fv_gra(i,j),i,j
end if
end do
end do
The hd_* is a copy of the gpu result. we can see the difference:
1.8626451492309570E-009 13 14306
1.8626451492309570E-009 13 14465
1.8626451492309570E-009 13 14472
1.8626451492309570E-009 14 14128
1.8626451492309570E-009 14 14146
1.8626451492309570E-009 14 14150
1.8626451492309570E-009 14 14153
1.8626451492309570E-009 14 14155
1.8626451492309570E-009 14 14156
So I am confused about that. The precision of Cuda should not as large as this. Any reply will be welcomed.
In addition, I don't know how to print the variables in GPU codes, which can help me debug.
In your code, calculation of g value most probably benefits from Fused Multiply Add (fma) optimization in CUDA.
g(1:3,k)=g(1:3,k)+du(k)*sec_coe_ls(1:3,j)
On the CPU side, this is not impossible but strongly depends on compiler choices (and the actual CPU running the code should it implement fma).
To enforce usage of separate multiply and add, you want to use intrinsics from CUDA, as defined here, such as :
__device__ ​ double __dadd_rn ( double x, double y ) Add two floating point values in round-to-nearest-even mode.
and
__device__ ​ double __dmul_rn ( double x, double y ) Multiply two floating point values in round-to-nearest-even mode.
with a rounding mode identical to the one defined on CPU (it depends on the CPU architecture, whether it be Power or Intel x86 or other).
Alternate approach is to pass the --fmad false option to ptxas when compiling cuda, using the --ptxas-options option from nvcc detailed here.

Using arrayfun to apply two arguments of a function on every combination

Let i = [1 2] and j = [3 5]. Now in octave:
arrayfun(#(x,y) x+y,i,j)
we get [4 7]. But I want to apply the function on the combinations of i vs. j to get [i(1)+j(1) i(1)+j(2) i(2)+j(1) i(2)+j(2)]=[4 6 5 7].
How do I accomplish this? I know I can go with for-loopsl but I want vectorized-code because it's faster.
In Octave, for finding summations between two vectors, you can use a truly vectorized approach with broadcasting like so -
out = reshape(ii(:).' + jj(:),[],1)
Here's a runtime test on ideone for the input vectors of size 1 x 100 each -
-------------------- With FOR-LOOP
Elapsed time is 0.148444 seconds.
-------------------- With BROADCASTING
Elapsed time is 0.00038299 seconds.
If you want to keep it generic to accommodate operations other than just summations, you can use anonymous functions like so -
func1 = #(I,J) I+J;
out = reshape(func1(ii,jj.'),1,[])
In MATLAB, you could accomplish the same with two bsxfun alternatives as listed next.
I. bsxfun with Anonymous Function -
func1 = #(I,J) I+J;
out = reshape(bsxfun(func1,ii(:).',jj(:)),1,[]);
II. bsxfun with Built-in #plus -
out = reshape(bsxfun(#plus,ii(:).',jj(:)),1,[]);
With the input vectors of size 1 x 10000 each, the runtimes at my end were -
-------------------- With FOR-LOOP
Elapsed time is 1.193941 seconds.
-------------------- With BSXFUN ANONYMOUS
Elapsed time is 0.252825 seconds.
-------------------- With BSXFUN BUILTIN
Elapsed time is 0.215066 seconds.
First, your first example is not the best because the most efficient way to accomplish what you're doing with arrayfun would be to vectorize:
a = [1 2];
b = [3 5];
out = a+b
Second, in Matlab at least, arrayfun is not necessarily faster than a simple for loop. arrayfun is mainly a convenience (especially for it's more advanced options). Try this simple timing example yourself:
a = 1:1e5;
b = a+1;
y = arrayfun(#(x,y)x+y,a,b); % Warm up
tic
y = arrayfun(#(x,y)x+y,a,b);
toc
y = zeros(1,numel(a));
for k = 1:numel(a)
y(k) = a(k)+b(k); % Warm up
end
tic
y = zeros(1,numel(a));
for k = 1:numel(a)
y(k) = a(k)+b(k);
end
toc
In Matlab R2015a, the for loop method is over 70 times faster run from the Command window and over 260 times faster when run from an M-file function. Octave may be different, but you should experiment.
Finally, you can accomplish what you want using meshgrid:
a = [1 2];
b = [3 5];
[x,y] = meshgrid(a,b);
out = x(:).'+y(:).'
which returns [4 6 5 7] as in your question. You can also use ndgrid to get output in a different order.

How to calculate a large size FFT using smaller sized FFTs?

If I have an FFT implementation of a certain size M (power of 2), how can I calculate the FFT of a set of size P=k*M, where k is a power of 2 as well?
#define M 256
#define P 1024
complex float x[P];
complex float X[P];
// Use FFT_M(y) to calculate X = FFT_P(x) here
[The question is expressed in a general sense on purpose. I know FFT calculation is a huge field and many architecture specific optimizations were researched and developed, but what I am trying to understand is how is this doable in the more abstract level. Note that I am no FFT (or DFT, for that matter) expert, so if an explanation can be laid down in simple terms that would be appreciated]
Here's an algorithm for computing an FFT of size P using two smaller FFT functions, of sizes M and N (the original question call the sizes M and k).
Inputs:
P is the size of the large FFT you wish to compute.
M, N are selected such that MN=P.
x[0...P-1] is the input data.
Setup:
U is a 2D array with M rows and N columns.
y is a vector of length P, which will hold FFT of x.
Algorithm:
step 1. Fill U from x by columns, so that U looks like this:
x(0) x(M) ... x(P-M)
x(1) x(M+1) ... x(P-M+1)
x(2) x(M+2) ... x(P-M+2)
... ... ... ...
x(M-1) x(2M-1) ... x(P-1)
step 2. Replace each row of U with its own FFT (of length N).
step 3. Multiply each element of U(m,n) by exp(-2*pi*j*m*n/P).
step 4. Replace each column of U with its own FFT (of length M).
step 5. Read out the elements of U by rows into y, like this:
y(0) y(1) ... y(N-1)
y(N) y(N+1) ... y(2N-1)
y(2N) y(2N+1) ... y(3N-1)
... ... ... ...
y(P-N) y(P-N-1) ... y(P-1)
Here is MATLAB code which implements this algorithm. You can test it by typing fft_decomposition(randn(256,1), 8);
function y = fft_decomposition(x, M)
% y = fft_decomposition(x, M)
% Computes FFT by decomposing into smaller FFTs.
%
% Inputs:
% x is a 1D array of the input data.
% M is the size of one of the FFTs to use.
%
% Outputs:
% y is the FFT of x. It has been computed using FFTs of size M and
% length(x)/M.
%
% Note that this implementation doesn't explicitly use the 2D array U; it
% works on samples of x in-place.
q = 1; % Offset because MATLAB starts at one. Set to 0 for C code.
x_original = x;
P = length(x);
if mod(P,M)~=0, error('Invalid block size.'); end;
N = P/M;
% step 2: FFT-N on rows of U.
for m = 0 : M-1
x(q+(m:M:P-1)) = fft(x(q+(m:M:P-1)));
end;
% step 3: Twiddle factors.
for m = 0 : M-1
for n = 0 : N-1
x(m+n*M+q) = x(m+n*M+q) * exp(-2*pi*j*m*n/P);
end;
end;
% step 4: FFT-M on columns of U.
for n = 0 : N-1
x(q+n*M+(0:M-1)) = fft(x(q+n*M+(0:M-1)));
end;
% step 5: Re-arrange samples for output.
y = zeros(size(x));
for m = 0 : M-1
for n = 0 : N-1
y(m*N+n+q) = x(m+n*M+q);
end;
end;
err = max(abs(y-fft(x_original)));
fprintf( 1, 'The largest error amplitude is %g\n', err);
return;
% End of fft_decomposition().
kevin_o's response worked quite well. I took his code and eliminated the loops using some basic Matlab tricks. It functionally is identical to his version
function y = fft_decomposition(x, M)
% y = fft_decomposition(x, M)
% Computes FFT by decomposing into smaller FFTs.
%
% Inputs:
% x is a 1D array of the input data.
% M is the size of one of the FFTs to use.
%
% Outputs:
% y is the FFT of x. It has been computed using FFTs of size M and
% length(x)/M.
%
% Note that this implementation doesn't explicitly use the 2D array U; it
% works on samples of x in-place.
q = 1; % Offset because MATLAB starts at one. Set to 0 for C code.
x_original = x;
P = length(x);
if mod(P,M)~=0, error('Invalid block size.'); end;
N = P/M;
% step 2: FFT-N on rows of U.
X=fft(reshape(x,M,N),[],2);
% step 3: Twiddle factors.
X=X.*exp(-j*2*pi*(0:M-1)'*(0:N-1)/P);
% step 4: FFT-M on columns of U.
X=fft(X);
% step 5: Re-arrange samples for output.
x_twiddle=bsxfun(#plus,M*(0:N-1)',(0:M-1))+q;
y=X(x_twiddle(:));
% err = max(abs(y-fft(x_original)));
% fprintf( 1, 'The largest error amplitude is %g\n', err);
return;
% End of fft_decomposition()
You could just use the last log2(k) passes of a radix-2 FFT, assuming the previous FFT results are from appropriately interleaved data subsets.
Well an FFT is basically a recursive type of Fourier Transform. It relies on the fact that as wikipedia puts it:
The best-known FFT algorithms depend upon the factorization of N, but there are FFTs with O(N log N) complexity for >all N, even for prime N. Many FFT algorithms only depend on the fact that e^(-2pi*i/N) is an N-th primitive root of unity, and >thus can be applied to analogous transforms over any finite field, such as number-theoretic transforms. Since the >inverse DFT is the same as the DFT, but with the opposite sign in the exponent and a 1/N factor, any FFT algorithm >can easily be adapted for it.
So this has pretty much already been done in the FFT. If you are talking about getting longer period signals out of your transform you are better off doing an DFT over the data sets of limited frequencies. There might be a way to do it from the frequency domain but IDK if anyone has actually done it. You could be the first!!!! :)