why the result is different between cpu and gpu? - cuda

This is my code running on GPU
tid=threadidx%x
bid=blockidx%x
bdim=blockdim%x
isec = mesh_sec_1(lev)+bid-1
if (isec .le. mesh_sec_0(lev)) then
if(.not. sec_is_int(isec)) return
do iele = tid, sec_n_ele(isec), bdim
idx = n_ele_idx(isec)+iele
u(1:5) = fv_u(1:5,idx)
u(6 ) = fv_t( idx)
g = 0.0d0
do j= sec_iA_ls(idx), sec_iA_ls(idx+1)-1
ss = sec_jA_ls(1,j)
ee = sec_jA_ls(2,j)
tem = n_ele_idx(ss)+ee
du(1:5) = fv_u(1:5, n_ele_idx(ss)+ee)-u(1:5)
du(6 ) = fv_t( n_ele_idx(ss)+ee)-u(6 )
coe(1:3) = sec_coe_ls(1:3,j)
do k=1,6
g(1:3,k)=g(1:3,k)+du(k)*sec_coe_ls(1:3,j)
end do
end do
do j=1,6
do i=1,3
fv_gra(i+(j-1)*3,idx)=g(i,j)
end do
end do
end do
end if
and next is my code running on CPU
do isec = h_mesh_sec_1(lev),h_mesh_sec_0(lev)
if(.not. h_sec_is_int(isec)) cycle
do iele=1,h_sec_n_ele(isec)
idx = h_n_ele_idx(isec)+iele
u(1:5) = h_fv_u(1:5,idx)
u(6 ) = h_fv_t( idx)
g = 0.0d0
do j= h_sec_iA_ls(idx),h_sec_iA_ls(idx+1)-1
ss = h_sec_jA_ls(1,j)
ee = h_sec_jA_ls(2,j)
du(1:5) = h_fv_u(1:5,h_n_ele_idx(ss)+ee)-u(1:5)
du(6 ) = h_fv_t( h_n_ele_idx(ss)+ee)-u(6 )
do k=1,6
g(1:3,k)= g(1:3,k) + du(k)*h_sec_coe_ls(1:3,j)
end do
end do
do j=1,6
do i=1,3
h_fv_gra(i+(j-1)*3,idx) = g(i,j)
enddo
enddo
end do
end do
The variable between h_* and * shows it belong to cpu and gpu separately.
The result is same at many points, but at some points they are a little different. I add the check code like this.
do i =1,size(h_fv_gra,1)
do j = 1,size(h_fv_gra,2)
if(hd_fv_gra(i,j)-h_fv_gra(i,j) .ge. 1.0d-9) then
print *,hd_fv_gra(i,j)-h_fv_gra(i,j),i,j
end if
end do
end do
The hd_* is a copy of the gpu result. we can see the difference:
1.8626451492309570E-009 13 14306
1.8626451492309570E-009 13 14465
1.8626451492309570E-009 13 14472
1.8626451492309570E-009 14 14128
1.8626451492309570E-009 14 14146
1.8626451492309570E-009 14 14150
1.8626451492309570E-009 14 14153
1.8626451492309570E-009 14 14155
1.8626451492309570E-009 14 14156
So I am confused about that. The precision of Cuda should not as large as this. Any reply will be welcomed.
In addition, I don't know how to print the variables in GPU codes, which can help me debug.

In your code, calculation of g value most probably benefits from Fused Multiply Add (fma) optimization in CUDA.
g(1:3,k)=g(1:3,k)+du(k)*sec_coe_ls(1:3,j)
On the CPU side, this is not impossible but strongly depends on compiler choices (and the actual CPU running the code should it implement fma).
To enforce usage of separate multiply and add, you want to use intrinsics from CUDA, as defined here, such as :
__device__ ​ double __dadd_rn ( double x, double y ) Add two floating point values in round-to-nearest-even mode.
and
__device__ ​ double __dmul_rn ( double x, double y ) Multiply two floating point values in round-to-nearest-even mode.
with a rounding mode identical to the one defined on CPU (it depends on the CPU architecture, whether it be Power or Intel x86 or other).
Alternate approach is to pass the --fmad false option to ptxas when compiling cuda, using the --ptxas-options option from nvcc detailed here.

Related

Removing DC component for matrix in chuncks in octave

I'm new to octave and if this as been asked and answered then I'm sorry but I have no idea what the phrase is for what I'm looking for.
I trying to remove the DC component from a large matrix, but in chunks as I need to do calculations on each chuck.
What I got so far
r = dlmread('test.csv',';',0,0);
x = r(:,2);
y = r(:,3); % we work on the 3rd column
d = 1
while d <= (length(y) - 256)
e = y(d:d+256);
avg = sum(e) / length(e);
k(d:d+256) = e - avg; % this is the part I need help with, how to get the chunk with the right value into the matrix
d += 256;
endwhile
% to check the result I like to see it
plot(x, k, '.');
if I change the line into:
k(d:d+256) = e - 1024;
it works perfectly.
I know there is something like an element-wise operation, but if I use e .- avg I get this:
warning: the '.-' operator was deprecated in version 7
and it still doesn't do what I expect.
I must be missing something, any suggestions?
GNU Octave, version 7.2.0 on Linux(Manjaro).
Never mind the code works as expected.
The result (K) got corrupted because the chosen chunk size was too small for my signal. Changing 256 to 4096 got me a better result.
+ and - are always element-wise. Beware that d:d+256 are 257 elements, not 256. So if then you increment d by 256, you have one overlaying point.

"dimension too large" error when broadcasting to sparse matrix in octave

32-bit Octave has a limit on the maximum number of elements in an array. I have recompiled from source (following the script at https://github.com/calaba/octave-3.8.2-enable-64-ubuntu-14.04 ), and now have 64-bit indexing.
Nevertheless, when I attempt to perform elementwise multiplication using a broadcast function, I get error: out of memory or dimension too large for Octave's index type
Is this a bug, or an undocumented feature? If it's a bug, does anyone have a reasonably efficient workaround?
Minimal code to reproduce the problem:
function indexerror();
% both of these are formed without error
% a = zeros (2^32, 1, 'int8');
% b = zeros (1024*1024*1024*3, 1, 'int8');
% sizemax % returns 9223372036854775806
nnz = 1000 % number of non-zero elements
rowmax = 250000
colmax = 100000
irow = zeros(1,nnz);
icol = zeros(1,nnz);
for ind =1:nnz
irow(ind) = round(rowmax/nnz*ind);
icol(ind) = round(colmax/nnz*ind);
end
sparseMat = sparse(irow,icol,1,rowmax,colmax);
% column vector to be broadcast
broad = 1:rowmax;
broad = broad(:);
% this gives "dimension too large" error
toobig = bsxfun(#times,sparseMat,broad);
% so does this
toobig2 = sparse(repmat(broad,1,size(sparseMat,2)));
mult = sparse( sparseMat .* toobig2 ); % never made it this far
end
EDIT:
Well, I have an inefficient workaround. It's slower than using bsxfun by a factor of 3 or so (depending on the details), but it's better than having to sort through the error in the libraries. Hope someone finds this useful some day.
% loop over rows, instead of using bsxfun
mult_loop = sparse([],[],[],rowmax,colmax);
for ind =1:length(broad);
mult_loop(ind,:) = broad(ind) * sparseMat(ind,:);
end
The unfortunate answer is that yes, this is a bug. Apparently #bsxfun and repmat are returning full matrices rather than sparse. Bug has been filed here:
http://savannah.gnu.org/bugs/index.php?47175

Using arrayfun to apply two arguments of a function on every combination

Let i = [1 2] and j = [3 5]. Now in octave:
arrayfun(#(x,y) x+y,i,j)
we get [4 7]. But I want to apply the function on the combinations of i vs. j to get [i(1)+j(1) i(1)+j(2) i(2)+j(1) i(2)+j(2)]=[4 6 5 7].
How do I accomplish this? I know I can go with for-loopsl but I want vectorized-code because it's faster.
In Octave, for finding summations between two vectors, you can use a truly vectorized approach with broadcasting like so -
out = reshape(ii(:).' + jj(:),[],1)
Here's a runtime test on ideone for the input vectors of size 1 x 100 each -
-------------------- With FOR-LOOP
Elapsed time is 0.148444 seconds.
-------------------- With BROADCASTING
Elapsed time is 0.00038299 seconds.
If you want to keep it generic to accommodate operations other than just summations, you can use anonymous functions like so -
func1 = #(I,J) I+J;
out = reshape(func1(ii,jj.'),1,[])
In MATLAB, you could accomplish the same with two bsxfun alternatives as listed next.
I. bsxfun with Anonymous Function -
func1 = #(I,J) I+J;
out = reshape(bsxfun(func1,ii(:).',jj(:)),1,[]);
II. bsxfun with Built-in #plus -
out = reshape(bsxfun(#plus,ii(:).',jj(:)),1,[]);
With the input vectors of size 1 x 10000 each, the runtimes at my end were -
-------------------- With FOR-LOOP
Elapsed time is 1.193941 seconds.
-------------------- With BSXFUN ANONYMOUS
Elapsed time is 0.252825 seconds.
-------------------- With BSXFUN BUILTIN
Elapsed time is 0.215066 seconds.
First, your first example is not the best because the most efficient way to accomplish what you're doing with arrayfun would be to vectorize:
a = [1 2];
b = [3 5];
out = a+b
Second, in Matlab at least, arrayfun is not necessarily faster than a simple for loop. arrayfun is mainly a convenience (especially for it's more advanced options). Try this simple timing example yourself:
a = 1:1e5;
b = a+1;
y = arrayfun(#(x,y)x+y,a,b); % Warm up
tic
y = arrayfun(#(x,y)x+y,a,b);
toc
y = zeros(1,numel(a));
for k = 1:numel(a)
y(k) = a(k)+b(k); % Warm up
end
tic
y = zeros(1,numel(a));
for k = 1:numel(a)
y(k) = a(k)+b(k);
end
toc
In Matlab R2015a, the for loop method is over 70 times faster run from the Command window and over 260 times faster when run from an M-file function. Octave may be different, but you should experiment.
Finally, you can accomplish what you want using meshgrid:
a = [1 2];
b = [3 5];
[x,y] = meshgrid(a,b);
out = x(:).'+y(:).'
which returns [4 6 5 7] as in your question. You can also use ndgrid to get output in a different order.

SPSS syntax of a quadratic term with interaction

How looks the syntax of a regression with a quadratic term and interaction in SPSS? In R the code would be:
fit <- lm(c ~ a*b + a*I(b^2), dat)
or
fit <- lm(c ~ a*(b+I(b^2), dat)
Thanks for help.
Using REGRESSION you need to actually make the variables in the SPSS data file before submitting the command. So if your variables were named the same:
COMPUTE ab = a*b. /*Interaction*/.
COMPUTE bsq = b**2. /*squared term*/.
COMPUTE absq = a*bsq. /*Interaction with squared term*/.
Then these can be placed on the right hand side of your regression equation.
REGRESSION VARIABLES=a,b,bsq,absq,c
/DEPENDENT=c
/METHOD=ENTER a,b,bsq,absq.
I thought you could only do factor variables for the interactions - but I was wrong, you can do continuous variables as well (sorry!). Here is an example using MIXED (still you need to make the seperate variables if using REGRESSION).
INPUT PROGRAM.
LOOP Case = 1 TO 200000.
END CASE.
END LOOP.
END FILE.
END INPUT PROGRAM.
COMPUTE a = RV.BERNOULLI(0.5).
COMPUTE b = RV.NORMAL(0,1).
COMPUTE ab = a*b /*Interaction*/.
COMPUTE bsq = b**2 /*squared term*/.
COMPUTE absq = a*bsq /*Interaction with squared term*/.
COMPUTE c = 0.5 + 0.2*a + 0.1*b -0.05*ab + .03*bsq -.001*absq + RV.NORMAL(0,1).
VARIABLE LEVEL a (NOMINAL).
RECODE a (0 = 2)(ELSE = COPY).
MIXED c BY a WITH b bsq
/FIXED = a b b*b a*b
/PRINT SOLUTION.

Runge-Kutta 4 With CUDA Fortran

I am trying to convert this FORTRAN program (motion of pendulum) to CUDA FORTRAN but I can use only 1 block with two threads. Is there any way to use more then 2 threads....
MODULE CB
REAL :: Q,B,W
END MODULE CB
PROGRAM PENDULUM
USE CB
IMPLICIT NONE
INTEGER, PARAMETER :: N=10,L=100,M=1
INTEGER :: I,count_rate,count_max,count(2)
REAL :: PI,H,T,Y1,Y2,G1,G1F,G2,G2F
REAL :: DK11,DK21,DK12,DK22,DK13,DK23,DK14,DK24
REAL, DIMENSION (2,N) :: Y
PI = 4.0*ATAN(1.0)
H = 3.0*PI/L
Q = 0.5
B = 0.9
W = 2.0/3.0
Y(1,1) = 0.0
Y(2,1) = 2.0
DO I = 1, N-1
T = H*I
Y1 = Y(1,I)
Y2 = Y(2,I)
DK11 = H*G1F(Y1,Y2,T)
DK21 = H*G2F(Y1,Y2,T)
DK12 = H*G1F((Y1+DK11/2.0),(Y2+DK21/2.0),(T+H/2.0))
DK22 = H*G2F((Y1+DK11/2.0),(Y2+DK21/2.0),(T+H/2.0))
DK13 = H*G1F((Y1+DK12/2.0),(Y2+DK22/2.0),(T+H/2.0))
DK23 = H*G2F((Y1+DK12/2.0),(Y2+DK22/2.0),(T+H/2.0))
DK14 = H*G1F((Y1+DK13),(Y2+DK23),(T+H))
DK24 = H*G2F((Y1+DK13),(Y2+DK23),(T+H))
Y(1,I+1) = Y(1,I)+(DK11+2.0*(DK12+DK13)+DK14)/6.0
Y(2,I+1) = Y(2,I)+(DK21+2.0*(DK22+DK23)+DK24)/6.0
! Bring theta back to the region [-pi,pi]
Y(1,I+1) = Y(1,I+1)-2.0*PI*NINT(Y(1,I+1)/(2.0*PI))
END DO
call system_clock ( count(2), count_rate, count_max )
WRITE (6,"(2F16.8)") (Y(1,I),Y(2,I),I=1,N,M)
END PROGRAM PENDULUM
FUNCTION G1F (Y1,Y2,T) RESULT (G1)
USE CB
IMPLICIT NONE
REAL :: Y1,Y2,T,G1
G1 = Y2
END FUNCTION G1F
FUNCTION G2F (Y1,Y2,T) RESULT (G2)
USE CB
IMPLICIT NONE
REAL :: Y1,Y2,T,G2
G2 = -Q*Y2-SIN(Y1)+B*COS(W*T)
END FUNCTION G2F
CUDA FORTRAN VERSION OF PROGRAM
MODULE KERNEL
CONTAINS
attributes(global) subroutine mykernel(Y_d,N,L,M)
INTEGER,value:: N,L,M
INTEGER ::tid
REAL:: Y_d(:,:)
REAL :: PI,H,T,G1,G1F,G2,G2F
REAL,shared :: DK11,DK21,DK12,DK22,DK13,DK23,DK14,DK24,Y1,Y2
PI = 4.0*ATAN(1.0)
H = 3.0*PI/L
Y_d(1,1) = 0.0
Y_d(2,1) = 2.0
tid=threadidx%x
DO I = 1, N-1
T = H*I
Y1 = Y_d(1,I)
Y2 = Y_d(2,I)
if(tid==1)then
DK11 = H*G1F(Y1,Y2,T)
else
DK21 = H*G2F(Y1,Y2,T)
endif
call syncthreads ()
if(tid==1)then
DK12 = H*G1F((Y1+DK11/2.0),(Y2+DK21/2.0),(T+H/2.0))
else
DK22 = H*G2F((Y1+DK11/2.0),(Y2+DK21/2.0),(T+H/2.0))
endif
call syncthreads ()
if(tid==1)then
DK13 = H*G1F((Y1+DK12/2.0),(Y2+DK22/2.0),(T+H/2.0))
else
DK23 = H*G2F((Y1+DK12/2.0),(Y2+DK22/2.0),(T+H/2.0))
endif
call syncthreads ()
if(tid==1)then
DK14 = H*G1F((Y1+DK13),(Y2+DK23),(T+H))
else
DK24 = H*G2F((Y1+DK13),(Y2+DK23),(T+H))
endif
call syncthreads ()
if(tid==1)then
Y_d(1,I+1) = Y1+(DK11+2.0*(DK12+DK13)+DK14)/6.0
else
Y_d(2,I+1) = Y2+(DK21+2.0*(DK22+DK23)+DK24)/6.0
endif
Y_d(1,I+1) = Y_d(1,I+1)-2.0*PI*NINT(Y_d(1,I+1)/(2.0*PI))
call syncthreads ()
END DO
end subroutine mykernel
attributes(device) FUNCTION G1F (Y1,Y2,T) RESULT (G1)
IMPLICIT NONE
REAL :: Y1,Y2,T,G1
G1 = Y2
END FUNCTION G1F
attributes(device) FUNCTION G2F (Y1,Y2,T) RESULT (G2)
IMPLICIT NONE
REAL :: Y1,Y2,T,G2
G2 = -0.5*Y2-SIN(Y1)+0.9*COS((2.0/3.0)*T)
END FUNCTION G2F
END MODULE KERNEL
PROGRAM PENDULUM
use cudafor
use KERNEL
IMPLICIT NONE
INTEGER, PARAMETER :: N=100000,L=1000,M=1
INTEGER :: I,d,count_max,count_rate
REAL,device :: Y_d(2,N)
REAL, DIMENSION (2,N) :: Y
INTEGER :: count(2)
call mykernel<<<1,2>>>(Y_d,N,L,M)
Y=Y_d
WRITE (6,"(2F16.8)") (Y(1,I),Y(2,I),I=1,N,M)
END PROGRAM PENDULUM
You can see that only two independent threads of execution are possible by doing a data-dependency analysis of your original serial code. It's easiest to think of this as an "outer" and an "inner" part.
The "outer" part is the dependence of Y(1:2,i+1) on Y(1:2,i). At each time step, you need to use the values of Y(1:2,i) to calculate Y(1:2,i+1), so it's not possible to perform the calculations for multiple time steps in parallel, simply because of the serial dependence structure -- you need to know what happens at time i to calculate what happens at time i+1, you need to know what happens at time i+1 to calculate what happens at time i+2, and so on. The best that you can hope to do is to calculate Y(1,i+1) and Y(2,i+1) in parallel, which is exactly what you do.
The "inner" part is based on the dependencies between the intermediate values in the Runge-Kutta scheme, the DK11, DK12, etc. values in your code. When calculating Y(1:2,i+1), each of the DK[n,m] depends on Y(1:2,i) and for m > 1, each of the DK[n,m] depends on both DK[1,m-1] and DK[2,m-1]. If you draw a graph of these dependencies (which my ASCII art skills aren't really good enough for!), you'll see that there are at each step of the calculation only two possible sub-calculations that can be performed in parallel.
The result of all this is that you cannot do better than two parallel threads for this calculation. As one of the commenters above said, you can certainly do much better if you're simulating a particle system or some other mechanical system with multiple independent degrees of freedom, which you can then integrate in parallel.