I'm struggling to get some code running to explore the shared memory features to get a fast matrix multiply. But everytime I try this I seem to run into errors that I cannot fathom.
import numpy as np
from numba import cuda, types
m = 128
n = 32
a = np.arange(m*n).reshape(m,n).astype(np.int32)
b = np.arange(m*n).reshape(n,m).astype(np.int32)
c = np.zeros((m, n)).astype(np.int32)
d_a = cuda.to_device(a)
d_b = cuda.to_device(b)
d_c = cuda.to_device(c)
block_size = (m,n)
grid_size = (int(m/n),int(m/n))
#cuda.jit
def mm(a, b, c):
column, row = cuda.grid(2)
sum = 0
# `a_cache` and `b_cache` are already correctly defined
a_cache = cuda.shared.array(block_size, types.int32)
b_cache = cuda.shared.array(block_size, types.int32)
a_cache[cuda.threadIdx.y, cuda.threadIdx.x] = a[row, column]
b_cache[cuda.threadIdx.x, cuda.threadIdx.y] = b[column, row]
cuda.syncthreads()
for i in range(a.shape[1]):
sum += a_cache[row][i] * b_cache[i][column]
c[row][column] = sum
and testing
mm[grid_size, block_size](d_a, d_b, d_c)
solution = a#b
output = d_c.copy_to_host()
keeps resulting in the following error:
CudaAPIError: [700] Call to cuMemcpyDtoH results in UNKNOWN_CUDA_ERROR
After chatting with the provider of one answer, I've updated the function. But still cannot make this work. So for the computation of the sum for each element in the output c we need to loop over the columns of A and the rows of B, using i as the index. We have therefore n*n products. I think the i us correct in the sum, but I cannot seem to get the correct index for the row and column of a and b in the expression for the sum.
import numpy as np
from numba import cuda, types
#cuda.jit
def mm_shared(a, b, c):
column, row = cuda.grid(2)
sum = 0
# `a_cache` and `b_cache` are already correctly defined
a_cache = cuda.shared.array(block_size, types.int32)
b_cache = cuda.shared.array(block_size, types.int32)
a_cache[cuda.threadIdx.x, cuda.threadIdx.y] = a[row, column]
b_cache[cuda.threadIdx.x, cuda.threadIdx.y] = b[row, column]
cuda.syncthreads()
for i in range(a.shape[1]):
sum += a_cache[cuda.threadIdx.x, i] * b_cache[i, cuda.threadIdx.y]
c[row][column] = sum
Your block size is invalid. CUDA devices have a limit of 1024 threads per block. When I run your code I see this:
/opt/miniconda3/lib/python3.7/site-packages/numba/cuda/cudadrv/driver.py in _check_error(self, fname, retcode)
327 _logger.critical(msg, _getpid(), self.pid)
328 raise CudaDriverError("CUDA initialized before forking")
--> 329 raise CudaAPIError(retcode, msg)
330
331 def get_device(self, devnum=0):
CudaAPIError: [1] Call to cuLaunchKernel results in CUDA_ERROR_INVALID_VALUE
When I fix that I see this:
$ cuda-memcheck python somethingsometing.py
========= CUDA-MEMCHECK
========= Invalid __shared__ read of size 4
========= at 0x000008b0 in cudapy::__main__::mm$241(Array<int, int=2, A, mutable, aligned>, Array<int, int=2, A, mutable, aligned>, Array<int, int=2, A, mutable, aligned>)
========= by thread (15,11,0) in block (3,2,0)
========= Address 0x00000ec0 is out of bounds
The why is pretty obvious:
for i in range(a.shape[1]):
sum += a_cache[row][i] * b_cache[i][column]
row and column are dimensions in the execution grid, not the local share memory tile, and similarly i is bounded by the shape of a, not the shape of a_cache (note also that you seemed to lapse in C style 2D array indexing syntax about half way through the code, which is a potential bug if you don't understand the difference between the two in Python).
To fix it you will have to change the indexing and then implement the rest of the code for multiplication (i.e. you must iteratively load the whole row and column slices through the local shared tiles to compute the full dot product for each row/column pair which a block will process).
Note also that
The dimensions you have selected for c are wrong (should be m x m)
The grid size you run the kernel on is also wrong because the dimensions of C are wrong and so your code could never calculate the whole matrix
Even after fixing all of this, it is likely that the results of the multiplication will be incorrect at anything other than trivial sizes because of integer overflow.
#disruptive: Hi, did you find any solution to your problem?
I had the same problem as you but I solved it by restarting the kernel of Jupyter notebook.
My code is slightly different than yours:
def mm_shared(a, b, c):
sum = 0
# `a_cache` and `b_cache` are already correctly defined
a_cache = cuda.shared.array(block_size, types.int32)
b_cache = cuda.shared.array(block_size, types.int32)
col, row = cuda.grid(2)
row = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
col = cuda.blockIdx.y * cuda.blockDim.y + cuda.threadIdx.y
a_cache[cuda.threadIdx.x, cuda.threadIdx.y] = a[row][col]
b_cache[cuda.threadIdx.y, cuda.threadIdx.x] = b[col][row]
for i in range(a.shape[1]):
a_cache[cuda.threadIdx.x, cuda.threadIdx.y] = a[row, cuda.threadIdx.y + i * N]
b_cache[cuda.threadIdx.x, cuda.threadIdx.y] = b[cuda.threadIdx.x + i * N, col]
cuda.syncthreads()
for j in range(N):
sum += a_cache[cuda.threadIdx.x, j] * b_cache[j, cuda.threadIdx.y]
# Wait until all threads finish computing
cuda.syncthreads()
c[row][col] = sum
Please let me know if you have any update.
This is the correct solution:
import numpy as np
from numba import cuda, types
#cuda.jit
def mm_shared(a, b, c):
sum = 0
# `a_cache` and `b_cache` are already correctly defined
a_cache = cuda.shared.array(block_size, types.int32)
b_cache = cuda.shared.array(block_size, types.int32)
# TODO: use each thread to populate one element each a_cache and b_cache
x,y = cuda.grid(2)
tx = cuda.threadIdx.x
ty = cuda.threadIdx.y
bpg = cuda.gridDim.x
TPB = int(N)
for i in range(a.shape[1] / TPB):
a_cache[tx, ty] = a[x, ty + i * TPB]
b_cache[tx, ty] = b[tx + i * TPB, y]
cuda.syncthreads()
for j in range(TPB):#a.shape[1]):
# TODO: calculate the `sum` value correctly using values from the cache
sum += a_cache[tx][j] * b_cache[j][ty]
cuda.syncthreads()
c[x][y] = sum
Related
I am trying to implement a custom loss function in a Pytorch Autoencoder.
The loss function tries to maximize the cosine similarity between a given output tensor U (a vector) and 100 random vectors J where both U and J have the same dimension of [300]. This is repeated for each batch.
Suppose we have 30 items per batch, then the output tensor is
train_Y.shape = [30,300]
Random_vectors.shape = [30,100,300]
I can implement the loss function in two ways:
All_Y =[]
for Y,z_r in zip(train_y, random_vectors):
Y_cosine_list =[]
for z in z_r:
cosi = torch.dot(Y,z) / (torch.norm(Y)*torch.norm(z))
Y_cosine_list.append(cosi)
All_Y.append(Y_cosine_list)
All_Y = torch.tensor(All_Y).to(device)
train_loss = torch.sum(torch.abs(All_Y))/dim_0
train_loss = torch.tensor(train_loss.data, requires_grad = True)
or
train_Y = torch.zeros([dim_0, 100])
for i, (Y,z_r) in enumerate(zip(train_Y, random_vectors)):
for j,z in enumerate(z_r):
train_Y[i,j] = cos(Y,z)
train_Y = train_Y.to(device)
train_loss = torch.sum(torch.abs(train_Y))/dim_0
The second one is more elegant and to the point. However it is giving a "Cuda illegal memory access error". I have checked that the memory is not exceeded in either case. Is there anything wrong with the second implementation?
The first implementation is inelegant and I am not sure that it makes sense from a neural net optimization perspective. But it does not give errors and am able to complete training for all the epochs.
Ps: I have tried encapsulating this code block in a loss_fn method but I get the same illegal memory access error.
I have tried everything that I could find for the illegal memory access error - changing GPUs, removing a torch.stack block etc. But I can't seem to get rid of the problem.
Here is a vectorized way to do it
class CosineLoss(nn.Module):
def __init__(self, ):
super().__init__()
pass
def forward(self, x, y):
"""
Args:
x (torch.tensor): [batchsize, N, M] - tensor.
y (torch.tensor): [batchsize, M] - tensor.
Returns:
torch.tensor: scalar mean cosine loss
"""
# dot product along dimension 'm' i.e multiply and sum along 'm'.
dotp = torch.einsum("bm, bnm -> bn", y, x)
# L2 norm along dimension 'm' and multiply by broadcasting
length = torch.norm(y, dim=-1)[:, None]*torch.norm(x, dim=-1)
# cosine = dotproduct of unit vectors
cos = dotp/length
return cos.mean()
def test():
b, n, m = 30, 100, 300
train_Y = torch.randn(b, m, device='cuda')
random_vectors = torch.randn(b, n, m, requires_grad=True, device='cuda')
print(f'{random_vectors.grad = }')
cosineloss = CosineLoss()
loss = cosineloss(random_vectors, train_Y)
print(f'{loss = }')
loss.backward()
print(f'{random_vectors.grad.shape = }')
References:
einsum
broadcasting
Consider the following CUDA kernel, which computes the mean of each row of a 2-D matrix.
using CUDA
function mean!(x, n, out)
"""out = sum(x, dims=2)"""
row_idx = (blockIdx().x-1) * blockDim().x + threadIdx().x
for i = 1:n
#inbounds out[row_idx] += x[row_idx, i]
end
out[row_idx] /= n
return
end
using Test
nrow, ncol = 1024, 10
x = CuArray{Float64, 2}(rand(nrow, ncol))
y = CuArray{Float64, 1}(zeros(nrow))
#cuda threads=256 blocks=4 row_sum!(x, size(x)[2], y)
#test isapprox(y, sum(x, dims=2)) # test passed
Also consider the following CUDA kernel
function add!(a, b, c)
""" c = a .+ b """
i = (blockIdx().x-1) * blockDim().x + threadIdx().x
c[i] = a[i] + b[i]
return
end
a = CuArray{Float64, 1}(zeros(nrow))
b = CuArray{Float64, 1}(ones(nrow))
c = CuArray{Float64, 1}(zeros(nrow))
#cuda threads=256 blocks=4 add!(a, b, c)
#test all(c .== a .+ b) # test passed
Now, suppose I wanted to write another kernel that uses the intermediate results of mean!(). For example,
function g(x, y)
""" mean(x, dims=2) + mean(y, dims=2) """
xrow, xcol = size(x)
yrow, ycol = size(y)
mean1 = CuArray{Float64, 1}(undef, xrow)
#cuda threads=256 blocks=4 mean!(x, xcol, mean1)
mean2 = CuArray{Float64, 1}(zeros(yrow))
#cuda threads=256 blocks=4 mean!(y, ycol, mean2)
out = CuArray{Float64, 1}(zeros(yrow))
#cuda threads=256 blocks=4 add!(mean1, mean2, out)
return out
end
(Of course, g() isn't technically a kernel since it returns something.)
My question is whether g() is "correct". In particular, is g() wasting time by transferring data between the GPU/CPU?
For example, if my understanding is correct, one way g() could be optimized is by initializing mean2 the same way we initialize mean1. This is because when constructing mean2, we're actually first creating zeros(yrow) on the CPU, then passing this to the CuArray constructor to be copied to the GPU. In contrast, mean1 is constructed but uninitialized (due to the undef) and therefore avoids this extra transfer.
To summarize, how do I save/use intermediate kernel results while avoiding data transfers between the CPU/GPU as much as possible?
You can generate arrays or vectors of zeros directly on GPU!
Try:
CUDA.zeros(Float64, nrow)
Some benchmarks:
julia> #btime CUDA.zeros(Float64, 1000,1000)
12.600 μs (26 allocations: 1.22 KiB)
1000×1000 CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}:
...
julia> #btime CuArray(zeros(1000,1000))
3.551 ms (8 allocations: 7.63 MiB)
1000×1000 CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}:
...
I want to calculate the relative error for two array. The pure numpy code is:
# a1, a2 are the two array
np.abs( 1-a2/a1 ).max()
How can I use numba.cuda to accelarate the above code?
In my thought:
#cuda.jit
def calculate(a1, a2):
start = cuda.blockDim.x*cuda.blockIdx.x + cuda.threadIdx.x
grid = cuda.gridDim.x*cuda.blockDim.x
for id in range(start, a1.size, grid):
r = abs(1-a2[id]/a1[id])
ca1 = cuda.to_device(a1)
ca2 = cuda.to_device(a2)
But, how can I compare the r between different thread?
One possible method to do this is to write your own shared memory parallel reduction.
As indicated in the comments, another possible method is to use numba's built-in reduce decorator.
Here is an example demonstrating both:
$ cat t79.py
from numba import cuda, float32, vectorize
import numpy as np
from numpy import random
#values of 0..10 are legal here
TPBP2 = 9
TPB = 2**TPBP2
TPBH = TPB//2
ds = 4096
#method 1: standard cuda parallel max-finding reduction
#cuda.jit
def max_error(a1, a2, err):
s = cuda.shared.array(shape=(TPB), dtype=float32)
x = cuda.grid(1)
st = cuda.gridsize(1)
tx = cuda.threadIdx.x
s[tx] = 0
cuda.syncthreads()
for i in range(x, a1.size, st):
s[tx] = max(s[tx], abs(1-a2[i]/a1[i]))
mid = TPBH
for i in range(TPBP2):
cuda.syncthreads()
if tx < mid:
s[tx] = max(s[tx], s[tx+mid])
mid >>= 1
if tx == 0:
err[cuda.blockIdx.x] = s[0]
# data
# for best performance we should choose blocks based on GPU occupancy
# but for demonstration since we don't know the GPU:
blocks = (ds+TPB-1)//TPB
a1= np.random.rand(ds).astype(np.float32)
a1 += 1
a2= np.random.rand(ds).astype(np.float32)
err = np.zeros(blocks).astype(np.float32)
# Start the kernel
max_error[blocks, TPB](a1,a2, err)
# we could perform another stage of GPU reduction here, but for simplicity:
my_err = np.max(err)
print(my_err)
#method 2: using numba features
#vectorize(['float32(float32,float32)'], target = 'cuda')
def my_error(a1,a2):
return abs(1-a2/a1)
#cuda.reduce
def max_reduce(a,b):
return max(a,b)
r = my_error(a1,a2)
my_err = max_reduce(r)
print(my_err)
$ python t79.py
0.9999707
0.9999707
$
I have to solve the following boundary value problem which is
also it is defined in my Matlab code below, but my code doesn't work. I mean I didn't get the approximate solution of my system.
I want to know where is the problem in my code or just the version of matlab that I have can't compile the kind of function I have used , Thanks
Explanation of method I have used : I have used the finite element method or what we called Galerkin Method based on investigation about assembly matrix and stiffness matrix. I have multiplied the system by weight function which satisfies the boundary condition then I have integrated over elements (integration of elementary matrix over the range ]-1,1[). I have four elementary matrix. For more information about that Method I used please check this paper(page:6,7,8)
Note The error I have got upon the compilation of my code is
The current use of "MatElt2Nd" is inconsistent with it previous use or definition in line 7
Code
function [U] = EquaDiff2(n)
% ----------------------------------
% -d²u/dx² + 6*u = (-4*x^2-6)exp(x^2)
% u(-1) = 0 u(1)= 0
%----------------------------------
function [Ke, Fe] = MatElt2Nd(x1,x2)
% déclaration de la fonction,
% function of computing matrix and elementary matrix (assembly matrix)
% ----------------------------------
x = [-1:2/n:1]'; % modification d1 of bound d’intégration
K = zeros(n+1) ;
F = zeros(n+1,1) ;
for i = 1:n
j = i+1;
t = [i j];
x1 = x(i);
x2 = x(j);
[Ke,Fe] = MatElt2Nd(x1,x2);
K(t,t) = K(t,t) + Ke;
F(t) = F(t) + Fe;
end;
K(1,:) = [];
K(:,1) = [];
F(1) = [];
U = K\F;
U = [0.0;U];
t = 0:0.01:1;
return
%-------------------------------------------
% calculation of matrix Ke and vector Fe
%-------------------------------------------
function [Ke,Fe] = MatElt2Nd0(x1,x2)
% NEWly named nested function is introduced
Ke1 = 1/(x2-x1)*[ 1 -1 % no modification done
-1 1 ] ; % essentiellement que les matrices
Ke2 =(x2-x1)* [ 2 1 % élémentaires
1 2 ] ;
N = [(x-x2)/(x1-x2) (x-x1)/(x2-x1)] % function of form
Fe =simple( int(N' * (-4*x^2-6)*exp(x^2) , x, x1, x2) ) % vecteur Fe ;
Ke = Ke1 + 6*Ke2 ;
return
Edit I have got a general code for that but I can't do changes in the general code to solve my system , Any help ?
General Code
% au'(x)+bu"(x)=0 for 0<=x<=d
% BC: u(0)=0 and u(d)=h
%==============================================================
% ======Example======
% Finding an approximate solution to the following BVP using 4 elements of
% equal length.
% u'(x)-u"(x)=0 : 0<=x<=1
% BC: u(0)=0 and u(1)=1
% Solution:
% >> Galerkin(4,1,-1,1,1)
% ==============================================================
% The output of this program is
% 1- The approximate solution (plotted in blue)
% 2- The exact solution (plotted in red)
% 3- The percentage error (plotted in magenta)
%=======================Program Begin==========================
function Galerkin(ne1,a,b,d,h) % Declare function
clc % Clear workspace
% Define the Coefficients of the exact solution
% The Exact solution is : u(x)=C1+C2*exp(-ax/b)
% where C2=h/(exp(-a*d/b)-1)and C1=-C2
C2=h/((exp(-a*d/b))-1);
C1=-C2;
% Define element length
le = d/ne1;
% Define x matrix
x = zeros (ne1+1,1); %
for i=2:ne1 +1
x(i,1) = x(i-1,1)+le;
end
% K1 matrix corresponding to the diffusion term (u"(x))
K1 = (b/le) * [1,-1;-1,1]
% K2 matrix corresponding to the convection term (u'(x))
K2 = a*[-1/2 1/2;-1/2 1/2]
% Element stiffness Matrix
Ke = K1+K2
% Global stiffness matrix
%********************Begin Assembly***************************
k = zeros(ne1+1);
for i=1:ne1+1
for j=1:ne1 +1
if (i==j)
if(i==1)
k(i,j)=Ke(1,1);
elseif(i==ne1+1)
k(i,j)=Ke(2,2);
else
k(i,j)=Ke(1,1)+Ke(2,2);
end
elseif(i==j+1)
k(i,j)=Ke(1,2);
elseif(j==i+1)
k(i,j)=Ke(2,1);
else
k(i,j)=0;
end
end
end
%********************End Assembly*****************************
%The Global f Matrix
f = zeros(ne1+1,1);
%BC apply u(0) = 0
f(1,1) = 0;
%BC apply u(d) = h
f(ne1+1,1) = h;
% Display the Global stifness matrix before striking row
K_Global=k
%Striking first row (u1=0)
k(1,1) = 1;
for i=2:ne1+1
k(1,i) = 0;
k(ne1+1,i) = 0;
end
k(ne1+1,ne1+1) = 1;
% Display the solvable stifness matrix
K_strike=k
%solving the result and finding the displacement matrix, {u}
u=inv(k)*f
hold on
% ======Calculating Approximate Solution and plotting============
syms X
U_sym=sym(zeros(ne1,1));
dU_sym=sym(zeros(ne1,1));
for i=1:ne1
N1x=1-((X-x(i))/le);
N2x=(X-x(i))/le;
U_X=(u(i)*N1x)+(u(i+1)*N2x);
U_sym(i)=U_X;
dU_sym(i)=diff(U_sym(i));
subplot(3,1,1)
hold on
ezplot(U_sym(i),[x(i) x(i+1)])
subplot(3,1,2)
hold on
% du/dx approximate
ezplot(dU_sym(i),[x(i) x(i+1)])
end
I am writing a function to compute the intersection between two sorted arrays (which may contain duplicates). So if the input is [0,3,7,7,7,9, 12] and [2,7,7,8, 12] the output should be [7,7,12] for example.
Here is my code:
cimport cython
#cython.wraparound(False)
#cython.cdivision(True)
#cython.boundscheck(False)
def sorting(int[:] A, int[:] B):
cdef Py_ssize_t i = 0
cdef Py_ssize_t j = 0
cdef int lenA = A.shape[0]
cdef int lenB = B.shape[0]
intersect = []
while (i < lenA and j < lenB):
if A[i] == B[j]:
intersect.append(A[i])
i += 1
j += 1
elif A[i] > B[j]:
j += 1
elif A[i] < B[j]:
i += 1
return intersect
As you will see, I use a list to store the answers and append to add the answers as they arrive. I am happy to return a python or numpy array if that will speed things up.
How can I avoid append to speed up the cython?
For this kind of thing you usually want to pre-allocate the array (it's basically free to shrink it later). In this case it can't be longer than the shortest of your input arrays, so that gives you a starting size:
cdef int[::1] intersect = np.array([A.shape[0] if A.shape[0]<B.shape[0] else B.shape[0]],dtype=np.int)
You then just keep a running total of how what index you're at on that array (say k), so append is replaced by:
intersect[k] = A[i]
k += 1
At the end you can either return the memoryview intersect[:k] or convert it to a numpy array with np.asarray(intersect[:k]).
As an aside: I'd remove the Cython directive #cython.cdivision(True) since you aren't doing any division. I believe you should be thinking about whether these directives are useful and if they apply to your code rather than blindly copying them in out of habit.