Generalized Levenshtein Distance Cython - cython

I have a reasonably fast Levenshtein Distance in cython, but it is much slower than the Levenshtein library for python. Does anyone know why? Thanks!
The first function is code for a distance function between strings x and y in the style of Levenshtein but with the additional tuning parameters d and e. It is computed via dynamic programming. The second function is a wrapper that computes a distance table for a large number of strings.
cimport cython
from libc.stdlib cimport malloc, free
import numpy as np
cimport numpy as np
cdef int distance(str x,str y,int d,int e):
cdef int m = len(x) + 1
cdef int n = len(y) + 1
cdef int* M = <int*>malloc(m*n*sizeof(int))
cdef int i = 0
cdef int j = 0
cdef int sub
for ii in range(m):
M[ii] = ii
for jj in range(n):
M[m*jj] = jj
for j in range(n - 1):
for i in range(m - 1):
if x[i] == y[j]:
sub = 0
else:
sub = d
M[i + 1 + m*(j + 1)] = min(min(M[i + m*(j + 1)] + e,M[i + 1 + m*j] + e),M[i + m*j] + sub)
out = M[m*n - 1]
free(M)
return out
def distance_table(np.ndarray X,int d,int e):
M = len(X)
D = np.zeros((M,M))
cdef int i
cdef int j
for i in range(M):
for j in range(i):
D[i,j] = distance(str(X[i]),str(X[j]),d,e)
print(i)
D = D + D.T
return D

Related

Julia CUDA - Reduce matrix columns

Consider the following kernel, which reduces along the rows of a 2-D matrix
function row_sum!(x, ncol, out)
"""out = sum(x, dims=2)"""
row_idx = (blockIdx().x-1) * blockDim().x + threadIdx().x
for i = 1:ncol
#inbounds out[row_idx] += x[row_idx, i]
end
return
end
N = 1024
x = CUDA.rand(Float64, N, 2*N)
out = CUDA.zeros(Float64, N)
#cuda threads=256 blocks=4 row_sum!(x, size(x)[2], out)
isapprox(out, sum(x, dims=2)) # true
How do I write a similar kernel except for reducing along the columns (of a 2-D matrix)? In particular, how do I get the index of each column, similar to how we got the index of each row with row_idx?
Here is the code:
function col_sum!(x, nrow, out)
"""out = sum(x, dims=1)"""
col_idx = (blockIdx().x-1) * blockDim().x + threadIdx().x
for i = 1:nrow
#inbounds out[col_idx] += x[i, col_idx]
end
return
end
N = 1024
x = CUDA.rand(Float64, N, 2N)
out = CUDA.zeros(Float64, 2N)
#cuda threads=256 blocks=8 col_sum!(x, size(x, 1), out)
And here is the test:
julia> isapprox(out, vec(sum(x, dims=1)))
true
As you can see the size of the result vector is now 2N instead of N, hence we had to adapt the number of blocks accordingly (that is multiply by 2 and now we have 8 instead of 4)
More materials can be found here: https://juliagpu.gitlab.io/CUDA.jl/tutorials/introduction/

CUDA out of memory error when doing matrix multiplication using Numba

I need to multiply a matrix with its transpose and I am running out of memory on my GPU with eror message numba.cuda.cudadrv.driver.CudaAPIError: [2] Call to cuMemAlloc results in CUDA_ERROR_OUT_OF_MEMORY
I am expecting the size of my matrix to be around 10k rows and 100k columns so multiplying it with its trnspose will give a result of a square matrix of 10k rows and 10k columns. The matrix only contains 0 and 1.
This is the script that I am running.
from numba import cuda, uint16
import numba
import numpy
import math
import time
TPB = 16
#cuda.jit()
def matmul_shared_mem(A, B, C):
sA = cuda.shared.array((TPB, TPB), dtype=uint16)
sB = cuda.shared.array((TPB, TPB), dtype=uint16)
x, y = cuda.grid(2)
tx = cuda.threadIdx.x
ty = cuda.threadIdx.y
if x >= C.shape[0] and y >= C.shape[1]:
return
tmp = 0.
for i in range(int(A.shape[1] / TPB)):
sA[tx, ty] = A[x, ty + i * TPB]
sB[tx, ty] = B[tx + i * TPB, y]
cuda.syncthreads()
for j in range(TPB):
tmp += sA[tx, j] * sB[j, ty]
cuda.syncthreads()
C[x, y] = tmp
A = numpy.random.randint(2, size=(TPB * 625, 50000))
B = A.transpose()
C_shared_mem = cuda.device_array((A.shape[0], B.shape[1]))
threads_per_block = (TPB, TPB)
blocks_per_grid_x = int(math.ceil(A.shape[0] / threads_per_block[0]))
blocks_per_grid_y = int(math.ceil(B.shape[1] / threads_per_block[1]))
blocks_per_grid = (blocks_per_grid_x, blocks_per_grid_y)
start_gpu_shared_memory = time.time()
matmul_shared_mem[blocks_per_grid, threads_per_block](A, B, C_shared_mem)
cuda.synchronize()
end_gpu_shared_memory = time.time()
time_gpu_shared = end_gpu_shared_memory - start_gpu_shared_memory
print("GPU time(shared memory):" + str(time_gpu_shared))
Update 1:
Based on your suggestions, I made certain changes but well I am still running out of memory.
import numpy as np
import numba as nb
colm = int(200000/8)
rows = 100000
cols = int(colm*8)
AU = np.random.randint(2,size=(rows, cols),dtype=np.int8)
A = np.empty((rows,colm), dtype=np.uint8)
#nb.njit('void(uint8[:,:],int8[:,:])', parallel=True)
def compute(A, AU):
for i in nb.prange(A.shape[0]):
for j in range(A.shape[1]):
offset = j * 8
res = AU[i,offset] << 7
res |= AU[i,offset+1] << 6
res |= AU[i,offset+2] << 5
res |= AU[i,offset+3] << 4
res |= AU[i,offset+4] << 3
res |= AU[i,offset+5] << 2
res |= AU[i,offset+6] << 1
res |= AU[i,offset+7]
A[i,j] = res
compute(A, AU)
from numba import cuda, uint8, int32
import numba
import numpy as np
import math
import time
TPB = 8
TPB1 = 9
#cuda.jit()
def bit_A_AT(A, C):
sA = cuda.shared.array((TPB, TPB), dtype=uint8)
sB = cuda.shared.array((TPB, TPB1), dtype=uint8)
x, y = cuda.grid(2)
tx = cuda.threadIdx.x
ty = cuda.threadIdx.y
bx = cuda.blockIdx.x
by = cuda.blockIdx.y
if bx >= by:
tmp = int32(0)
for i in range((A.shape[1]+TPB-1) // TPB):
if y < A.shape[0] and (i*TPB+tx) < A.shape[1]:
sA[ty, tx] = A[y, i*TPB+tx]
else:
sA[ty, tx] = 0
if (TPB*bx+ty) < A.shape[0] and (i*TPB+tx) < A.shape[1]:
sB[ty, tx] = A[TPB*bx+ty, i*TPB+tx]
else:
sB[ty, tx] = 0
cuda.syncthreads()
for j in range(TPB):
tmp1 = sA[ty,j] & sB[tx, j]
test = uint8(1)
for k in range(8):
if (tmp1 & test) > 0:
tmp += 1
test <<= 1
cuda.syncthreads()
if y < C.shape[0] and x < C.shape[1]:
C[y, x] = tmp
C = np.empty((A.shape[0], A.shape[0]), dtype=np.int32)
threads_per_block = (TPB, TPB)
blocks_per_grid_x = int(math.ceil(A.shape[0] / threads_per_block[0]))
blocks_per_grid_y = int(math.ceil(A.shape[0] / threads_per_block[1]))
blocks_per_grid = (blocks_per_grid_x, blocks_per_grid_y)
start_gpu_shared_memory = time.time()
bit_A_AT[blocks_per_grid, threads_per_block](A, C)
cuda.synchronize()
end_gpu_shared_memory = time.time()
time_gpu_shared = end_gpu_shared_memory - start_gpu_shared_memory
print("GPU time(shared memory):" + str(time_gpu_shared))
Any idea how I can fiix this?
The following method should reduce the amount of device memory required for the calculation of A x AT. We'll use the following ideas:
since the input array (A) only takes on values of 0,1, we'll reduce the storage for that array down to the minimum convenient size, int8, i.e. one byte per element
since the B array is just the transpose of the A array, there is no need to handle it explicitly. We can derive it from the A array, somehwhat similar to here, although that is performing AT x A
the matrix multiplication of A x AT involves taking the dot-product of the rows of matrix A, as indicated here
we will provide the A transposed version in the sB array using adjusted indexing
there are a range of other changes to your code, to address various errors and also to improve load/store efficiency, such as a general reversal of your usage of x,y indices
I've also fixed your usage of syncthreads and modified the code to allow arbitrary values for row and column dimensions
Here is a worked example:
$ cat t62.py
from numba import cuda, int32, int8
import numba
import numpy as np
import math
import time
TPB = 32
TPB1 = TPB+1
#cuda.jit()
def byte_A_AT(A, C):
sA = cuda.shared.array((TPB, TPB), dtype=int8)
sB = cuda.shared.array((TPB, TPB1), dtype=int8)
x, y = cuda.grid(2)
tx = cuda.threadIdx.x
ty = cuda.threadIdx.y
bx = cuda.blockIdx.x
by = cuda.blockIdx.y
# uncomment and indent remainder of kernel to only do the "symmetric half" of calculation
# if bx >= by:
tmp = int32(0)
for i in range((A.shape[1]+TPB-1)// TPB):
if y < A.shape[0] and (i*TPB+tx) < A.shape[1]:
sA[ty, tx] = A[y, i*TPB+tx]
else:
sA[ty, tx] = 0
if (TPB*bx+ty) < A.shape[0] and (i*TPB+tx) < A.shape[1]:
sB[ty, tx] = A[TPB*bx+ty, i*TPB+tx]
else:
sB[ty, tx] = 0
cuda.syncthreads()
for j in range(TPB):
tmp += int32(sA[ty,j]) * int32(sB[tx, j])
cuda.syncthreads()
if y < C.shape[0] and x < C.shape[1]:
C[y, x] = tmp
rows = 1041
cols = 1043
print('host mem: ', (rows*cols*2+rows*rows*4*2)//1048576, 'MB device mem: ', (rows*cols+rows*rows*4)//1048576, 'MB')
A = np.random.randint(2,size=(rows, cols),dtype=np.int8)
AT = A.transpose()
CU = np.matmul(A,AT, dtype = np.int32)
C = np.empty((A.shape[0], A.shape[0]), dtype=np.int32)
threads_per_block = (TPB, TPB)
blocks_per_grid_x = int(math.ceil(A.shape[0] / threads_per_block[0]))
blocks_per_grid_y = int(math.ceil(A.shape[0] / threads_per_block[1]))
blocks_per_grid = (blocks_per_grid_x, blocks_per_grid_y)
byte_A_AT[blocks_per_grid, threads_per_block](A, C)
cuda.synchronize()
start_gpu_shared_memory = time.time()
byte_A_AT[blocks_per_grid, threads_per_block](A, C)
cuda.synchronize()
end_gpu_shared_memory = time.time()
time_gpu_shared = end_gpu_shared_memory - start_gpu_shared_memory
print("GPU time(shared memory):" + str(time_gpu_shared))
test = np.array_equal(C, CU)
print(test)
if test == False:
for i in range(C.shape[0]):
for j in range(C.shape[1]):
if C[i,j] != CU[i,j]:
print(i, ' ' , j ,' ' , C[i,j] , ' ' , CU[i,j])
$ python t62.py
host mem: 10 MB device mem: 5 MB
GPU time(shared memory):0.019593000411987305
True
$
Notes:
most of the runtime of the above code will be spent in python (in the np.matmul() operation, which is really only used to verify results, it should not be necessary for an actual implementation), not in the GPU portion. As the matrices are made larger, the code will run much more slowly.
as mentioned in the comments, the result of A x AT is a symmetric matrix. My code does not take advantage of this, however we could crudely take advantage of it by uncommenting the if test at the beginning of the kernel and then indenting the remainder of the kernel. However this will cause the host code np.array_equal test to fail, of course.
the device memory consumption for this is calculated in the code. for your largest value in the comments (rows = 30k, cols = 200k) this would amount to about 10GB, so it will still not run in your 8GB GPU.
I have created a version of this code which packs 8 elements per byte for the A matrix, which would further reduce memory demand, however writing this code to handle the case of arbitrary column dimensions (vs. multiples of 8) proves to be rather messy. However that code could get the total device memory consumption down to about 5GB for 30k rows and 200k columns case.
Here is the one bit per element version, it has a requirement that the number of columns be whole-number divisible by 8:
$ cat t61.py
from numba import cuda, uint8, int32
import numba
import numpy as np
import math
import time
TPB = 32
TPB1 = 33
#cuda.jit()
def bit_A_AT(A, C):
sA = cuda.shared.array((TPB, TPB), dtype=uint8)
sB = cuda.shared.array((TPB, TPB1), dtype=uint8)
x, y = cuda.grid(2)
tx = cuda.threadIdx.x
ty = cuda.threadIdx.y
bx = cuda.blockIdx.x
by = cuda.blockIdx.y
tmp = int32(0)
for i in range((A.shape[1]+TPB-1) // TPB):
if y < A.shape[0] and (i*TPB+tx) < A.shape[1]:
sA[ty, tx] = A[y, i*TPB+tx]
else:
sA[ty, tx] = 0
if (TPB*bx+ty) < A.shape[0] and (i*TPB+tx) < A.shape[1]:
sB[ty, tx] = A[TPB*bx+ty, i*TPB+tx]
else:
sB[ty, tx] = 0
cuda.syncthreads()
for j in range(TPB):
tmp1 = sA[ty,j] & sB[tx, j]
test = uint8(1)
for k in range(8):
if (tmp1 & test) > 0:
tmp += 1
test <<= 1
cuda.syncthreads()
if y < C.shape[0] and x < C.shape[1]:
C[y, x] = tmp
colm = 131
rows = 1041
cols = int(colm*8)
print('host mem: ', (rows*cols*2+rows*rows*4*2)//1048576, 'MB device mem: ', (((rows*cols)//8)+rows*rows*4)//1048576, 'MB')
AU = np.random.randint(2,size=(rows, cols),dtype=np.int8)
AUT = AU.transpose()
CU = np.matmul(AU,AUT,dtype=np.int32)
A = np.empty((rows,colm), dtype=np.uint8)
for i in range(A.shape[0]):
for j in range(A.shape[1]):
A[i,j] = 0
for k in range(8):
if AU[i,(j*8)+k] == 1:
A[i,j] = A[i,j] | (1<<(7-k))
C = np.empty((A.shape[0], A.shape[0]), dtype=np.int32)
threads_per_block = (TPB, TPB)
blocks_per_grid_x = int(math.ceil(A.shape[0] / threads_per_block[0]))
blocks_per_grid_y = int(math.ceil(A.shape[0] / threads_per_block[1]))
blocks_per_grid = (blocks_per_grid_x, blocks_per_grid_y)
bit_A_AT[blocks_per_grid, threads_per_block](A, C)
cuda.synchronize()
start_gpu_shared_memory = time.time()
bit_A_AT[blocks_per_grid, threads_per_block](A, C)
cuda.synchronize()
end_gpu_shared_memory = time.time()
time_gpu_shared = end_gpu_shared_memory - start_gpu_shared_memory
print("GPU time(shared memory):" + str(time_gpu_shared))
test = np.array_equal(C, CU)
print(test)
if test == False:
for i in range(C.shape[0]):
for j in range(C.shape[1]):
if C[i,j] != CU[i,j]:
print(i, ' ' , j ,' ' , C[i,j] , ' ' , CU[i,j])
break
$ python t61.py
host mem: 10 MB device mem: 4 MB
GPU time(shared memory):0.009343624114990234
True
$
EDIT: Responding to some questions in the comments, updates, and now taking into account that the A matrix may have significantly more than 30k rows, this will cause the C matrix to increase as well. If the A matrix can be fit in GPU memory, we can reduce the memory demand of the C matrix by computing it in pieces. These pieces will be a group of rows computed together, which I refer to as a row_slice of the C matrix. The following code demonstrates that this can be achieved with relatively minor changes to the code above:
$ cat t63.py
from numba import cuda, uint8, int32
import numba as nb
import numpy as np
import math
import time
TPB = 32
TPB1 = 33
#cuda.jit()
def bit_slice_A_AT(A, C, row_offset):
sA = cuda.shared.array((TPB, TPB), dtype=uint8)
sB = cuda.shared.array((TPB, TPB1), dtype=uint8)
x, y = cuda.grid(2)
tx = cuda.threadIdx.x
ty = cuda.threadIdx.y
bx = cuda.blockIdx.x
by = cuda.blockIdx.y
tmp = int32(0)
for i in range((A.shape[1]+TPB-1) // TPB):
if y+row_offset < A.shape[0] and (i*TPB+tx) < A.shape[1]:
sA[ty, tx] = A[y+row_offset, i*TPB+tx]
else:
sA[ty, tx] = 0
if (TPB*bx+ty) < A.shape[0] and (i*TPB+tx) < A.shape[1]:
sB[ty, tx] = A[TPB*bx+ty, i*TPB+tx]
else:
sB[ty, tx] = 0
cuda.syncthreads()
for j in range(TPB):
tmp1 = sA[ty,j] & sB[tx, j]
test = uint8(1)
for k in range(8):
if (tmp1 & test) > 0:
tmp += 1
test <<= 1
cuda.syncthreads()
if y < C.shape[0] and x < C.shape[1]:
C[y, x] = tmp
#nb.njit('void(uint8[:,:],int8[:,:])', parallel=True)
def bitpack(A, AU):
for i in nb.prange(A.shape[0]):
for j in range(A.shape[1]):
offset = j * 8
res = AU[i,offset] << 7
res |= AU[i,offset+1] << 6
res |= AU[i,offset+2] << 5
res |= AU[i,offset+3] << 4
res |= AU[i,offset+4] << 3
res |= AU[i,offset+5] << 2
res |= AU[i,offset+6] << 1
res |= AU[i,offset+7]
A[i,j] = res
colm = 131
rows = 1535
cols = int(colm*8)
row_slice = 512
print('host mem: ', (rows*cols*2+rows*rows*4*2)//1048576, 'MB device mem: ', (((rows*cols)//8)+row_slice*rows*4)//1048576, 'MB')
AU = np.random.randint(2,size=(rows, cols),dtype=np.int8)
CU = np.matmul(AU,AU.T,dtype=np.int32)
A = np.empty((rows,colm), dtype=np.uint8)
bitpack(A, AU)
A_dev = cuda.to_device(A)
threads_per_block = (TPB, TPB)
C = np.empty((row_slice, A.shape[0]), dtype=np.int32)
blocks_per_grid_x = int(math.ceil(A.shape[0] / threads_per_block[0]))
blocks_per_grid_y = int(row_slice / threads_per_block[1])
blocks_per_grid = (blocks_per_grid_x, blocks_per_grid_y)
for i in range((A.shape[0]+row_slice-1)//row_slice):
bit_slice_A_AT[blocks_per_grid, threads_per_block](A_dev, C, i*row_slice)
lower = i*row_slice
upper = min(lower+row_slice, CU.shape[0])
width = upper-lower
test = np.array_equal(C[:width,:], CU[i*row_slice:i*row_slice+width,:])
print(test)
cuda.synchronize()
C_dev = cuda.device_array_like(C)
start_gpu_shared_memory = time.time()
for i in range((A.shape[0]+row_slice-1)//row_slice):
bit_slice_A_AT[blocks_per_grid, threads_per_block](A_dev, C_dev, i*row_slice)
cuda.synchronize()
end_gpu_shared_memory = time.time()
time_gpu_shared = end_gpu_shared_memory - start_gpu_shared_memory
print("GPU time(shared memory):" + str(time_gpu_shared))
$ python t63.py
host mem: 21 MB device mem: 3 MB
True
True
True
GPU time(shared memory):0.010116815567016602
$
This means, as suggested, for the case of rows = 100k and columns = 200k as given in the latest update to the question, we should be able to divide the C matrix into chunks of say 5k rows. The memory usage for the A matrix would be 2.5GB, but for the C matrix, since we are only computing a 5k row slice at a time, the device memory storage required would be 100k*5k*4 bytes, so 2GB for this example.
After some further study, we can speed up the host code matmul operation by switching from int8 datatype to float32 datatype. This makes that op quite a bit faster, but the GPU code still seems to ~4x faster than that:
$ cat t64.py
from numba import cuda, uint8, int32
import numba as nb
import numpy as np
import math
import time
TPB = 32
TPB1 = 33
#cuda.jit()
def bit_slice_A_AT(A, C, row_offset):
sA = cuda.shared.array((TPB, TPB), dtype=uint8)
sB = cuda.shared.array((TPB, TPB1), dtype=uint8)
x, y = cuda.grid(2)
tx = cuda.threadIdx.x
ty = cuda.threadIdx.y
bx = cuda.blockIdx.x
by = cuda.blockIdx.y
tmp = int32(0)
for i in range((A.shape[1]+TPB-1) // TPB):
if y+row_offset < A.shape[0] and (i*TPB+tx) < A.shape[1]:
sA[ty, tx] = A[y+row_offset, i*TPB+tx]
else:
sA[ty, tx] = 0
if (TPB*bx+ty) < A.shape[0] and (i*TPB+tx) < A.shape[1]:
sB[ty, tx] = A[TPB*bx+ty, i*TPB+tx]
else:
sB[ty, tx] = 0
cuda.syncthreads()
for j in range(TPB):
tmp1 = sA[ty,j] & sB[tx, j]
test = uint8(1)
for k in range(8):
if (tmp1 & test) > 0:
tmp += 1
test <<= 1
cuda.syncthreads()
if y < C.shape[0] and x < C.shape[1]:
C[y, x] = tmp
#nb.njit('void(uint8[:,:],float32[:,:])', parallel=True)
def bitpack(A, AU):
for i in nb.prange(A.shape[0]):
for j in range(A.shape[1]):
offset = j * 8
res = int(AU[i,offset]) << 7
res |= int(AU[i,offset+1]) << 6
res |= int(AU[i,offset+2]) << 5
res |= int(AU[i,offset+3]) << 4
res |= int(AU[i,offset+4]) << 3
res |= int(AU[i,offset+5]) << 2
res |= int(AU[i,offset+6]) << 1
res |= int(AU[i,offset+7])
A[i,j] = res
colm = 1000
rows = 6000
cols = int(colm*8)
row_slice = 512
print('host mem: ', (rows*cols*4+rows*colm+rows*rows*4+rows*row_slice*4)//1048576, 'MB device mem: ', (((rows*cols)//8)+row_slice*rows*4)//1048576, 'MB')
t1 = time.time()
AU = np.random.randint(2,size=(rows, cols),dtype=np.int8)
AU = AU.astype(np.float32)
print("randint:" + str(time.time()-t1))
t1 = time.time()
#CU = np.empty((rows, rows), dtype=np.int32)
CU = np.matmul(AU,AU.T,dtype=np.float32)
print("matmul:" + str(time.time()-t1))
t1 = time.time()
A = np.empty((rows,colm), dtype=np.uint8)
print("np.empty:" + str(time.time()-t1))
t1 = time.time()
bitpack(A, AU)
print("bitpack:" + str(time.time()-t1))
t1 = time.time()
A_dev = cuda.to_device(A)
threads_per_block = (TPB, TPB)
C = np.empty((row_slice, A.shape[0]), dtype=np.int32)
blocks_per_grid_x = int(math.ceil(A.shape[0] / threads_per_block[0]))
blocks_per_grid_y = int(row_slice / threads_per_block[1])
blocks_per_grid = (blocks_per_grid_x, blocks_per_grid_y)
for i in range((A.shape[0]+row_slice-1)//row_slice):
bit_slice_A_AT[blocks_per_grid, threads_per_block](A_dev, C, i*row_slice)
lower = i*row_slice
upper = min(lower+row_slice, CU.shape[0])
width = upper-lower
test = np.array_equal(C[:width,:], CU[i*row_slice:i*row_slice+width,:])
print(test)
cuda.synchronize()
C_dev = cuda.device_array_like(C)
start_gpu_shared_memory = time.time()
for i in range((A.shape[0]+row_slice-1)//row_slice):
bit_slice_A_AT[blocks_per_grid, threads_per_block](A_dev, C_dev, i*row_slice)
cuda.synchronize()
end_gpu_shared_memory = time.time()
time_gpu_shared = end_gpu_shared_memory - start_gpu_shared_memory
print("GPU time(shared memory):" + str(time_gpu_shared))
$ python t64.py
host mem: 337 MB device mem: 17 MB
randint:0.1817936897277832
matmul:3.498671531677246
np.empty:7.62939453125e-05
bitpack:0.03707313537597656
True
True
True
True
True
True
True
True
True
True
True
True
GPU time(shared memory):0.8318064212799072
$
I haven't thoroughly tested these codes. Bugs may exist. Use at your own risk.
For attribution, the numba bit-packing code seems to have come from here.

Error "expression must have integral or enum type" in thats code:

Error "expression must have integral or enum type" in thats code:
__global__ void VectorKernel(float *a, float *b, float *c, int n)
{
int i = threadIdx.x;
float y = 0, z = 0;
if (i < n)
y = (b-a) / n;
for (float j = y; j <= n ; j++) {
z = (((j+y) - j) / 6) * function(j) + 4 * (function((j + (y+j)) / 2)) + function(y+j);
c = c + z;
}
}
the error happen in "z", in stretch:
c = c + z;
(i'm beginner in CUDA programming)
c is a pointer. Pointer arithmetic requires a pointer and an integer type expression.
If you want to to add z to the float pointed to by c you should change the expression to:
*c = *c + z;
When you write c = c + z and get an error like this, you should suspect your types are mismatched.
c is a float * and z is a float which are not assignable.
What you probably want to do is store the result of *c + z in the memory location pointed at by c, in which case you'd write:
*c = *c + z.

How to static typing a list of gmpy2.mpq in Cython?

I implementented a Gauss estimation function based on gmpy2.mpq. After profiling, it is proved to be the bottleneck of my program. I tried Cython to optimize it.
Cython is fantastic and the speed has already increased about two times, but I'm trying to make it faster by static typing.
There are two problems when I tring to do this:
gmpy2.mpq is a function instead of a type. How to static typing it?
how to static typing a list of a given type?
It would be great if there are some faster alternative to list.
I have attached the code just FYI.
def gauss_estimate(a, b):
"""Gauss estimation for integers
:param a:a n*m 2d sequence of integers, where n>=m
:param b:an n-length 1d sequence of integers
:returns: a m-length 1d sequence of mpq that a#x=b, gmpy2.mpq is a fast
implementing of fraction
:raises: a SingularError if the rank of a is smaller than m
:raises: a NoSolutionError if the rank of a is larger than m
"""
cdef int n, m, i, j
cdef list aa, bb
if isinstance(a, ndarray):
a = a.tolist()
if isinstance(b, ndarray):
b = b.tolist()
aa = [[mpq(aii) for aii in ai] for ai in a]
n = len(aa)
m = len(aa[0])
if n < m:
raise ValueError('Wrong shape of a')
for ai in aa:
if len(ai) != m:
raise ValueError('Wrong shape of ai')
bb = [mpq(bi) for bi in b]
if len(bb) != n:
raise ValueError('Wrong shape of b')
for i in range(m):
if aa[i][i] == 0:
for j in range(i, n):
if aa[j][i] != 0:
aa[i], aa[j] = aa[j], aa[i]
bb[i], bb[j] = bb[j], bb[i]
break
else:
raise SingularError('The rank of a is smaller than m')
bb[i] /= aa[i][i]
for j in reversed(range(i, m)):
aa[i][j] /= aa[i][i]
for j in range(i+1, n):
bb[j] -= aa[j][i] * bb[i]
for k in reversed(range(i, m)):
aa[j][k] -= aa[j][i] * aa[i][k]
assert aa[j][i] == 0
for i in range(m, n):
if bb[i] != 0:
raise NoSolutionError('No solution found')
for i in reversed(range(m)):
for j in range(0, i):
bb[j] -= bb[i] * aa[j][i]
for i in range(m):
assert aa[i][i] == 1
return tuple(bb[:m])

Tweaking a Function in Python

I am trying to get the following code to do a few more tricks:
class App(Frame):
def __init__(self, master):
Frame.__init__(self, master)
self.grid()
self.create_widgets()
def create_widgets(self):
self.answerLabel = Label(self, text="Output List:")
self.answerLabel.grid(row=2, column=1, sticky=W)
def psiFunction(self):
j = int(self.indexEntry.get())
valueList = list(self.listEntry.get())
x = map(int, valueList)
if x[0] != 0:
x.insert(0, 0)
rtn = []
for n2 in range(0, len(x) * j - 2):
n = n2 / j
r = n2 - n * j
rtn.append(j * x[n] + r * (x[n + 1] - x[n]))
self.answer = Label(self, text=rtn)
self.answer.grid(row=2, column=2, sticky=W)
if __name__ == "__main__":
root = Tk()
In particular, I am trying to get it to calculate len(x) * j - 1 terms, and to work for a variety of parameter values. If you try running it you should find that you get errors for larger parameter values. For example with a list 0,1,2,3,4 and a parameter j=3 we should run through the program and get 0123456789101112. However, I get an error that the last value is 'out of range' if I try to compute it.
I believe it's an issue with my function as defined. It seems the issue with parameters has something to do with the way it ties the parameter to the n value. Consider 0123. It works great if I use 2 as my parameter (called index in the function) but fails if I use 3.
EDIT:
def psi_j(x, j):
rtn = []
for n2 in range(0, len(x) * j - 2):
n = n2 / j
r = n2 - n * j
if r == 0:
rtn.append(j * x[n])
else:
rtn.append(j * x[n] + r * (x[n + 1] - x[n]))
print 'n2 =', n2, ': n =', n, ' r =' , r, ' rtn =', rtn
return rtn
For example if we have psi_j(x,2) with x = [0,1,2,3,4] we will be able to get [0,1,2,3,4,5,6,7,8,9,10,11] with an error on 12.
The idea though is that we should be able to calculate that last term. It is the 12th term of our output sequence, and 12 = 3*4+0 => 3*x[4] + 0*(x[n+1]-x[n]). Now, there is no 5th term to calculate so that's definitely an issue but we do not need that term since the second part of the equation is zero. Is there a way to write this into the equation?
If we think about the example data [0, 1, 2, 3] and a j of 3, the problem is that we're trying to get x[4]` in the last iteration.
len(x) * j - 2 for this data is 10
range(0, 10) is 0 through 9.
Manually processing our last iteration, allows us to resolve the code to this.
n = 3 # or 9 / 3
r = 0 # or 9 - 3 * 3
rtn.append(3 * x[3] + 0 * (x[3 + 1] - x[3]))
We have code trying to reach x[3 + 1], which doesn't exist when we only have indices 0 through 3.
To fix this, we could rewrite the code like this.
n = n2 / j
r = n2 - n * j
if r == 0:
rtn.append(j * x[n])
else:
rtn.append(j * x[n] + r * (x[n + 1] - x[n]))
If r is 0, then (x[n + 1] - x[n]) is irrelevant.
Please correct me if my math is wrong on that. I can't see a case where n >= len(x) and r != 0, but if that's possible, then my solution is invalid.
Without understanding that the purpose of the function is (is it a kind of filter? or smoothing function?), I prickled it out of the GUI suff and tested it alone:
def psiFunction(j, valueList):
x = map(int, valueList)
if x[0] != 0:
x.insert(0, 0)
rtn = []
for n2 in range(0, len(x) * j - 2):
n = n2 / j
r = n2 - n * j
print "n =", n, "max_n2 =", len(x) * j - 2, "n2 =", n2, "lx =", len(x), "r =", r
val = j * x[n] + r * (x[n + 1] - x[n])
rtn.append(val)
print j * x[n], r * (x[n + 1] - x[n]), val
return rtn
if __name__ == '__main__':
print psiFunction(3, [0, 1, 2, 3, 4])
Calling this module leads to some debugging output and, at the end, the mentionned error message.
Obviously, your x[n + 1] access fails, as n is 4 there, so n + 1 is 5, one too much for accessing the x array, which has length 5 and thus indexes from 0 to 4.
EDIT: Your psi_j() gives me the same behaviour.
Let me continue guessing: Whatever we want to do, we have to ensure that n + 1 stays below len(x). So maybe a
for n2 in range(0, (len(x) - 1) * j):
would be helpful. It only produces the numbers 0..11, but I think this is the only thing which can be expected out of it: the last items only can be
3*3 + 0*(4-3)
3*3 + 1*(4-3)
3*3 + 2*(4-3)
and stop. And this is achieved with the limit I mention here.