Julia CUDA - Reduce matrix columns - cuda

Consider the following kernel, which reduces along the rows of a 2-D matrix
function row_sum!(x, ncol, out)
"""out = sum(x, dims=2)"""
row_idx = (blockIdx().x-1) * blockDim().x + threadIdx().x
for i = 1:ncol
#inbounds out[row_idx] += x[row_idx, i]
end
return
end
N = 1024
x = CUDA.rand(Float64, N, 2*N)
out = CUDA.zeros(Float64, N)
#cuda threads=256 blocks=4 row_sum!(x, size(x)[2], out)
isapprox(out, sum(x, dims=2)) # true
How do I write a similar kernel except for reducing along the columns (of a 2-D matrix)? In particular, how do I get the index of each column, similar to how we got the index of each row with row_idx?

Here is the code:
function col_sum!(x, nrow, out)
"""out = sum(x, dims=1)"""
col_idx = (blockIdx().x-1) * blockDim().x + threadIdx().x
for i = 1:nrow
#inbounds out[col_idx] += x[i, col_idx]
end
return
end
N = 1024
x = CUDA.rand(Float64, N, 2N)
out = CUDA.zeros(Float64, 2N)
#cuda threads=256 blocks=8 col_sum!(x, size(x, 1), out)
And here is the test:
julia> isapprox(out, vec(sum(x, dims=1)))
true
As you can see the size of the result vector is now 2N instead of N, hence we had to adapt the number of blocks accordingly (that is multiply by 2 and now we have 8 instead of 4)
More materials can be found here: https://juliagpu.gitlab.io/CUDA.jl/tutorials/introduction/

Related

How to plot Iterations in Julia

I coded a function picircle() that estimates pi.
Now I would like to plot this function for N values.
function Plotpi()
p = 100 # precision of π
N = 5
for i in 1:N
picircle(p)
end
end
3.2238805970149254
3.044776119402985
3.1641791044776117
3.1243781094527363
3.084577114427861
Now I am not sure how to plot the function, I tried plot(PP()) but it didn't work
Here I defined picircle:
function picircle(n)
n = n
L = 2n+1
x = range(-1, 1, length=L)
y = rand(L)
center = (0,0)
radius = 1
n_in_circle = 0
for i in 1:L
if norm((x[i], y[i]) .- center) < radius
n_in_circle += 1
end
end
println(4 * n_in_circle / L)
end
Your problem is that your functions don't actually return anything:
julia> x = Plotpi()
3.263681592039801
3.0646766169154227
2.845771144278607
3.18407960199005
3.044776119402985
julia> x
julia> typeof(x)
Nothing
The numbers you see are just printed to the REPL, and print doesn't return any value:
julia> x = print(5)
5
julia> typeof(x)
Nothing
So you probably just want to change your function so that it returns what you want to plot:
julia> function picircle(n)
n = n
L = 2n+1
x = range(-1, 1, length=L)
y = rand(L)
center = (0,0)
radius = 1
n_in_circle = 0
for i in 1:L
if norm((x[i], y[i]) .- center) < radius
n_in_circle += 1
end
end
4 * n_in_circle / L
end
Then:
julia> x = picircle(100)
3.263681592039801
julia> x
3.263681592039801
So now the value of the function is actually returned (rather than just printed to the console). You don't really need a separate function if you just want to do this multiple times and plot the results, a comprehension will do. Here's an example comparing the variability of the estimate with 100 draws vs 50 draws:
julia> using Plots
julia> histogram([picircle(100) for _ ∈ 1:1_000], label = "100 draws", alpha = 0.5)
julia> histogram!([picircle(20) for _ ∈ 1:1_000], label = "20 draws", alpha = 0.5)

kmer counts with cython implementation

I have this function implemented in Cython:
def count_kmers_cython(str string, list alphabet, int kmin, int kmax):
"""
Count occurrence of kmers in a given string.
"""
counter = {}
cdef int i
cdef int j
cdef int N = len(string)
limits = range(kmin, kmax + 1)
for i in range(0, N - kmax + 1):
for j in limits:
kmer = string[i:i+j]
counter[kmer] = counter.get(kmer, 0) + 1
return counter
Can I do better with cython? Or Can I have any away to improve it?
I am new to cython, that is my first attempt.
I will use this to count kmers in DNA with alphabet restrict to 'ACGT'. The length of the general input string is the average bacterial genomes (130 kb to over 14 Mb, where each 1 kb = 1000 bp).
The size of the kmers will be 3 < kmer < 16.
I wish to know if I could go further and maybe use cython in this function to:
def compute_kmer_stats(kmer_list, counts, len_genome, max_e):
"""
This function computes the z_score to find under/over represented kmers
according to a cut off e-value.
Inputs:
kmer_list - a list of kmers
counts - a dictionary-type with k-mers as keys and counts as values.
len_genome - the total length of the sequence(s).
max_e - cut off e-values to report under/over represented kmers.
Outputs:
results - a list of lists as [k-mer, observed count, expected count, z-score, e-value]
"""
print(colored('Starting to compute the kmer statistics...\n',
'red',
attrs=['bold']))
results = []
# number of tests, used to convert p-value to e-value.
n = len(list(kmer_list))
for kmer in kmer_list:
k = len(kmer)
prefix, sufix, center = counts[kmer[:-1]], counts[kmer[1:]], counts[kmer[1:-1]]
# avoid zero division error
if center == 0:
expected = 0
else:
expected = (prefix * sufix) // center
observed = counts[kmer]
sigma = math.sqrt(expected * (1 - expected / (len_genome - k + 1)))
# avoid zero division error
if sigma == 0.0:
z_score = 0.0
else:
z_score = ((observed - expected) / sigma)
# pvalue for all kmers/palindromes under represented
p_value_under = (math.erfc(-z_score / math.sqrt(2)) / 2)
# pvalue for all kmers/palindromes over represented
p_value_over = (math.erfc(z_score / math.sqrt(2)) / 2)
# evalue for all kmers/palindromes under represented
e_value_under = (n * p_value_under)
# evalue for all kmers/palindromes over represented
e_value_over = (n * p_value_over)
if e_value_under <= max_e:
results.append([kmer, observed, expected, z_score, p_value_under, e_value_under])
elif e_value_over <= max_e:
results.append([kmer, observed, expected, z_score, p_value_over, e_value_over])
return results
OBS - Thank you CodeSurgeon by the help. I know there are other tools to count kmer efficiently but I am learning Python so I am trying to write my own functions and code.

Silhouette function gives me an error: element number 2 undefined in return list

I am trying to check the performance of k-means with the silhouette function but I am getting an error.
I am calling the function like this [out1,out2] = silhouette(normalized, idx); or [out1,out2] = silhouette(normalized, idx, 'cosine');
The definition of the function is function [si, h] = silhouette(X, clust, metric)
I expect to take a number between -1,+1 but instead of that I am getting error: element number 2 undefined in return list.
My code for the silhouette function:
function [si, h] = silhouette(X, clust, metric)
% Nan Zhou
% Code Matlab 'silhouette' into Octave function
% Sichuan University, Macquarie University
% zhnanx#gmail.com
% input parameters
% X, n-by-p data matrix
% - Rows of X correspond to points, columns correspond to coordinates.
% clust, clusters defined for X; n-by-1 vector
% metric, e.g. Euclidean, sqEuclidean, cosine
% return values
% si, silhouettte values, n-by-1 vector
% h, figure handle, waiting to be solved in the future
% algorithm reference
% - Peter J. Rousseeuw (1987)
% - Silhouettes: a Graphical Aid to the Interpretation and Validation of Cluster Analysis
% - doi:10.1016/0377-0427(87)90125-7
% check size
if (size(X, 1) != size(clust, 1))
error("First dimension of X <%d> doesn't match that of clust <%d>",...
size(X, 1), size(clust, 1));
endif
% check metric
if (! exist('metric', 'var'))
metric = 'sqEuclidean';
endif
%%%% function set
function [dist] = EuclideanDist(x, y)
dist = sqrt((x - y) * (x - y)');
endfunction
function [dist] = sqEuclideanDist(x, y)
dist = (x - y) * (x - y)';
endfunction
function [dist] = cosineDist(x, y)
cosineValue = dot(x,y)/(norm(x,2)*norm(y,2));
dist = 1 - cosineValue;
endfunction
%%% end function set
% calculating
si = zeros(size(X, 1), 1);
%h
%calculate values of si one by one
for iii = 1:length(si)
%%% distance of iii to all others
iii2all = zeros(size(X, 1), 1);
for jjj = 1:size(X, 1)
switch (metric)
case 'Euclidean'
iii2all(jjj) = EuclideanDist(X(iii, :), X(jjj, :));
case 'sqEuclidean'
iii2all(jjj) = sqEuclideanDist(X(iii, :), X(jjj, :));
case 'cosine'
iii2all(jjj) = cosineDist(X(iii, :), X(jjj, :));
otherwise
error('Invalid metric.');
endswitch
endfor
%%% end distance to all
%%% allocate values to clusters
clusterIDs = unique(clust); % eg [1; 2; 3; 4]
groupedValues = {};
for jjj = 1:length(clusterIDs)
groupedValues{clusterIDs(jjj)} = [iii2all(clust == clusterIDs(jjj))];
endfor
%%% end allocation
%%% calculate a(i)
% dist of object iii to all other objects in the same cluster
a_iii = groupedValues{clust(iii)};
% average distance of iii to all other objects in the same cluster
a_i = sum(a_iii) / (size(a_iii, 1) - 1);
%disp(a_i);pause;
%%% end a(i)
%%% calculate b(i)
clusterIDs_new = clusterIDs;
% remove the cluster iii in
clusterIDs_new(find(clusterIDs_new == clust(iii))) = [];
% average distance of iii to all objects of another cluster
a_iii_2others = zeros(length(clusterIDs_new), 1);
for jjj = 1:length(clusterIDs_new)
values_another = groupedValues{clusterIDs_new(jjj)};
a_iii_2others(jjj) = mean(values_another);
endfor
b_i = min(a_iii_2others);
%disp(b_i);disp('---');pause;
%%% end b(i)
%%% calculate s(i)
si(iii) = (b_i - a_i) / max([a_i; b_i]);
%%% end s(i)
endfor
end

What is an efficient way to broadcast a value to all the threads after using blockReduce from Cuda Unbound?

let column_mean_partial = blockReducer.Reduce(temp_storage, acc, fun a b -> a + b) / (float32 num_rows)
if threadIdx.x = 0 then
means.[col] <- column_mean_partial
column_mean_shared := column_mean_partial
__syncthreads()
let column_mean = !column_mean_shared
The above code snippet calculates the mean of every column of a 2d matrix. As only thread 0 of a block has the full value I store it to shared memory column_mean_shared, use __syncthreads() and then broadcast it to all the threads in a block as I need them to have that value in order to calculate the variance.
Would there be a better way to broadcast the value or is the above efficient enough already?
I hadn't expected much when I posted this question, but it turns out for large matrices such as 128x10000 for example there is a much better way. I wrote a warpReduce kernel that has the block size of 32, which allows it to do the whole reduction using shuffle xor.
For a 128x100000 for 100 iterations the first version that used 64 blocks per grid (and 32 threads per block) took 0.5s. For the the CUB row reduce it took 0.25s.
When I increased the blocks per grid to 256, I got a nearly 4x speedup to about 1.5s. At 384 blocks per thread it takes 1.1s and increasing the number of blocks does not seem to improve performance from there.
For the problem sizes that I am interested in, the improvements are not nearly as dramatic.
For the 128x1024 and 128x512 cases, 10000 iterations:
For 1024: 1s vs 0.82s in favor of warpReduce.
For 512: 0.914s vs 0.873s in favor of warpReduce.
For small matrices, any speedups from parallelism are eaten away by kernel launch times it seems.
For 256: 0.94s vs 0.78s in favor of warpReduce.
For 160: 0.88s vs 0.77s in favor of warpReduce.
It was tested using a GTX 970.
It is likely that for the Kepler and earlier nVidia cards the figures would have been different as in Maxwell the block limit per grid was raised from 32 to 64 per SM which improves multiprocessor occupancy.
I am satisfied with this as the performance improvement is nice and I actually hadn't been able to write a kernel that uses shared memory correctly before reaching for the Cub block reduce. I forgot how painful Cuda can be sometimes. It is amazing that the version that uses no shared memory is so competitive.
Here are the two modules that I tested with:
type rowModule(target) =
inherit GPUModule(target)
let grid_size = 64
let block_size = 128
let blockReducer = BlockReduce.RakingCommutativeOnly<float32>(dim3(block_size,1,1),worker.Device.Arch)
[<Kernel;ReflectedDefinition>]
member this.Kernel (num_rows:int) (num_cols:int) (x:deviceptr<float32>) (means:deviceptr<float32>) (stds:deviceptr<float32>) =
// Point block_start to where the column starts in the array.
let mutable col = blockIdx.x
let temp_storage = blockReducer.TempStorage.AllocateShared()
let column_mean_shared = __shared__.Variable()
while col < num_cols do
// i is the row index
let mutable row = threadIdx.x
let mutable acc = 0.0f
while row < num_rows do
// idx is the absolute index in the array
let idx = row + col * num_rows
acc <- acc + x.[idx]
// Increment the row index
row <- row + blockDim.x
let column_mean_partial = blockReducer.Reduce(temp_storage, acc, fun a b -> a + b) / (float32 num_rows)
if threadIdx.x = 0 then
means.[col] <- column_mean_partial
column_mean_shared := column_mean_partial
__syncthreads()
let column_mean = !column_mean_shared
row <- threadIdx.x
acc <- 0.0f
while row < num_rows do
// idx is the absolute index in the array
let idx = row + col * num_rows
// Accumulate the variances.
acc <- acc + (x.[idx]-column_mean)*(x.[idx]-column_mean)
// Increment the row index
row <- row + blockDim.x
let variance_sum = blockReducer.Reduce(temp_storage, acc, fun a b -> a + b) / (float32 num_rows)
if threadIdx.x = 0 then stds.[col] <- sqrt(variance_sum)
col <- col + gridDim.x
member this.Apply((dmat: dM), (means: dM), (stds: dM)) =
let lp = LaunchParam(grid_size, block_size)
this.GPULaunch <# this.Kernel #> lp dmat.num_rows dmat.num_cols dmat.dArray.Ptr means.dArray.Ptr stds.dArray.Ptr
type rowWarpModule(target) =
inherit GPUModule(target)
let grid_size = 384
let block_size = 32
[<Kernel;ReflectedDefinition>]
member this.Kernel (num_rows:int) (num_cols:int) (x:deviceptr<float32>) (means:deviceptr<float32>) (stds:deviceptr<float32>) =
// Point block_start to where the column starts in the array.
let mutable col = blockIdx.x
while col < num_cols do
// i is the row index
let mutable row = threadIdx.x
let mutable acc = 0.0f
while row < num_rows do
// idx is the absolute index in the array
let idx = row + col * num_rows
acc <- acc + x.[idx]
// Increment the row index
row <- row + blockDim.x
let inline butterflyWarpReduce (value:float32) =
let v1 = value + __shfl_xor value 16 32
let v2 = v1 + __shfl_xor v1 8 32
let v3 = v2 + __shfl_xor v2 4 32
let v4 = v3 + __shfl_xor v3 2 32
v4 + __shfl_xor v4 1 32
let column_mean = (butterflyWarpReduce acc) / (float32 num_rows)
row <- threadIdx.x
acc <- 0.0f
while row < num_rows do
// idx is the absolute index in the array
let idx = row + col * num_rows
// Accumulate the variances.
acc <- acc + (x.[idx]-column_mean)*(x.[idx]-column_mean)
// Increment the row index
row <- row + blockDim.x
let variance_sum = (butterflyWarpReduce acc) / (float32 num_rows)
if threadIdx.x = 0
then stds.[col] <- sqrt(variance_sum)
means.[col] <- column_mean
col <- col + gridDim.x
member this.Apply((dmat: dM), (means: dM), (stds: dM)) =
let lp = LaunchParam(grid_size, block_size)
this.GPULaunch <# this.Kernel #> lp dmat.num_rows dmat.num_cols dmat.dArray.Ptr means.dArray.Ptr stds.dArray.Ptr

Tweaking a Function in Python

I am trying to get the following code to do a few more tricks:
class App(Frame):
def __init__(self, master):
Frame.__init__(self, master)
self.grid()
self.create_widgets()
def create_widgets(self):
self.answerLabel = Label(self, text="Output List:")
self.answerLabel.grid(row=2, column=1, sticky=W)
def psiFunction(self):
j = int(self.indexEntry.get())
valueList = list(self.listEntry.get())
x = map(int, valueList)
if x[0] != 0:
x.insert(0, 0)
rtn = []
for n2 in range(0, len(x) * j - 2):
n = n2 / j
r = n2 - n * j
rtn.append(j * x[n] + r * (x[n + 1] - x[n]))
self.answer = Label(self, text=rtn)
self.answer.grid(row=2, column=2, sticky=W)
if __name__ == "__main__":
root = Tk()
In particular, I am trying to get it to calculate len(x) * j - 1 terms, and to work for a variety of parameter values. If you try running it you should find that you get errors for larger parameter values. For example with a list 0,1,2,3,4 and a parameter j=3 we should run through the program and get 0123456789101112. However, I get an error that the last value is 'out of range' if I try to compute it.
I believe it's an issue with my function as defined. It seems the issue with parameters has something to do with the way it ties the parameter to the n value. Consider 0123. It works great if I use 2 as my parameter (called index in the function) but fails if I use 3.
EDIT:
def psi_j(x, j):
rtn = []
for n2 in range(0, len(x) * j - 2):
n = n2 / j
r = n2 - n * j
if r == 0:
rtn.append(j * x[n])
else:
rtn.append(j * x[n] + r * (x[n + 1] - x[n]))
print 'n2 =', n2, ': n =', n, ' r =' , r, ' rtn =', rtn
return rtn
For example if we have psi_j(x,2) with x = [0,1,2,3,4] we will be able to get [0,1,2,3,4,5,6,7,8,9,10,11] with an error on 12.
The idea though is that we should be able to calculate that last term. It is the 12th term of our output sequence, and 12 = 3*4+0 => 3*x[4] + 0*(x[n+1]-x[n]). Now, there is no 5th term to calculate so that's definitely an issue but we do not need that term since the second part of the equation is zero. Is there a way to write this into the equation?
If we think about the example data [0, 1, 2, 3] and a j of 3, the problem is that we're trying to get x[4]` in the last iteration.
len(x) * j - 2 for this data is 10
range(0, 10) is 0 through 9.
Manually processing our last iteration, allows us to resolve the code to this.
n = 3 # or 9 / 3
r = 0 # or 9 - 3 * 3
rtn.append(3 * x[3] + 0 * (x[3 + 1] - x[3]))
We have code trying to reach x[3 + 1], which doesn't exist when we only have indices 0 through 3.
To fix this, we could rewrite the code like this.
n = n2 / j
r = n2 - n * j
if r == 0:
rtn.append(j * x[n])
else:
rtn.append(j * x[n] + r * (x[n + 1] - x[n]))
If r is 0, then (x[n + 1] - x[n]) is irrelevant.
Please correct me if my math is wrong on that. I can't see a case where n >= len(x) and r != 0, but if that's possible, then my solution is invalid.
Without understanding that the purpose of the function is (is it a kind of filter? or smoothing function?), I prickled it out of the GUI suff and tested it alone:
def psiFunction(j, valueList):
x = map(int, valueList)
if x[0] != 0:
x.insert(0, 0)
rtn = []
for n2 in range(0, len(x) * j - 2):
n = n2 / j
r = n2 - n * j
print "n =", n, "max_n2 =", len(x) * j - 2, "n2 =", n2, "lx =", len(x), "r =", r
val = j * x[n] + r * (x[n + 1] - x[n])
rtn.append(val)
print j * x[n], r * (x[n + 1] - x[n]), val
return rtn
if __name__ == '__main__':
print psiFunction(3, [0, 1, 2, 3, 4])
Calling this module leads to some debugging output and, at the end, the mentionned error message.
Obviously, your x[n + 1] access fails, as n is 4 there, so n + 1 is 5, one too much for accessing the x array, which has length 5 and thus indexes from 0 to 4.
EDIT: Your psi_j() gives me the same behaviour.
Let me continue guessing: Whatever we want to do, we have to ensure that n + 1 stays below len(x). So maybe a
for n2 in range(0, (len(x) - 1) * j):
would be helpful. It only produces the numbers 0..11, but I think this is the only thing which can be expected out of it: the last items only can be
3*3 + 0*(4-3)
3*3 + 1*(4-3)
3*3 + 2*(4-3)
and stop. And this is achieved with the limit I mention here.