cuda matrix multiplication by columns - cuda

I'm trying to do matrix multiplication in cuda. My implementation is different from the cuda example.
The cuda example (from the cuda samples) performs matrix multiplication by multiplying each value in the row of the first matrix by each value in the column of the second matrix, then summing the products and storing it in an output vector at the index of the row from the first matrix.
My implementation multiplies each value in the column of the first matrix by the single value of the row of the second matrix, where the row index = column index. It then has an output vector in global memory that has each of its indices updated.
The cuda example implementation can have a single thread update each index in the output vector, whereas my implementation can have multiple threads updating each index.
The results that I get show only some of the values. For example, if I had it do 4 iterations of updates, it would only do 2 or 1.
I think that the threads might be interfering with each other since they're all trying to write to the same indices of the vector in global memory. So maybe, while one thread is writing to an index, the other might not be able to insert its value and update the index?
Just wondering if this assessment makes sense.
For example. To multiply the following two matrices:
[3 0 0 2 [1 [a
3 0 0 2 x 2 = b
3 0 0 0 3 c
0 1 1 0] 4] d]
The Cuda sample does matrix multiplication in the following way using 4 threads where a,b,c,d are stored in global memory:
Thread 0: 3*1 + 0*2 + 0*3 + 2*4 = a
Thread 1: 3*1 + 0*2 + 0*3 + 2*4 = b
Thread 2: 3*1 + 0*2 + 0*3 + 0*4 = c
Thread 3: 0*1 + 1*2 + 1*3 + 0*4 = d
My implementation looks like this:
a = b = c = d = 0
Thread 0:
3*1 += a
3*1 += b
3*1 += c
0*1 += d
Thread 1:
0*2 += a
0*2 += b
0*2 += c
1*2 += d
Thread 2:
0*3 += a
0*3 += b
0*3 += c
1*3 += d
Thread 3:
2*4 += a
2*4 += b
0*4 += c
0*4 += d
So at one time all four threads could be trying to update one of the indices.

In order to fix this issue, I used atomicAdd to do the += operation. When a thread performs the operation 3*1 += a (for example), it does three things.
It gets the previous value of a
It updates the value by doing 3*1 + previous value of a
It then stores the new value into a
By using atomicAdd it guarantees that these operations can occur by the thread without interruption from other threads. If atomicAdd is not used, thread0 could get the previous value of a and while thread0 is updating the value, thread1 could get the previous value of a and perform its own update. In this way a += operation would not occur because the threads aren't able to finish their operations.
If a += 3*1 is used instead of atomicAdd(&a, 3*1), then it is possible for thread1 to interfere and change the value of thread0 before thread0 finishes what it's doing. It creates a race condition.
atomicAdd is a += operation. You would use the following code to perform the operation:
__global__ void kernel(){
int a = 0;
atomicAdd(&a, 3*1); //is the same as a += 3*1
}

Related

What is the correlation between dimensional nature of threads and the dimensions of the data itself in CUDA?

I've read as a beginner that using a 2D block of threads is the simplest way to deal with a 2D dataset. I am trying to implement the following matrix operations in sequence:
Swap elements at odd and even positions of each row in the matrix
1 2 2 1
3 4 becomes 4 3
Reflect the elements of the matrix across the principal diagonal
2 1 2 4
4 3 becomes 1 3
To implement this, I wrote the following kernel:
__global__ void swap_and_reflect(float *d_input, float *d_output, int M, int N)
{
int j = threadIdx.x;
int i = threadIdx.y;
for(int t=0;t<M*N;t++)
d_output[t] = d_input[t];
float temp = 0.0;
if (j%2 == 0){
temp = d_output[j];
d_output[j] = d_output[j+1];
d_output[j+1] = temp;
}
__syncthreads(); // Wait for swap to complete
if (i!=j){
temp = d_output[i];
d_output[i] = d_output[j];
d_output[j] = temp;
}
}
The reflection does not happen as expected. But at this point, I am tending to find myself confused with the 2D structure of the executing threads with the 2D structure of the matrix itself.
Could you please correct my understanding of the multi-dimensional arrangement of threads and how it correlates to the dimensionality of the data itself? I believe this is the reason why I have the reflection part of it incorrect.
Any pointers/resources that could help me visualize/understand this correctly would be of immense help.
Thank you for reading.
The thread indices are laid out in your hypothetical 4x4 block in (x,y) pairs as
(0,0) (0,1)
(1,0) (1,1)
and the ordering is
thread ID (x,y) pair
--------- ----------
0 (0,0)
1 (1,0)
2 (0,1)
3 (1,1)
You need to choose an ordering for your array in memory and then modify your kernel accordingly, for example:
if (i!=j){
temp = d_output[i+2*j];
d_output[i+2*j] = d_output[j+2*i];
d_output[j+2*i] = temp;
}

Writing Fibonacci Sequence Elegantly Python

I am trying to improve my programming skills by writing functions in multiple ways, this teaches me new ways of writing code but also understanding other people's style of writing code. Below is a function that calculates the sum of all even numbers in a fibonacci sequence up to the max value. Do you have any recommendations on writing this algorithm differently, maybe more compactly or more pythonic?
def calcFibonacciSumOfEvenOnly():
MAX_VALUE = 4000000
sumOfEven = 0
prev = 1
curr = 2
while curr <= MAX_VALUE:
if curr % 2 == 0:
sumOfEven += curr
temp = curr
curr += prev
prev = temp
return sumOfEven
I do not want to write this function recursively since I know it takes up a lot of memory even though it is quite simple to write.
You can use a generator to produce even numbers of a fibonacci sequence up to the given max value, and then obtain the sum of the generated numbers:
def even_fibs_up_to(m):
a, b = 0, 1
while a <= m:
if a % 2 == 0:
yield a
a, b = b, a + b
So that:
print(sum(even_fibs_up_to(50)))
would output: 44 (0 + 2 + 8 + 34 = 44)

Is this a CUDA thread synchronization issue or something else?

I am very new to parallel programming and stack overflow. I am working on a matrix multiplication implementation using CUDA. I am using column order float arrays as matrix representations.
The algorithm I developed is a bit unique and goes as follows. Given a matrix an n x m matrix A and an m x k matrix B, I launch an n x k blocks with m threads in each block. Essentially, I launch a block for every entry in the resulting matrix, with each thread computing one multiplication for that entry. For example,
1 0 0 0 1 2
0 1 0 * 3 4 5
0 0 1 6 7 8
For the first entry in the resulting matrix I would launch each thread with
thread 0 computing 1 * 3
thread 1 computing 0 * 0
thread 2 computing 0 * 1
With each thread adding to a 0-initialized matrix.
Right now, I am not getting a correct answer. I am getting this over and over again
0 0 2
0 0 5
0 0 8
My kernel function is below. Could this be a thread synchronization problem or am I screwing up array indexing or something?
/*#param d_A: Column order matrix
*#param d_B: Column order matrix
*#param d_result: 0-initialized matrix that kernels write to
*#param dim_A: dimensionality of A (number of rows)
*#param dim_B: dimensionality of B (number of rows)
*/
__global__ void dot(float *d_A, float *d_B, float *d_result, int dim_A, int dim_B) {
int n = blockIdx.x;
int k = blockIdx.y;
int m = threadIdx.x;
float a = d_A[(m * dim_A) + n];
float b = d_B[(k * dim_B) + m];
//d_result[(k * dim_A) + n] += (a * b);
syncthreads();
float temp = d_result[(k*dim_A) + n];
syncthreads();
temp = temp + (a * b);
syncthreads();
d_result[(k*dim_A) + n] = temp;
syncthreads();
}
The whole idea of using syncthreads() is wrong in this case. This API call has a block scope.
1. syncthreads();
2. float temp = d_result[(k*dim_A) + n];
3. syncthreads();
4. temp = temp + (a * b);
5. syncthreads();
6. d_result[(k*dim_A) + n] = temp;
7. syncthreads();
The local variable float temp; has thread scope and using this synchronization barrier is senseless.
The pointer d_result is global memory pointer and using this synchronization barrier is also senseless. Note that there isn't available yet (maybe there will never be available) a barrier which synchronizes threads globally.
Typically the usage of syncthreads() is required when shared memory is used for computation. In this case you may want to use shared memory. Here you could see an example of how to use shared memory and syncthreads() properly. Here you have an example of matrix multiplication with shared memory.

Iterating through matrix rows in Octave without using an index or for loop

I am trying to understand if it's possible to use Octave more efficiently by removing the for loop I'm using to calculate a formula on each row of a matrix X:
myscalar = 0
for i = 1:size(X, 1),
myscalar += X(i, :) * y(i) % y is a vector of dimension size(X, 1)
...
The formula is more complicate than adding to a scalar. The question here is really how to iterate through X rows without an index, so that I can eliminate the for loop.
Yes, you can use broadcasting for this (you will need 3.6.0 or later). If you know python, this is the same (an explanation from python). Simply multiply the matrix by the column. Finnaly, cumsum does the addition but we only want the last row.
newx = X .* y;
myscalars = cumsum (newx, 1) (end,:);
or in one line without temp variables
myscalars = cumsum (X .* y, 1) (end,:);
If the sizes are right, broadcasting is automatically performed. For example:
octave> a = [ 1 2 3
1 2 3
1 2 3];
octave> b = [ 1 0 2];
octave> a .* b'
warning: product: automatic broadcasting operation applied
ans =
1 0 6
1 0 6
1 0 6
octave> a .* b
warning: product: automatic broadcasting operation applied
ans =
1 2 3
0 0 0
2 4 6
The reason for the warning is that it's a new feature that may confuse users and is not existent in Matlab. You can turn it off permanentely by adding warning ("off", "Octave:broadcast") to your .octaverc file
For anyone using an older version of Octave, the same can be accomplished by calling bsxfun directly.
myscalars = cumsum (bsxfun (#times, X, y), 1) (end,:);

f(n), understanding the equation

I've been tasked with writing MIPS instruction code for the following formula:
f(n) = 3 f(n-1) + 2 f(n-2)
f(0) = 1
f(1) = 1
I'm having issues understanding what the formula actually means.
From what I understand we are passing an int n to the doubly recursive program.
So for f(0) the for would the equation be:
f(n)=3*1(n-1) + 2*(n-2)
If n=10 the equation would be:
f(10)=3*1(10-1) + 2*(10-2)
I know I'm not getting this right at all because it wouldn't be recursive. Any light you could shed on what the equation actually means would be great. I should be able to write the MIPS code once I understand the equation.
I think it's a difference equation.
You're given two starting values:
f(0) = 1
f(1) = 1
f(n) = 3*f(n-1) + 2*f(n-2)
So now you can keep going like this:
f(2) = 3*f(1) + 2*f(0) = 3 + 2 = 5
f(3) = 3*f(2) + 2*f(1) = 15 + 2 = 17
So your recursive method would look like this (I'll write Java-like notation):
public int f(n) {
if (n == 0) {
return 1;
} else if (n == 1) {
return 1;
} else {
return 3*f(n-1) + 2*f(n-2); // see? the recursion happens here.
}
}
You have two base cases:
f(0) = 1
f(1) = 1
Anything else uses the recursive formula. For example, let's calculate f(4). It's not one of the base cases, so we must use the full equation. Plugging in n=4 we get:
f(4) = 3 f(4-1) + 2 f(4-2) = 3 f(3) + 2 f(2)
Hm, not done yet. To calculate f(4) we need to know what f(3) and f(2) are. Neither of those are base cases, so we've got to do some recursive calculations. All right...
f(3) = 3 f(3-1) + 2 f(3-2) = 3 f(2) + 2 f(1)
f(2) = 3 f(2-1) + 2 f(2-2) = 3 f(1) + 2 f(0)
There we go! We've reached bottom. f(2) is defined in terms of f(1) and f(0), and we know what those two values are. We were given those, so we don't need to do any more recursive calculations.
f(2) = 3 f(1) + 2 f(0) = 3×1 + 2×1 = 5
Now that we know what f(2) is, we can unwind our recursive chain and solve f(3).
f(3) = 3 f(2) + 2 f(1) = 3×5 + 2×1 = 17
And finally, we unwind one more time and solve f(4).
f(4) = 3 f(3) + 2 f(2) = 3×17 + 2×5 = 61
No, I think you're right and it is recursive. It seems to be a variation of the Fibonacci Sequence, a classic recursive problem
Remember, a recursive algorithm has 2 parts:
The base case
The recursive call
The base case specifies the point at which you cannot recurse anymore. For example, if you are sorting recursively, the base case is a list of length 1 (since a single item is trivially sorted).
So (assuming n is not negative), you have 2 base cases: n = 0 and n = 1. If your function receives an n value equal to 0 or 1, then it doesn't make sense to recurse anymore
With that in mind, your code should look something like this:
function f(int n):
#check for base case
#if not the base case, perform recursion
So let's use Fibonacci as an example.
In a Fibonacci sequence, each number is the sum of the 2 numbers before it. So, given the sequence 1, 2 the next number is obviously 1 + 2 = 3 and the number after that is 2 + 3 = 5, 3 + 5 = 8 and so on. Put generically, the nth Fibonacci number is the (n - 1)th Fibonacci Number plus the (n - 2)th Fibonacci Number, or f(n) = f(n - 1) + f(n - 2)
But where does the sequence start? This is were the base case comes in. Fibonacci defined his sequence as starting from 1, 1. This means that for our pruposes, f(0) = f(1) = 1. So...
function fibonacci(int n):
if n == 0 or n == 1:
#for any n less than 2
return 1
elif n >= 2:
#for any n 2 or greater
return fibonacci(n-1) + fibonacci(n-2)
else:
#this must n < 0
#throw some error
Note that one of the reasons Fibonacci is taught along with recursion is because it shows that sometimes recursion is a bad idea. I won't get into it here but for large n this recursive approach is very inefficient. The alternative is to have 2 global variables, n1 and n2 such that...
n1 = 1
n2 = 1
print n1
print n2
loop:
n = n1 + n2
n2 = n1
n1 = n
print n
will print the sequence.