Understanding wmma::col_major vs wmma::row_major in tensor cores API - cuda

I'm having a hard time understanding this. Let's say that I fill my matrices like this:
A[i] = [a1 a2 a3 a4 a5 a6...]
B[i] = [b1 b2 b3 b4 b5 b6...]
C[i] = [c1 c2 c3 c4 c5 c6...]
And i want them to appear, e.g., matrix_a: 2x3, as
A = | a1 a2 a3 |
| a4 a5 a6 |
To me this is a row_major approach. Now let's move on to tensor cores. We have:
C = A*B, where the matrix dimensions are given by m,n,k (small letters):
matrix_a: mxk
matrix_b: kxn
matrix_c: mxn
We also set the fragments of the matrices with dimensions M,N,K (capital letters). Then in wmma I see the following fragment declarations:
case I
wmma::fragment<wmma::matrix_a, M, N, K, half, wmma::col_major> a_frag;
wmma::fragment<wmma::matrix_b, M, N, K, half, wmma::col_major> b_frag;
But I also have seen
case II
wmma::fragment<wmma::matrix_a, M, N, K, half, wmma::row_major> a_frag;
wmma::fragment<wmma::matrix_b, M, N, K, half, wmma::row_major> b_frag;
and even
case III
wmma::fragment<wmma::matrix_a, M, N, K, half, wmma::row_major> a_frag;
wmma::fragment<wmma::matrix_b, M, N, K, half, wmma::col_major> b_frag;
I really don't get why we need these three cases. To me Case II is enough for solving or not?
To make it simple let's focus on Block(0,0). Then the load_matrix_sync reduces
Case I:
wmma::load_matrix_sync(a_frag, ...i * K * m, m);
wmma::load_matrix_sync(b_frag, ...i * K, k);
Case II:
wmma::load_matrix_sync(a_frag, ...i * K, k);
wmma::load_matrix_sync(b_frag, ...i * K * n, n);
To me it looks like Case I is computing B.T*A.T, because b_frag now is the transpose with dimension NxK and is loaded horizontally, i.e., 0.,K,2K,3K, while a_frag now is the tranpose with dimension KxM and is loaded vertically, i.e., 0, K*m, 2K*m. In contrast, in Case II, we explicitly solve A*B. a_frag with dimension MxK is loaded horizontally (0,K,2K,...) and b_frag with dimension KxN is loaded vertically (0,K*n,2K*n,...).
So why would we need these three cases? Actually I don't even know what Case III is used for. Is it computing also A*B?
Thanks

Related

Sympy unable to ignore multiplictive constant when doing an integral

I'm trying run the following double integral.
from sympy import integrate, symbols, sqrt
b, x, y, N, L, h = symbols('b x y N L h ',real=True, positive=True)
arg = h*((y-b)**2+x**2)/sqrt((y-b)**2 + x**2 + h**2)**5
Nc = integrate(arg,(b,-L/2,L/2))
Nc = integrate(Nc,(y,-N/2,N/2))
If I run this it takes 583 seconds to complete. If I remove the leading term with h, e.g.
arg = ((y-b)**2+x**2)/sqrt((y-b)**2 + x**2 + h**2)**5
The the first integral take the same time, by the second integral only takes 3.2s. When I put the same integral into maple, it was evaluated in 0.3 s. What is hanging up sympy here?

Solve for the coefficients of (functions of) the independent variable in a symbolic equation

Using Octave's symbolic package, I define a symbolic function of t like this:
>> syms a b c d t real;
>> f = poly2sym([a b c], t) + d * exp(t)
f = (sym)
2 t
a⋅t + b⋅t + c + d⋅ℯ
I also have another function with known coefficients:
>> g = poly2sym([2 3 5], t) + 7 * exp(t)
g = (sym)
2 t
2⋅t + 3⋅t + 7⋅ℯ + 5
I would like to solve f == g for the coefficients a, b, c, d such that the equation holds for all values of t. That is, I simply want to equate the coefficients of t^2 in both equations, and the coefficients of exp(t), etc. I am looking for this solution:
a = 2
b = 3
c = 5
d = 7
When I try to solve the equation using solve, this is what I get:
>> solve(f == g, a, b, c, d)
ans = (sym)
t 2 t
-b⋅t - c - d⋅ℯ + 2⋅t + 3⋅t + 7⋅ℯ + 5
───────────────────────────────────────
2
t
It solves for a in terms of b, c, d, t. This is understandable since in essence there is no difference between the variables b, c and t. But I was wondering if there was a method to somehow separate the terms (using their symbolic form w. r. t. the variable t) and solve the resulting system of linear equations on a, b, c, d.
Note: The function I wrote here is a minimal example. What I am really trying to do is to solve a linear ordinary differential equation using the method of undetermined coefficients. For example, I define something like y = a*exp(-t) + b*t*exp(-t), and solve for diff(y, t, t) + diff(y,t) + y == t*exp(-t). But I believe solving the problem with simpler functions will lead me to the right direction.
I have found a terribly slow and dirty method to get the job done. The coefficients have to be linear in a, b, ... though.
The idea is to follow these steps:
Write the equation in f - g form (which equals zero)
Use expand() to separate the terms
Use children() to get the terms in the equation as a symbolic vector
Now that we have the terms in a vector, we can find those that are the same function of t and add their coefficients together. The way I checked this was by checking if the division of two terms had t as a symbolic variable
For each term, find other terms with the same function of t, add all these coefficients together, save the obtained equation in a vector
Pass the vector of created equations to solve()
This code solves the equation I wrote in the note at the end of my question:
pkg load symbolic
syms t a b real;
y = a * exp(-t) + b * t * exp(-t);
lhs = diff(y, t, t) + diff(y, t) + y;
rhs = t * exp(-t);
expr = expand(lhs - rhs);
chd = children(expr);
used = false(size(chd));
equations = [];
for z = 1:length(chd)
if used(z)
continue
endif
coefficients = 0;
for zz = z + 1:length(chd)
if used(zz)
continue
endif
division = chd(zz) / chd(z);
vars = findsymbols(division);
if sum(has(vars, t)) == 0 # division result has no t
used(zz) = true;
coefficients += division;
endif
endfor
coefficients += 1; # for chd(z)
vars = findsymbols(chd(z));
nott = vars(!has(vars, t));
if length(nott)
coefficients *= nott;
endif
equations = [equations, expand(coefficients)];
endfor
solution = solve(equations == 0);

Find (num * (pow(b, p) - 1) / den) % mod where p is very large(10 ^ 18)

I want to find (num * (pow(b, p) - 1) / den) % mod. I know about binary exponentiation. But we can't do it straightforward. It is guaranteed that the numerator is divisible by the denominator. That means
[num * (pow(b, p) - 1)] % den == 0
constraints on mod: are 1 <= mod <= 10 ^ 9 and mod might be prime or composite
constraints on b: 1 <= b <= 10
constraints on p: 1 <= p <= (10^18)
constraints on num: 1 <= num <= (10^9)
constraints on den: 1 <= den <= (10^9)
Here pow(b, p) means b raised to power p(b ^ p). It is guaranteed that the numerator is divisible by the denominator. How can I do it with binary exponentiation
Your expression should rewritten to simplIfy it. First let k=num/den, with k integer according to your question.
So you have to compute
(k×(b^p-1))mod m=( (k mod m) × ((b^p -1) mod m) ) mod m
= ( (k mod m) × ( (b^p mod m) -1 mod m ) mod m ) mod m
= ((k mod m) × ((b^p mod m) + m-1) mod m) mod m (1)
So the real problem is to compute b^p mod m
Many languages (python, java, etc) already have a modular exponentiation in their standard libraries. Consult the documentation and use it. Otherwise, here is a C implementation.
unsigned long long modexp(unsigned long long b, unsigned long long e, unsigned long long m) {
if (m==1) return 0;
unsigned long long res=1;
unsigned long long bb = b % m;
while (e) {
if (e & 1)
res = (res*b) % m;
e >>= 1;
bb = (bb*bb) % m;
}
return res;
}
The implementation uses long long to fit your constraints. It relies on the classical trick of binary exponentiation. All values of b^l, where l is a power of two (l=2^t) are computed and stored in var bb and if the corresponding tth bit of e is set, this value of b^l is integrated in the result. Bit testing is done by checking the successive parities of e, while shifting e rightward at each step.
Last, the fact that (a×b)mod m=((a mod m)×(b mod m))mod m is used to avoid computation on very large numbers. We always have res<m and bb<m and hence res and bb are codable on standard integers.
Then you just have to apply (1) to get the final result.
EDIT according to the precisions given in the comments
To compute n=(3^p-1)/2 mod m, one can remark that
(3^p-1)/2 = x*m + n (as 3^p-1 is even, x is an integer, 0&leq;n<m)
3^p-1=x*2*m+2n (0&leq;2n<2m)
so 2n=(3^p-1) mod 2m
We can just apply the previous method with a modulo of 2*m, and divide the result (that will be even) by 2.

MIPS Programming instruction count issue

I wrote this mips code to find the gcf but I am confused on getting the number of instructions executed for this code. I need to find a linear function as a function of number of times the remainder must be calculated before an answer. i tried running this code using Single step with Qtspim but not sure on how to proceed.
gcf:
addiu $sp,$sp,-4 # adjust the stack for an item
sw $ra,0($sp) # save return address
rem $t4,$a0,$a1 # r = a % b
beq $t4,$zero,L1 # if(r==0) go to L1
add $a0,$zero,$a1 # a = b
add $a1,$zero,$t4 # b = r
jr gcf
L1:
add $v0,$zero,$a1 # return b
addiu $sp,$sp,4 # pop 2 items
jr $ra # return to caller
There is absolutely nothing new to show here, the algorithm you just implemented is the Euclidean algorithm and it is well known in the literature1.
I will nonetheless write an informal analysis here as link only questions are evil.
First lets rewrite the code in an high level formulation:
unsigned int gcd(unsigned int a, unsigned int b)
{
if (a % b == 0)
return b;
return gcd(b, a % b);
}
The choice of unsigned int vs int was dicated by the MIPS ISA that makes rem undefined for negative operands.
Out goal is to find a function T(a, b) that gives the number of step the algorithm requires to compute the GDC of a and b.
Since a direct approach leads to nothing, we try by inverting the problem.
What pairs (a, b) makes T(a, b) = 1, in other words what pairs make gcd(a, b) terminates in one step?
We clearly must have that a % b = 0, which means that a must be a multiple of b.
There are actually an (countable) infinite number of pairs, we can limit our selves to pairs with the smallest, a and b2.
To recap, to have T(a, b) = 1 we need a = nb and we pick the pair (a, b) = (1, 1).
Now, given a pair (c, d) that requires N steps, how do we find a new pair (a, b) such that T(a, b) = T(c, d) + 1?
Since gcd(a, b) must take one step further then gcd(c, d) and since starting from gcd(a, b) the next step is gcd(b, a % b) we must have:
c = b => b = c
d = a % b => d = a % c => a = c + d
The step d = a % c => a = c + d comes from the minimality of a, we need the smallest a that when divided by c gives d, so we can take a = c + d since (c + d) % c = c % c d % c = 0 + d = d.
For d % c = d to be true we need that d < c.
Our base pair was (1, 1) which doesn't satisfy this hypothesis, luckily we can take (2, 1) as the base pair (convince your self that T(2, 1) = 1).
Then we have:
gcd(3, 2) = gcd(2, 1) = 1
T(3, 2) = 1 + T(2, 1) = 1 + 1 = 2
gcd(5, 3) = gcd(3, 2) = 1
T(5, 3) = 1 + T(3, 2) = 1 + 2 = 3
gcd(8, 5) = gcd(5, 3) = 1
T(8, 5) = 1 + T(5, 3) = 1 + 3 = 4
...
If we look at the pair (2, 1), (3, 2), (5, 3), (8, 5), ... we see that the n-th pair (starting from 1) is made by the number (Fn+1, Fn).
Where Fn is the n-th Fibonacci number.
We than have:
T(Fn+1, Fn) = n
Regarding Fibonacci number we know that Fn ∝ φn.
We are now going to use all the trickery of asymptotic analysis, particularly in the limit of the big-O notation considering φn or φn + 1 is the same.
Also we won't use the big-O symbol explicitly, we rather assume that each equality is true in the limit. This is an abuse, but makes the analysis more compact.
We can assume without loss of generality that N is an upper bound for both number in the pair and that it is proportional to φn.
We have N ∝ φn that gives logφ N = n, this ca be rewritten as log(N)/log(φ) = n (where logs are in base 10 and log(φ) can be taken to be 1/5).
Thus we finally have 5logN = n or written in reverse order
n = 5 logN
Where n is the number of step taken by gcd(a, b) where 0 < b < a < N.
We can further show that if a = ng and b = mg with n, m coprimes, than T(a, b) = T(n, m) thus the restriction of taking the minimal pairs is not bounding.
1 In the eventuality that you rediscovered such algorithm, I strongly advice against continue with reading this answer. You surely have a sharp mind that would benefit the most from a challenge than from an answer.
2 We'll later see that this won't give rise to a loss of generality.

Transpose matrix multiplication in cuBLAS howto

The problem is simple: I have two matrices, A and B, that are M by N, where M >> N. I want to first take the transpose of A, and then multiply that by B (A^T * B) to put that into C, which is N by N. I have everything set up for A and B, but how do I call cublasSgemm properly without it returning the wrong answer?
I understand that cuBlas has a cublasOperation_t enum for transposing things beforehand, but somehow I'm not quite using it correctly. My matrices A and B are in row-major order, i.e. [ row1 ][ row2 ][ row3 ]..... in device memory. That means for A to be interpreted as A-transposed, BLAS needs to know my A is in column-major order. My current code looks like below:
float *A, *B, *C;
// initialize A, B, C as device arrays, fill them with values
// initialize m = num_row_A, n = num_row_B, and k = num_col_A;
// set lda = m, ldb = k, ldc = m;
// alpha = 1, beta = 0;
// set up cuBlas handle ...
cublasSgemm(handle, CUBLAS_OP_T, CUBLAS_OP_N, m, n, k, &alpha, A, lda, B, ldb, &beta, C, ldc);
My questions:
Am I setting up m, k, n correctly?
What about lda, ldb, ldc?
Thanks!
Since cuBLAS always assume that the matrices are stored in column-major, you could either transpose your matrices first into colum-major by using cublas_geam(), or you could treat your matrix A, stored in row-major, as a new matrix AT stored in column-major. The matrix AT is actually the transpose of A. For B do the same thing. Then you could calculate matrix C stored in column-major by C=AT * BT^T
float* AT = A;
float* BT = B;
The leading dimension is a param related to the storage, which doesn't change no matter if you use the transpose flag CUBLAS_OP_T or not.
lda = num_col_A = num_row_AT = N;
ldb = num_col_B = num_row_BT = N;
ldc = num_row_C = N;
m and n in the cuBLAS GEMM routine are the #rows and #cols of the result matrix C,
m = num_row_C = num_row_AT = num_col_A = N;
n = num_col_C = num_row_BT = num_col_B = N;
k is the common dimension of A^T and B,
k = num_col_AT = num_row_B = M;
Then you could invoke the GEMM routine by
cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_T, m, n, k, &alpha, AT, lda, BT, ldb, &beta, C, ldc);
If you want the matrix C to be stored in row-major, you could calculate the CT stored in column-major with the formula CT = BT * AT^T by
cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_T, n, m, k, &alpha, BT, ldb, AT, lda, &beta, CT, ldc);
Please note you don't have to swap m and n since C is a square matrix in this case.