Non Square Matrix Multiplication in CUDA - cuda

The code I use for matrix multiplications in CUDA lets me multiply both square and non square matrices, however, both Width and Height MUST be multiples of blocksize.
So, for example, I can multiply [3][6] * [6][3] (using blocksize=3), but I can't multiply [3][2]*[2][3].
Does anyone knows a way to do that? This is my kernel:
#include <stdio.h>
#include <limits.h>
#include <stdlib.h>
#define blocksize 3
#define HM (1*blocksize)
#define WM (2*blocksize)
#define WN (1*blocksize)
#define HN WM
#define WP WN
#define HP HM
#define PTH WM
#define PTW HM
__global__ void nonsquare(float*M, float*N, float*P, int uWM,int uWN)
{
__shared__ float MS[blocksize][blocksize];
__shared__ float NS[blocksize][blocksize];
int tx=threadIdx.x, ty=threadIdx.y, bx=blockIdx.x, by=blockIdx.y;
int rowM=ty+by*blocksize;
int colN=tx+bx*blocksize;
float Pvalue=0;
for(int m=0; m< uWM/blocksize;++m){
MS[ty][tx]=M[rowM*uWM+(m*blocksize+tx)];
NS[ty][tx]=M[colN + uWN*(m*blocksize+ty)];
__syncthreads();
for(int k=0;k<blocksize;k++)
Pvalue+=MS[ty][k]*NS[k][tx];
__syncthreads();
P[rowM*WP+colN]=Pvalue;
}
}
Thanks in advance!

I think the easiest thing to do would be to just pad the blocks on the end with zeros:
for(int m=0; m< uWM/blocksize;++m){
colM = m*blocksize+tx;
rowN = m*blocksize+ty;
if (rowM > uWN || rowN > uWM || colM > uWM || colN > uWN) {
MS[ty][tx]=0.;
NS[ty][tx]=0.;
} else {
MS[ty][tx]=M[rowM*uWM+colM];
NS[ty][tx]=N[colN + uWN*rowN];
}
plus or minus. (That NS line should reference N, not M, right?)
But, since I seem to be the only one here advocating using existing tuned libraries when possible -- why not use CUBLAS or MAGMA instead of rolling your own? They're fast, and tested by hundreds of users.

The underlying performance requirement here is that either the first or second dimension of the shared memory "tile" be a round multiple of 16 - historically that is what is necessary to achieve optimal global memory bandwidth (ie. half warp coalesced transactions). Whether it should be the first or second dimension of the tile is dictated by whether the matrices are stored in column or row major order. There is nothing to say that the shared memory tile need be square, only that the leading dimension of the storage (LDA in BLAS notation) be round multiples of 16.
You could easily template the kernel with the tile dimensions as template arguments and instantiate several versions, depending on matrix dimensions. For a given architecture, there is probably an optimal tile dimension which balances occupancy and instruction level parallelism. The "clever" way to solve this is probably to decompose the matrix multiplication into two operations - the first doing the bulk of the work at the optimal tile size, and the second at a different size for the remaining columns. If the result is going straight back to host memory after the product is completed, the second operation might best be done on the host using an optimised BLAS, overlapped with the GPU kernel. This is the approach that many of the routines in the UTK Magma library use.

Related

kernels accessing host memory

I am trying to get a better grasp of memory management in cuda. There is Something that is just now occurring to me as a major lack of understanding. How do kernels access values that, as I understand it, should be in host memory.
When vectorAdd() is called, it runs the function on the device. But only the elements are stored on the device memory. the length of the vectors are stored on the host. How is it that the kernel does not exit with an error from trying to access foo.length, something that should be on the host.
#include <cuda.h>
#include <cuda_runtime.h>
#include <stdio.h>
#include <stdlib.h>
typedef struct{
float *elements;
int length;
}vector;
__global__ void vectorAdd(vector foo, vector bar){
int idx = threadIdx.x + blockDim.x * blockId.x.x;
if(idx < foo.length){ //this is the part that I do not understand
foo.elements[idx] += bar.elements[idx];
}
}
int main(void){
vector foo, bar;
foo.length = bar.length = 50;
cudaMalloc(&(foo.elements), sizeof(float)*50);
cudaMalloc(&(bar.elements), sizeof(float)*50);
//these vectors are empty, so adding is just a 0.0 += 0.0
int blocks_per_grid = 10;
int threads_per_block = 5;
vectorAdd<<<blocks_per_grid, threads_per_block>>>(foo, bar);
return 0;
}
In C and C++, a typical mechanism for making arguments available to the body of a function call is pass-by-value. The basic idea is that a separate copy of the arguments are made, for use by the function.
CUDA claims compliance to C++ (subject to various limitations), and it therefore provides a mechanism for pass-by-value. On a kernel call, the CUDA compiler and runtime will make copies of the arguments, for use by the function (kernel). In the case of a kernel call, these copies are placed in a particular area of __constant__ memory which is in the GPU and within GPU memory space, and therefore "accessible" to device code.
So, in your example, the entire structures passed as the arguments for the parameters vector foo, vector bar are copied to GPU device memory (specifically, constant memory) by the CUDA runtime. The CUDA device code is structured in such a way by the compiler to access these arguments as needed directly from constant memory.
Since those structures contain both the elements pointer and the scalar quantity length, both items are accessible in CUDA device code, and the compiler will structure references to them (e.g. foo.length) so as to retrieve the needed quantities from constant memory.
So the kernels are not accessing host memory in your example. The pass-by-value mechanism makes the quantities available to device code, in GPU constant memory.

How do I take real FFT using STM32F407G?

I've been trying to use fft for real data in STM32F407G for quite sometime with no luck. However, I when I use cfft function for complex data, it works. But I don't want to process the imaginary part.So my question is - how to use the rfft function in stm32?
Here is the code which I have tried -
#include "stm32f407xx.h"
#include "arm_math.h"
#include "arm_const_structs.h"
#include "core_cm4.h"
#define TEST_LENGTH_SAMPLES 32
extern float32_t ffttestrealip[TEST_LENGTH_SAMPLES];
static float32_t ffttestrealop[TEST_LENGTH_SAMPLES];
uint32_t fftSize = 32;
uint8_t ifftFlag = 0;
uint8_t doBitReverse = 1;
int32_t main(void)
{
arm_rfft_fast_instance_f32 * S;
arm_rfft_fast_init_f32 (S,fftSize);
arm_rfft_fast_f32 (S,ffttestrealip,ffttestrealop,ifftFlag ) ;
while(1);
}
/** \endlink */
But when I compile this, it says "error: L6047U: The size of this image (83968 bytes) exceeds the maximum allowed for this version of the linker"
When I comment out "arm_rfft_fast_init_f32 (S,fftSize);" , it compiles but I get a wrong result.
The 32k limit is the binary limit as opposed to the RAM limit - there is no guarantee that the binary for performing the real FFT will be smaller than the complex FFT.
In reality it could be a more optimized algorithm that is more complex, so that it might require less ram / CPU cycles to compute but could still require more (cheaper) operations to complete.
I have run up to a 4096 pt complex FFT on these parts with no trouble
You're using a size limited linker, 32 k is the common size for evaluation versions of IAR and Keil.
You can either pay to get the full size version, or build your code with arm gcc which is free.

CUDA performance: branching and shared memory

I wish to ask two questions on performance. I have been unable to create simple code to illustrate.
Question 1: How expensive is non-divergent branching? In my code it seems that it even goes up as to more then the equivalent of 4 non-fma FLOPS. Note that I am speaking of the BRA PTX code whereby the predicate is already calculated
Question 2: I have been reading a lot about performance of shared memory and some articles like a Dr Dobbs article even state that it can be as fast as registers (as far as accessed well). In my code all threads within the warps within the block access the same shared variable. I believe in this case shared memory is accessed in broadcast mode, isn't it? Should it reach the performance of registers in this way? Is there any special things that should be considered to make it work?
EDIT: I have been able to construct some simple code that give more insight for my query
Here it is
#include <math.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <float.h>
#include "cuComplex.h"
#include "time.h"
#include "cuda_runtime.h"
#include <iostream>
using namespace std;
__global__ void test()
{
__shared__ int t[1024];
int v=t[0];
bool b=(v==-1);
bool c=(v==-2);
int myValue=0;
for (int i=0;i<800;i++)
{
#if 1
v=i;
#else
v=t[i];
#endif
#if 0
if (b) {
printf("abs");
}
#endif
if (c)
{
printf ("IT HAPPENED");
v=8;
}
myValue+=v;
}
if (myValue==1000)
printf ("IT HAPPENED");
}
int main(int argc, char *argv[])
{
cudaEvent_t event_start,event_stop;
float timestamp;
float4 *data;
// Initialise
cudaDeviceReset();
cudaSetDevice(0);
dim3 threadsPerBlock;
dim3 blocks;
threadsPerBlock.x=32;
threadsPerBlock.y=32;
threadsPerBlock.z=1;
blocks.x=1;
blocks.y=1000;
blocks.z=1;
cudaEventCreate(&event_start);
cudaEventCreate(&event_stop);
cudaEventRecord(event_start, 0);
test<<<blocks,threadsPerBlock,0>>>();
cudaEventRecord(event_stop, 0);
cudaEventSynchronize(event_stop);
cudaEventElapsedTime(&timestamp, event_start, event_stop);
printf("Calculated in %f", timestamp);
}
I am running this code on a GTX680.
Now the results are as follows ..
If run as it is it takes 5.44 ms
If I change the first #if conditional to 0 (which will enable reading from shared memory) it will take 6.02ms.. Not much more but still not enough for me
If I enable the second #if conditional (inserts a branch that will never evaluate to true) the it runs in 9.647040ms. The performance reduction is very big. What is the cause and what can be done?
I have also changed slightly the code to make further checks with shared memory
Instead of
__shared__ int t[1024]
I did
__shared__ int2 t[1024]
and wherever I access t[] I just access t[].x. In got a further drop in performance to 10ms..(another 400micro seconds) Why this should happen?
Regards
Daniel
Have you determined if your kernel is compute bound or memory bound? Your first question would be most relevant if your kernel is compute bound, while the second wold be most relevant if your kernel is memory bound. You might be getting results that are confusing or hard to reproduce if you're assuming one, while it is the other.
(1) I don't think the cost of a branch has been published. You might be left to determining that experimentally for your architecture. The CUDA Programming Guide does say that there is no "branch prediction and no speculative execution."
(2) You're right that when you access a single 32-bit value in shared memory from all the threads in a warp, the value is broadcast. But my guess would be that accessing a single value from all threads would have the same cost as accessing any combination of values as long as you don't incur any bank conflicts. So you end up with the latency of a single fetch from shared memory. I don't think the number of cycles of latency has been published. It is short enough that it is normally easily hidden.
You need to keep in mind that the compiler is highly optimizing. So if you comment out the branch, you also eliminate the evaluation of the conditional, wether or not you leave it in the source code. Thus a difference of four instructions seems very plausible for your example:
load -1,
compare v to it (and store result in b),
test b,
branch,
although I have not compiled your example and looked at the code (which is what you should do - run cuobjdump -sass on your binaries and look at the actual differences in machine code.
Using the only the .x compnent of an int2 changes the layout in shared memory so that you go from bank conflict free access to a 2-way bank conflict, which causes the slight further slowdown in your example. IIRC the latency of a shared memory access is of the order of 30 cycles, which usually is easily hidden by other threads (as Roger has already mentioned).

Very poor memory access performance with CUDA

I'm very new to CUDA, and trying to write a test program.
I'm running the application on GeForce GT 520 card, and get VERY poor performance.
The application is used to process some image, with each row being handled by a separate thread.
Below is a simplified version of the application. Please note that in the real application, all constants are actually variables, provided be the caller.
When running the code below, it takes more than 20 seconds to complete the execution.
But as opposed to using malloc/free, when l_SrcIntegral is defined as a local array (as it appears in the commented line), it takes less than 1 second to complete the execution.
Since the actual size of the array is dynamic (and not 1700), this local array can't be used in the real application.
Any advice how to improve the performance of this rather simple code would be appreciated.
#include "cuda_runtime.h"
#include <stdio.h>
#define d_MaxParallelRows 320
#define d_MinTreatedRow 5
#define d_MaxTreatedRow 915
#define d_RowsResolution 1
#define k_ThreadsPerBlock 64
__global__ void myKernel(int Xi_FirstTreatedRow)
{
int l_ThreadIndex = blockDim.x * blockIdx.x + threadIdx.x;
if (l_ThreadIndex >= d_MaxParallelRows)
return;
int l_Row = Xi_FirstTreatedRow + (l_ThreadIndex * d_RowsResolution);
if (l_Row <= d_MaxTreatedRow) {
//float l_SrcIntegral[1700];
float* l_SrcIntegral = (float*)malloc(1700 * sizeof(float));
for (int x=185; x<1407; x++) {
for (int i=0; i<1700; i++)
l_SrcIntegral[i] = i;
}
free(l_SrcIntegral);
}
}
int main()
{
cudaError_t cudaStatus;
cudaStatus = cudaSetDevice(0);
int l_ThreadsPerBlock = k_ThreadsPerBlock;
int l_BlocksPerGrid = (d_MaxParallelRows + l_ThreadsPerBlock - 1) / l_ThreadsPerBlock;
int l_FirstRow = d_MinTreatedRow;
while (l_FirstRow <= d_MaxTreatedRow) {
printf("CUDA: FirstRow=%d\n", l_FirstRow);
fflush(stdout);
myKernel<<<l_BlocksPerGrid, l_ThreadsPerBlock>>>(l_FirstRow);
cudaDeviceSynchronize();
l_FirstRow += (d_MaxParallelRows * d_RowsResolution);
}
printf("CUDA: Done\n");
return 0;
}
1.
As #aland said, you will maybe even encounter worse performance calculating just one row in each kernel call.
You have to think about processing the whole input, just to theoretically use the power of the massive parallel processing.
Why start multiple kernels with just 320 threads just to calculate one row?
How about using as many blocks you have rows and let the threads per block process one row.
(320 threads per block is not a good choice, check out how to reach better occupancy)
2.
If your fast resources as registers and shared memory are not enough, you have to use a tile apporach which is one of the basics using GPGPU programming.
Separate the input data into tiles of equal size and process them in a loop in your thread.
Here I posted an example of such a tile approach:
Parallelization in CUDA, assigning threads to each column
Be aware of range checks in that tile approach!
Example to give you the idea:
Calculate the sum of all elements in a column vector in an arbitrary sized matrix.
Each block processes one column and the threads of that block store in a tile loop their elements in a shared memory array. When finished they calculate the sum using parallel reduction, just to start the next iteration.
At the end each block calculated the sum of its vector.
You can still use dynamic array sizes using shared memory. Just pass a third argument in the <<<...>>> of the kernel call. That'd be the size of your shared memory per block.
Once you're there, just bring all relevant data into your shared array (you should still try to keep coalesced accesses) bringing one or several (if it's relevant to keep coalesced accesses) elements per thread. Sync threads after it's been brought (only if you need to stop race conditions, to make sure the whole array is in shared memory before any computation is done) and you're good to go.
Also: you should tessellate using blocks and threads, not loops. I understand that's just an example using a local array, but still, it could be done tessellating through blocks/threads and not nested for loops (which are VERY bad for performance!) I hope you're running your sample code using just 1 block and 1 thread, otherwise it wouldn't make much sense.

CUDA - what is this loop doing

Hey
I've seen on a website this example kernel
__global__ void loop1( int N, float alpha, float* x, float* y ) {
int i;
int i0 = blockIdx.x*blockDim.x + threadIdx.x;
for(i=i0;i<N;i+=blockDim.x*gridDim.x) {
y[i] = alpha*x[i] + y[i];
}
}
To compute this function in C
for(i=0;i<N;i++) {
y[i] = alpha*x[i] + y[i];
}
Surely the for loop inside the kernel isn't necessary? and you can just do y[i0] = alpha*x[i0] + y[i0] and remove the for loop altogether.
I'm just curious as to why it's there and what it's purpose is. This is assuming a kernel call such as loop1<<<64,256>>>> so presumably gridDim.x = 1
You need the for loop in the kernel if your vector has more entrys than you have started threads. If it's possible it is of course more efficent to start enough threads.
Interesting kernel. The loop inside the kernel is necessary, because N is greater than total number of threads, which is 16 384 (blockDim.x*gridDim.x), but I think it's not good practice to do it (the whole point of CUDA is to use SIMT concept). According to CUDA Programming Guide you can have at most 65535 thread blocks with one kernel. Futhermore starting from Compute Capability 2.x (Fermi) you can have at most 1024 threads per one block (512 before Fermi) Also you can (if possible) separate code into multiple (sequential) kernels.
Much as we would like to believe that CUDA GPUs have infinite execution resources, they do not, and authors of highly optimized code are finding that unrolled for loops, often with fixed numbers of blocks, give the best performance. Makes for painful coding, but optimized CPU code is also pretty painful.
btw a commenter mentioned that this code would have coalescing problems, and I don't see why. If the base addresses are correctly aligned (64B since those are floats), all of the memory transactions by this code will be coalesced, provided the threads/block is also divisible by 64.