cublas: same input and output matrix for better performance? - cuda

I see CUBLAS may be an efficient algorithm package for a single large matrices multiplication or addition etc. But in a common setting, most computations are dependent. So the next step relies on the result of the previous step.
This causes one problem, because the output matrix has to be different from input matrix in CUBLAS routine( as input matrices are const ), much time are spend to malloc space and copy data from device to device for these temporary matrices.
So is it possible to do things like multiply(A, A, B), where the first argument is ouput matrix and the second/third are input matrices, to avoid extra memory manipulation time? Or is there a better workaround?
Thanks a lot !

No, it is not possible to perform in-place operations like gemm using CUBLAS (in fact, I am not aware of any parallel BLAS implementation which guarantees such an operation will work).
Having said that, this comment:
.... much time are spend to malloc space and copy data from device to device for these temporary matrices.
makes me think you might be overlooking the obvious. While it is necessary to allocate space for interim matrices, it certainly isn't necessary to perform device to device memory copies when using such allocations. This:
// If A, B & C are pointers to allocations in device memory
// compute C = A*B and copy result to A
multiply(C, A, B);
cudaMemcpy(A, C, sizeA, cudaMemcpyDeviceToDevice);
// now A = A*B
can be replaced by
multiply(C, A, B);
float * tmp = A; A = C; C = tmp;
ie. you only need to exchange pointers on the host to perform the equivalent of a device to device memory copy, but with no GPU time cost. This can't be used in every situation (for example, there are some in-place block operations which might still require an explicit memory transfer), but in most cases an explicit device to device memory transfer can be avoided.
If the memory cost of large dense operations with CUBLAS is limiting your application, consider investigating "out of core" approaches to working with large dense matrices.

You could pre alloc a buffer matrix, and copy the input matrix A to the buffer before the mat-mul operation.
Memcopy(buff, A);
Multiply(A, buffer, B);
By reusing the buffer, you don't need to allocate the buffer every time, and the overhead will be only one mem copy for each mat-mul. When your matrix is large enough, the time cost of the overhead will take very small portion and can be ignored.

Related

Is it safe to use cudaHostRegister on only part of an allocation?

I have a C++ class container that allocates, lets say, 1GB of memory of plain objects (e.g. built-ins).
I need to copy part of the object to the GPU.
To accelerate and simplify the transfer I want to register the CPU memory as non-pageable ("pinning"), e.g. with cudaHostRegister(void*, size, ...) before copying.
(This seems to be a good way to copy further subsets of the memory with minimal logic. For example if plain cudaMemcpy is not enough.)
Is it safe to pass a pointer that points to only part of the original allocated memory, for example a contiguous 100MB subset of the original 1GB.
I may want to register only part because of efficiency, but also because deep down in the call trace I might have lost information of the original allocated pointer.
In other words, can the pointer argument to cudaHostRegister be the something else other than an allocated pointer? in particular an arithmetic result deriving from allocated memory, but still within the allocated range.
It seems to work but I don't understand if, in general, "pinning" part of an allocation can corrupt somehow the allocated block.
UPDATE: My concern is that allocation is actually mentioned in the documentation for the cudaHostRegister flag options:
cudaHostRegisterDefault: On a system with unified virtual addressing, the memory will be both mapped and portable. On a system
with no unified virtual addressing, the memory will be neither mapped
nor portable.
cudaHostRegisterPortable: The memory returned by this call will be considered as pinned memory by all CUDA contexts, not just the one
that performed the allocation.
cudaHostRegisterMapped: Maps the allocation into the CUDA address space. The device pointer to the memory may be obtained by calling
cudaHostGetDevicePointer().
cudaHostRegisterIoMemory: The passed memory pointer is treated as pointing to some memory-mapped I/O space, e.g. belonging to a
third-party PCIe device, and it will marked as non cache-coherent and
contiguous.
cudaHostRegisterReadOnly: The passed memory pointer is treated as pointing to memory that is considered read-only by the device. On
platforms without cudaDevAttrPageableMemoryAccessUsesHostPageTables,
this flag is required in order to register memory mapped to the CPU as
read-only. Support for the use of this flag can be queried from the
device attribute cudaDeviceAttrReadOnlyHostRegisterSupported. Using
this flag with a current context associated with a device that does
not have this attribute set will cause cudaHostRegister to error with
cudaErrorNotSupported.
This is a rule-of-thumb answer rather than a proper one:
When the CUDA documentation does not guarantee something is guaranteed to work - you'll need to assume it doesn't. Because if it does happen to work - for you, right now, on the system you have - it might stop working in the future; or on another system; or in another usage scenario.
More specifically - memory pinning happens at page resolution, so unless the part you want to pin starts and ends on a physical page boundary, the CUDA driver will need to pin some more memory before and after the region you asked for - which it could do, but it's going an extra mile to accommodate you, and I doubt that would happen without documentation.
I also suggest you file a bug report via developer.nvidia.com , asking that they clarify this point in the documentation. My experience is that there's... something like a 50% chance they'll do something about such a bug report.
Finally - you could just try it: Write a program which copies to the GPU with and without the pinning of the part-of-the-region, and see whether there's a throughput difference.
Is it safe to pass a pointer that points to only part of the original allocated memory, for example a contiguous 100MB subset of the original 1GB.
While I agree that the documentation could be clearer, I think the answer to the question is 'Yes'.
Here's why: The alternative interpretation would be that only whole memory sections returned by, say, malloc should be allowed to be registered. However, this is unworkable, because malloc could, behind the scenes, have one big section allocated, and only give the user parts of it. So even if you (the user) were cudaHostRegistering those sections returned by malloc, they'd actually be fragments of some bigger previously allocated chunk of memory anyway.
By the way, Linux has a similar kernel call to lock memory called mlock. It accepts arbitrary memory ranges.
One of the other answers claimed (until this test was posted):
If you need to copy the part-of-the-object just once to the GPU - there's no use in using cudaHostRegister(), because it will likely itself copy the data, physically, elsewhere - so you won't be saving anything
But this is incorrect: registering is worth it, if the chunk of memory being copied is big enough, even if the copying is done only once. I'm seeing about a 2x speed-up with this code (comment out the line indicated), or about 50% if unregistering is also done between the timers.
#include <chrono>
#include <iostream>
#include <vector>
#include <cuda_runtime_api.h>
int main()
{
std::size_t giga = 1024*1024*1024;
std::vector<char> src(giga, 3);
char* dst = 0;
if(cudaMalloc((void**)&dst, giga)) return 1;
cudaDeviceSynchronize();
auto t0 = std::chrono::system_clock::now();
if(cudaHostRegister(src.data() + src.size()/2, giga/8, cudaHostRegisterDefault)) return 1; // comment out this line
if(cudaMemcpy(dst, src.data() + src.size()/2, giga/8, cudaMemcpyHostToDevice)) return 1;
cudaDeviceSynchronize();
auto t1 = std::chrono::system_clock::now();
auto d = std::chrono::duration_cast<std::chrono::microseconds>(t1 - t0).count();
std::cout << (d / 1e6) << " seconds" << std::endl;
// un-register and free
}

non-linear optimization on the GPU (CUDA) without data transfer latency

I am trying to perform a non-linear optimization problem entirely on the GPU. Computation of the objective function and data transfer from the GPU to CPU are the bottlenecks. To solve this, I want to
heavily parallelize computation of the objective and
perform the entire optimization on the GPU.
More specifically, the problem is as follows in pseudo-code:
x = x0 // initial guess of the vector of unknowns, typically of size ~10,000
for iteration = 1 : max_iter
D = compute_search_direction(x)
alpha = compute_step_along_direction(x)
x = x + D * alpha // update
end for loop
The functions compute_search_direction(x) and compute_step_along_direction(x) both call the objective function f0(x) dozens of times per iteration. The objective function is a complicated CUDA kernel, basically it is a forward Bloch simulation (=the set of equations that describes the dynamics of nuclear spins in a magnetic field). The output of f0(x) are F (value of the objective function, scalar) and DF (Jacobian, or vector of first derivatives, with same size as x, i.e. ~10,000). On the GPU, f0(x) is really fast but transfer of x from the CPU to the GPU and then transfer back of F and DF from the GPU to the CPU takes a while (~1 second total). Because the function is called dozens of time per iteration, this leads to a pretty slow overall optimization.
Ideally, I would want to have the entire pseudo code above on the GPU. The only solution I can think of now is recursive kernels. The pseudo code above would be the "outer kernel", launched with a number of threads = 1 and a number of blocks = 1 (i.e., this kernel is not really parallel...). This kernel would then call the objective function (i.e., the "inner kernel", this one massively parallel) every time it needs to evaluate the objective function and the vector of first derivatives. Since kernel launches are asynchronous, I can force the GPU to wait until the f0 inner kernel is fully evaluated to move to the next instruction of the outer kernel (using a synchronization point).
In a sense, this is really the same as regular CUDA programming where the CPU controls kernel launches for evaluation of the objective function f0, except the CPU is replaced by an outer kernel that is not parallelzied (1 thread, 1 block). However, since everything is on the GPU, there is no data transfer latency anymore.
I am testing the idea now on a simple example to test feasibility. However, this seems quite cumbersome... My questions are:
Does this make any sense to anyone else?
Is there a more direct way to achieve the same result without the added complexity of nested kernels?
It seems you are mixing up "reducing memory transfer between GPU and CPU", and "having the entire code run on device (aka. on gpu)".
In order to reduce memory transfers, you do not need to have the entire code run on GPU.
You can copy your data to the GPU once, and then switch back and forth between GPU code and CPU code. As long as you don't try to access any GPU memory from your CPU code (and vice-versa), you should be fine.
Here's a pseudo-code of a correct approach for what you want to do.
// CPU code
cudaMalloc(&x,...) //allocate memory for x on GPU
cudaMemCpy(x, x0, size, cudaMemCpyHostToDevice); //Copy x0 to the freshly allocated array
cudaMalloc(&D, ....) //allocate D and alpha before the loop
cudaMalloc(&alpha, ....)
for iteration = 1 : max_iter
compute_search_direction<<<...>>>(x, D) //Call a kernel that does the computation and stores the result in D
compute_step_along_direction<<<....>>>(x, alpha)
combine_result<<<...>>>(x, D, alpha) // x + D * alpha
end for loop
//Eventually copy x on CPU, if need be
Hope it helps!

How to perform basic operations (+ - * /) on GPU and store the result on it

I have the following code line, gamma is a CPU variable, that after i will need to copy to GPU. gamma_x and delta are also stored on CPU. Is there any way that i can execute the following line and store its result directly on GPU? So basically, host gamma, gamma_x and delta on GPU and get the output of the following line on GPU. It would speed up my code a lot for the lines after.
I tried with magma_dcopy but so far i couldn't find a way to make it working because the output of magma_ddot is CPU double.
gamma = -(gamma_x[i+1] + magma_ddot(i,&d_gamma_x[1],1,&(d_l2)[1],1, queue))/delta;
The very short answer is no, you can't do this, or least not if you use magma_ddot.
However, magma_ddot is itself a only very thin wrapper around cublasDdot, and the cublas function fully supports having the result of the operation stored in GPU memory rather than returned to the host.
In theory you could do something like this:
// before the apparent loop you have not shown us:
double* dotresult;
cudaMalloc(&dotresult, sizeof(double));
for (int i=....) {
// ...
// magma_ddot(i,&d_gamma_x[1],1,&(d_l2)[1],1, queue);
cublasSetPointerMode( queue->cublas_handle(), CUBLAS_POINTER_MODE_DEVICE);
cublasDdot(queue->cublas_handle(), i, &d_gamma_x[1], 1, &(d_l2)[1], 1, &dotresult);
cudaDeviceSynchronize();
cublasSetPointerMode( queue->cublas_handle(), CUBLAS_POINTER_MODE_HOST);
// Now dotresult holds the magma_ddot result in device memory
// ...
}
Note that might make Magma blow up depending on how you are using it, because Magma uses CUBLAS internally and how CUBLAS state and asynchronous operations are handled inside Magma are completely undocumented. Having said that, if you are careful, it should be OK.
To then execute your calculation, either write a very simple kernel and launch it with one thread, or perhaps use a simple thrust call with a lambda expression, depending on your preference. I leave that as an exercise to the reader.

CUDA: 2 threads from different warps but same block attempt to write into same SHARED memory position: dangerous?

Will this lead to inconsistencies in shared memory?
My kernel code looks like this (pseudocode):
__shared__ uint histogram[32][64];
uint threadLane = threadIdx.x % 32;
for (data){
histogram[threadLane][data]++;
}
Will this lead to collisions, given that, in a block with 64 threads, threads with id "x" and "(x + 32)" will very often write into the same position in the matrix?
This program calculates a histogram for a given matrix. I have an analogous CPU program which does the same. The histogram calculated by the GPU is consistently 1/128 lower than the one calculated by the CPU, and I can't figure out why.
It is dangerous. It leads to race conditions.
If you cannot guarantee that each thread within a block has unique write access to a location in the shared memory then you have a problem because that you need to solve by synchronization.
Take a look at this paper for a correct and efficient way of using SM for histogram computation: http://developer.download.nvidia.com/compute/cuda/1.1-Beta/x86_website/projects/histogram64/doc/histogram.pdf
Note that is plenty of libraries online that allows you to compute histograms in one line, Thrust for instance .

Correct Effective Bandwith calculations of y = Ax+b?

I would like to calculate the bandwith of
the matrix vector multiplication and addition: (assume A = M times N big)
y = A*x +b
But I am a bit confused about what read and write count to the number of bytes read from global memory:
is the effective bandwith:
bytesReadWrite = M*N (for reading A) + N(for read x) + M (for read b) + M(for write y)
or is it
bytesReadWrite = M*N (for reading A) + M*N (for read x) + M (for read b) + M(for write y)
M*N for x because we read once the whole x for each row basically (also if we work with shared memory, we have eventually read once the whole x vector per row)
Does somebody have some good advice of what is the right choice? I dont get this really...
I tend to use the first calculation but why? Does it make sense?
Thanks a lot!!!
It's almost certainly none of the above. In terms of memory bandwidth, modern processors will load all of the items to be operated on once into Level 2 cache, and operate on them from there, after which the results will be written back out to memory for any items changed. Effectively, your bandwidth is just the sum total size for all of the elements involved. Note: even this is an oversimplification, because it doesn't take into account the effects of streaming, not to mention memory pagination. For streaming, it's not uncommon to have a single matrix operate on a large set of data (3D graphics calculations, for example); in that case, the matrix gets loaded to L2 cache (and presumably for reasonably optimized code into the registers from there) once, and then the vectors get loaded through. Once again, the model isn't really complete without an understanding of modern memory paging techniques; there's a gigantic difference in the above if the matrix and the vectors are stored in different memory pages, for example; not to mention serious optimizations in packing vectors together for "streaming" into L2 cache. And even then, that's assuming a CPU model of performing the matrix math; bringing a GPU into the picture changes things once again very dramatically.