Fastest way to access device vector elements directly on host - cuda

I refer you to following page http://code.google.com/p/thrust/wiki/QuickStartGuide#Vectors. Please see second paragraph where it says that
Also note that individual elements of a device_vector can be accessed
using the standard bracket notation. However, because each of these
accesses requires a call to cudaMemcpy, they should be used sparingly.
We'll look at some more efficient techniques later.
I searched all over the document but I could not find the more efficient technique. Does anyone know the fastest way to do this? i.e how to access device vector/device pointer on host fastest?

The "more efficient techniques" the guide alludes to are the Thrust algorithms. It's more efficient to access (or copy across the PCI-E bus) millions of elements at once than it is to access a single element because the fixed cost of CPU/GPU communication is amortized.
There's no faster way to copy data from the GPU to the CPU than by calling cudaMemcpy, because it is the most primitive way for a CUDA programmer to implement the task.

If you have a device_vector which you need to do more processing on, try to keep the data on the device and process it with Thrust algorithms or your own kernels. If you need to read only a few values from the device_vector, just access the values directly with bracket notation. If you need to access more than a few values, copy the device_vector over to a host_vector and read the the values from there.
thrust::device_vector<int> D;
...
thrust::host_vector<int> H = D;

Related

does thrust::device_vector.pushback() cause a call to memcpy?

Summary
I'd like some clarification on how the thrust::device_vector works.
AFAIK, writing to an indexed location such as device_vector[i] = 7 is implemented by the host, and therefore causes a call to memcpy. Does device_vector.push_back(7) also call memcpy?
Background
I'm working on a project comparing stock prices. The prices are stored in two vectors. I iterate over the two vectors, and when there's a change in their prices relative to each other, I write that change into a new vector. So I never know how long the resulting vector is going to be. On the CPU the natural way to do this is with push_back, but I don't want to use push_back on the GPU vector if its going to call memcpy every time.
Is there a more efficient way to build a vector piece by piece on the GPU?
Research
I've looked at this question, but it (and others) are focused on the most efficient way to access elements from the host. I want to build up a vector on the GPU.
Thank you.
Does device_vector.push_back(7) also call memcpy?
No. It does, however, result in a kernel launch per call.
Is there a more efficient way to build a vector piece by piece on the GPU?
Yes.
Build it (or large segments of it) in host memory first, then copy or insert to memory on the device in a single operation. You will greatly reduce latency and increase PCI-e bus utilization by doing so.

Global device variables in CUDA: bad practice?

I am designing a library that has a large contingent of CUDA kernels to perform parallel computations. All the kernels will be acting on a common object, say a computational grid, which is defined using C++ style objects. The computational domain doesn't necessarily need to be accessed from the host side, so creating it on the device side and keeping it there makes sense for now. I'm wondering if the following is considered "good practice":
Suppose my computational grid class is called Domain. First I Define a global device-side variable to store the computational domain:
__device__ Domain* D
Then I Initialize the computational domain using a CUDA kernel
__global__ void initDomain(paramType P){
D = new Domain(P);
}
Then, I perform computations using this domain with other kernels:
__global__ void doComputation(double *x,double *y){
D->doThing(x,y);
//...
}
If my domain remains fixed (i.e. kernels don't modify the domain once it's created), is this OK? Is there a better way? I initially tried creating the Domain object on the host side and copying it over to the device, but this turned out to be a hassle because Domain is a relatively complex type that makes it a pain to copy over using e.g. cudaMemCpy or even Thrust::device_new (at least, I couldn't get it to work nicely).
Yes it's ok.
Maybe you can improve performance using
__constant__
using this keyword, your object will be available in all your kernels in a very fast memory.
In order to copy your object, you must use : cudaMemcpyToSymbol, please note there is come restriction : your object will be read-only in your device code, and it must don't have default constructor.
You can find informations here
If your object is complex and hard to copy, maybe you can look for : Unified memory, then just pass your variable by value to your kernel.

send custom datatype/class to GPU

all tutorials and introductional material for GPGPU/Cuda often use flat arrays, however I'm trying to port a piece of code which uses somewhat more sophisticated objects compared to an array.
I have a 3-dimensional std::vector whose data I want to have on the GPU. Which strategies are there to get this on the GPU?
I can think of 1 for now:
copy the vector's data on the host to a more simplistic structure like an array. However this seems wasteful because 1) I have to copy data and then send to the GPU; and 2) I have to allocate a 3-dimensional array whose dimensions are the max of the the element count in any of the vectors e.g. using a 2D vector
imagine {{1, 2, 3, 4, .. 1000}, {1}}, In the host memory these are roughly ~1001 allocated items, whereas if I were to copy this to a 2 dimensional array, I would have to allocate 1000*1000 elements.
Are there better strategies?
There are many methodologies for refactoring data to suit GPU computation, one of the challenges being copying data between device and host, the other challenge being representation of data (and also algorithm design) on the GPU to yield efficient use of memory bandwidth. I'll highlight 3 general approaches, focusing on ease of copying data between host and device.
Since you mention std::vector, you might take a look at thrust which has vector container representations that are compatible with GPU computing. However thrust won't conveniently handle vectors of vectors AFAIK, which is what I interpret to be your "3D std::vector" nomenclature. So some (non-trivial) refactoring will still be involved. And thrust still doesn't let you use a vector directly in ordinary CUDA device code, although the data they contain is usable.
You could manually refactor the vector of vectors into flat (1D) arrays. You'll need one array for the data elements (length = total number of elements contained in your "3D" std::vector), plus one or more additional (1D) vectors to store the start (and implicitly the end) points of each individual sub-vector. Yes, folks will say this is inefficient because it involves indirection or pointer chasing, but so does the use of vector containers on the host. I would suggest that getting your algorithm working first is more important than worrying about one level of indirection in some aspects of your data access.
as you point out, the "deep-copy" issue with CUDA can be a tedious one. It's pretty new, but you might want to take a look at Unified Memory, which is available on 64-bit windows and linux platforms, under CUDA 6, with a Kepler (cc 3.0) or newer GPU. With C++ especially, UM can be very powerful because we can extend operators like new under the hood and provide almost seamless usage of UM for shared host/device allocations.

using FFTW compatablity mode in cuFFT

I have a full project created using FFTW. I want to transition to using cuFFT. I understand that cuFFT has a "compatibility mode". But how exactly does this work? The cuFFT manual says:
After an application is working using the FFTW3 interface, users may
want to modify their code to move data to and from the GPU and use the
routines documented in the FFTW Conversion Guide for the best
performance.
Does this mean I actually need to change my individual function calls? For example, call
cufftPlan1d() instead of fftw_plan_dft_1d().
Do I also have to change my data types?
fftw_complex *inputData; // fftw data storage gets replaced..
cufft_complex *inputData; // ... by cufft data storage?
fftw_plan forwardFFT; // fftw plan gets replaced...
cufftHandle forwardFFT; // ... by cufft plan?
If I'm going to have to rewrite all of my code, what is the point of cufftSetCompatabilityMode(.)?
Probably what you want is the cuFFTW interface to cuFFT. I suggest you read this documentation as it probably is close to what you have in mind. This will allow you to use cuFFT in a FFTW application with a minimum amount of changes. As indicated in the documentation, there should only be two steps requred:
It is recommended that you replace the include file fftw3.h with cufftw.h
Instead of linking with the double/single precision libraries such as fftw3/fftw3f libraries, link with both the CUFFT and CUFFTW libraries
Regarding the doc item you excerpted, that step (moving the data explicitly) is not required if you're just using the cuFFTW compatibility interface. However, you may not achieve maximum performance this way. If you want to achieve maximum performance, you may need to use cuFFT natively, for example so that you can explicitly manage data movement. Whether or not this is important will depend on the specific structure of your application (how many FFT's you are doing, and whether any data is shared amongst multiple FFTs, for example.) If you intend to use cuFFT natively, then the following comments apply:
Yes, you need to change your individual function calls. They must line up with function names in the API, associated header files, and library. The fftw_ function names are not in the cuFFT library.
You can inspect your data types and should discover that for the basic data types like float, double, complex, etc. they should be layout-compatible between cuFFT and FFTW. Personally I would recommend changing your data types to cuFFT data types, but there should be no functional or performance difference at this time.
Although you don't mention it, cuFFT will also require you to move the data between CPU/Host and GPU, a concept that is not relevant for FFTW.
Regarding cufftSetCompatibilityMode, the function documentation and discussion of FFTW compatibility mode is pretty clear on it's purpose. It has to do with overall data layout, especially padding of data for FFTW.
Check this link out :
Here, it says that all we need to do is change the linking.
https://developer.nvidia.com/blog/cudacasts-episode-8-accelerate-fftw-apps-cufft-55/

Difference between kernels construct and parallel construct

I study a lot of articles and the manual of OpenACC but still i don't understand the main difference of these two constructs.
kernels directive is the more general case and probably one that you might think of, if you've written GPU (e.g. CUDA) kernels before. kernels simply directs the compiler to work on a piece of code, and produce an arbitrary number of "kernels", of arbitrary "dimensions", to be executed in sequence, to parallelize/offload a particular section of code to the accelerator. The parallel construct allows finer-grained control of how the compiler will attempt to structure work on the accelerator, for example by specifying specific dimensions of parallelization. For example, the number of workers and gangs would normally be constant as part of the parallel directive (since only one underlying "kernel" is usually implied), but perhaps not on the kernels directive (since it may translate to multiple underlying "kernels").
A good treatment of this specific question is contained in this PGI article.
Quoting from the article summary:
"The OpenACC kernels and parallel constructs each try to solve the same problem, identifying loop parallelism and mapping it to the machine parallelism. The kernels construct is more implicit, giving the compiler more freedom to find and map parallelism according to the requirements of the target accelerator. The parallel construct is more explicit, and requires more analysis by the programmer to determine when it is legal and appropriate. "
OpenACC directives and GPU kernels are just two ways of representing the same thing -- a section of code that can run in parallel.
OpenACC may be best when retrofitting an existing app to take advantage of a GPU and/or when it is desirable to let the compiler handle more details related to issues such as memory management. This can make it faster to write an app, with a potential cost in performance.
Kernels may be best when writing a GPU app from scratch and/or when more fine grained control is desired. This can make the app take longer to write, but may increase performance.
I think that people new to GPUs may be tempted to go with OpenACC because it looks more familiar. But I think it's actually better to go the other way, and start with writing kernels, and then, potentially move to OpenACC to save time in some projects. The reason is that OpenACC is a leaky abstraction. So, while OpenACC may make it look as if the GPU details are abstracted out, they are still there. So, using OpenACC to write GPU code without understanding what is happening in the background is likely to be frustrating, with odd error messages when attempting to compile, and result in an app that has low performance.
Parallel Construct
Defines the region of the program that should be compiled for parallel execution on the accelerator device.
The parallel loop directive is an assertion by the programmer that it is both safe and desirable to parallelize the affected loop. This relies on the programmer to have correctly identified parallelism in the code and remove anything in the code that may be unsafe to parallelize. If the programmer asserts incorrectly that the loop may be parallelized then the resulting application may produce incorrect results.
The parallel construct allows finer-grained control of how the compiler will attempt to structure work on the accelerator. So it does not rely heavily on the compiler’s ability to automatically parallelize the code.
When parallel loop is used on two subsequent loops that access the same data a compiler may or may not copy the data back and forth between the host and the device between the two loops.
More experienced parallel programmers, who may have already identified parallel loops within their code, will likely find the parallel loop approach more desirable.
e.g refer
#pragma acc parallel
{
#pragma acc loop
for (i=0; i<n; i++)
a[i] = 3.0f*(float)(i+1);
#pragma acc loop
for (i=0; i<n; i++)
b[i] = 2.0f*a[i];
}
 Generate one kernel
 There is no barrier between the two loops: the
second loop may start before the first loop ends.
(This is different from OpenMP).
Kernels Construct
Defines the region of the program that should be compiled into a sequence of kernels for execution on the accelerator device.
An important thing to note about the kernels construct is that the compiler will analyze the code and only parallelize when it is certain that it is safe to do so. In some cases, the compiler may not have enough information at compile time to determine whether a loop is safe the parallelize, in which case it will not parallelize the loop, even if the programmer can clearly see that the loop is safely parallel.
The kernels construct gives the compiler maximum leeway to parallelize and optimize the code how it sees fit for the target accelerator but also relies most heavily on the compiler’s ability to automatically parallelize the code.
One more notable benefit that the kernels construct provides is that if multiple loops access the same data it will only be copied to the accelerator once which may result in less data motion.
Programmers with less parallel programming experience or whose code contains a large number of loops that need to be analyzed may find the kernels approach much simpler, as it puts more of the burden on the compiler.
e.g refer
#pragma acc kernels
{
for (i=0; i<n; i++)
a[i] = 3.0f*(float)(i+1);
for (i=0; i<n; i++)
b[i] = 2.0f*a[i];
}
 Generate two kernels
 There is an implicit barrier between the two
loops: the second loop will start after the first
loop ends.