CUDA pipeline asynchronous memory copy from global to shared memory - cuda

I'm currently learning how to write fast CUDA kernels. I implemented a tiled matrix multiplication (block size 32x32) which only does coalesc reads/writes from/to global memory and has no bank conflicts when writing/reading from shared memory (it has ~50% of the speed of the pytorch matrix multiplication implementation). Now I tried to use pipelining (two stages) and copy memory asyncronously from global to shared memory (see here, and here).
torch::PackedTensorAccessor32<float,2,torch::RestrictPtrTraits> a; // input to the kernel
constexpr unsigned stages_count = 2;
__shared__ float s_a[stages_count][32][32];
auto block = cooperative_groups::this_thread_block();
__shared__ cuda::pipeline_shared_state<cuda::thread_scope::thread_scope_block, stages_count> shared_state;
auto pipeline = cuda::make_pipeline(block, &shared_state);
for(int step=0; step<a.size(1); step+=32) {
for(int stage=0; stage<stages_count; stage++) {
pipeline.producer_acquire();
// what i would like to do (this works but is not asynchronous)
s_a[stage][threadIdx.y][threadIdx.x] = a[blockIdx.x*stages_count*32 + stage*32 + threadIdx.y][step + threadIdx.x];
// this does not work
cuda::memcpy_async(block,
&s_a[stage][threadIdx.y][0],
&a[blockIdx.x*stages_count*32 + stage*32 + threadIdx.y][step],
sizeof(float) * 32,
pipeline);
pipeline.producer_commit();
}
for(int stage=0; stage<stages_count; stage++) {
pipeline.consumer_wait();
// use shared memory
pipeline.consumer_release();
}
}
However, I don't know how to make the asynchronous memory copy work. Problem is, I think, that I don't want to copy 32*32 consecutive floats from global memory, but one tile of the matrix (32 times 32 consecutive float). Also, is it possible to somehow transpose (or e.g. use a permuted shared memory layout) while asynchronously loading to prevent later bank conflicts?

The new Hopper architecture (H100 GPU) has a new hardware feature for this, called the tensor memory accelerator (TMA). Software support will come with CUDA 12 later this year.
As far as I understand it this will allow to asynchronously copy tensor tiles with a single command. But, if it works at all on Ampere and older architectures, it might be quite slow in the same way that the emulated cuda::memcpy_async is quite slow on pre-Ampere GPUs in my experience due to missing hardware support.
Not sure if the transpose you mention will be part of that new API but it might:
TMA significantly reduces addressing overhead and improves efficiency with support for different tensor layouts (1D-5D tensors), different memory access modes, reductions, and other features.
When asynchronicity is not required, CUB provides some useful functionality for "transposing" data in cub::BlockLoad (and cub::BlockStore). A downside with these is that they use shared memory as an intermediary only and write to registers or local memory in the end, so for these kinds of tiled matrix multiplication kernels they probably are not of any help. They might add a feature for only copying to shared memory in the future. Maybe these new containers will even support asynchronicity.

Related

Using dynamic parallelism results in 30x worse performance

Note: I don't have my computer and GPU with me so this me typing from memory. I timed this and compiled it correctly so ignore any odd typos should they exist.
I don't know if the overhead of what I'm going to describe below is the problem, or if I'm doing this wrong, or why launching kernels in kernels is slower than one big kernel that has a lot of threads predicate off and not get used. Maybe this is because I'm not swamping the GPU with work that I don't notice the saturation.
Suppose we're doing something simple for the sake of this example, like multiplying all the values in a square matrix by two. The matrices can be any size, but they won't be larger than 16x16.
Now suppose I have 200 matrices all in the device memory ready to go. I launch a kernel like
// One matrix given to each block
__global__ void matrixFunc(Matrix** matrices)
{
Matrix* m = matrices[blockIdx.x];
int area = m->width * m->height;
if (threadIdx.x < area)
// Heavy calculations
}
// Assume 200 matrices, no larger than 16x16
matrixFunc<<<200, 256>>>(ptrs);
whereby I'm using one block per matrix, and an abundance of threads such that I know I'm never going to have less threads per block than cells in a matrix.
The above runs in 0.17 microseconds.
This seems wasteful. I know that I have a bunch of small matrices (so 256 threads is overkill when a 2x2 matrix can function on 4 threads), so why not launch a bunch of them dynamically from a kernel to see what the runtime overhead is? (for learning reasons)
I change my code to be like the following:
__device__ void matrixFunc(float* matrix)
{
// Heavy calculations (on threadIdx.x for the cell)
}
__global__ void matrixFuncCaller(Matrix** matrices)
{
Matrix* m = matrices[threadIdx.x];
int area = m->width * m->height;
matrixFunc<<<1, area>>>(m.data);
}
matrixFuncCaller<<<1, 200>>>(ptrs);
But this performs a lot worse at 11.3 microseconds.
I realize I could put them all on a stream, so I do that. I then change this to make a new stream:
__global__ void matrixFuncCaller(Matrix** matrices)
{
Matrix* m = matrices[threadIdx.x];
int area = m->width * m->height;
// Create `stream`
matrixFunc<<<1, area, 0, stream>>>(m.data);
// Destroy `stream`
}
This does better, it's now 3 microseconds instead of 11, but it's still much worse than 0.17 microseconds.
I want to know why this is worse.
Is this kernel launching overhead? I figure that maybe my examples are small enough such that the overhead drowns out the work seen here. In my real application which I cannot post, there is a lot more work done than just "2 * matrix", but it still is probably small enough that there might be decent overhead.
Am I doing anything wrong?
Put it shortly: the benchmark is certainly biased and the computation is latency bound.
I do not know how did you measure the timings but I do not believe "0.17 microseconds" is even possible. In fact the overhead of launching a kernel is typically few microseconds (I never saw an overhead smaller than 1 microsecond). Indeed, running a kernel should typically require a system call that are expensive and known to take an overhead of at least about 1000 cycles. An example of overhead analysis can be found in this research paper (confirming that it should takes several microseconds). Not to mention current RAM accesses should take at least 50-100 ns on mainstream x86-64 platforms and the one one of GPU requires several hundreds of cycles. While everything may fit in both the CPU and GPU cache is possible this is very unlikely to be the case regarding the kernels (and the fact the GPU may be used for other tasks during multiple kernel executions). For more information about this, please read this research paper. Thus, what you measure has certainly nothing to do with the kernel execution. To measure the overhead of the kernel, you need to care about synchronizations (eg. call cudaDeviceSynchronize) since kernels are launched asynchronously.
When multiple kernels are launched, you may pay the overhead of an implicit synchronization since the queue is certainly bounded (for sake of performance). In fact, as pointed out by #talonmies in the comments, the number of concurrent kernels is bounded to 16-128 (so less than the number of matrices).
Using multiple streams reduces the need for synchronizations hence the better performance results but there is certainly still a synchronization. That being said, for the comparison to be fair, you need to add a synchronization in all cases or measure the execution time on the GPU itself (without taking care of the launching overhead) still in all cases.
Profilers like nvvp help a lot to understand what is going on in such a case. I strongly advise you to use them.
As for the computation, please note that GPU are designed for heavy computational SIMT-friendly kernels, not low-latency kernel operating on small variable-sized matrices stored in unpredictable memory locations. In fact, the overhead of a global memory access is so big that it should be much bigger than the actual matrix computation. If you want GPUs to be useful, then you need to submit more work to them (so to provide more parallelism to them and so to overlap the high latencies). If you cannot provide more work, then the latency cannot be overlapped and if you care about microsecond latencies then GPUs are clearly not suited for the task.
By the way, not that Nvidia GPUs operate on warp of typically 32 threads. Threads should perform coalesced memory loads/stores to be efficient (otherwise they are split in many load/store requests). Operating on very small matrices like this likely prevent that. Not to mention most threads will do nothing. Flattening the matrices and sorting them by size as proposed by #sebastian in the comments help a bit but the computations and memory access will still be very inefficient for a GPU (not SIMT-friendly). Note that using less thread and make use of unrolling should also be a bit more efficient (but still far from being great). CPUs are better suited for such a task (thanks to a higher frequency, instruction-level parallelism combined with an out-of-order execution). For fast low-latency kernels like this FPGAs can be even better suited (though they are hard to program).

NVIDIA __constant memory: how to populate constant memory from host in both OpenCL and CUDA?

I have a buffer (array) on the host that should be resided in the constant memory region of the device (in this case, an NVIDIA GPU).
So, I have two questions:
How can I allocate a chunk of constant memory? Given the fact that I am tracing the available constant memory on the device and I know, for a fact, that we have that amount of memory available to us (at this time)
How can I initialize (populate) those arrays from values that are computed at the run time on the host?
I searched the web for this but there is no concise document documenting this. I would appreciate it if provided examples would be in both OpenCL and CUDA. The example for OpenCL is more important to me than CUDA.
How can I allocate a chunk of constant memory? Given the fact that I am tracing the available constant memory on the device and I know, for a fact, that we have that amount of memory available to us (at this time)
In CUDA, you can't. There is no runtime allocation of constant memory, only static definition of memory via the __constant__ specifier which get mapped to constant memory pages at assembly. You could generate some code contain such a static declaration at runtime and compile it via nvrtc, but that seems like a lot of effort for something you know can only be sized up to 64kb. It seems much simpler (to me at least) to just statically declare a 64kb constant buffer and use it at runtime as you see fit.
How can I initialize (populate) those arrays from values that are computed at the runtime on the host?
As noted in comments, see here. The cudaMemcpyToSymbol API was created for this purpose and it works just like standard memcpy.
Functionally, there is no difference between __constant in OpenCL and __constant__ in CUDA. The same limitations apply: static definition at compile time (which is runtime in the standard OpenCL execution model), 64kb limit.
For cuda, I use driver API and NVRTC and create kernel string with a global constant array like this:
auto kernel = R"(
..
__constant__ ##Type## buffer[##SIZE##]={
##elm##
};
..
__global__ void test(int * input)
{ }
)";
then replace ##-pattern words with size and element value information in run-time and compile like this:
__constant__ int buffer[16384]={ 1,2,3,4, ....., 16384 };
So, it is run-time for the host, compile-time for the device. Downside is that the kernel string gets too big, has less readability and connecting classes needs explicitly linking (as if you are compiling a side C++ project) other compilation units. But for simple calculations with only your own implementations (no host-definitions used directly), it is same as runtime API.
Since large strings require extra parsing time, you can cache the ptx intermediate data and also cache the binary generated from ptx. Then you can check if kernel string has changed and needs to be re-compiled.
Are you sure just __constant__ worths the effort? Do you have some benchmark results to show that actually improves performance? (premature optimization is source of all evil). Perhaps your algorithm works with register-tiling and the source of data does not matter?
Disclaimer: I cannot help you with CUDA.
For OpenCL, constant memory is effectively treated as read-only global memory from the programmer/API point of view, or defined inline in kernel source.
Define constant variables, arrays, etc. in your kernel code, like constant float DCT_C4 = 0.707106781f;. Note that you can dynamically generate kernel code on the host at runtime to generate derived constant data if you wish.
Pass constant memory from host to kernel via a buffer object, just as you would for global memory. Simply specify a pointer parameter in the constant memory region in your kernel function's prototype and set the buffer on the host side with clSetKernelArg(), for example:
kernel void mykernel(
constant float* fixed_parameters,
global const uint* dynamic_input_data,
global uint* restrict output_data)
{
cl_mem fixed_parameter_buffer = clCreateBuffer(
cl_context,
CL_MEM_READ_ONLY | CL_MEM_HOST_NO_ACCESS | CL_MEM_COPY_HOST_PTR,
sizeof(cl_float) * num_fixed_parameters, fixed_parameter_data,
NULL);
clSetKernelArg(mykernel, 0, sizeof(cl_mem), &fixed_parameter_buffer);
Make sure to take into account the value reported for CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE for the context being used! It usually doesn't help to use constant memory buffers for streaming input data, this is better stored in global buffers, even if they are marked read-only for the kernel. constant memory is most useful for data that are used by a large proportion of work-items. There is typically a fairly tight size limitation such as 64KiB on it - some implementations may "spill" to global memory if you try to exceed this, which will lose you any performance advantages you would gain from using constant memory.

What is the reason for K80 versus Pascal performance differences in this program that adds two arrays? [duplicate]

I was testing the new CUDA 8 along with the Pascal Titan X GPU and is expecting speed up for my code but for some reason it ends up being slower. I am on Ubuntu 16.04.
Here is the minimum code that can reproduce the result:
CUDASample.cuh
class CUDASample{
public:
void AddOneToVector(std::vector<int> &in);
};
CUDASample.cu
__global__ static void CUDAKernelAddOneToVector(int *data)
{
const int x = blockIdx.x * blockDim.x + threadIdx.x;
const int y = blockIdx.y * blockDim.y + threadIdx.y;
const int mx = gridDim.x * blockDim.x;
data[y * mx + x] = data[y * mx + x] + 1.0f;
}
void CUDASample::AddOneToVector(std::vector<int> &in){
int *data;
cudaMallocManaged(reinterpret_cast<void **>(&data),
in.size() * sizeof(int),
cudaMemAttachGlobal);
for (std::size_t i = 0; i < in.size(); i++){
data[i] = in.at(i);
}
dim3 blks(in.size()/(16*32),1);
dim3 threads(32, 16);
CUDAKernelAddOneToVector<<<blks, threads>>>(data);
cudaDeviceSynchronize();
for (std::size_t i = 0; i < in.size(); i++){
in.at(i) = data[i];
}
cudaFree(data);
}
Main.cpp
std::vector<int> v;
for (int i = 0; i < 8192000; i++){
v.push_back(i);
}
CUDASample cudasample;
cudasample.AddOneToVector(v);
The only difference is the NVCC flag, which for the Pascal Titan X is:
-gencode arch=compute_61,code=sm_61-std=c++11;
and for the old Maxwell Titan X is:
-gencode arch=compute_52,code=sm_52-std=c++11;
EDIT: Here are the results for running NVIDIA Visual Profiling.
For the old Maxwell Titan, the time for memory transfer is around 205 ms, and the kernel launch is around 268 us.
For the Pascal Titan, the time for memory transfer is around 202 ms, and the kernel launch is around an insanely long 8343 us, which makes me believe something is wrong.
I further isolate the problem by replacing cudaMallocManaged into good old cudaMalloc and did some profiling and observe some interesting result.
CUDASample.cu
__global__ static void CUDAKernelAddOneToVector(int *data)
{
const int x = blockIdx.x * blockDim.x + threadIdx.x;
const int y = blockIdx.y * blockDim.y + threadIdx.y;
const int mx = gridDim.x * blockDim.x;
data[y * mx + x] = data[y * mx + x] + 1.0f;
}
void CUDASample::AddOneToVector(std::vector<int> &in){
int *data;
cudaMalloc(reinterpret_cast<void **>(&data), in.size() * sizeof(int));
cudaMemcpy(reinterpret_cast<void*>(data),reinterpret_cast<void*>(in.data()),
in.size() * sizeof(int), cudaMemcpyHostToDevice);
dim3 blks(in.size()/(16*32),1);
dim3 threads(32, 16);
CUDAKernelAddOneToVector<<<blks, threads>>>(data);
cudaDeviceSynchronize();
cudaMemcpy(reinterpret_cast<void*>(in.data()),reinterpret_cast<void*>(data),
in.size() * sizeof(int), cudaMemcpyDeviceToHost);
cudaFree(data);
}
For the old Maxwell Titan, the time for memory transfer is around 5 ms both ways, and the kernel launch is around 264 us.
For the Pascal Titan, the time for memory transfer is around 5 ms both ways, and the kernel launch is around 194 us, which actually results in the performance increase I am hoping to see...
Why is Pascal GPU so slow on running CUDA kernels when cudaMallocManaged is used? It will be a travesty if I have to revert all my existing code that uses cudaMallocManaged into cudaMalloc. This experiment also shows that the memory transfer time using cudaMallocManaged is a lot slower than using cudaMalloc, which also feels like something is wrong. If using this results in a slow run time even the code is easier, this should be unacceptable because the whole purpose of using CUDA instead of plain C++ is to speed things up. What am I doing wrong and why am I observing this kind of result?
Under CUDA 8 with Pascal GPUs, managed memory data migration under a unified memory (UM) regime will generally occur differently than on previous architectures, and you are experiencing the effects of this. (Also see note at the end about CUDA 9 updated behavior for windows.)
With previous architectures (e.g. Maxwell), managed allocations used by a particular kernel call will be migrated all at once, upon launch of the kernel, approximately as if you called cudaMemcpy to move the data yourself.
With CUDA 8 and Pascal GPUs, data migration occurs via demand-paging. At kernel launch, by default, no data is explicitly migrated to the device(*). When the GPU device code attempts to access data in a particular page that is not resident in GPU memory, a page fault will occur. The net effect of this page fault is to:
Cause the GPU kernel code (the thread or threads that accessed the page) to stall (until step 2 is complete)
Cause that page of memory to be migrated from the CPU to the GPU
This process will be repeated as necessary, as GPU code touches various pages of data. The sequence of operations involved in step 2 above involves some latency as the page fault is processed, in addition to the time spent to actually move the data. Since this process will move data a page at a time, it may be signficantly less efficient than moving all the data at once, either using cudaMemcpy or else via the pre-Pascal UM arrangement that caused all data to be moved at kernel launch (whether it was needed or not, and regardless of when the kernel code actually needed it).
Both approaches have their pros and cons, and I don't wish to debate the merits or various opinions or viewpoints. The demand-paging process enables a great many important features and capabilities for Pascal GPUs.
This particular code example, however, does not benefit. This was anticipated, and so the recommended use to bring the behavior in line with previous (e.g. maxwell) behavior/performance is to precede the kernel launch with a cudaMemPrefetchAsync() call.
You would use the CUDA stream semantics to force this call to complete prior to the kernel launch (if the kernel launch does not specify a stream, you can pass NULL for the stream parameter, to select the default stream). I believe the other parameters for this function call are pretty self-explanatory.
With this function call before your kernel call, covering the data in question, you should not observe any page-faulting in the Pascal case, and the profile behavior should be similar to the Maxwell case.
As I mentioned in the comments, if you had created a test case that involved two kernel calls in sequence, you would have observed that the 2nd call runs at approximately full speed even in the Pascal case, since all of the data has already been migrated to the GPU side through the first kernel execution. Therefore, the use of this prefetch function should not be considered mandatory or automatic, but should be used thoughtfully. There are situations where the GPU may be able to hide the latency of page-faulting to some degree, and obviously data already resident on the GPU does not need to be prefetched.
Note that the "stall" referred to in step 1 above is possibly misleading. A memory access by itself does not trigger a stall. But if the data requested is actually needed for an operation, e.g. a multiply, then the warp will stall at the multiply operation, until the necessary data becomes available. A related point, then, is that demand-paging of data from host to device in this fashion is just another "latency" that the GPU can possibly hide in it's latency-hiding architecture, if there is sufficient other available "work" to attend to.
As an additional note, in CUDA 9, the demand-paging regime for pascal and beyond is only available on linux; the previous support for Windows advertised in CUDA 8 has been dropped. See here. On windows, even for Pascal devices and beyond, as of CUDA 9, the UM regime is the same as maxwell and prior devices; data is migrated to the GPU en-masse, at kernel launch.
(*) The assumption here is that data is "resident" on the host, i.e. already "touched" or initialized in CPU code, after the managed allocation call. The managed allocation itself creates data pages associated with the device, and when CPU code "touches" these pages, the CUDA runtime will demand-page the necessary pages to be resident in host memory, so that the CPU can use them. If you perform an allocation but never "touch" the data in CPU code (an odd situation, probably) then it will actually already be "resident" in device memory when the kernel runs, and the observed behavior will be different. But that is not the case in view for this particular example/question.
Additional information is available in this blog article.
I can reproduce this in three programms on a 1060 and a 1080. As example i use a voulme render with procedural transferfunction which was nearly interactive real time on a 960 but on a 1080 is a slight show. All data are stored in read only textures and only my transferfunctions are in Managed Memory. In difference to my other code the volume render runs especially slow, this is becaus in differece to my other code my transferfunctions are passed from the kernel to other device methods.
I belive that it is not only the calling of kernels with cudaMallocManaged data. My expierence go to that every call of a kernel or device methode has this behavior and the effect adds up. Also the basis of the volume render is in parts the provided CudaSample without Managed Memory, which runs as expected on Maxwell an pascal GPUs (1080, 1060,980Ti,980,960).
I just yesterday found this bug, because we changed all of oure reaserch systems to pascal. I will profile my software in the next days on a 980 in comapre to a 1080. I'm not yet sure if i should report a bug in the NVIDIA developer zone.
it is a BUG of NVIDIA on Windows Systems witch occurs with PASCAL architecture.
I know this since a few days, but could not write it here because i was on vacation without internet connection.
For details see the comments of: https://devblogs.nvidia.com/parallelforall/unified-memory-cuda-beginners/
where Mark Harris from NVIDIA confirms the Bug. It should be corrected with CUDA 9. He also tells that it should be communicated to Microsoft to help the caus. But i don't found a suitable Microsoft Bug Report Page till now.

Memory Coalescing vs. Vectorized Memory Access

I am trying to understand the relationship between memory coalescing on NVIDIA GPUs/CUDA and vectorized memory access on x86-SSE/C++.
It is my understanding that:
Memory coalescing is a run-time optimization of the memory controller (implemented in hardware). How many memory transactions are required to fulfill the load/store of a warp is determined at run-time. A load/store instruction of a warp may be issued repeatedly unless there is perfect coalescing.
Memory vectorization is a compile-time optimization. The number of memory transactions for a vectorized load/store is fixed. Each vector load/store instruction is issued exactly once.
Coalescable GPU load/store instructions are more expressive than SSE vector load/store instructions. E.g., a st.global.s32 PTX instruction may store into 32 arbitrary memory locations (warp size 32), whereas a movdqa SSE instruction can only store into a consecutive block of memory.
Memory coalescing in CUDA seems to guarantee efficient vectorized memory access (when accesses are coalescable), whereas on x86-SSE, we have to hope that the compiler actually vectorizes the code (it may fail to do so) or vectorize code manually with SSE intrinsics, which is more difficult for programmers.
Is this correct? Did I miss an important aspect (thread masking, maybe)?
Now, why do GPUs have run-time coalescing? This probably requires extra circuits in hardware. What are the main benefits over compile-time coalescing as in CPUs? Are there applications/memory access patterns that are harder to implement on CPUs because of missing run-time coalescing?
caveat: I don't really know / understand the architecture / microarchitecture of GPUs very well. Some of this understanding is cobbled together from the question + what other people have written in comments / answers here.
The way GPUs let one instruction operate on multiple data is very different from CPU SIMD. That's why they need special support for memory coalescing at all. CPU-SIMD can't be programmed in a way that needs it.
BTW, CPUs have cache to absorb multiple accesses to the same cache line before the actual DRAM controllers get involved. GPUs have cache too, of course.
Yes, memory-coalescing basically does at runtime what short-vector CPU SIMD does at compile time, within a single "core". The CPU-SIMD equivalent would be gather/scatter loads/stores that could optimize to a single wide access to cache for indices that were adjacent. Existing CPUs don't do this: each element accesses cache separately in a gather. You shouldn't use a gather load if you know that many indices will be adjacent; it will be faster to shuffle 128-bit or 256-bit chunks into place. For the common case where all your data is contiguous, you just use a normal vector load instruction instead of a gather load.
The point of modern short-vector CPU SIMD is to feed more work through a fetch/decode/exec pipeline without making it wider in terms of having to decode + track + exec more CPU instructions per clock cycle. Making a CPU pipeline wider quickly hits diminishing returns for most use-cases, because most code doesn't have a lot of ILP.
A general-purpose CPU spends a lot of transistors on instruction-scheduling / out-of-order execution machinery, so just making it wider to be able to run many more uops in parallel isn't viable. (https://electronics.stackexchange.com/questions/443186/why-not-make-one-big-cpu-core).
To get more throughput, we can raise the frequency, raise IPC, and use SIMD to do more work per instruction/uop that the out-of-order machinery has to track. (And we can build multiple cores on a single chip, but cache-coherent interconnects between them + L3 cache + memory controllers are hard). Modern CPUs use all of these things, so we get a total throughput capability of frequency * IPC * SIMD, and times number of cores if we multithread. They aren't viable alternatives to each other, they're orthogonal things that you have to do all of to drive lots of FLOPs or integer work through a CPU pipeline.
This is why CPU SIMD has wide fixed-width execution units, instead of a separate instruction for each scalar operation. There isn't a mechanism for one scalar instruction to flexibly be fed to multiple execution units.
Taking advantage of this requires vectorization at compile time, not just of your loads / stores but also your ALU computation. If your data isn't contiguous, you have to gather it into SIMD vectors either with scalar loads + shuffles, or with AVX2 / AVX512 gather loads that take a base address + vector of (scaled) indices.
But GPU SIMD is different. It's for massively parallel problems where you do the same thing to every element. The "pipeline" can be very lightweight because it doesn't need to support out-of-order exec or register renaming, or especially branching and exceptions. This makes it feasible to just have scalar execution units without needing to handle data in fixed chunks from contiguous addresses.
These are two very different programming models. They're both SIMD, but the details of the hardware that runs them is very different.
Each vector load/store instruction is issued exactly once.
Yes, that's logically true. In practice the internals can be slightly more complicated, e.g. AMD Ryzen splitting 256-bit vector operations into 128-bit halves, or Intel Sandybridge/IvB doing that for just loads+stores while having 256-bit wide FP ALUs.
There's a slight wrinkle with misaligned loads/stores on Intel x86 CPUs: on a cache-line split, the uop has to get replayed (from the reservation station) to do the other part of the access (to the other cache line).
In Intel terminology, the uop for a split load gets dispatched twice, but only issues + retires once.
Aligned loads/stores like movdqa, or movdqu when the memory happens to be aligned at runtime, are just a single access to L1d cache (assuming a cache hit). Unless you're on a CPU that decodes a vector instruction into two halves, like AMD for 256-bit vectors.
But that stuff is purely inside the CPU core for access to L1d cache. CPU <-> memory transactions are in whole cache lines, with write-back L1d / L2 private caches, and shared L3 on modern x86 CPUs - Which cache mapping technique is used in intel core i7 processor? (Intel since Nehalem, the start of the i3/i5/i7 series, AMD since Bulldozer I think introduced L3 caches for them.)
In a CPU, it's the write-back L1d cache that basically coalesces transactions into whole cache lines, whether you use SIMD or not.
What SIMD helps with is getting more work done inside the CPU, to keep up with faster memory. Or for problems where the data fits in L2 or L1d cache, to go really fast over that data.
Memory coalescing is related to parallel accesses: when each core in a SM will access a subsequent memory location, the memory access is optimized.
Viceversa, SIMD is a single core optimization: when a vector register is filled with operands and a SSE operation is performed, the parallelism is inside the CPU core, with one operation being performed on each internal logical unit per clock cycle.
However you are right: coalesced/uncoalesced memory access is a runtime aspect. SIMD operations are compiled in. I don't think they can compare well.
If I would make a parallelism, I would compare coalesing in GPUs to memory prefetching in CPUs. This is a very important runtime optimization as well - and I believe it's active behind the scene using SSE as well.
However there is nothing similar to colescing in Intel CPU cores. Because of cache coherency, the best you can do in optimizing parallel memory accesses, is to let each core access to independent memory regions.
Now, why do GPUs have run-time coalescing?
Graphical processing is optimized for executing a single task in parallel on adjacent elements.
For example, think to perform an operation on every pixel of an image, assigning each pixel to a different core. Now it's clear that you want to have an optimal path to load the image spreading one pixel to each core.
That's why memory coalescing is deeply buried in the GPUs architecture.

CUDA: Is texture memory still useful to speed up access times for compute capability 2.x and newer?

I'm writing an image processing app where I have to fetch pixel data in uncoalesced manner.
Initially I implemented my algorithm using global memory. Later I reimplemented it using texture memory. To my amazement it became slower! I thought, maybe something wrong with cudaMalloc/text1Dfetch style, so I changed it to cudaArray/tex2D. Nothing changed.
Then I stumbled upon Shane Cook's "CUDA Programming", where he wrote:
As compute 1.x hardware has no cache to speak of, the 6–8K of texture memory per SM provides the
only method to truly cache data on such devices. However, with the advent of Fermi and its up to 48 K
L1 cache and up to 768 K shared L2 cache, this made the usage of texture memory for its cache
properties largely obsolete. The texture cache is still present on Fermi to ensure backward compati-
bility with previous generations of code.
I have GeForce GT 620M (Fermi, compute cap. 2.1).
So I need some advice from professionals! Should I dig deeper into texture memory with its texture cache trying to optimize performance? Or I should better stick with global memory and L1/L2 cache?
Textures can indeed be useful on devices of compute capability >= 2.0.
Textures and cudaArrays can use memory stored in a space filling curve, which can allow for a better cache hit rate due to better 2D spatial locality.
The texture cache is separate from the other caches. So it has its own dedicated memory and bandwidth and reading from it does not interfere with the other caches. This can become important if there is a lot of pressure on your L1/L2 caches.
Textures also provide built in functionality such as interpolation, various addressing modes (clamp, wrap, mirror) and normalized addressing with floating point coordinates. These can be used without any extra cost and can greatly improve performance in kernels where such functionality is needed.
On early CUDA architectures, textures and cudaArrays could not be written by a kernel. On architectures of compute capability >= 2.0, they can be written via CUDA surfaces.
Determining if you should use textures or a regular buffer in global memory comes down to the intended usage and access patterns for the memory. It will be project specific.
You are using the Fermi architecture, with a device that has been rebranded into the 6xx series.
For those on the Kepler architecture, take a look at NVIDIA's Inside Kepler Presentation. In particular, the slides, Texture Performance, Texture Cache Unlocked and const __restrict Example.