Is it possible to access hard disk/ flash disk directly from GPU (CUDA/openCL) and load/store content directly from the GPU's memory ?
I am trying to avoid copying stuff from disk to memory and then copying it over to GPU's memory.
I read about Nvidia GPUDirect but not sure if it does what I explained above. It talks about remote GPU memory and disks but the disks in my case are local to the GPU.
Basic idea is to load contents (something like dma) -> do some operations -> store contents back to disk (again in dma fashion).
I am trying to involve CPU and RAM as little as possible here.
Please feel free to offer any suggestions about the design.
For anyone else looking for this, 'lazy unpinning' did more or less what I want.
Go through the following to see if this can be helpful for you.
The most straightforward implementation using RDMA for GPUDirect would
pin memory before each transfer and unpin it right after the transfer
is complete. Unfortunately, this would perform poorly in general, as
pinning and unpinning memory are expensive operations. The rest of the
steps required to perform an RDMA transfer, however, can be performed
quickly without entering the kernel (the DMA list can be cached and
replayed using MMIO registers/command lists).
Hence, lazily unpinning memory is key to a high performance RDMA
implementation. What it implies, is keeping the memory pinned even
after the transfer has finished. This takes advantage of the fact that
it is likely that the same memory region will be used for future DMA
transfers thus lazy unpinning saves pin/unpin operations.
An example implementation of lazy unpinning would keep a set of pinned
memory regions and only unpin some of them (for example the least
recently used one) if the total size of the regions reached some
threshold, or if pinning a new region failed because of BAR space
exhaustion (see PCI BAR sizes).
Here is a link to an application guide and to nvidia docs.
Trying to use this feature, I wrote a small example on Windows x64 to implement this. In this example, kernel "directly" accesses disk space. Actually, as #RobertCrovella mentioned previously, the operating system is doing the job, with probably some CPU work; but no supplemental coding.
__global__ void kernel(int4* ptr)
{
int4 val ; val.x = threadIdx.x ; val.y = blockDim.x ; val.z = blockIdx.x ; val.w = gridDim.x ;
ptr[threadIdx.x + blockDim.x * blockIdx.x] = val ;
ptr[160*1024*1024 + threadIdx.x + blockDim.x * blockIdx.x] = val ;
}
#include "Windows.h"
int main()
{
// 4GB - larger than installed GPU memory
size_t size = 256 * 1024 * 1024 * sizeof(int4) ;
HANDLE hFile = ::CreateFile ("GPU.dump", (GENERIC_READ | GENERIC_WRITE), 0, 0, CREATE_ALWAYS, FILE_ATTRIBUTE_NORMAL, NULL) ;
HANDLE hFileMapping = ::CreateFileMapping (hFile, 0, PAGE_READWRITE, (size >> 32), (int)size, 0) ;
void* ptr = ::MapViewOfFile (hFileMapping, FILE_MAP_ALL_ACCESS, 0, 0, size) ;
::cudaSetDeviceFlags (cudaDeviceMapHost) ;
cudaError_t er = ::cudaHostRegister (ptr, size, cudaHostRegisterMapped) ;
if (cudaSuccess != er)
{
printf ("could not register\n") ;
return 1 ;
}
void* d_ptr ;
er = ::cudaHostGetDevicePointer (&d_ptr, ptr, 0) ;
if (cudaSuccess != er)
{
printf ("could not get device pointer\n") ;
return 1 ;
}
kernel<<<256,256>>> ((int4*)d_ptr) ;
if (cudaSuccess != ::cudaDeviceSynchronize())
{
printf ("error in kernel\n") ;
return 1 ;
}
if (cudaSuccess != ::cudaHostUnregister (ptr))
{
printf ("could not unregister\n") ;
return 1 ;
}
::UnmapViewOfFile (ptr) ;
::CloseHandle (hFileMapping) ;
::CloseHandle (hFile) ;
::cudaDeviceReset() ;
printf ("DONE\n");
return 0 ;
}
The real solution is on the horizon!
Early access: https://developer.nvidia.com/gpudirect-storage
GPUDirect® Storage (GDS) is the newest addition to the GPUDirect family. GDS enables a direct data path for direct memory access (DMA) transfers between GPU memory and storage, which avoids a bounce buffer through the CPU. This direct path increases system bandwidth and decreases the latency and utilization load on the CPU.
Related
I believe my CUDA application could potentially benefit from shared memory, in order to keep the data near the GPU cores. Right now, I have a single kernel to which I pass a pointer to a previously allocated chunk of device memory, and some constants. After the kernel has finished, the device memory includes the result, which is copied to host memory. This scheme works perfectly and is cross-checked with the same algorithm run on the CPU.
The docs make it quite clear that global memory is much slower and has higher access latency than shared memory, but either way to get the best performance you should make your threads coalesce and align any access. My GPU has Compute Capability 6.1 "Pascal", has 48 kiB of shared memory per thread block and 2 GiB DRAM. If I refactor my code to use shared memory, how do I make sure to avoid bank conflicts?
Shared memory is organized in 32 banks, so that 32 threads from the same block each may simultaneously access a different bank without having to wait. Let's say I take the kernel from above, launch a kernel configuration with one block and 32 threads in that block, and statically allocate 48 kiB of shared memory outside the kernel. Also, each thread will only ever read from and write to the same single memory location in (shared) memory, which is specific to the algorithm I am working on. Given this, I would access those 32 shared memory locations with on offset of 48 kiB / 32 banks / sizeof(double) which equals 192:
__shared__ double cache[6144];
__global__ void kernel(double *buf_out, double a, double b, double c)
{
for(...)
{
// Perform calculation on shared memory
cache[threadIdx.x * 192] = ...
}
// Write result to global memory
buf_out[threadIdx.x] = cache[threadIdx.x * 192];
}
My reasoning: while threadIdx.x runs from 0 to 31, the offset together with cache being a double make sure that each thread will access the first element of a different bank, at the same time. I haven't gotten around to modify and test the code, but is this the right way to align access for the SM?
MWE added:
This is the naive CPU-to-CUDA port of the algorithm, using global memory only. Visual Profiler reports a kernel execution time of 10.3 seconds.
Environment: Win10, MSVC 2019, x64 Release Build, CUDA v11.2.
#include "cuda_runtime.h"
#include <iostream>
#include <stdio.h>
#define _USE_MATH_DEFINES
#include <math.h>
__global__ void kernel(double *buf, double SCREEN_STEP_SIZE, double APERTURE_RADIUS,
double APERTURE_STEP_SIZE, double SCREEN_DIST, double WAVE_NUMBER)
{
double z, y, y_max;
unsigned int tid = threadIdx.x/* + blockIdx.x * blockDim.x*/;
double Z = tid * SCREEN_STEP_SIZE, Y = 0;
double temp = WAVE_NUMBER / SCREEN_DIST;
// Make sure the per-thread accumulator is zero before we begin
buf[tid] = 0;
for (z = -APERTURE_RADIUS; z <= APERTURE_RADIUS; z += APERTURE_STEP_SIZE)
{
y_max = sqrt(APERTURE_RADIUS * APERTURE_RADIUS - z * z);
for (y = -y_max; y <= y_max; y += APERTURE_STEP_SIZE)
{
buf[tid] += cos(temp * (Y * y + Z * z));
}
}
}
int main(void)
{
double *dev_mem;
double *buf = NULL;
cudaError_t cudaStatus;
unsigned int screen_elems = 1000;
if ((buf = (double*)malloc(screen_elems * sizeof(double))) == NULL)
{
printf("Could not allocate memory...");
return -1;
}
memset(buf, 0, screen_elems * sizeof(double));
if ((cudaStatus = cudaMalloc((void**)&dev_mem, screen_elems * sizeof(double))) != cudaSuccess)
{
printf("cudaMalloc failed with code %u", cudaStatus);
return cudaStatus;
}
kernel<<<1, 1000>>>(dev_mem, 1e-3, 5e-5, 50e-9, 10.0, 2 * M_PI / 5e-7);
cudaDeviceSynchronize();
if ((cudaStatus = cudaMemcpy(buf, dev_mem, screen_elems * sizeof(double), cudaMemcpyDeviceToHost)) != cudaSuccess)
{
printf("cudaMemcpy failed with code %u", cudaStatus);
return cudaStatus;
}
cudaFree(dev_mem);
cudaDeviceReset();
free(buf);
return 0;
}
The kernel below uses shared memory instead and takes approximately 10.6 seconds to execute, again measured in Visual Profiler:
__shared__ double cache[1000];
__global__ void kernel(double *buf, double SCREEN_STEP_SIZE, double APERTURE_RADIUS,
double APERTURE_STEP_SIZE, double SCREEN_DIST, double WAVE_NUMBER)
{
double z, y, y_max;
unsigned int tid = threadIdx.x + blockIdx.x * blockDim.x;
double Z = tid * SCREEN_STEP_SIZE, Y = 0;
double temp = WAVE_NUMBER / SCREEN_DIST;
// Make sure the per-thread accumulator is zero before we begin
cache[tid] = 0;
for (z = -APERTURE_RADIUS; z <= APERTURE_RADIUS; z += APERTURE_STEP_SIZE)
{
y_max = sqrt(APERTURE_RADIUS * APERTURE_RADIUS - z * z);
for (y = -y_max; y <= y_max; y += APERTURE_STEP_SIZE)
{
cache[tid] += cos(temp * (Y * y + Z * z));
}
}
buf[tid] = cache[tid];
}
The innermost line inside the loops is typically executed several million times, depending on the five constants passed to the kernel. So instead of thrashing the off-chip global memory, I expected the on-chip shared-memory version to be much faster, but apparently it is not - what am I missing?
Let's say... each thread will only ever read from and write to the same single memory location in (shared) memory, which is specific to the algorithm I am working on.
In that case, it does not make sense to use shared memory. The whole point of shared memory is the sharing... among all threads in a block. Under your assumptions, you should keep your element in a register, not in shared memory. Indeed, in your "MWE Added" kernel - that's probably what you should do.
If your threads were to share information - then the pattern of this sharing would determine how best to utilize shared memory.
Also remember that if you don't read data repeatedly, or from multiple threads, it is much less likely that shared memory will help you - as you always have to read from global memory at least once and write to shared memory at least once to have your data in shared memory.
I want all accesses from my program to access global memory (even if the data is found in the L1/L2 cache). To this effect I found out that L1 cache can be skipped by passing these options to nvcc compiler:
-Xptxas -dlcm=cg
CUDA documentation states this:
.cv Cache as volatile (consider cached system memory lines stale, fetch again).
So, I am assuming when I run with either -dlcm=cg or -dlcm=cv, the PTX file generated should be different from the one that is generated normally. (The loads should be appended with either .cg or .cv)
My sample program:
__global__ void rh_kernel(int *datainRowX, int *datainRowY) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;
if (tid != 0)
return;
int i, x, y;
x = datainRowX[1];
y = datainRowY[2];
datainRowX[0] = x + y;
}
int main(int argc, char** argv) {
int* d_datainRowX;
cudaMalloc((void**)&d_datainRowX, sizeof(int) * 268435456);
int* d_datainRowY;
cudaMalloc((void**)&d_datainRowY, sizeof(int) * 268435456);
rh_kernel<<<1024, 1>>>(d_datainRowX, d_datainRowY);
cudaFree(d_datainRowX); cudaFree(d_datainRowY);
return(0);
}
I notice that whatever options I pass to the nvcc compiler ("-Xptxas -dlcm=cg" or "-Xptxas -dlcm=cv" or nothing), in all the three cases the PTX generated is the same. I am using -ptx option to generate the PTX file.
What am I missing? Is there any other way to achieve what I am doing?
Thanks in advance for your time.
According to Cuda Toolkit Documentation:
L1 caching in Kepler GPUs is reserved only for local memory accesses,
such as register spills and stack data. Global loads are cached in L2
only (or in the Read-Only Data Cache).
GK110B-based products such as the Tesla K40 GPU Accelerator, GK20A,
and GK210 retain this behavior by default
L1 cache is not used in global memory reads on Kepler by default . Thus - there is no difference in PTX when you add -Xptxas -dlcm=cg.
Disabling L2 cache is not possbile.
I have this simple code shown below which is doing nothing but just copies some data to the device from host using the streams. But I am confused after running the nvprof as to cudamemcpyasync is really async and understanding of the streams.
#include <stdio.h>
#define NUM_STREAMS 4
cudaError_t memcpyUsingStreams (float *fDest,
float *fSrc,
int iBytes,
cudaMemcpyKind eDirection,
cudaStream_t *pCuStream)
{
int iIndex = 0 ;
cudaError_t cuError = cudaSuccess ;
int iOffset = 0 ;
iOffset = (iBytes / NUM_STREAMS) ;
/*Creating streams if not present */
if (NULL == pCuStream)
{
pCuStream = (cudaStream_t *) malloc(NUM_STREAMS * sizeof(cudaStream_t));
for (iIndex = 0 ; iIndex < NUM_STREAMS; iIndex++)
{
cuError = cudaStreamCreate (&pCuStream[iIndex]) ;
}
}
if (cuError != cudaSuccess)
{
cuError = cudaMemcpy (fDest, fSrc, iBytes, eDirection) ;
}
else
{
for (iIndex = 0 ; iIndex < NUM_STREAMS; iIndex++)
{
iOffset = iIndex * iOffset ;
cuError = cudaMemcpyAsync (fDest + iOffset , fSrc + iOffset, iBytes / NUM_STREAMS , eDirection, pCuStream[iIndex]) ;
}
}
if (NULL != pCuStream)
{
for (iIndex = 0 ; iIndex < NUM_STREAMS; iIndex++)
{
cuError = cudaStreamDestroy (pCuStream[iIndex]) ;
}
free (pCuStream) ;
}
return cuError ;
}
int main()
{
float *hdata = NULL ;
float *ddata = NULL ;
int i, j, k, index ;
cudaStream_t *abc = NULL ;
hdata = (float *) malloc (sizeof (float) * 256 * 256 * 256) ;
cudaMalloc ((void **) &ddata, sizeof (float) * 256 * 256 * 256) ;
for (i=0 ; i< 256 ; i++)
{
for (j=0; j< 256; j++)
{
for (k=0; k< 256 ; k++)
{
index = (((i * 256) + j) * 256) + k;
hdata [index] = index ;
}
}
}
memcpyUsingStreams (ddata, hdata, sizeof (float) * 256 * 256 * 256, cudaMemcpyHostToDevice, abc) ;
cudaFree (ddata) ;
free (hdata) ;
return 0;
}
The nvprof results are as below.
Start Duration Grid Size Block Size Regs* SSMem* DSMem* Size Throughput Device Context Stream Name
104.35ms 10.38ms - - - - - 16.78MB 1.62GB/s 0 1 7 [CUDA memcpy HtoD]
114.73ms 10.41ms - - - - - 16.78MB 1.61GB/s 0 1 8 [CUDA memcpy HtoD]
125.14ms 10.46ms - - - - - 16.78MB 1.60GB/s 0 1 9 [CUDA memcpy HtoD]
135.61ms 10.39ms - - - - - 16.78MB 1.61GB/s 0 1 10 [CUDA memcpy HtoD]
So I didnt understand the point of using the streams here because of the start time. It looks sequential to me. Please help me to understand as what I am doing wrong here. I am using tesla K20c card.
The PCI Express link that connects your GPU to the system only has one channel going to the card and one channel coming from the card. That means at most, you can have a single cudaMemcpy(Async) operation that is actually executing at any given time, per direction (i.e. one DtoH and one HtoD, at the most). All other cudaMemcpy(Async) operations will get queued up, waiting for those ahead to complete.
You cannot have two operations going in the same direction at the same time. One at a time, per direction.
As #JackOLantern states, the principal benefit for streams is to overlap memcopies and compute, or else to allow multiple kernels to execute concurrently. It also allows one DtoH copy to run concurrently with one HtoD copy.
Since your program does all HtoD copies, they all get executed serially. Each copy has to wait for the copy ahead of it to complete.
Even getting an HtoD and DtoH memcopy to execute concurrently requires a device with multiple copy engines; you can discover this about your device using deviceQuery.
I should also point out, to enable concurrent behavior, you should use cudaHostAlloc, not malloc, for your host side buffers.
EDIT: The answer above has GPUs in view that have at most 2 copy engines (one per direction) and is still correct for such GPUs. However there exist some newer Pascal and Volta family member GPUs that have more than 2 copy engines. In that case, with 2 (or more) copy engines per direction, it is theoretically possible to have 2 (or more) transfers "in-flight" in that direction. However this doesn't change the characteristics of the PCIE (or NVLink) bus itself. You are still limited to the available bandwidth, and the exact low level behavior (whether such transfers appear to be "serialized" or else appear to run concurrently, but take longer due to sharing of bandwidth) should not matter much in most cases.
I am trying to profile some CUDA Rodinia benchmarks, in terms of their SM and memory utilization, power consumption etc. For that, I simultaneously execute the benchmark and the profiler which essentially spawns a pthread to profile the GPU execution using NVML library.
The issue is that the execution time of a benchmark, is much higher( about 3 times) in case I do not invoke the profiler along with it, than the case when the benchmark is executing with the profiler. The frequency scaling governor for the CPU is userspace so I do not think that frequency of the CPU is changing. Is it due to the flickering in GPU frequency?
Below is the code for the profiler.
#include <pthread.h>
#include <stdio.h>
#include "nvml.h"
#include "unistd.h"
#define NUM_THREADS 1
void *PrintHello(void *threadid)
{
long tid;
tid = (long)threadid;
// printf("Hello World! It's me, thread #%ld!\n", tid);
nvmlReturn_t result;
nvmlDevice_t device;
nvmlUtilization_t utilization;
nvmlClockType_t jok;
unsigned int device_count, i,powergpu,clo;
char version[80];
result = nvmlInit();
result = nvmlSystemGetDriverVersion(version,80);
printf("\n Driver version: %s \n\n", version);
result = nvmlDeviceGetCount(&device_count);
printf("Found %d device%s\n\n", device_count,
device_count != 1 ? "s" : "");
printf("Listing devices:\n");
result = nvmlDeviceGetHandleByIndex(0, &device);
while(1)
{
result = nvmlDeviceGetPowerUsage(device,&powergpu );
result = nvmlDeviceGetUtilizationRates(device, &utilization);
printf("\n%d\n",powergpu);
if (result == NVML_SUCCESS)
{
printf("%d\n", utilization.gpu);
printf("%d\n", utilization.memory);
}
result=nvmlDeviceGetClockInfo(device,NVML_CLOCK_SM,&clo);
if(result==NVML_SUCCESS)
{
printf("%d\n",clo);
}
usleep(500000);
}
pthread_exit(NULL);
}
int main (int argc, char *argv[])
{
pthread_t threads[NUM_THREADS];
int rc;
long t;
for(t=0; t<NUM_THREADS; t++){
printf("In main: creating thread %ld\n", t);
rc = pthread_create(&threads[t], NULL, PrintHello, (void *)t);
if (rc){
printf("ERROR; return code from pthread_create() is %d\n", rc);
exit(-1);
}
}
/* Last thing that main() should do */
pthread_exit(NULL);
}
With your profiler running, the GPU(s) are being pulled out of their sleep state (due to the access to the nvml API, which is querying data from the GPUs). This makes them respond much more quickly to a CUDA application, and so the application appears to run "faster" if you time the entire application execution (e.g. using the linux time command).
One solution is to place the GPUs in "persistence mode" with the nvidia-smi command (use nvidia-smi --help to get command line help).
Another solution would be to do the timing from within the application, and exclude the CUDA start-up time from the timing measurement, perhaps by executing a cuda command such as cudaFree(0); prior to the start of timing.
The following global barrier works on Kepler K10 and not Fermi GTX580:
__global__ void cudaKernel (float* ref1, float* ref2, int* lock, int time, int dim) {
int gid = blockIdx.x * blockDim.x + threadIdx.x;
int lid = threadIdx.x;
int numT = blockDim.x * gridDim.x;
int numP = int (dim / numT);
int numB = gridDim.x;
for (int t = 0; t < time; ++t) {
// compute # time t
for (int i = 0; i < numP; ++i) {
int idx = gid + i * numT;
if (idx > 0 && idx < dim - 1)
ref2 [idx] = 0.333f * ((ref1 [idx - 1] + ref1 [idx]) + ref1 [idx + 1]);
}
// global sync
if (lid == 0){
atomicSub (lock, 1);
while (atomicCAS(lock, 0, 0) != 0);
}
__syncthreads();
// copy-back # time t
for (int i = 0; i < numP; ++i) {
int idx = gid + i * numT;
if (idx > 0 && idx < dim - 1)
ref1 [idx] = ref2 [idx];
}
// global sync
if (lid == 0){
atomicAdd (lock, 1);
while (atomicCAS(lock, numB, numB) != numB);
}
__syncthreads();
}
}
So, by looking at the output sent back to CPU, I noticed that one thread (either 1st or last thread) escapes the barrier and resumes execution earlier than the others. I'm using CUDA 5.0. number of blocks is also always smaller than number of SMs (in my set of runs).
Any idea why the same code wouldn't work on two architectures? What's new in Kepler that helps this global synchronization?
So I suspect the barrier code itself is probably working the same way. It's what's happening on other data structures not associated with the barrier functionality itself that is at issue, it seems.
Niether Kepler nor Fermi have L1 caches that are coherent with each other. What you have discovered (although it's not associated with your barrier code itself) is that the L1 cache behavior is different between Kepler and Fermi.
In particular, Kepler L1 cache is not in play on global loads as described in the above link, and so the caching behavior is handled at L2 level which is device-wide, and therefore coherent. When a Kepler SMX reads it's global data, it's getting coherent values from L2.
On the other hand, Fermi has L1 caches that also participate in global loads (by default -- although this behavior can be turned off) and the L1 caches as described in the link above are unique to each Fermi SM and are non-coherent with the L1 caches in other SMs. When a Fermi SM reads it's global data, it's getting values from the L1, which may be non-coherent with other L1 caches in other SMs.
This is the difference in "coherency" that you are seeing, of the data you are manipulating before and after the barrier.
As I mentioned, I believe the barrier code itself is probably working the same way on both devices.