performance difference between .cu and .cpp files

performance difference between .cu and .cpp files - cuda

For the study we have to analyze performance difference between the CPU and GPU. My problem is that i have a .cu file with only cpp code and a .cpp file with exactly the same code. But there is a performance difference that the .cu file run 3 times faster than the .cpp file. The .cu file will compiled by the NVCC compiler but the NVCC compiler will only compile cuda code, and there is no cuda code, so it will be compiled by the host cpp compiler. And thats my Problem. I dont unterstand the performance difference.
#include <iostream>
#include <conio.h>
#include <ctime>
#include <cuda.h>
#include <cuda_runtime.h> // Stops underlining of __global__
#include <device_launch_parameters.h> // Stops underlining of threadIdx etc.
using namespace std;
void FindClosestCPU(float3* points, int* indices, int count) {
// Base case, if there's 1 point don't do anything
if(count <= 1) return;
// Loop through every point
for(int curPoint = 0; curPoint < count; curPoint++) {
// This variable is nearest so far, set it to float.max
float distToClosest = 3.40282e38f;
// See how far it is from every other point
for(int i = 0; i < count; i++) {
// Don't check distance to itself
if(i == curPoint) continue;
float dist = sqrt((points[curPoint].x - points[i].x) *
(points[curPoint].x - points[i].x) +
(points[curPoint].y - points[i].y) *
(points[curPoint].y - points[i].y) +
(points[curPoint].z - points[i].z) *
(points[curPoint].z - points[i].z));
if(dist < distToClosest) {
distToClosest = dist;
indices[curPoint] = i;
}
}
}
}
int main()
{
// Number of points
const int count = 10000;
// Arrays of points
int *indexOfClosest = new int[count];
float3 *points = new float3[count];
// Create a list of random points
for(int i = 0; i < count; i++)
{
points[i].x = (float)((rand()%10000) - 5000);
points[i].y = (float)((rand()%10000) - 5000);
points[i].z = (float)((rand()%10000) - 5000);
}
// This variable is used to keep track of the fastest time so far
long fastest = 1000000;
// Run the algorithm 2 times
for(int q = 0; q < 2; q++)
{
long startTime = clock();
// Run the algorithm
FindClosestCPU(points, indexOfClosest, count);
long finishTime = clock();
cout<<"Run "<<q<<" took "<<(finishTime - startTime)<<" millis"<<endl;
// If that run was faster update the fastest time so far
if((finishTime - startTime) < fastest)
fastest = (finishTime - startTime);
}
// Print out the fastest time
cout<<"Fastest time: "<<fastest<<endl;
// Print the final results to screen
cout<<"Final results:"<<endl;
for(int i = 0; i < 10; i++)
cout<<i<<"."<<indexOfClosest[i]<<endl;
// Deallocate ram
delete[] indexOfClosest;
delete[] points;
_getch();
return 0;
}
The only difference between the two files, is that one is an .cu file and will be compiled by the NVCC and the other is a .cpp file and will be compiled normally by the cpp compiler.

well ,as such you are not using any cuda functions that need to run on the GPU, but you are using float3 which is included as a part of the CUDA api and is not purely CPP, so when you change the extension to .cu, the code involving float3, will be compiled by NVCC, and as it might be different from the default cpp compiler, there are chances that a time difference may arise during execution.
you might want to check this by passing a 'pure' cpp file with .cu
extension to the NVCC and check the time difference, hopefully it will
pass on the whole code to the default cpp compiler, and there would be
no time difference when executing.

Related

Cuda printf() overlapping when using multiple devices

I have a printf in my __global__ code. It works as intended most of the time. However when using a multi GPU system (typically happens when ran on an 4-8 GPU system), once in a while, the prints will merge. By once in a while Its about 100-500 lines out of 167000 lines.
I was wondering how this situation can be remedied without adding too much overhead of transferring the data back to host (if possible). I was thinking to try a mutex lock for printing but I dont think that sort of thing exists for use in the kernel. Any other solutions I could try?
Note: The actual kernel is a long running kernel usually around 20-50 minutes to complete depending on the GPU.
Note2: I barely know what I'm doing with C/C++.
Example of merged Output
JmHp8rwXAw,031aa97714c800de47971829beded204000cfcf5e0f3775552ccf3e9b387869fxLuZJu3ZkX
qVOuKlQ0ZcMrhGXAnZ75,08bf3e90a57c31b7f355214cdf442748d9ff6ae1d49a96f7a8b9e3c86bd8e68a,5231a9e969d53c64f75bb1f07b1c95bb81f685744ed46f56348c733389c56ca5
,623f62b3198c8b62cd7a3b3cf8bf8ede5f9bfdccb7c1dc48a55530c7d5f59ce8
What it should look like
JmHp8rwXAw,031aa97714c800de47971829beded204000cfcf5e0f3775552ccf3e9b387869f
MrhGXAnZ75,08bf3e90a57c31b7f355214cdf442748d9ff6ae1d49a96f7a8b9e3c86bd8e68a
qVOuKlQ0Zc,5231a9e969d53c64f75bb1f07b1c95bb81f685744ed46f56348c733389c56ca5
xLuZJu3ZkX,623f62b3198c8b62cd7a3b3cf8bf8ede5f9bfdccb7c1dc48a55530c7d5f59ce8
My Example Code:
#define BLOCKS 384
#define THREADS 64
typedef struct HandlerInput {
unsigned char device;
} HandlerInput;
pthread_mutex_t solutionLock;
__global__ void kernel(unsigned long baseSeed) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
BYTE random[RANDOM_LEN];
BYTE data[DIGEST_LEN];
SHA256_CTX ctx;
/* Randomization routine*/
d_getRandomString((unsigned long)idx + baseSeed, random);
/* Hashing routine*/
sha256_hash(&ctx, random, data, RANDOM_LEN);
/* Print to console - randomStr,Hash */
printf("%s,%s\n", random, data);
}
void *launchGPUHandlerThread(void *vargp) {
HandlerInput *hi = (HandlerInput *)vargp;
cudaSetDevice(hi->device);
unsigned long rngSeed = timeus();
while (1) {
hostRandomGen(&rngSeed);
kernel<<<BLOCKS, THREADS>>>(rngSeed);
cudaDeviceSynchronize();
}
cudaDeviceReset();
return NULL;
}
int main() {
int GPUS;
cudaGetDeviceCount(&GPUS);
pthread_t *tids = (pthread_t *)malloc(sizeof(pthread_t) * GPUS);
for (int i = 0; i < GPUS; i++) {
HandlerInput *hi = (HandlerInput *)malloc(sizeof(HandlerInput));
hi->device = i;
pthread_create(tids + i, NULL, launchGPUHandlerThread, hi);
usleep(23);
}
pthread_mutex_lock(&solutionLock);
for (int i = 0; i < GPUS; i++)
pthread_join(tids[i], NULL);
return 0;
}

I spent 4 days trying different things to no avail. I really don't understand memory management enough in C/C++ to get past the endless segmentation fault errors.
What I ended up doing was using Unified Memory as it seemed the easiest way to handle the memory for both device and host and it doesn't seem to add too much overhead to the whole process. Then each cpu thread (gpu) can write to its own file. I ran a couple of nvprof and it seemed that after the initial setup for the memory cudaMallocManaged the rest of the overhead seemed to be measured in the microseconds. Since each loop takes 20 minutes these are really barely noticeable.
I created two __device__ functions to copy the data over to the host accessible arrays, because I wanted to utilize the #pragma unroll feature. Not really sure if that helps or what it even does, but I decided to do things this way.
If anyone has further suggestions on ways to improve I am open to trying more things out.
Here is my new example code:
#define BLOCKS 384
#define THREADS 64
typedef struct HandlerInput {
unsigned char device;
} HandlerInput;
__device__ void mycpydigest(__restrict__ BYTE *dst, __restrict__ const BYTE *src) {
#pragma unroll 64
for (BYTE i = 0; i < 64; i++) {
dst[i] = src[i];
}
dst[64] = '\0';
}
__device__ void mycpyrandom(__restrict__ BYTE *dst, __restrict__ const BYTE *src) {
#pragma unroll 10
for (BYTE i = 0; i < 10; i++) {
dst[i] = src[i];
}
dst[10] = '\0';
}
__global__ void kernel(BYTE **d_random, BYTE **d_hashes, unsigned long baseSeed) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
BYTE random[RANDOM_LEN];
BYTE data[DIGEST_LEN];
SHA256_CTX ctx;
/* Randomization routine*/
d_getRandomString((unsigned long)idx + baseSeed, random);
/* Hashing routine*/
sha256_hash(&ctx, random, data, RANDOM_LEN);
/* Send to host - randomStr & Hash */
mycpydigest(d_hashes[idx], data);
mycpyrandom(d_random[idx], random);
}
void *launchGPUHandlerThread(void *vargp) {
HandlerInput *hi = (HandlerInput *)vargp;
cudaSetDevice(hi->device);
unsigned long rngSeed = timeus();
int threadBlocks = hi->BLOCKS * hi->THREADS;
BYTE **randoms;
BYTE **hashes;
cudaMallocManaged(&randoms, sizeof(BYTE *) * (threadBlocks), cudaMemAttachGlobal);
cudaMallocManaged(&hashes, sizeof(BYTE *) * (threadBlocks), cudaMemAttachGlobal);
for (int i = 0; i < threadBlocks; i++) {
cudaMallocManaged(&randoms[i], sizeof(BYTE) * (RANDOM_LEN), cudaMemAttachGlobal);
cudaMallocManaged(&hashes[i], sizeof(BYTE) * (DIGEST_LEN), cudaMemAttachGlobal);
}
while (1) {
hostRandomGen(&rngSeed);
kernel<<<hi->BLOCKS, hi->THREADS>>>(randoms, hashes, rngSeed);
cudaDeviceSynchronize();
print2File(randoms, hashes, threadBlocks, hi->device)
}
cudaFree(hashes);
cudaFree(randoms);
cudaDeviceReset();
return NULL;
}
int main() {
int GPUS;
cudaGetDeviceCount(&GPUS);
pthread_t *tids = (pthread_t *)malloc(sizeof(pthread_t) * GPUS);
for (int i = 0; i < GPUS; i++) {
HandlerInput *hi = (HandlerInput *)malloc(sizeof(HandlerInput));
hi->device = i;
pthread_create(tids + i, NULL, launchGPUHandlerThread, hi);
usleep(23);
}
for (int i = 0; i < GPUS; i++)
pthread_join(tids[i], NULL);
return 0;
}
I want to thank #paleonix for the help in the comments. I was working on this issue for a week before I posted and your comments helped guide me down a different path.

Difficulty using atomicMin to find minimum value in a matrix

I'm having trouble using atomicMin to find the minimum value in a matrix in cuda. I'm sure it has something to do with the parameters I'm passing into the atomicMin function. The findMin function is the function to focus on, the popmatrix function is just to populate the matrix.
#include <stdio.h>
#include <cuda.h>
#include <curand.h>
#include <curand_kernel.h>
#define SIZE 4
__global__ void popMatrix(unsigned *matrix) {
unsigned id, num;
curandState_t state;
id = threadIdx.x * blockDim.x + threadIdx.y;
// Populate matrix with random numbers
curand_init(id, 0, 0, &state);
num = curand(&state)%100;
matrix[id] = num;
}
__global__ void findMin(unsigned *matrix, unsigned *temp) {
unsigned id;
id = threadIdx.x * blockDim.y + threadIdx.y;
atomicMin(temp, matrix[id]);
printf("old: %d, new: %d", matrix[id], temp);
}
int main() {
dim3 block(SIZE, SIZE, 1);
unsigned *arr, *harr, *temp;
cudaMalloc(&arr, SIZE*SIZE*sizeof(unsigned));
popMatrix<<<1,block>>>(arr);
// Print matrix of random numbers to see if min number was picked right
cudaMemcpy(harr, arr, SIZE*SIZE*sizeof(unsigned), cudaMemcpyDeviceToHost);
for (unsigned i = 0; i < SIZE; i++) {
for (unsigned j = 0; j < SIZE; j++) {
printf("%d ", harr[i*SIZE+j]);
}
printf("\n");
}
temp = harr[0];
findMin<<<1, block>>>(harr);
return 0;
}

harr is not allocated. You should allocated it on the host side using for example malloc before calling cudaMemcpy. As a result, the printed values you look are garbage. This is quite surprising that the program did not segfault on your machine.
Moreover, when you call the kernel findMin at the end, its parameter is harr (which is supposed to be on the host side regarding its name) should be on the device to perform the atomic operation correctly. As a result, the current kernel call is invalid.
As pointed out by #RobertCrovella, a cudaDeviceSynchronize() call is missing at the end. Moreover, you need to free your memory using cudaFree.

Is the profiler wrong, or is the scheduling messed up, or both?

Consider the following program:
#include <iostream>
#include <array>
#include <unistd.h>
using clock_value_t = long long;
__device__ void gpu_sleep(clock_value_t sleep_cycles)
{
clock_value_t start = clock64();
clock_value_t cycles_elapsed;
do { cycles_elapsed = clock64() - start; }
while (cycles_elapsed < sleep_cycles);
}
__global__ void dummy(clock_value_t duration_in_cycles)
{
gpu_sleep(duration_in_cycles);
}
int main()
{
const clock_value_t duration_in_clocks = 1e7;
const size_t buffer_size = 2e7;
constexpr const auto num_streams = 8;
std::array<char*, num_streams> host_ptrs;
std::array<char*, num_streams> device_ptrs;
std::array<cudaStream_t, num_streams> streams;
for (auto i=0; i<num_streams; i++) {
cudaMallocHost(&host_ptrs[i], buffer_size);
cudaMalloc(&device_ptrs[i], buffer_size);
cudaStreamCreateWithFlags(&streams[i], cudaStreamNonBlocking);
}
cudaDeviceSynchronize();
for (auto i=0; i<num_streams; i++) {
cudaMemcpyAsync(device_ptrs[i], host_ptrs[i], buffer_size, cudaMemcpyDefault, streams[i]);
dummy<<<128, 128, 0, streams[i]>>>(duration_in_clocks);
cudaMemcpyAsync(host_ptrs[i], device_ptrs[i], buffer_size, cudaMemcpyDefault, streams[i]);
}
usleep(50000);
for (auto i=0; i<num_streams; i++) { cudaStreamSynchronize(streams[i]); }
for (auto i=0; i<num_streams; i++) {
cudaFreeHost(host_ptrs[i]);
cudaFree(device_ptrs[i]);
}
}
I'm running it on an GTX Titan X, with CUDA 8.0.61, on Fedora 25, with driver 375.66. The timeline I'm seeing is this:
There a few things wrong with this picture:
As far as I can recall there can only be one HtoD transfer at a time.
All of the memory transfers should take basically the same amount of time - they're of the same amount of data; and there's nothing else interesting going on with the PCIe bus to affect transfer rates so much.
Some DtoH bars like like they're stretched out until something happens on another stream.
There's this huge gap in which there seems to be no Computer and no real I/O. And even if the DtoH for all previously-completed kernels was to occupy that gap, that would still leave a very significant amount of time. That actually looks like a scheduling issue rather than a profiling error.
So, how should I interpret this timeline? And where does the problem lie? (Hopefully not with the programmer...)
I should mention that with less streams (e.g. 2) the timeline looks very nice on the same SW+HW:

how to return a structure (defined in .h file) from a function in another cpp file?

I ran into this issue and I cannot handle it. Any suggestion is appreciated.
I have a structure defined in a header file as follows:
Results.h
#ifndef RESULTS_H
#define RESULTS_H
struct Results
{
double dOptSizeMWh;
double dOrigSOCFinal;
double dManiSOCFinal;
};
#endif
and a general definition of "Deterministic" function in Deterministic.h:
#ifndef DETERMINISTIC_H
#define DETERMINISTIC_H
Results Deterministic(int,int,int,double,double); //Deterministic(int nNoMonth, int nNOWind, int nWindLength, double dPreviousSizeMWh, double dPreviousSOC)
#endif;
This function is implemented in Deterministic.cpp:
#include "Results.h"
Results Deterministic(int nNoMonth, int nNOWind, int nWindLength, double dPreviousSizeMWh, double dPreviousSOC)
{
// returns number of rows and columns of the array created
struct Results sRes;
sRes.dOptSizeMWh = -1.0; // for the optimal size of battery in MWh
sRes.dOrigSOCFinal = -1.0; // for the SOC at the end of the window
sRes.dManiSOCFinal = -1.0; // this is set to 0.0 if final SOC is slightly below 0
//...........................////
// OTHER Calculation .......////
//...........................////
return sRes;
}
Finally, I have a main file which I call Deterministic function and I use Results structure, main.cpp:
#include <Results.h>
#include <Deterministic.h>
using namespace std;
int main ()
{
int nNoMonth = 1; // the month that we want to use in the input
int nWindLength = 1; // length of window, hour
int nNODays = 1; // number of days that we want to repeat optimization
struct Results dValues;
double **mRes = new double*[nNODays * 24 / nWindLength];
for (int i = 0; i < nNODays * 24 / nWindLength; ++i) mRes[i] = new double[3];
for (int i = 0; i < nNODays * 24 / nWindLength; i++)
{
if (i == 0)
{
dValues = Deterministic(nNoMonth, i, nWindLength, 0.0, 0.0);
}else
{
temp0 = *(*(mRes+i-1)); double temp1 = *(*(mRes+i-1)+1); double temp2 = *(*(mRes+i-1)+2);
if (temp2 == -1.0) {dValues = Deterministic(nNoMonth, i, nWindLength, temp0, temp1);}
else {dValues = Deterministic(nNoMonth, i, nWindLength, *(*(mRes+i-1)), *(*(mRes+i-1)));}
}
*(*(mRes+i)) = dValues.dOptSizeMWh;
*(*(mRes+i)+1) = dValues.dOrigSOCFinal;
*(*(mRes+i)+2) = dValues.dManiSOCFinal;
}
these are only a small portion of the codes in Deterministic.cpp and main.cpp which defines the problem. First loop goes well (i.e., i=0) without any problem, but it fails in the second loop and beyond with this error: "R6010 - abort() has been called"
This error comes up in the main.cpp where I call Deterministic function in the if statement.

I have no problem compiling and running the posted code (other than the missing double in front of the declaration of temp). Without knowing what Deterministic() is actually doing, it's a bit hard to guess what the problem is (divide by zero? playing a Justin Bieber mp3?). It shouldn't have anything to do with returning a structure from a function defined in another file (translation units are a fundamental feature of the language). To find the root cause, single-step through the (complete) Deterministic() using your debugger.

Does calling a CUDA kernel multiple times affect execution speed?

I am trying to measure the performance difference of a GPU between allocating memory using 'malloc' in a kernel function vs. using pre-allocated storage from 'cudaMalloc' on the host. To do this, I have two kernel functions, one that uses malloc, one that uses a pre-allocated array, and I time the execution of each function repeatedly.
The problem is that the first execution of each kernel function takes between 400 - 2500 microseconds, but all subsequent runs take about 15 - 30 microseconds.
Is this behavior expected, or am I witnessing some sort of carryover effect from previous runs? If this is carryover, what can I do to prevent it?
I have tried putting in a kernel function that zeros out all memory on the GPU between each timed test run to eliminate that carryover, but nothing changed. I have also tried reversing the order in which I run the tests, and that has no effect on relative or absolute execution times.
const int TEST_SIZE = 1000;
struct node {
node* next;
int data;
};
int main() {
int numTests = 5;
for (int i = 0; i < numTests; ++i) {
memClear();
staticTest();
memClear();
dynamicTest();
}
return 0;
}
__global__ void staticMalloc(int* sum) {
// start a linked list
node head[TEST_SIZE];
// initialize nodes
for (int j = 0; j < TEST_SIZE; j++) {
// allocate the node & assign values
head[j].next = NULL;
head[j].data = j;
}
// verify creation by adding up values
int total = 0;
for (int j = 0; j < TEST_SIZE; j++) {
total += head[j].data;
}
sum[0] = total;
}
/**
* This is a test that will time execution of static allocation
*/
int staticTest() {
int expectedValue = 0;
for (int i = 0; i < TEST_SIZE; ++i) {
expectedValue += i;
}
// host output vector
int* h_sum = new int[1];
h_sum[0] = -1;
// device output vector
int* d_sum;
// vector size
size_t bytes = sizeof(int);
// allocate memory on device
cudaMalloc(&d_sum, bytes);
// only use 1 CUDA thread
dim3 blocksize(1, 1, 1), gridsize(1, 1, 1);
Timer runTimer;
int runTime = 0;
// check dynamic allocation time
runTime = 0;
runTimer.start();
staticMalloc<<<gridsize, blocksize>>>(d_sum);
runTime += runTimer.lap();
h_sum[0] = 0;
cudaMemcpy(h_sum, d_sum, bytes, cudaMemcpyDeviceToHost);
cudaFree(d_sum);
delete (h_sum);
return 0;
}
__global__ void dynamicMalloc(int* sum) {
// start a linked list
node* headPtr = (node*) malloc(sizeof(node));
headPtr->data = 0;
headPtr->next = NULL;
node* curPtr = headPtr;
// add nodes to test cudaMalloc in device
for (int j = 1; j < TEST_SIZE; j++) {
// allocate the node & assign values
node* nodePtr = (node*) malloc(sizeof(node));
nodePtr->data = j;
nodePtr->next = NULL;
// add it to the linked list
curPtr->next = nodePtr;
curPtr = nodePtr;
}
// verify creation by adding up values
curPtr = headPtr;
int total = 0;
while (curPtr != NULL) {
// add and increment current value
total += curPtr->data;
curPtr = curPtr->next;
// clean up memory
free(headPtr);
headPtr = curPtr;
}
sum[0] = total;
}
/**
* Host function that prepares data array and passes it to the CUDA kernel.
*/
int dynamicTest() {
// host output vector
int* h_sum = new int[1];
h_sum[0] = -1;
// device output vector
int* d_sum;
// vector size
size_t bytes = sizeof(int);
// allocate memory on device
cudaMalloc(&d_sum, bytes);
// only use 1 CUDA thread
dim3 blocksize(1, 1, 1), gridsize(1, 1, 1);
Timer runTimer;
int runTime = 0;
// check dynamic allocation time
runTime = 0;
runTimer.start();
dynamicMalloc<<<gridsize, blocksize>>>(d_sum);
runTime += runTimer.lap();
h_sum[0] = 0;
cudaMemcpy(h_sum, d_sum, bytes, cudaMemcpyDeviceToHost);
cudaFree(d_sum);
delete (h_sum);
return 0;
}
__global__ void clearMemory(char *zeros) {
int i = threadIdx.x + blockDim.x * blockIdx.x;
zeros[i] = 0;
}
void memClear() {
char *zeros[1024]; // device pointers
for (int i = 0; i < 1024; ++i) {
cudaMalloc((void**) &(zeros[i]), 4 * 1024 * 1024);
clearMemory<<<1024, 4 * 1024>>>(zeros[i]);
}
for (int i = 0; i < 1024; ++i) {
cudaFree(zeros[i]);
}
}

The first execution of a kernel takes more time because you have to load a lots of stuff on GPU (kernel, lib etc...). To prove it, you can just measure how long it takes to launch an empty kernel and you will see that it's take some times. Try like:
time -> start
launch emptykernel
time -> end
firstTiming = end - start
time -> start
launch empty kernel
time -> end
secondTiming = end - start
You will see that the secondTiming is significantly smaller thant the firstTiming.

The first CUDA (kernel) call initializes the CUDA system transparently. You can avoid this by calling an empty kernel first. Note that this is required in e.g. OpenCL, but there you have to do all that init-stuff manually. CUDA does it for you in the background.
Then some problems with your timing: CUDA kernel calls are asynchronous. So (assuming your Timer class is a host timer like time()) currently you measure the kernel launch time (and for the first call the init-time of CUDA) not the kernel execution time.
At the very least you HAVE to do a cudaDeviceSynchronize() before starting AND stopping the timer.
You are better of using CUDA events which can exactly measure the kernel execution time and only that. Using host-timers you still include the launch-overhead. See https://devblogs.nvidia.com/parallelforall/how-implement-performance-metrics-cuda-cc/

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

performance difference between .cu and .cpp files - cuda

Related

Cuda printf() overlapping when using multiple devices

Difficulty using atomicMin to find minimum value in a matrix

Is the profiler wrong, or is the scheduling messed up, or both?

how to return a structure (defined in .h file) from a function in another cpp file?

Does calling a CUDA kernel multiple times affect execution speed?

Categories

Resources