Error in the function cuMemcpyHtoD in jCUDA

Error in the function cuMemcpyHtoD in jCUDA - cuda

I am NEW to java programming and trying to code a matrix multiplication program in jCUDA.
While transferring the data from host to device and vice versa I use:
cuMemcpyHtoD(devMatrixA, Pointer.to(hostMatrixA), numRows * numCols * Sizeof.FLOAT);
cuMemcpyHtoD(devMatrixB, Pointer.to(hostMatrixA), numRows * numCols * Sizeof.FLOAT);
cuMemcpyDtoH(Pointer.to(hostMatrixC), devMatrixC, numRows * numCols * Sizeof.FLOAT);
Here, the devMatrixA, devMatrixB and devMatrixC are the matrices to be stored on device memory. And hostMatrixA, hostMatrixB and hostMatrixC are the matrices stored on my Host memory.
When I call above functions for data transfer, it gives me following error 'The method to(byte[]) in the type Pointer is not applicable for the arguments (float[][])' with 'to' in 'Pointer.to(' is red underlined. I am using eclipse. I have given my complete code as below.
Pardon my java knowledge, and please suggest if I am going into wrong direction.
Package JCudaMatrixAddition;
import static jcuda.driver.JCudaDriver.*;
import java.io.*;
import jcuda.*;
import jcuda.driver.*;
import jcuda.Pointer;
import jcuda.Sizeof;
public class JCudaMatrixAddition {
public static void main(String[] args) throws IOException
{
// Enable exceptions and omit all subsequent error checks
JCudaDriver.setExceptionsEnabled(true);
// Create the PTX file by calling the NVCC
String ptxFilename = preparePtxFile("JCudaMatrixAdditionKernel.cu");
//Initialize the driver and create a context for the first device.
cuInit(0);
CUdevice device = new CUdevice();
cuDeviceGet (device, 0);
CUcontext context = new CUcontext();
cuCtxCreate(context, 0, device);
//Load PTX file
CUmodule module = new CUmodule();
cuModuleLoad(module,ptxFilename);
//Obtain a function pointer to the Add function
CUfunction function = new CUfunction();
cuModuleGetFunction(function, module, "add");
int numRows = 32;
int numCols = 32;
//Allocate and fill Host input Matrices:
float hostMatrixA[][] = new float[numRows][numCols];
float hostMatrixB[][] = new float[numRows][numCols];
float hostMatrixC[][] = new float[numRows][numCols];
for(int i = 0; i<numRows; i++)
{
for(int j = 0; j<numCols; j++)
{
hostMatrixA[i][j] = (float) 1.0;
hostMatrixB[i][j] = (float) 1.0;
}
}
// Allocate the device input data, and copy the
// host input data to the device
CUdeviceptr devMatrixA = new CUdeviceptr();
cuMemAlloc(devMatrixA, numRows * numCols * Sizeof.FLOAT);
//This is the part where it gives me the error
cuMemcpyHtoD(devMatrixA, Pointer.to(hostMatrixA), numRows * numCols * Sizeof.FLOAT);
CUdeviceptr devMatrixB = new CUdeviceptr();
cuMemAlloc(devMatrixB, numRows * numCols * Sizeof.FLOAT);
//This is the part where it gives me the error
cuMemcpyHtoD(devMatrixB, Pointer.to(hostMatrixA), numRows * numCols * Sizeof.FLOAT);
//Allocate device matrix C to store output
CUdeviceptr devMatrixC = new CUdeviceptr();
cuMemAlloc(devMatrixC, numRows * numCols * Sizeof.FLOAT);
// Set up the kernel parameters: A pointer to an array
// of pointers which point to the actual values.
Pointer kernelParameters = Pointer.to(Pointer.to(new int[]{numRows}),
Pointer.to(new int[]{numRows}),
Pointer.to(devMatrixA),
Pointer.to(devMatrixB),
Pointer.to(devMatrixC));
//Kernel thread configuration
int blockSize = 32;
int gridSize = 1;
cuLaunchKernel(function,
gridSize, 1, 1,
blockSize, 32, 1,
0, null, kernelParameters, null);
cuCtxSynchronize();
// Allocate host output memory and copy the device output
// to the host.
//This is the part where it gives me the error
cuMemcpyDtoH(Pointer.to(hostMatrixC), devMatrixC, numRows * numCols * Sizeof.FLOAT);
//verify the result
for (int i =0; i<numRows; i++)
{
for (int j =0; j<numRows; j++)
{
System.out.print(" "+ hostMatrixB[i][j]);
}
System.out.println("");
}
cuMemFree(devMatrixA);
cuMemFree(devMatrixB);
cuMemFree(devMatrixC);
}

You can not copy a float[][] array from the host to the device directly.
When you create a float[][] array, then this is not a large array of float values. Instead, it is an array of arrays. Imagine that you could even create an array like
float array[][] = new float[3];
array[0] = new float[42];
array[1] = null;
array[2] = new float[1234];
This is simply not a contiguous memory block, and thus, such an array can not be copied to the device.
When handling matrices in CUDA (not only in JCuda, but in CUDA in general), they are usually represented as 1-dimensional arrays. So in this case, you could declare your matrices as
float hostMatrixA[] = new float[numRows*numCols];
In order to access the matrix elements, you have to compute the appropriate index:
int row = ...;
int col = ...;
hostMatrix[col+row*numCols] = 123.0f; // Column-major
// Or
hostMatrix[row+col*numRows] = 123.0f; // Row-major
The difference between the last two lines is that one assumes column-major order, and the other assumes row-major order. See the Wikipedia site about row-major order for details.
Some side notes:
The CUDA matrix libraries like CUBLAS use column-major ordering, so it is probably a good idea to follow the same convention. Particularly when you later want to use CUBLAS/JCublas functions. For example, the cublasSgeam function already offers the functionality to perform a matrix addition.
When you only want to do a matrix addition, you will not see a speedup when using CUDA/JCuda. I wrote a summary about this in this answer.
And BTW: Technically, it is possible to use "2D arrays". The JCudaDriverSample shows how this can be done. But it is rather inconvenient and not recommended for matrix operations.

Related

Udacity parallel programming, unspecified launch failure cudaGetLastError()

I am trying to complete homework #2 for Udacity course parallel programming. I have ran into a CUDA error that I just can't get around. The error is encoutnered when I launch a kernel that is meant to separate an image in the format "RGBRGBRGB" to three separate arrays of "RRR" "GGG" and "BBB". Seeing as the error "unspecified launch failure" does not give me anything specific to go on I am not sure how to trouble shoot my issue.
Here is the "main" function called to start the entire process. I left out the rest after the error is encountered so that I don't post the rest of my work for someone to find later.
void your_gaussian_blur(const uchar4 * const h_inputImageRGBA, uchar4 * const d_inputImageRGBA, uchar4* const d_outputImageRGBA, const size_t numRows, const size_t numCols,
unsigned char *d_redBlurred,
unsigned char *d_greenBlurred,
unsigned char *d_blueBlurred,
const int filterWidth)
{
// Maximum number of threads per block = 512; do this
// to keep this compatable with CUDa 5 and lower
// MAX > threadsX * threadsY * threadsZ
int MAXTHREADSx = 16;
int MAXTHREADSy = 16; // 16 x 16 x 1 = 512
// We want to fill the blocks so we don't waste this blocks threads
// I wonder if blocks can intermix in a physical core?
// Either way this method makes things "clean"; one thread per px
int nBlockX = numCols / MAXTHREADSx + 1;
int nBlockY = numRows / MAXTHREADSy + 1;
const dim3 blockSize(MAXTHREADSx, MAXTHREADSy, 1);
const dim3 gridSize(nBlockX, nBlockY, 1);
separateChannels<<<gridSize, blockSize>>>(
h_inputImageRGBA,
numRows,
numCols,
d_red,
d_green,
d_blue);
// Call cudaDeviceSynchronize(), then call checkCudaErrors() immediately after
// launching your kernel to make sure that you didn't make any mistakes.
cudaDeviceSynchronize(); checkCudaErrors(cudaGetLastError());
And here is the function separateChannels
//This kernel takes in an image represented as a uchar4 and splits
//it into three images consisting of only one color channel each
__global__
void separateChannels(const uchar4* const inputImageRGBA,
int numRows,
int numCols,
unsigned char* const redChannel,
unsigned char* const greenChannel,
unsigned char* const blueChannel)
{
//const int2 thread_2D_pos = make_int2(blockIdx.x * blockDim.x + threadIdx.x, blockIdx.y * blockDim.y + threadIdx.y);
const int col = blockIdx.x * blockDim.x + threadIdx.x;
const int row = blockIdx.y * blockDim.y + threadIdx.y;
//if (thread_2D_pos.x >= numCols || thread_2D_pos.y >= numRows)
// return;
if (col >= numCols || row >= numRows)
return;
//const int thread_1D_pos = thread_2D_pos.y * numCols + thread_2D_pos.x;
int arrayPos = row * numCols + col;
uchar4 rgba = inputImageRGBA[arrayPos];
redChannel[arrayPos] = rgba.x;
greenChannel[arrayPos] = rgba.y;
blueChannel[arrayPos] = rgba.z;
}
I think I put in anything necessary, please let me know if not.

Without seeing the rest of the code I cannot tell for sure, but I believe you are sending pointer to host memory as a parameter to cuda kernel - not a good thing to do. In kernel launch you are sending in a h_inputImageRGBA while I believe you want to send in a d_inputImageRGBA.
Typically h_ prefix stands for host memory while d_ represents device.

What is the significance of 'sharedMemBytes' argument in kernel call cuLaunchKernel()?

I am trying to implement simple matrix multiplication program using shared memory in JCuda.
Following is my JCudaSharedMatrixMul.java code:
import static jcuda.driver.JCudaDriver.cuCtxCreate;
import static jcuda.driver.JCudaDriver.cuCtxSynchronize;
import static jcuda.driver.JCudaDriver.cuDeviceGet;
import static jcuda.driver.JCudaDriver.cuInit;
import static jcuda.driver.JCudaDriver.cuLaunchKernel;
import static jcuda.driver.JCudaDriver.cuMemAlloc;
import static jcuda.driver.JCudaDriver.cuMemFree;
import static jcuda.driver.JCudaDriver.cuMemcpyDtoH;
import static jcuda.driver.JCudaDriver.cuMemcpyHtoD;
import static jcuda.driver.JCudaDriver.cuModuleGetFunction;
import static jcuda.driver.JCudaDriver.cuModuleLoad;
import static jcuda.runtime.JCuda.cudaEventCreate;
import static jcuda.runtime.JCuda.cudaEventRecord;
import static jcuda.runtime.JCuda.*;
import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.IOException;
import java.io.InputStream;
import java.util.Scanner;
import jcuda.Pointer;
import jcuda.Sizeof;
import jcuda.driver.CUcontext;
import jcuda.driver.CUdevice;
import jcuda.driver.CUdeviceptr;
import jcuda.driver.CUfunction;
import jcuda.driver.CUmodule;
import jcuda.driver.JCudaDriver;
import jcuda.runtime.cudaEvent_t;
public class JCudaSharedMatrixMul
{
public static void main(String[] args) throws IOException
{
// Enable exceptions and omit all subsequent error checks
JCudaDriver.setExceptionsEnabled(true);
// Create the PTX file by calling the NVCC
String ptxFilename = preparePtxFile("JCudaSharedMatrixMulKernel.cu");
//Initialize the driver and create a context for the first device.
cuInit(0);
CUdevice device = new CUdevice();
cuDeviceGet (device, 0);
CUcontext context = new CUcontext();
cuCtxCreate(context, 0, device);
//Load PTX file
CUmodule module = new CUmodule();
cuModuleLoad(module,ptxFilename);
//Obtain a function pointer to the Add function
CUfunction function = new CUfunction();
cuModuleGetFunction(function, module, "jCudaSharedMatrixMulKernel");
int numRows = 16;
int numCols = 16;
//Allocate and fill Host input Matrices:
float hostMatrixA[] = new float[numRows*numCols];
float hostMatrixB[] = new float[numRows*numCols];
float hostMatrixC[] = new float[numRows*numCols];
for(int i = 0; i<numRows; i++)
{
for(int j = 0; j<numCols; j++)
{
hostMatrixA[i*numCols+j] = (float) 1;
hostMatrixB[i*numCols+j] = (float) 1;
}
}
// Allocate the device input data, and copy the
// host input data to the device
CUdeviceptr devMatrixA = new CUdeviceptr();
cuMemAlloc(devMatrixA, numRows * numCols * Sizeof.FLOAT);
//This is the part where it gives me the error
cuMemcpyHtoD(devMatrixA, Pointer.to(hostMatrixA), numRows * numCols * Sizeof.FLOAT);
CUdeviceptr devMatrixB = new CUdeviceptr();
cuMemAlloc(devMatrixB, numRows * numCols * Sizeof.FLOAT);
//This is the part where it gives me the error
cuMemcpyHtoD(devMatrixB, Pointer.to(hostMatrixB ), numRows * numCols * Sizeof.FLOAT);
//Allocate device matrix C to store output
CUdeviceptr devMatrixC = new CUdeviceptr();
cuMemAlloc(devMatrixC, numRows * numCols * Sizeof.FLOAT);
// Set up the kernel parameters: A pointer to an array
// of pointers which point to the actual values.
Pointer kernelParameters = Pointer.to(
Pointer.to(new int[]{numCols}),
Pointer.to(devMatrixA),
Pointer.to(devMatrixB),
Pointer.to(devMatrixC));
//Kernel thread configuration
int blockSize = 16;
int gridSize = 1;
cudaEvent_t start = new cudaEvent_t();
cudaEvent_t stop = new cudaEvent_t();
cudaEventCreate(start);
cudaEventCreate(stop);
long start_nano=System.nanoTime();
cudaEventRecord(start, null);
cuLaunchKernel(function,
gridSize, 1, 1,
blockSize, 16, 1,
250, null, kernelParameters, null);
cuCtxSynchronize();
cudaEventRecord(stop, null);
long end_nano=System.nanoTime();
float elapsedTimeMsArray[] = { Float.NaN };
cudaEventElapsedTime(elapsedTimeMsArray, start, stop);
float elapsedTimeMs = elapsedTimeMsArray[0];
System.out.println("Time Required (Using cudaevent elapsed time) = " + " " +elapsedTimeMs+
"Time Required (Using nanotime)= "+(end_nano-start_nano)/1000000);
// Allocate host output memory and copy the device output
// to the host.
//This is the part where it gives me the error
cuMemcpyDtoH(Pointer.to(hostMatrixC), devMatrixC, numRows * numCols * Sizeof.FLOAT);
//verify the result
for (int i =0; i<numRows; i++)
{
for (int j =0; j<numRows; j++)
{
System.out.print(" "+ hostMatrixC[i*numCols+j]);
}
System.out.println("");
}
cuMemFree(devMatrixA);
cuMemFree(devMatrixB);
cuMemFree(devMatrixC);
}
private static String preparePtxFile(String cuFileName) throws IOException
{
int endIndex = cuFileName.lastIndexOf('.');
if (endIndex == -1)
endIndex = cuFileName.length()-1;
{
}
String ptxFileName = cuFileName.substring(0, endIndex+1)+"ptx";
File ptxFile = new File(ptxFileName);
if (ptxFile.exists())
{
return ptxFileName;
}
File cuFile = new File(cuFileName);
if (!cuFile.exists())
{
throw new IOException("Input file not found: "+cuFileName);
}
String modelString = "-m"+System.getProperty("sun.arch.data.model");
String command = "nvcc " + modelString + " -ptx "+ cuFile.getPath()+" -o "+ptxFileName;
System.out.println("Executing\n"+command);
Process process = Runtime.getRuntime().exec(command);
String errorMessage = new String(toByteArray(process.getErrorStream()));
String outputMessage = new String(toByteArray(process.getInputStream()));
int exitValue = 0;
try
{
exitValue = process.waitFor();
}
catch (InterruptedException e)
{
Thread.currentThread().interrupt();
throw new IOException(
"Interrupted while waiting for nvcc output", e);
}
if (exitValue != 0)
{
System.out.println("nvcc process exitValue "+exitValue);
System.out.println("errorMessage:\n"+errorMessage);
System.out.println("outputMessage:\n"+outputMessage);
throw new IOException(
"Could not create .ptx file: "+errorMessage);
}
System.out.println("Finished creating PTX file");
return ptxFileName;
}
private static byte[] toByteArray(InputStream inputStream) throws IOException
{
ByteArrayOutputStream baos = new ByteArrayOutputStream();
byte buffer[] = new byte[8192];
while (true)
{
int read = inputStream.read(buffer);
if (read == -1)
{
break;
}
baos.write(buffer, 0, read);
}
return baos.toByteArray();
}
}
Following is my JCudaSharedMatrixMulKernel.cu code:
extern "C"
__global__ void jCudaSharedMatrixMulKernel(int N,float *ad,float *bd,float *cd)
{
float pvalue=0;
int TILE=blockDim.x;
int ty=threadIdx.y;
int tx=threadIdx.x;
__shared__ float ads[4][4];
__shared__ float bds[4][4];
int Row = blockIdx.y * blockDim.y + threadIdx.y;
int Col = blockIdx.x * blockDim.x + threadIdx.x;
for(int i=0;i< N/TILE;++i)
{
ads[ty][tx] = ad[Row * N + (i * TILE) + tx];
bds[ty][tx] = bd[(i * TILE + ty) * N + Col];
__syncthreads();
for(int k=0;k<TILE;k++)
pvalue += ads[ty][k] * bds[k][tx];
__syncthreads();
}
cd[Row * N + Col] = pvalue;
}
In my above example total shared memory used per block is 2*4*4*4 = 128 bytes.
In the cuLaunchKernel when I define sharedMemBytes parameter as 0(zero) then it gives me following error:
**Exception in thread "main" jcuda.CudaException: CUDA_ERROR_LAUNCH_FAILED
at jcuda.driver.JCudaDriver.checkResult(JCudaDriver.java:282)
at jcuda.driver.JCudaDriver.cuCtxSynchronize(JCudaDriver.java:1795)
at JCudaSharedMatrixMul.main(JCudaSharedMatrixMul.java:121)**
When I define it as 128 then it gives the same above error. But when I make it as 129 then it gives me correct output! When I give any value between 129 to 49024 then it gives me the correct result.
My question is why I am not able to get the correct output when I am defining it as 128? Also what is the maximum shared memory can be defined? Why this 129-49024 range is working here?

You're launching blocks of 16x16 threads:
cuLaunchKernel(function,
gridSize, 1, 1,
blockSize, 16, 1, <-- the first two params are block.x and block.y
250, null, kernelParameters, null);
so __shared__ float ads[4][4]; should not be working at all. For example, these lines of kernel code would be accessing those shared arrays out-of-bounds for some threads:
ads[ty][tx] = ad[Row * N + (i * TILE) + tx];
bds[ty][tx] = bd[(i * TILE + ty) * N + Col];
^ ^
| tx goes from 0..15 for a 16x16 threadblock
ty goes from 0..15 for a 16x16 threadblock
Your code is broken in this respect. If you run your code with cuda-memcheck it may catch these out-of-bounds accesses, even in your "passing" case. Looking at the matrixMulDrv cuda sample code, will be instructive, and you'll see that the shared memory allocation is 2*block_size*block_size, as it should be for your case as well, but your shared memory definitions should be [16][16] not [4][4] It may be that the shared memory allocation granularity just happens to work when you exceed 128 bytes, but there is a defect in your code.
Your shared definitions should be:
__shared__ float ads[16][16];
__shared__ float bds[16][16];
Since the above allocations are static allocations, and the sharedMemBytes parameter is defined as dynamic shared memory allocation, for this example you don't need to allocate any (0 is OK) dynamic shared memory, and it still works. The difference between static and dynamic is covered here.
The maximum shared memory per block is available in the documentation, or if you run the cuda deviceQuery sample code. It is 48K bytes for cc2.0 and newer devices.

Does calling a CUDA kernel multiple times affect execution speed?

I am trying to measure the performance difference of a GPU between allocating memory using 'malloc' in a kernel function vs. using pre-allocated storage from 'cudaMalloc' on the host. To do this, I have two kernel functions, one that uses malloc, one that uses a pre-allocated array, and I time the execution of each function repeatedly.
The problem is that the first execution of each kernel function takes between 400 - 2500 microseconds, but all subsequent runs take about 15 - 30 microseconds.
Is this behavior expected, or am I witnessing some sort of carryover effect from previous runs? If this is carryover, what can I do to prevent it?
I have tried putting in a kernel function that zeros out all memory on the GPU between each timed test run to eliminate that carryover, but nothing changed. I have also tried reversing the order in which I run the tests, and that has no effect on relative or absolute execution times.
const int TEST_SIZE = 1000;
struct node {
node* next;
int data;
};
int main() {
int numTests = 5;
for (int i = 0; i < numTests; ++i) {
memClear();
staticTest();
memClear();
dynamicTest();
}
return 0;
}
__global__ void staticMalloc(int* sum) {
// start a linked list
node head[TEST_SIZE];
// initialize nodes
for (int j = 0; j < TEST_SIZE; j++) {
// allocate the node & assign values
head[j].next = NULL;
head[j].data = j;
}
// verify creation by adding up values
int total = 0;
for (int j = 0; j < TEST_SIZE; j++) {
total += head[j].data;
}
sum[0] = total;
}
/**
* This is a test that will time execution of static allocation
*/
int staticTest() {
int expectedValue = 0;
for (int i = 0; i < TEST_SIZE; ++i) {
expectedValue += i;
}
// host output vector
int* h_sum = new int[1];
h_sum[0] = -1;
// device output vector
int* d_sum;
// vector size
size_t bytes = sizeof(int);
// allocate memory on device
cudaMalloc(&d_sum, bytes);
// only use 1 CUDA thread
dim3 blocksize(1, 1, 1), gridsize(1, 1, 1);
Timer runTimer;
int runTime = 0;
// check dynamic allocation time
runTime = 0;
runTimer.start();
staticMalloc<<<gridsize, blocksize>>>(d_sum);
runTime += runTimer.lap();
h_sum[0] = 0;
cudaMemcpy(h_sum, d_sum, bytes, cudaMemcpyDeviceToHost);
cudaFree(d_sum);
delete (h_sum);
return 0;
}
__global__ void dynamicMalloc(int* sum) {
// start a linked list
node* headPtr = (node*) malloc(sizeof(node));
headPtr->data = 0;
headPtr->next = NULL;
node* curPtr = headPtr;
// add nodes to test cudaMalloc in device
for (int j = 1; j < TEST_SIZE; j++) {
// allocate the node & assign values
node* nodePtr = (node*) malloc(sizeof(node));
nodePtr->data = j;
nodePtr->next = NULL;
// add it to the linked list
curPtr->next = nodePtr;
curPtr = nodePtr;
}
// verify creation by adding up values
curPtr = headPtr;
int total = 0;
while (curPtr != NULL) {
// add and increment current value
total += curPtr->data;
curPtr = curPtr->next;
// clean up memory
free(headPtr);
headPtr = curPtr;
}
sum[0] = total;
}
/**
* Host function that prepares data array and passes it to the CUDA kernel.
*/
int dynamicTest() {
// host output vector
int* h_sum = new int[1];
h_sum[0] = -1;
// device output vector
int* d_sum;
// vector size
size_t bytes = sizeof(int);
// allocate memory on device
cudaMalloc(&d_sum, bytes);
// only use 1 CUDA thread
dim3 blocksize(1, 1, 1), gridsize(1, 1, 1);
Timer runTimer;
int runTime = 0;
// check dynamic allocation time
runTime = 0;
runTimer.start();
dynamicMalloc<<<gridsize, blocksize>>>(d_sum);
runTime += runTimer.lap();
h_sum[0] = 0;
cudaMemcpy(h_sum, d_sum, bytes, cudaMemcpyDeviceToHost);
cudaFree(d_sum);
delete (h_sum);
return 0;
}
__global__ void clearMemory(char *zeros) {
int i = threadIdx.x + blockDim.x * blockIdx.x;
zeros[i] = 0;
}
void memClear() {
char *zeros[1024]; // device pointers
for (int i = 0; i < 1024; ++i) {
cudaMalloc((void**) &(zeros[i]), 4 * 1024 * 1024);
clearMemory<<<1024, 4 * 1024>>>(zeros[i]);
}
for (int i = 0; i < 1024; ++i) {
cudaFree(zeros[i]);
}
}

The first execution of a kernel takes more time because you have to load a lots of stuff on GPU (kernel, lib etc...). To prove it, you can just measure how long it takes to launch an empty kernel and you will see that it's take some times. Try like:
time -> start
launch emptykernel
time -> end
firstTiming = end - start
time -> start
launch empty kernel
time -> end
secondTiming = end - start
You will see that the secondTiming is significantly smaller thant the firstTiming.

The first CUDA (kernel) call initializes the CUDA system transparently. You can avoid this by calling an empty kernel first. Note that this is required in e.g. OpenCL, but there you have to do all that init-stuff manually. CUDA does it for you in the background.
Then some problems with your timing: CUDA kernel calls are asynchronous. So (assuming your Timer class is a host timer like time()) currently you measure the kernel launch time (and for the first call the init-time of CUDA) not the kernel execution time.
At the very least you HAVE to do a cudaDeviceSynchronize() before starting AND stopping the timer.
You are better of using CUDA events which can exactly measure the kernel execution time and only that. Using host-timers you still include the launch-overhead. See https://devblogs.nvidia.com/parallelforall/how-implement-performance-metrics-cuda-cc/

Fetch MySQL row result to dynamic array

When a retrieve info from my db fetch_row(result)
I want to select from these results and store them in a dynamic array
row[i] will be the info a need
I'll need to store it to tagid[trigger]
but char* can be stored to char
so what i now is tagid[trigger] = *row[i];
but when i check the results... it aint what i want
the number 358713020035990 needs to be in tagid...
row[i] 0x05df2090 "358713020035990" char *
tagid[i] -112 '' char
how do i get this right?
char *tagid;int trigger;
tagid = (char *) malloc(sizeof(char));
result = mysql_store_result(conn); // only one column of integers
num_rows = mysql_num_rows(result);
while (row = mysql_fetch_row(result))
{tagid[trigger] = *row[i];}

If you are trying to copy a string data buffer, and not just the pointer to that buffer, then you are going to have to use a memory copy operation or preferably a standard library function made for such purposes like strcpy, or strncpy. So assuming that tagid[trigger] is referring to a block of memory that is an array of type char, you could do the following:
#include <string.h>
//tagid is a two-dimensional array of chars of ROWSIZE x COLUMNSIZE
char** tagid;
tagid = malloc(sizeof(char*) * COLUMNSIZE);
for (int i=0; i < COLUMNSIZE; i++)
{
tagid[i] = malloc(sizeof(char) * ROWSIZE);
}
//copy some data into your array at row index "trigger"
int trigger = SOMEVALUE;
strncpy(tagid[trigger], row[i], ROWSIZE);
//free the memory you've allocated for your two dimensional array
for (int i=0; i < COLUMNSIZE; i++)
{
free(tagid[i]);
}
free(tagid);
The value of ROWSIZE will have to be big enough to hold your largest string plus a terminating NULL otherwise the copy will be truncated using strncpy, or the data will overflow the array bounds and will write-over something else you don't want it to if you use strcpy.

CUDA memory troubles

I have a CUDA kernel which I'm compiling to a cubin file without any special flags:
nvcc text.cu -cubin
It compiles, though with this message:
Advisory: Cannot tell what pointer points to, assuming global memory space
and a reference to a line in some temporary cpp file. I can get this to work by commenting out some seemingly arbitrary code which makes no sense to me.
The kernel is as follows:
__global__ void string_search(char** texts, int* lengths, char* symbol, int* matches, int symbolLength)
{
int localMatches = 0;
int blockId = blockIdx.x + blockIdx.y * gridDim.x;
int threadId = threadIdx.x + threadIdx.y * blockDim.x;
int blockThreads = blockDim.x * blockDim.y;
__shared__ int localMatchCounts[32];
bool breaking = false;
for(int i = 0; i < (lengths[blockId] - (symbolLength - 1)); i += blockThreads)
{
if(texts[blockId][i] == symbol[0])
{
for(int j = 1; j < symbolLength; j++)
{
if(texts[blockId][i + j] != symbol[j])
{
breaking = true;
break;
}
}
if (breaking) continue;
localMatches++;
}
}
localMatchCounts[threadId] = localMatches;
__syncthreads();
if(threadId == 0)
{
int sum = 0;
for(int i = 0; i < 32; i++)
{
sum += localMatchCounts[i];
}
matches[blockId] = sum;
}
}
If I replace the line
localMatchCounts[threadId] = localMatches;
after the first for loop with this line
localMatchCounts[threadId] = 5;
it compiles with no notices. This can also be achieved by commenting out seemingly random parts of the loop above the line. I have also tried replacing the local memory array with a normal array to no effect. Can anyone tell me what the problem is?
The system is Vista 64bit, for what its worth.
Edit: I fixed the code so it actually works, though it still produces the compiler notice. It does not seem as though the warning is a problem, at least with regards to correctness (it might affect performance).

Arrays of pointers like char** are problematic in kernels, since the kernels have no access to the host's memory.
It is better to allocate a single continuous buffer and to divide it in a manner that enables parallel access.
In this case I'd define a 1D array which contains all the strings positioned one after another and another 1D array, sized 2*numberOfStrings which contains the offset of each string within the first array and it's length:
For example - preparation for kernel:
char* buffer = st[0] + st[1] + st[2] + ....;
int* metadata = new int[numberOfStrings * 2];
int lastpos = 0;
for (int cnt = 0; cnt < 2* numberOfStrings; cnt+=2)
{
metadata[cnt] = lastpos;
lastpos += length(st[cnt]);
metadata[cnt] = length(st[cnt]);
}
In kernel:
currentIndex = threadId + blockId * numberOfBlocks;
char* currentString = buffer + metadata[2 * currentIndex];
int currentStringLength = metadata[2 * currentIndex + 1];

The problem seems to be associated with the char** parameter. Turning this into a char* solved the warning, so I suspect that cuda might have problems with this form of data. Perhaps cuda prefers that one uses the specific cuda 2D arrays in this case.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Error in the function cuMemcpyHtoD in jCUDA - cuda

Related

Udacity parallel programming, unspecified launch failure cudaGetLastError()

What is the significance of 'sharedMemBytes' argument in kernel call cuLaunchKernel()?

Does calling a CUDA kernel multiple times affect execution speed?

Fetch MySQL row result to dynamic array

CUDA memory troubles

Categories

Resources