In the below code, I first bind the texture called ref to an array called gpu in the global memory. Then I call a function called getVal in which i first set the value of gpu[1] to 5 and then read it using the binded texture using tex1Dfetch(ref,1). However, in this case, tex1Dfetch() does not display the changed value of gpu[5], instead it displays the old value.
Then, I call another function called getagain which just reads tex1Dfetch(ref,1) again. However, this time i get the new value . I really do not understand that why in the first function I do not get the changed value.
#include<cuda_runtime.h>
#include<cuda.h>
#include<stdio.h>
texture<int> ref;
__global__ void getVal(int *c, int *gpu){
gpu[1] = 5;
*c = tex1Dfetch(ref, 1); // returns old value, not 5
}
__global__ void getagain(int *c){
*c = tex1Dfetch(ref, 1); // returns new value !!!????
}
void main(){
int *gpu,*c;
int i,b[10];
for( i =0 ; i < 10; i++){
b[i] = i*3;
}
cudaMalloc((void**)&gpu, sizeof(int) * 10);
cudaBindTexture(NULL, ref, gpu,10*sizeof(int));
cudaMemcpy(gpu, b, 10 * sizeof(int), cudaMemcpyHostToDevice);
cudaMalloc((void**)&c, sizeof(int));
//try changing value and reading using tex1dfetch
getVal<<<1,1>>>(c,gpu);
cudaMemcpy(&i, c,sizeof(int), cudaMemcpyDeviceToHost);
printf("the value returned by tex fetch is %d\n" , i);
cudaMemcpy(b, gpu,10*sizeof(int), cudaMemcpyDeviceToHost);
for( i =0 ; i < 10; i++){
printf("%d\n",b[i]);
}
getagain<<<1,1>>>(c);
cudaMemcpy(&i, c,sizeof(int), cudaMemcpyDeviceToHost);
printf("the value returned by tex fetch is %d\n" , i);
getchar();
}
Within the same kernel call, the texture cache does not maintain coherency with global memory. See section 3.2.10.4 of the CUDA 4.0 C Programming Guide. Coherency of the texture cache between consecutive kernel calls is achieved by the driver flushing the texture cache prior to launching a kernel.
Related
I'm having trouble using atomicMin to find the minimum value in a matrix in cuda. I'm sure it has something to do with the parameters I'm passing into the atomicMin function. The findMin function is the function to focus on, the popmatrix function is just to populate the matrix.
#include <stdio.h>
#include <cuda.h>
#include <curand.h>
#include <curand_kernel.h>
#define SIZE 4
__global__ void popMatrix(unsigned *matrix) {
unsigned id, num;
curandState_t state;
id = threadIdx.x * blockDim.x + threadIdx.y;
// Populate matrix with random numbers
curand_init(id, 0, 0, &state);
num = curand(&state)%100;
matrix[id] = num;
}
__global__ void findMin(unsigned *matrix, unsigned *temp) {
unsigned id;
id = threadIdx.x * blockDim.y + threadIdx.y;
atomicMin(temp, matrix[id]);
printf("old: %d, new: %d", matrix[id], temp);
}
int main() {
dim3 block(SIZE, SIZE, 1);
unsigned *arr, *harr, *temp;
cudaMalloc(&arr, SIZE*SIZE*sizeof(unsigned));
popMatrix<<<1,block>>>(arr);
// Print matrix of random numbers to see if min number was picked right
cudaMemcpy(harr, arr, SIZE*SIZE*sizeof(unsigned), cudaMemcpyDeviceToHost);
for (unsigned i = 0; i < SIZE; i++) {
for (unsigned j = 0; j < SIZE; j++) {
printf("%d ", harr[i*SIZE+j]);
}
printf("\n");
}
temp = harr[0];
findMin<<<1, block>>>(harr);
return 0;
}
harr is not allocated. You should allocated it on the host side using for example malloc before calling cudaMemcpy. As a result, the printed values you look are garbage. This is quite surprising that the program did not segfault on your machine.
Moreover, when you call the kernel findMin at the end, its parameter is harr (which is supposed to be on the host side regarding its name) should be on the device to perform the atomic operation correctly. As a result, the current kernel call is invalid.
As pointed out by #RobertCrovella, a cudaDeviceSynchronize() call is missing at the end. Moreover, you need to free your memory using cudaFree.
I want to print d_t global 2D array variable using "printf" inside main method. But I got a compile warning saying that:
a __device__ variable "d_t" cannot be directly read in a host function
How can I copy global 2D array variable from device to host and then print the first column of each row?
__device__ double *d_t;
__device__ size_t d_gridPitch;
__global__ void kernelFunc()
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
double* rowt = (double*)((char *)d_t + i * d_gridPitch);
rowt[0] = rowt[0] + 40000;
}
int main()
{
int size = 16;
size_t d_pitchLoc;
double *d_tLoc;
cudaMallocPitch((void**)&d_tLoc, &d_pitchLoc, size * sizeof(double), size);
cudaMemset2D(d_tLoc, d_pitchLoc, 0, size * sizeof(double), size);
cudaMemcpyToSymbol(d_gridPitch, &d_pitchLoc, sizeof(int));
cudaMemcpyToSymbol(d_t, & d_tLoc, sizeof(d_tLoc));
kernelFunc<<<1,size>>>();
for(int i=0; i< size; i++){
double* rowt = (double*)((char *)d_t + i * d_gridPitch);
printf("%.0f, ",rowt[0]);
}
cudaDeviceReset();
return 0;
}
As indicated in comments, the cudaMemcpy2D API is designed for exactly this task. You must allocate or statically define a host memory buffer or container to act as storage for the data from the device, and then provide the pitch of that host buffer to the cudaMemcpy2D call. The API handles the pitch conversion without any further intervention on the caller side.
If you replace the print loop with something like this:
double* h_t = new double[size * size];
cudaMemcpy2D(h_t, size * sizeof(double), d_tLoc, d_pitchLoc,
size * sizeof(double), size, cudaMemcpyDeviceToHost);
for(int i=0, j=0; i< size; i++){
std::cout << h_t[i * size + j] << std::endl;
}
[Note I'm using iostream here for the printing. CUDA uses a C++ compiler for compiling host code and you should prefer iostream functions over cstdio because they are less error prone and support improve diagnostics on most platforms].
You can see that the API call form is very similar to the cudaMemset2D call that I provided for you in your last question.
I've a simple CUDA kernel which is incrementing the value of an int by 1 and changing a bool from true to false. Here's my code.
#include <stdio.h>
__global__ void cube(int* d_var, bool* d_bool){
int idx = threadIdx.x;
//do basically nothing
__syncthreads();
*d_var = *d_var + 1;
*d_bool = false;
}
int main(int argc, char** argv){
int h_var = 1;
int* d_var;
bool h_bool = true;
bool* d_bool;
cudaMalloc((void**)&d_var, sizeof(int));
cudaMalloc((void**)&d_bool, sizeof(bool));
while(h_var < 10){
h_bool = true;
//printf("%d\n", h_bool);
//printf("%d\n", h_var);
cudaMemcpy(d_var, &h_var, sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(d_bool, &h_bool, sizeof(bool), cudaMemcpyHostToDevice);
cube<<<10, 512>>>(d_var, d_bool);
cudaThreadSynchronize();
cudaMemcpy(&h_var, d_var, sizeof(int), cudaMemcpyDeviceToHost);
cudaMemcpy(&h_bool, d_bool, sizeof(bool), cudaMemcpyDeviceToHost);
printf("%d\n", h_var);
printf("%d\n", h_bool);
}
cudaFree(d_var);
cudaFree(d_bool);
//cudaFree(d_out);
return 0;
}
The problem is instead of 1 the output shows an increment of 2 in each step. Thus the output is
1
3
5
7
11
Could someone help me understand what's happening here.
You have a race condition. All threads in the grid associated with the kernel launch are attempting to update the same location (d_var). You'll have to properly manage this access or else you will have unpredictable results.
To understand the race condition better, you need to realize that an operation like this:
*d_var = *d_var + 1;
is performed in multiple steps by the machine. When multiple threads are each asynchronously executing those multiple steps on the same location, the results are unpredictable. One thread can overwrite what another thread has just written, and the results will not be consistent.
__syncthreads() doesn't do anything to manage multiple threads accessing the same location, and furthermore __syncthreads only operates on the threads within a block, not all of the threads in a grid.
One possible approach to manage the simultaneous access is to use atomics. Atomics will force an orderly access to a memory location by multiple threads that are attempting to do so at the same time.
You could modify your kernel like this:
__global__ void cube(int* d_var, bool* d_bool){
int idx = threadIdx.x;
//do basically nothing
__syncthreads();
atomicAdd(d_var, 1); // modify this line
*d_bool = false;
}
This will now result in d_var getting updated once per thread, for each kernel launch. So if you launch a single thread (<<<1,1>>>), your variable should increase by one per kernel launch. If you launch 5120 threads (<<<10,512>>>), then your variable should increase by 5120 per kernel launch.
Note that we don't need to worry about d_bool in this case, because the only possible outcome is that it is set to false, and that is guaranteed in this case even if multiple threads are doing it.
If you only want to increment the variable by 1 per kernel launch, no matter how many threads are in the grid, then you could modify your kernel code to condition that update on only a single thread:
__global__ void cube(int* d_var, bool* d_bool){
int idx = threadIdx.x + blockDim.x*blockIdx.x; // create globally unique thread ID
//do basically nothing
__syncthreads();
if (idx == 0) // only thread 0 does the update
*d_var = *d_var + 1;
*d_bool = false; // all threads will do this
}
With that modification, I get what I consider to be expected results:
$ cat t997.cu
#include <stdio.h>
__global__ void cube(int* d_var, bool* d_bool){
int idx = threadIdx.x + blockDim.x*blockIdx.x;
//do basically nothing
__syncthreads();
if (idx == 0)
*d_var = *d_var + 1;
*d_bool = false;
}
int main(int argc, char** argv){
int h_var = 1;
int* d_var;
bool h_bool = true;
bool* d_bool;
cudaMalloc((void**)&d_var, sizeof(int));
cudaMalloc((void**)&d_bool, sizeof(bool));
while(h_var < 10){
h_bool = true;
//printf("%d\n", h_bool);
//printf("%d\n", h_var);
cudaMemcpy(d_var, &h_var, sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(d_bool, &h_bool, sizeof(bool), cudaMemcpyHostToDevice);
cube<<<10, 512>>>(d_var, d_bool);
cudaThreadSynchronize();
cudaMemcpy(&h_var, d_var, sizeof(int), cudaMemcpyDeviceToHost);
cudaMemcpy(&h_bool, d_bool, sizeof(bool), cudaMemcpyDeviceToHost);
printf("%d\n", h_var);
printf("%d\n", h_bool);
}
cudaFree(d_var);
cudaFree(d_bool);
//cudaFree(d_out);
return 0;
}
$ nvcc -o t997 t997.cu
$ ./t997
2
0
3
0
4
0
5
0
6
0
7
0
8
0
9
0
10
0
$
The zeros are in there because your code is printing out the h_bool variable, and the printout starts at 2 rather 1 because your first printout is after the kernel has been executed.
So I'm trying to copy a jagged array from host to device. First of all here is my current understanding of cudaMalloc and cudaMemcpy:
cudaMalloc takes a pointer to the pointer to the memory block.
cudaMemcpy takes a pointer to the memory block to copy to or from.
Correct me if I'm wrong please.
Now this is my code that doesn't work (compiles fine but no output):
__global__ void kernel(int** arr)
{
for (int i=0; i<3; i++)
printf("%d\n", arr[i][0]);
}
int main()
{
int arr[][3] = {{1},{2},{3}}; // 3 arrays, 1 element each
int **d_arr;
cudaMalloc((void**)(&d_arr), sizeof(int*)*3); // allocate for 3 int pointers
for (int i=0; i<3; i++)
{
cudaMalloc( (void**) &(d_arr[i]), sizeof(int) * 1 ); // allocate for 1 int in each int pointer
cudaMemcpy(d_arr[i], arr[i], sizeof(int) * 1, cudaMemcpyHostToDevice); // copy data
}
kernel<<<1,1>>>(d_arr);
cudaDeviceSynchronize();
cudaDeviceReset();
}
So what am I doing wrong here?
Cheers
I found out why, it's because cudaMalloc and cudaMemcpy expect pointers that exist on the host and not on the device.
In my for-loop I was trying to fill pointers that exist on the device, in code that runs on host !
The right way is to make an intermediate variable, a pointer on host that points to memory on the device, fill it with integers, then copy that pointer into the jagged array (the pointer on pointers) !
This is the correct version:
__global__ void kernel(int** arr)
{
for (int i=0; i<3; i++)
printf("%d\n", arr[i][0]);
}
int main()
{
int arr[][3] = {{1},{2},{3}}; // 3 arrays, 1 element each
int **d_arr;
cudaMalloc((void***)(&d_arr), sizeof(int*)*3); // allocate for 3 int pointers
for (int i=0; i<3; i++)
{
int* temp;
cudaMalloc( (void**) &(temp), sizeof(int) * 1 ); // allocate for 1 int in each int pointer
cudaMemcpy(temp, arr[i], sizeof(int) * 1, cudaMemcpyHostToDevice); // copy data
cudaMemcpy(d_arr+i, &temp, sizeof(int*), cudaMemcpyHostToDevice);
}
kernel<<<1,1>>>(d_arr);
cudaDeviceSynchronize();
cudaDeviceReset();
}
Your kernel calls printf(), which is used to be (until CC2.0) a host function. Everything ok here. ;)
cudaMemcpy((void*)d_arr, (void*)arr, sizeof(int*)*3, cudaMemcpyHostToDevice); copies the Memory adresses of your Arrays on the host to the device. That makes no sense. Since you now have pointers to host memory on the device.
You can not allocate 2d Arrays that particular way in CUDA. See http://www.stevenmarkford.com/allocating-2d-arrays-in-cuda/.
I am trying to implement Sauvola Binarization in cuda.For this I have read the image in a 2d array in host and allocating memory for 2D array in device using pitch.After allocating the memory I am trying to copy the host 2D array to Device 2d Array using cudaMemcpy2D,it compiles fine but it crashes here on runtime.I am unable to understand where am I missing,Kindly suggest something.The code which I have written is as follows:
#include "BinMain.h"
#include "Binarization.h"
#include <stdlib.h>
#include <stdio.h>
#include <conio.h>
#include <cuda.h>
#include <cuda_runtime.h>
void printDevProp(cudaDeviceProp);
void CUDA_SAFE_CALL( cudaError_t);
int main()
{
//Read an IplImage in imgOriginal as grayscale
IplImage * imgOriginal = cvLoadImage("E:\\1.tiff",CV_LOAD_IMAGE_GRAYSCALE);
//Create a size variable of type CvSize for cvCreateImage Parameter
CvSize size = cvSize(imgOriginal->width,imgOriginal->height);
//create an image for storing the result image with same height and width as imgOriginal
IplImage * imgResult = cvCreateImage(size,imgOriginal->depth,imgOriginal- >nChannels);
//Create a 2D array for storing the pixels value of each of the pixel of imgOriginal grayscale image
int ** arrOriginal = (int **)malloc(imgOriginal->height * sizeof(int *));
for (int i = 0; i < imgOriginal->height; i++)
{
arrOriginal[i] = (int*)malloc(imgOriginal->width * sizeof(int));
}
//Create a 2D array for storing the returned device array
int ** arrReturn = (int **)malloc(imgOriginal->height * sizeof(int *));
for (int i = 0; i < imgOriginal->height; i++)
{
arrReturn[i] = (int*)malloc(imgOriginal->width * sizeof(int));
}
//Create a CvScalar variable to copy pixel values in 2D array (arrOriginal)
CvScalar s;
//Copying the pixl values
for(int j = 0;j<imgOriginal->height;j++)
{
for(int k =0;k<imgOriginal->width;k++)
{
s = cvGet2D(imgOriginal,j,k);
arrOriginal[j][k] = s.val[0];
}
}
//Cuda Device Property
int devCount;
cudaGetDeviceCount(&devCount);
printf("CUDA Device Query...\n");
printf("There are %d CUDA devices.\n", devCount);
// Iterate through devices
for (int i = 0; i < devCount; ++i)
{
// Get device properties
printf("\nCUDA Device #%d\n", i);
cudaDeviceProp devProp;
cudaGetDeviceProperties(&devProp, i);
printDevProp(devProp);
}
//Start the clock
clock_t start = clock();
//Allocating Device memory for 2D array using pitch
size_t host_orig_pitch = imgOriginal->width * sizeof(int)* imgOriginal->height; //host original array pitch in bytes
size_t dev_pitch; //device array pitch in bytes which will be used in cudaMallocPitch
size_t dev_pitchReturn; //device return array pitch in bytes
size_t host_ret_pitch = imgOriginal->width * sizeof(int)* imgOriginal->height; //host return array pitch in bytes
int * devArrOriginal; //device 2d array of original image
int * result; //device 2d array for returned array
int dynmicRange = 128; //Dynamic Range for calculating the threshold from sauvola's formula
//Allocating memory by using cudaMallocPitch
CUDA_SAFE_CALL(cudaMallocPitch((void**)&devArrOriginal,&dev_pitch,imgOriginal->width * sizeof(int),imgOriginal->height * sizeof(int)));
//Allocating memory for returned array
CUDA_SAFE_CALL(cudaMallocPitch((void**)&result,&dev_pitchReturn,imgOriginal->width * sizeof(int),imgOriginal->height * sizeof(int)));
//Copying 2D array from host memory to device mempry by using cudaMemCpy2D
CUDA_SAFE_CALL(cudaMemcpy2D((void*)devArrOriginal,dev_pitch,(void*)arrOriginal,host_orig_pitch,imgOriginal->width * sizeof(float),imgOriginal->height,cudaMemcpyHostToDevice));
int windowSize = 19; //Size of the window for calculating mean and variance
//Launching the kernel by calling myKernelLauncher function.
myKernelLauncher(devArrOriginal,result,windowSize,imgOriginal->width,imgOriginal- >height,dev_pitch,dynmicRange);
//Calling the sauvola binarization function by passing the parameters as
//1.arrOriginal 2D array 2.Original image height 3.Original image width
//int ** result = AdaptiveBinarization(arrOriginal,imgOriginal->height,imgOriginal- >width);//binarization(arrOriginal,imgOriginal->width,imgOriginal->height);
//
CUDA_SAFE_CALL(cudaMemcpy2D(arrReturn,host_ret_pitch,result,dev_pitchReturn,imgOriginal->width * sizeof(int),imgOriginal->height * sizeof(int),cudaMemcpyDeviceToHost));
//create a CvScalar variable to set the data in imgResult
CvScalar ss;
//Copy the pixel values from returned array to imgResult
for(int i=0;i<imgOriginal->height;i++)
{
for(int j=0;j<imgOriginal->width;j++)
{
ss = cvScalar(arrReturn[i][j]*255);
cvSet2D(imgResult,i,j,ss);
//k++; //No need for k if returned array is 2D
}
}
printf("Done \n");
//calculate and print the time elapsed
printf("Time elapsed: %f\n", ((double)clock() - start) / CLOCKS_PER_SEC);
//Create a windoe and show the resule image
cvNamedWindow("Result",CV_WINDOW_AUTOSIZE);
cvShowImage("Result",imgResult);
cvWaitKey(0);
getch();
//Release the various resources
cvReleaseImage(&imgResult);
cvReleaseImage(&imgOriginal);
cvDestroyWindow("Result");
for(int i = 0; i < imgOriginal->height; i++)
free(arrOriginal[i]);
free(arrOriginal);
free(result);
cudaFree(&devArrOriginal);
cudaFree(&result);
}
// Print device properties
void printDevProp(cudaDeviceProp devProp)
{
printf("Major revision number: %d\n", devProp.major);
printf("Minor revision number: %d\n", devProp.minor);
printf("Name: %s\n", devProp.name);
printf("Total global memory: %u\n", devProp.totalGlobalMem);
printf("Total shared memory per block: %u\n", devProp.sharedMemPerBlock);
printf("Total registers per block: %d\n", devProp.regsPerBlock);
printf("Warp size: %d\n", devProp.warpSize);
printf("Maximum memory pitch: %u\n", devProp.memPitch);
printf("Maximum threads per block: %d\n", devProp.maxThreadsPerBlock);
for (int i = 0; i < 3; ++i)
printf("Maximum dimension %d of block: %d\n", i, devProp.maxThreadsDim[i]);
for (int i = 0; i < 3; ++i)
printf("Maximum dimension %d of grid: %d\n", i, devProp.maxGridSize[i]);
printf("Clock rate: %d\n", devProp.clockRate);
printf("Total constant memory: %u\n", devProp.totalConstMem);
printf("Texture alignment: %u\n", devProp.textureAlignment);
printf("Concurrent copy and execution: %s\n", (devProp.deviceOverlap ? "Yes" : "No"));
printf("Number of multiprocessors: %d\n", devProp.multiProcessorCount);
printf("Kernel execution timeout: %s\n", (devProp.kernelExecTimeoutEnabled ? "Yes" : "No"));
return;
}
/* Utility Macro : CUDA SAFE CALL */
void CUDA_SAFE_CALL( cudaError_t call)
{
cudaError_t ret = call;
switch(ret)
{
case cudaSuccess:
break;
default :
{
printf(" ERROR at line :%i.%d' ' %s\n",
__LINE__,ret,cudaGetErrorString(ret));
exit(-1);
break;
}
}
}
The flow of the code is as follows:
1. Create a 2D array in host from image and another array for returned array from kernel.
2. Allocate memory for a 2D array in device using CudaMallocPitch
3. Allocate memory for a 2d array which will be returned by kernel.
4. Copy the original 2d array from host to device array using cudaMemcpy2d.
5. Launch the Kernel.
6. Copy the returned device array to host array using cudaMemcpy2D.
The program is crashing while it reaches to 4th point.It is an unhandled exception stating "Unhandled exception at 0x773415de in SauvolaBinarization_CUDA_OpenCV.exe: 0xC0000005: Access violation reading location 0x01611778."
I think the problem must be while allocating the memory,but I am using the function first time and have no idea how it works,kindly suggest.
First of all, you're not calling "cudaMallocPitch" properly. The "height" parameter should represent the number of rows, so instead of :
imgOriginal->height * sizeof(int)
you should simply use:
imgOriginal->height
This is fine because the number of bytes per row is already contained in the "pitch" property. The main problem, however, lies with the way you allocate the memory for the host image. When you write:
//Create a 2D array for storing the pixels value of each of the pixel of imgOriginal grayscale image
int ** arrOriginal = (int **)malloc(imgOriginal->height * sizeof(int *));
for (int i = 0; i < imgOriginal->height; i++)
{
arrOriginal[i] = (int*)malloc(imgOriginal->width * sizeof(int));
}
you are effectively creating an array with pointers to arrays. The CUDA API call that you
're making:
CUDA_SAFE_CALL(cudaMemcpy2D((void*)devArrOriginal,dev_pitch,(void*)arrOriginal,host_orig_pitch,imgOriginal->width * sizeof(float),imgOriginal->height,cudaMemcpyHostToDevice));
expects that the input memory buffer is contiguous. So here's what will happen: the first row from the input image (totalling "imgOriginal->width * sizeof(float)" bytes) will be read starting with the address:
(void*)arrOriginal
However, the amount of valid data you have starting at that address is only "imgOriginal->height * sizeof(int *)" bytes. The two byte counts are very likely to be different, which will lead to the crash because you will end up reading from an unknown location.
To solve this, consider allocating "arrOriginal" as one contiguous block, such as:
int * arrOriginal = (int *)malloc(imgOriginal->height * imgOriginal->width * sizeof(int));
Also, in this case, your pitch should be:
"imgOriginal->width * sizeof(int)"