CUDA MemcpyHostToDevice

CUDA MemcpyHostToDevice - cuda

typedef struct {
int M;
int N;
int records[NMAX][SZM];
int times[NMAX];
bool prime[NMAX];
} DATASET;
typedef int ITEMSET[SZM];
__device__ DATASET d_db;
DATASET db;
int main(void) {
loadDB();
cudaMemcpy(&d_db, &db, sizeof(DATASET), cudaMemcpyHostToDevice);
...
I have a device variable d_db a variable db on the host. After I load same value on my db variable, I want to copy this variable on device. Compiling there are no errors, but when I execute the code there are some wornings about cache and sometimes the pc is restarted. What I'm doing wrong?

Using __device__ variables you need to use MemcpyToSymbol and MemcpyFromSymbol instead of cudaMemcpy.
So in my case I have to use
cudaMemcpyToSymbol(d_db,&db,sizeof(DATASET)));

Related

How does CUDA's cudaMemcpyFromSymbol work?

I understand the concept of passing a symbol, but was wondering what exactly is going on behind the scenes. If it's not the address of the variable, then what is it?

I believe the details are that for each __device__ variable, cudafe creates a normal global variable as in C and also a CUDA-specific PTX variable. The global C variable is used so that the host program can refer to the variable by its address, and the PTX variable is used for the actual storage of the variable. The presence of the host variable also allows the host compiler to successfully parse the program. When the device program executes, it operates on the PTX variable when it manipulates the variable by name.
If you wrote a program to print the address of a __device__ variable, the address would differ depending on whether you printed it out from the host or device:
#include <cstdio>
__device__ int device_variable = 13;
__global__ void kernel()
{
printf("device_variable address from device: %p\n", &device_variable);
}
int main()
{
printf("device_variable address from host: %p\n", &device_variable);
kernel<<<1,1>>>();
cudaDeviceSynchronize();
return 0;
}
$ nvcc test_device.cu -run
device_variable address from host: 0x65f3e8
device_variable address from device: 0x403ee0000
Since neither processor agrees on the address of the variable, that makes copying to it problematic, and indeed __host__ functions are not allowed to access __device__ variables directly:
__device__ int device_variable;
int main()
{
device_variable = 13;
return 0;
}
$ nvcc warning.cu
error.cu(5): warning: a __device__ variable "device_variable" cannot be directly written in a host function
cudaMemcpyFromSymbol allows copying data back from a __device__ variable, provided the programmer happens to know the (mangled) name of the variable in the source program.
cudafe facilitates this by creating a mapping from mangled names to the device addresses of variables at program initialization time. The program discovers the device address of each variable by querying the CUDA driver for a driver token given its mangled name.
So the implementation of cudaMemcpyFromSymbol would look something like this in pseudocode:
std::map<const char*, void*> names_to_addresses;
cudaError_t cudaMemcpyFromSymbol(void* dst, const char* symbol, size_t count, size_t offset, cudaMemcpyKind kind)
{
void* ptr = names_to_addresses[symbol];
return cudaMemcpy(dst, ptr + offset, count, kind);
}
If you look at the output of nvcc --keep, you can see for yourself the way that the program interacts with special CUDART APIs that are not normally available to create the mapping:
$ nvcc --keep test_device.cu
$ grep device_variable test_device.cudafe1.stub.c
static void __nv_cudaEntityRegisterCallback( void **__T22) { __nv_dummy_param_ref(__T22); __nv_save_fatbinhandle_for_managed_rt(__T22); __cudaRegisterEntry(__T22, ((void ( *)(void))kernel), _Z6kernelv, (-1)); __cudaRegisterVariable(__T22, __shadow_var(device_variable,::device_variable), 0, 4, 0, 0); }
If you inspect the output, you can see that cudafe has inserted a call to __cudaRegisterVariable to create the mapping for device_variable. Users should not attempt to use this API themselves.

Reduce by key on device array

I am using reduce_by_key to find the number of elements in an array of type int2 which has same first values .
For example
Array: <1,2> <1,3> <1,4> <2,5> <2,7>
so no. elements with 1 as first element are 3 and with 2 are 2.
CODE:
struct compare_int2 : public thrust::binary_function<int2, int2, bool> {
__host__ __device__ bool operator()(const int2 &a,const int2 &b) const{
return (a.x == b.x);}
};
compare_int2 cmp;
int main()
{
int n,i;
scanf("%d",&n);
int2 *h_list = (int2 *) malloc(sizeof(int2)*n);
int *h_ones = (int *) malloc(sizeof(int)*n);
int2 *d_list,*C;
int *d_ones,*D;
cudaMalloc((void **)&d_list,sizeof(int2)*n);
cudaMalloc((void **)&d_ones,sizeof(int)*n);
cudaMalloc((void **)&C,sizeof(int2)*n);
cudaMalloc((void **)&D,sizeof(int)*n);
for(i=0;i<n;i++)
{
int2 p;
printf("Value ? ");
scanf("%d %d",&p.x,&p.y);
h_list[i] = p;
h_ones[i] = 1;
}
cudaMemcpy(d_list,h_list,sizeof(int2)*n,cudaMemcpyHostToDevice);
cudaMemcpy(d_ones,h_ones,sizeof(int)*n,cudaMemcpyHostToDevice);
thrust::reduce_by_key(d_list, d_list+n, d_ones, C, D,cmp);
return 0;
}
The above code is showing Segmentation Fault . I ran the above code using gdb and it reported the segfault at this location.
thrust::system::detail::internal::scalar::reduce_by_key >
(keys_first=0x1304740000,keys_last=0x1304740010,values_first=0x1304740200,keys_output=0x1304740400, values_output=0x1304740600,binary_pred=...,binary_op=...)
at /usr/local/cuda-6.5/bin/../targets/x86_64-linux/include/thrust/system/detail/internal/scalar/reduce_by_key.h:61 61
InputKeyType temp_key = *keys_first
How to use reduce_by_key on device arrays ?

Thrust interprets ordinary pointers as pointing to data on the host:
thrust::reduce_by_key(d_list, d_list+n, d_ones, C, D,cmp);
Therefore thrust will call the host path for the above algorithm, and it will seg fault when it attempts to dereference those pointers in host code. This is covered in the thrust getting started guide:
You may wonder what happens when a "raw" pointer is used as an argument to a Thrust function. Like the STL, Thrust permits this usage and it will dispatch the host path of the algorithm. If the pointer in question is in fact a pointer to device memory then you'll need to wrap it with thrust::device_ptr before calling the function.
Thrust has a variety of mechanisms (e.g. device_ptr, device_vector, and execution policy) to identify to the algorithm that the data is device-resident and the device path should be used.
The simplest modification for your existing code might be to use device_ptr:
#include <thrust/device_ptr.h>
...
thrust::device_ptr<int2> dlistptr(d_list);
thrust::device_ptr<int> donesptr(d_ones);
thrust::device_ptr<int2> Cptr(C);
thrust::device_ptr<int> Dptr(D);
thrust::reduce_by_key(dlistptr, dlistptr+n, donesptr, Cptr, Dptr,cmp);
The issue described above is similar to another issue you asked about.

passing command line argument in c as parameters

I'm trying to make a function to assign a structure members a value.
#include <stdio.h>
#include <string.h>
typedef struct
{
int id;
char *data;
}person_t;
person_t person_build(int id, char *data);
int main (int argc, char *argv[])
{
person_t person = person_build(atoi(argv[1]), argv[2]);
return 0;
}
person_t person_build(int id, char *data)
{
person_t person;
person.id = id;
strcpy(person.data, data);
return person;
}
This program compiled successfully.
I run that program and give command line arguments as parameters
to person_build() function as parameters.
>struct5.exe 4 Something
operating system(windows 7) give me a warning this program has stopped working
but when run without any command line argument (changing the person_build() parameter other than command line arguments) that program works.
can someone explain why this behaviour happen?

Your program is not working because you are accessing memory structures that you have not initialized. Specifically:
typedef struct
{
int id;
char *data;
}person_t;
This creates a structure that has a char * as a member. That char * allocates no actual memory, it simply reserves a member in the structure that can hold a memory address that should point to a value. Later, you:
strcpy(person.data, data);
You are now copying data into the memory location that person.data points at even though you have never allocated memory or initialized person.data.
You could take this approach:
person_t person_build(int id, char *data)
{
person_t person;
person.id = id;
person.data = malloc(sizeof(char) * strlen(data) + 1);
if(person.data != NULL) strcpy(person.data, data);
return person;
}
This allocates memory of the proper size, accounting for null termination at the end of the string, verifies that the allocation was successful and only then will it attempt to copy into that memory.
This is far from complete. I think you may have many more obstacles yet to overcome!

cuModuleGetFunction don't accepts simple kernel names, only ".entry"-tags from .ptx-file

I convert my .cu-files using CUDA_COMPILE_PTX from findPackageCUDA.cmake. When I try to get the function-pointers to my kernels I am facing the following problem:
My kernel named Kernel1 only can be loaded correctly via cuModuleGetFunction if I use its .entry-label from the resulting .ptx-file, e.g. _Z7Kernel1Pj
The problem is that this label may change each time I have to recompile my .cu-files. This can't be a solution if I reference them by name in a constant char*.

_Z7Kernel1Pj is a C++ mangled name. If you want to have a simple symbol you can use extern "C"
extern "C" void Kernel1(...)
For example if you use the default CUDA visual studio project contains the kernel
__global__ void addKernel(int *c, const int *a, const int *b)
If you run cuobjdump -symbols on this you will see the mangled symbol name
STT_FUNC STB_GLOBAL _Z9addKernelPiPKiS1_
If you use extern "C"
extern "C" __global__ void addKernel(int *c, const int *a, const int *b)
the symbol name will now be
STT_FUNC STB_GLOBAL addKernel
Using extern "C" will result in loss of function overloading and namespaces

C++: Explicit DLL Loading: First-chance Exception on non "extern C" functions

I am having trouble importing my C++ functions. If I declare them as C functions I can successfully import them. When explicit loading, if any of the functions are missing the extern as C decoration I get a the following exception:
First-chance exception at 0x00000000 in cpp.exe: 0xC0000005: Access violation.
DLL.h:
extern "C" __declspec(dllimport) int addC(int a, int b);
__declspec(dllimport) int addCpp(int a, int b);
DLL.cpp:
#include "DLL.h"
int addC(int a, int b) {
return a + b;
}
int addCpp(int a, int b) {
return a + b;
}
main.cpp:
#include "..DLL/DLL.h"
#include <stdio.h>
#include <windows.h>
int main() {
int a = 2;
int b = 1;
typedef int (*PFNaddC)(int,int);
typedef int (*PFNaddCpp)(int,int);
HMODULE hDLL = LoadLibrary(TEXT("../Debug/DLL.dll"));
if (hDLL != NULL)
{
PFNaddC pfnAddC = (PFNaddC)GetProcAddress(hDLL, "addC");
PFNaddCpp pfnAddCpp = (PFNaddCpp)GetProcAddress(hDLL, "addCpp");
printf("a=%d, b=%d\n", a,b);
printf("pfnAddC: %d\n", pfnAddC(a,b));
printf("pfnAddCpp: %d\n", pfnAddCpp(a,b)); //EXCEPTION ON THIS LINE
}
getchar();
return 0;
}
How can I import c++ functions for dynamic loading? I have found that the following code works with implicit loading by referencing the *.lib, but I would like to learn about dynamic loading.
Thank you to all in advance.
Update:
bindump /exports
1 00011109 ?addCpp##YAHHH#Z = #ILT+260(?addCpp##YAHHH#Z)
2 00011136 addC = #ILT+305(_addC)
Solution:
Create a conversion struct as
found here
Take a look at the
file exports and copy explicitly the
c++ mangle naming convention.
PFNaddCpp pfnAddCpp = (PFNaddCpp)GetProcAddress(hDLL, "?addCpp##YAHHH#Z");

Inevitably, the access violation on the null pointer is because GetProcAddress() returns null on error.
The problem is that C++ names are mangled by the compiler to accommodate a variety of C++ features (namespaces, classes, and overloading, among other things). So, your function addCpp() is not really named addCpp() in the resulting library. When you declare the function with extern "C", you give up overloading and the option of putting the function in a namespace, but in return you get a function whose name is not mangled, and which you can call from C code (which doesn't know anything about name mangling.)
One option to get around this is to export the functions using a .def file to rename the exported functions. There's an article, Explicitly Linking to Classes in DLLs, that describes what is necessary to do this.

It's possible to just wrap a whole header file in extern "C" as follows. Then you don't need to worry about forgetting an extern "C" on one of your declarations.
#ifdef __cplusplus
extern "C" {
#endif
__declspec(dllimport) int addC(int a, int b);
__declspec(dllimport) int addCpp(int a, int b);
#ifdef __cplusplus
} /* extern "C" */
#endif
You can still use all of the C++ features that you're used to in the function bodies -- these functions are still C++ functions -- they just have restrictions on the prototypes to make them compatible with C code.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

CUDA MemcpyHostToDevice - cuda

Using device variables you need to use MemcpyToSymbol and MemcpyFromSymbol instead of cudaMemcpy. So in my case I have to use cudaMemcpyToSymbol(d_db,&db,sizeof(DATASET)));

Related

How does CUDA's cudaMemcpyFromSymbol work?

Reduce by key on device array

passing command line argument in c as parameters

cuModuleGetFunction don't accepts simple kernel names, only ".entry"-tags from .ptx-file

C++: Explicit DLL Loading: First-chance Exception on non "extern C" functions

Categories

Resources

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

CUDA MemcpyHostToDevice - cuda

Using __device__ variables you need to use MemcpyToSymbol and MemcpyFromSymbol instead of cudaMemcpy. So in my case I have to use cudaMemcpyToSymbol(d_db,&db,sizeof(DATASET)));

Related

How does CUDA's cudaMemcpyFromSymbol work?

Reduce by key on device array

passing command line argument in c as parameters

cuModuleGetFunction don't accepts simple kernel names, only ".entry"-tags from .ptx-file

C++: Explicit DLL Loading: First-chance Exception on non "extern C" functions

Categories

Resources

Using device variables you need to use MemcpyToSymbol and MemcpyFromSymbol instead of cudaMemcpy. So in my case I have to use cudaMemcpyToSymbol(d_db,&db,sizeof(DATASET)));