CUDA plugin dlopen - cuda

I've written a cuda plugin (dynamic library), and I have a program written in C which uses dlopen() to load this plugin. I am using dlsym() to get the functions from this plugin. For my application it is very important that any time of loading plugin the program gets a new handle with dlopen() calling (the library file may modified subsequently).
Therefore after the using of functions from my plugin I invoke the dlclose(). The invocations dlopen() - dlsym() - dlclose() are occur during my program execution (in the loop).
If I working on the computer with NVIDIA driver 256.35 (CUDA 3.0 or 3.1) I have a memory leak (I use in my plugin cudaMemGetInfo() calling for the diagnostics).
If I working on the computer with NVIDIA driver 195.36.15 (CUDA 3.0) I have an error after some time of the program execution: “NVIDIA: could not open the device file /dev/nvidia0 (Too many open files).”
If I don't use the dlclose() invocation the program is working fine, but in this case I can't replace the plugin on a new one's during my program execution.
Anyone encountered this problem?
Thanks.

Nobody wrote plugins on CUDA?
I've found the similar example on CUDA SDK: matrixMulDynlinkJIT. I've done small correction in the code. In particular, in the file cuda_drvapi_dynlink.c I've corrected cuInit() function:
CUDADRIVER CudaDrvLib = NULL;
CUresult CUDAAPI cuInit(unsigned int Flags)
{
//CUDADRIVER CudaDrvLib;
CUresult result;
int driverVer;
if (CudaDrvLib != NULL) {
dlclose (CudaDrvLib);
CudaDrvLib = NULL;
}
.......
}
And in the file matrixMulDynlinkJIT.cpp I've added loop in the main() function:
int main(int argc, char** argv)
{
printf("[ %s ]\n", sSDKsample);
while (1) {
// initialize CUDA
CUfunction matrixMul = NULL;
cutilDrvSafeCallNoSync(initCUDA(&matrixMul, argc, argv));
.....
}//while (1)
cutilExit();
}
So, I have the same problem like in my program (after some time execution): “NVIDIA: could not open the device file /dev/nvidia0 (Too many open files).”
But when I comment out the dlclose() in the cuda_drvapi_dynlink.c file – all works fine
I can't understand this behavior...
Any ideas?

Related

Unable to create a thrust device vector

So I'm trying to start on GPU programming and using the Thrust library to simplify things.
I have created a test program to work with it and see how it works, however whenever I try to create a thrust::device_vector with non-zero size the program crashes with "Run-time Check Failure #3 - The variable 'result' is being used without being initialized.' (this comes from the allocator_traits.inl file) And... I have no idea how to fix this.
The following is all that is needed to cause this error.
#include <thrust/device_vector.h>
int main()
{
int N = 100;
thrust::device_vector<int> d_a(N);
return 0;
}
I suspect it may be a problem with how the environment is set up so the details on that are...
Created using visual studio 2019, in a CUDA 11.0 Runtime project (the example program given when opening this project works fine, however), Thrust version 1.9, and the given GPU is a GTX 970.
This issue only seems to manifest with the thrust version (1.9.x) associated with CUDA 11.0, and only in debug projects on windows/Visual Studio.
Some workarounds would be to switch to building a release project, or just click "Ignore" on the dialogs that appear at runtime. According to my testing this allows ordinary run or debug at that point.
I have not confirmed it, but I believe this issue is fixed in the latest thrust (1.10.x) just released (although not part of any formal CUDA release at this moment, I would expect it to be part of some future CUDA release).
Following the Answer of Robert Crovella, I fixed this issue by changing the corresponding lines of code in the thrust library with the code from GitHub. More precisely, in the file ...\CUDA\v11.1\include\thrust\detail\allocator\allocator_traits.inl I replaced the following function
template<typename Alloc>
__host__ __device__
typename disable_if<
has_member_system<Alloc>::value,
typename allocator_system<Alloc>::type
>::type
system(Alloc &)
{
// return a copy of a default-constructed system
typename allocator_system<Alloc>::type result;
return result;
}
by
template<typename Alloc>
__host__ __device__
typename disable_if<
has_member_system<Alloc>::value,
typename allocator_system<Alloc>::type
>::type
system(Alloc &)
{
// return a copy of a default-constructed system
return typename allocator_system<Alloc>::type();
}

How do you disable CRT Exceptions?

I don't use exception handling in any of my libraries or applications. Everything is built with Exception handling off and Windows applications using MS Visual Studio are linked with nothrownew.obj. The code handles any failures itself. The fact that the CRT is out behind my back throwing exceptions and terminating the program instead of returning an error result is a problem.
For example, this crashed when provided a drive letter that no longer existed instead of just returning the errcode.
if (_tgetdcwd(driveltr ? DriveLtrToDOSDriveNum(driveltr) : 0, dirout, diroutbufsize)==NULL) {
errcode=GetErrorFromErrno();
}
For now on Windows I've used the _set_invalid_parameter_handler() to prevent the nasty issue above. Goes something like this:
void invalid_parameter_function(const wchar_t * expression, const wchar_t * function, const wchar_t * file, unsigned int line, uintptr_t pReserved)
{
}
int _tmain(int argc, const TCHAR *argv[])
{
_set_invalid_parameter_handler(invalid_parameter_function);
}
Is there a way to ensure the CRT doesn't throw exceptions for both MS Visual Studio 2017 and the CRT used with g++? If not and must setup a handler like I did above for Windows, does g++ have one?
TIA!!

Online compilation of single CUDA function

I have a function in my program called float valueAt(float3 v). It's supposed to return the value of a function at the given point. The function is user-specified. I have an interpreter for this function at the moment, but others recommended I compile the function online so it's in machine code and is faster.
How do I do this? I believe I know how to load the function when I have PTX generated, but I have no idea how to generate the PTX.
CUDA provides no way of runtime compilation of non-PTX code.
What you want can be done, but not using the standard CUDA APIs. PyCUDA provides an elegant just-in-time compilation method for CUDA C code which includes behind the scenes forking of the toolchain to compile to device code and loading using the runtime API. The (possible) downside is that you need to use Python for the top level of your application, and if you are shipping code to third parties, you might need to ship a working Python distribution too.
The only other alternative I can think of is OpenCL, which does support runtime compilation (that is all it supported until recently). The C99 language base is a lot more restrictive than what CUDA offers, and I find the APIs to be very verbose, but the runtime compilation model works well.
I've thought about this problem for a while, and while I don't think this is a "great" solution, it does seem to work so I thought I would share it.
The basic idea is to use linux to spawn processes to compile and then run the compiled code. I think this is pretty much a no-brainer, but since I put together the pieces, I'll post instructions here in case it's useful for somebody else.
The problem statement in the question is to be able to take a file that contains a user-defined function, let's assume it is a function of a single variable f(x), i.e. y = f(x), and that x and y can be represented by float quantities.
The user would edit a file called fx.txt that contains the desired function. This file must conform to C syntax rules.
fx.txt:
y=1/x
This file then gets included in the __device__ function that will be holding it:
user_testfunc.cuh:
__device__ float fx(float x){
float y;
#include "fx.txt"
;
return y;
}
which gets included in the kernel that is called via a wrapper.
cudalib.cu:
#include <math.h>
#include "cudalib.h"
#include "user_testfunc.cuh"
__global__ void my_kernel(float x, float *y){
*y = fx(x);
}
float cudalib_compute_fx(float x){
float *d, *h_d;
h_d = (float *)malloc(sizeof(float));
cudaMalloc(&d, sizeof(float));
my_kernel<<<1,1>>>(x, d);
cudaMemcpy(h_d, d, sizeof(float), cudaMemcpyDeviceToHost);
return *h_d;
}
cudalib.h:
float cudalib_compute_fx(float x);
The above files get built into a shared library:
nvcc -arch=sm_20 -Xcompiler -fPIC -shared cudalib.cu -o libmycudalib.so
We need a main application to use this shared library.
t452.cu:
#include <stdio.h>
#include <stdlib.h>
#include "cudalib.h"
int main(int argc, char* argv[]){
if (argc == 1){
// recompile lib, and spawn new process
int retval = system("nvcc -arch=sm_20 -Xcompiler -fPIC -shared cudalib.cu -o libmycudalib.so");
char scmd[128];
sprintf(scmd, "%s skip", argv[0]);
retval = system(scmd);}
else { // compute f(x) at x = 2.0
printf("Result is: %f\n", cudalib_compute_fx(2.0));
}
return 0;
}
Which is compiled like this:
nvcc -arch=sm_20 -o t452 t452.cu -L. -lmycudalib
At this point, the main application (t452) can be executed and it will produce the result of f(2.0) which is 0.5 in this case:
$ LD_LIBRARY_PATH=.:$LD_LIBRARY_PATH ./t452
Result is: 0.500000
The user can then modify the fx.txt file:
$ vi fx.txt
$ cat fx.txt
y = 5/x
And just re-run the app, and the new functional behavior is used:
$ LD_LIBRARY_PATH=.:$LD_LIBRARY_PATH ./t452
Result is: 2.500000
This method takes advantage of the fact that upon recompilation/replacement of a shared library, a new linux process will pick up the new shared library. Also note that I've omitted several kinds of error checking for clarity. At a minimum I would check CUDA errors, and I would also probably delete the shared object (.so) library before recompiling it, and then test for its existence after compilation, to do a basic test that the compilation proceeded successfully.
This method entirely uses the runtime API to achieve this goal, so as a result the user would have to have the CUDA toolkit installed on their machine and appropriately set up so that nvcc is available in the PATH. Using the driver API with PTX code would make this process much cleaner (and not require the toolkit on the user's machine), but AFAIK there is no way to generate PTX from CUDA C without using nvcc or a user-created toolchain built on the nvidia llvm compiler tools. In the future, there may be a more "integrated" approach available in the "standard" CUDA C toolchain, or perhaps even by the driver.
A similar approach can be arranged using separate compilation and linking of device code, such that the only source code that needs to be exposed to the user is in user_testfunc.cu (and fx.txt).
EDIT: There is now a CUDA runtime compilation facility, which should be used in place of the above.

Unable to use GMP with NVCC

When I try to compile the following code with c++ on OS X 10.8, it works fine - no compile errors.
#include <gmpxx.h>
int main(int argc, const char * argv[]) { }
However, when I try to do the same with nvcc, I get a ton of errors:
/usr/local/Cellar/gcc47/4.7.3/gcc/lib/gcc/x86_64-apple-darwin12.5.0/4.7.3/../../../../include/c++/4.7.3/limits(1405): error: identifier "__int128" is undefined
/usr/local/Cellar/gcc47/4.7.3/gcc/lib/gcc/x86_64-apple-darwin12.5.0/4.7.3/../../../../include/c++/4.7.3/limits(1421): error: function call is not allowed in a constant expression
...
How can I use GMP with NVCC/CUDA? To clarify, I don't intend to perform GMP calculations on the device, just the host.
Create a .cpp module that you compile with your host compiler, and
include your GMP code there.
Create a separate .cu module that you compile with nvcc, and include
your CUDA code there.
Link them together.

cuda invalid resource handle

What does this error mean? I can't seem to find ANY information on it. It occurs on a cudaEventRecord.
in the project header file:
cudaEvent_t cudaEventStart;
in a .c file:
cudaEventCreate(&cudaEventStart);
printf("create event: %d\n", (int) cudaEventStart);
in my one .cu file:
printf("record event: %d\n", (int) cudaEventStart);
cudaEventRecord(cudaEventStart);
the relevant output shows what the problem with the call is. cudaEventStart isn't a valid event resource in my cu file for some reason:
create event: 44199920
record event: 0
Details
CUDA 3.2
GTX 480
64-bit Win7
I'm in the process of porting my code from linux to windows. It runs fine on the same card in linux, and there have been only a few changes. I defined roundf and added the following:
typedef size_t off_t;
#define strtof(str,n) (float)strtod(str,n)
#include <float.h>
#define isnan(n) _isnan(n)
#define strcasecmp _stricmp
#include <io.h>
#define read _read
It isn't clear to me why any of these things should affect cuda resources. Perhaps I'm building the project incorrectly somehow...?
An invalid resource handle usually means trying to use something (pointer, symbol, texture, kernel) in a context where it was not created. A more specific answer will require a more specific question, particularly which API you are using and how/if you are using host threads anywhere in the code.