I am new to Numba and am trying to apply it to an existing NumPy code that is very FLOP intensive. The function I want to apply #jit to, however, calls other functions that, in turn, use the numpy fft module.
It seems that, in order to apply #jit to a function, it must also be applied to the functions it calls. As a consequence, I cannot apply #jit to my function - it would require applying it to all functions it calls and, ultimately, to the functions that use the fft module that is not supported by Numba. Is there a way around this? For instance, a way to let Numba know the data type of the variables returned by the functions called and instructing it to leave them alone? So that I can apply #jit only to the one function and not to those it calls?
The usual way to solve this problem is to split the functions in three parts: a header function using Numba, a pure-Python function (calling np.fft) and a footer Numba function. An overall pure-Python function coordinate the call to others. Using functions operating on big arrays is faster in this case due to the overhead of CPython loops.
That being said, there is an experimental Numba feature called objmode meant to solve this specific problem. You need to specify the type of the input/output objects of the section and there are few limitations mentioned in the documentation. Note that is currently quite unstable since I encountered few non-deterministic compilation errors (while the code was valid) and crashes while using it so far.
Related
i've cloned a "PointPillars" repo for 3D detection using just point cloud as input. But when I came to run it, I noted it use cuda and numba. With any prior knowledge about these two, I'm asking if there is any way to remove or disable numba and cuda. I want to run it on local server with CPU only, so I want your advice to solve.
The actual code matters here.
If the usage is only of vectorize or guvectorize using the target=cuda parameter, then "removal" of CUDA should be trivial. Just remove the target parameter.
However if there is use of the #cuda.jit decorator, or explicit copying of data between host and device, then other code refactoring would be involved. There is no simple answer here in that case, the code would have to be converted to an alternate serial or parallel realization via refactoring or porting.
I'm trying to optimize some molecular simulation code (written completely in fortran) by using GPUs. I've developed a small subroutine that performs matrix vector multiplication using the cuBLAS fortran binding library (non-thunking - /usr/local/cuda/src/fortran.c on Linux).
When I tested the subroutine outside of the rest of the code (i.e. without any other external subroutine calls) everything worked. When I compiled, I used these tags -names uppercase -assume nounderscore. Without them, I would receive undefined reference errors.
When porting this into the main function of the molecular dynamics code, the -assume nounderscore -names uppercase tags mess up all of my other function calls in the main program.
Any idea of a way around this? Please refer to my previous question where -assume nounderscore -names uppercase was suggested here
Thanks in advance!
I would try Fortran-C interop. With something like
interface
function cublas_alloc(argument list) bind(C, name="the_binding_name")
defs of arguments
end function
end interface
the binding name can be upper case or lowercase, whatever you need, for example, bind(C,name="CUBLAS_ALLOC"). No underscores will be appended to that.
The iso_c_binding module might be also helpful.
Suppose I have a pointer to a __global__ function in CUDA. Is there a way to programmatically ask CUDART for a string containing its name?
I don't believe this is possible by any public API.
I have previously tried poking around in the driver itself, but that doesn't look too promising. The compiler emitted code for <<< >>> kernel invocation clearly registers the mangled function name with the runtime via __cudaRegisterFunction, but I couldn't see any obvious way to perform a lookup by name/value in the runtime library. The driver API equivalent cuModuleGetFunction leads to an equally opaque type from which it doesn't seem possible to extract the function name.
Edited to add:
The host compiler itself doesn't support reflection, so there are no obvious fancy language tricks that could be pulled at runtime. One possibility would be to add another preprocessor pass to the compilation trajectory to build a static kernel function lookup table before the final build. That would be rather a lot of work, but it could be done, at least for "classic" compilation where everything winds up in a single translation unit.
I have a full project created using FFTW. I want to transition to using cuFFT. I understand that cuFFT has a "compatibility mode". But how exactly does this work? The cuFFT manual says:
After an application is working using the FFTW3 interface, users may
want to modify their code to move data to and from the GPU and use the
routines documented in the FFTW Conversion Guide for the best
performance.
Does this mean I actually need to change my individual function calls? For example, call
cufftPlan1d() instead of fftw_plan_dft_1d().
Do I also have to change my data types?
fftw_complex *inputData; // fftw data storage gets replaced..
cufft_complex *inputData; // ... by cufft data storage?
fftw_plan forwardFFT; // fftw plan gets replaced...
cufftHandle forwardFFT; // ... by cufft plan?
If I'm going to have to rewrite all of my code, what is the point of cufftSetCompatabilityMode(.)?
Probably what you want is the cuFFTW interface to cuFFT. I suggest you read this documentation as it probably is close to what you have in mind. This will allow you to use cuFFT in a FFTW application with a minimum amount of changes. As indicated in the documentation, there should only be two steps requred:
It is recommended that you replace the include file fftw3.h with cufftw.h
Instead of linking with the double/single precision libraries such as fftw3/fftw3f libraries, link with both the CUFFT and CUFFTW libraries
Regarding the doc item you excerpted, that step (moving the data explicitly) is not required if you're just using the cuFFTW compatibility interface. However, you may not achieve maximum performance this way. If you want to achieve maximum performance, you may need to use cuFFT natively, for example so that you can explicitly manage data movement. Whether or not this is important will depend on the specific structure of your application (how many FFT's you are doing, and whether any data is shared amongst multiple FFTs, for example.) If you intend to use cuFFT natively, then the following comments apply:
Yes, you need to change your individual function calls. They must line up with function names in the API, associated header files, and library. The fftw_ function names are not in the cuFFT library.
You can inspect your data types and should discover that for the basic data types like float, double, complex, etc. they should be layout-compatible between cuFFT and FFTW. Personally I would recommend changing your data types to cuFFT data types, but there should be no functional or performance difference at this time.
Although you don't mention it, cuFFT will also require you to move the data between CPU/Host and GPU, a concept that is not relevant for FFTW.
Regarding cufftSetCompatibilityMode, the function documentation and discussion of FFTW compatibility mode is pretty clear on it's purpose. It has to do with overall data layout, especially padding of data for FFTW.
Check this link out :
Here, it says that all we need to do is change the linking.
https://developer.nvidia.com/blog/cudacasts-episode-8-accelerate-fftw-apps-cufft-55/
So the problem I am having has not actually happened yet. I am planning out some code for a game I am currently working on and I know that I am going to be needing to conserve memory usage as much as possible from step one. My question is, if I have for example, 500k objects that will need to constantly be constructed and deconstructed. Would is save me any memory to have the functions those classes are going to use as function pointers.e.g. without function pointers class MyClass{public:void Update();void Draw();...}e.g. with function pointersclass MyClass{public:void * Update;void * Draw;...}Would this save me any memory or would any new creation of MyClass just access the same functions that were defined for the rest of them? If it does save me any memory would it be enough to be worthwhile?
Assuming those are not virtual functions, you'd use more memory with function pointers.
The first example
There is no allocation (beyond the base amount required to make new return unique pointers, or your additional implementation that you ... ellipsized).
Non-virtual member functions are basically static functions taking a this pointer.
The advantage is that your objects are really simple, and you'll have only one place to look to find the corresponding code.
The disadvantage is that you lose some flexibility, and have to cram all your code into single update/draw functions. This will be hard to manage for anything but a tiny game.
The second example
You allocate 2x pointers per object instance. Pointers are usually 4x to 8x bytes each (depending on your target platform and your compiler).
The advantage you gain a lot of flexibility of being able to change the function you're pointing to at runtime, and can have a multitude of functions that implement it - thus supporting (but not guaranteeing) better code organization.
The disadvantage is that it will be harder to tell which function each instance will point to when you're debugging your application, or when you're simply reading through the code.
Other options
Using function pointers this specific way (instance data members) usually makes more sense in plain C, and less sense in C++.
If you want those functions to be bound at runtime, you may want to make them virtual instead. The cost is compiler/implementation dependent, but I believe it is (usually) going to be one v-table pointer per object instance.
Plus you can take full advantage of virtual function syntax to make your functions based on type, rather than whatever you bound them to. It will be easier to debug than the function pointer option, since you can simply look at the derived type to figure out what function a particular instance points to.
You also won't have to initialize those function pointers - the C++ type system would do the equivalent initialization automatically (by building the v-table, and initializing each instance's v-table pointer).
See: http://www.parashift.com/c++-faq-lite/virtual-functions.html