adding functions in a CUDA program - cuda

So, I think I have a very weird question.
So, let say that I already have a program put on my GPU and in that program I call a function X. But that function X is not declared yet.
I want to be able, dynamically, to modify that function X, by completely changing the code and put it in the program without recompiling the rest or losing any pointers whatsoever.
To compare it with something that most of us know, I want to be able to do like the shaders in OpenGL. In the middle of the execution, I can change the code of one shader, only recompile that shader, activate the program and now I used this one.
So, is it possible. Or do I need to recompile the whole thing all the time. And if I have to recompile, do I lose the various arrays that I created in global memory ?
Thanks
W

If you compile with the -cuda flag using nvcc, you can get the intermediate C++ source that streams PTX to the processor. In theory, you could post-process this intermediate output to dynamically generate PTX on the fly and send it over. You might even be able to have PTX be self modifying, but that's way out of my league.

Related

External FLASH slow verification with STM32cubeProgrammer

I'm working with an STM32F469 chip with a Micron MT25Q Quad_SPI Flash. To program the Flash, there needs to be an external loader program developed. That's all working, but the problem is that verification of the QSPI Flash is extremely slow.
Looking in the log file, it shows that the Flash is being programmed in 150K byte blocks. However, the verification is being done in 1K byte blocks. In addition, the chip is re-initialized before each block check. I've tried this with both through STM32cubeIDE and in STM32cubeProgrammer directly.
The external programmer program include the correct chip configuration information and specifies a 64K page size. I don't see how to get the programmer to use a larger block size. It looks like it understands what part of the SRAM is used and is using the balance of the 256K in the on-board SRAM for programming the QSPI Flash. It could use the same size for reading the data back or use the Verify() function in the external loader. It's calling Read() and then checking the data itself.
Any thought or hints?
Let me add some observations on creating a new external loader. The first observation is "Don't." If you can pick a supported external chip and pin it out to use an existing loader, then do that. STM provides just 4 example programs but they must have 50 external loaders. If the hardware design copies the schematic for a demo board that has an external loader, you should be fine and avoid doing the development work.
The external loader is not a complete executable. It provides a set of functions to do basic operations like Init(), Erase(), Read() and Write(). The trick is that there is no main() and no start-up code is run when the program starts.
The external loader is an ELF file, renamed to "*.stldr". The programming tool looks into the debug information to find the location of the functions. It then sets the registers to provide the parameters, the PC to run the function, and then let's it run. There's some super-clever work going on to make this work. The programmer looks at the returned value (R0) to see if things pass or not. It can also figure out if the function has crashed the core or otherwise timed-out.
What makes writing the external super fun is that the debugger is running the program so there's no debugger available to see what the code is doing. I settled on outputting errors, and encoded information, on the return() from the called functions to give hints as to what was happening.
The external loader isn't a "full" program. Without the startup code, lots of on-chip stuff isn't set up and some just isn't going to work. At least I couldn't figure it out. I'm not sure if it wasn't configured right or the debugger was blocking its use. Looking at the example external loaders, they are written in a very simple way and do not call the HAL or use interrupts. You'll need to provide core set-up functions to configure the clock chains. That Hal_Delay() method will never return as the timers and/or interrupts aren't working. I could never make them work and suspect the NVIC was somehow being disabled. I ended up replacing the HAL_delay() function with a for loop that spun based on core clock rate and a the instruction cycles per loop.
The app note suggests developing a stand-alone program to debug the basic capabilities. That's a good idea but a challenge. Prior to starting the external loader, I had the QSPI doing the needed operations but from a C++ application calling the HAL. Creating an external loader from that was a long exercise in stripping out and replacing functionality. A hint is that the examples are written at a register level. I'm not that good to deal directly with the QuadSPI peripheral and the chip's instruction set at the same time.
The normal start-up of a program is eliminated. Everything that's done before the main() is called (E.g., in startup_stm32f469nihx.s) is up to you. This includes setting the clock chains to boost the core clock and get the peripheral buses working. The program runs in the on-chip SRAM so any initialized variables are loaded correctly. There's no moving data needed but the stack and uninitialized data areas could/should still be zeroed.
I hope this helps someone!
Today I've faced the same issue.
I was able to improve the Verify speed with two simple steps, but Verify still much slower than programming and this is strange...
If anyone find a way to change the 1KB block read of STM32CubeProgrammer I would like to know =).
Follow the changes I made to improve a bit the performance.
Add a kind of lock in the Init Function to avoid multiple initializations. This was the most significant change because I'am checking the Flash ID in my initialization proccess. Other aproachs could be safer but this simple code snippet worked to me.
int Init(void)
{
static uint32_t lock;
if(lock != 0x43213CA5)
{
lock = 0x43213CA5;
/* Init procedure goes here */
}
return(1);
}
Cache a page instead of reading the external memory for each call. This will help more if your external memory page read has too much overhead, otherwise this idea won't give relevant results.

How to disable or remove numba and cuda from python project?

i've cloned a "PointPillars" repo for 3D detection using just point cloud as input. But when I came to run it, I noted it use cuda and numba. With any prior knowledge about these two, I'm asking if there is any way to remove or disable numba and cuda. I want to run it on local server with CPU only, so I want your advice to solve.
The actual code matters here.
If the usage is only of vectorize or guvectorize using the target=cuda parameter, then "removal" of CUDA should be trivial. Just remove the target parameter.
However if there is use of the #cuda.jit decorator, or explicit copying of data between host and device, then other code refactoring would be involved. There is no simple answer here in that case, the code would have to be converted to an alternate serial or parallel realization via refactoring or porting.

NVRTC and __device__ functions

I am trying to optimize my simulator by leveraging run-time compilation. My code is pretty long and complex, but I identified a specific __device__ function whose performances can be strongly improved by removing all global memory accesses.
Does CUDA allow the dynamic compilation and linking of a single __device__ function (not a __global__), in order to "override" an existing function?
I am pretty sure the really short answer is no.
Although CUDA has dynamic/JIT device linker support, it is important to remember that the linkage process itself is still static.
So you can't delay load a particular function in an existing compiled GPU payload at runtime as you can in a conventional dynamic link loading environment. And the linker still requires that a single instance of all code objects and symbols be present at link time, whether that is a priori or at runtime. So you would be free to JIT link together precompiled objects with different versions of the same code, as long as a single instance of everything is present when the session is finalised and the code is loaded into the context. But that is as far as you can go.
It looks like you have a "main" kernel with a part that is "switchable" at run time.
You can definitely do this using nvrtc. You'd need to go about doing something like this:
Instead of compiling the main kernel ahead of time, store it as as string to be compiled and linked at runtime.
Let's say the main kernel calls "myFunc" which is a device kernel that is chosen at runtime.
You can generate the appropriate "myFunc" kernel based on equations at run time.
Now you can create an nvrtc program using multiple sources using nvrtcCreateProgram.
That's about it. The key is to delay compiling the main kernel until you need it at run time. You may also want to cache your kernels somehow so you end up compiling only once.
There is one problem I foresee. nvrtc may not find the curand device calls which may cause some issues. One work around would be to look at the header the device function call is in and use nvcc to compile the appropriate device kernel to ptx. You can store the resulting ptx as text and use cuLinkAddData to link with your module. You can find more information in this section.

Given a pointer to a __global__ function, can I retrieve its name?

Suppose I have a pointer to a __global__ function in CUDA. Is there a way to programmatically ask CUDART for a string containing its name?
I don't believe this is possible by any public API.
I have previously tried poking around in the driver itself, but that doesn't look too promising. The compiler emitted code for <<< >>> kernel invocation clearly registers the mangled function name with the runtime via __cudaRegisterFunction, but I couldn't see any obvious way to perform a lookup by name/value in the runtime library. The driver API equivalent cuModuleGetFunction leads to an equally opaque type from which it doesn't seem possible to extract the function name.
Edited to add:
The host compiler itself doesn't support reflection, so there are no obvious fancy language tricks that could be pulled at runtime. One possibility would be to add another preprocessor pass to the compilation trajectory to build a static kernel function lookup table before the final build. That would be rather a lot of work, but it could be done, at least for "classic" compilation where everything winds up in a single translation unit.

Is there any advantage of building bytecode than regular actionscript?

Im very curious , is there any advantage of building SWF using bytecode than regular actionscript ?
As i read there are some ways to speed up code a little , but are there some things that as3code cannot do ?
EDIT:
Please , do not focus on coding style , problems , type checks and syntax , let say ill have external SWF with 1 class written in bytecode and now i like to know what benefits i can get from lower level coding.
The Flash compiler doesn't use all the opcodes available in the Flash player so, in theory, you can indeed get some performance increase by "manually" writing opcodes.
For example, the Haxe compiler can make use of the Alchemy OpCodes and provide a boost of performance compared to Flash:
Access Alchemy OpCodes
There are OpCodes for memory allocation hidden in the SWF player which are used by Adobe Alchemy. Haxe has the ability to access them giving you low level memory access which can allow HUGE speed increases.
http://haxe.org/doc/why
I don't know how safe it is to use these opcodes though. Since they are not supported by the official Flash compiler, they might be dropped in a future release of the Flash Player.
Just adding to what Laurent posted. ASC is the ActionScript compiler used by Adobe products so far (MXML uses the code from ASC and the compiler in Flash CS has the customized build of it, but neither greatly affects the actual compiler, but act as the front end / "glue" to the linker and other utilities).
ASC is not an optimizing compiler. What it means is that it doesn't do any optimizations usually possible when compiling to lower level language. It analyzes the generated bytecode only to the level to make mostly certain it's not erroneous. (It is still possible that the compiler will generate an erroneous code from valid AS3 code).
There isn't one to one correspondence between all valid code you can write in bytecode and AS3. Which means that AS3 limits you to a subset of what is possible in bytecode. It is absolutely possible that in the resulting bytecode you would see how it is easier to get a certain value, when it is on stack, but there will be no tool in AS3 to get it from there. For example, you could've avoided creating a loop iterator relying on that the first register contains the iterator, if you know that the rest of the code inside the loop will never read/write to the first register. Obviously there are a lot more "features" you can discover when you analyze the way bytecode is processed.
But, it is important to understand that locally optimizing you will hardly achieve anything significant unless your goal is fine grained and very specific. Such as, for example, some particular cryptographic algorithm or string parsing routine etc. Optimizing by hand larger pieces of program is really difficult. In fact, so difficult that you will need a tool to at least verify yourself that you are actually optimizing. In the end, you will find yourself using this tool to generate the optimized variants of code and test them - and this is how compilers are built :)