how to deallocate memory allocated by dlopen()? - dlopen

I am about to fix this issue. I read several queries regarding dlopen but still not clear. It seems dlopen is allocating memory by either calloc or malloc. But how to deallocate this memory?
Similar code pointing leak problem here for "dl"
(snip)
Event alloc_fn: Called allocation function "dlopen"
Event var_assign: Assigned variable "dl" to storage returned from "dlopen(&"libc.so.6",1)"
261 dl = dlopen("libc.so.6", RTLD_LAZY);
At conditional (1): "dl" taking true path
262 if (dl) {
Event noescape: Variable "dl" not freed or pointed-to in function "dlsym"
263 func = dlsym(dl, "fdopen");
264 }
265 assert(func != NULL);
266 }
Event leaked_storage: Variable "dl" goes out of scope
267 return (*func)(fd, mode);
(snip)
Is it a bug or we need to ignore this? If I need to fix, could any one please guide me to fix it?
Thanks,
Boobesh

Read carefully dlopen(3) man page. Read also Drepper's paper How To Write Shared Libraries which gives a lot of interesting explanations.
You can release the resources acquired by dlopen by calling dlclose
Be careful about resources used or provided by the plugin that you have dlopen-ed. The plugin might have so called constructor or destructor functions, e.g. functions declared with the GCC function attribute constructor etc....
You could also not care, and not bother to call dlclose hence accepting some leakage. In practice, on Linux, you can call many hundred thousands times dlopen without pain (only your address space will grow), see my manydl.c example.

Related

How to clear an exception in handler in risc-v?

Following is my trap routine in FE310 Sifive-Hifive1-Rev B board.
my_trap_routine:
// read mcause
csrr t0, mcause;
// read mepc
csrr t1, mepc;
mret;
Now, I generated a load access fault exception and execution jumped inside the trap routine. Now how to clear the exception inside the handler so that it don't keep jumping into trap routine again and again?
You have to advance the exception program counter, so that you return to the next instruction after in the user / interrupted code.
This is fairly simple in RISC V unless the compressed instruction set is in use, in which case you have to decode the excepting instruction to determine how far to advance the PC.
Fortunately, it is a pretty simple decode, but you need to be aware that RISC V allows varying instruction length in sizes of 2 byte increments.

How to make a atomic device function in cuda?

My kernel write the results to some global device variables.
So, I need to make the function to write as a atomic one.
Is it possible?
If it is not, i am trying to use atomicExch() about every global variables.
But some of them are struct or float not int.
As i know, atomic operations are for int.
How can i deal with this problem.
Thanks.
I got the reason why cudaErrorLaunchOutOfResources error raised.
My kernel used 28 registers, but project setting did not cover it.
CUDA C/C++ ->Device->Max Used Register was set 0. after changing it to 30, error disappeared.

Why is the first cudaMalloc the only bottleneck?

I defined this function :
void cuda_entering_function(...)
{
StructA *host_input, *dev_input;
StructB *host_output, *dev_output;
host_input = (StructA*)malloc(sizeof(StructA));
host_output = (StructB*)malloc(sizeof(StructB));
cudaMalloc(&dev_input, sizeof(StructA));
cudaMalloc(&dev_output, sizeof(StructB));
... some more other cudaMalloc()s and cudaMemcpy()s ...
cudaKernel<< ... >>(dev_input, dev_output);
...
}
This function is called several times (about 5 ~ 15 times) throughout my program, and I measured this program's performance using gettimeofday().
Then I found that the bottleneck of cuda_entering_function() is the first cudaMalloc() - the very first cudaMalloc() throughout my whole program. Over 95% of the total execution time of cuda_entering_function() was consumed by the first cudaMalloc(), and this also happens when I changed the size of first cudaMalloc()'s allocating memory or when I changed the executing order of cudaMalloc()s.
What is the reason and is there any way to reduce the first cuda allocating time?
The first cudaMalloc is responsible for the initialization of the device too, because it's the first call to any function involving the device. This is why you take such a hit: it's overhead due to the use of CUDA and your GPU. You should make sure that your application can gain a sufficient speedup to compensate for the overhead.
In general, people use a call to an initialization function in order to setup their device. In this answer, you can see that apparently a call to cudaFree(0) is the canonical way to do so. This sample shows the use of cudaSetDevice, which could be a good habit if you ever work on machines with several CUDA-ready devices.

error in Assigning values to bytes in a 2d array of registers in Verilog .Error

Hi when i write this piece of code :
module memo(out1);
reg [3:0] mem [2:0] ;
output wire [3:0] out1;
initial
begin
mem[0][3:0]=4'b0000;
mem[1][3:0]=4'b1000;
mem[2][3:0]=4'b1010;
end
assign out1= mem[1];
endmodule
i get the following warnings which make the code unsynthesizable
WARNING:Xst:1780 - Signal mem<2> is never used or assigned. This unconnected signal will be trimmed during the optimization process.
WARNING:Xst:653 - Signal mem<1> is used but never assigned. This sourceless signal will be automatically connected to value 1000.
WARNING:Xst:1780 - Signal > is never used or assigned. This unconnected signal will be trimmed during the optimization process.
Why am i getting these warnings ?
Haven't i assigned the values of mem[0] ,mem[1] and mem[2]!?? Thanks for your help!
Your module has no inputs and a single output -- out1. I'm not totally sure what the point of the module is with respect to your larger system, but you're basically initializing mem, but then only using mem[1]. You could equivalently have a module which just assigns out1 to the value 4'b1000 (mem never changes). So yes -- you did initialize the array, but because you didn't use any of the other values the xilinx tools are optimizing your module during synthesis and "trimming the fat." If you were to simulate this module (say in modelsim) you'd see your initializations just fine. Based on your warnings though I'm not sure why you've come to the conclusion that your code is unsynthesizable. It appears to me that you could definitely synthesize it, but that it's just sort of a weird way to assign a single value to 4'b1000.
With regards to using initial begins to store values in block ram (e.g. to make a ROM) that's fine. I've done that several times without issue. A common use for this is to store coefficients in block ram, which are read out later. That stated the way this module is written there's no way to read anything out of mem anyway.

How are functions modified at run-time then propagated to multiple threads?

With Clojure (and other Lisp dialects) you can modify running code. So, when a function is modified during runtime is that change made available to multiple threads?
I'm trying to figure out how it works technically in a concurrent setting: if several threads are using a function foo, what happens when I redefine (say using defn) the function foo?
There has to be some synchronization going on: when and how does such synchronization happen and what does it cost?
Say on a JVM, is the function referenced using a volatile reference? If so, does it mean every single time there's a "function lookup" then one has to pay the volatile cost?
In Clojure functions are instances of the IFn class and they are almost always stored in vars. vars are Clojures mechanism for thread local values.
when you define a function that sets the "root binding" of the var to reference the function
threads other threads get whatever the the current value of the root binding for the var but can't change the value. this prevents any two threads from having to fight over the value of the var because only the root thread can set the value.
threads can choose to use a new value of the var if they need to, but calling binding which gives then their own thread local value that they are free to change at will because no other thread can read it.
A good understanding of vars is well worth a little study, they are a very useful concurrency device once you get used to them.
ps: the root thread is usually the REPL
pss: you are of course free to store your functions in something other than vars, if for instance you needed to atomically update a group of functions, though this is rare.